This applies to SCOM 2016, 2019, and 2022. I will start with what people want most – the “list”:
These are the most common changes and settings I recommend to adjust on SCOM management servers.
Simply run these from an elevated command prompt on all your management servers.
reg add "HKLM\SYSTEM\CurrentControlSet\Services\HealthService\Parameters" /v "State Queue Items" /t REG_DWORD /d 20480 /f reg add "HKLM\SYSTEM\CurrentControlSet\Services\HealthService\Parameters" /v "Persistence Checkpoint Depth Maximum" /t REG_DWORD /d 104857600 /f reg add "HKLM\SOFTWARE\Microsoft\System Center\2010\Common\DAL" /v "DALInitiateClearPool" /t REG_DWORD /d 1 /f reg add "HKLM\SOFTWARE\Microsoft\System Center\2010\Common\DAL" /v "DALInitiateClearPoolSeconds" /t REG_DWORD /d 60 /f reg add "HKLM\SOFTWARE\Microsoft\System Center\2010\Common" /v "GroupCalcPollingIntervalMilliseconds" /t REG_DWORD /d 1800000 /f reg add "HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Data Warehouse" /v "Command Timeout Seconds" /t REG_DWORD /d 1800 /f reg add "HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Data Warehouse" /v "Deployment Command Timeout Seconds" /t REG_DWORD /d 86400 /f
I will explain each setting in detail below:
1. HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters\
REG_DWORD Decimal Value: State Queue Items = 20480
SCOM default existing registry value: (not present)
SCOM default value in code: 10240
Description: This sets the maximum size of healthservice internal state queue. It should be equal or larger than the number of monitor based workflows running in a healthservice. Too small of a value, or too many workflows will cause state change loss.
2. HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters\
REG_DWORD Decimal Value: Persistence Checkpoint Depth Maximum = 104857600
SCOM default existing registry value = 20971520
Description: Management Servers that host a large amount of agentless objects, which results in the MS running a large number of workflows: (network/URL/Linux/3rd party/VEEAM) This is an ESE DB setting which controls how often ESE writes to disk. A larger value will decrease disk IO caused by the SCOM healthservice but increase ESE recovery time in the case of a healthservice crash.
3. HKLM\SOFTWARE\Microsoft\System Center\2010\Common\DAL\
REG_DWORD Decimal Value:
DALInitiateClearPool = 1
DALInitiateClearPoolSeconds = 60
SCOM existing registry value: (not present)
Description: This is a critical setting on ALL management servers in ANY management group. This setting configures the SDK service to attempt a reconnection to SQL server upon disconnection, on a regular basis. Without these settings, an extended SQL outage can cause a management server to never reconnect back to SQL when SQL comes back online after an outage. Per: http://support.microsoft.com/kb/2913046/en-us All management servers in a management group should get the registry change.
4. HKLM\SOFTWARE\Microsoft\System Center\2010\Common\
REG_DWORD Decimal Value: GroupCalcPollingIntervalMilliseconds = 1800000
SCOM existing registry value: (not present)
SCOM default code value: 30000 (30 seconds)
Description: This setting will slow down how often group calculation runs to find changes in group memberships. Group calculation can be very expensive, especially with a large number of groups, large agent count, or complex group membership expressions. Groups with complex expressions in large environments can actually take several minutes to calculate. Multiply that times a large number of groups and you have problems. Slowing this down will help keep groupcalc from consuming all the healthservice and database I/O. 1800000 milliseconds is every 30 minutes. This means once a group initializes (31410 event) on a management server in the pool, that specific group will wait 30 minutes before evaluating if any members need to be added/removed based on dynamic inclusion criteria in the expression.
5. HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Data Warehouse\
REG_DWORD Decimal Value: Command Timeout Seconds = 1800
SCOM existing registry value: (not present)
SCOM default code value: 600
Description: This helps with dataset maintenance as the default timeout of 10 minutes is often too short. Setting this to a longer value helps reduce the 31552 events you might see with standard database maintenance. This is a very common issue. http://blogs.technet.com/b/kevinholman/archive/2010/08/30/the-31552-event-or-why-is-my-data-warehouse-server-consuming-so-much-cpu.aspx This should be adjusted to however long it takes aggregations or other maintenance to run in your environment. We need this to complete in less than one hour, so if it takes more than 30 minutes to complete, you really need to investigate why it is so slow, either from too much data or SQL performance issues.
6. HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Data Warehouse\
REG_DWORD Decimal Value: Deployment Command Timeout Seconds = 86400
SCOM existing registry value: (not present)
SCOM default code value: 10800 (3 hours)
Description: This helps with deployment of heavy handed scripts that are applied during version upgrades and cumulative updates. Customers often see blocking on the DW database for creating indexes, and this causes the script not to be able to deployed in the default of 3 hours. Setting this value to allow for one full day to deploy the script resolves most customer issues. Setting this to a longer value helps reduce the 31552 events you might see with standard database maintenance after a version upgrade or UR deployment. This is a very common issue in large environments are very large warehouse databases.
7. HKLM\SOFTWARE\Microsoft\System Center\2010\Common\DAL\
REG_DWORD Decimal Value:
DALCommandTimeoutSeconds = 14400
SCOM existing registry value: (not present)
SCOM default code value: 3600 (1 hour)
Description: This is needed to extend the timeout for DAL Commands, such as MP Import. Normally MP’s import quickly, however in some cases where an MP import impacts a large number of entities (such as upgrading the System.Library) when performing a SCOM in place version upgrade for a larger environment.
Ok, that covers the “standard” stuff.
Pool Manager:
This next one should not be needed. I am documenting it because it was erroneously recommended in the past. You should ONLY change this one if directed to by Microsoft support. I personally have never seen an environment where this change should be made, and I DO NOT recommend it.
If you make changes to this setting, the same change must be made on ALL management servers, otherwise the resource pools will constantly fail. All management servers must have identical settings here. If you add a management server in the future, this setting must be applied immediately if you modified it on other management servers, or you will see your resource pools constantly committing suicide and failing over to other management servers, reinitializing all workflows in a loop. All the other settings in this article are generally beneficial. This specific one for PoolManager should receive great scrutiny before changing, due to the risks. It is NOT included in my reg-add list above for good reason.
HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters\PoolManager\
REG_DWORD Decimal Value:
PoolLeaseRequestPeriodSeconds = 600
PoolNetworkLatencySeconds = 120
SCOM 2016 existing registry value: not present (must create PoolManager key and both values)
Default code value = 120/30 seconds
This is VERY RARE to change, and in general I only recommend changing this under advisement from a support case. The resource pools work quite well on their own, and I have worked with very large environments that did not need these to be modified. This is more common when you are dealing with a rare condition, such as management group spread across datacenters with high latency links, DR sites, MASSIVE number of workflows running on management servers, etc.
DAL Command Timeouts:
The following registry change is applicable if you are planning an in-place version upgrade of SCOM. This is needed to extend the timeout for DAL Commands, such as MP Import. Normally MP’s import quickly, however in some cases where an MP import impacts a large number of entities (such as upgrading the System.Library) when performing a SCOM in place version upgrade for a larger environment. You might see these in the logs after a failed in place upgrade of SCOM:
[16:43:27]: Error: :ImportManagementPack: Error: Unable to load management pack D:\Installation_Files\2022 – Installation Files\System Center Operations Manager\Setup\AMD64\..\..\ManagementPacks\System.Library.mp
[16:43:27]: Error: :: Database error. MPInfra_p_ManagementPackInstall failed with exception:
<MP ID: 01c8b236-3bce-9dba-6f1c-c119bcdc2972><MP Version: 7.5.8501.1><MP PKT: 31bf3856ad364e35> Database error. MPInfra_p_ManagementPackInstall failed with exception:
Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
HKLM\SOFTWARE\Microsoft\System Center\2010\Common\DAL\
REG_DWORD Decimal Value:
DALCommandTimeoutSeconds = 14400
SCOM existing registry value: (not present)
SCOM default code value: 3600 (1 hour)
Description: Extends the timeout from 1 hour to 4 hours for DAL commands such as MP import.
Hi Kevin,
regarding regkey HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters\PoolManager\
If it is present can it be removed? We update SCOM to 1801 and we have to add new MS on OSE 2016.
Before we add new MS I was thinking to remove this key.
Thank you,
Janez
Janez – NO – don’t remove the key. The note above references changing the key, not adding it. The management servers need this value to keep the resource pools healthy. It’s like the heartbeat in a cluster…
Actually – yes, I’d recommend removing the key, and all values within it. HOWEVER – this must be done on ALL management servers and Gateways (anyone who is a member of any pool) if it was previously configured.
HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters\PoolManager\
This reg key DOES NOT EXIST on any SCOM management server by default. If it exists, it means someone added it, and typically that means it was added in error (either to try and solve a problem, or by poor guidance). These rarely solve problems, as pool issues are usually load or performance related, and changing the timings is usually only a band aid to the root cause.
If I had a deployment where someone configured this key, I would take a backup of one of the management servers reg keys, then delete the PoolManager key (including all contained values) on all management servers, then restart the healthservice on all management servers.
Do all these settings apply for a fresh install of SCOM 2019?
I was wondering the same thing, do they apply to SCOM 2019?
Updated article for clarity.
Pingback:OpsMgr 2016 – QuickStart Deployment Guide - Kevin Holman's Blog
Could you add a column for these to your SCOM management / SCOM Servers state view so we can keep track of which servers have been updated. I manage alot of MS’s and forget which ones have been done ?
I thought about that, but the number of columns is already making this MP really slow in big environments.
I actually wrote a “SCOM Health” MP, but never published it, which checks for all kinds of scenarios like this. Would you prefer I publish that MP?
Yes please
Yes, such an MP would be most appreciated
Hi Kevin,
First off may I thank you for all your work in the SCOM community over the years.
We have just completed an large upgrade of SCOM 2012 R2 over to SCOM 2016 UR9, the upgrade has given us kittens for a number of days with many issues that we have had to overcome
This has caused the team to debate and discuss the ConfigService.config file and its settings and I was wondered if you could advise on best practice for the settings as we in our 2012 R2 we had many settings that had been tweaked over the last 6 years SCOM 2012 was in place –
Now our new 2016 ConfigService.Config doesnt have these setting or tweaks and we really not sure if we should move these over to the new file
2012 R2 – Settings
We not really seeing any issues anymore but have the odd error here and there
Any advice would be great and thanking you in advance
In general, I do not recommend changing anything in configservice.config. There is no optimization that should be made across the board out of the box, in my opinion. All my largest customers run SCOM 2016 with zero modifications. There are very specific scenarios where customers might alter this file, under the guidance of Microsoft Support, however those are to troubleshoot a specific problem and not general in nature.
Thanks Kevin for the advice, we have a quite a large SCOM group using all of the big MP’s Veeam/Citrix/Opslogix and we seeing a few failures with error code 10 not all the time though, they seems to error between 52 seconds and 80 seconds on the duration
Its only really on the DeltaSynchronization
durationSeconds =61
DeltaSynchronization = 10
Those are the scenarios where it might be necessary to change from the defaults.
However, any changes just document, keep backup copies of the files, and make them consistently on ALL management servers.
Personally, I’d much rather reduce the income load on config, which is discoveries or huge instance counts, but with some MP’s you can’t due to their design.
Thank you
Hi Kevin what about “bulk insert command timeout seconds” we currently have it set to 80 seconds at the moment, it seems small compared to 1800 for “command timeout seconds” just wondering if you have any recommendations for that registry Value please?
I do NOT recommend changing “bulk insert command timeout seconds”, in most cases. (see more below)
This is VERY different than “command timeout seconds”. Setting “command timeout seconds primarily controls the standard dataset maintenance, which runs very frequently, but only allows certain maintenance sub routines to run every hour, or longer durations (such as hourly aggregations). This maintenance can sometimes take a long time to complete, so setting this to 30 minutes is better than the default timeout, especially for large environments.
However, this is different than bulk insert. Bulk insert is a method that SQL uses to insert perf and event data. We are constantly inserting data, and we do not expect any bulk insert to take longer than 30 seconds by default. If you have to extend this timeout, in my opinion, something is not optimized. Common issues I see are that either you are attempting to insert way too much perf data, or your Data Warehouse SQL server is undersized from a performance perspective – usually disk latency. Extending bulk insert timeout is often a band-aid, covering up a problem that should be focused on. I am not saying it should NEVER be extended, but great care and understanding should be taken BEFORE changing bulk insert timeout from the default.
Thanks Kevin I used have issues with perf data but implementing your other article seem to have fixed those. so I don’t need to change this registry setting cheers
https://kevinholman.com/2017/08/15/quicktip-disabling-workflows-to-optimize-for-large-environments/
Should these reg edits be applied to Gateway servers as well?
No – these are recommended for management servers. Many of these do not apply and don’t work on Gateway servers, or arent needed. Gateways are pretty solid out of the box.
Have the “GroupCalcPollingIntervalMilliseconds” moved location in regedit between SCOM 2012 R2 and SCOM 2016 + 2019?
As you recommended setting have a different path for “GroupCalcPollingIntervalMilliseconds”
“HKLM\SOFTWARE\Microsoft\System Center\2010\Common”
vs.
“HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0”
SCOM 2016 + 2019
– reg add “HKLM\SOFTWARE\Microsoft\System Center\2010\Common” /v “GroupCalcPollingIntervalMilliseconds” /t REG_DWORD /d 1800000 /f
SCOM 2012 R2
https://kevinholman.com/2014/06/25/tweaking-scom-2012-management-servers-for-large-environments/
– reg add “HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0” /v “GroupCalcPollingIntervalMilliseconds” /t REG_DWORD /d 900000 /f
YES.
It actually looks like it changed from SCOM 2007 R2 > SCOM 2012. Its just that nobody noticed the old location we were still using was no longer relevant. 🙁 I updated it to the correct registry location for SCOM 2012 and later.
Hey Kevin,
is there a reg key that allows the management server to run several powershell scripts simultaneously?
Thanks in advance!
Have you been able to confirm all of these reg values are still good for a SCOM 2022 deployment?
Yes – they are applicable to SCOM 2022.
Hello guys, do you know how to, manage increse amount on scripts which can run on agent? I always have error that powershell script aborted? Is there any settings in registry is available for that? many thanks for help
Found that is maximumQueueSizeKb, The default is 15 Mb. I set it to 75 Mb
I generally do not recommend adjusting Maximum queue size, except in specific scenarios and on specific agents (like a watcher agent solution). If your management packs don’t work with our existing default queue – I’d question why they are designed to load so much into the queue. Just FYI.
Thats actually STD SQL Management pack for 2014\2016 SQL Server. I noticed that powershell scripts wont run on some servers and dropped in timeout, i tried to dig which scripts takes too much memory but that took to much time for me and finally it was fixed after i changed maximumQueueSizeKb to 75 mb. I also saw that solution in https://ds.squaredup.com/blog/upgrading-to-scom-2019-step-by-step/ article. BTW I always follow your articles so thank you for your reply and all your amazing job you do to community. Thank you.
We don’t support a SQL 2014/2016 Management pack anymore – all supported monitoring of SQL uses the SQL Agnostic MP. We use modules now and not Powershell scripts in most cases now.
I have never seen the need nor recommendation to change maximumQueueSizeKb on management servers in the contect of an upgrade.
The default value for SCOM management servers is and should always be 102400. Not sure why he recommended that. Regardless – queues on management servers are a very different thing than queues on agents. It is fine to increase the size as needed, just often it is used as a band aid when a badly designed MP is the real culprit. I can only see this being needed on SQL servers if you are using old unsupported MP’s, and have a lot of instances, databases, files/filegroups.
Pingback:System Center Operations Manager (SCOM) Management Group Performance Optimizations | POHN IT-Consulting GmbH
What is considered a ‘ large amount of agentless objects’, does this include Windows Cluster objects? I have a number of Linux Servers in monitoring but not sure what is a large amount. Considering if this reg setting is needed, prior to a SCOM 2019 in-place upgrade to 2022.