Menu Close

Recommended registry tweaks for SCOM 2016 and 2019 management servers

image

This applies to SCOM 2016 and 2019.  I will start with what people want most – the “list”:

 

These are the most common changes and settings I recommend to adjust on SCOM management servers.

Simply run these from an elevated command prompt on all your management servers.

 

reg add "HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters" /v "State Queue Items" /t REG_DWORD /d 20480 /f reg add "HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters" /v "Persistence Checkpoint Depth Maximum" /t REG_DWORD /d 104857600 /f reg add "HKLM\SOFTWARE\Microsoft\System Center\2010\Common\DAL" /v "DALInitiateClearPool" /t REG_DWORD /d 1 /f reg add "HKLM\SOFTWARE\Microsoft\System Center\2010\Common\DAL" /v "DALInitiateClearPoolSeconds" /t REG_DWORD /d 60 /f reg add "HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0" /v "GroupCalcPollingIntervalMilliseconds" /t REG_DWORD /d 900000 /f reg add "HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Data Warehouse" /v "Command Timeout Seconds" /t REG_DWORD /d 1800 /f reg add "HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Data Warehouse" /v "Deployment Command Timeout Seconds" /t REG_DWORD /d 86400 /f

 

I will explain each setting in detail below:

 

1.  HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters\
REG_DWORD Decimal Value:        
State Queue Items = 20480

SCOM 2016 default existing registry value:   (not present)

SCOM 2016 default value in code:   10240

Description:  This sets the maximum size of healthservice internal state queue.  It should be equal or larger than the number of monitor based workflows running in a healthservice.  Too small of a value, or too many workflows will cause state change loss.  http://blogs.msdn.com/b/rslaten/archive/2008/08/27/event-5206.aspx

 

2.  HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters\
REG_DWORD Decimal Value:  
Persistence Checkpoint Depth Maximum = 104857600

SCOM 2016 default existing registry value = 20971520

Description:  Management Servers that host a large amount of agentless objects, which results in the MS running a large number of workflows: (network/URL/Linux/3rd party/VEEAM)  This is an ESE DB setting which controls how often ESE writes to disk.  A larger value will decrease disk IO caused by the SCOM healthservice but increase ESE recovery time in the case of a healthservice crash.

 

3.  HKLM\SOFTWARE\Microsoft\System Center\2010\Common\DAL\
REG_DWORD Decimal Value:
  DALInitiateClearPool = 1
  DALInitiateClearPoolSeconds = 60

SCOM 2016 existing registry value:   not present

Description:  This is a critical setting on ALL management servers in ANY management group.  This setting configures the SDK service to attempt a reconnection to SQL server upon disconnection, on a regular basis.  Without these settings, an extended SQL outage can cause a management server to never reconnect back to SQL when SQL comes back online after an outage.   Per:  http://support.microsoft.com/kb/2913046/en-us  All management servers in a management group should get the registry change.

 

4.  HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\
REG_DWORD Decimal Value:       
GroupCalcPollingIntervalMilliseconds = 900000

SCOM 2016 existing registry value:  (not present)

SCOM 2016 default code value:  30000 (30 seconds)

Description:  This setting will slow down how often group calculation runs to find changes in group memberships.  Group calculation can be very expensive, especially with a large number of groups, large agent count, or complex group membership expressions.  Slowing this down will help keep groupcalc from consuming all the healthservice and database I/O.  900000 is every 15 minutes.

 

5.  HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Data Warehouse\
REG_DWORD Decimal Value:    
Command Timeout Seconds = 1800

SCOM 2016 existing registry value:  (not preset)

SCOM 2016 default code value:  600

Description:  This helps with dataset maintenance as the default timeout of 10 minutes is often too short.  Setting this to a longer value helps reduce the 31552 events you might see with standard database maintenance.  This is a very common issue.   http://blogs.technet.com/b/kevinholman/archive/2010/08/30/the-31552-event-or-why-is-my-data-warehouse-server-consuming-so-much-cpu.aspx  This should be adjusted to however long it takes aggregations or other maintenance to run in your environment.  We need this to complete in less than one hour, so if it takes more than 30 minutes to complete, you really need to investigate why it is so slow, either from too much data or SQL performance issues.

 

6.  HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Data Warehouse\
REG_DWORD Decimal Value:    Deployment 
Command Timeout Seconds = 86400

SCOM 2016 existing registry value:  (not preset)

SCOM 2016 default code value:  10800 (3 hours)

Description:  This helps with deployment of heavy handed scripts that are applied during version upgrades and cumulative updates.  Customers often see blocking on the DW database for creating indexes, and this causes the script not to be able to deployed in the default of 3 hours.  Setting this value to allow for one full day to deploy the script resolves most customer issues.  Setting this to a longer value helps reduce the 31552 events you might see with standard database maintenance after a version upgrade or UR deployment.  This is a very common issue in large environments are very large warehouse databases.

 

 

Ok, that covers the “standard” stuff.

 

I will cover one other registry modification that is RARELY needed.  You should ONLY change this one if directed to by Microsoft support.  I personally have never seen an environment where this change should be made, and I DO NOT recommend it.

WARNING:

If you make changes to this setting, the same change must be made on ALL management servers, otherwise the resource pools will constantly fail.  All management servers must have identical settings here.  If you add a management server in the future, this setting must be applied immediately if you modified it on other management servers, or you will see your resource pools constantly committing suicide and failing over to other management servers, reinitializing all workflows in a loop.   All the other settings in this article are generally beneficial.  This specific one for PoolManager should receive great scrutiny before changing, due to the risks.  It is NOT included in my reg-add list above for good reason.

 

HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters\PoolManager\
REG_DWORD Decimal Value: 
PoolLeaseRequestPeriodSeconds = 600
    PoolNetworkLatencySeconds = 120

SCOM 2016 existing registry value:  not present (must create PoolManager key and both values)  Default code value =  120/30 seconds

This is VERY RARE to change, and in general I only recommend changing this under advisement from a support case.  The resource pools work quite well on their own, and I have worked with very large environments that did not need these to be modified.  This is more common when you are dealing with a rare condition, such as management group spread across datacenters with high latency links, DR sites, MASSIVE number of workflows running on management servers, etc.

19 Comments

  1. Janez

    Hi Kevin,

    regarding regkey HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters\PoolManager\
    If it is present can it be removed? We update SCOM to 1801 and we have to add new MS on OSE 2016.
    Before we add new MS I was thinking to remove this key.

    Thank you,
    Janez

  2. Sean Tompkins

    Janez – NO – don’t remove the key. The note above references changing the key, not adding it. The management servers need this value to keep the resource pools healthy. It’s like the heartbeat in a cluster…

  3. Kevin Holman

    Actually – yes, I’d recommend removing the key, and all values within it. HOWEVER – this must be done on ALL management servers and Gateways (anyone who is a member of any pool) if it was previously configured.

    HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters\PoolManager\

    This reg key DOES NOT EXIST on any SCOM management server by default. If it exists, it means someone added it, and typically that means it was added in error (either to try and solve a problem, or by poor guidance). These rarely solve problems, as pool issues are usually load or performance related, and changing the timings is usually only a band aid to the root cause.

    If I had a deployment where someone configured this key, I would take a backup of one of the management servers reg keys, then delete the PoolManager key (including all contained values) on all management servers, then restart the healthservice on all management servers.

  4. Pingback:OpsMgr 2016 – QuickStart Deployment Guide - Kevin Holman's Blog

  5. Andy

    Could you add a column for these to your SCOM management / SCOM Servers state view so we can keep track of which servers have been updated. I manage alot of MS’s and forget which ones have been done ?

  6. David Anderson

    Hi Kevin,

    First off may I thank you for all your work in the SCOM community over the years.

    We have just completed an large upgrade of SCOM 2012 R2 over to SCOM 2016 UR9, the upgrade has given us kittens for a number of days with many issues that we have had to overcome

    This has caused the team to debate and discuss the ConfigService.config file and its settings and I was wondered if you could advise on best practice for the settings as we in our 2012 R2 we had many settings that had been tweaked over the last 6 years SCOM 2012 was in place –

    Now our new 2016 ConfigService.Config doesnt have these setting or tweaks and we really not sure if we should move these over to the new file

    2012 R2 – Settings

    We not really seeing any issues anymore but have the odd error here and there

    Any advice would be great and thanking you in advance

    • Kevin Holman

      In general, I do not recommend changing anything in configservice.config. There is no optimization that should be made across the board out of the box, in my opinion. All my largest customers run SCOM 2016 with zero modifications. There are very specific scenarios where customers might alter this file, under the guidance of Microsoft Support, however those are to troubleshoot a specific problem and not general in nature.

      • David Anderson

        Thanks Kevin for the advice, we have a quite a large SCOM group using all of the big MP’s Veeam/Citrix/Opslogix and we seeing a few failures with error code 10 not all the time though, they seems to error between 52 seconds and 80 seconds on the duration

        Its only really on the DeltaSynchronization

        durationSeconds =61
        DeltaSynchronization = 10

        • Kevin Holman

          Those are the scenarios where it might be necessary to change from the defaults.

          However, any changes just document, keep backup copies of the files, and make them consistently on ALL management servers.

          Personally, I’d much rather reduce the income load on config, which is discoveries or huge instance counts, but with some MP’s you can’t due to their design.

  7. Danny

    Hi Kevin what about “bulk insert command timeout seconds” we currently have it set to 80 seconds at the moment, it seems small compared to 1800 for “command timeout seconds” just wondering if you have any recommendations for that registry Value please?

    • Kevin Holman

      I do NOT recommend changing “bulk insert command timeout seconds”, in most cases. (see more below)

      This is VERY different than “command timeout seconds”. Setting “command timeout seconds primarily controls the standard dataset maintenance, which runs very frequently, but only allows certain maintenance sub routines to run every hour, or longer durations (such as hourly aggregations). This maintenance can sometimes take a long time to complete, so setting this to 30 minutes is better than the default timeout, especially for large environments.

      However, this is different than bulk insert. Bulk insert is a method that SQL uses to insert perf and event data. We are constantly inserting data, and we do not expect any bulk insert to take longer than 30 seconds by default. If you have to extend this timeout, in my opinion, something is not optimized. Common issues I see are that either you are attempting to insert way too much perf data, or your Data Warehouse SQL server is undersized from a performance perspective – usually disk latency. Extending bulk insert timeout is often a band-aid, covering up a problem that should be focused on. I am not saying it should NEVER be extended, but great care and understanding should be taken BEFORE changing bulk insert timeout from the default.

Leave a Reply

Your email address will not be published. Required fields are marked *