Menu Close

Tweaking SCOM 2012 Management Servers for large environments

There are many articles on tweaking certain registry settings for SCOM agents, Gateways, and Management servers, for many reasons.  Large deployments, custom 3rd party MP’s, monitoring Exchange 2010 to name a few.  Matt Goedtel has a good list on his blog:  http://blogs.technet.com/b/mgoedtel/archive/2010/08/24/performance-optimizations-for-operations-manager-2007-r2.aspx

 

Below – I’d like to post some settings that I change on Management Servers, when monitoring large environments.  What does “very large” mean?  Well, I’d characterize that as a management group with a significant agent count (>1000), or a very large instance space (lots of Management Packs deployed both MS and 3rd party, and custom MP’s which don’t always behave well).  Perhaps you have a very large number of groups, or groups with complex expressions.  It could be your are monitoring a large number of “agentless” items, such as Linux servers, or Network Devices, or URLs, etc.

These settings are very common, and I recommend them for all environments, with documented caveats below.

 

1.  Key:    HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters\
REG_DWORD Decimal Value:        
Persistence Checkpoint Depth Maximum = 104857600
SCOM 2012 default existing registry value = 20971520

All management servers, that host a large amount of agentless objects, which results in the MS running a large number of workflows: (network/URL/Linux/3rd party/VEEAM)  This is an ESE DB setting which controls how often ESE writes to disk.  A larger value will decrease disk IO caused by the SCOM healthservice but increase ESE recovery time in the case of a healthservice crash.

2.  Key:    HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters\
REG_DWORD Decimal Value:        State Queue Items = 20480
SCOM 2012 default existing registry value: not present.  Value must be created.  Default code value = 10240

All management servers in a large management group:  This sets the maximum size of healthservice internal state queue.  It should be equal or larger than the number of monitor based workflows running in a healthservice.  Too small of a value, or too many workflows will cause state change loss.  http://blogs.msdn.com/b/rslaten/archive/2008/08/27/event-5206.aspx

3.  Key:    HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters\PoolManager\
REG_DWORD Decimal Value:  
    
PoolLeaseRequestPeriodSeconds = 600
    PoolNetworkLatencySeconds = 120
SCOM 2012 existing registry value:  not present (must create PoolManager key and both values)  Default code value =  120/30 seconds

All management servers, that participate in any resource pools, that run a large number of workflows.  This is VERY RARE to change, and in general I only recommend changing this under advisement from a support case.  The resource pools work quite well on their own, and I have worked with very large environments that did not need these to be modified.  This is more common when you are dealing with a rare condition, such as management group spread across datacenters with high latency links, DR sites, MASSIVE number of workflows running on management servers, etc.

4.  Key:     HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\
REG_DWORD Decimal Value:       
GroupCalcPollingIntervalMilliseconds = 900000
SCOM 2012 existing registry value:  not present (must create value).  Default code value = 30000 (30 seconds)

All management servers that participate in the All Management Servers resource pool, that have a large agent count or large number of groups:  This setting will slow down how often group calculation runs to find changes in group memberships.  Group calculation can be very expensive, especially with a large number of groups, large agent count, or complex group membership expressions.  Slowing this down will help keep groupcalc from consuming all the healthservice and database I/O.

5.  Key:    HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Data Warehouse\
REG_DWORD Decimal Value:    
Command Timeout Seconds = 1200
SCOM 2012 existing registry value: not preset (must create “Data Warehouse” key and value)  Default in code value = 300

All management servers in a management group, this helps with dataset maintenance as the default timeout of 10 minutes is often too short.  Setting this to a longer value helps reduce the 31552 events you might see with standard database maintenance.  This is a very common issue.   https://kevinholman.com/2010/08/29/the-31552-event-or-why-is-my-data-warehouse-server-consuming-so-much-cpu/

6.  Key:    HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Data Warehouse\
REG_DWORD Decimal Value:    Deployment 
Command Timeout Seconds = 86400
SCOM 2012 existing registry value: not preset (must create “Data Warehouse” key and value)  Default in code value = 10800 seconds (3 hours)

All management servers in a management group, this helps with deployment of heavy handed scripts that are applied during version upgrades and cumulative updates.  Customers often see blocking on the DW database for creating indexes, and this causes the script not to be able to deployed in the default of 3 hours.  Setting this value to allow for one full day to deploy the script resolves most customer issues.  Setting this to a longer value helps reduce the 31552 events you might see with standard database maintenance after a version upgrade or UR deployment.  This is a very common issue in large environments are very large warehouse databases.

 

7.  Key:    HKLM\SOFTWARE\Microsoft\System Center\2010\Common\DAL\
REG_DWORD Decimal Value:
    
DALInitiateClearPool = 1
    DALInitiateClearPoolSeconds = 60
SCOM 2012 existing registry value:   not present – code default – 30 seconds?

All management servers in ANY management group.  This setting configures the SDK service to attempt a reconnection to SQL server upon disconnection, on a regular basis.  Without these settings, an extended SQL outage can cause a management server to never reconnect back to SQL when SQL comes back online after an outage.   Per:  http://support.microsoft.com/kb/2913046/en-us  All management servers in a management group should get the registry change.

 

To summarize:

Registry Key

Reg DWORD Value Name Reg DWORD Decimal Value

HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters\

Persistence Checkpoint Depth Maximum 104857600

HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters\

State Queue Items 20480

HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters\PoolManager\

PoolLeaseRequestPeriodSeconds

600

HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters\PoolManager\

PoolNetworkLatencySeconds 120

HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\

GroupCalcPollingIntervalMilliseconds 900000

HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Data Warehouse\

Command Timeout Seconds 1200

HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Data Warehouse\

Deployment Command Timeout Seconds 86400

HKLM\SOFTWARE\Microsoft\System Center\2010\Common\DAL\

DALInitiateClearPool 1

HKLM\SOFTWARE\Microsoft\System Center\2010\Common\DAL\

DALInitiateClearPoolSeconds 60

 

****NOTE:

On modifying the following:

    HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters\PoolManager\
REG_DWORD Decimal Value:  
    
PoolLeaseRequestPeriodSeconds = 600
    PoolNetworkLatencySeconds = 120

This should NOT be done unless you are guided to by Microsoft support, generally speaking.  If you make changes to this setting, the same change must be made on ALL management servers, otherwise the resource pools will constantly fail.  All management servers must have identical settings here.  If you add a management server in the future, this setting must be applied immediately if you modified it on other management servers, or you will see your resource pools constantly committing suicide and failing over to other management servers, reinitializing all workflows in a loop.   All the other settings in this article are generally beneficial.  This specific one for PoolManager should receive great scrutiny before changing, due to the risks.

 

 

Below are some simple reg add statement examples on how you can run to make setting these easy:

reg add “HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters” /v “State Queue Items” /t REG_DWORD /d 20480 /f
reg add “HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters” /v “Persistence Checkpoint Depth Maximum” /t REG_DWORD /d 104857600 /f
reg add “HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0” /v “GroupCalcPollingIntervalMilliseconds” /t REG_DWORD /d 900000 /f
reg add “HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Data Warehouse” /v “Command Timeout Seconds” /t REG_DWORD /d 1200 /f
reg add “HKLM\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Data Warehouse” /v “Deployment Command Timeout Seconds” /t REG_DWORD /d 86400 /f
reg add “HKLM\SOFTWARE\Microsoft\System Center\2010\Common\DAL” /v “DALInitiateClearPool” /t REG_DWORD /d 1 /f
reg add “HKLM\SOFTWARE\Microsoft\System Center\2010\Common\DAL” /v “DALInitiateClearPoolSeconds” /t REG_DWORD /d 60 /f

1 Comment

  1. Pavel Masek

    Hi Kevin, very nice article – thank you very much for this.
    I would like to ask if there is also similar suggested settings for SCOM agents? I have noticed this topic on below sources:
    https://www.veeam.com/kb1026
    https://support.microsoft.com/en-us/help/975057/one-or-more-management-servers-and-their-managed-devices-are-dimmed-in (specifically related to possible doubling values for ‘Persistence Version Store Maximum’ keys)
    – Free book – Operations Manager Field Experience (suggested values for ‘State Queue Items’ key which is missing by default).
    What is your opinion regarding that? Do you think that tweaking of agent in large environment with many workflows running is necessary? We partly encounters problems mentioned in veeam article. Thank you for answer.

Leave a Reply

Your email address will not be published. Required fields are marked *