Menu Close

Stop Healthservice restarts in SCOM


image

 

This is probably the single biggest issue I find in 100% of customer environments.

YOU ARE IMPACTED.  Trust me.

 

This article applies to SCOM 2016 and 2019.  Quick download:  thekevinholman/SCOM.AgentThresholds: SCOM Agent threshold and alerting Management Pack (github.com)

 

SCOM monitors itself to ensure we aren’t using too much memory, or too many handles for the SCOM processes.  If we detect that the SCOM agent is using an unexpected amount of memory or handles, we will forcibly KILL the agent, and restart it.

That sounds good right?

In theory, yes.  In reality, however, this is KILLING your SCOM environment, and you probably aren’t aware it is even it is happening.

 

The problem?

1.  The default thresholds in SCOM 2016 are WAY out of touch with reality.  They were set almost 10 years ago, when systems used a LOT less resources than modern operating systems today.  This is MUCH worse if you choose to MULTIHOME.  Multi-homed agents can use twice as many resources as non-multi-homed agents, and this restart can be issued from EITHER management group, but will affect BOTH.  (In SCOM 2019, these thresholds were changed to be more reasonable.)

2.  We don’t generate an alert when this happens, so you are blind that this is impacting you.

 

We need to change these in the product.  Until we do, a simple override is the solution.

 

Why is this so bad?

This is bad because of two impacts:

1.  You are hurting your monitored systems by restarting them over and over, causing the startup scripts to run on loops and actually consuming additional resources.  You are actually going periods of time without any monitoring because of this as well, because when the agent is killed and restarting, there is a period of time where the monitoring is unloaded.

2.  You are filling SCOM with state change events.  Every time all the monitors initialize, they send an updated “new” statechange event upon initialization.  You are hammering SCOM with useless state data.

 

What can I do about it?

Well, I am glad you asked!  We simply need to override 4 monitors, to give them realistic agent thresholds, and set them to generate an informational alert.  I will also include a view for these alerts so we can see if anyone is still generating them.  I will wrap all this in a sample management pack for you to download, below.

 

In the console, go to Authoring, Monitors, and change scope to “Agent”

image

 

We will override each one:

Private bytes monitors should be set to a default threshold of 1610612736  (1.6GB)

Handle Count monitors should be set to 30000  (the default of 6000 is WAY too low)

Override Generate Alert to True (to generate alerts)

Override Auto-Resolve to False (even though default is false, this must be set, to keep from auto-closing these so you can see them and their repeat count)

Override Alert severity to Information (to keep from ticketing on these events)

 

 

Override EACH monitor, “all objects of class” and choose “Agent” class.

image

 

NOTE: It is CRITICAL that we choose the “Agent” class for our overrides, because we do not want to impact thresholds already set on Management Servers or Gateways.

 

This is a good configuration:

image

image

image

image

 

Ok – those are much more reasonable defaults.

 

What else should I do?

Create an alert view that shows alerts that match your Alert Name

This will show you if you STILL have some agents restarting on a regular basis.  You should review the ones with high repeat counts on a weekly basis, and adjust their agent specific thresholds, or investigate why they are consuming so much, so often.  An occasional agent restart (one or less per day) is totally fine and probably not worth the time to investigate.

 

image

 

I also added to this MP – special overrides to enhance the Alert Name and Description, as documented here:

How to override the Alert Name and Alert Description of a Sealed Monitor – Kevin Holman’s Blog

 

I am including a management pack with these overrides, and the alert view, and you can download it below.

 

Download:   thekevinholman/SCOM.AgentThresholds: SCOM Agent threshold and alerting Management Pack (github.com)

26 Comments

  1. Peter Nilsson

    Hi!

    While tuning a new environment ( with the help of https://kevinholman.com/2009/11/25/tuning-tip-turning-off-some-over-collection-of-events/ ) I just found out that you can also trace the restarts with the already built in event rule “Collect Restart System Center Management Health Service Events”, which will be triggered by the actual restart, not the memory or handle value. So I chose not to tune that particular rule out, instead of generating alerts. In an healthy environment these events should not fill up my databases anyway 🙂

    Thanks for all your insights, I´ve been using you blogs basically every week for at least ten years!

    Regards
    Peter

  2. Pingback:SCOM 2016 – Agent (Health Service) high CPU utilization and service restart | POHN IT-Consulting GmbH

  3. Scott M.

    Good day Alexey,

    The SAC for SCOM has been discontinued but I’ll attempt to address SCOM 2019:

    In reviewing the Private Bytes Threshold monitors in SCOM 2019:
    The first threshold is considerably higher in SCOM 2019: 1,610,612,736 (vs 943,718,400 per Kevin’s SCOM 2016 recommendation), so the default setting in SCOM 2019 should suffice.

    SCOM 2019’s Override Value for Alert Severity still needs to be reviewed per environment and changed if warranted.

    SCOM 2019’s Override Value for Auto-Resolve Alert is in-line with what Kevin recommends for SCOM 2016.

    SCOM 2019’s Override Value for Generates Alert still needs to be reviewed per environment and changed if warranted.

    In reviewing the Handle Count Threshold monitors in SCOM 2019:
    The first threshold is exactly the same in SCOM 2019 as Kevin’s recommendation for SCOM 2016 (30,000), so no change should be needed.

    SCOM 2019’s Override Value for Alert Severity still needs to be reviewed per environment and changed if warranted.

    SCOM 2019’s Override Value for Auto-Resolve Alert is in-line with what Kevin recommends for SCOM 2016.

    SCOM 2019’s Override Value for Generates Alert still needs to be reviewed per environment and changed if warranted.

    Hopefully with the change (increase) in the Consecutive Sample Threshold, the noise of these alerts will be a thing of the past and when they do happen, will warrant action.

  4. Michael Ferstl

    Hi Kevin,
    I am having issues that on a lot servers, managed by a SCOM 2019 server, the monitoringhost.exe restarts every 5 minutes and causes performance issues, i verifiey all the settings, but they are even higher than the ones that should be set at least.
    Have you had that experience already as well?

    br
    Michael

      • Ken

        I see Michael responsed in a new post and not as a Reply but I wanted to response here. SCOM 2016 UR9 in both our Production and NonProd/PreProd environments. I see MonitoringHost.exe restarting every 15 minutes on a lot of servers in our environment. We are using your recommended Agent overrides as well as UR9 with all the steps outlined in your excellent deployment guide. Any recommendations or has this been seen by others similarly elsewhere?

        • Kevin Holman

          That’s not normal. So it is important to know:

          1. Which monitor is causing the agent to restart?
          2. What threshold is being breached and by what value?

          This is the first step.

          The next step is to understand why. The most common cause is that you are discovering too many objects on this agent, either by a bad MP, or just a scale that was not intended. You might need to change the MP, or you might need additional overrides for even larger thresholds on really rare scenarios.

          Another reason, can be your overrides are not working, misconfigured, or in conflict with other overrides.

  5. Michael Ferstl

    Hi Kevin, sorry didn’t see your reply till now.

    Thats the problem, i can’t see what is causing the restart of the process.
    We just realised via Dynatrace, that we have a lot process restarts on some machines, so i started investigating on which machines it occurs. Lets say about 80% of the machines have the same behaviour:
    There is one monitoringhost.exe process running all the time and a second one starts, runs and finishes after a view seconds and then starts again.
    the other 20% of the machines have 2 monitoringhost.exe processes running all the time.
    So i am not sure, which behaviour is the correct one 🙂

    I tried to analyse with process monitor, but can’t really see a obvious process exit reason:
    07:36:41,4202193 MonitoringHost.exe 12036 Thread Exit SUCCESS Thread ID: 12056, User Time: 0.1718750, Kernel Time: 0.0781250
    07:36:41,4222278 MonitoringHost.exe 12036 CreateFile C:\ SUCCESS Desired Access: Generic Read, Disposition: Open, Options: Synchronous IO Non-Alert, Complete If Oplocked, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened
    07:36:41,4225299 MonitoringHost.exe 12036 CreateFile C:\Program Files\Microsoft Monitoring Agent\Agent\MonitoringHost.exe SUCCESS Desired Access: Generic Read, Disposition: Open, Options: Synchronous IO Non-Alert, Non-Directory File, Complete If Oplocked, Open By ID, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: n/a, OpenResult: Opened
    07:36:41,4225647 MonitoringHost.exe 12036 CloseFile C:\ SUCCESS
    07:36:41,4226306 MonitoringHost.exe 12036 CloseFile C:\Program Files\Microsoft Monitoring Agent\Agent\MonitoringHost.exe SUCCESS
    07:36:41,4228776 MonitoringHost.exe 12036 Process Exit SUCCESS Exit Status: 0, User Time: 0.1875000 seconds, Kernel Time: 0.1875000 seconds, Private Bytes: 3.788.800, Peak Private Bytes: 4.128.768, Working Set: 15.642.624, Peak Working Set: 15.646.720
    07:36:41,4229403 MonitoringHost.exe 12036 RegOpenKey HKLM\System\CurrentControlSet\Services\bam\State\UserSettings\S-1-5-18 SUCCESS Desired Access: All Access
    07:36:41,4229677 MonitoringHost.exe 12036 RegQueryValue HKLM\System\CurrentControlSet\Services\bam\State\UserSettings\S-1-5-18\\Device\HarddiskVolume3\Program Files\Microsoft Monitoring Agent\Agent\MonitoringHost.exe NAME NOT FOUND Length: 40
    07:36:41,4229762 MonitoringHost.exe 12036 RegCloseKey HKLM\System\CurrentControlSet\Services\bam\State\UserSettings\S-1-5-18 SUCCESS
    07:36:41,4230846 MonitoringHost.exe 12036 CloseFile C:\Windows\System32 SUCCESS
    07:36:41,4231767 MonitoringHost.exe 12036 RegCloseKey HKLM\System\CurrentControlSet\Control\Session Manager SUCCESS
    07:36:41,4231845 MonitoringHost.exe 12036 RegCloseKey HKLM\System\CurrentControlSet\Control\Nls\Sorting\Versions SUCCESS
    07:36:41,4231910 MonitoringHost.exe 12036 RegCloseKey HKLM SUCCESS
    07:36:41,4232119 MonitoringHost.exe 12036 RegCloseKey HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options SUCCESS
    07:36:41,4232612 MonitoringHost.exe 12036 RegCloseKey HKCR SUCCESS
    07:36:41,4232941 MonitoringHost.exe 12036 RegCloseKey HKU\.DEFAULT\Software\Classes SUCCESS
    07:36:41,4233282 MonitoringHost.exe 12036 CloseFile C:\Windows\Registration\R000000000007.clb SUCCESS
    07:36:41,4234640 MonitoringHost.exe 12036 RegCloseKey HKLM\System\CurrentControlSet\Services\Tcpip\Parameters\Interfaces SUCCESS
    07:36:41,4234711 MonitoringHost.exe 12036 RegCloseKey HKLM\System\CurrentControlSet\Services\Tcpip6\Parameters\Interfaces SUCCESS
    07:36:41,4234827 MonitoringHost.exe 12036 CloseFile C:\Windows\System32\en-US\KernelBase.dll.mui SUCCESS
    07:36:41,4235328 MonitoringHost.exe 12036 RegCloseKey HKU\.DEFAULT\Control Panel\International SUCCESS
    07:36:41,4235385 MonitoringHost.exe 12036 RegCloseKey HKLM\System\CurrentControlSet\Control\Nls\Sorting\Ids SUCCESS

  6. Martijn

    Hi Kevin,

    I got to this article, because in SCOM 2019 whenever I change the action account in the ‘Default Action Account’ profile from system to a domain account (which is configured with logon as a service). The monitoringhost processes on that agent (multiple) are spawned, but claim 100% cpu and memory usage keeps increasing.

    Is this article still an issue with scom 2019, I have difficulty finding out why this happening and what I need to do to fix this.

    • Kevin Holman

      Why are you changing the default action account from Local System. It should be Local System. Not all MP’s support using a domain based default action account.

    • Kevin Holman

      In SCOM 2019 – the default values are 1.6GB of private bytes and 30,000 handles. no changes should be needed when multi-homing between multiple management groups – just understand the management group with the lowest settings will restart the agent when the thresholds are breached. When you multi-home – the agent is running twice the workloads, and using more handles and memory. So its possible you might need to bump the values up higher, but you will have to monitor for that. This is why I recommend turning on alerting, so you can be aware when this is happening. Especially on agents hosting lots of objects (AD, SQL, etc)

  7. Thomas

    Hi Kevin

    Not sure if this is a similar issue, but on our main management server, one of the MonitoringHost.exe processes keeps going up in memory usage to more than 8GB (server is configured with 16GB). We have monthly patch/reboot schedule, so this usage does reset then. Noticed this early 2020, and coincides with SCOM 2019 UR1 or update of some MPs at the same time. UR2 did not fix this so perhaps an MP, but no way we know of to find the offending MP – is there a way to find out? The memory usage history prior to this was pretty much flat. This only occurs on one management server (have 4 for different workloads).

    • Kevin Holman

      If its not causing a problem I wouldn’t be too concerned. But I would collect this value and watch it in a perf view – to see if it slowly rises and goes up and down, or if it is a linear graph and constantly goes up until there is some kind of failure…. that’s a leak. If it is a leak then I’d open a support case as it is likely a workflow or a module running on that specific management server that is causing it.

      • Thomas

        Thanks for the response. I think I’ve now narrowed it down to the Linux MPs. 80+ agents, one pool with one MP. Now added a second MP to the pool and it’s doing the same. The growth is linear, sounds like we need to open a case.

  8. Tristan BLONDEL

    Hello Kevin,
    I have configure this for a long time in my 2019 environment and I would like to add my 2 cents.
    SCOM agent in it’s UR3 version is consuming much more resources (CPU at least and Handle counts on the process) that the UR2 SCOM agent on the same server making the agent forcibly restart. This issue is visible on AD servers for example when event logs are very active. This is actually refered as a functional bug at Microsoft and product team is working on it. They have change the way to read event logs and it seems to be the issue.
    Mathew MANOJ was refereing to his article here :https://nathangau.wordpress.com/2021/06/03/update-on-security-monitoring-and-ur3/ that the issue was visible when the Security monitoring Management Pack but in my side, This MP is not installed.

    My (159) agents are multihomed to Azure with a lot of azure assessment ‘AD, Security, DNS…’ making the issue appearing… In my side, Agent are raising a lot of event 26013 meaning that event Log are Wrapped or not actives, then there are raising Handle count and sometimes PrivateBytes alerts on monitoringhost process, sometimes alerts are raised for queue filling up and at the end, some are unloading System rule(s) making the agent greyed out (but heartbeating). The MS servers are generating alerts because processing Backlogged Events is taking a long time. Each time, affected workflow is related to security event collection rules (from SCOM or from Azure MPs).

    Actually the only way to limit the impact is to roll back agent in UR2.

    • Russell Zotz

      Thank you for the info Tristan. I am having this same issue show up on Domain Controllers and its not good. Causing high CPU. I’ve made some adjustments to the agent parameters which have lowered the cpu some, and hoping they have a fix soon.

      • Kevin Holman

        I’d strongly suggest opening up a support case with Microsoft on this issue. The more customers that do this will drive a private hotfix or speed up UR4. In the meantime – the workaround is simple – downgrade the agent to RTM + UR2.

        • Russell Zotz

          Agreed, will discuss with mgmt on what they want to do. It was mentioned to disable the Security Monitoring MP, but I do not see one called that on the list of installed MPs, it must be called something else I guess. We upgraded only because of our audit picked up we were out of date on it, and we were on UR2.

  9. Sameer Mujawar

    Hi,
    Thanks for all your findings and support which you have been providing and helping us resolve our issues pertaining to SCOM.
    The Stop Healthservice restarts in SCOM can be considered in system where monitoring is done with Agent.
    What to be done in case of Agentless exception monitoring scenario where memory and CPU utilization go high. Can you suggest any tweaks or configuration changes which will ensure 25000 client monitoring per Mahagement server is achieved with proper CPU and memory utilization.

  10. Pingback:Coffee Break: The SCOM Clinic: Your Questions Answered (Part 1) - SCOMathon

    • Kevin Holman

      Partially. They finally fixed the default values – changing them to 30,000 handles and 1.6GB private bytes for both Healthservice and MonitoringHost.

      However, they still do not generate alerts by default, so your agents that might be restarting, you won’t be aware of it. I still recommend turning on alerting for these monitors, even if using an informational alert, just so the SCOM admins can be aware of agents are consuming WAY too many resources and suiciding.

Leave a Reply

Your email address will not be published.