Menu Close

Stop Healthservice restarts in SCOM 2016


image

 

This is probably the single biggest issue I find in 100% of customer environments.

YOU ARE IMPACTED.  Trust me.

 

SCOM monitors itself to ensure we aren’t using too much memory, or too many handles for the SCOM processes.  If we detect that the SCOM agent is using an unexpected amount of memory or handles, we will forcibly KILL the agent, and restart it.

That sounds good right?

In theory, yes.  In reality, however, this is KILLING your SCOM environment, and you probably aren’t aware it is even it is happening.

 

The problem?

1.  The default thresholds are WAY out of touch with reality.  They were set almost 10 years ago, when systems used a LOT less resources than modern operating systems today.  This is MUCH worse if you choose to MULTIHOME.  Multi-homed agents can use twice as many resources as non-multi-homed agents, and this restart can be issued from EITHER management group, but will affect BOTH.

2.  We don’t generate an alert when this happens, so you are blind that this is impacting you.

 

We need to change these in the product.  Until we do, a simple override is the solution.

 

Why is this so bad?

This is bad because of two impacts:

1.  You are hurting your monitored systems by restarting them over and over, causing the startup scripts to run on loops and actually consuming additional resources.  You are actually going periods of time without any monitoring because of this as well, because when the agent is killed and restarting, there is a period of time where the monitoring is unloaded.

2.  You are filling SCOM with state change events.  Every time all the monitors initialize, they send an updated “new” statechange event unpon initialization.  You are hammering SCOM with useless state data.

 

What can I do about it?

Well, I am glad you asked!  We simply need to override 4 monitors, to give them realistic agent thresholds, and set them to generate an informational alert.  I will also include a view for these alerts so we can see if anyone is still generating them.  I will wrap all this in a sample management pack for you to download.

 

In the console, go to Authoring, Monitors, and change scope to “Agent”

image

 

We will override each one:

Private bytes monitors should be set to a default threshold of 943718400 (triple the default of 300MB)

Handle Count monitors should be set to 30000  (the default of 6000 is WAY low)

Override Generate Alert to True (to generate alerts)

Override Auto-Resolve to False (even though default is false, this must be set, to keep from auto-closing these so you can see them and their repeat count)

Override Alert severity to Information (to keep from ticketing on these events)

 

 

Override EACH monitor, “all objects of class” and choose “Agent” class.

image

 

NOTE: It is CRITICAL that we choose the “Agent” class for our overrides, because we do not want to impact thresholds already set on Management Servers or Gateways.

 

This is a good configuration:

image

image

image

image

 

Ok – those are much more reasonable defaults.

 

What else should I do?

Create an alert view that shows alerts with name “Microsoft.SystemCenter.Agent.%”

This will show you if you STILL have some agents restarting on a regular basis.  You should review the ones with high repeat counts on a weekly basis, and adjust their agent specific thresholds, or investigate why they are consuming so much, so often.  An occasional agent restart (one or less per day) is totally fine and probably not worth the time to investigate.

 

image

 

I am including a management pack with these overrides, and the alert view, and you can download it below if you prefer to to make your own.

 

Download:

https://gallery.technet.microsoft.com/SCOM-Agent-Threshold-b96c4d6a

2 Comments

  1. Peter Nilsson

    Hi!

    While tuning a new environment ( with the help of https://kevinholman.com/2009/11/25/tuning-tip-turning-off-some-over-collection-of-events/ ) I just found out that you can also trace the restarts with the already built in event rule “Collect Restart System Center Management Health Service Events”, which will be triggered by the actual restart, not the memory or handle value. So I chose not to tune that particular rule out, instead of generating alerts. In an healthy environment these events should not fill up my databases anyway 🙂

    Thanks for all your insights, I´ve been using you blogs basically every week for at least ten years!

    Regards
    Peter

Leave a Reply

Your email address will not be published. Required fields are marked *