This is probably the single biggest issue I find in 100% of customer environments.
YOU ARE IMPACTED. Trust me.
SCOM monitors itself to ensure we aren’t using too much memory, or too many handles for the SCOM processes. If we detect that the SCOM agent is using an unexpected amount of memory or handles, we will forcibly KILL the agent, and restart it.
That sounds good right?
In theory, yes. In reality, however, this is KILLING your SCOM environment, and you probably aren’t aware it is even it is happening.
1. The default thresholds are WAY out of touch with reality. They were set almost 10 years ago, when systems used a LOT less resources than modern operating systems today. This is MUCH worse if you choose to MULTIHOME. Multi-homed agents can use twice as many resources as non-multi-homed agents, and this restart can be issued from EITHER management group, but will affect BOTH.
2. We don’t generate an alert when this happens, so you are blind that this is impacting you.
We need to change these in the product. Until we do, a simple override is the solution.
Why is this so bad?
This is bad because of two impacts:
1. You are hurting your monitored systems by restarting them over and over, causing the startup scripts to run on loops and actually consuming additional resources. You are actually going periods of time without any monitoring because of this as well, because when the agent is killed and restarting, there is a period of time where the monitoring is unloaded.
2. You are filling SCOM with state change events. Every time all the monitors initialize, they send an updated “new” statechange event unpon initialization. You are hammering SCOM with useless state data.
What can I do about it?
Well, I am glad you asked! We simply need to override 4 monitors, to give them realistic agent thresholds, and set them to generate an informational alert. I will also include a view for these alerts so we can see if anyone is still generating them. I will wrap all this in a sample management pack for you to download.
In the console, go to Authoring, Monitors, and change scope to “Agent”
We will override each one:
Private bytes monitors should be set to a default threshold of 943718400 (triple the default of 300MB)
Handle Count monitors should be set to 30000 (the default of 6000 is WAY low)
Override Generate Alert to True (to generate alerts)
Override Auto-Resolve to False (even though default is false, this must be set, to keep from auto-closing these so you can see them and their repeat count)
Override Alert severity to Information (to keep from ticketing on these events)
Override EACH monitor, “all objects of class” and choose “Agent” class.
NOTE: It is CRITICAL that we choose the “Agent” class for our overrides, because we do not want to impact thresholds already set on Management Servers or Gateways.
This is a good configuration:
Ok – those are much more reasonable defaults.
What else should I do?
Create an alert view that shows alerts with name “Microsoft.SystemCenter.Agent.%”
This will show you if you STILL have some agents restarting on a regular basis. You should review the ones with high repeat counts on a weekly basis, and adjust their agent specific thresholds, or investigate why they are consuming so much, so often. An occasional agent restart (one or less per day) is totally fine and probably not worth the time to investigate.
I am including a management pack with these overrides, and the alert view, and you can download it below if you prefer to to make your own.