This is a script example I did a demo on recently. It is designed to help you keep up with agents that are not communicating with SCOM, fix them, or help you categorize them into groups for troubleshooting.
I was working with a customer that has a very large environment, with 1000 agents not communicating. Most of these were because they don’t have an integrated decommission process, so people retire servers and do not tell the monitoring team. This creates server-down alerts that just get ignored because the operations team receiving them recognize them as decommissioned servers. This is a bad practice, as monitoring removal should be part of any customer server decommission process.
When you run it, it will dump a CSV report to C:\windows\temp directory, and output a grid to the screen:
The script gets all your agents that have critical Health Service Watcher object, and loops through each one, checking to see:
- Is the server in maintenance mode?
- when was the server last communicating or reset?
- What are the management server assignments?
- Can we resolve the agent from DNS?
- Can we ping the agent now?
- Can we connect to the remote Service Control Manager?
- Can we get the status of Healthservice?
- If stopped, start it
- If disabled, fix it
- If someone uninstalled the agent, lets us know
This is really helpful when you have a large environment, and a large number of agents that are not communicating.
Obviously, firewalls create issues for running a script like this, and you must have rights on the agent machines in order to remotely interrogate or fix services.