Quick download: SCOM.GatewayMaintenanceMode (github.com)
This is an often requested solution – how do we stop alert storms from agents being unavailable when there is a network outage causing it – not a real server down issue?
This post is dedicated to a Management Pack based solution for such.
This MP was part of the SCOMAThon HackaSCOM 2021 challenge. You can watch the video presentation of this management pack here:
https://youtu.be/md21GGxAzUo?t=1556
Challenge:
When a Gateway is not available:
- Suppress alerts from all agents that report to that Gateway
- Trigger maintenance mode for these agents
- Remove maintenance mode when the GW is available again
Components:
- Monitor availability for each Gateway with ping
- Recovery to START maintenance mode
- Recovery to STOP maintenance mode
- Group for each Gateway
- Groups are populated with agents assigned to that Gateway
- Groups are placed into and out of Maintenance mode
- Views for Alerts, State, Group memberships
You will need to customize this MP to test it in your lab environment. It will not work out of the box as it is customized for my lab server names.
Out of the box the views look like this:
There is a monitor that pings each Gateway in your environment:
If we detect the GW is unavailable, there is a recovery on this monitor to set the corresponding group for this Gateway’s agents into maintenance mode:
You can see above the Group is in Maintenance mode. Below we will see the agents in that group in MM, with a special comment:
When the network outage is recovered – there is another recovery that runs to end maintenance mode:
What’s next?
The biggest issue with this solution is that is requires the customer to edit the XML to customize the MP for their environment. It needs to be made more dynamic in nature to accommodate any environment.
Quick download: SCOM.GatewayMaintenanceMode (github.com)
Can the MP also be used with older SCOM versions? SCOM 2012 R2
It should be useable. You might have to adjust the minimum version requirements in the manifest, I don’t know. I don’t test with SCOM 2012 as that goes out of support this year.
Keep in mind this is an example MP, and very “version 1.0”.
Pingback:Automating Maintenance Mode for Computers Behind a Gateway: SCOM Management Pack by Kevin Holman - SCOMathon
Hi Kevin,
What can be modified in here to put the Gateway and/or its connected Agents into Maintenance Mode manually when doing Windows Patching on the Gateway that always requires a reboot of the Gateway causing a Alert Storm?
Pingback:Top 5 SCOM community recommendations: January SCOMathon Newsletter - SCOMathon
Hi Kevin,
How can I customize the MP to target GW in our environment? I tried to rename the SCOM group but the discovery is not running. Thanks.
Awesome idea Kevin!
I was wondering: We always deploy atleast two GW for HA/Failover for each untrusted domain.
If one GW goes down, the agents failover to the other so triggering Maintenance Mode for a single GW failure is not correct.
Would it be possible to perhaps create Resouce Pools for each untrusted domain and use the Resource Pool Availability monitor as source for Maintenance Mode instead?
Or do you have a better idea? 🙂
In a case like this – I’d rather ping the site router – this would be a better scenario that the “site is down”.
Nice solution Kevin!
I’m thinking about the following. What happens to servers that are already in Maintenance Mode, that you don’t want to enable monitoring again when the gateway becomes healthy?
They should stay in MM. Never versions of SCOM – were enhanced so that MM end date for objects will be retained if they were already in MM for an extended period.
I have tried to understand, how I can add my gw-servers to this management pack, but I can’t see that :(. Could it be possible create fragment to MP Author Pro / Studio for new gw-server? And I also hope that You would add documentation to Github, how to edit mp file to add new gw-servers to it (and I have many of them)?
Hi Jukka-Pekka,
I see you were not answered to your question.
@Kevin, will you be so kind as to provide some guidance here? I have 17 Gateways servers and need to add each separately, I assume. How do I create this?
Or is there a way to pick this up through the system?
I am using SCOM 2019 UR4, waiting for UR6 to come out and upgrade to UR5.
I am also planning to upgrade the system to SCOM 2022
@Kevin, will you be so kind as to provide some guidance here? I have 17 Gateways servers and need to add each separately, I assume. How do I create this?
Or is there a way to pick this up through the system?
I am using SCOM 2019 UR4, waiting for UR6 to come out and upgrade to UR5.
I am also planning to upgrade the system to SCOM 2022
Hi Kevin,
This is an awesome MP but I need to know where I change the XML for my environment.
Will you be so kind as to provide some guidance here? I have 17 Gateways servers and need to add each separately. How do I create or change this?
Kind regards.
Pieter
Hi Kevin,
There may be an issue that started in 2022 UR1 that can affect this process significantly (seconds to hours). It has to do with a change to a Stored Procedure. I have sent the full details and what was discussed with Microsoft, and what we did to resolve it via your Contact Me option, but it was kind of lengthy so not sure if I went over any character limit. Can you let me know if you didn’t receive it?
Hi Kevin,
This is an awesome MP but I need to know where I change the XML for my environment.
Will you be so kind as to provide some guidance here? I have 17 Gateways servers and need to add each separately. How do I create or change this? Please can you respond to my query?
Kind regards.
Pieter