In SCOM, when a Cluster Resource Group is not online, or partially online, we have a monitor that alerts on this condition.
This alert comes from a Rollup Monitor (Resource Group Rollup Monitor).
The problem with using a Rollup Monitor that targets the Cluster Resource Group, is:
1. Rollup Monitors cannot provide context from the underlying Unit monitor, so you cannot get much data from them, other than “something is broken”
2. More importantly, there is nothing in the alert that tells you about the NODE Computer that the issue originated from. The source is simply the name of the Resource Group, as seen in the image above.
If you integrate SCOM to a Incident Management system, this particular alert is nearly useless, because it lacks a source or path to a Computer/OS that is real. However, this is the ONLY alert we get in SCOM to notify us that a Cluster Resource Group has a serious issue.
One of the solutions is to enable Alerting (via overrides) from the Unit monitors that actually do the detection on each node:
Now, we will get alerts from EACH node in the cluster, but the alerts will have a real server name that we can work with:
This solution has two downsides:
1. The Alerts are UGLY. Since the monitor was not designed with Alerting configured, the alert name is not formatted and the Alert Description is
2. You will get one alert from EACH NODE in the cluster, for the same issue.
Issue #1 above we can fix. We can use the solution from: How to override the Alert Name and Alert Description of a Sealed Monitor – Kevin Holman’s Blog
Issue #2 we can’t work around. You just have to accept that you might see multiple alerts at the same time for the same issue from different node perspectives. Perhaps if your incident management system has dedupe capabilities, it can manage this.
For the ugly alert Name and empty Alert Description – lets address that:
I’ll start by creating a new empty Management Pack, “Overrides – Clusters”.
I’ll save my override examples above into this MP to turn on Alerting for the monitor.
Export the MP, and open the XML. I need to create a Presentation > StringResource > StringResources section, and create a StringResource for my new AlertMessage.
Here is my String Resource:
<Presentation> <StringResources> <StringResource ID="Overrides.Clusters.Resource.Group.State.Monitor.AlertMessage" /> </StringResources> </Presentation>
Next I need to create a displaystring for my StringResource:
<DisplayString ElementID="Overrides.Clusters.Resource.Group.State.Monitor.AlertMessage"> <Name>Cluster resource group offline or partially online</Name> <Description>Based on the severity of this alert, the resource group on the cluster is offline (Critical) or partially online (Warning).</Description> </DisplayString>
Now here is what my Presentation and LanguagePack section looks like in XML:
<Presentation> <StringResources> <StringResource ID="Overrides.Clusters.Resource.Group.State.Monitor.AlertMessage" /> </StringResources> </Presentation> <LanguagePacks> <LanguagePack ID="ENU" IsDefault="false"> <DisplayStrings> <DisplayString ElementID="Overrides.Clusters"> <Name>Overrides - Clusters</Name> </DisplayString> <DisplayString ElementID="Overrides.Clusters.Resource.Group.State.Monitor.AlertMessage"> <Name>Cluster resource group offline or partially online</Name> <Description>Based on the severity of this alert, the resource group on the cluster is offline (Critical) or partially online (Warning).</Description> </DisplayString> </DisplayStrings> </LanguagePack> </LanguagePacks>
Now – we can import this MP with these changes.
This doesn’t change the alert yet, this is just setting up to create the override. However, this type of override must reference the unique GUID of the StringResource, so we need to get that GUID first, before we continue. We can easily do that now, using a SQL query to find the GUID for the string we just created.
Query your OpsDB:
SELECT StringResourceId FROM StringResource WHERE StringResourceName = 'Overrides.Clusters.Resource.Group.State.Monitor.AlertMessage'
This GUID is what mine returned. Yours might be different, if you used anything different for the StringResourceID in XML. We have to get the GUID before creating the final step – the override.
Now that I have my GUID for my custom Alert String Resource, I can create my override in XML:
<MonitorPropertyOverride ID="OverrideForMonitor.Microsoft.Windows.Cluster.HostedGroup.StateMonitoring.AlertMessage" Context="MWCML!Microsoft.Windows.Cluster.HostedGroup" Enforced="false" Monitor="MWCMM!Microsoft.Windows.Cluster.HostedGroup.StateMonitoring" Property="AlertMessage"> <Value>3DDCF88A-435E-40A2-BC42-50820505875D</Value> </MonitorPropertyOverride>
Notice I used the GUID I got from my SQL query.
Now – the above part is easy to get messed up on.
You need to match up the References to the MP containing “Microsoft.Windows.Cluster.HostedGroup” target class, and the MP containing the monitor: “Microsoft.Windows.Cluster.HostedGroup.StateMonitoring”
These will be in the Manifest > References section of your MP:
<Manifest> <Identity> <ID>Overrides.Clusters</ID> <Version>1.0.0.0</Version> </Identity> <Name>Overrides - Clusters</Name> <References> <Reference Alias="MWCMM"> <ID>Microsoft.Windows.Cluster.Management.Monitoring</ID> <Version>10.1.0.0</Version> <PublicKeyToken>31bf3856ad364e35</PublicKeyToken> </Reference> <Reference Alias="MWCML"> <ID>Microsoft.Windows.Cluster.Management.Library</ID> <Version>10.1.0.0</Version> <PublicKeyToken>31bf3856ad364e35</PublicKeyToken> </Reference> <Reference Alias="SC"> <ID>Microsoft.SystemCenter.Library</ID> <Version>7.0.8448.6</Version> <PublicKeyToken>31bf3856ad364e35</PublicKeyToken> </Reference> </References> </Manifest>
You need to match up the Reference Alias with what is in your MP. These might already match fine, or might need to be changed – depending on how the Override MP was initially created, and what tools added references….. as the aliases are not standardized.
Now Import the MP.
New alerts look MUCH better:
Now, the question might be – what kind of additional, dynamic data can we include in this alert to make it even better?
I’m glad you asked!
We can also include any context that is part of the monitor state change. These are usually propertybags that are part of the module output.
The override is for “AlertParameter1”, “AlertParameter2”, etc.
You can see these Context properties in Health Explorer, if your monitor has them:
So in this specific case, I can add into the Alert Description, the Name property value, which will tell me specifically which Cluster Resource Group Name has an issue.
First, I will add an override for the Cluster Resource Group Name value as seen in the Monitor Context output of the statechange above. You can get examples from this link.
<MonitorPropertyOverride ID="OverrideForMonitor.Microsoft.Windows.Cluster.HostedGroup.StateMonitoring.AlertParameter1" Context="MWCML!Microsoft.Windows.Cluster.HostedGroup" Enforced="false" Monitor="MWCMM!Microsoft.Windows.Cluster.HostedGroup.StateMonitoring" Property="AlertParameter1"> <Value>$Data/Context/Property[@Name='Name']$</Value> </MonitorPropertyOverride>
Lastly – we need to include these in our AlertDescription.
AlertParameter1 = {0}
AlertParameter2 = {1}
AlertParameter3 = {2}
etc…etc…
<DisplayString ElementID="Overrides.Clusters.Resource.Group.State.Monitor.AlertMessage"> <Name>Cluster resource group offline or partially online</Name> <Description>Based on the severity of this alert, the resource group in the cluster is offline (Critical) or partially online (Warning). Cluster Resource Group Name: {0} </Description> </DisplayString>
Now look at these alerts!
Recap:
1. Create a new StringResource (in the Presentation/StringResources section of the XML)
2. Create a new DisplayString (in the LanguagePacks section of the XML) that references your string resource and includes your modified Alert Name and AlertDescription.
3. Import this MP.
4. Get the GUID of your StringResource from SQL
5. Create a new Override in the XML, for “AlertMessage” property, with the GUID from above.
6. Add additional overrides for AlertParameters (optional) for any data included in the statechange context.
(Note: $Target and $MPElement replacements cannot be used in this override method for Alert Parameters, only $Data values work.)
You can download a copy of this Override MP example and use it if you prefer: Overrides.Clusters MP Download from Github
Personally, I think the entire Cluster MP needs a good overhaul, too many rule based alerts that need to be closed manually etc….
+1 gazillion to that. Unfortunately, I can also say that for most MS MPs as well (Orchestrator, SCCM…)
i agree with Robert.
How I can get this monitor to maintenance mode, when virtual server is set to maintenance? (All nodes where vserver is)
Pingback:Top 5 SCOM community recommendations: October SCOMathon Newsletter - SCOMathon
Would this work where we want to monitor the health of “Other Resources” under a Role? e.g Hosted Windows Services?