This is something a LOT of people make mistakes on – so I wanted to write a post on the correct way to do this properly, using a very common target as an example.
When we write a monitor for something like “Processor\% Processor Time\_Total” and target “Windows Server Operating System”…. everything is very simple. “Windows Server Operating System” is a single instance target…. meaning there is only ONE “Operating System” instance per agent. “Processor\% Processor Time\_Total” is also a single instance counter…. using ONLY the “_Total” instance for our measurement. Therefore – your performance unit monitors for this example work just like you’d think.
However – Logical Disk is very different. On a given agent – there will often be MULTIPLE instances of “Logical Disk” per agent, such as C:, D:, E:, F:, etc… We must write our monitors to take this into account.
For this reason – we cannot monitor a Logical Disk perf counter, and use “Windows Server Operating System” as the target. The only way this would work, is if we SPECIFICALLY chose the instance in perfmon. I will explain:
Bad example #1:
I want to monitor for the perf counter Logical Disk\% Free Space\<All Instances> so that I can get an alert when any logical disk is below 50% in free space.
I create a new monitor > unit monitor > Windows Performance Counters > Static Thresholds > Single Threshold > Simple Threshold.
I target a generic class, such as “Windows Server Operating System”.
I choose the perf counter I want – and select all instances:
And save my monitor.
The problem with this workflow – is that we targeted a multi-instance perf counter, at a single instance target. This workflow will load on all Windows Server Operating Systems, and parse through all discovered instances. If an agent only has ONE instance of “Logical Disk” (C:) then this monitor will work perfectly…. if the C: drive does not have enough free space – no issues. HOWEVER… if an agent has MULTIPLE instances of logical disks, C:, D:, E:, AND those disks have different threshold results… the monitor will “flip-flop” as it examines each instance of the counter. For example, if C: is running out of space, but D: is not… the workflow will examine C:, turn red, generate an alert, then immediately examine D:, and turn back to green, closing the alert.
This is SERIOUS. This will FLOOD your environment with statechanges, and alerts, every minute, from EVERY Operating System.
A quick review of Health Explorer will show what is happening:
This monitor went “unhealthy” and issued an alert at 10:20:58AM for the C: instance:
Then went “healthy” in the same SECOND from the _Total Instance:
Then flipped back to unhealthy, at the same time – for the D: instance.
I think you can see how bad this is. I find this condition all the time, even in “mature” SCOM implementations… it just happens when someone creates a simple perf threshold monitor but doesn’t understand the class model, or multi-instance perf counters. In an environment with only 500 monitored agents – I can generate over 100,000 state changes – and 50,000 alerts, in an HOUR!!!!
Ok – lesson learned – DONT target a single-instance class, using a multi-instance perf counter. So – what should I have used? Well, in this case – I should use something like “Windows 2008 Logical Disk” But we can still screw that up!
Bad example #2:
I want to monitor for the perf counter Logical Disk\% Free Space\<All Instances> so that I can get an alert when any logical disk is below 20% in free space.
I create a new monitor > Unit monitor > Windows Performance Counters > Static Thresholds > Single Threshold > Simple Threshold.
I have learned from my mistake in Bad Example #1, so I target a more specific class, such as “Windows Server 2008 Logical Disk”.
I choose the perf counter I want – and select all instances:
And save my monitor.
Ack! The SAME problem! Why????
The problem is – now, instead of each Operating System instance loading this monitor, and then parsing and measuring each instance, now EACH INSTANCE of logical disk is doing the SAME THING. This is actually WORSE than before…. because the number of monitors loaded is MUCH higher, and will flood me with even more state changes and alerts than before.
Now if I look at Health Explorer – I will likely see MULTIPLE disks have gone red, and are “flip-flopping” and throwing alerts like never before.
When you dig into Health Explorer – you will see – that they are being turned Unhealthy – and it isn’t event their drive letter! I will examining the F: drive monitor:
I can see it was turned unhealthy because of the free space threshold hit on the D: drive!
and then flipped back to healthy due to the available space on the C: instance:
This is very, very bad. So – what are we supposed to do???
We need to target the specific class (Windows 2008 Logical Disk) AND then use a Wildcard parameter, to match the INSTANCE name of the perf counter to the INSTANCE name of the “Logical Disk” object. Make sense? Such as – match up the “C:” perf counter instance – to the “C:” Device ID of the Logical Disk discovered in SCOM. This is actually easier than it sounds:
Good example:
I want to monitor for the perf counter Logical Disk\% Free Space\<All Instances> so that I can get an alert when any logical disk is below 20% in free space.
I create a new monitor > Unit monitor > Windows Performance Counters > Static Thresholds > Single Threshold > Simple Threshold.
I have learned from my mistake in Bad Example #1, so I target a more specific class, such as “Windows Server 2008 Logical Disk”.
I choose the perf counter I want – and INSTEAD of select all instances, I learn from my mistake in Bad Example #2. Instead – this time I will UNCHECK the “All Instances” box, and use the “fly-out” on the right of the “Instance:” box:
This fly-out will present wildcard options, which are discovered properties of the Windows Server 2008 Logical Disk class. You can see all of these if you viewed that class in discovered inventory. What we need to do now – is use discovered inventory to find a property, that matches the perfmon instance name. In perfmon – we see the instance names are “C:” or “D:”
In Discovered Inventory – looking at the Windows Server 2008 Logical Disk, I can see that “Device ID” is probably a good property to match on:
So – I choose “Device ID” from the fly-out, which inserts this parameter wildcard, so that the monitor on EACH DISK will ONLY examine the perf data from the INSTANCE in perfmon that matches the disk drive letter.
The wildcard parameter is actually something like this:
$Target/Property[Type=”MicrosoftWindowsLibrary6172210!Microsoft.Windows.LogicalDevice”]/DeviceID$
This simply is a reference to the MP that defined the “Device ID” property on the class.
Now – no more flip-flopping, no more statechangeevent floods, no more alert storms opening and closing several times per second.
You can use this same process for any multi-instance perf object. I have a (slightly less verbose) example using SQL server HERE.
To determine if you have already messed up…. you can look at “Top 20 Alerts in an Operational Database, by Alert Count” and “Historical list of state changes by Monitor, by Day:” which are available on my SQL Query List. These should indicate lots of alerts, and monitor flip-flop, and should be investigated.
This article, very interesting, seems perfectly match my needs… BUT!
Let me explain: I’m trying to monitor Read and Write Latency for a Cluster Shared Volume target.
If I check the counter on a server, I can see that the different instances are identified by something very close to the Volume Label.
Infact (as an example) for a disk on cluster called VOLUME_STORAGE_16A65, I have a volume called VOLUME_STORAGE_16 and a performance counter linked to an instance called simply VOLUME_STORAGE.
Now, I wouldn’t had any problem if one of the CSV object attribute had this information inside. Unfortunately all the attributes are more complex (containing info about cluster name or extended mountpoint), or the name is reported in the form VOLUME_STORAGE_16.
And using this parameter results in an error in performance collection: on event viewer I can clearly see that my perfcoll rule is not able to match any counter for VOLUME_STORAGE_16 instance (and it is normal: it is expecting only VOLUME_LABEL!).
I hope my issue is clear and I really appreciate any help on this configuration.
You simply should reference the built in rules for CSV – and replicate the way they are built:
Look at: Microsoft.Windows.Server.ClusterSharedVolumeMonitoring.ClusterSharedVolume.Monitoring.CollectPerfDataSource.FreeSpaceMB
Thanks Kevin for your kind reply, great hint!
Now I’m deep diving into the configurations of CSV rules, which seems to me more complex than expected…
Hi Kevin, great post and thank you, a question around this though is how do I define the wildcard parameter to only monitor a specific drive like D or E or G for example? I tried to use the “select performance counter” option to target a specific instance but when I monitor the monitor it still alerts for all the instances and not only the instance I targeted in the monitor. I am running SCOM 1801.
Why would you write a monitor, and ONLY have it apply to a specific instance?
Just use the example in this, then create overrides to disable it for groups if disks with a specific drive letter, if thats what you want. But it seems odd to me. Can you explain the customer ask/scenario?
I need to monitor drive space for non-system drives but the drives have different sizes, so to set the thresholds for different drives I decided to create monitors for the specified drives to give me that customizability on the group of servers I need those drives monitored on. Not sure if I can accomplish this another way? If I use the example, I will only be able to set the threshold for all the non-system drives to a set value and not a different threshold for each instance if I am not mistaken.
Hello Kevin,
I have seen on lots of server there is no drive letter , instead it has a long string like \\?\volume{xxxxx-xx-xx-xxxxxx}
What is the reason of this behavior and how to rectify the same ?
Thanks in advance.
This is normal. There are many partitions that exist on servers. These discover in SCOM when you enable Mout Point discovery. If you dont want to monitor these, its pretty easy to put them in a dynamic group and disable monitoring. Or, if you dont need mount points monitored, then disable that.
https://docs.microsoft.com/en-us/windows-hardware/manufacture/desktop/configure-uefigpt-based-hard-drive-partitions
Thanks Kevin for the clarification.
I have created a disk unit monitors using this example. My scenario is we want alert for a particular server (all drives) when disk usage reaches 50%, 60 %, 70 %, 80 %, 90%. Hence, the threshold for Logical Disk\% Free Space is kept at 50,40,30,20 and 10 respectively.
I have created a monitor as you explained above and also created a group for all the drives on the server and enabled monitor for this group only.
Now I am getting alerts for all the drives for all the monitors – 50%, 60%, 70%, 80%, 90% even the disk free is 42 % or 93%.
I know if disk free space is 10% then we should get alert for all the monitors as threshold will breach for all.
Is there any better way to meet this kind of objective and why SCOM is alerting for all the thresholds even the disk free space is 42%.
Hi Kevin, If I want to target logical/physical disk space just on 1 server, do I follow the same process. Thanks
That would be very strange. Why would you write monitoring for disk space for only 1 server?