Menu Close

Writing monitors to target Logical or Physical Disks

This is something a LOT of people make mistakes on – so I wanted to write a post on the correct way to do this properly, using a very common target as an example.

When we write a monitor for something like “Processor\% Processor Time\_Total” and target “Windows Server Operating System”…. everything is very simple.  “Windows Server Operating System” is a single instance target…. meaning there is only ONE “Operating System” instance per agent.  “Processor\% Processor Time\_Total” is also a single instance counter…. using ONLY the “_Total” instance for our measurement.  Therefore – your performance unit monitors for this example work just like you’d think.

However – Logical Disk is very different.  On a given agent – there will often be MULTIPLE instances of “Logical Disk” per agent, such as C:, D:, E:, F:, etc…   We must write our monitors to take this into account.

For this reason – we cannot monitor a Logical Disk perf counter, and use “Windows Server Operating System” as the target.  The only way this would work, is if we SPECIFICALLY chose the instance in perfmon.  I will explain:

Bad example #1:

I want to monitor for the perf counter Logical Disk\% Free Space\<All Instances> so that I can get an alert when any logical disk is below 50% in free space.

I create a new monitor > unit monitor > Windows Performance Counters > Static Thresholds > Single Threshold > Simple Threshold.

image

I target a generic class, such as “Windows Server Operating System”.

I choose the perf counter I want – and select all instances:

image

And save my monitor.

The problem with this workflow – is that we targeted a multi-instance perf counter, at a single instance target.  This workflow will load on all Windows Server Operating Systems, and parse through all discovered instances.  If an agent only has ONE instance of “Logical Disk” (C:) then this monitor will work perfectly…. if the C: drive does not have enough free space – no issues.  HOWEVER… if an agent has MULTIPLE instances of logical disks, C:, D:, E:, AND those disks have different threshold results… the monitor will “flip-flop” as it examines each instance of the counter.  For example, if C: is running out of space, but D: is not… the workflow will examine C:, turn red, generate an alert, then immediately examine D:, and turn back to green, closing the alert.

This is SERIOUS.  This will FLOOD your environment with statechanges, and alerts, every minute, from EVERY Operating System.

A quick review of Health Explorer will show what is happening:

This monitor went “unhealthy” and issued an alert at 10:20:58AM for the C: instance:

image

Then went “healthy” in the same SECOND from the _Total Instance:

image

Then flipped back to unhealthy, at the same time – for the D: instance.

image

 

I think you can see how bad this is.  I find this condition all the time, even in “mature” SCOM implementations… it just happens when someone creates a simple perf threshold monitor but doesn’t understand the class model, or multi-instance perf counters.  In an environment with only 500 monitored agents – I can generate over 100,000 state changes – and 50,000 alerts, in an HOUR!!!!

 

Ok – lesson learned – DONT target a single-instance class, using a multi-instance perf counter.  So – what should I have used?  Well, in this case – I should use something like “Windows 2008 Logical Disk”  But we can still screw that up! 

Bad example #2:

I want to monitor for the perf counter Logical Disk\% Free Space\<All Instances> so that I can get an alert when any logical disk is below 20% in free space.

I create a new monitor > Unit monitor > Windows Performance Counters > Static Thresholds > Single Threshold > Simple Threshold.

image

I have learned from my mistake in Bad Example #1, so I target a more specific class, such as “Windows Server 2008 Logical Disk”.

I choose the perf counter I want – and select all instances:

image

And save my monitor.

Ack!  The SAME problem!  Why????

The problem is – now, instead of each Operating System instance loading this monitor, and then parsing and measuring each instance, now EACH INSTANCE of logical disk is doing the SAME THING.  This is actually WORSE than before…. because the number of monitors loaded is MUCH higher, and will flood me with even more state changes and alerts than before.

Now if I look at Health Explorer – I will likely see MULTIPLE disks have gone red, and are “flip-flopping” and throwing alerts like never before.

image

 

When you dig into Health Explorer – you will see – that they are being turned Unhealthy – and it isn’t event their drive letter!  I will examining the F: drive monitor:

I can see it was turned unhealthy because of the free space threshold hit on the D: drive!

image

and then flipped back to healthy due to the available space on the C: instance:

image

This is very, very bad.  So – what are we supposed to do???

 

We need to target the specific class (Windows 2008 Logical Disk) AND then use a Wildcard parameter, to match the INSTANCE name of the perf counter to the INSTANCE name of the “Logical Disk” object.  Make sense?  Such as – match up the “C:” perf counter instance – to the “C:” Device ID of the Logical Disk discovered in SCOM.  This is actually easier than it sounds:

 

Good example:

 

I want to monitor for the perf counter Logical Disk\% Free Space\<All Instances> so that I can get an alert when any logical disk is below 20% in free space.

I create a new monitor > Unit monitor > Windows Performance Counters > Static Thresholds > Single Threshold > Simple Threshold.

image

I have learned from my mistake in Bad Example #1, so I target a more specific class, such as “Windows Server 2008 Logical Disk”.

I choose the perf counter I want – and INSTEAD of select all instances, I learn from my mistake in Bad Example #2.  Instead – this time I will UNCHECK the “All Instances” box, and use the “fly-out” on the right of the “Instance:” box:

image

 

This fly-out will present wildcard options, which are discovered properties of the Windows Server 2008 Logical Disk class.  You can see all of these if you viewed that class in discovered inventory.  What we need to do now – is use discovered inventory to find a property, that matches the perfmon instance name.  In perfmon – we see the instance names are “C:” or “D:”

image

In Discovered Inventory – looking at the Windows Server 2008 Logical Disk, I can see that “Device ID” is probably a good property to match on:

image

 

So – I choose “Device ID” from the fly-out, which inserts this parameter wildcard, so that the monitor on EACH DISK will ONLY examine the perf data from the INSTANCE in perfmon that matches the disk drive letter.

image

 

The wildcard parameter is actually something like this:

$Target/Property[Type=”MicrosoftWindowsLibrary6172210!Microsoft.Windows.LogicalDevice”]/DeviceID$

This simply is a reference to the MP that defined the “Device ID” property on the class.

 

Now – no more flip-flopping, no more statechangeevent floods, no more alert storms opening and closing several times per second.

 

 

You can use this same process for any multi-instance perf object.  I have a (slightly less verbose) example using SQL server HERE.

 

To determine if you have already messed up…. you can look at “Top 20 Alerts in an Operational Database, by Alert Count” and “Historical list of state changes by Monitor, by Day:” which are available on my SQL Query List.  These should indicate lots of alerts, and monitor flip-flop, and should be investigated.

3 Comments

  1. Claudio Valentini

    This article, very interesting, seems perfectly match my needs… BUT!
    Let me explain: I’m trying to monitor Read and Write Latency for a Cluster Shared Volume target.
    If I check the counter on a server, I can see that the different instances are identified by something very close to the Volume Label.
    Infact (as an example) for a disk on cluster called VOLUME_STORAGE_16A65, I have a volume called VOLUME_STORAGE_16 and a performance counter linked to an instance called simply VOLUME_STORAGE.

    Now, I wouldn’t had any problem if one of the CSV object attribute had this information inside. Unfortunately all the attributes are more complex (containing info about cluster name or extended mountpoint), or the name is reported in the form VOLUME_STORAGE_16.
    And using this parameter results in an error in performance collection: on event viewer I can clearly see that my perfcoll rule is not able to match any counter for VOLUME_STORAGE_16 instance (and it is normal: it is expecting only VOLUME_LABEL!).

    I hope my issue is clear and I really appreciate any help on this configuration.

    • Kevin Holman

      You simply should reference the built in rules for CSV – and replicate the way they are built:

      Look at: Microsoft.Windows.Server.ClusterSharedVolumeMonitoring.ClusterSharedVolume.Monitoring.CollectPerfDataSource.FreeSpaceMB

      • CLAUDIO VALENTINI

        Thanks Kevin for your kind reply, great hint!
        Now I’m deep diving into the configurations of CSV rules, which seems to me more complex than expected…

Leave a Reply

Your email address will not be published. Required fields are marked *