Menu Close

How does CPU monitoring work in the Windows Server 2016 management pack?


image

 

First – let me warn you.  The way SCOM monitors Processor time is complicated.  If you don’t like it – there is *NOTHING* wrong with nuking this from orbit (disable via override) and just create your own very simple consecutive samples (or average) monitor.  That said, while complicated and somewhat difficult to understand, it is very powerful and useful, and limits “noise”.

 

Ok, all warnings aside – lets figure out how this works.

 

In the Windows Server 2016 OS Management Pack, there is a built in monitor which evaluates the Processor load.  This monitor (Total CPU Utilization Percentage or Microsoft.Windows.Server.10.0.OperatingSystem.TotalCPUUtilization) targets the “Windows Server 2016 Operating System” class.

It runs every 15 minutes, and evaluates after 3 samples.  The samples are not consecutive samples as the product knowledge states – they are AVERAGE samples.

Like previous versions of the CPU monitor, this is often misunderstood.  This monitor does not use a native perfmon module, it runs a PowerShell script.  The script evaluates TWO DIFFERENT perfmon counters:

Processor Information / % Processor Time / _Total  (default threshold 95)

System / Processor Queue Length (default threshold 15)

 

BOTH of these above thresholds must be met, before we will create a monitor state change/alert.  This means that even if your server is stuck at 100% CPU utilization, it will not genet an alert most of the time.  Smile 

The default threshold of “15” is multiplied times the number of logical CPU’s for the server.  So on a typical VM with 4 virtual CPU’s, this means that the value of SYSTEM\Processor Queue Length must be great than (15*4) = 60.  Not only that, but the value must be above 60 for the average of any three consecutive samples.  This is incredibly high.

What this means, is that it is VERY unlikely this monitor will ever trigger, unless your system is absolutely HAMMERED.  If you like this, great!  If you don’t like this, then you have two options.

1)  Re-write your own monitor and make it a very simple consecutive or average samples threshold performance monitor.

2)  Override the default monitor – but set the “CPU Queue Length” threshold to “-1” as in the picture below:

image

This will result in the equation ignoring the CPU queue length requirement, and make the monitor consider “% Processor Time” only.  If you find this is too noisy, you can use the CPU queue length, but use lower value than the default of 15.  Another thing to keep in mind, this is a PowerShell script based monitor, so if you want to run this VERY frequently (the default is every 15 minutes) then consider replacing it with a less impactful native perfmon based monitor.

The default monitor has a Diagnostic task on it – that will output the top consuming processes to health explorer state change context:

image

Note – the numbers are not exactly correct – my “ProcessorHog” process was consuming 100% of the CPU…. but this server has 32 cores, so it looks like you need to multiply by the number of cores to understand the ACTUAL utilization consumed by a process.  This is a typical Windows problem in how windows looks at processes, not a SCOM issue.

 

Ok – so that covers the basic monitoring of the CPU, from an _Total perspective.

 

What about monitoring individual *logical processors* like virtual CPU’s or actual cores on physical servers?  Can we do that?

Yes, yes we can. 

First – let me start by saying – I DON’T recommend you do this.  In fact, I recommend AGAINST this.  This type of monitoring is INCREDIBLY detailed, and creates a huge instance space in SCOM that will only serve to slow down your environment, console, and increase config and monitoring load.  It should only be leveraged where you have a very specific need to monitor individual logical processing cores for very specific reasons, which should be rare.

There is a VERY specific scenario where this type of monitoring might be useful…. that is when an individual single threaded process “runs away” on CPU 0, core 0.  This has been seen on Skype servers and will impact server performance.  So if you MUST monitor for this condition, you can consider discovering these individual CPU’s.  I still don’t recommend it and certainly not across the board.

 

Ok, all warnings aside – lets figure out how this works.

There is an optional discovery (disabled by default) in the Windows Server 2016 Operating System (Discovery) management pack, to discover individual CPU’s:  “Discover Windows CPUs” (Microsoft.Windows.Server.10.0.CPU.Discovery)  This discovery runs once a day, and calls the Microsoft.Windows.Server.10.0.CPUDiscovery.ModuleType datasource.  This datasource runs a PowerShell script that discovers two object types:

1.  Microsoft.Windows.Server.10.0.Processor (Windows Server 2016 Processor)

2.  Microsoft.Windows.Server.10.0.LogicalProcessor (Windows Server 2016 Logical Processor)

If you enable this discovery – you will discover both types:

 

Let’s start with “Windows Server 2016 Processor”.  This class discovers actual physical or virtual Processors in sockets, as they are exposed to the OS by physical hardware or the virtualization layer.  See example below:

Physical server:

image

VM guest:

image

 

By contrast – the “Windows Server 2016 Logical Processor” class shows instances of physical or virtual “Logical Processors” which will be virtual processors on a VM, and logical CPU’s exposed to the physical layer – either actual cores or hyper-threaded cores:

image

 

The former is how all our previous monitoring worked for individual CPU monitoring, which is pretty much worthless.  If we need to monitor cores, we generally don’t care about “sockets”.

The latter is new for Windows Server 2016 management pack, which actually discovers individual logical CPU’s as seen by the OS.

 

Now – lets look at the monitoring provided out of the box.

IF you enable discover the individual CPU discovery, there are three monitors targeting the “Windows Server 2016 Processor” class, one of which is enabled out of the box.  This is “CPU percentage Utilization”  It runs every three minutes, 5 samples, with a threshold of “10”.  It is also a PowerShell script based monitor.

Comments on above:

1.  Monitoring for individual “socket” utilization seems really silly to me, and not useful at all.  You probably should not use this.

2.  The default threshold of “10” is WAY too low…. I have no idea why we would use that.

3.  The counter uses “Processor” perfmon object instead of the newer “Processor Information”  The reason this isn’t a simple change, is because the “Performance Monitor Instance Name” class property doesn’t match the newer counters instance value.

Additionally, there are three rules to collect perfmon data – one of which is enabled.  You should disable this collection rule as well, IF you just HAVE to discover individual CPU’s.

 

Ok, now lets move on to the Windows Server 2016 Logical Processor.

This is more useful as it will monitor individual CORE’s (or virtual CPU’s) to look for runaway single threaded processes.

There are three monitors out of the box targeting this class and NONE of these are enabled by default.

The one for CPU util, Microsoft.Windows.Server.10.0.LogicalProcessor.CPUUtilization is a native perfmon monitor for consecutive samples.  I like this WAY better than complicated and heavy handed script based monitors.

HOWEVER – this will potentially be VERY noisy – as a server will have multiple CPU’s, and these will alarm anytime the _Total condition is met.  This means duplication of alerts when a server is heavily utilized.  That said – if only a SINGLE logical processor is spiked, but the overall CPU utilization is low, this will let you know that is happening.

 

 

Bottom line:

1.  CPU monitoring of the OS level is somewhat complex, script based, and evaluates multiple perf counters before it triggers.  Be aware, and be proactive in managing this.

2.  Change your System Processor Queue Length threshold if needed to make this monitor actually trigger on high Processor Time.

3.  The individual CPU’s can be discovered, but I DON’T recommend it.

4.  The default rules and monitors enabled for individual CPU monitoring focuses on SOCKETS, and isn’t very useful, and should be disabled.

5.  The new Logical Processor class for the Server 2016 MP is more useful as it monitors cores/logical CPU’s, but all monitoring is disabled by default.

16 Comments

  1. Joe McGowan

    Great post. Quick question – I would like to disable the “CPU Queue Length” as you recommend. Would I have to do that for each “Total CPU Utilization” monitor for every operating system? We have Server 2008, 2012, 2012 R2, and 2016. Thanks!

  2. edwio

    Thanks Kevin,
    We have a lot of Power Shell Script failed to run, due to SCOMpercentageCPUTimeCounter.ps1 script.
    What we can do? this issue is still existing on SCOM UR1 with latest Agnostic Operating System MP

  3. Alexander

    What would you describe as “very frequent testing” in these scenarios? As you say, it’s extremely rare to see these monitors trigger an alert. I don’t think I’ve ever come across it.

    We have around 350 VMs and we’d like to get more accurate CPU Usage statistics in order to catch bottlenecks and match the monitors with user experience etc. Would you say this is feasible using SCOM, or would it load the systems too much if we increase the schedule to run, say every 5 minutes or less?

    Also, is it correct that it would take 45 minutes of >95% load (+queue) before these would trigger? Or is it 3 samples within 15 minutes?

    Thank you!

    • Kevin Holman

      The default is 15 minutes and 3 samples – so that’s 45 minutes of a high CPU condition before you will see a statechange/Alert.

      Setting this to every 5 minutes is fine, as is making your own monitor with just CPU processor time evaluation from perfmon – that is a native counter and you could run that one MUCH more frequently with zero impact.

      • curtiss

        3 samples at 30 minutes is not actually 45 minutes, is it? it’s 30 minutes. because the “first” sample is at zero, the second sample is at 15, and the third is at 30.

        i have no problem setting the queue length to -1. but i find that in instances when a runaway service pegs an agents server’s CPU at 100% in a matter of seconds, scom is kind of useless, because the box’s cpu is too busy to tell scom how busy the cpu is. the performance graph is also blank. so it’s not until somebody reports a problem with some application being unresponsive, and we realize we can’t RDP to the server and powershell is super slow, and we have to bounce it or restart an IIS app pool to get it to come back.

        • Kevin Holman

          You need three samples over the threshold – so you are correct – technically it is a minimum of 30 minutes and 1 second in theory.

          I agree on the alert not firing if CPU is completely hammered. Mostly this happens because this workflow requires a script to be able to complete. If you switched to a simple perf monitor, the chances of it working and alerting are usually good, unless the CPU is so hammered that the agent can no longer function at all. That’s pretty rare.

          • Curtiss

            “unless the CPU is so hammered that the agent can no longer function at all. That’s pretty rare.”

            yep., because–i hate to say ‘unfortunately’ about this, but — unfortunately the agent usually doesn’t stop heartbeating. even that would be some indication of a problem. it’s in the dead zone of being too busy to say “hey here’s my cpu” but not too busy to say “don’t worry, i’m still up”.

            the corresponding performance collection rule looks like a straight perf counter, is that accurate? because i’m not getting any graph data during these episodes either. so if that rule doesn’t require a script workflow, and it’s already not working…how confident should i be about not using a script workflow for my monitor?

            also is there a way to subscribe to these comments and replies to your blog?

  4. Zid

    Hi Kevin, Is it possible to create a dashboard view displaying the CPU utilization for all the servers and not just the “Top 20” utilized servers which is the max limit settings to display in a grid. In my case I have more than 60+ odd servers and I need all of them to be displayed in a single view. Please let me know how this is done if possible. Thanks in advance

    • Kevin Holman

      Not in an out of the box dashboard. You could write a custom SSRS report to do this, or develop a dashboard using PowerBI that leverages the same query. You could also develop a PowerShell dashboard…. that pulls and displays perf data. I have done this in the past for customers, but never found it to be very fast.

  5. Mohan Ram K

    “Total CPU Utilization Percentage monitor” How we can include the alert description top 5 processes consuming CPU in alert description

Leave a Reply

Your email address will not be published.