Menu Close

Self Tuning Thresholds – love and hate

Mostly hate.  Smile

Self tuning thresholds were a new concept for OpsMgr 2007.  They “attempt” to “learn” what is normal for a performance counter, and alert when the value is outside of the learned baseline.  This is great when we have performance counters that will vary widely from company to company, and we don’t know a good static setting.  The downside?  They are noisy, they don’t work very well, they don’t alert at ALL until the baseline is learned, and the stop alerting after an agent restart, again until the baseline is learned.  For this reason – they are NOT good choices for any kind of critical monitoring in SCOM.  There is no development continuing to enhance them, and Microsoft MP’s have stopped leveraging them because of all the downsides.

One of the core challenges:  if the counter being monitored varies widely on a regular basis… these monitors are extremely noisy… and generate the massive amount of alerts and state changes that they were designed to control. 

There are a couple good blog posts on these already.  Probably the best one I have read is here:  http://ops-mgr.spaces.live.com/blog/cns!3D3B8489FCAA9B51!183.entry   We will be referring to this several times.

One of the complaints about self-tuning thresholds…. is that the numbers reflected in the baseline don’t tell us anything about the actual values.  This is true…. these are based on an internal algorithm… so people see this “2.81” or “3.31” and don’t understand what it has to do with anything about our performance counter.

First – lets take a look at the basic components of a STT:  We will create a new unit monitor.  Under windows performance, self tuning thresholds…. we have several types to choose from.  The most common are going to be a 2-state or 3-state baselining…. depending on how many states we want.  For this example – we will choose a 2-state baselining.

Let’s give it a name, choose Windows Server as the target, and choose the performance Parent Monitor.  To keep this simple – lets choose Processor\% Processor Time\_Total as our performance Object\Counter\Instance.  Set the interval to 1 minute.

image

Now – we get to adjust the business cycle.  I’m picking one day for this example.  Typically – you would choose a week…. especially if your server behaves differently on different days of the week.

We can choose how many business cycles to wait before alerting…. most of the time 1 business cycle is fine.

On “Sensitivity” we have a nice slider from “Low” to “High”.  In general…. we will be choosing a low sensitivity for our custom rules.  Low = lest alerts, wider baseline range.  I will explain the numeric values for each setting later.

image

On the “Configure Health” screen…. within the envelope will be Healthy, and above the envelope will generate a state change and (optionally) an alert.

Groovy.  So – what did we just really create?  Well certainly – we created a monitor: 

image

But if we look at rules, on the same target…. we also created some rules:

image

One of the rules is to simply collect the performance data.  The other collects signature data.  Both on the same frequency we specified earlier.

So now…. on to the most important thing – the numbers.  When we created our 2-state baselining monitor – we pretty much accepted all defaults…. except we pick low sensitivity.  To see these numbers – create an override for all objects of type, and you can see what defaults, and low equal:

image

So “inner” is 4.01 while “outer” is 4.51    We will look at these numbers more later.  This is important – because we will use these to adjust and override other counters later.

Also – on the signature collection rule that was created – a sensitivity value was placed:

image

So…. lets try and find out how each setting affects these numbers – to better understand them. 

I created 5 Self-Tuning 2-state baseline monitors…. each with a different sensitivity setting…. starting with low:

Low:  Inner: 4.01  Outer: 4.51   Rule Sensitivity:  4.01

Low-Mid:  Inner: 3.77  Outer: 4.27   Rule Sensitivity:  3.77

Mid:  Inner: 3.29   Outer: 3.79   Rule Sensitivity:  3.29

Mid-High: Inner: 2.81  Outer: 3.31  Rule Sensitivity:  2.81

High:  Inner: 2.57  Outer: 3.07  Rule Sensitivity:  2.57

That will give us a good baseline to use – when tuning these rules.  We can see that default inner sensitivity ranges from 2.57 to 4.01, and outer ranges from 3.07 to 4.51.   The larger the numbers…. the less sensitive the baseline range, and therefore fewer alerts.  The difference between the numbers is always .5 

To tune these self tuning alerts….. we simply need to adjust these values, for the Performance signature rule, and the corresponding baselining monitor.

Here is a list of some very common noisy STT’s – taken from the link above:

  • ALERT=Information Store Transport Temp Table is outside the calculated baseline
  • RULE=Baseline Collection Rule for Information Store temp table number of entries (Rules, of type Exchange Queue)
  • MONITOR=IS Transport Temp Table Monitor (Exchange Queue, Entity Health, Performance)
  • ALERT= Mailbox Store Send Queue is outside the calculated baseline
  • RULE=Baseline Collection Rule for Mailbox Store Send Queue Length (Rules, of type Exchange Queue)
  • MONITOR=MB Store Send Queue Monitor (Exchange Queue, Entity Health, Performance)
  • ALERT=SMTP Local queue is outside calculated baseline
  • RULE=Baseline Collection Rule for SMTP Server Local Queue (Rules, of type Exchange Queue)
  • MONITOR=SMTP Local Queue Monitor (Exchange Queue, Entity Health, Performance)
  • ALERT=SMTP Messages in the Queue Directory is outside calculated baseline
  • RULE=Baseline Collection for SMTP Message Queue Directory (Rules, of type Exchange Queue)
  • MONITOR=SMTP Message Queue Directory Monitor (Exchange Queue, Entity Health, Performance)
  • ALERT=SMTP Remote Queue is outside the calculated baseline
  • RULE=Baseline Collection Rule for SMTP Server Remote Queue Length (Rules, of type Exchange Queue)
  • MONITOR= SMTP Remote Queue Monitor (Exchange Queue, Entity Health, Performance)
  • ALERT=SMTP Remote Retry Queue is outside the calculated baseline
  • RULE=Baseline Collection Rule for SMTP Server Remote Retry Queue Length (Rules, of type Exchange Queue)
  • MONITOR=SMTP Remote Retry Queue Monitor (Exchange Queue, Entity Health, Performance)
  • ALERT=IS Virtual Bytes is outside the calculated baseline
  • RULE=Baseline Collection Rule for IS Virtual Bytes (Rules, of type Exchange IS Service)
  • MONITOR=IS Virtual Bytes Monitor (Exchange IS Service, Entity Health, Performance)
  • ALERT= Number of RPC requests is outside the calculated baseline
  • RULE=Baseline Collection Rule for IS RPC Requests (Rule, of type Exchange IS Service)
  • MONITOR=IS RPC Requests Monitor (Exchange IS Service, Entity Health, Performance)

What we see – is that most of the default STT’s in the management packs are set to “Medium-High” sensitivity…. or a Inner of 2.81 and outer of 3.31.  This is likely too sensitive, and needs to be adjusted.  Essentially… start by bumping up to the next set of numbers for both values, and adjusting them from Mid-High, to Mid, Mid-Low, or Low.

Here are the steps from the above blog post… with a few changes:

Steps to resolve: (perform all of these steps for each Alert in your environment which needs to be tuned)

  1. Find the rule that applies to the alert. (To find the rules, it’s easiest to change the scope to filter by the two areas that we need – which are the Exchange Queue and Exchange IS Service. Both of these are available when you click on scope and choose the option to view all targets. Then find rules with “Baseline Collection” as the start. This scopes it down to about 17 rules versus over 6000.) Details on the names of each of the above rules are listed below. Disable the rule (Right-click on the rule, overrides, disable the rule for all objects of type: Exchange Queue, click yes to accept).
  2. Change the rule sensitivity to 3.29 (Right-click on the rule, Overrides, Override the rule, For all Objects of type: Exchange Queue, check the Sensitivity parameter and set it to 3.29 (or higher if needed), click OK).
  3. Find the monitor that applies to the alert. This can be found by searching or scoping to the type of object identified for the rule. Disable the monitor (Right-click on the monitor, Overrides, Disable the monitor for all objects of type: Exchange Queue, click yes to accept).
  4. Change the monitor inner sensitivity to 3.29 (Right-click on the monitor, Overrides, Overrides the monitor, For all Objects of type: Exchange Queue, check the Inner Sensitivity parameter and set it to 3.29 if it’s not already set to that value, click Ok).
  5. Change the monitor outer sensitivity to 3.79 (Right-click on the monitor, overrides, Overrides the monitor, For all Objects of type: Exchange Queue, check the Outer Sensitivity parameter and set it to 3.79 if it’s not already set to that value, click Ok).
  6. Re-enable the monitor. (Right-click on the monitor, click on Overrides Summary, delete the override that says Type, Exchange Queue, Enabled, False).
  7. Go back to the rule identified in step #1 and re-enable the rule. (Right-click on the rule, click on Overrides Summary, delete the override that says Type, Exchange Queue, Enabled, False).

NOTE:  The “outer” sensitivity does not matter.  It is an early design leftover, and does not have an impact.  Only the inner sensitivity makes a difference in tuning.  There has been some conflicting information in the newsgroups, but this information has been verified with the dev team.

The only requirements… on the outer, is that it be a larger number than the inner.  So when adjusting – focus on bumping the inner in .5 increments, and just make sure the outer is any number higher than the inner…. such as .1 higher than inner.

In Summary:

1.  Not all counters are good candidates for STT’s based on the performance counter pattern.

2.  Some of our built in STT’s are a bit on the sensitive side and should be tuned.  If the alert noise is high – start by tuning – lower the sensitivity.

3.  Some of our built in STT’s are targeting a perf counter that is not a good candidate for an STT. (eg… STMP queue, or any perf counter that is often “zero value” when healthy). 

4.  There is no simple way to view the learned baseline of an STT…. the “show baseline” in graph view does not display a range.

5. Any time a customer is not happy with the results of a STT monitor – they should simply create a static threshold monitor.  This is very basic and provides the best solution.  If you cant tune noise out of a STT, or you NEED to know at what threshold an alert will be generated…. then simply turn off the STT, and create an identical static threshold monitor, of the average, or consecutive samples above, type.

 

Personally?  I don’t use them, and recommend against them for any serious monitoring.  They were an awesome idea, but not implemented well enough to be useful in production monitoring environments.

Leave a Reply

Your email address will not be published. Required fields are marked *