Menu Close

Tuning tip: Do you have monitors constantly “flip flopping” or a lot of old statechange events?

This is something I see in almost all clients when we perform a SCOM RAAS (Health Check).  The customer will have lots of data being inserted into the OpsDB from agents, about monitors that are constantly changing state.  This can have a very negative effect on overall performance of the database – because it can be a lot of data, and the management server is busy handling the state calculation, and synching this data about the state and any alert changes to the warehouse.

Many times the OpsMgr admin has no idea this is happening, because the alerts appear, and then auto-resolve so fast, you never see them – or don’t see them long enough to detect there is a problem.  I have seen databases where the statechangeevent table was the largest in the database – caused by these issues.

 

Too many state changes are generally caused by one (or many) of these issues:

1.  Badly written monitors that flip flop constantly.  Normally – this happens when you target a multi-instance perf counter incorrectly.  See my POST on this topic for more information.

2.  HealthService restarts.  See my POST on this topic.

3.  Old stale statechanges that will NEVER get groomed

4.  Frequent use of maintenance mode will cause lots of statechange activity

 

How can I detect if this is happening in my environment?

 

That is the right question!  For now – you can run a handful of SQL queries, which will show you the most common state changes going on in your environments.  These are listed on my SQL query blog page in the State section:

 

Noisiest monitors in the database: (Note – these will include old state changes – might not be current)

SELECT DISTINCT TOP 50 count(sce.StateId) as NumStateChanges, m.MonitorName, mt.typename AS TargetClass FROM StateChangeEvent sce with (nolock) JOIN state s with (nolock) on sce.StateId = s.StateId JOIN monitor m with (nolock) on s.MonitorId = m.MonitorId JOIN managedtype mt with (nolock) on m.TargetManagedEntityType = mt.ManagedTypeId WHERE m.IsUnitMonitor = 1 GROUP BY m.MonitorName,mt.typename ORDER BY NumStateChanges DESC

 

The above query will show us which monitors are flipping the most in the entire database.  This includes recent, and OLD data.  You have to be careful looking at this output – as you might spent a lot of time focusing on a monitor that had a problem long ago.  You see – we will only groom out old state changes for monitors that are CURRENTLY in a HEALTHY state, AT THE TIME that grooming runs.  We will not groom old state change events if the monitor is Disabled (unmonitored), in Maintenance Mode, Warning State, or Critical State.

What?

This means that if you had a major issue with a monitor in the past, and you solved it by disabling the monitor, we will NEVER, EVER groom that junk out.  This doesn’t really pose a problem, it just leaves a little database bloat, and messy statechangeevent views in Health Explorer.  But the real issue for me is – it makes it a bit tougher to only look at the problem monitors NOW.

To see if you have really old stale statechange data leftover in your database, you can run the following query:

declare @statedaystokeep INT SELECT @statedaystokeep = DaysToKeep from PartitionAndGroomingSettings WHERE ObjectName = 'StateChangeEvent' SELECT COUNT(*) as 'Total StateChanges', count(CASE WHEN sce.TimeGenerated > dateadd(dd,-@statedaystokeep,getutcdate()) THEN sce.TimeGenerated ELSE NULL END) as 'within grooming retention', count(CASE WHEN sce.TimeGenerated < dateadd(dd,-@statedaystokeep,getutcdate()) THEN sce.TimeGenerated ELSE NULL END) as '> grooming retention', count(CASE WHEN sce.TimeGenerated < dateadd(dd,-30,getutcdate()) THEN sce.TimeGenerated ELSE NULL END) as '> 30 days', count(CASE WHEN sce.TimeGenerated < dateadd(dd,-90,getutcdate()) THEN sce.TimeGenerated ELSE NULL END) as '> 90 days', count(CASE WHEN sce.TimeGenerated < dateadd(dd,-365,getutcdate()) THEN sce.TimeGenerated ELSE NULL END) as '> 365 days' from StateChangeEvent sce

 

You might find you have a couple YEARS worth of old state data.  This query could take MANY HOURS to complete.  It runs in a batch loop, grooming out 10,000 rows at a time. 

So – I have taken the built in grooming stored procedure, and modified the statement to groom out ALL statechange data, and only keep the number of days you have set in the UI.  (The default setting is 7 days).  I like to run this “cleanup” script from time to time, to clear out the old data, and whenever I am troubleshooting current issues with monitor flip-flop.  Here is the SQL query statement:

 

To clean up old StateChangeEvent data for state changes that are older than the defined grooming period, such as monitors currently in a disabled, warning, or critical state.  By default we only groom monitor statechangeevents where the monitor is enabled and healthy at the time of grooming.

USE [OperationsManager] GO SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO BEGIN SET NOCOUNT ON DECLARE @Err int DECLARE @Ret int DECLARE @DaysToKeep tinyint DECLARE @GroomingThresholdLocal datetime DECLARE @GroomingThresholdUTC datetime DECLARE @TimeGroomingRan datetime DECLARE @MaxTimeGroomed datetime DECLARE @RowCount int SET @TimeGroomingRan = getutcdate() SELECT @GroomingThresholdLocal = dbo.fn_GroomingThreshold(DaysToKeep, getdate()) FROM dbo.PartitionAndGroomingSettings WHERE ObjectName = 'StateChangeEvent' EXEC dbo.p_ConvertLocalTimeToUTC @GroomingThresholdLocal, @GroomingThresholdUTC OUT SET @Err = @@ERROR IF (@Err <> 0) BEGIN GOTO Error_Exit END SET @RowCount = 1 -- This is to update the settings table -- with the max groomed data SELECT @MaxTimeGroomed = MAX(TimeGenerated) FROM dbo.StateChangeEvent WHERE TimeGenerated < @GroomingThresholdUTC IF @MaxTimeGroomed IS NULL GOTO Success_Exit -- Instead of the FK DELETE CASCADE handling the deletion of the rows from -- the MJS table, do it explicitly. Performance is much better this way. DELETE MJS FROM dbo.MonitoringJobStatus MJS JOIN dbo.StateChangeEvent SCE ON SCE.StateChangeEventId = MJS.StateChangeEventId JOIN dbo.State S WITH(NOLOCK) ON SCE.[StateId] = S.[StateId] WHERE SCE.TimeGenerated < @GroomingThresholdUTC AND S.[HealthState] in (0,1,2,3) SELECT @Err = @@ERROR IF (@Err <> 0) BEGIN GOTO Error_Exit END WHILE (@RowCount > 0) BEGIN -- Delete StateChangeEvents that are older than @GroomingThresholdUTC -- We are doing this in chunks in separate transactions on -- purpose: to avoid the transaction log to grow too large. DELETE TOP (10000) SCE FROM dbo.StateChangeEvent SCE JOIN dbo.State S WITH(NOLOCK) ON SCE.[StateId] = S.[StateId] WHERE TimeGenerated < @GroomingThresholdUTC AND S.[HealthState] in (0,1,2,3) SELECT @Err = @@ERROR, @RowCount = @@ROWCOUNT IF (@Err <> 0) BEGIN GOTO Error_Exit END END UPDATE dbo.PartitionAndGroomingSettings SET GroomingRunTime = @TimeGroomingRan, DataGroomedMaxTime = @MaxTimeGroomed WHERE ObjectName = 'StateChangeEvent' SELECT @Err = @@ERROR, @RowCount = @@ROWCOUNT IF (@Err <> 0) BEGIN GOTO Error_Exit END Success_Exit: Error_Exit: END

 

 

Once this is cleaned up – you can re-run the “old stale” query – and see you should mostly see that statechanges are within the grooming threshold, as defined by your UI retention setting for database grooming.

Now – you can run the “Most common state changes” query – and identify which monitors are still causing a problem.

 

Look for monitors at the top with MUCH higher numbers than all others.  This will be “monitor flip flop” and you should use Health Explorer to find that monitor on a few instances – and figure out why it is changing state so much in the past few days.  Common conditions for this one are badly written monitors that target a single instance object, but monitor a multi-instance perf counter.  Also – just poor overall tuning can cause this – or poorly written custom script based monitors.

If you see a LOT of similar monitors at the top, with very similar state change counts, this is often indicative of HealthService restarts.  The Health service will submit new state change data every time it starts up.  So if the agent is bouncing every 10 minutes, that is a new state change for ALL monitors on that agent, every 10 minutes.  See link at the top of this article for resolution to that.

4 Comments

    • Kevin Holman

      Yes, absolutely. This cleanup works in SCOM 2007R2, 2012, 2012R2, 2016. I have not yet tested it for 2019 but I suspect it is good there too.

  1. Simon

    What happens if you have state changes in the future, an example being if the SCOM management server has had its time reset for a short period but between that time state changes took place but now they are stored in the database with that future date. Is there a cleanup script for this.
    Thanks

  2. Pingback:Cleaning old health state change data made easy - The Monitoring Guys

Leave a Reply

Your email address will not be published.