Menu Close

What is config churn?

Config churn is basically, when your management server is in an almost never-ending loop of generating config.  This can be caused by “less than optimized” management packs, pushing agents all the time, or injecting major changes into a management group, such as overrides or custom rules and monitors, or importing updated management packs.  By examining this topic in depth – we will re-state some already known best practices with maintaining a healthy management group, and get some deeper knowledge as to why they are best practices in the first place.

 

Any time you push agents, or create rules and monitors, or overrides for widespread classes….. you can create a config update on the RMS that must be sent down to ALL agents in the management group.  For small management groups (under 500 agents) this is generally not a big deal and processes rather quickly.  For large management groups over 1000 agents, this can cause high resource utilization of the RMS and SQL Database, in terms of CPU, Memory, and Disk I/O.  This can impact data insertion, and console performance during these times.  For these reasons, we like to keep those activities down to a minimum during working hours, and schedule these major changes in an off-hours maintenance window.

What about “less than optimized” management packs?  What does that mean?  Well, this means management packs that you might be using, that have poorly written discoveries.

We have long known that a worst practice in Management Pack development, is to have a discovery that discovers instances of a class, that has properties for those instances that are likely to change frequently. 

 

Ok… wait… Whaaaaat?

Let me put that in English:

Say we have a discovery for a Logical Disk.  This will discover any logical disk, like C:, D:, E:, Q:, etc….  When we write the discovery for a logical disk, we can add properties to that discovery.  These are attributes of the discovered instances.  So – in this case – lets say we decided to add “Size” of the disk as a property, and “Free Space” as a property.  And for the discovery frequency – we will run this discovery every hour, looking for new disks.

“Size” is an excellent property for the Logical Disk class.  We like to know the size of the disks…. we can use this property group them if needed.  “Size” of a logical disk is not something that we would expect to change very often.

“Free Space” is a horrible property for the Logical Disk class.  Free space is something that will likely change, just a small amount even, between each run of the discovery.  Free space is a property that is likely to change frequently, therefore – it should NOT be used in a discovery.

 

Make sense?

Ok – so… what’s the big deal?

Well, the agent will run almost all discoveries that it knows about when the health service starts up (like when you bounce the service, or after a reboot).  It will always send this discovery data to the management server.  Then, it will run then based on the “Interval” frequency specified on the discovery.  Sometimes this is as frequent as once per hour, sometimes as long as once per day.  When the discovery runs, the agent will inspect the discovery data that it gets, and compare it to the last discovery data it sent to the management server.  If nothing changed – the agent drops the discovery data and does nothing.  IF anything changed in the values of the discovery data – it will re-submit the new data to the management server, which will submit this data to the database.  The RMS will detect the change, and will have to recalculate (regenerate) configuration.  You will see this on the RMS as a 21025 event:

Log Name:      Operations Manager 
Source:        OpsMgr Connector 
Date:          9/27/2009 11:51:49 PM 
Event ID:      21025 
Task Category: None 
Level:         Information 
Keywords:      Classic 
User:          N/A 
Computer:      OMRMS.opsmgr.net 
Description: 
OpsMgr has received new configuration for management group PROD1 from the Configuration Service.  The new state cookie is “D7 9B A4 BE 00 90 CF 13 35 B5 9B 5F 3B 14 FF 78 D6 13 9A 2D “

The 21025 event isn’t really “bad”… it simply means the config service did its job.  It re-generated its configuration file from the database data, and wrote it to:  \Program Files\System Center Operations Manager 2007\Health Service State\Connector Configuration Cache\<MGNAME>\OpsMgrConnector.Config.xml  The problem is – when this config file gets large (like in large agent count environments) and when the “Config Instance Space” is large (number of discovered objects in total).  Recalculating this config can have a significant impact on the disk where the file exists on the RMS, use lots of memory and CPU on the RMS for the config service, and use significant disk I/O on the SQL database.

If the RMS is in a perpetual cycle of recalculating config, and sending these config updates to all agents…. the performance of the management group is impacted.

 

Daniele Grandini of Quaue Nocent Docent is pretty much the “godfather” of good information researching the 21025 event.  Read his 3 part series on config churn here:

http://nocentdocent.wordpress.com/2009/07/09/troubleshooting-21025-events-wrap-up/

 

 

So – what can I do if I think I have too much config churn?

 

The biggest problem causing the most frequent config updates is management packs with noisy discoveries.  However, lets wrap up all the issues that can cause it, and what you can do:

 

  1. New agents.  Discover/install/approve new agents in bulk and off-hours.
  2. Overrides.  Set overrides during off-hours, or create override MP’s in a lab, then synch to production management groups during schedule off-hours times.
  3. Custom rules and monitors.  Create these during off-hours, or create using the authoring console, test in a lab, then import to production during off-hours.
  4. Newly discovered instances.  For instance – someone adds a new disk, or SQL database, or DNS zone, to an existing agent.  Not much we can do about this, except the expectation that this would be done during off hours.
  5. Group membership changes
  6. Management packs with noisy discovery properties.  See below:

 

Ok – the remainder of this article will touch on #5.

 

How can I tell which discoveries are noisy?

 

Daniele Grandini has put together a good query on this, from his link:  http://nocentdocent.wordpress.com/2009/05/23/how-to-get-noisy-discovery-rules/

 

I will repost these (slightly modified) below:

 

/* Top Noisy Rules in the last 24 hours */ select ManagedEntityTypeSystemName, DiscoverySystemName, count(*) As 'Changes' from (select distinct MP.ManagementPackSystemName, MET.ManagedEntityTypeSystemName, PropertySystemName, D.DiscoverySystemName, D.DiscoveryDefaultName, MET1.ManagedEntityTypeSystemName As 'TargetTypeSystemName', MET1.ManagedEntityTypeDefaultName 'TargetTypeDefaultName', ME.Path, ME.Name, C.OldValue, C.NewValue, C.ChangeDateTime from dbo.vManagedEntityPropertyChange C inner join dbo.vManagedEntity ME on ME.ManagedEntityRowId=C.ManagedEntityRowId inner join dbo.vManagedEntityTypeProperty METP on METP.PropertyGuid=C.PropertyGuid inner join dbo.vManagedEntityType MET on MET.ManagedEntityTypeRowId=ME.ManagedEntityTypeRowId inner join dbo.vManagementPack MP on MP.ManagementPackRowId=MET.ManagementPackRowId inner join dbo.vManagementPackVersion MPV on MPV.ManagementPackRowId=MP.ManagementPackRowId left join dbo.vDiscoveryManagementPackVersion DMP on DMP.ManagementPackVersionRowId=MPV.ManagementPackVersionRowId AND CAST(DefinitionXml.query('data(/Discovery/DiscoveryTypes/DiscoveryClass/@TypeID)') AS nvarchar(max)) like '%'+MET.ManagedEntityTypeSystemName+'%' left join dbo.vManagedEntityType MET1 on MET1.ManagedEntityTypeRowId=DMP.TargetManagedEntityTypeRowId left join dbo.vDiscovery D on D.DiscoveryRowId=DMP.DiscoveryRowId where ChangeDateTime > dateadd(hh,-24,getutcdate()) ) As #T group by ManagedEntityTypeSystemName, DiscoverySystemName order by count(*) DESC

 

/* Modified properties in the last 24 hours */ select distinct MP.ManagementPackSystemName, MET.ManagedEntityTypeSystemName, PropertySystemName, D.DiscoverySystemName, D.DiscoveryDefaultName, MET1.ManagedEntityTypeSystemName As 'TargetTypeSystemName', MET1.ManagedEntityTypeDefaultName 'TargetTypeDefaultName', ME.Path, ME.Name, C.OldValue, C.NewValue, C.ChangeDateTime from dbo.vManagedEntityPropertyChange C inner join dbo.vManagedEntity ME on ME.ManagedEntityRowId=C.ManagedEntityRowId inner join dbo.vManagedEntityTypeProperty METP on METP.PropertyGuid=C.PropertyGuid inner join dbo.vManagedEntityType MET on MET.ManagedEntityTypeRowId=ME.ManagedEntityTypeRowId inner join dbo.vManagementPack MP on MP.ManagementPackRowId=MET.ManagementPackRowId inner join dbo.vManagementPackVersion MPV on MPV.ManagementPackRowId=MP.ManagementPackRowId left join dbo.vDiscoveryManagementPackVersion DMP on DMP.ManagementPackVersionRowId=MPV.ManagementPackVersionRowId AND CAST(DefinitionXml.query('data(/Discovery/DiscoveryTypes/DiscoveryClass/@TypeID)') AS nvarchar(max)) like '%'+MET.ManagedEntityTypeSystemName+'%' left join dbo.vManagedEntityType MET1 on MET1.ManagedEntityTypeRowId=DMP.TargetManagedEntityTypeRowId left join dbo.vDiscovery D on D.DiscoveryRowId=DMP.DiscoveryRowId where ChangeDateTime > dateadd(hh,-24,getutcdate()) ORDER BY MP.ManagementPackSystemName, MET.ManagedEntityTypeSystemName

 

Wow – that returned a LOT of discoveries running all the time!  What can I do?

 

  • Don’t import too many MP’s!  The FIRST line of defense – is NOT to import ANY management packs into a management group that you don’t absolutely need RIGHT THEN.  Management packs are constantly updated, and by the time you have an actual SLA in a technology area – there will likely be a newer, better MP available for it.  The biggest mistake many customers make is to import any available MP for a technology that they have internally.  They end up with a FLOOD of alerts, big fat databases, slow consoles, and lots of weird errors.  MP’s should be transitioned slowly, one at a time – tuning and resolving as you go.

 

  • Disable the noisy discoveries.  Probably not a great solution, unless they discover objects that you really don’t care about – but there are other objects in the MP that you DO want to monitor.  However – what I like to do is to look at the discovery – and see what class it discovers.  Then, using MPViewer – look at the rules and monitors that target that class.  I might find I really don’t need these rules and monitors in my business, so disabling the discovery is the simplest options to solve the performance problem.

 

  • Increase the interval of the discovery frequency.  This means… essentially – change any “bad” discoveries to run only once per day (86400 seconds) or once every 7 days (604800 seconds) or more (up to 4 weeks = 2419200 seconds)

 

  • Add a “synch time” override to the discovery – if possible.  This option is not available unless the MP author of the discovery exposed it.  What this will do – it cause all the agents to ONLY run the discovery at a distinct and specified time every day (say…. 1AM).  This might cause too much discovery data to flood in at one time… but since it will all come in at the same time – it wont cause constant config churn all throughout the day.  I have never done this because I don’t know how bad the impact is for all the discovery data to come in at the same time is…. so this is more of an idea I had.

 

  • Re-write the discovery.  If this is a custom MP – rewrite the discovery/MP, and remove that property which changes too often, or fix it.  If this is a sealed MP – talk to the vendor and get them to fix their MP.  Or – consider disabling it – and re-writing the discovery yourself – and fix it until the vendor is able to release an update.

 

  • Make sure your hardware and software is optimized for scalability.  On your RMS – it is good to place your config file on fast disks, especially in large environments.  I have worked with very large customers who were experiencing config churn, but had zero ill effects, because their RMS disk I/O was on a 4 spindle RAID10 with 15K spindles, CPU and memory were really good, and their SQL database disk I/O for the OpsDB was excellent with plenty of breathing room.  I have also worked with smaller agent counts, where config churn has a serious impact…. mostly due to the RMS config file being places on the same RAID spindle set as that OS and pagefile, using only 2 older 10,000 RPM disks in a RAID1 mirror.  The SQL disk I/O was also just borderline for their agent count.  In these environments – I see config churn having a bigger impact.

 

  • Re-run the queries periodically – especially after importing/upgrading to a new management pack in your management group.  This “instance space change” report should be part of your testing and evaluation of a new MP when brought into your lab…. if you have a large agent count environment.

 

 

Some very common discoveries I have seen – that have properties that change very frequently – are listed below.  I often recommend these be overridden to run once per day (86,400 seconds) or once per week (604800 seconds) if the problem is serious, or still existing when running once per day (large agent counts)

 

The top noisy MP’s with bad discoveries I find in customer environments – are almost ALWAYS some order of the following:

  • IIS MP
  • SQL MP (old versions only – see note below)
  • DNS MP (old versions only – see note below)
  • ADMP
  • DPM

 

Discovery Display Name Discovery Target Class Discovered Type Property that is changing too much Default frequency Modified frequency
Windows Internet Information Services Base Classes Discovery Rule IIS 2000 Server Role IIS NNTP Virtual Server MaxMessageSize 3600 86400
Windows Internet Information Services Base Classes Discovery Rule IIS 2003 Server Role IIS FTP Site   3600 86400
Windows Internet Information Services Web Sites x-x Discovery Rule (4 of these) IIS 2003 Web Server IIS Web Site LoggingEnabled 3600 86400
DNS 2003 Component Discovery DNS 2003 Server DNS 2003 Zone SerialNumber 21600 604800
DNS 2008 Component Discovery DNS 2008 Server DNS 2008 Zone SerialNumber 21600 604800
DNS 2003 Component Discovery DNS 2003 Server DNS Domain PrimaryServer 21600 604800
DNS 2003 Component Discovery DNS 2008 Server DNS Domain PrimaryServer 21600 604800
AD Remote Topology Discovery Active Directory Domain Controller Server 2000 Computer Role Active Directory Connection Object LastSuccessfulSyncTime 86400 604800
AD Remote Topology Discovery Active Directory Domain Controller Server 2003 Computer Role Active Directory Connection Object LastSuccessfulSyncTime 86400 604800
AD Remote Topology Discovery Active Directory Domain Controller Server 2008 Computer Role Active Directory Connection Object LastSuccessfulSyncTime 86400 604800
Discover SQL 2000 Databases** SQL 2000 DB Engine SQL 2000 DB DatabaseSize, DatabaseSizeNumeric, LogSize, LogSizeNumeric 1800 86400
Discover Databases for a Database Engine** SQL 2005 DB Engine SQL 2005 DB DatabaseSize, DatabaseSizeNumeric, LogSize, LogSizeNumeric 7200 86400
Discover Databases for a Database Engine** SQL 2008 DB Engine SQL 2008 DB DatabaseSize, DatabaseSizeNumeric, LogSize, LogSizeNumeric 7200 86400
Discover File Groups and Files** SQL 2005 DB Engine SQL 2005 DB File, SQL 2005 DB File Group Size 7200 86400
Discover File Groups and Files** SQL 2008 DB Engine SQL 2008 DB File, SQL 2008 DB File Group Size 7200 86400
Discover Network Adapters (Only Enabled) Windows Server 2003 Operating System Windows Server 2003 Network Adapter Name, Description, IPAddress 86820 604800
Discover Network Adapters (Only Enabled) Windows Server 2008 Operating System Windows Server 2008 Network Adapter Name, Description, IPAddress 86820 604800

 

The above is just a sample – you should examine the query output of the query above and see what is impacting your management group the most.

 

Note some deeper level information on this topic:

 

What is the maximum value I can set a discovery frequency to????  Supposedly – the MAX value in seconds is 2419200 which is 4 weeks.  Normally – discoveries should not have to be stretched out so long – only if they are creating a problem  Setting this number to 4 weeks essentially negates the discovery….  which is no big deal if it is a discovery that is running for something already discovered.  However – for something like SQL databases – that means it might take 4 weeks to start monitoring a new database.  That is not good.  There is a workaround however – for being able to use the extended frequencies and still discover items – when you restart the HealthService of an agent – it will immediately run all discoveries that apply to it that don’t have a synch time set.  This means – that as a workaround to the workaround here – you can simply restart the agent if you add a new database, or IIS website, and need sooner monitoring than the max frequency time.

 

 

RMS Churn:  When a discovery property change comes in for an instance that is hosted by an agent – the RMS creates new config and send it to that agent.  This is a normal process – but we want to control this from happening too frequently.  It isn’t terribly expensive unless the number of instances hosted by the Agent is very high.  (as in – a typical agent might have 40 instances, but a SQL server with 1000 databases has 1040 instances)

Next up – if the discovery property change occurs, and that instance that sent up the change is a member of a group.  This is worse – because this now causes a config recalc for the agent, AND a config recalc for the RMS.  This is because the RMS has to evaluate group population membership since it hosts a group of instances and one or more of those instances changed – which might affect group membership.  For instance – if the SQL Database size property changes – this is no big deal.  UNLESS you have created groups of SQL databases somewhere in the management group – and this changed database is a member of one or more groups.  This will cause the RMS to updates its own config.

Lastly – when a discovery property comes in for an instance of a class, that is hosted by the RMS – this causes the RMS to completely recalculate its own config as well, and update its local health service config file.  This is very expensive…. and these instances should be given top consideration in fixing their discoveries, or extending them to reduce the issue.  The most common ones of these I see are the DNS Domain, DNS Zone, and AD Connection objects, which I have highlighted in red above.  Changes to these instances are VERY expensive – because since these are logical instances and not hosted by any SINGLE agent – they get hosted by the RMS.  When they change – it forces the RMS to regenerate its own config.  This will be evident by a LARGE number of 21025 events showing up in the RMS OpsMgr event log.  Generally – we only would like to see this file updated when necessary – two to three times per hour is ideal.  However – if you are running the DNS Management pack or ADMP, you are likely seeing this even every few MINUTES.  These DNS discoveries should be evaluated and overridden.

 

Other items hosted by the RMS are groups.  When group membership changes – this impacts RMS performance.  This is due to the fact that the RMS hosts the group instances, and the relationships to what each group contains.  When group membership changes – the RMS generates new config.  This will also show up as a 21025 Event in the RMS OpsMgr event log.  So if you have tackled the discoveries from MP’s changing frequently – the next thing to look at is groups.  If you have a large management group, and you think this might be impacting you – one of the things you can do is to slow down the group populator module.  By default – this runs every 30 seconds.

We have a registry setting to make group calculation run less often to lower the performance hit on the database.  When making this setting less frequent, group calculation will poll the database less often, if you understand that the latency of group membership discovery will increase:

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\GroupCalcPollingIntervalMilliseconds

Default is 30,000 milliseconds (30 secs)   You can create this new DWORD value to control this setting.

 

If you want to see all the instance types hosted by the RMS, run this query against the Operations Database:

 

DECLARE @RelationshipTypeId_Manages UNIQUEIDENTIFIER SELECT @RelationshipTypeId_Manages = dbo.fn_RelationshipTypeId_Manages() SELECT bme.FullName, dt.TopLevelEntityName, dt.BaseEntityName, dt.TypedEntityName FROM BaseManagedEntity bme RIGHT JOIN ( SELECT HBME.BaseManagedEntityId AS HS_BMEID, TBME.FullName AS TopLevelEntityName, BME.FullName AS BaseEntityName, TYPE.TypeName AS TypedEntityName FROM BaseManagedEntity BME WITH(NOLOCK) INNER JOIN TypedManagedEntity TME WITH(NOLOCK) ON BME.BaseManagedEntityId = TME.BaseManagedEntityId AND BME.IsDeleted = 0 AND TME.IsDeleted = 0 INNER JOIN BaseManagedEntity TBME WITH(NOLOCK) ON BME.TopLevelHostEntityId = TBME.BaseManagedEntityId AND TBME.IsDeleted = 0 INNER JOIN ManagedType TYPE WITH(NOLOCK) ON TME.ManagedTypeID = TYPE.ManagedTypeID LEFT JOIN Relationship R WITH(NOLOCK) ON R.TargetEntityId = TBME.BaseManagedEntityId AND R.RelationshipTypeId = @RelationshipTypeId_Manages AND R.IsDeleted = 0 LEFT JOIN BaseManagedEntity HBME WITH(NOLOCK) ON R.SourceEntityId = HBME.BaseManagedEntityId ) AS dt ON dt.HS_BMEID = bme.BaseManagedEntityId Where Fullname like '%RMSNAME%' order by typedentityname

 

Change “RMSNAME” above to your RMS name.  You will see most will be groups – but might be surprised to see what all is hosted by the RMS.

1 Comment

  1. Kamil

    Hello Kevin,
    it’s really fantastic article and query.
    But some time in really big environments we can meet problem occures by another connectors, for example OnTAP or SCVMM, could You provide example query for find many state changes from many sources? it could be very helpful.

Leave a Reply

Your email address will not be published.