SCOM 2012 – Grooming deep dive in the OperationsManager database
Grooming of the OpsDB in OpsMgr 2012 is very similar to OpsMgr 2007. Grooming is called once per day at 12:00am…. by the rule: “Partitioning and Grooming” You can search for this rule in the Authoring space of the console, under Rules. It is targeted to the“All Management Servers Resource Pool”and is part of the System Center Internal Library.
It calls the “p_PartitioningAndGrooming” stored procedure. This SP calls two other SP’s: p_Partitioningand thenp_Grooming
p_Partitioninginspects the tablePartitionAndGroomingSettings, and then calls the SPp_PartitionObjectfor each object in thePartitionAndGroomingSettingstable where “IsPartitioned = 1” (note – we partition event and perf into 61 daily tables – just like MOM 2005/SCOM 2007)
Thep_PartitionObjectSP first identifies the next partition in the sequence, truncates it to make sure it is empty, and then updates thePartitionTablestable in the database, to update theIsCurrentfield to the next numeric table for events and perf. It also sets the current time as the partition end time in the previous “is current” row, and sets the current time in the partition start time of the new “is current” row. Then it calls thep_PartitionAlterInsertViewsproc, to make new data start writing to the “new” current event and perf table.
To review which tables you are writing to – execute the following query: select * from partitiontables where IsCurrent = ‘1’
Aselect * from partitiontableswill show you all 61 event and perf tables, and when they were used. You should see aPartitionStartTimeupdated every day – around midnight (time is stored in UTC in the database). If partitioning is failing to run, then we wont see this date changing every day.
Ok – that’s the first step of thep_PartitioningAndGroomingsproc – Partitioning. Now – if that is all successful, we will start grooming!
Thep_Groomingis called after partitioning is successful. One of the first things it does – is to update the InternalJobHistory table. In this table – we keep a record of all partitioning and grooming jobs. It is a good spot check to see what’s going on with grooming. To have a peek at this table – execute aselect * from InternalJobHistory order by InternalJobHistoryId DESC
Thep_Groomingsproc then callsp_GroomPartitionedObjects
p_GroomPartitionedObjects will first examine thePartitionAndGroomingSettingsand compare the “days to keep” column value, against the current date, to figure out how many partitions to keep vs groom. It will then inspect the partitions (tables) to ensure they have data, and then truncate the partition, by callingp_PartitionTruncate. A truncate command is just a VERY fast and efficient way to delete all data from a table without issuing a highly transactional DELETE command.Thep_GroomPartitionedObjectssproc will then update thePartitionAndGroomingSettingstable with the current time, under theGroomingRunTimecolumn, to reflect when grooming last ran.
Next – thep_Groomingsproc continues, by callingp_GroomNonPartitionedObjects.
p_GroomNonPartitionedObjectsis a short, but complex sproc – in that is calls all the individual sprocs listed in thePartitionAndGroomingSettingstable where IsPartitioned = 0. The following stored procedures are present in my database as non-partitioned data:
Now, for the above sprocs, each one could potentially return a success or failure. They will also likely call additional sprocs, for specific tasks. You can see, the rabbit hole is deep. This is just an example of the complexity involved in self-maintenance and grooming. If you are experiencing a grooming failure of any kind, and the error messages involve any of the above stored procedures when you execute p_PartitioningAndGrooming manually, you should open a support case with Microsoft for troubleshooting and resolution. The theory is, that each of the above procedures grooms a specific non-partitioned dataset. Under NORMAL circumstances, each should be able to complete in a reasonable time frame. The challenge becomes evident when you have something go wrong, like alert storms, state change even storms from monitors flip-flop, lots of performance signature data from using self-tuning threshold monitors, huge amounts of pending SDK datasource data from large Exchange 2010 environments, or from other MP’s that might leverage this. Grooming non-partitioned data is slow, and highly resource intensive and transactional. These are specific delete statements, from tables directly, often combined with creating temp tables in TempDB. Having a good presized high performance TempDB can help, as will ensuring you have plenty of transaction log space for the database, and having the disk subsystem offer as many IOPS as possible. http://technet.microsoft.com/en-us/library/ms175527(v=SQL.105).aspx
Next – thep_Groomingsproc continues, by updating the InternalJobHistory table, to give it a status of success (StatusCode of 1 = success, 2= failed, 0 appears to be never completed?)
If you ever have a problem with grooming – or need to get your OpsDB database size under control – simply reduce the data retention days, in the console, under Administration, Settings, Database Grooming. To start with – I recommend setting all these to just 2 days, from the default of 7. This keeps your OpsDB under control until you have time to tune all the noise from the MP’s you import. So just reduce this number, then open up query analyzer, and executeEXECp_PartitioningAndGrooming When it is done, check the job status by executingselect * from InternalJobHistory order by InternalJobHistoryId DESC The last groom job should be present, and successful. The OpsDB size should be smaller, with more free space. And to validate, you can always run my large table query