This article is not just a warning about the Dell (Detailed) MP, but the danger of importing ANY management pack into your environment without fully understanding the intended scope, scalability, and any known/common issues.
I recently worked with a customer who had an interesting issue. They had a very large agent based monitoring environment (greater than 10,000 agents). While performing a supportability review, we noticed that Config generation was failing. This was evidenced by the Config monitors showing red on the console, alerts generated, events logged in the Management Server SCOM event logs, and most notably by the fact that agents were not getting updated config in a timely fashion.
Events were similar to:
Log Name: Operations Manager
Source: OpsMgr Management Configuration
Event ID: 29181
Computer: managementserver.domain.com
Description:
OpsMgr Management Configuration Service failed to execute ‘SnapshotSynchronization’ engine work item due to the following exceptionMicrosoft.EnterpriseManagement.ManagementConfiguration.DataAccessLayer.DataAccessException: Data access operation failed
at Microsoft.EnterpriseManagement.ManagementConfiguration.DataAccessLayer.DataAccessOperation.ExecuteSynchronously(Int32 timeoutSeconds, WaitHandle stopWaitHandle)
at Microsoft.EnterpriseManagement.ManagementConfiguration.SqlConfigurationStore.ConfigurationStore.ExecuteOperationSynchronously(IDataAccessConnectedOperation operation, String operationName)
at Microsoft.EnterpriseManagement.ManagementConfiguration.SqlConfigurationStore.ConfigurationStore.EndSnapshot(String deltaWatermark)
at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.SnapshotSynchronizationWorkItem.EndSnapshot(String deltaWatermark)
at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.SnapshotSynchronizationWorkItem.ExecuteSharedWorkItem()
at Microsoft.EnterpriseManagement.ManagementConfiguration.Interop.SharedWorkItem.ExecuteWorkItem()
at Microsoft.EnterpriseManagement.ManagementConfiguration.Interop.ConfigServiceEngineWorkItem.Execute()
———————————–
System.Data.SqlClient.SqlException (0x80131904): Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding. —> System.ComponentModel.Win32Exception (0x80004005): The wait operation timed out
at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
at System.Data.SqlClient.SqlCommand.InternalEndExecuteReader(IAsyncResult asyncResult, String endMethod)
at System.Data.SqlClient.SqlCommand.EndExecuteReaderInternal(IAsyncResult asyncResult)
at System.Data.SqlClient.SqlCommand.EndExecuteReader(IAsyncResult asyncResult)
at Microsoft.EnterpriseManagement.ManagementConfiguration.DataAccessLayer.ReaderSqlCommandOperation.SqlCommandCompleted(IAsyncResult asyncResult)
ClientConnectionId:724196c1-d9ec-4f29-8807-b16cab05fcc6
Our initial issue was due to the fact that the management servers were running Windows 2012 RTM, with .NET 4.5. There is an issue here and we needed to install .NET 4.5.1 to resolve these timeouts. This got us past the initial failing for Snapshot Config failing.
Next – we saw that Delta Config started failing:
Log Name: Operations Manager
Source: OpsMgr Management Configuration
Event ID: 29181
Computer: managementserver.domain.com
Description:
OpsMgr Management Configuration Service failed to execute ‘DeltaSynchronization’ engine work item due to the following exceptionMicrosoft.EnterpriseManagement.ManagementConfiguration.DataAccessLayer.DataAccessException: Data access operation failed
at Microsoft.EnterpriseManagement.ManagementConfiguration.DataAccessLayer.DataAccessOperation.ExecuteSynchronously(Int32 timeoutSeconds, WaitHandle stopWaitHandle)
at Microsoft.EnterpriseManagement.ManagementConfiguration.CmdbOperations.CmdbDataProvider.GetConfigurationDelta(String watermark)
at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.TracingConfigurationDataProvider.GetConfigurationDelta(String watermark)
at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.DeltaSynchronizationWorkItem.TransferData(String watermark)
at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.DeltaSynchronizationWorkItem.ExecuteSharedWorkItem()
at Microsoft.EnterpriseManagement.ManagementConfiguration.Interop.SharedWorkItem.ExecuteWorkItem()
at Microsoft.EnterpriseManagement.ManagementConfiguration.Interop.ConfigServiceEngineWorkItem.Execute()
———————————–
System.Data.SqlClient.SqlException (0x80131904): Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding. —> System.ComponentModel.Win32Exception (0x80004005): The wait operation timed out
at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
at System.Data.SqlClient.SqlDataReader.TryReadInternal(Boolean setTimeout, Boolean& more)
at System.Data.SqlClient.SqlDataReader.Read()
at Microsoft.EnterpriseManagement.ManagementConfiguration.CmdbOperations.EntityChangeDeltaReadOperation.ReadManagedEntitiesProperties(SqlDataReader reader)
at Microsoft.EnterpriseManagement.ManagementConfiguration.CmdbOperations.EntityChangeDeltaReadOperation.ReadData(SqlDataReader reader)
at Microsoft.EnterpriseManagement.ManagementConfiguration.DataAccessLayer.ReaderSqlCommandOperation.SqlCommandCompleted(IAsyncResult asyncResult)
ClientConnectionId:9d9ec759-e9bf-4c1e-a958-581377c630b3
We run a snapshot config every 24 hours by default. We run a delta config every 30 seconds by default. These are controlled via the ConfigService.config file located in the \Program Files\Microsoft System Center 2012 R2\Operations Manager\Server\ directory. Delta config timing out was odd. There can be many reasons for this, so the next step was to take a SQL trace and see what expensive queries were running.
If you want to see these in more clarity – the Config service logs these jobs to the CS.WorkItem table:
SELECT * FROM cs.workitem ORDER BY WorkItemRowId DESC
You can filter these by Delta Sync or the daily Snapshot sync as well:
SELECT * FROM cs.workitem WHERE WorkItemName like '%delta%' ORDER BY WorkItemRowId DESC SELECT * FROM cs.workitem WHERE WorkItemName like '%snap%' ORDER BY WorkItemRowId DESC
WorkItemStateId is the value of success or fail for the job. It is normal to see some failures, for instance when multiple management servers try and execute the same job, some of those will fail, by design.
1 Running
10 Failed
12 Abandoned
15 Timed out
20 Succeeded
What we found – was one of the MP’s (the Dell Hardware MP) was consuming a large amount of SQL server CPU time, just to query some ManagedType views in the database. Many of these queries were running over 10 minutes.
When we researched further, we found that the “Dell Windows Server (Detailed Edition)” management pack had been imported, and in the documentation there was no mention of scalability limitations. However, we found in a much older (4.x) version of the documentation, Dell specifically states that they recommend the Detailed MP only for small environments, when the monitored server count is less than 300 agents!!!! We had already discovered and were monitoring over 5000 Dell servers.
This massive discovery data influx was also causing Config Churn – and binding showing up as 2115 errors for discovery data:
Log Name: Operations Manager
Source: HealthService
Event ID: 2115
Computer: managementserver.domain.com
Description:
A Bind Data Source in Management Group Production has posted items to the workflow, but has not received a response in 1510 seconds. This indicates a performance or functional problem with the workflow.
Workflow Id : Microsoft.SystemCenter.CollectDiscoveryData
Instance : managementserver.domain.com
Instance Id : {B3FA7F2F-3D4A-236D-D3FD-119B3E01C3E3}
So, just delete the MP, right?
Well, lets talk about what must happen when we delete an MP. When you right click an MP in the console to delete it, we must first delete any discovered instances of any classes defined in that MP. (Such as an instance of “Dell Server BIOS”.) In order to delete an instance of a class, we must first also delete ALL monitoring data associated with that instance. And I don’t mean just simply mark it as “deleted” in the database. It must actually be deleted transactionally from the tables. This means all alerts, all monitor based state changes, all events, all performance data, etc. This can be MASSIVE overhead.
What we actually experienced, is the console locking up, we could track the SQL statements trying to delete the management pack and all the instance data, however this would time out eventually and never return anything to the console. It would just go away, all the while our MP still existed.
So what can we do?
Well, we do have a possible solution…. in the Remove-SCOMDisabledClassInstance PowerShell commandlet. This cmdlet allows us to delete the discovered instance data methodically, and slowly. What this cmdlet does, is to delete any discovered instances in the management group, where that instance’s discovery is explicitly disabled via override.
So – we find all the discoveries in the Dell Detailed MP, and we create a new Override MP, to store a disable override for each discovery in. Then, we run Remove-SCOMDisabledClassInstance. This will run and run and run…. seemingly forever, until it returns with no errors. In many cases, even this cmdlet will time out or crash with an exception, which can be normal when deleting a massive amount of data.
One trick to help with this process – is to set your state, performance, and event retention in the OpsDB to ONE day, then run grooming. This will greatly reduce the amount of data we must delete transactionally.
Then – just keep running Remove-SCOMDisabledClassInstance. In this specific case, because the amount of data was so large, it actually took over a day and probably over 100 executions, before the instances were all removed. You can track the instances being removed, by creating a query that counts the records in the Managed Type tables you are deleting from. Here is part of the one I crafted for this MP:
select sum(TCount) As TotalCount from ( select count (*) as Tcount from MT_Dell$WindowsServer$Server union all select count (*) as Tcount from MT_Dell$WindowsServer$BIOS union all select count (*) as Tcount from MT_Dell$WindowsServer$Detailed$MemoryUnit union all select count (*) as Tcount from MT_Dell$WindowsServer$Detailed$ProcUnit union all select count (*) as Tcount from MT_Dell$WindowsServer$Detailed$PSUnit union all select count (*) as Tcount from MT_Dell$WindowsServer$EnclosurePhysicalDisk union all select count (*) as Tcount from MT_Dell$WindowsServer$ControllerConnector ) as T
As you run the Remove-SCOMDisabledClassInstance command, you will see these instance counts slowly eroding. You just have to keep running it until it completes without a timeout or an exception.
Once the instance count gets to zero…. you can delete the MP. We found this time the MP deleted in seconds!
Now that this MP was gone, the expensive query was over… and we saw the binding on Discovery Data go back to a more reasonable occurrence count and time value.
The lesson to learn here is – be careful when importing MP’s. A badly written MP, or an MP designed for small environments, might wreak havoc in larger ones. Sometimes the recovery from this can be long and quite painful. An MP that tests out fine in your Dev SCOM environment might have issues that wont be seen until it moves into production. You should always monitor for changes to a production SCOM deployment after a new MP is brought in, to ensure that you don’t see a negative impact. Check the management server event logs, MS CPU performance, database size, and disk/CPU performance to see if there is a big change from your established baselines.
If you are designing a large agent deployment that nears our maximum scalability (currently 15,000 agents) great consideration must go into the management packs in scope. If you require management packs that discover a large instance space per agent, and/or have a large number of workflows, you might find that you cannot achieve the maximum scale.