I see this event a lot in customer environments. I am not an expert on troubleshooting this here… but saw this post in the MS newsgroups and felt it was worth capturing….
My experience has been that it is MUCH more common to see these when there is a management pack that collects way too much discovery data…. than any real performance problem with the data warehouse. In most cases…. if the issue just started after bringing in a new MP…. deleting that MP solves the problem. I have seen this repeatedly after importing the Cluster MP, Or Exchange 2007 MP…. but haven’t been able to fully investigate the root cause yet:
In a nutshell…. if they are happening just a couple times an hour…. and the time in seconds is fairly low (under a few minutes) then this is normal.
If they are happening very frequently – like every minute, and the times are increasing – then there is an issue that needs to be resolved.
Taken from the newsgroups:
——————————————-
In OpsMgr 2007 one of the performance concerns is DB/DW data insertion performance. Here is a description of how to identify and troubleshoot problems with DB/DW data insertion.
Symptoms:
DB/DW write action workflows run on a Management Server, they first keep data received from Agent / Gateway in an internal buffer, then they create a batch of data from the buffer and insert the data batch to DB / DW, when the insertion of the first batch finished, they will create another batch and insert it to DB / DW. The size of the batch depends on how much data is available in the buffer when the batch is created, but there is a maximum limit on the size of the batch, a batch can contain up to 5000 data items. If data item incoming (from Agent / Gateway) throughput becomes larger, or the data item insertion (to DB/DW) throughput becomes smaller, then the buffer will tend to accumulate more data and the batch size will tend to become larger. There are different write action workflows running on a MS, they handle data insertion to DB / DW for different type of data:
- Microsoft.SystemCenter.DataWarehouse.CollectEntityHealthStateChange
- Microsoft.SystemCenter.DataWarehouse.CollectPerformanceData
- Microsoft.SystemCenter.DataWarehouse.CollectEventData
- Microsoft.SystemCenter.CollectAlerts
- Microsoft.SystemCenter.CollectEntityState
- Microsoft.SystemCenter.CollectPublishedEntityState
- Microsoft.SystemCenter.CollectDiscoveryData
- Microsoft.SystemCenter.CollectSignatureData
- Microsoft.SystemCenter.CollectEventData
When a DB/DW write action workflow on Management Server notices that the insertion of a single data batch is slow (ie. slower than 1 minute), it will start to log a 2115 NT event to OpsMgr NT event log once every minute until the batch is inserted to DB / DW or is dropped by DB / DW write action module. So you will see 2115 events in management server’s “Operations Manager” NT event log when it is slow to insert data to DB /DW. You might also see 2115 events when there is a burst of data items coming to
Management server and the number of data items in a batch is large. (This can happen during a large amount of discovery data being inserted – from a freshly imported or noisy management pack.)
2115 events have 2 import pieces of information: the name of the workflow that has insertion problem, and the pending time since the workflow started inserting last data batch. Here is an example of a 2115 event:
————————————
A Bind Data Source in Management Group OpsMgr07PREMT01 has posted items to the workflow, but has not received a response in 3600 seconds. This indicates a performance or functional problem with the workflow.
Workflow Id : Microsoft.SystemCenter.CollectSignatureData
Instance : MOMPREMSMT02.redmond.corp.microsoft.com
Instance Id : {6D52A6BB-9535-9136-0EF2-128511F264C4}
——————————————
This 2115 event is saying DB write action workflow “Microsoft.SystemCenter.CollectSignatureData” (which writes performance
signature data to DB) is trying to insert a batch of signature data to DB and it started inserting 3600 seconds ago but the insertion has not finished yet. Normally inserting of a batch should finish within 1 minutes.
Normally, there should not be much 2115 events happening on Management server, if it happens less than 1 or 2 times every hour (per write action workflow), then it is not a big concern, but if it happens more than that, there is a DB /DW insertion problem.
The following performance counters on Management Server gives information of DB / DW write action insertion batch size and insertion time, if batch size is becoming larger (by default maximum batch size is 5000), it means management server is either slow in inserting data to DB/DW or is getting a burst of data items from Agent/Gateway. From the DB / DW write action’s Avg. Processing Time, you will see how much time it takes to write a batch of data to DB / DW.
- OpsMgr DB Write Action Modules(*)Avg. Batch Size
- OpsMgr DB Write Action Modules(*)Avg. Processing Time
- OpsMgr DW Writer Module(*)Avg. Batch Processing Time, ms
- OpsMgr DW Writer Module(*)Avg. Batch Size
Possible root causes:
- In OpsMgr, discovery data insertion is relatively expensive, so a discovery burst (a discovery burst is referring to a short period of time when a lot of discovery data is received by management server) could cause 2115 event (complaining about slow insertion of discovery data), since discovery insertion should not happen frequently. So if you see consistently 2115 events for discovery data collection. That means you either have DB /DW insertion problem or some discovery rules in a MP is collecting too much
discovery data. - OpsMgr Config update caused by instance space change or MP import will impact the CPU utilization on DB and will have impact on DB data insertion. After importing a new MP or after a big instance space change in a large environment, you will probably see more than normal 2115 events.
- Expensive UI queries can impact the resource utilization on DB and could have impact on DB data insertion. When user is doing expensive UI operation, you will probably see more than normal 2115 events.
- When DB / DW is out of space / offline you will find Management server keeps logging 2115 events to NT event log and the pending time is becoming higher and higher.
- Sometimes invalid data item sent from agent /Gateway will cause DB / DW insertion error which will end up with 2115 event complaining about DB /DW slow insertion. In this case please check the OpsMgr event log for relevant error events. It’s more common in DW write action workflows.
- If DB / DW hardware is not configured properly, there could be performance issue, and it could cause slow data insertion to DB / DW. The problem could be:
- The network link between DB / DW to MS is slow (either bandwidth is small / latency is large, as a best practice we recommend MS to be in the same LAN as DB/DW).
- The data / log / tempdb disk used by DB / DW is slow, (we recommend separating data, log and tempdb to different disks, we recommend using RAID 10 instead of using RAID 5, we also recommend turning on write cache of the array controllers).
- The OpsDB tables are too fragmented (this is a common cause of DB performance issues). Reindex affected tables will solve this issue.
- The DB / DW does not have enough memory.
Now – that is the GENERAL synopsis and how to attack them. Next – we will cover a specific issue we are seeing with a specific type of 2115 Event:
———————————————–
It appears we may be hitting cache resolution error we were trying to catch for a while. This is about CollectEventData workflow. Error is very hard to catch and we’re including a fix in SP2 to avoid it. There are two ways to resolve the problem in the meantime. Since the error happens very rarely, you can just restart Health Service on the Management Server that is affected. Or you can prevent it from blocking the workflow by creating overrides in the following way:
———————————————–
1) Launch Console, switch to Authoring space and click “Rules”
2) In the right top hand side of the screen click “Change Scope”
3) Select “Data Warehouse Connection Server” in the list of types,. click “Ok”
4) Find “Event data collector” rule in the list of rules;
5) Right click “Event data collector” rule, select Overrides/Override the Rule/For all objects of type…
6) Set Max Execution Attempt Count to 10
7) Set Execution Attempt Timeout Interval Seconds to 6
That way if DW event writer fails to process event batch for ~ a minute, it will discard the batch. 2115 events related to
Datawarehouse.CollectEventData should go away after you apply these overrides. BTW, while you’re at it you may want to override “Max Batches To Process Before Maintenance Count” to 50 if you have a relatively large environment. We think 50 is better default setting then SP1’s 20 in this case and we’ll switch default to 50 in SP2.
————————————————-
Essentially – to know if you are affected by the specific 2115 issue describe above – here is the criteria:
1. You are seeing 2115 bind events in the OpsMgr event log of the RMS or MS, and they are recurring every minute.
2. The events have a Workflow ID of: Workflow Id : Microsoft.SystemCenter.DataWarehouse.CollectEventData
3. The “has not received a response” time is increasing, and growing to be a very large number over time.
Here is an example of a MS with the problem: Note consecutive events, from the CollectEventData workflow, occurring every minute, with the time being a large number and increasing:
Event Type: Warning
Event Source: HealthService
Event Category: None
Event ID: 2115
Date: 5/5/2008
Time: 2:37:06 PM
User: N/A
Computer: MS1
Description:
A Bind Data Source in Management Group MG1 has posted items to the workflow, but has not received a response in 706594 seconds. This indicates a performance or functional problem with the workflow.
Workflow Id : Microsoft.SystemCenter.DataWarehouse.CollectEventData
Instance : MS1.domain.com
Instance Id : {646486D0-E366-03CA-38E7-79A0D6F34F82}
Event Type: Warning
Event Source: HealthService
Event Category: None
Event ID: 2115
Date: 5/5/2008
Time: 2:36:05 PM
User: N/A
Computer: MS1
Description:
A Bind Data Source in Management Group MG1 has posted items to the workflow, but has not received a response in 706533 seconds. This indicates a performance or functional problem with the workflow.
Workflow Id : Microsoft.SystemCenter.DataWarehouse.CollectEventData
Instance : MS1.domain.com
Instance Id : {646486D0-E366-03CA-38E7-79A0D6F34F82}
Event Type: Warning
Event Source: HealthService
Event Category: None
Event ID: 2115
Date: 5/5/2008
Time: 2:35:03 PM
User: N/A
Computer: MS1
Description:
A Bind Data Source in Management Group MG1 has posted items to the workflow, but has not received a response in 706471 seconds. This indicates a performance or functional problem with the workflow.
Workflow Id : Microsoft.SystemCenter.DataWarehouse.CollectEventData
Instance : MS1.domain.com
Instance Id : {646486D0-E366-03CA-38E7-79A0D6F34F82}
reg add “HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters” /v “Thread Pool CLR Max Thread Count Min” /t REG_DWORD /d 512 /f
reg add “HKLM\SYSTEM\CurrentControlSet\services\HealthService\Parameters” /v “Thread Pool CLR Min Thread Count” /t REG_DWORD /d 50 /f
———————————————————————————————–
The above Registry keys fixed issues my customers had with 2115 events as well. I felt like I should put this here for clarity.
Hi Blake, those reg. keys have resolved also an issue with 2115 in my end. JFYI.
Gretings,
Stoyan
Those two reg settings Blake posted, looks like they helped with my endless 2115 events too, only saw them on the RMS, but for all non DW workflows.
I will monitor, and return if it turns out I was wrong.
The reg key settings can be found in the
“Guide to System Center Management Pack for Microsoft Azure CTP”
How can we troubleshoot 2115 errors relating to Microsoft.SystemCenter.CollectDiscoveryData? (assuming we don’t think it’s SQL Perf insertions that is the issue)
Are there any queries which can help us determine the MPs or workflows which are posting the most Discovery Data? (I see queries for Events, Alert, Perf and State Data, but not so much for Discovery).
I usually focus on my Config Churn queries for that.
I’ve been seeing similar errors the last couple weeks. Would the reg entries mentioned above work for any 2115 event or are they specific to the DataWarehouse.CollectEventData workflow?
The 2 i’ve been seeing lately:
A Bind Data Source in Management Group ******** has posted items to the workflow, but has not received a response in 60 seconds. This indicates a performance or functional problem with the workflow.
Workflow Id : Microsoft.SystemCenter.DataWarehouse.CollectEntityHealthStateChange
Instance : ******MS03.****.*******.ORG
Instance Id : {15835AE2-30FD-5BC7-CCBB-40133BED011A}
——————————————————————————————–
——————————————————————————————–
A Bind Data Source in Management Group ******** has posted items to the workflow, but has not received a response in 60 seconds. This indicates a performance or functional problem with the workflow.
Workflow Id : Microsoft.SystemCenter.Apm.CollectApplicationDiagnosticsEvents
Instance : ******MS03.****.*******.ORG
Instance Id : {15835AE2-30FD-5BC7-CCBB-40133BED011A}
Hi.
I have a lot of 2115 events (only regarding Ops DB). They started to show up a week ago. I get five of them every minute. I have followed several guides/articles. One of them is https://learn.microsoft.com/en-us/troubleshoot/system-center/scom/troubleshoot-event-2115-related-performance-problems and also checked https://kevinholman.com/2008/04/21/event-id-2115-a-bind-data-source-in-management-group/ . I do not find any of them matching up with our problem. I cannot see any issues regarding disk latency, CPU usage, memory usage. Both Ops DB and DW DB has way more than 50% unallocated space. Nothing has been changed regarding SCOM or surrounding systems in the last few weeks. I have no MPs to update withing SCOM console update feature. I did see logon fails for Data Access and Management Configuration accounts in SQL log, but those matched server restart and SQL service and DBs not online yet, while SCOM tried to login. I solved this by setting services to Auto Delayed start (to give SQL time to start before SCOM). Affected workflows below.
I would really appreciate some help or guidance.
Microsoft.SystemCenter.CollectPerformanceData
Microsoft.SystemCenter.CollectPublishedEntityState
Microsoft.SystemCenter.CollectEventData
Microsoft.SystemCenter.CollectAlerts
Microsoft.SystemCenter.CollectDiscoveryData
How big are the numbers in the event? Are they staying in the 60-180 seconds range? Or are they all climbing with each interval? If all climbing, I’d stop all three SCOM services on each management server, and flush the healthservice cache, then start the services back up – look for errors, and see how soon 2115’s come back.
Hi Kevin, and thanks for reply.
Numbers are very high and climbing all the time. Now all 5 workflow events points above 44 000 seconds and still climbing. I have now stopped all three services, deleted Health Service State folder on our only Management server. Restarted the server and waited. It takes around 15 minutes before event 2115 starts showing again. And now continue to climb for every minute.
Similar 2115 event ids with Management servers greying out. All 5 workflows affected, 720 seconds and climbing.
Hi all. Now it has been some time since last status update. I now want to share our findings and what the root cause of our problem was.
First of, Kevin helped us collecting a lot of data, event logs and so on. Thanks a lot for your help. This did not point us in any specific direction.We continued to rule out things, by removing unnecessary MPs, tried changing to a new fresh DAS account, giving DAS account extra very high permissions locally on SCOM Management server, upgrading to SCOM 2025 (seems like a dumb idea, but we wanted to test), and so on. Nothing helped. After a few weeks of troubleshooting I felt the pressure to get a working monitoring system up and running. What if something major happens and we do not know about it. We decided to set up a new SCOM server based on Kevins step by step/quick guides. When getting to the stage configuring SNMP to monitor our network switches, I found that we cannot communicate via SNMP against our switches. Strangely old SCOM server have not reported anything about this issue. Anyway, we fixed fw rules (affecting both new and old SCOM server) to get SNMP working and also discovery of switches with new SCOM server. And suddenly event 2115 disappeared on old SCOM server. And my college found this article (https://learn.microsoft.com/en-us/answers/questions/193587/event-2115-after-install-sqlserver-to-cu15). I know I have seen that before, but not reading it 200%. The article points to some specific situations when monitoring network devices, and network traffic is not working as it should. Can cause 2115 events in SCOM. Those events points to a complete different thing to troubleshoot.
The root cause of this was our smart firewall having updates put in place regarding new App IDs for SNMP. That made SNMP communication stop work and this in some way also made SCOM fail after some time (days or even weeks).
Hope above can help someone. And again, a very big thanks to Kevin Holman.