We often think of tuning OpsMgr by way of tuning “Alert Noise”…. by disabling rules that generate alerts that we don’t care about, or modifying thresholds on monitors to make the alert more actionable for our specific environment.
However – one area of OpsMgr that often goes overlooked, is event over-collection. This has a cost… because these events are collected and create LAN/WAN traffic, agent overhead, OpsDB size bloat, and especially, DataWarehouse size bloat. I have worked with customers who had a data warehouse that was over one third event data….. and they had ZERO requirement for this nor did they want it. They were paying for disk storage, and backup expense, plus added time and resources on the framework, all for data they cared nothing about.
MOST of these events, are enabled out of the box, and are default OpsMgr collect rules from the “System Center Core Monitoring” MP. These events are items like “config requested”, “config delivered”, “new config active”. They might be interesting, but there is no advanced analysis included to use these to detect a problem. In small environments, they are not usually a big deal. But in large agent count environments, these events can account for a LOT of data, and provide little value unless you are doing something advanced in analyzing them. I have yet to see a customer who did that.
At a high level – here is how I like to review these events:
- Review the Most Common Events query that your OpsDB has.
- Create a “My Workspace” view for each event that has a HIGH event count.
- Examine the event details for value to YOU.
- View the rule that collected the event.
- Does the rule also alert or do anything special, or does it simply collect the event?
- Do you think the event is required for any special reporting you do?
- Create an Override, in an Override MP for the rule source management pack, to disable the rule.
- Continue to the next event in the query output, and evaluate it.
So, what I like to do – is to run the “Most Common Events” query against the OpsDB, and examine the top events, and consider disabling these event collection rules:
Most common events by event number and event publishername:
SELECT top 20 Number as EventID, COUNT(*) AS TotalEvents, Publishername as EventSource
FROM EventAllView eav with (nolock)
GROUP BY Number, Publishername
ORDER BY TotalEvents DESC
The trick is – to run this query periodically – and to examine the most common events for YOUR environment. The easiest way to view these events – to determine their value – is to create a new Events view in My Workspace, for each event – and then look at the event data, and the rule that collected it: (I will use a common event 21024 as an example:)
What we can see – is that this is a very typical event, and there is likely no real value for collecting and storing this event in the OpsDB or Warehouse.
Next – I will examine the rule. I will look at the Data Source section, and the Response section. The purpose here is to get a good idea of where this collection rule is looking, what events it is collecting, and if there is also an alert in the response section. If there is an alert in the response section – I assume this is important, and will generally leave these rules enabled.
If the rule simply collected the event (no alerting), is not used in any reports that I know about (rare condition) and I have determined the event provides little to no value to me, I disable it. You will find you can disable most of the top consumers in the database.
Here is why I consider it totally cool to disable these uninteresting event collection rules:
- If they are really important – there will be different alert generating rule to fire an alert
- They fill the databases, agent queues, agent load, and network traffic with unimportant information.
- While troubleshooting a real issue – we would examine the agent event log – we wouldn’t search through the database for collected events.
- Reporting on events is really slow – because we cannot aggregate them, so any views are reports dont work well with events.
- If we find we do need one later – simply remove the override.
Here is an example of this one:
So – I create an override in my “Overrides – System Center Core” MP, and disable this rule “for all objects of class”.
Here are some very common event ID’s that I will generally end up disabling their corresponding event collection rules:
1206
1210
1215
1216
10102
10401
10403
10409
10457
10720
11771
21024
21025
21402
21403
21404
21405
29102
29103
I don’t recommend everyone disable all of these rules… I recommend you periodically view your top 10 or 20 events… and then review them for value. Just knocking out the top 10 events will often free up 90% of the space they were consuming.
The above events are the ones I run into in most of my customers… and I generally turn these off, as we get no value from them. You might find you have some other events as your top consumers. I recommend you review them in the same manner as above – methodically. Then revisit this every month or two to see if anything changed.
I’d also love to hear if you have other events that you see as your top consumer that isn’t my list above… SOME events are created from script (conversion MP’s) and unfortunately you cannot do much about those, because you would have to disable the script to fix them. I’d be happy to give feedback on those, or add any new ones to my list.
Hey Kevin, great site with great info. Just wanted to ask aboout my top events as they are mostly different then what you have posted. Can Event ID 17 be disabled?
EventID TotalEvents EventSource
17 90664 Health Service Script
342 35381 AD FS
11771 3434 Health Service Modules
400 2973 MSExchange Monitoring SmtpConnectivity
1206 2808 HealthService
Anything can be disabled! Just check the rule that is collecting it – and ensure that rule doesn’t also alert, and if so – determine if that alert is necessary. Almost any event collection can be turned off. I am not sure if I have ever seen one that was beneficial.
Thanks Kevin, appreciate your respnse.
David
Hi Kevin,
I’ve recently witnesses a massive amount of event IDs of 7000 and 7038, your blog post gave my some guidance on how to deal with these, thanks!
Br, Leon
I had the product group disable these event collections in the latest Base OS MP’s…. so this “should” be getting better. Service termination events can flood SCOM, seen more than one DB full because of this.
Pingback:Fly Your SCOM Management Group to the sky – SCOM Performance Optimizations | POHN IT-Consulting GmbH
Hi Kevin,
I was seeing 1207 events on some of the agents where cluster services are running and thats expected as the Clustered instances are monitored as Agentless and thus those Rules/Monitors(including discoveries will not run on those. But the strange thing is those discoveries are Disabled by default on SCOM and shouldn’t be initiated on any agent. Note that there is no alert for these warning events on SCOM. Below is one of the example. The Nano Discovery is disabled by default.
Rule/Monitor “Microsoft.SystemCenter.Agent.NanoDiscovery” running for remote instance “***************” with id:”{F621A329-BBF5-54FB-178F-A1031169BCF3}” will be disabled as it is not remotable. Management group “**********”.
Interesting, I never noticed that before, but even disabled the workflow does flow down to the agent, it just doesn’t initialize. This is probably something that happens prior to initialization.
Yes, that what it looks like. These warnings are logged just before the Discovery scripts triggers and the Health Service Script Logs are written into Ops Mgr Logs.
Hi Kevin, Bish,
I am also facing this issue, lots of 1207 Event ID has been logged on SQL Clusters and example is similar to mentioned in the same thread -[ Rule/Monitor “Microsoft.SystemCenter.Agent.NanoDiscovery” running for remote instance “***************” with id:”{F621A329-BBF5-54FB-178F-A1031169BCF3}” will be disabled as it is not remotable..]. I would request help/suggestions to overcome this.
Thank you both in advance.
Regards,
Suri
1207 events are completely normal.
Hi Kevin,
We have enabled Agentless Exception Monitoring and we get lacs of events with event id 9999. We don’t want to disable the events related to event id 9999 but the amount of events are in lacs and we want to control them. As we checked, there are some events where parameter1 is apphang, BEX64, PRELEAK64 etc are not of our use and we want to get the events where the parameter1 is only appcrash for application crash analysis.We checked in the Client monitoring error collection rule by disabling it for a specific instance but it doesn’t allow us to disable it dynamically and it gives us an option to disable the event for that instance. Any possibility to modify the rule and specify the parameter1 as per our requirement?
Disable the rule and re-write it with criteria to ignore specific parameter values.
Hi Kevin,
we are facing issue “Processing Backlogged Events Taking a Long Time” on 2016 Domain controllers.
Error Description
The Windows Event Log Provider monitoring the Security Event Log is 372 minutes behind in processing
events. This can occur when the provider is restarted after being offline for some time, or there
are too many events to be handled by the workflow.
As we observed Security is Log is flooding with Event ID below mentioned per day almost 4 lakhs. we don need these collections in monitoring
4776 : Credential Validation
4634 – Logoff
4624 – Logon
4672 – Special Logon
5136 – Directory Service Changes
Kindly suggest how to fix the issue .
Are your agents SCOM 2019 UR3?
If so – please revert back to UR2. There is an issue in UR3 reading events in high volume.
Second – SCOM is not designed to collect security events, and can harm the monitored system trying to do this. We recommend using Azure Log Analytics to collect and analyze security events. SCOM can monitor the security event log to a certain degree – but it is critical that the workflows are simple and do not create a resource issue trying to scan deep text in the events as criteria.