Sometimes agents either will not “talk” to the management server upon initial installation, and sometimes an agent can get unhealthy long after working fine. Agent health is an ongoing task of any OpsMgr Admin’s life.
This post in NOT an “end to end” manual of all the factors that influence agent health…. but that is something I am working on for a later time. There are so many factors in an agent’s ability to communicate and work as expected. A few key areas that commonly affect this are:
- DNS name resolution (Agent to MS, and MS to Agent)
- DNS domain membership (disjointed)
- DNS suffix search order
- Kerberos connectivity
- Kerberos SPN’s accessible
- Firewalls blocking 5723
- Firewalls blocking access to AD for authentication
- Packet loss
- Invalid or old registry entries
- Missing registry entries
- Corrupt registry
- Default agent action accounts locked down/out (HSLockdown)
- HealthService Certificate configuration issues.
- Hotfixes required for OS Compatibility
- Management Server rejecting the agent
How do you detect agent issues from the console? The problem might be that they are not showing up in the console at all! Perhaps they might be a manual install that never shows up in Pending Actions? Or a push deployment, that stays stuck in Pending actions and never shows up under “Agent Managed”. Or even one that does show up under “Agent Managed” but never shows as being monitored… returning agent version data, etc.
One of the BEST things you can do when faced with an agent health issue… if to look on the agent, in the OperationsManager event log. This is a fairly verbose log that will almost always give you a good hint as to the trouble with the agent. That is ALWAYS one of my first steps in troubleshooting.
Another way of examining Agent health – is by the built in views in OpsMgr. In the console – there is a view – Located at the following:
This view is important – because it gives us a perspective of the agent from two different points:
1. The perspective of the agent monitors running on the agent, measuring its own “health”.
2. The perspective of the “Health Service Watcher” which is the agent being monitored from a Management Server”.
If any of these are red or yellow – that is an excellent place to start. This should be an area that your level 1 support for Operations manager checks DAILY. We should never have a high number of agents that are not green here. If they aren’t – this is indicative of an unhealthy environment, or the admin team not adhering to best practices (such as keeping up with hotfixes, using maintenance mode correctly, etc…
Use Health Explorer on these views – to drill down into exactly what is causing the Agent, or Health Service Watcher state to be unhealthy.
Now…. the following are some general steps to take to “fix” broken agents. These are not in definitive order. The order of steps really comes down to what you find when looking at the logs after taking these steps.
- Start the HealthService on the agent. You might find the HealthService is just not running. This should not be common or systemic. Consider enabling the recovery for this condition to restart the HealthService on Heartbeat failure. However – if this is systemic – it is indicative of something causing your HealthService to restart too frequently, or administrators stopping SCOM. Look in the OpsMgr event log for verification.
- Bounce the HealthService on the agent. Sometimes this is all that is needed to resolve an agent issue. Look in the OpsMgr event log after a HealthService restart, to make sure it is clean with no errors.
- Clear the HealthService queue and config (manually). This is done by stopping the HealthService. Then deleting the “\Program Files\System Center Operations Manager 2007\Health Service State” folder. Then start the HealthService. This removes the agent config file, and the agent queue files. The agent starts up with no configuration, so it will resort to the registry to determine what management server to talk to. From the registry – it will find out if it is AD integrated, or a fixed management server to talk to if not. This is located at HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Agent Management Groups\PROD1\Parent Health Services\ location, in the \<#>\NetworkName string value. The agent will contact the management server – request config, receive config, download the appropriate management packs, apply them, run the discoveries, send up discovery data, and repeat the cycle for a little while. This is very much what happens on a new agent during initial deployment.
- Clear the HealthService queue and config (from the console). When looking at the above view (or any state view or discovered inventory view which targets the HealthService or Agent class) there is a task in the actions pane – “Flush Health Service State and Cache”. This will perform a very similar action to that above…. as a console task. This will only work on an agent that is somewhat responsive…. if it does not work you need to perform this manually as the agent is really broken from communication with the management server. This task will never complete, and will not return success – because the task breaks off from itself as the queue is flushed.
- “Repair” the agent from the console. This is done from the Administration pane – Agent Managed. You should not run a repair on any AD-integrated agent – as this will break the AD integration and assign it to the management server that ran the repair action. A “repair” technically just reinstalls the agent in a push fashion, just like an initial agent deployment. It will also apply/reapply any agent related hotfixes in the management server’s \Program Files\System Center Operations Manager 2007\AgentManagement\ directories.
- Reinstall the agent (manually). This would be for manual installs or when push/repair is not possible. This section is where the combination of options gets a little tricky. When you are at this point… where you have given up, I find just going all the way with a brute force reinstall is the best way. This means performing the following steps:
- Uninstall the agent via add/remove programs.
- Run the Operations Manager Cleanup Tool CleanMom.exe or CleanMOM64.exe. This is designed to make sure that the service, files, and all registry entires are removed.
- Ensure that the agent’s folder is removed at: \Program Files\System Center Operations Manager 2007\
- Ensure that the following registry keys are deleted:
- HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager
- HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\HealthService
- Reboot the agent machine (if possible)
- Delete the agent from Agent Managed in the OpsMgr console. This will allow a new HealthService ID to be detected and is sometimes a required step to get an agent to work properly, although not always required.
- Now that the agent is gone cleanly from both OpsMgr console and the agent Operating System…. manually reinstall the agent. Keep it simple – install it using a named management server/management group, and use Local System for the agent action account (these will remove any common issues with a low priv domain account, and AD integration if used) If it works correctly – you can always reinstall again using low priv or AD integration.
- Remember to import certificats at this point if you are using those on the individual agent.
- As always – look in the OperationsManager event log…. this will tell you if it connected, and is working, or if there is a connectivity issue.
To summarize…. there are many things that can cause an agent issue, and many methods to troubleshoot. However – to summarize at a very general level, my typical steps are:
- Review OpsMgr event log on agent
- Bounce HealthService
- Bounce HealthService clearing \Health Service State folder.
- Complete brute force reinstall of the agent.
If it an external issue is causing the issue (DNS, Kerberos, Firewall) then these steps likely will not help you…. but those should be available from the OpsMgr event log.
Hi kevin ,
After we upgrade UR7 we are getting frequently this error on management servers. can you Please help me on this .
So far
1. rebooted servers
2. renamed health state folder and start the seervices
Issue:
Management Configuration Service failed to process agent configuration request
Management Configuration Service failed to process agent configuration request. OpsMgr Management Configuration Service failed to process configuration request (Xml configuration file or management pack request) due to the following exception
Microsoft.EnterpriseManagement.ManagementConfiguration.Interop.HealthServicePublicKeyNotRegisteredException: Missing certificate for Healthservice id 5446a7a5-637f-d72d-5946-db4ec08abf4f
Server stack trace:
at Microsoft.EnterpriseManagement.RuntimeService.RootConnectorMethods.OnRetrieveSecureData(Guid healthServiceId, ReadOnlyCollection`1 addedSecureStorageReferences, ReadOnlyCollection`1 removedSecureStorageReferences, ReadOnlyCollection`1 addedSecureStorageElements, ReadOnlyCollection`1 removedSecureStorageElements, String hashAlgorithmName, Byte[]& hashValue)
at Microsoft.EnterpriseManagement.RuntimeService.SDKReceiver.OnRetrieveSecureData(Guid healthServiceId, ReadOnlyCollection`1 addedSecureStorageReferences, ReadOnlyCollection`1 removedSecureStorageReferences, ReadOnlyCollection`1 addedSecureStorageElements, ReadOnlyCollection`1 removedSecureStorageElements, String hashAlgorithmName, Byte[]& hashValue)
at System.Runtime.Remoting.Messaging.StackBuilderSink._PrivateProcessMessage(IntPtr md, Object[] args, Object server, Object[]& outArgs)
at System.Runtime.Remoting.Messaging.StackBuilderSink.SyncProcessMessage(IMessage msg)
Exception rethrown at [0]:
at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg)
at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type)
at Microsoft.EnterpriseManagement.Mom.Internal.ISdkService.OnRetrieveSecureData(Guid healthServiceId, ReadOnlyCollection`1 addedSecureStorageReferences, ReadOnlyCollection`1 removedSecureStorageReferences, ReadOnlyCollection`1 addedSecureStorageElements, ReadOnlyCollection`1 removedSecureStorageElements, String hashAlgorithmName, Byte[]& hashValue)
at Microsoft.EnterpriseManagement.ManagementConfiguration.Communication.CredentialDataProvider.GetSecureDataUnwrapped(Guid agentId, ICollection`1 addedReferenceList, ICollection`1 deletedReferenceList, ICollection`1 addedCredentialList, ICollection`1 deletedCredentialList, Byte[]& hashValue)
at Microsoft.EnterpriseManagement.ManagementConfiguration.Communication.CredentialDataProvider.GetSecureData(Guid agentId, ICollection`1 addedReferenceList, ICollection`1 deletedReferenceList, ICollection`1 addedCredentialList, ICollection`1 deletedCredentialList, Byte[]& hashValue)
at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.TracingCredentialDataProvider.GetSecureData(Guid agentId, ICollection`1 addedReferenceList, ICollection`1 deletedReferenceList, ICollection`1 addedCredentialList, ICollection`1 deletedCredentialList, Byte[]& hashValue)
at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.AgentConfigurationFormatter.WriteSecureData(AgentConfigurationStream stream, XmlWriter writer, Guid agentId, Hashtable credentialAssociationList, Hashtable credentialList)
at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.AgentConfigurationFormatter.WriteSnapshotState(AgentConfigurationStream stream, XmlWriter writer, AgentValidatedConfiguration validatedConfig)
at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.AgentConfigurationFormatter.GetSnapshotConfigurationStream(AgentValidatedConfiguration validatedConfig, AgentConfigurationCookie oldCookie, AgentConfigurationCookie& newCookie)
at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.AgentConfigurationBuilder.FormatConfig(ConfigurationRequestDescriptor requestDescriptor, IAgentConfiguration agentConfig)
at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.AgentRequestProcessor.ProcessConfigurationRequest(ICollection`1 requestList, Int32& processedRequestsCount)
at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.AgentRequestProcessor.Execute()
at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.ThreadManager.ResponseThreadStart(Object state)
First you can check for server having Healthservice id 5446a7a5-637f-d72d-5946-db4ec08abf4f Microsoft.EnterpriseManagement.ManagementConfiguration.Interop.HealthServicePublicKeyNotRegisteredException: Missing certificate for Healthservice id 5446a7a5-637f-d72d-5946-db4ec08abf4f.
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\HealthService\Parameters\Management Groups\mymgmtgroup\SSDB\References\some long string
If it is not there you can pick it up from another machine in the same management group and merge it. Afterwards restart your healthservice (System Center Management).
It seems insane to have such an entry as this, describing such effort over a “monitoring agent”.
A ‘repair’ from the console is not the best approach to take (fifth bullet)?
And a manual clean-up process that still doesn’t ensure a working re-install?
How is it even acceptable (“an ongoing task of any OpsMgr Admin’s life”) that a monitoring agent would become a bigger problem to solve than the monitored server?
Thank you, though, for this info. I’ll need to address an existing SCOM environment I walked in to.
reply to an old post sure. but I’ve only ever had to remove, cleanup and reinstall the agent ONCE since we started with scom 2007 back in 2009 over the thousands of servers that have been monitored. any other ‘reinstall’ has been solely as someone uninstalled it for some other dumb reason. (customer wanted to use an AV sol with terrible coding that saw scom as an AV for instance)
I may have been lucky with the update process, but given not even 2% of our monitored servers are in the domain scom is in our agent ‘upgrades’ have been handled by our deployment systems, as required, not scom itself. Realistically I don’t think that really makes a difference to any potential ‘issue’.
Most agent issues people might think require an ‘reinstall/repair’ simply don’t. they just need someone to look at logs and address an issue for things to work again.
If a discovery breaks the wrong way and adds something like service monitoring for a service that doesn’t exist simply clear the health state cache, not all MP’s have ‘good’ exception handling and things can go wrong. I’ve had people think that’s a ‘broken’ agent…
so over thousands of servers, 14+ years of experience and only ONE horribly broken agent that needed a fix, that doesn’t sound too bad, does it? sure that box was a song and a dance compared to other agents to fix, but it wasn’t that hard either. ie use the cleanup tool and some manual steps on top as per Kevin’s notes above.
so keeping the agent working properly I’ve found to be a doddle. Its normally human error that causes issues for me. (i.e. routes breaking) nothing else.
Is there a way to download System Center Operations Manager Agent (version 1807) without deploying it through the wizard?
Hello Kevin, good morning, I have a somewhat unusual behavior, we are migrating to our SCOM 2016 Update Rollup 9 servers with an agent version 8.0.13.053 from a SCOM 1801 intalation, but one day most of the servers are gray, up to date The following are in green, it should be noted that the MG of Scom 2016 UR9 is in one domain, let’s say domain1 and the agents are in another domain, which we call domain2 for authentication, or a short-cut trust relationship is performed, the question is whether This intermittence could be realion to be using a different version of agents or some problem in the trust relationship
errors are seen in agents 20070: The OpsMgr Connector connected to XXXXXX.domain.net, but the connection was closed immediately after authentication occurred. The most likely cause of this error is that the agent is not authorized to communicate with the server, or the server has not received configuration.
and on the MS side you see errors 2002: A device at IP xxx.xxx.xxx.xxx:rpc port attempted to connect but could not be authenticated, and was rejected.
Hi Kevin – good day, Have some strange behavior across our log analytics workspaces trying to ship custom logs:
A module of type “Microsoft.EnterpriseManagement.Mom.Modules.CloudFileUpload.CloudFileUploadWriteAction” reported an exception Microsoft.EnterpriseManagement.Mom.Modules.CloudFileUpload.FileUploadException: Unable to get blob container for CustomLog from https://workspaceid.ods.opinsights.azure.com/ContainerService.svc. Will keep trying according to the specified policy. —> System.Net.WebException: The underlying connection was closed: An unexpected error occurred on a send. —> System.IO.IOException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host. —> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host
at System.Net.Sockets.Socket.EndReceive(IAsyncResult asyncResult)
at System.Net.Sockets.NetworkStream.EndRead(IAsyncResult asyncResult)
Hi Kevin – Nice info. Thanks.
I’ve faced some strange behavior in our SCOM environment. After a switch maintenance, we’ve lost a LAN connection for more than 2 hours and our management server was hit hard with all alerts generated from every device. After we restore the LAN connection, the management server is filled with event ID 20002 : “A device at IP xx.x..xx.xx:58245 attempted to connect but could not be authenticated, and was rejected.” and it’s grayed out.
All except few clients agents are also grayed out.
I got the below error in Ops Manager log
OpsMgr was unable to set up a communications channel to scom and there are no failover hosts. Communication will resume when XXXXXX.XXXXXXX.XXX is available and communication from this computer is allowed.
The OpsMgr Connector could not connect to scom:5723. The error code is 10061L(No connection could be made because the target machine actively refused it.). Please verify there is network connectivity, the server is running and has registered it’s listening port, and there are no firewalls blocking traffic to the destination.
Please point me to the right direction.
Thanks,
Dj
Sounds like Kerberos is broken. Are these agents and SCOM servers in the same AD Forest? If so – something needs a reboot…. it seems like AD Kerberos auth is broken after the switch maintenance.
Some of the clients are in the same AD forest, but there’re quite a few which are in different AD forest but there’s a full trust between the them.
I rebooted the RMS, the DB it leaves in, all DCs.
I accidentally posted the scom server name in my previous post, would you please remove it? (Including this sentence)
Thank you!
Dj
Full trusts that support Kerberos can be problematic… I still recommend Gateways and certificates in these cases, because Trusts can break then you lose all your agents when Kerberos breaks. If you feel that kerberos is working, or have agents in the SAME forest that cannot be authenticated, I recommend opening a support case.
Thanks for the input. I’ll consider adding Gateway severs with Cert.
The issue seems to be related to a network/switch problem. For some reason the subnet that the RMS is found is not communicating with other subnets.
Thank you, Kevin!
Hi Kevin,
with scom 2022ur1 so lets say your issue it the monitoring agent on a management server, where it stops and all that’s in the logs indicating anything abnormal in operation is the local DB dismounting then agent shutting down (informational events) All heathy and trundling along fine until then. Not on the whole fleet just some (5/11), and always the same ones. Regkey adjustments are pushed to all of these servers via GPO and that seems to be working.
trying to work through opsmgr traces but not sure what I should be collecting nor what’s ‘fluff’. If there was any event that was occurring prior I would at least know where to look
Regards
Dwayne
Under Administration -> Device Management -> Agent Managed, what would cause of the “Change Primary Management Server”, Repair…, and Uninstall… options to be grayed out? I am running the console on the Management Server, with an account that’s in the Administrators group. SCOM version is 2019.
Thanks,
Alfredo
By design: https://kevinholman.com/2010/02/20/how-to-get-your-agents-back-to-remotely-manageable-in-scom/
That explains it. Our agents are all installed via SCCM, and only port 5723 is open. Thanks, Kevin.
Hi Kevin,
We are facing issue of non-performance of Linux servers on SCOM. While the issue has been fixed. Question is how to pull the automated performance report for UNIX servers from SCOM management servers or any alerting can be done
Pingback:Troubleshooting System Center Operations Manager, Part 1: Troubleshooting the Windows SCOM Agent | POHN IT-Consulting GmbH
Quick and Dirty fix that worked for us.
Delete Monitoring of said asset under “Agent Managed”. Manual uninstall on said asset, manually delete “Microsoft Monitor Agent” folder – delete HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager – HealthService entry under “Services” is always deleted with uninstall it seems.
Push from Discovery Wizard.
Is there a way to generate an Alert if any computer on the SCOM console goes into the grey state
We already have this – it is the Heartbeat failure alert.