Menu Close

Fixing troubled SCOM agents

Sometimes agents either will not “talk” to the management server upon initial installation, and sometimes an agent can get unhealthy long after working fine.  Agent health is an ongoing task of any OpsMgr Admin’s life.

This post in NOT an “end to end” manual of all the factors that influence agent health…. but that is something I am working on for a later time.  There are so many factors in an agent’s ability to communicate and work as expected.  A few key areas that commonly affect this are:

  • DNS name resolution (Agent to MS, and MS to Agent)
  • DNS domain membership (disjointed)
  • DNS suffix search order
  • Kerberos connectivity
  • Kerberos SPN’s accessible
  • Firewalls blocking 5723
  • Firewalls blocking access to AD for authentication
  • Packet loss
  • Invalid or old registry entries
  • Missing registry entries
  • Corrupt registry
  • Default agent action accounts locked down/out (HSLockdown)
  • HealthService Certificate configuration issues.
  • Hotfixes required for OS Compatibility
  • Management Server rejecting the agent

How do you detect agent issues from the console?  The problem might be that they are not showing up in the console at all!  Perhaps they might be a manual install that never shows up in Pending Actions?  Or a push deployment, that stays stuck in Pending actions and never shows up under “Agent Managed”.  Or even one that does show up under “Agent Managed” but never shows as being monitored… returning agent version data, etc.

One of the BEST things you can do when faced with an agent health issue… if to look on the agent, in the OperationsManager event log.  This is a fairly verbose log that will almost always give you a good hint as to the trouble with the agent.  That is ALWAYS one of my first steps in troubleshooting.

Another way of examining Agent health – is by the built in views in OpsMgr.  In the console – there is a view – Located at the following:


This view is important – because it gives us a perspective of the agent from two different points:

1.  The perspective of the agent monitors running on the agent, measuring its own “health”.

2.  The perspective of the “Health Service Watcher” which is the agent being monitored from a Management Server”.

If any of these are red or yellow – that is an excellent place to start.  This should be an area that your level 1 support for Operations manager checks DAILY.  We should never have a high number of agents that are not green here.  If they aren’t – this is indicative of an unhealthy environment, or the admin team not adhering to best practices (such as keeping up with hotfixes, using maintenance mode correctly, etc…

Use Health Explorer on these views – to drill down into exactly what is causing the Agent, or Health Service Watcher state to be unhealthy.

Now…. the following are some general steps to take to “fix” broken agents.  These are not in definitive order.  The order of steps really comes down to what you find when looking at the logs after taking these steps.

  • Start the HealthService on the agent.  You might find the HealthService is just not running.  This should not be common or systemic.  Consider enabling the recovery for this condition to restart the HealthService on Heartbeat failure.  However – if this is systemic – it is indicative of something causing your HealthService to restart too frequently, or administrators stopping SCOM.  Look in the OpsMgr event log for verification.
  • Bounce the HealthService on the agent.  Sometimes this is all that is needed to resolve an agent issue.  Look in the OpsMgr event log after a HealthService restart, to make sure it is clean with no errors.
  • Clear the HealthService queue and config (manually).  This is done by stopping the HealthService.  Then deleting the “\Program Files\System Center Operations Manager 2007\Health Service State” folder.  Then start the HealthService.  This removes the agent config file, and the agent queue files.  The agent starts up with no configuration, so it will resort to the registry to determine what management server to talk to.  From the registry – it will find out if it is AD integrated, or a fixed management server to talk to if not.  This is located at HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Agent Management Groups\PROD1\Parent Health Services\ location, in the \<#>\NetworkName string value.  The agent will contact the management server – request config, receive config, download the appropriate management packs, apply them, run the discoveries, send up discovery data, and repeat the cycle for a little while.  This is very much what happens on a new agent during initial deployment.
  • Clear the HealthService queue and config (from the console).  When looking at the above view (or any state view or discovered inventory view which targets the HealthService or Agent class) there is a task in the actions pane – “Flush Health Service State and Cache”.  This will perform a very similar action to that above…. as a console task.  This will only work on an agent that is somewhat responsive…. if it does not work you need to perform this manually as the agent is really broken from communication with the management server.  This task will never complete, and will not return success – because the task breaks off from itself as the queue is flushed.
  • “Repair” the agent from the console.  This is done from the Administration pane – Agent Managed.  You should not run a repair on any AD-integrated agent – as this will break the AD integration and assign it to the management server that ran the repair action.  A “repair” technically just reinstalls the agent in a push fashion, just like an initial agent deployment.  It will also apply/reapply any agent related hotfixes in the management server’s \Program Files\System Center Operations Manager 2007\AgentManagement\ directories.
  • Reinstall the agent (manually).  This would be for manual installs or when push/repair is not possible.  This section is where the combination of options gets a little tricky.  When you are at this point… where you have given up, I find just going all the way with a brute force reinstall is the best way.  This means performing the following steps:
    • Uninstall the agent via add/remove programs.
    • Run the Operations Manager Cleanup Tool CleanMom.exe or CleanMOM64.exe.  This is designed to make sure that the service, files, and all registry entires are removed.
    • Ensure that the agent’s folder is removed at:  \Program Files\System Center Operations Manager 2007\
    • Ensure that the following registry keys are deleted:
      • HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager
      • HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\HealthService
    • Reboot the agent machine (if possible)
    • Delete the agent from Agent Managed in the OpsMgr console.  This will allow a new HealthService ID to be detected and is sometimes a required step to get an agent to work properly, although not always required.
    • Now that the agent is gone cleanly from both OpsMgr console and the agent Operating System…. manually reinstall the agent.  Keep it simple – install it using a named management server/management group, and use Local System for the agent action account (these will remove any common issues with a low priv domain account, and AD integration if used)  If it works correctly – you can always reinstall again using low priv or AD integration.
    • Remember to import certificats at this point if you are using those on the individual agent.
    • As always – look in the OperationsManager event log…. this will tell you if it connected, and is working, or if there is a connectivity issue.

To summarize…. there are many things that can cause an agent issue, and many methods to troubleshoot.  However – to summarize at a very general level, my typical steps are:

  1. Review OpsMgr event log on agent
  2. Bounce HealthService
  3. Bounce HealthService clearing \Health Service State folder.
  4. Complete brute force reinstall of the agent.

If it an external issue is causing the issue (DNS, Kerberos, Firewall) then these steps likely will not help you…. but those should be available from the OpsMgr event log.


  1. ilayaraja

    Hi kevin ,

    After we upgrade UR7 we are getting frequently this error on management servers. can you Please help me on this .
    So far
    1. rebooted servers
    2. renamed health state folder and start the seervices

    Management Configuration Service failed to process agent configuration request

    Management Configuration Service failed to process agent configuration request. OpsMgr Management Configuration Service failed to process configuration request (Xml configuration file or management pack request) due to the following exception

    Microsoft.EnterpriseManagement.ManagementConfiguration.Interop.HealthServicePublicKeyNotRegisteredException: Missing certificate for Healthservice id 5446a7a5-637f-d72d-5946-db4ec08abf4f

    Server stack trace:
    at Microsoft.EnterpriseManagement.RuntimeService.RootConnectorMethods.OnRetrieveSecureData(Guid healthServiceId, ReadOnlyCollection`1 addedSecureStorageReferences, ReadOnlyCollection`1 removedSecureStorageReferences, ReadOnlyCollection`1 addedSecureStorageElements, ReadOnlyCollection`1 removedSecureStorageElements, String hashAlgorithmName, Byte[]& hashValue)
    at Microsoft.EnterpriseManagement.RuntimeService.SDKReceiver.OnRetrieveSecureData(Guid healthServiceId, ReadOnlyCollection`1 addedSecureStorageReferences, ReadOnlyCollection`1 removedSecureStorageReferences, ReadOnlyCollection`1 addedSecureStorageElements, ReadOnlyCollection`1 removedSecureStorageElements, String hashAlgorithmName, Byte[]& hashValue)
    at System.Runtime.Remoting.Messaging.StackBuilderSink._PrivateProcessMessage(IntPtr md, Object[] args, Object server, Object[]& outArgs)
    at System.Runtime.Remoting.Messaging.StackBuilderSink.SyncProcessMessage(IMessage msg)

    Exception rethrown at [0]:
    at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg)
    at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type)
    at Microsoft.EnterpriseManagement.Mom.Internal.ISdkService.OnRetrieveSecureData(Guid healthServiceId, ReadOnlyCollection`1 addedSecureStorageReferences, ReadOnlyCollection`1 removedSecureStorageReferences, ReadOnlyCollection`1 addedSecureStorageElements, ReadOnlyCollection`1 removedSecureStorageElements, String hashAlgorithmName, Byte[]& hashValue)
    at Microsoft.EnterpriseManagement.ManagementConfiguration.Communication.CredentialDataProvider.GetSecureDataUnwrapped(Guid agentId, ICollection`1 addedReferenceList, ICollection`1 deletedReferenceList, ICollection`1 addedCredentialList, ICollection`1 deletedCredentialList, Byte[]& hashValue)
    at Microsoft.EnterpriseManagement.ManagementConfiguration.Communication.CredentialDataProvider.GetSecureData(Guid agentId, ICollection`1 addedReferenceList, ICollection`1 deletedReferenceList, ICollection`1 addedCredentialList, ICollection`1 deletedCredentialList, Byte[]& hashValue)
    at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.TracingCredentialDataProvider.GetSecureData(Guid agentId, ICollection`1 addedReferenceList, ICollection`1 deletedReferenceList, ICollection`1 addedCredentialList, ICollection`1 deletedCredentialList, Byte[]& hashValue)
    at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.AgentConfigurationFormatter.WriteSecureData(AgentConfigurationStream stream, XmlWriter writer, Guid agentId, Hashtable credentialAssociationList, Hashtable credentialList)
    at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.AgentConfigurationFormatter.WriteSnapshotState(AgentConfigurationStream stream, XmlWriter writer, AgentValidatedConfiguration validatedConfig)
    at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.AgentConfigurationFormatter.GetSnapshotConfigurationStream(AgentValidatedConfiguration validatedConfig, AgentConfigurationCookie oldCookie, AgentConfigurationCookie& newCookie)
    at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.AgentConfigurationBuilder.FormatConfig(ConfigurationRequestDescriptor requestDescriptor, IAgentConfiguration agentConfig)
    at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.AgentRequestProcessor.ProcessConfigurationRequest(ICollection`1 requestList, Int32& processedRequestsCount)
    at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.AgentRequestProcessor.Execute()
    at Microsoft.EnterpriseManagement.ManagementConfiguration.Engine.ThreadManager.ResponseThreadStart(Object state)

  2. pradeep Teotia

    First you can check for server having Healthservice id 5446a7a5-637f-d72d-5946-db4ec08abf4f Microsoft.EnterpriseManagement.ManagementConfiguration.Interop.HealthServicePublicKeyNotRegisteredException: Missing certificate for Healthservice id 5446a7a5-637f-d72d-5946-db4ec08abf4f.

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\HealthService\Parameters\Management Groups\mymgmtgroup\SSDB\References\some long string

    If it is not there you can pick it up from another machine in the same management group and merge it. Afterwards restart your healthservice (System Center Management).

  3. mark

    It seems insane to have such an entry as this, describing such effort over a “monitoring agent”.
    A ‘repair’ from the console is not the best approach to take (fifth bullet)?
    And a manual clean-up process that still doesn’t ensure a working re-install?
    How is it even acceptable (“an ongoing task of any OpsMgr Admin’s life”) that a monitoring agent would become a bigger problem to solve than the monitored server?

    Thank you, though, for this info. I’ll need to address an existing SCOM environment I walked in to.

  4. Tharindu

    Is there a way to download System Center Operations Manager Agent (version 1807) without deploying it through the wizard?

  5. David Jacob

    Hello Kevin, good morning, I have a somewhat unusual behavior, we are migrating to our SCOM 2016 Update Rollup 9 servers with an agent version from a SCOM 1801 intalation, but one day most of the servers are gray, up to date The following are in green, it should be noted that the MG of Scom 2016 UR9 is in one domain, let’s say domain1 and the agents are in another domain, which we call domain2 for authentication, or a short-cut trust relationship is performed, the question is whether This intermittence could be realion to be using a different version of agents or some problem in the trust relationship
    errors are seen in agents 20070: The OpsMgr Connector connected to, but the connection was closed immediately after authentication occurred. The most likely cause of this error is that the agent is not authorized to communicate with the server, or the server has not received configuration.
    and on the MS side you see errors 2002: A device at IP port attempted to connect but could not be authenticated, and was rejected.

  6. Dan

    Hi Kevin – good day, Have some strange behavior across our log analytics workspaces trying to ship custom logs:
    A module of type “Microsoft.EnterpriseManagement.Mom.Modules.CloudFileUpload.CloudFileUploadWriteAction” reported an exception Microsoft.EnterpriseManagement.Mom.Modules.CloudFileUpload.FileUploadException: Unable to get blob container for CustomLog from Will keep trying according to the specified policy. —> System.Net.WebException: The underlying connection was closed: An unexpected error occurred on a send. —> System.IO.IOException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host. —> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host
    at System.Net.Sockets.Socket.EndReceive(IAsyncResult asyncResult)
    at System.Net.Sockets.NetworkStream.EndRead(IAsyncResult asyncResult)

  7. Dera

    Hi Kevin – Nice info. Thanks.

    I’ve faced some strange behavior in our SCOM environment. After a switch maintenance, we’ve lost a LAN connection for more than 2 hours and our management server was hit hard with all alerts generated from every device. After we restore the LAN connection, the management server is filled with event ID 20002 : “A device at IP xx.x..xx.xx:58245 attempted to connect but could not be authenticated, and was rejected.” and it’s grayed out.

    All except few clients agents are also grayed out.
    I got the below error in Ops Manager log

    OpsMgr was unable to set up a communications channel to scom and there are no failover hosts. Communication will resume when XXXXXX.XXXXXXX.XXX is available and communication from this computer is allowed.

    The OpsMgr Connector could not connect to scom:5723. The error code is 10061L(No connection could be made because the target machine actively refused it.). Please verify there is network connectivity, the server is running and has registered it’s listening port, and there are no firewalls blocking traffic to the destination.

    Please point me to the right direction.


    • Kevin Holman

      Sounds like Kerberos is broken. Are these agents and SCOM servers in the same AD Forest? If so – something needs a reboot…. it seems like AD Kerberos auth is broken after the switch maintenance.

      • Dera

        Some of the clients are in the same AD forest, but there’re quite a few which are in different AD forest but there’s a full trust between the them.

        I rebooted the RMS, the DB it leaves in, all DCs.

        I accidentally posted the scom server name in my previous post, would you please remove it? (Including this sentence)

        Thank you!


        • Kevin Holman

          Full trusts that support Kerberos can be problematic… I still recommend Gateways and certificates in these cases, because Trusts can break then you lose all your agents when Kerberos breaks. If you feel that kerberos is working, or have agents in the SAME forest that cannot be authenticated, I recommend opening a support case.

  8. Dera

    Thanks for the input. I’ll consider adding Gateway severs with Cert.

    The issue seems to be related to a network/switch problem. For some reason the subnet that the RMS is found is not communicating with other subnets.

    Thank you, Kevin!

Leave a Reply

Your email address will not be published.