Menu Close

Agents that never connect to management server

Was working with a customer on this issue:

The agent would install correctly, it would even push install (but took forever) or a manual installation would make it show up in pending, but after approval, it would never communicate with a management server.

The logs on the management server didn’t show anything interesting.

The agent was logging this specific event – with the unique part highlighted:

Log Name:      Operations Manager
Source:        OpsMgr Connector
Date:          10/27/2014 10:07:37 AM
Event ID:      20071
Computer:      foo.contoso.com
Description:
The OpsMgr Connector connected to MS1.contoso.com, but the connection was closed immediately without authentication taking place.  The most likely cause of this error is a failure to authenticate either this agent or the server .  Check the event log on the server and on the agent for events which indicate a failure to authenticate.

Normally, we see the agent getting “rejected” by the management server.  In this case, the management server just didn’t respond.  We ran a verbose ETL trace of the agent, and captured an agent startup, which includes the attempt to communicate with the primary assigned MS:

[MOMChannel] [] [Information] :MOMChannel::ChannelTimeoutManagerImpl::OnTimerCallback{ChannelTimeoutManager_cpp117}Channel has timed out after 1498ms

There are a few possibilities.

First, there was a fix put in UR3 for SCOM 2012R2 to change some of the default timeouts for communication from 1 second to 20 seconds.  This helps resolve issues when agents are a long distance away, network wise, and Kerberos auth takes a long time.  So my first recommendation would be to apply UR3 to both management servers and agents and attempt a repro.

However, this was not the case for us.  These were in the same datacenter, on the same subnet even!

To rule out a network issue, we tried to copy a large zipped file across the network, and saw this take a very long time, then it failed on the copy.

Next, we performed a ping test:

ping servername –t –L 65500

The –L in ping allows us to control the packet size sent via the ping, and we saw the server either have extraordinary ping times, or timeout altogether.  This all points to a failure in the network card.  Sure enough – this was a physical server and not a VM.  A reliable as today’s hardware is, you just cant rule out an old school issue like this.

3 Comments

  1. Mike Z.

    Hi Kevin,

    Can you please elaborate on changing “some of the default timeouts for communication from 1 second to 20 seconds?” I believe I am having this issue and need to know where to change this setting in the registry. Thanks!

    • Kevin Holman

      You don’t. These were default timeouts in code, and generally should not be changed. Make sure you have at LEAST SCOM 2012R2 UR3 or later on both the agent and the management server, and you now have these extended timeouts. If we cannot negotiate Kerberos within 20 seconds, something is terribly wrong with your environment.

      My guess, is that something else is wrong here.

Leave a Reply

Your email address will not be published.