This is something I have worked with pretty much every customer on. If you assign agents manually to management servers or gateways, you might want to automate load balancing these agents across multiple management servers. There are some community solutions out there already, but they often can move agents every day, unnecessarily. This solution incorporates a threshold, where a percentage of total agents needs to be unbalanced before load balancing.
A common scenario I see is when you have something like 5 management servers, 3 MS are dedicated to monitoring Windows Agents but 2 MS are dedicated to UNIX/Linux, or URL monitoring, or Network monitoring, etc. You might wish to keep agents from reporting to, or even failing over to these dedicated management servers.
Another common scenario is Gateways, when you deploy multiple gateways in a specific network location for high availability. Agents assigned to a GW only communicate with their assigned GW. To get them to fail over to a second GW you must use PowerShell and the SCOM SDK to manually configure this for all the agents. However, when you use my automation for load balancing, this also handles adding the other GW as a failover for any agents that get moved.
Quick Download: SCOM Agent Load Balancing Management Pack on GitHub
Once you import the MP, you will find two rules targeting the “All Management Servers Resource Pool”
These rules are disabled and require configuration first. Open the Rule for Management Servers first.
Set the rule to Enabled, then go to the Configuration tab and edit the data source:
By default this rule runs once per day at 5:01 AM. Change this if needed.
Next, edit the Write Action:
Edit the SCOMServerList to a comma separated list of SCOM Management servers that you wish to load balance as a unit.
The script will only load balance agents if more than 5 percent of the agents are out of balance. This keeps from moving agents every day unnecessarily. You can change this if you like.
Save the rule and it will start running at the scheduled time.
This workflow logs Event 8001 to the Operations Log on one of your management servers (wherever the AMSRP is hosted). It will always run on the same Management server, unless that MS is down.
If there are errors, then alerts will be raised:
You can use the GW rule in the same way. If you have multiple GW pairs, then I recommend creating a new rule for each GW pair, by copying and pasting the GW rule several times, and editing the configuration in the XML.
Quick Download: SCOM Agent Load Balancing Management Pack on GitHub
Hi Kevin
Damn this is long overdue and has always been a pain, I even tried scripting it into the agent install but never worked perfectly, great job!
Does this overcome the bug with changing primary management server through the console. The bug where some agents will generate heartbeat alerts because they don’t actually change when told?
I haven’t ever seen that bug…. unless you are talking about Gateways – in which case that’s by design. But I haven’t ever seen it moving from one MS to another, as long as you provide the failover server list, and that list contains at least ONE server that was in the agents configuration previously.
Hi Kevin
Is it possible to change the target, from “All Management Servers Resource Pool”, to be something you can override in the Management Pack?
I feel like that would be a mistake. Why would you want to do that? Technically you could change the target to Collection Management Server – and just enable the rule for a specific management server – but this is backwards, as you lose high availability for the rule. When you force a rule only to run on a defined MS so it makes it easy for you to know which MS is running the rule – you destroy high availability, which is the benefit of targeting a pool. The MS that owns the AMSRP will always own it, unless the number of MS changes, or that MS is down. If you truly need to limit this or control it in some way, you could create your own custom resource pool with two servers in it, and then target the rule to that RP.
But Gateway servers are not member of the “All Management Servers Resource Pool”, or they are not in any of my SCOM Management Groups.
Agent loadbalancing on the Gateways wouldn’t work with your Management Pack, for the GW rule then.
Where the rule runs has ZERO relationship to what is being load balanced. The AMSRP is simply the target for “where to run this rule, that can access the SDK”. What you are load balancing has ZERO relationship to that.
Hi Kevin
Where does this management pack differ from the one Tao Yang created in his “OpsMgr Self Maintenance Management Pack”?
They are similar. Both target the AMSRP for high availability. His uses a resource pool to select management servers, which I always through was odd because resource pools do not have anything to do with agents. He uses a custom resource pool for load balancing simply as a selection criteria for which management servers to load balance.
Also, his does not have a threshold for number of agents BEFORE balancing. It will load balance even if there is ONE agent that needs to be moved… so it ends up moving agents, likely every single day. My solution uses a threshold of percentage of agents that need to be moved before triggering a rebalance. This ensures we are not load balancing something every day, or with every change in number of agents.
Lastly, my solution works fine with Gateways as well, as there is a rule for Management Servers, then there is a rule for Gateways. A customer might have multiple GW served network locations, each with multiple GW pairs. My solution makes it pretty easy to copy and paste the GW rule for each GW pair.
Hi Kevin,
Does this MP have a minimum supported SCOM version? I am running SCOM 2016 UR10 and I am not seeing the management pack objects for the MP in the Authoring pane after successfully importing it.
Please disregard my question about not seeing the agent load balancer rules. I failed to read the article correctly.
Pingback:Thank you SCOM community – you made 2021 awesome! - SCOMathon
Hi Kevin,
Great work on this MP, saves me a lot of time. I have one feature request though. Is it possible to add a failover server even when the agent is not being moved to another MS/GW? I now have a number of agents which have a failover MS/GW configured and a number of agents which don’t have a failover. This because of the fact that a failover is only added when the agent is moved to another MS/GW.
I see this commonly…. but checking EVERY agent for proper failover config can be expensive…. so I would not make it part of the GW load balancer. Normally I do this manually via PowerShell https://kevinholman.com/2018/08/06/assigning-gateways-and-agents-to-management-servers-using-powershell/
Otherwise – you could write a rule that runs a script and gets all agents assigned to a specific gateway, then adds a specific failover if it is missing. But you’d need a copy of the rule running for EACH gateway that you have a failover partner for. I have written things like this for my customers, but it is usually customized for their environment and hard to make a generic one. Usually if a customer is comfortable editing the XML and copy/pasting a rule and fixing up the display strings – they are also comfortable writing their own script.
Do you think it would be useful to create a MP example that does this in a generic format?
Thank you for explaining your thoughts on this and why you choose not to add this to the MP. Yes, it would be of great help to have a MP example which I can adjust for our environments.