Menu Close

Understanding SCOM Resource Pools


image

 

 

Resource pools are nothing new – they were introduced in SCOM 2012 RTM, for two reasons:

1.  To remove the single-point-of-failure that was the RMS role in SCOM 2007.

2.  To provide a mechanism for high availability of agentless/remote workflows, such as Unix/Linux, Network, and URL monitoring, among others.

 

That said – they are often not fully understood.

 

Lets talk about the primary components of a Resource Pool.  I am going to “dumb this down” a lot…. because it is actually quite complex behind the scenes.  So I will break this down more into “roles” with regard to Resource Pools.  The primary “role” components we will discuss are:

1.  Members

2.  Observers

3.  Default Observer

 

Members of a pool are either a Management Server or a Gateway Server.

Observers are “observer-only” roles.  These will be a Management Server or a Gateway server, that do NOT participate in loading workflows for the pool, however they participate in quorum decisions.  This is actually pretty rare to do anything with Healthservice based observer-only roles…. but you would use these if you wanted high availability for your pool, but only a limited number of Healthservices actually running pool workflows.  This is rarely used under normal circumstances.

Default Observer is the SCOM Operations Database.  This is set to “Enabled” or “Disabled” for every pool.  This is set to enabled by default for all pools created in the UI.  It is set to disabled by default, for all pools created via PowerShell, using the New-ResourcePool command.  The “reason” this exists is for the following:

To allow for a pool to have high availability when you have two management servers in a pool

 

Let’s talk about that.

A pool requires ONE or more members.

A pool requires THREE (quorum voting) members to establish high availability.

High availability is the ability to have a member be unavailable, with no loss of monitoring.

 

The reason we need THREE (quorum voting) members (not two) for high availability is because of the quorum algorithm.  We require that MORE than 50% of the quorum voting members in a pool be available.  If you have only two members of a pool, and one is down, you have lost quorum, because of the “greater than 50%” rule.

Therefore – the “Default Observer” was dreamed up, so customers would not HAVE to deploy a minimum of THREE management servers just to get high availability for their Resource Pools.  It is a special quorum voting “observer” role, to allow for high availability of pools when you have two management servers deployed.  This reduced cost and complexity for a basic SCOM deployment.

 

Lets break this into “scenarios”

 

Single Management server in pool

The default observer is enabled by default.

There is no high availability, because the management server is a single point of failure.

The default observer provides no benefit (nor harm) in this case.

 

Two management servers in pool

The default observer is enabled by default.

There is high availability for the pool, because there are three voting members (2 MS + Default Observer)

If you disable the default observer, you will lose high availability for the pool.

 

Three management servers in pool

The default observer is enabled by default.

There is high availability for the pool, because there are four voting members (3 MS + Default Observer)

By default – you can only have ONE management server down, to maintain the pool. (greater than 50% rule) because if two MS are down, this is 50% of voting members, so pool suicides.

The default observer in this case provides NO value.  It does not increase the number of management servers that can be down, therefore it does not increase pool stability.

You can consider removing the DO (Default Observer) in this scenario.

 

Four management servers in pool

The default observer is enabled by default.

There is high availability for the pool, because there are five voting members (4 MS + Default Observer)

By default – you can only have TWO management server down, to maintain the pool. (greater than 50% rule) because if three MS are down, this is greater than 50% of voting members, so pool suicides.

The default observer in this case provides significant value, because it increases the number of management servers that can be down.  Without the DO in this case, you’d only have 4 quorum members, which only allows for ONE to be unavailable.

 

Five or more management servers in pool

The default observer is enabled by default.

There is high availability for the pool, because there are 6 voting members (5 MS + Default Observer)

By default – you can only have TWO management server down, to maintain the pool. (greater than 50% rule) because if three MS are down, this is exactly 50% of voting members, so pool suicides.

The default observer in this case provides NO value.  It does not increase the number of management servers that can be down, therefore it does not increase pool stability.

You can consider removing the DO (Default Observer) in this scenario.

 

One could argue – that once you have 3 or more management servers in a pool, any “odd” number of management servers would be a good consideration to remove the DO from the pool.  I’d also argue that once you hit 5 management servers, you are probably big enough that the database is under significant load (you wouldn’t typically have 5 management servers in a small environment).  When the database is under heavy load, the default observer might not perform well, and might experience latency in resource pool calculations/voting.

The way the default observer plays a role – is that each MANAGEMENT SERVER in the pool, queries its own local SDK service – which allows it to get data from the database.  There is a table in the SCOM Operations database for the default observer.  So if the SDK service is under load, or the database, we could experience latency that otherwise would not exist.

 

Gateways as resource pool members

 

Next – we should discuss the Gateway role as it pertains to Resource Pools.  Microsoft support resource pool membership for Management Servers, AND for Gateway servers.

For instance, a customer might monitor Unix/Linux servers in a firewalled off DMZ, or across a small WAN circuit where you want the agentless communication localized.  In this scenario, a customer might create dedicated resource pools for Gateways in those locations, to perform monitoring.

 

Single Gateway server in pool

The default observer is enabled by default.

There is no high availability, because the Gateway server is a single point of failure.

The default observer should NOT be used here, because Gateways do not have a local SDK service, therefore they cannot query the database.

 

Two Gateway servers in pool

The default observer is enabled by default.

One would THINK there is high availability for the pool, because there are two GW’s in the pool, right?  HOWEVER – that is NOT the case.  As we discussed above – we need three voting members to establish high availability for a pool.  Since the Default Observer is NEVER valid for a pool consisting of Gateways, there are only TWO members of this pool.  The pool will run, and will load balance workflows, but if either pool member goes down, the pool suicides.  In this case – you actually have WORSE availability than if you placed a single member in the pool!

In order to maintain high availability for a pool made of Gateways, you need to have THREE GW’s in the pool.

The default observer should NOT be used here, because Gateways do not have a local SDK service, therefore they cannot query the database.

 

Three Gateway servers in pool

The default observer is enabled by default.

There is high availability for the pool, because there are three voting members (3 GW)

By default – you can only have ONE Gateway server down, to maintain the pool. (greater than 50% rule) because if two GW are down, this is >50% of voting members, so pool suicides.

The default observer should NOT be used here, because Gateways do not have a local SDK service, therefore they cannot query the database.

 

 

Let’s take a minute and process this.

 

What we have learned, is that you should remove the DO from any pool comprised of Gateways.

You should consider removing the DO from pools when 5 or more Management Servers are present.

If your pools are stable….. and you aren’t having any problems with high availability….. then this really doesn’t make much difference….. which is why the defaults are set like they are.

 

So we have talked about pool members, and the default observer…… but what about the “observer” role?

This role is really unique, and will not be used very often.  I cannot think of a single enterprise deployment where I have seen it used.  Generally speaking – if we are adding a dedicated observer for a pool (which is a management server or a GW server) then why not just make that server a full blown pool member?

There is only one scenario where I can think of where this might be useful.  Such as a company with a datacenter with SCOM deployed.  In the SAME DATACENTER, they have a DMZ with two gateways deployed because of firewall rules.  In this case, you could potentially make their parent management server a dedicated observer only, and this would work because tcp_5723 is open already for Healthservice communication.  This is incredibly rare, and the best practice would be to just go ahead and plan for three Gateways servers in the DMZ.

 

Remember – for resource pool members – Microsoft supports Management Servers and Gateways.

For resource pool observers – the same, Management Servers and Gateways.

 

That said – I have done some testing making an *agent* a dedicated observer, such as the DMZ scenario above, and it does work.  The agent becomes a voting member for quorum, and high availability is created by this.  Microsoft didn’t plan or test this scenario – so it is technically unsupported.

Which got me to thinking – “what if I create a resource pool, and make its membership strictly agents”???

Well, that works too.  You cannot do this using the UI, but you can in PowerShell.  I create a resource pool of only agents, then set up URL monitoring to that pool, and high availability and load distribution worked great.  Again, not technically supported by Microsoft, but a unique capability nonetheless.

 

Lastly – I will demonstrate some PowerShell commands to work with this stuff.

 

To view the pools, their Default Observer status, and if they are Automatic or Manual:

$pools = Get-SCOMResourcePool $pools | fl DisplayName,UseDefaultObserver,IsDynamic

 

To DISABLE the default observer for a pool:

$pool = Get-SCOMResourcePool -DisplayName "Your Pool Name" $pool.UseDefaultObserver = $false $pool.ApplyChanges()

 

To ENABLE the default observer for a pool:

$pool = Get-SCOMResourcePool -DisplayName "Your Pool Name" $pool.UseDefaultObserver = $true $pool.ApplyChanges()

 

To set a pool to MANUAL membership:

$pool = Get-SCOMResourcePool -DisplayName "Your Pool Name" $pool | Set-SCOMResourcePool -EnableAutomaticMembership $false $pool.ApplyChanges()

 

To set a pool to AUTOMATIC membership:

$pool = Get-SCOMResourcePool -DisplayName "Your Pool Name" $pool | Set-SCOMResourcePool -EnableAutomaticMembership $true $pool.ApplyChanges()

 

To add or remove Management Servers or Gateways from a manual pool:

$pool = Get-SCOMResourcePool -DisplayName "Your Pool Name" $MS = Get-SCOMManagementServer -Name "YourMSorGW.domain.com" $pool | Set-SCOMResourcePool -Member $MS -Action "Add" $pool | Set-SCOMResourcePool -Member $MS -Action "Remove"

 

To add or remove Management Servers or Gateways as Observers only to a pool:

$pool = Get-SCOMResourcePool -DisplayName "Your Pool Name" $Observer = Get-SCOMManagementServer -Name "YourMSorGW.domain.com" $pool | Set-SCOMResourcePool -Observer $Observer -Action "Add" $pool | Set-SCOMResourcePool -Observer $Observer -Action "Remove"

 

If you want to play with adding AGENTS as a resource pool member or observer (not supported) then simply change “Get-SCOMManagementServer” above – to “Get-SCOMAgent”

 

 

Credits:

A debt of gratitude to Mihai Sarbulescu at Microsoft for his guidance on this topic – he has forgotten more about Resource Pools than most people at Microsoft ever knew.  Smile

28 Comments

  1. peter

    is it possible to define group members by resource pool alone, if server 1 is resource pool 1 for my unix computers that use local accounts, and server 2 is resource pool 2 for my unix computers that use ad integrated accounts accounts, i need to distribute the different run as/ action accounts credentials to the relevant servers/resource pools.

    server naming conventions and OS versions are so mixed and similar , the only was I can define a group is by adding names one by one to the group or define each object in the profiles run as accounts . creating groups based on their resource pool would be less messy and easier to add and remove servers. any ideas ?

    • Kevin Holman

      To make sure I understand – you would like to create a Group, defined in a Management Pack, that contains Windows Computer objects, that are members of a specific Resource Group?

  2. Mark Van Doren

    Hi Kevin,

    We currently manage about 300 Linux servers in a resource pool comprised of 3 SCOM 2012 R2 management servers. We have 20 gateway hosts that we used to manage our Windows infrastructure. I was wondering if it was possible to create a resource pool comprised of gateway servers to monitor our linux environment. I’m not seeing any practical examples of this — other people seem to only use management servers in their Unix/Linux resource pool (as we do now). Is a resource pool with gateway hosts used to monitor unix/linux something that is supported in SCOM 2012 R2? I can’t seem to find any documentation around that scenario, and it seems as though the gateways don’t have a clean way to import/house the requisite certs. Thanks for any advice you may be able to provide!

  3. Mark Van Doren

    Right now we have 5 Management servers, 3 of which are dedicated to our unix/linux resource pool. We monitor ~1100 windows hosts, and ~300 linux hosts. The benefit of using the gateways for linux monitoring would be that it would solve a lot of firewall headaches for us due to their positioning in our infrastructure. If you have any documentation around creating a resource pool comprised of only gateway servers, and how they are configured in terms of importing the necessary certs/communicating with the management hosts for linux, that would be extremely helpful. Or, do you not need certs on the gateway hosts that manage linux, as the requisite certs are already installed on the management hosts that they would be sending their info through? We are running 2012 R2 with UR14.

  4. Alex

    Would be great if Microsoft would would let Kevin put documentation up because the official documentation was not helpful.

    For anyone authoring mp that leverage resource groups for Gateway Pools…. the key is as Kevin called out Gateways of NOT to use default observer.

  5. Thug User

    Hi Kevin. I want to implement SCOM 2019 across 2 Datacenters (primary/secondary). So will have management servers (MS) and gateway servers (GS) in both DCs. It will have sql alwayson DB cluster. Can i put I MS and GS from both DCs in the same resource pool? I also want agent to automatically failover if one datacenter fails. Pls how can i achieve this

    • Kevin Holman

      I do not recommend this. In general, I do not recommend putting management servers in more than one datacenter. SCOM management servers need to be VERY low latency to the database, or each other. The more agents you have the more important this becomes. Management servers should not spread a resource pool across datacenters either, in general. The exception is when you have a DR datacenter a VERY short distance away, and the network latency is VERY, VERY low (always under 5ms). Even then you will see degraded performance as the MS in the report datacenter has higher latency for SQL locks/writes/stored procedures.

      If the purpose is DR, I recommend using a replication technology to replicate your Management Servers to the DR datacenter, and then boot them up in the case of a disaster.

  6. Thug Usher

    Thanks Kevin. This is for two “primary” datacenters. Please can you advise on the best setup I can use using gateway servers and management servers. This is for 600 servers and 600 network devices

    • Kevin Holman

      Why do you feel you “need” anything in the “remote” datacenter? If the bandwidth is sufficient to remote DC, I wouldn’t put anything in there and manage all resources from the primary DC. Especially so for Windows Computers…. the difference between managing them remotely and managing them through a local Gateway is very, very small. Everything has to end up back in the database.

      If you have significant resources in the remote DC, AND you are concerned about SCOM consuming bandwidth, you potentially could save some bandwidth for monitoring the remote network devices by placing three gateway servers in the remote DC, and creating a resource pool for those network devices (removing the default observer). But honestly, I’d only do this if your network monitoring is consuming significant bandwidth.

      People automatically assume “I have a datacenter, there should be some monitoring infrastructure there” but many times, this thought process just increases complexity, lowers availability, and harms performance and stability.

      • Kiwifulla

        Our topology is similar to Thug’s. We have two primary DC’s that our SCOM MG spans. The reason we don’t just whack a Gateway out there instead of another Management Server is because we have a small team of admins spanning both sites, so it would be top heavy to set up a MG for each DC and have to socialise/implement every MP and config change ongoing across both MGs, effectively doubling all admin and infrastructure and there would no doubt be configuration drift as people do overrides etc. in one and not the other.

        Because of our company’s site topology, we need to be able to monitor all agents 24×7 from both sites and by the “one” logical on-call team. Basically if we lose one DC, we need the other to be able to carry on with the monitoring (appreciating that we might lose monitoring of the agents in the site that is down if they are in the same DC which not all of them are though) and of course the time to bring SQL back if the surviving site’s SQL node was not the active node within the AG (which is only a few minutes to kick in that process).

        Microsoft consulting/support ratified this design years ago, but the main concern I raised was 10 ms latency between the MS not in the same site as the active database server node. We use AG’s with a passive sync replica in the same site (results in zero SCOM outages during monthly patching) and an async replica in the other site.

        When I enquired about those “special” pool manager reg keys that should only be implemented if instructed by Microsoft, they said see how it behaves and only implement if needed (which we didn’t). We have been running this setup since SCOM 2012 and we don’t have issues *that often. I would say though that is probably because we only have about 750 agents. We run about 300 MP’s. We’ve even had a SCOM RaaS a few years later (before moving to SCOM 2016 and AG’s but still the same topology otherwise) which came back as no issues/healthy.

        We do planned DR failovers twice a year which involves making the async node the active node etc. and generally this has worked perfectly for years. I admit I have never liked the 10ms thing because *once every couple of months or so we get snapshot sync/delta alerts if everything is a bit busy, but it generally sorts itself out. I would like to remove that side of our topology though to be honest, so no doubt if we go to SCOM MI this problem will go away!

  7. Raoul

    Hi Kevin, We have some resource pools that have 2 Gateways. The purpose of those gateways is monitoring some servers in a different domain, that do not have a trust. Now I know this is even worse than 1 Gateway (from HA perspective). Now I was thinking, can I add a SCOM Management Server as a observer only? I am guessing that the observer doesn’t need to be in that same domain as the Gateways.

    • Raoul

      I did some testing on my own. I created a Resource Pool with only 2 gateways. No exta observers or what so ever. After stopping the “Microsoft Monitoring Agent” Service on 1 of the Gateways, I get a Resource Pool heartbeat faillure. But within a minute, this alert resolves itself, without me starting the Agent or something.

      So from that point, it looks like the Pool didn’t suicide? Only when I stop the Agent on both Gateways, the Resource Pool Heartbeat faillure alerts stays active.

  8. Raoul

    It appears that when I disable the Default Observer, then the pool DOES suicide, when I stop 1 Gateway. When the Default Observer is enbaled, the pool stays alive. Now this sounds weird to me, because from what I understand, the Default Observer shouldn’t work, because the Gateway does not have an SDK.service.

    I do have to point out, that in my testing environment, the Management Server and the Gateways are in the same domain. I am using SCOM 1801 here, did anything change perhaps? Or am I missing something here?

  9. Thom Davis

    For URL monitoring the sizing guide says “URLs monitored by each management server participating in the resource pool”. Does that REALLY mean only management servers can participate in a URL pool. I thought GTWs could do that too.

    • Kevin Holman

      Gateways can participate in a pool, and can easily do URL monitoring. However, they do not scale the same as a MS, because the GW does not have a local SDK and direct path to the database. A GW must write the output to its local queue, then send that data to a management server. This is what limits the scale of a GW somewhat when looking at agents, linux monitoring, or any agentless work like URL.

  10. Amanpreet Bansal

    Hello Kevin,
    As you mentioned about replication technology for DR setup.
    Did you wrote any document about – How to setup DR for SCOM 2016 ?
    Also; Can use DR on cloud using Azure Site Recovery (primary on-prem and DR on Azure ? Any references will be highly appreciated.

  11. John

    Hello Kevin,
    Is there possible to create several Ressource Pools for unix monitoring to have the control on the workflows ?
    We have some developed MP (unix monitoring using cookdown already) and we could in that case reserve some specific workflows generated by specifics MPs, to one specific Pool and distribute the others unix monitoring workflows to the others Management Servers of another other Ressource Pool.
    Sometime we have some errors ( scripts are being dropped..) and we have noticed that each time one of the two servers of the unix pool took in charge the most of the unix workflows.

    Thanks

    • Kevin Holman

      If your resources are failing over – then I’d focus on pool sizing/load/stability.

      To answer your question – yes, you can distribute workflows to specific servers via specific pools, however those workflows must target a class that is hosted by that specific pool. When you target a UNIX/Linux Computer class, or any class hosted by it – this will follow the same path of pool hosting that the computer follows.

  12. Saravanan

    EventID: 26373 – The query resulted in # Rows
    Due to this the Resource Pool seems to be affected and throwing the alert for RP
    How do we fix this issue and where we have to start the troubleshooting?

  13. Pingback:SCOM Resource Pools, part 1: how & why to create one - Monitoring Stuff

  14. dwayne

    Kevin,

    an interesting dilemma for you…

    All Management Servers Resource Pool [Watcher] alerts.

    Issue is no management server agent isn’t responding that we can confirm. ie we only ever see the pool and watcher alert (too frequently). Scanning the event logs on all Managment servers (all 23 of them) shows no events for pools. default observer is disabled (that made it better) its only down for 5 min at most. If it wasn’t an event log monitor but a script based one, I’d assume it’s failing in a ‘failed’ state at random – still no idea why as we don’t have logs indicating that either. We won’t go into the noise 12 Managment servers really failing would cause either.

    any thoughts. This is a system that’s run for years with no issues, any pool alert prior was confirmable with a root cause, then it started doing this.

    I’ve made tasks etc. to aid in investigation, but they always come up clean on 100% of the servers.

    Regards
    Dwayne

  15. Larry Leblanc

    Cheers Kevin,

    OK, so clearly the ‘Default Observer’ (i.e. the OperationsManager database) serves a similar, if not identical function as a ‘witness’ does in the context of a Windows Server cluster. I have the following 2 questions:

    Q1: Can you confirm that the “Default Observer” plays a PASSIVE role in the quorum voting process? In other words, each MS/member collects an extra vote for their ‘team’ if it is able to communicate with the OperationsManager database, but has lost contact with 1 (or more) fellow pool members?

    I am basing this understanding on the following passage:

    “The way the default observer plays a role – is that each MANAGEMENT SERVER in the pool, queries its own local SDK service – which allows it to get data from the database. There is a table in the SCOM Operations database for the default observer. So if the SDK service is under load, or the database, we could experience latency that otherwise would not exist.”

    Q2: What happens if a ‘Default Observer’-enabled pool containing 2 members loses contact with each other, but not with the OperationsManager database? Would this not lead to a ‘split brain’-equivalent scenario, where both pool members simultaneously attempt to ‘take control’ of the workflows targeting the pool?

    Kind regards,
    Larry

    • dwayne

      Hi Larry,

      Q1: Yes it has a vote, Kevin explains this above as does the typical cluster quorum disk for instance. this is to give HA to 2 server pools by being bale to maintain a majority when one fails/stopes/etc

      to stop its vote ie on odd count pools, or gateway pools etc you have to disable it. I don’t think there is a way to change its vote value as per clustering (which can be a dangerous thing if not done right especially longer term if the environment grows)

      Q2: no its not like clustering in these regards. there is always an observer checking on this (ie a pool member)

      this can be seen if you trace the pool activity

      0]11840.30752::12/16/2022-09:24:28.216 [PoolManager] [] [Failover] :CClient::CheckAckReceived{Failover_cpp2674}UPDATING AVAILABILITY INFO for process ServerA.test.local, in pool with id , from observer ServerB.test.local, with timestamp 12/16/2022-09:24:27.631, process version , process counter , and alive status 1.

      if they can both reach the DB all the roles/pool allocations are in here and only one will be seen as down in a two server pool as the observer say ServerA can see itself but not ServerB so flags ServerB as UNAVAILABLE and the DB updated to reflect.

      [0]11840.30752::12/16/2022-09:24:28.216 [PoolManager] [] [Failover] :CPoolMember::UpdateAvailability{pool_cpp2979}CPool member ServerB.test.local with id in pool with id is now UNAVAILABLE.

      so clearly no split brain, one is marked down, one is working and if you have the default observer you’re still up

      Kevin please correct me if I’m wrong anywhere here. I can only go on observation.

  16. Diego pereira

    help good morning! I’m going to ask a question about pooling, but it’s a little off topic, I work with SCOM 2022 and had a network pool with more than 1000 devices. I created 2 new pools using different management servers. Delete the old pool and discover devices from our new pools. However, some devices appear to be duplicates in the monitoring/network device and one of them is gray and the other is healthy but in the administration/network the devices do not appear twice. Gray devices generate an alert and do not allow to be reset, when I tried to reset via powershell I received a message that there are no pools for the device. Could you help me with this problem?

Leave a Reply

Your email address will not be published.