We have 3 primary sites under a CAS (bad, but we have no choice with so many clients). Because we also have Nomad, we don’t care where clients get assigned. We care only that each site has roughly the same client count as the others. But we drifted about 30K clients too many on one site and simply made use of CM12 R2’s function to move clients. So we moved them to level set the count.
The downside, and we knew this, was that each client would have to do a full inventory and SUP scan. That’s a lot of traffic but we’ve done this before without issue. But this time we melted the SUPs with many full scans. And the wonderful Rapid Fail detection built into IIS decided to protect us by stopping our WSUS App pool. Late at night.
Now in CM12 post SP1 (we’re on R2), clients make use of the SUPList which is a list of all possible SUPs available. Clients find one SUP off that list and stick to it. They never change unless they can’t reach their SUP after 4 attempts (30 minutes between each – the 5th attempt is to a new SUP). Well with the app pool off, all clients trying to scan would fail and start looking for new SUPs. A new SUP means a full scan. A full scan from 110K clients is far worse than from just 10K when we’re moving things. Needless to say our SUPs were working very hard the next morning to serve clients. On a normal day the NIC on one of our SUPs shows about 1Mbps of traffic, but after starting the WSUS App pool we were at over 850Mbps going out per SUP.
Disabling Rapid Fail is one nice fix to help keep that app pool from stopping, but we also increased the default RAM on that from 5GB to 20GB (the SUPs have 24GB so we were clearly wasting most of that). I know of another company who has 85K clients on 2 SUPs who boosted their RAM from 24 GB to 48 GB to help IIS serve clients. Another option is to add more SUPs, but RAM is probably cheaper than another VM. This default Private Memory Limit is 5GB, so for those of us weirdoes with lots of RAM, it makes sense to crank this up if you can. We actually did this long ago, but we’re thinking the Server 2012 R2 upgrade over Server 2012 wiped our settings out.
By the way, the obvious ‘treatment’ during such a meltdown is to throttle IIS. We set our servers down to 50 Mbps and the network team was happy; your setting will vary based on client count and bandwidth. Our long term insurance here will be QoS. UPDATE: Jeff Carreon just posted a tidbit on how to throttle quickly in case of an emergency using PowerShell.
So how do we keep our settings? We ask Sherry who knows DCM! Read more on her CIs to enforce our settings here.