I was at a customer site recently and they were simulating a data centre failure as part of their Disaster Recovery plan. They replicate the Site server as a VM across to another data centre in real-time and fail over to it when the main Site server is taken offline, and during the testing they came across a problem with Clients sticking to Distribution Points that were made unavailable as part of the Disaster Recovery exercise. In reality the Distribution Points should also be replicated and switched over too, but this wasn’t done. The problem was that the Clients would not install anything that had already been deployed to them but had not yet been installed, the deployments were on the Client but the Applications content had not yet been downloaded.
This Distribution Point stickiness is a well-known feature of ConfigMgr going back a long time, the Client will select a Distribution Point containing the content it requires from the provided list of Distribution Points and try to contact it, at which point the Client will realise the Distribution Point is down, note it as a recoverable error as opposed to an unrecoverable error and it will then attempt to retry the Distribution Point for up to 8 hours before moving on to the next Distribution Point in the list. This is to give the administrators a chance to recover the Distribution Point. If the next Distribution Point is down, there’s another long wait until a Distribution Point that is available is found and downloads begin. This Client had 3 Distribution Points and unfortunately the two failed Distribution Points where at the top of the list resulting in a 16 hour wait before content for existing Deployments could be downloaded and the applications installed. Obviously, this induced a lengthy wait time before the Clients begin downloading content correctly, and the customer wanted to get around this quickly.
There are a couple of workarounds for this:
- Boundary change on the Site server if it is still available
- Roberts special magic!
If you remove the failed Distribution Point from a boundary group, this will result in the Client ignoring the unavailable Distribution Point and going to a Distribution Point that is still servicing the Clients boundaries. A very simple solution, it requires a change in the ConfigMgr Console to remove the Distribution Point from the Boundary Group, or on specific boundaries if you have configured the Site that way, and adding the boundary to the Site system hosting the available Distribution Point. Once DR is complete you’d have to change all this back to how it was before the outage took place. This solution requires several “moves” so to speak, it’s not much effort and it works but I like solutions that are simple and contain fewer moves.
Here’s an example of a Boundary group where you would remove the unavailable Distribution Point:
Once the Clients talk to the Management Point again they will update the list of Distribution Points that service the Clients boundary, it’ll ignore the Distribution Point no longer servicing them and select a new Distribution Point to retrieve the content from. The Client will check policy based on the Client Settings (Client Policy) applied to the Client, with the default being every 60 minutes. Client Settings in ConfigMgr 2012 can be defined on a Collection basis rather than being a site-wide setting as it was in ConfigMgr 2007, so depending on how you have set this up the Client will do a policy update based on the default value mentioned or any custom settings that you have applied.
In theory you could also change the existing Site system containing the unavailable Distribution Point from fast connection to slow connection so that the available Distribution Point servicing the Clients boundaries takes precedence, I haven’t tested this solution but it also seems very viable. This again is simple, and has far fewer moves to complete the task. I’ll test this at some point to see if the Client behaves itself and that this is a workable solution. It should do and has fewer moves required to achieve the end result but assumes the Primary is available for the administrator to make the changes.
Changing boundaries assumes the Primary is still available, the SMS Provider is accessible and the Console can establish a session to actually make the changes. If the Primary was unavailable and some Distribution Points that could service the Client for content are online, the Client would hang around for 8hrs+ per unavailable Distribution Point until it found the available Distribution Point and as an administrator nothing could be done until the Primary is back in operation at which point you can make changes, but I’d hazard a guess if the Primary is back no doubt the recovery of Distribution Points would also be not far behind it. In this scenario where the Primary has not returned to operational status, to accelerate things you will need a different way to motivate the Client into moving onto the next Distribution Point in its list until it has found an available Distribution Point, this is where the Magic comes into play!
Well it’s not really magic, I just wanted to keep you reading and get you down to here where the fun really begins!
If the Site is unavailable or you have a streak of derring-do in you and you’d like to leave the Site servers configuration intact you’ll have to think outside the box, and here’s the solution I came up with, it’s a pretty simple solution, you make a change in one place outside of ConfigMgr and in under an hour the Clients will select the online Distribution Point and proceed to download content. That kind of qualifies as magic!
How we achieve this is by changing the A record for the unavailable Distribution Point in DNS so that it points at a member server that is available. This tricks the Client into recognising the Distribution Point as being in an unrecoverable state, and thus, it will move onto the next Distribution Point in the list. Brilliant.
Below, I’ve produced some screenshots to assist with explaining the steps involved to achieve this result, but the shots are not taken from an environment where the problem is being reproduced fully. Instead, I’ve just taken shots of the relevant dialogues and shots of actual real logs that came from the client (their details stripped and replaced with some names of servers in my lab), so don’t worry if in the shots you don’t see Distribution Points listed and such stuff. Anyway if you are expecting a hand-holding through this solution then I think you’d better not continue, you should be familiar with all the moving parts involved and be able to understand this fully end-to-end.
Let’s go over what will happen:
- Administrator changes Distribution Points A record in DNS to point at a different IP address
- DNS replication takes place if there are multiple DNS servers in the environment
- The Clients local DNS Cache eventually expires the A record and retrieves an updated one from its assigned DNS server
- The ConfigMgr Client retries the Distribution Point and returns an unrecoverable error instead of a recoverable error, no further waiting will occur
- The ConfigMgr Client immediately moves onto the next Distribution and repeats the same process if the Distribution Point is unavailable or it downloads the required content
Let’s go over how the Client handles DNS, the DNS Cache on every OS is managed automatically for us, and each entry in the local DNS Cache has a Time To Live and once the TTL has expired the A record is removed from the DNS cache on the Client and refreshed from the assigned DNS server. This works in our favour but also introduces some latency into the solution.
Here’s a shot of my DNS servers Zone properties showing the TTL and as you can see I have retained the MS defaults for the TTL which is 60 minutes, 3,600 seconds:
When my Client first looks up this Distribution Points A record, it retrieves this A record and puts it in the local DNS Cache, it counts down the TTL until it has expired.
Here’s a shot of a record in the local DNS Cache and as you can see it has 1,197 seconds left before the A record is flushed from the local DNS Cache and replaced with an updated A record from the assigned DNS server:
I flushed my DNS cache then pinged the FQDN to populate the local DNS Cache for this screenshot, you achieve this using IPCONFIG\FLUSHDNS and you view the local DNS Cache using IPCONFIG/DISPLAYDNS.
Notice the Time to Live (TTL) is displayed in seconds and repeated rendering using IPCONFIG/DISPLAYDNS will show the value decrementing. When the TTL has expired contact is made to the assigned DNS server(s) and a refreshed A record will be retrieved.
If you wanted you could expedite this process by manually flushing the local DNS Cache but this isn’t very practical to perform manually on many Clients. To flush the local DNS Cache use IPCONFIG/FLUSHDNS, the Clients DNS Cache will be flushed away and future name lookups will retrieve the latest A record from the assigned DNS server.
In theory you could create a Package\Program containing no content just a command line to flush the local DNS Cache then deploy it to your Clients so that they flush the DNS cache quickly, assuming the Primary is available, but it requires a lot of back-end fiddling, it requires new objects in ConfigMgr, a new deployment to all affected Clients. It is a viable way to speed things up if you needed to do so, but the costs in administrative time, the assumption the Primary is available and the latency in Clients picking up new Policy may make this an excessive process and when the Deployment fires on the Client it may coincide with the TTL expiring anyway. If your Clients checked for new Policy every 15 minutes this is doable I guess.
Here’s a shot of the A record for my Site server in the active DNS Zone for my domain, the A record is not from a Distribution Point but as I said these shots are there simply to point out the areas to head too:
You will see that the IP Address is pointing at the actual IP address of the unavailable Distribution Point.
It’s pretty obvious where you’d make the change of IP address, but just to ram it home here’s another screenshot:
The IP address you use must represent a server that is online otherwise you’ll remain pinned to the unavailable Distribution Point.
Once you have changed the IP address for the unavailable Distribution Point the A record will be replicated to the other DNS servers in your environment. If you have a lot of DNS servers this could also introduce latency into the solution while you wait for the A record to change on the DNS server the client is using for name resolution. You could in practice determine which DNS server or servers the Clients are using and make the change right there, then let it replicate from there out to the other DNS servers to speed things up.
After the TTL has expired for the A record in the Clients local DNS Cache, the A record is refreshed and the changed IP will be shown if you use IPCONFIG/DISPLAYDNS.
Ok now that’s the infrastructure changes made, now it is entirely on the ConfigMgr Client to realise that the Distribution Point is in an unrecoverable state.
Once the A record has been updated, what will happen is the Client will try to contact the unavailable Distribution Point but this time it will get a response from the member server using the specified IP address. And, when the Client attempts to connect to this IP address there will be a conflict because the name of the server isn’t correct, this induces an unauthorised error back to the Client which tricks it into thinking this is an unrecoverable error.
Here’s a shot of the list of Distribution Points containing the Content the Client wants, notice we have three Distribution Points listed, two of which, 01 and 02 are down\unavailable\off the network (in reality they were just VM’s switched off for this simulated DR exercise), 03 is our available Distribution Point:
Here’s a shot of the Client seeing that the Distribution Point returned an unauthorised error which induces the unrecoverable error we need to unpin from that Distribution Point:
At this point the magic has worked, we have forced the Client to note the Distribution Point as being in an unrecoverable state due to the unauthorised error being returned!
Here’s a shot of the Client immediately going to the next Distribution Point in the list. This Distribution Point is also down. Rinse and repeat as they say:
And here’s a shot of the Client moving onto the third Distribution Point in the list, this one is online, notice the lack of errors, this is an available Distribution Point and content transfer will begin immediately thereafter:
So, by making a change to just the A record for the unavailable Distribution Points we have unstuck the Client from the failed Distribution Points and forced it to move onto the available Distribution Point. We have effectively controlled default behaviour, behaviour we had no control over before without making changes on the Site server with regards to Boundary\Boundary Groups and Site systems containing Distribution Point roles that service those boundaries.
This solution allows us to step in and take control even if the Primary is offline and we are unable to make any changes to boundaries. If this was the case, without the magic above we’d have to recover the Primary to make the changes to the boundaries.
I guess I should sign off with the usual disclaimers that this is Magic and sometimes Magic can burn, so always go through this in a lab, in dev, pre-prod, whatever you have, and, if you have 10’s of thousands of Clients involved, there will be a lot of traffic generated towards the replacement IP address entered into the A record for the unavailable Distribution Points. If the Primary is still online it makes sense to enter it’s IP address into the unavailable Distribution Points A record, so as to not interfere with production services on the member server you choose for the IP address needed above.
I should also thanks Jon Rumens for letting me tijnker with his DEV environment to produce this solution, and also I’d like to call out this guy, Carlitog, who posted on the System Center 2012 Configuration Manager forums a solution for handling a boundary change if the Primary is down but you use remote SQL and have another SMS Provider available to leverage use to make changes via PowerShell. Check out the thread here.
Hope this article is helpful for you, or insightful et al.