Patch Deployment Strategy and Scheduling

How do you schedule when updates will be applied to your machines? What deployment strategy and design do you use? Patching always carries the risk of unnecessary accidental disruption, but that risk can be minimized through careful scheduling, planning and processes. This is partly scheduling and partly deployment design - which groups of users can control the schedule on their machines to some degree, and which can not, and how that is performed.

How do you develop a deployment strategy and schedule? You start by understanding the business needs of your users, and the different categories of users and machines that might exist. Not only are there important differences between scheduling workstations and servers, but also groups of machines within each type.

One common strategy separates the updates from the reboots. Reboots are performed by users or scheduled independently. That leaves unbooted machines vulnerable, possibly for an extended period. Products or OS components that are partially updated, and partially waiting for the reboot, may not operate properly because of the mix of file versions in place. A better approach is to assure that the reboot happens shortly after the updates, at a time that is appropriate to the particular machine.

The general analysis process is to identify groups of machines with common scheduling characteristics, determine how to identify specific machines in each group, and then develop a deployment strategy that meets the needs of each group with the simplest possible structure. You don't want to create ten different deployment packages if you can create two or three that meet all of the needs. You also may have exception workstations or servers that are not patched through automation that you want to address in the design. The following sections address workstations and servers separately, as each group has its own unique characteristics and issues.


Workstation Updates
Here are examples of categories of machines you might have:

  • You probably have a very large number of workstations that are only used during day shift, and can easily be updated and rebooted at night without disruption.
  • Workstations used by night operations staff definitely can not be reboted at midnight without causing problems.
  • Call centers that operate 24-hour will be disrupted no matter when you reboot them if you do all machines at once.
  • You may have users, such as insurance company actuaries, who have processes that run for several days. Rebooting them during the wrong night will certainly cause problems.
  • You also may have laptops that are frequently away during the deployment, and remote machines that have unique scheduling issues.

It's vital to understand how you can identify the machines in any group you're considering giving special treatment. If they are in unique subnets or automatically in specific AD groups that makes iteasy. If you need to manage lists manually you must allow for the manual effort and some unknown degree of errors during a deployment. Machines will be replaced and you won't be informed. What would be the impact of such errors? Your deployment plan must minimize such potential impacts.

Companies like banks and retail chains are likely to have a large number of remote locations that can easily be identified by subnet and could easily be updated and rebooted during the night. It might be worth having one deployment process for such machines that gets them out of the way quickly, while addressing operations centers separately. That way you quickly get much of the needed protection, and the opportunity to identify and resolve patching issues with those machines, with minimal risk of disruption. Appropriate email warnings to users would take care of any individuals working late that night. Such machines will commonly have the same software configuration, or very few variations, so it's easy to test a sample group ahead of time to assure there are no conflicts.

The remaining machines have many different configurations and operating schedules. The deployment strategy I personally prefer allows users to decide when to apply updates and reboots within limits determined by security needs. This is easily done using the provisions built into the SMS Distribute Software Updates Wizard, as described in my article Settings for Distribute Software Updates Wizard. This allows the staff of a call center to distribute the reboots to avoid disruption, and machines used for night operations to be updated during idle times of day. Many other designs can work, though - the main thing is understanding the business needs and environment.

Any deployment strategy must also allow for handling emergency patches that must be applied in a very short time. Identify which categories of machines will be updated as groups under such circumstances, and assure that you can be prepared to implement this will very little notice.

Server Updates
With servers, you first need to limit all updates to occur within the maintenance windows. You also need to make sure which servers are rebooted at the same time. Rebooting servers requires planning similar to the workstations. Examples of categories you might have include: 

  • Most servers probably can be updated and rebooted at the appropriate time without concern. The applications and services running on them will dependably shut down and restart properly.
  • Others are not as dependable, or the OS itself may be known to be unreliable on some machines. Any servers that can't be rebooted safely need to be identified and projects started to correct the problems. Until they are corrected, such machines require manual intervention when patching and rebooting.
  • Some servers might require running special scripts at shutdown or startup. Scripts might be created to allow previously unreliable machines to be rebooted safely if the scripts are run as part of the patching process.
  • Rebooting both machines in a cluster at the same time will cause problems. Note that SMS must be able to address each machine individually.
  • Other load balancing systems require similar treatment.
  • Rebooting all domain controllers at once is probably not a good idea either.
  • Rebooting the SMS servers while other servers are copying patch updates from them would also cause problems.
  • Other groups of servers that must be done in stages might be identified by your server support teams.
  • If you use Virtual Machine servers hosted on a server running a Microsoft OS, you need to coordinate updating and rebooting the guest and host systems to avoid double disruptions.
  • You may need or want to warn server support staff that are working on a server when updates are going to be applied and the machine reboted. This could be done through separate scripts or using interactive updates like for workstations, but with different countdown settings and time constraints. The staff could be advised how to prevent updates from running while they are performing critical emergency maintenance.

Servers in your DMZ present a very special case. These usually are the first to be patched, after necessary testing is completed, because they are the most vulnerable. Security issues mean these machines can't be part of the normal SMS configuration, so they may be patched manually or through some other process.

Most of the various reboot scenarios can probably be handled by some combination of silent updates with reboots, silent updates without reboots or interactive updates that would be managed by one of the server support team members. You might be able to simplify the SMS setup by creating a DSUW package that updates without rebooting, and using a separate SMS package to perform the reboots. That risks conflict with the emergency maintenance scenario, however, if updates are installed while an administrator is trying to reinstall some component.

Scheduling requires creating one collection for each time period. If several reboot scenarios are used you may need sub-collections or separate collections by both schedule and package type.

An easy way to manage this would be creating a new SQL table that has machine name, a management group name, a one-byte schedule code representing when the updates would occur, and a reboot code indicating automatic reboot, no reboot, interactive, or run a script. The management group would be used for management reporting. All machines that require coordinated scheduling, such as a load balancing group or the domain controllers, would have one group name. Each such group of machines would have its own group. If the scheduling is managed by other teams, the leading portion of the group name could identify the team.

You'd then create an SMS report that summarizes the update scheduling by group, linked to a detail report that lists each member of the selected group with its specific values. A report must also show all servers that don't have data in this table or have invalid data. You might create a web update form that allowed authorized people to update this data. That should also record the ID of who last updated a record, with the date and time, in the SQL data. That would be displayed in the detail report.

Baseline (Catchup) Patching
Assuring that all machines are fully patched even after applications or OS components are installed or reinstalled is a critical part of the strategy. Without that, every time an application such as an Office component is installed, or a server administrator uninstalls and reinstalls a failing software component, you might be creating a vulnerability.

This has been addressed in a separate article, even though it must be part of the overall strategy, because of the complexity of the issues this raises. See Baseline Patching.

Comments

# Patch Management Process Overview

Friday, September 28, 2007 4:32 PM by Steve Pruitt at myITforum.com

I've seen several postings recently from people just getting into SMS patching that don't know

Powered by Community Server (Commercial Edition), by Telligent Systems