It's surprising how many organizations don't plan well for change. Change Control is a well known process, one that is well defined in many different frameworks (ITIL and the ISO 27000 Series and NIST for starters). Yet many organizations plan changes over coffee and a napkin (or a visio on a good day). This almost always results in figuring out problems during the change (I don't know about you, but the less 1am thinking I need to do, the better off I am!), conflicting changes, or changes that just plain don't work, and need to be backed out in a panic.
IT is supposed to be a support group for the larger organization, so we should be striving for planned changes. From the point of view of the larger business, we want system maintenance and upgrades to be as close to a non-event as we can possibly make it.
So, how do most IT organizations make this happen? In most cases, with a formal change control process.
The Change Control Form:
The Change Control Form highlights a lot of the things that need to be done for a formal change control process.
You need a plan for each change. This should include the steps along the way to implement the change, and how long they might take. For network changes, often I'll include the actual CLI changes I'll be making on the router or switch for example. Doing a good job in this section can really streamline the change when the time comes - many of my changes are all cut/paste operations. The last thing I need to do in a 1am Change Window is figure stuff out, the less thinking the better !
Who is the the change owner? Every change needs to be "hung around someone's neck", so that if there's a resulting problem, the helpdesk or management knows who they should call the next day.
You need to indicate that you (or someone) has tested this change
A good estimate on how long the change will take is almost always required.
A backout plan is a MUST. If you've ever had a change go south on you, you have an appreciation for how useful a step-by-step backout plan can be! Be sure to have a time estimate for this as well.
The business will want to know how long any affected systems will be affected. Be sure to include the backout time in your outage estimate, and also indicate which other systems will be affected.
The business will want to know what is being updated, but more importantly, what *else* will be affected by your change. For instance, updating a core router cluster might result in "all systems will be offline". Or, if your cluster is built in a different way, the answer here might be "no systems impacted".
Often you'll need to indicate which specific user groups need to be notified of changes - filling this in is often a cooperative effort between the change owner and the management team.
I have an example change control form, that I'll often use to as a starting point when a client needs change control - you can find it here ==> http://isc.sans.edu/diaryimages/files/change%20control%20form%201a.doc
The Review Process:
Every change to the production system should be reviewed. This isn't neccessarily so that everyone gets a chance to tell you that there's a better way to accomplish what you are doing (though this can sometimes happen), it's more to ensure that you don't have conflicting changes, or too many changes at once.
Conflicting changes are a real problem in larger IT groups. Back in the day, I had a set of server updates scheduled with a client. The networking team had router upgrades booked that same weekend, so all of our server updates got pre-empted, halfway through the process (oops). Needless to say, that client now has weekly change control meetings.
Too much change is another common problem. If you've got a high volume of changes going through in a short period of time, it's often tough to keep track of what changes might be impacted by what other changes. It's also easy to let things like documentation slide if there are a lot fo changes planned, with the good intentions of "catching up on that later" (we all know how this ends!).
Anyway, the common practice is to have weekly change control meetings, which have to be attended by a rep from each major group in the department. Each change is there for a yes/no - this meeting is not the place to hash out exactly how anything might be done or done better. If those discussions are still in play, then the change should be denied until there is agreement on these points. Changes should be distributed to each member of the group in advance to prevent issues like this.
Workflow tools like sharepoint or bolt-on products for email are a common way to deal with this approval process, but these are essentially "paving the cowpath" tools - plain old paper or simple emails can do a good job here also.
You'll always have emergency changes. If you've got a known problem, and the fix must be put in before your next Change Control meeting or you risk a service interruption for instance. Or you need a quick network change before that new application can be rolled out (that no-body thought of until the last minute). Normally there is a fast-track for emergency changes of this type, where a small quorum of change approvers can green light an emergency change. Note that this gets your change in quick, but should NOT get you off the hook for the rest of the process - you still need to present it at the next change control meeting so everyone *else* knows what's been done, do documentation updates, assess the chagne for violations to regulatory compliance or security policies and all the rest - what the emergency process gives you is a special dispensation to do all the other process stuff after the change is in.
Everyone hates documentation. Even I do, and I'm the guy telling folks they should keep their docs up to date. But seriously, if you change your environment, you should reflect that in your documentation. I recently had to restore a firewall from 12 months old documentation, then dig the changes out from 12 months of correspondence in the sysadmin's inbox- this was NOT fun.
The easy way to do this is to automate your documentation. If you can script data collection, so that your critical infrastructure documentation can be updated automatically, you can schedule this in a daily process. Not only does this keep your docs up to date, you can then script "diff" reports and alert on changes. This is a great way to report to your IT management on changes, both approved (good changes) and un-approved (bad changes).
There are lots of products like this (CATTOOLS springs immediately to my mind, I'm a network person), but you can easily write you own scripts as well for network gear, SANs (the storage kind), Windows or Linux hosts, or anything that supports scripting really (more here http://www.sans.org/reading_room/whitepapers/auditing/admins-documentation-hackers-pentest_33303 , and there are tons of other resources and tools on the net to help! )
DRP / BCP review:
Does this change need to be replicated at your site and reflected in your DR / BCP plan? The last thing you want is for a change to go into production, and then find out it never made it to the DR Site during an expensive DR Test window, or worse yet, during an actual disaster.
Security and Regulatory Review
Changes to production systems should be reviewed with an eye towards the original design of the system. For instance:
Does the proposed change open up a backdoor path between two different security zones? We see this often, where developers are ready to deploy a new system that requires access to the same database as a different appplicaiton in a different security zone. Simply opening up the required port for direct database access could easily end badly for everyone at the next PCI or NERC/FERC audit (or whatever regulatory umbrella you might be under). Really this issue should be addressed earlier in the process, but the change control process that governs production is your final safety net for issues of this type. Well, the final safety net except for the audit that is ...
Or, is that new subnet inside or outside the range permitted for various VPN user groups? It's often the case that a new subnet or host should *not* be accessible from the VPN access for most users (SQL servers for instance). On the other hand, other hosts need to be accessible to almost all users (the corporate internet). The important thing is that as changes are implemented, that accesses to any new servcie should be assessed based on the underlying security policies that should drive the design
Hey, have I talked about Risk Assessment enough lately? Here we go again - I'm seeing more and more organizations that have inserted a Risk Assessment section into their Change Control document. I really (really) applaud this trend, but it'd be better if it was coupled with an awareness of how risk assessment should be done. What is often seen in these new sections is a list of the remotest possible negative outcomes of the change. What I'd rather see is more references to the hard work that's gone into establishing system redundancies
The Change Control Document that I posted above is one that I"ve used as an example for a while now. If you've got a more complete document, if there is something that you feel should be added or changed, or if I've missed anything in the process description, please use our comment form and share !
"IT is supposed to be a support group for the larger organization, so we should be striving for planned changes. From the point of view of the larger business, we want system maintenance and upgrades to be as close to a non-event as we can possibly make it."
This is not necessarily true. The true answer is: Management doesn't want changes to interrupt business processes, if the cost is too high.
If you have a server that won't boot because it's indicating keyboard error "Press F1 to continue"; you don't want a change control form, and 3 day long bureaucratic process, before an engineer is allowed to clear the error and get the service back up.
If you aren't careful, excessive planning can result in cost overruns exceeding the benefits of the change -- it rules out changes that may be low-risk incremental improvements to systems, due to cost. Incremental changes, that when added up over time, might be massive improvement.
Disruptions caused by change may be deemed OK if the cost is low, or the benefits of the change are high enough. A visio diagram or sketch on a napkin, might be fine in many circumstances, particularly for changes of limited scope, with full impacts that can easily be understood.
When you introduce complicated changes; the full impacts may NEVER be understood and fully anticipated, even with years of planning.
You have issues similar with software development.... the Engineer may require feedback from the live system while making the change, to fully chart out the best parameters, and the exact next steps.
This makes it physically impossible or unjustifiably expensive to even attempt to plot out precisely in advance.
Spending months planning a change, costs lots of money, in the form of many lost man hours due to excessive time spent planning or making up paperwork. Possibly massively more money, then a little extra time spent solving other problems.
Therefore, it is not obvious at all, that such bureaucracy is always called for.
For some cases it will be a sensible CYA measure for IT management.
For other cases, such formalized change process constructions are just irrational artificial introduction of excessive complexity.
Mysid - I think you're missing the point entirely. A good change process can take into account all the issues you've pointed out.
Emergency CAB process. Changes can be raised retrospectively when incidents occur so you don't have to wait three days to restart a server than has crashed.
"A visio diagram or sketch on a napkin, might be fine in many circumstances, particularly for changes of limited scope, with full impacts that can easily be understood." - this is called a pre-approved change. In large organiseations, one of the big points of a CAB process is so that changes made by different teams do not effect eachother. You may have a regular, scheduled change that brings down a server every weekend, but what about the development team doing their major release in the same window?
"You have issues similar with software development.... the Engineer may require feedback from the live system while making the change, to fully chart out the best parameters, and the exact next steps." - For obvious reason you typically do not test changes in a production environment, so your engineer is doing something fundamentally wrong in this case. This is a fundamental part of release management.
In smaller organisations where you only have 10-15 people making changes you might not consider using a CAB, but for larger organisations with hundreds of IT staff it's absolutely necessary.