Lessons from DR

I arrived in Philadelphia late Sunday afternoon and departed for home Wednesday morning.  During this time span I managed to get a cumulative total of 12 hours of sleep.  "What accounts for my sleep deprivation" you ask?  "Too much caffeine in my system?"  Well, yes, but the main culprit is our yearly disaster recovery test.

Our test began Monday morning at 8 a.m. and ran for the next 48 hours.  During this time we recovered an ERP system, a database cluster,  a data warehouse, Exchange, and miscellaneous third-party applications.  I'm happy to report we were largely successful with our recovery meeting all of our primary objectives.

Below I will outline principles we followed to make this exercise as successful as it was.  I also will lay out several suggestions that had we followed would have made it even more efficient.

Principal One:  Have a well documented plan

One of the main problems with having IT people perform your disaster recovery tests is that we are notorious for our lack of planning.  By definition we are "DO"ers.  We prefer to leave the paperwork to others.

Having a well documented plan is essential to effectively performing the recovery exercise.  One of the main items you want to focus on developing a plan is clearly identifying dependencies.  Only after all your dependencies are identified can you decide in which order specific items must be done and which tasks can be done in parallel. 

Part of this plan should also encompass all server configuration information.  You need good documentation explaining what you're in state should look like.  You don't want to be in a situation where all you have is a box full of tapes and a lot of questions.

Principal Two:  Have a DR Kit

The DR Kit is extremely important to the success of your disaster recovery exercise.  It should contain all of your installation media, service packs, hot fixes, and a copy of a Project plan itself.  Basically this kit should have everything required to perform an installation of the operating system, databases, and applications.

This last comment should be filed under "it goes without saying" but due to the category "you'd be surprised by" I'll go ahead and mention it as well.  Make sure your does DR Kit is stored in off-site location.  Typically this will be stored in the same location as your backup media you have off-cited.  (If you currently don't have an off-site location you should stop reading this article right now and ponder the difference between cheap and dumb ass because you, or at least your company, falls into one of these two categories.)

Principal Three:  Have clearly defined success criteria

You might think this third item is fairly obvious that you'd be surprised at how many times it can become an issue.  An example of this would be all of the servers currently in your production landscape versus what is actually required to run the application.  Is your success criteria getting all of the servers up or making certain functionality available.

Principle four:  Don't trust tape

Anytime you're recovering a system for tape during extremely dangerous territory.  Even assuming that there are no errors on your backup media you're still left with the situation where are you attempting to restore a system to unlike hardware devices.  There are many out there are make a career out of the ability to be able to pull this off successfully.  As for me, I don't try to make a backup off of an IBM 360 run on some HP ProLiant server.  My preference is to rebuild the server from scratch, reinstall the application, and restore only what data than is absolutely necessary from tape.  Obviously your business transactional data can be restored from installation CDs however just about everything else can be.  All your specific configuration information for your applications should be stored in your DR Kit.

Paul's DR suggestions:

it's often been said that war is 99% boredom and 1% terror.  The percentages might change slightly but this can also be true of disaster recovery.  I like the layout several suggestions I have for making yours go easier.

Suggestion One:   Do your DR remotely.

At your DR location all you should actually need is a tape library and and one or two IT individuals to physically touch the servers when they mean a warm caress.  The rest of you should be able to do all of you were setup and configuration information remotely.  One of the main advantages to being able to remotely perform the DR is that since you no longer paying for hotel and travel suddenly you can afford to involve more people and thus avoid death by sleep deprivation.

And my last DR we had not made allowances for remote connectivity however myself and a colleague did come equipped with Sprint PPC-6700 smart phones.  We put these phones into the USB ports of a couple servers and were able to provide remote access to our Malaysian team.  This meant that three in the morning when we were not thinking clearly we have the support from people that were wide-eyed and alert.

Suggestion Two:  Get enough sleep

I'm sure everyone's familiar with the statistics comparing the effects of sleep deprivation to blood/alcohol levels.  The bottom line is when you're exhausted you not going to make smart decisions.  Make sure you have enough people available to divide up the work as necessary. 

If you have your DR planned out properly you should be able to avoid having people sitting around waiting for hours to be able to begin their tasks.  For example, if you DBAs are not needed until after the database servers are built you probably don't need them to start until much later in the timeline.

Suggestion Three:  use USB Hard drives

This one was a real winner in our past the DR's.  Copy all of you were installation media, configuration files, service packs, hot fixes, and drivers onto a USB hard drive and place it into your DR Kit.  You also need to make sure that you have these all burned off to DVDs as well since hard drives can fail quite easily however doing installations off of a hard drive is much faster than having to sort through a stack of a hundred DVDs picking out the one item that you need.

Suggestion Four: Don't cluster

One way of saving yourself a lot of pain is totally recovering clustered systems.  Clusters are a great way to provide redundancy in your primary data center but should not be a concern during a DR.  The added complexity of clusters will dramatically add to the amount of time it takes to get production back up and running.  My recommendation is to physically name the server the name of the cluster group.  Since applications will refer to the cluster group name and not the node name and the fact that the cluster does not exist should be invisible to your applications.  You also have the option of adding DNS aliases for other cluster group names that you want to refer to a specific machine.

 del.icio.us  Stumbleupon  Technorati  Digg 

 

What did you think of this article?




Trackbacks
  • No trackbacks exist for this entry.
Comments

  • 7/10/2008 10:25 AM Oscar wrote:
    Hi Paul, nice article! Enough sleep is key for sure

    I love your videos, very informative. Specifically the ones that test Windows Server Backup and restore to a VPC.

    Unfortunate for me, I have not had success on trying to restore a backup into a Virtual PC drive. In a nutshell, I get an error that says:

    "The Automated System Recover(ASR) pre restore operation failed.

    ERROR: Volume Shadow Copy Service operation error (0x80042407)
    The disk that is set as active in BIOS is too small to recover the original system disk. Replace the disk with a larger one and retry the restore operation"

    Here is what I am using:

    -a mac laptop with a windows partition (Windows 2008 Enterprise)

    -a NAS to backup to with 1TB

    -a VM with the at least 20GB more than the original backed up partition.

    Any help you can provide, I mean, even if I need to pay you for this, I don't care at this point, I just need to ensure I can restore my backups.

    Thank you,
    Oscar
    Reply to this
Leave a comment

Submitted comments will be subject to moderation before being displayed.

 Enter the above security code (required)

 Name

 Email (will not be published)

 Website

Your comment is 0 characters limited to 3000 characters.