Disaster Recovery Planning
Floods, fires, earthquakes, power outages — there's a long list of unplanned interruptions that can wreak havoc with IT systems. Having a recovery plan is key to minimizing downtime.
Most businesses cannot function without their computer systems operating effectively. As a result, disaster recovery planning for IT has become a critical aspect of business continuity planning. Yet, this is an area in which companies often fall victim to procrastination. If you were to ask a random sample of organizations what would happen if their computer systems were destroyed by some type of disaster, a typical answer would be that the company has backup tapes of data stored offsite. A majority would have nothing more than that.
And without a clear strategy, recovery from backup tapes can be a hit-or-miss proposition. The most important aspect of disaster recovery planning is to assess the potential risks to the organization should computer systems be inaccessible or inoperable for an extended period of time. For IT departments, planning starts with defining what the amount of threshold downtime is. For some organizations, threshold may be set at 48 hours, while for others, it may be days or even weeks.
Typically, disaster recovery planning covers a defined period of time, enabling the organization to continue operations while the old data centre is repaired or a new data centre is assembled. When a major disaster hits, companies need to have an alternate location to set up computer systems and install applications software. These elements must be in place before data from backup tapes can be restored.
The plan should address the fact that servers, unlike desktop computers, cannot be readily purchased off the shelf at a local retail shop. And given that it may be impractical to try to recover all computer applications in a short timeframe, the plan should prioritize the recovery of applications. Most organizations cannot function without e-mail, so re-establishing e-mail service is usually a top priority. Many organizations also rely on their Web sites to deliver services, both internally and externally. For such organizations, an adequate Internet connection would be part of the initial disaster recovery steps.
A Step-by-Step Approach
The following best practices on how to conduct IT disaster recovery planning are recommended by IBM:
Phase 1: Criticality Analysis
- Determine which business units and processes are critical and contribute most to the delivery of the company's services and products;
- Determine the requirements necessary to support these processes during an extended outage;
- Establish what kind of disasters you are planning for;
- Determine the purpose of the disaster recovery plan;
- Identify which computer applications need to be recovered first;
- Classify functions according to recovery timeframes and prioritize the recovery sequence for computer applications.
Phase 2: Recovery Plan Development
- Identify, prioritize, and sequence the tasks that need to be performed during recovery, such as setting up servers, arranging for power and Internet connections, installing systems software, installing application software, and restoring data from backup tapes;
- Determine roles and responsibilities, document who does what, how, and when;
- Determine resource requirements such as location, power requirements, server requirements, and people resources;
- Document all disaster recovery procedures.
Phase 3: Active Testing
The final step is to test the recovery plan to verify that it is appropriate and workable. Depending on what the plan entails, this may be as simple as testing recovery from backup to flying staff to server farms to install systems and application software, restore data from backup tapes, then conduct a complete test of the recovered computer systems to ensure that all systems are restored correctly. For complex systems, active testing may go through several iterations, as errors and omissions in the disaster recovery procedures are discovered and corrected. Testing must be repeated at least once a year to keep the procedures current.
Disaster Recovery Strategies
Many business processes require that critical applications be restored within hours or days; the smaller the recovery window, the more costly. For example, the recovery window for financial institutions is small as any delays cause major disruptions. For some accounting offices, it is possible to continue the business manually while computer systems are repaired, allowing recovery time to be days or even weeks. Business requirements will dictate the appropriate disaster recovery strategy. In general, one of four main strategies is used:
1. Duplicate systems
The fastest recovery strategy calls for a duplicate set of computer hardware, data, communications equipment, power supply, and an Internet connection, ready for activation at an alternate site. With this strategy, recovery times are measured in minutes or hours.
If no downtime is allowed, a mirror site is used — basically a redundant setup elsewhere, running in parallel to the computer systems in production. All transactions are automatically recorded by the production systems and the mirror site.
If a disaster hits the production site, the mirror site kicks in and continues the operation. Mirror sites are the most expensive solution but provide the best disaster protection; however, not many organizations can afford or need this strategy.
2. Hot-site service
A hot-site subscription is an agreement with a recovery services provider to access a physical location equipped with the necessary hardware. This strategy has relatively high ongoing operating costs and the shorter the guaranteed timeframe, the more costly the subscription. Typically, hot-site service provides recovery within a few hours to a few days. Hot-site service is designed to bridge the organization for several weeks while a replacement data centre is built. It is not intended for use for an extended period.
3. Quick-ship subscription
Another disaster recovery strategy is a subscription for guaranteed shipment of servers and other critical equipment — generally known as "quick-shipment." Quick shipments involve an agreement with a recovery services vendor for the guaranteed delivery of computer hardware to the customer's alternate recovery location. This strategy is generally less expensive than the first two.
The least responsive strategy involves purchasing the required computer hardware, data, and communications equipment at the time of the disaster. This strategy has the lowest operating costs, as there is virtually no cost unless a disaster happens. Costs of acquiring equipment at the time of a disaster are potentially the most expensive, as a premium price may be paid to get quick delivery. Recovery times for this strategy are usually measured in weeks.
Completing the Plan
The most practical solution for many organizations will likely be a combination of the above strategies. For the most critical applications with a very short recovery window, it often makes sense to set up a duplicate system in a different location, ready to be turned on if and when a disaster strikes. For less critical applications, the plan may call for the requisite hardware to be purchased on an as-needed basis.
It is important to note that disaster recovery is not just an IT issue, as all business units should have complementary business continuity plans. If they do not, critical IT services may be recovered quickly, but not used as the business units may not have identified core operational requirements.
Many organizations make the mistake of expending huge resources developing a disaster recovery plan complete with detailed documentation, but then fail to keep the plan up-to-date. An effective plan is a live plan and must be revised on an ongoing basis. At least once a year, organizations should conduct a complete test to confirm that the plan is appropriate and workable.
Disaster Recovery Planning Resources
Developing a disaster recovery plan can be quite an onerous task. For complex operations, external consultants who specialize in disaster recovery planning can simplify the process of preparing a plan. A number of online resources are also available. Examples include:
- The Disaster Recovery Guide covers the A to Z of disaster recovery and features detailed information. www.disaster-recovery-guide.com
- University of Toronto, Computing and Network Services, is an excellent resource. www.utoronto.ca/security/index.htm
- Disaster Recovery Planning is a vendor-neutral information resource provided free of charge by Toigo Partners International. www.drplanning.org
- Disaster Recovery Zone contains essential advice for understanding the role and responsibilities of the disaster recovery project leader. www.disaster-recovery-zone.com
[ TOP ]