editione1.0.0
Updated October 9, 2023A disaster recovery plan is critical to your organization’s ability to respond to and recover from a range of disruptive events.
The objectives of this plan are to:
Undertake risk management assessment.
Define and prioritize your critical business functions.
Detail your immediate response to a critical incident.
Detail the strategies and actions to be taken to enable you to stay in business.
In plain English, the aim of this entire plan is to know what has gone wrong and get your most critical systems and processes back up and running with minimal disruption.
Next, we are going to look at all the sections you would typically put into your business continuity plan or disaster recovery plan and outline the types of information you should capture in each. Towards the end of this chapter we’ll look at what to do after an incident or disaster, and mistakes to avoid.
You need to manage the risks to your business by identifying and analyzing the things that may have an adverse effect on your business and choosing the best method of dealing with each of these identified risks.
The questions to ask are:
What could cause an impact?
How serious would that impact be?
What is the likelihood of this occurring?
Can it be reduced or eliminated?
Risk/Description | Likelihood | Impact | Preventative Action | Contingency Plans |
---|---|---|---|---|
Natural Disaster | Low | High | Insurance Off site backups in multiple locations | |
Epidemic | Low | High | Well-defined and tested remote working arrangements. | |
Fire | Medium | High | Use of well-provisioned working spaces with fire prevention mechanisms such as sprinklers. | Off-site backups in multiple locations Insurance |
Flood | Medium | High | Use of water tight and well-maintained working environments. | Insurance |
Theft of Equipment | High | Medium | Encryption of all disks and portable devices. Encryption of backup files. Physical controls on working spaces. Guidance for travel with work devices. | Restoration of device data from backups Insurance |
Loss of Key Staff member | High | Medium | Ensuring roles are known by multiple staff members. Use of access management and sharing solutions to ensure all passwords and access keys are securely stored and accessible. | Prompt assessment and revocation of accesses. |
Determine what types of insurance are available, and purchase the necessary policies. Your disaster recovery plan should document any policies you have so that if something happens, they are easy to find and trigger.
Example data to capture about your insurance policies:
insurance type
policy details and documents
exclusions
insurance company and contact details
renewal and review dates.
Ensure that any backup processes for critical data are recorded in your disaster recovery plan.
This helps you understand how much data you can recover, how that recovery process works, and when it was last tested.
Example data to capture about your backups:
backup frequency
where and how the backups are made
owner of the system and associated backups
recovery procedures (where to find them)
how frequently the backups are tested and when the last test was held.
This is where we start getting into the really crucial part of disaster recovery. As we all know, you can’t do everything at once; there always has to be an order to the actions we carry out that works with the time, money, and people we have available.
Disaster recovery is one area that really highlights this reality. Imagine you lost all of your systems in one day after a freak accident in a hosting center wipes out your infrastructure. While this is a highly unlikely, controlled example, the point is still the same. If you had nothing and had to rebuild everything from scratch to resume your business operations, what would you restore first?
There are two key measurements we use to prioritize our systems.
Definition The recovery time objective (RTO) is the amount of time you can operate or survive as a business without a system. In short, how quickly do you need the system to resume?
Definition The recovery point objective (RPO) lets us define how much data we would need to have restored for a system to function or to be of use to our company.
The RTO and RPO are a balance. Here are some scenarios that outline the relationship between these two values.
You may be able to resume services very quickly (short RTO) but with a small amount of data (short RPO), such as only the records from the last hour.
You may be able to last a long time without a system (long RTO) so long as when it comes back, you haven’t lost any data at all (long RPO).
You may need your system to be back quickly (short RTO) and have all the data back including historical records (long RPO).
Whatever your requirements when recovering from a disaster, it’s important that every system, tool, or process that needs to be restored is documented in your plan, along with your RTO and RPO for that system. This analysis allows those responding to the event to prioritize and get systems back up in the right order, as well as with enough data to make them useful.
confusion Remember that not everything can be restored first and not all data can come back in those early hours and days. Think carefully about your RTO and RPO expectations so that you can make this process easy and reduce conflict or arguments.
Critical Business Activity/System | Name of the system |
---|---|
Description | What does this system do? |
Priority | What is the priority for this system when recovering? |
Impact of system loss | What impact would losing this system or not recovering it have on the organization? |
Recovery time objective (RTO) | How long can you live without it? |
Recovery point objective (RPO) | How much data do you need back? |
Business continuity requires a number of coordinated roles to work efficiently.
To ensure that your organization is able to respond quickly, incident specific roles such as “incident lead” and “deputy” should be rotated between team members. This redundancy reduces the reliance of individuals and that nasty “key person risk” we discussed in Part III.
Role | Description | Responsibilities |
---|---|---|
Business Continuity Owner | Owns this business continuity plan and management level ownership of it and its associated risks. | • Update and maintain this document • Arrange for regular tests of this process |
Incident Lead | Controls and leads activities for a specific business continuity event. | • Lead the response team • Coordinate response activities • Manage prioritization during event |
Deputy | Supports the Incident Lead and manages communications for a specific incident | • Manage communications with internal and external stakeholders • Support the incident lead |
This is where our plan starts to move from collecting important information to documenting the key steps we need our response team to take for every event.
The following activities should be conducted in the event of a serious business continuity incident. They are listed in priority order.
Assess the severity of the incident.
Evacuate the site.
Account for everyone.
Identify any injuries to people.
Contact emergency services.
Start event log.
Begin restoration plan activities.
Activate staff members and resources.
It’s important to review these suggestions and see if there are any additional steps you need to take based on your location, operating model, health and safety risks, or culture.
The aim should remain the same, however. The first steps of this process are always focused on quickly triaging the situation and ensuring people are removed from harm’s way. Later steps focus on addressing human harm first and then, when safe to do so, restoring services and operations.
Upon loss of your physical infrastructure or office, or any event that prevents staff from safely reaching usual working premises, you need to ensure that your team take the right steps to stay safe:
If located at the affected site at the time of the event, report to the business continuity lead to register their safety and presence. In some cases, like a fire alarm, this might be a physical location like a car park or muster point. In cases like natural disasters, or for remote teams affected by disaster events, this might be a digital check in to say you are safe.
Seek medical assistance where required.
Remain at or return to your homes or other appropriate safe location.
Resume working in a remote capacity when safe to do so.
Await further instructions.
The important thing to remember is that by documenting this plan and your expectations, you can remove some of the anxiety and uncertainty from a very stressful situation. Disasters and business continuity events are very hard to manage for most people, and by having a simple, well-communicated plan, your team can focus on staying safe and can fall back to your instructions at any time if they are lost, uncertain, or unclear as to what is expected of them.
Once your people are safe, the next step is to locate the essential equipment and documentation you need to begin the recovery process.
For essential equipment, you should make sure the following are available:
emergency medical supplies
first aid kits
earthquake kits
flashlights.
Many national and international civil defense organizations provide guidance on preparing these kits and what to include in them. Please remember, if your team is remote or distributed, you should provide this equipment to all operating locations.
When preparing your essential documentation, you should make sure that the following are available:
contact details for communicating with the team and key stakeholders
insurance information
your disaster recovery plan
recovery codes for secure accounts, password managers, or other highly sensitive systems.
Each of these items should be stored somewhere suitable and accessible in the event of a disaster. It’s no good having a well-documented plan if nobody can find it. Your plan and critical information should be stored both electronically and physically in a number of geographically separated locations. This ensures that in a bad situation, there should always be a copy accessible.
As well as choosing good locations, ensure that multiple people have access so that in the event of injury or loss of contact, there are additional people who can locate the plan and activate it.
Much like you need to be able to find your first aid kit in case of an emergency, having contact details at hand is also critical to how well you can respond.
Remember that depending on the type of emergency, you won’t just be able to look up someone’s details on your company’s computer systems. You may have to resort to more manual and old-school mechanisms, like a call sheet.
When capturing contact information, remember to capture the details of both internal contacts (people on your team) and external contacts (people outside your organization who are essential to its operation).
For each of these groups, you should capture:
name
contact Number
responsibilities or roles
which company they represent (external only).
The painful part of this section is maintaining your lists. It’s been a long time since any of us kept a physical address book. Make a point to schedule updates to both this list and your overall disaster recovery plan and assign owners from across your team to ensure that many hands make light work of its upkeep.
Finally, our disaster recovery would not be complete without instructions on how to recover the systems, infrastructure, facilities, and data we rely on to get the job done every day. The more systems you have and the more complex your organization, the more you will need to document here.
The aim of this section is to give responders enough information to get going with restoring systems. This often includes:
where to find recovery playbooks for each system
who to contact for each system to talk through the process and set expectations
where to find essential equipment, backups, or authentication materials
how to get physical access where needed.
confusion Remember that any document you reference here should be stored along with the overall plan so that it can be accessed in times of need.
The first common element of both disaster recovery and incident response plans is the need to plan your communications during an emergency. There are many reasons why you don’t want to leave this to chance:
Your normal communication tools may not be available due to an outage or fault.
You may have no physical access to your communication devices, or other physical locations or equipment needed to use them.