CISSP Reloaded–Domain 8: BCP and DR

This is the 8th part on my CISSP Reloaded where I am revisiting the 10 CISSP domains I studied for many years ago to see what has changed and how much of it I have retained as well as adding in my own personal thoughts, experiences and rambles into the mix. Read the other domains here: (Domain 1) (Domain 2) (Domain 3) (Domain 4) (Domain 5) (Domain 6) (domain 7)

Do you ever watch those life insurance adverts where they show a family playing happily and then in comes the deep grim voiceover that somberly asks, “Who will look after them should the worst happen?” It doesn’t say the word death by heart attack, or falling from balcony to your doom or even death by PowerPoint. But we all know that’s what they mean.

Which is the exact same voice that goes through my head any time I’ve had to sit in on a Business Continuity or Disaster recovery planning session. I possibly I even end up speaking like that guys. “So tell me, how will you run your website…. Should the worst happen?”

“Who will answer your customers phone calls…. Should the worst happen?”

“how will we operate the projector… should the worst happen?”

Now you’re also going to have that stuck in your head and it will come out at the most inappropriate time causing you to chuckle to yourself while everyone around you thinks you’ve gone a bit loopy. I like how it’s worded in a non-offensive and open way, to bring about unlimited possibilities. It’s like a game of inception. You don’t want to say the exact words because you’ll look like a doomsday hater, but you want to plant the seed of doubt in their minds. Get them thinking.

If there’s nothing else you take away from this domain, take this lesson. Forget the technology, forget your load balancers and your availability criteria, when you talk to the business about continuity and disaster recovery, your job is to leave them feeling as paranoid as an A-List celebrity out shopping without wearing a pair of sunglasses 3 sizes too big to conceal their identities. I find that parents of children aged between 3-8 already have the necessary skills needed to be effective at this. They are used to giving out these subliminal messages to their kids. “you better clean your room right now young man, or you’re going to be in trouble.” Of course, the term ‘trouble’ is never really quantified. The kid usually conjures up some wild imaginative thing such as their parent will turn into a werewolf and eat them while they sleep. Whereas the parent is desperately hoping the child does what they’re told or they’ll be forced to make up some laughable punishment, like the naughty step.

Another of my favourite parenting techniques has to be the counting to 3 method. It’s where the parent tells the child to do something and the child stubbornly refuses. So the parent slowly, but very firmly starts to count. It becomes a battle of nerves at that point like a spaghetti Western standoff, the tension mounts in the room. The gauntlet has been thrown down by the parent, they’ve sent out a clear message that insubordination will not be tolerated. For a while the child resists…. Then the parent says “2”. At this stage time slows down, an eerie silence takes over, the child can hear the clock ticking, becomes aware of their own breathing and heightened pulse rate. The parent raises an eyebrow as if to indicate that if they reach “3”, the floodgates of hell will be opened up and demons will emerge from every corner and rip the child limb from limb.

So the child gives in, stomps their feet and does what they were asked to do. The parent sighs a giant sigh of relief knowing that if they ever reached 3, their whole game would be up. The bluff is saved to be repeated another day.

I have no idea how I’ve ended up talking about parenting techniques – this is the very reason I’ve been told I desperately need an editor, so they can take out all my crap. But that would probably turn a 4000 word chapter into 50 words.

Business Continuity Planning and Disaster recovery are usually uttered under the same breath and used somewhat interchangeably, but do they mean the same thing? Well they’re a bit like sisters who are born a year apart. They’re not quite twins, but the similarities are undeniable. They have the same mannerisms, probably share that same awkward snorty laugh and are the same build. But when you look closer you’ll note the differences, like how one has a mole on her left cheek or has 3 piercings in her ear whereas the other only has 1. This leads to it almost becoming an obsession with you wanting to check the mole out before speaking to either because you want to know exactly who you’re talking to. Which is how you should approach Business Continuity and Disaster Recovery. They’re the two sisters who everyone else mixes up, but you know who’s who because of their moles and piercings.

Sisters who aren’t twins and parenting tips? I swear this is the most messed up domain I’m writing about. Probably because I’ve got little experience in doing a lot of Business Continuity or Disaster recovery planning. I usually end up asking a project if they’ve considered it, and they grunt and mutter something about having bought two servers and installed one in each of the data centres which are 50 miles apart and I usually nod and tick it off on my checklist and make a mental note to go verify the data centres are actually ones the company owns and manages and to validate it on the plan. It would work a whole lot better if I actually made an actual note of these things on the piece of paper in front of me because I kind of forget mental notes. That’s the problem with mental notes. Depending on your mental capacity, you can end up forgetting them, or overwriting them with other notes, or even worse you start doodling on them in your mind. Which is why you should always document your business continuity and disaster recovery plans. The last thing you want is a tragedy hits and you have 8 different senior managers in a meeting trying to remember what they agreed would be the best course of action to take in the event of this incident.

Business continuity planning is rather pro-active. It’s like taking a first aid kit with you on holiday because you know that the kids will inevitably trip over and cut their knee. The first aid kit will allow you to take a bit of pleasure in disinfecting the wound while the child wails and apply a plaster. After a while your kid can carry on playing as normal. Or it’s like having a spare wheel in the back of your car. You have a plan that if your car journey was interrupted by a flat tyre, you could change it, or if you’re like my wife, you phone me up to come and change the wheel. This allows you to continue the car journey albeit with a slight interruption.

Disaster recover, as it names suggests is how you would recover from a real disaster. Like if your kid got taken hostage by flying monkeys who wanted to raise the man-cub as their own and name him Mowgli… or if your car’s engine blew up. These are disasters and the recovery is usually a reactive process. So having car breakdown cover so that a man who kills people on the weekends can drive up in his pickup truck, look around your car, confirm that the engine has blown up and offer to tow your car to a garage. I’m not sure what kind of plan you’d invoke to get your child back from the flying monkeys though.

Business Continuity Planning

BCP is all about continuing business activities whilst something has happened. My CISSP notes break it down into four phases:

1. Scope and Plan Initiation

2. Business impact Assessment

3. Business Continuity Plan Development

4. Plan Approval and Implementation

1. Scope Plan Initiation

As they say, a journey of a hundred miles begins with a single step. The scope plan initiation phase is the first step you need to take to create a BCP. In order to properly scope the plan, it’s important to understand what the company does, what activities are important or not and which systems are crucial to support the important activities. Now, if a company has done a good job of risk assessing and classifying all their assets, then this should be an easy process of simply going through your assets and ticking off the ones that are needed.

Unfortunately, it’s never that easy, so what happens is you end up setting up a working group of well-trained monkeys to go around with checklists trying to understand what all the assets are and trying to make sense of how the organization works. This highlights a fundamental disconnect between most security departments and the business. If you ever find yourself doing this activity in a business, stop and ask, if you don’t already know what the business does or what are the critical assets, then how do you know what you’re supposed to be protecting and where your security controls are most needed?

During this phase a lot of large organisations will setup a committee or two, a steering group and an advisory board of some sort. Just do what’s right for the organization and plough on.

2. Business Impact Assessment

The BIA is what we should be doing on each asset when it’s deployed. But again, we would have either misplaced the record or it would be so out of date that the BCP creation process will drive us down conducting another set of BIA’s on each system.

In simple terms, the BIA seeks to answer what the impact to the business will be if a particular system was rendered unavailable. Or if you want to jazz up the words a bit, what would be the impact to the business if a rogue state, sponsored some hacktivists to totally cyber-pwn your box.

There are different templates and complexities of BIA’s that different companies adopt. Generally though you’ll be working with someone in the business to answer the questions. Well crafted questions will allow you to prioritise the system you’re looking at, understand how it operates, and therefore reach some sort of scientific conclusion on what the impact to the business will be should the system be unavailable through any means.

You can then put it into a bucket for how much downtime is acceptable. For example, it could be 1-3 hours, or 1 day, 1 week or even 1 month or more than a system can be down before having significant impact. For example a monthly payroll system may only be used once a month, so if the system is down at the beginning of the month for a few days, there isn’t a major impact. Perhaps there are manual workarounds that could be deployed in the interim. On the other hand, there could be an online store that generates over 80% of the company’s revenue, so you can afford much downtime at all.

The important thing to bear in mind whilst completing a BIA and arriving at a conclusion is that people will answer questions based on their own understanding and view of the world. Once I was speaking to a person about the criticality of a system, to which he responded that it was very very critical and couldn’t be down for any length of time. Probing a bit further, I enquired as to why it was so important, to which he responded that without the system, he wouldn’t be able to do his job. I agreed that him not being able to do his job was most definitely an impact on him as an individual, but what would the impact to the company be? Would other processes fall down, would customers be unable to proceed with decisions, would the CEO be asking for data out of this system? He rather sheepishly replied “no, I don’t suppose anyone else would really notice.”

I tried to assure him that this wasn’t about his job security, I was just trying to figure out which systems need to be recovered first if we invoke Business Continuity procedures. He nodded, but I don’t think he believed me. The next time I walked past his desk he was on a jobsite.

The main point being that simply getting people to fill out a set of BIA questions alone isn’t sufficient. You need to be involved to a degree to ensure the quality of responses are sufficient.

3. Business Continuity Plan Development

By now you would have collated some information and saved some money by identifying redundant jobs so you have enough information to start developing the plan. The strategy should encompass everything that may need continuity, so that would not just be computing, but consider your facilities, people, supplies and other equipment. Things like planning for a transport strike and people couldn’t get into the office. Or if there was a blocked drain in the building and people couldn’t work just because of the smell.

Many years ago, when I first started work, there was a young graduate who had started at the same time as me. A few months into the job, he had an unfortunate accident and passed away. Naturally everyone in the office was upset by this incident, particularly his team members who directly worked with him daily. On his funeral all of his team and most of the rest of the office was out to be at the funeral. Sure it had a big impact on the company that day when nearly all of the IT support function was out, but no manager is going to prevent people from going to a funeral, and even if you could force a few people to stay behind, would they really be in the state of mind to operate efficiently?

When planning, keep in mind that people won’t always be in the best state of mind to make the most rational decisions. Factor this into your plans.

4. Plan Approval and Implementation

The emotional state of people during a disaster is a good reason why it’s so important to have your plan fully documented. Because it’s a lot easier to agree and document a plan of action in advance. What you do need to ensure though is that the documented plan is approved at right level. Nothing is more fun than having a documented plan that is ignored by some big chief who wants to play hero during an incident.

Also make sure that people are aware of the plan. Because if you’ve left the organization or are on holiday somewhere on the other side of the world when someone needs to use the plan, it’s no good if they can’t find it. But don’t make it publically available for anyone to pick up and read. After all, it could contain sensitive information about your company, it’s assets and other proprietary information.

Finally, keep the plan up to date. Why do you have a plan that talks about using the win95 recovery disk? Yes, I actually read that in a plan… in 2008!

Disaster Recovery Planning

DR plans are for when things really go bad. I’m not talking about a blocked sewage pipe, I’m talking about every single pipe getting blocked and spontaneously bursting thus flooding your entire building. It’s Armageddon, it’s a realization of those scenario’s you see in Hollywood movies but thought could never happen to you.

I find myself talking in the voice of the movie trailer guy as I write this.

Where DR plans differ from BCP is that with DR planning, you’re looking to setup a framework or a method of how the company can effectively make decisions in a logical way should the worst happen.

In essence, a DR plan will seek to minimize any decision making required by staff during a disaster. During an event, people may be emotional, worried, complacent, or just curious as to what’s going on and hence being distracted by what they need to do. It’s not the best time to expect them to be making strategic decisions. Having people know what they are supposed to be doing will hopefully protect the organization from major failure by minimizing the risk from delays in recovering from an incident.

The planning process is similar to the BCP process. So assuming you’ve already undertaken BC planning, you’ll have the BIA’s to hand, so you’ll start from defining what you need to do to for the business to recover. A lot of material you will read will go into the merits of having mutual aid agreements with other companies whose facilities you could share and vice versa if the need arises. Or having your own hot, cold or warm backup sites. Of course being so many years old, my notes don’t make any reference to the cloud. Which is also another place where businesses are hosting their critical applications.

It is interesting, because some companies a lot of companies are not factoring in cloud-based (or general 3rd party hosted) applications into their DR plans because contractually, the cloud provider is responsible for it all. Although, I’d argue that what would you do if your cloud-provider got hit by a disaster they couldn’t recover from. Then what would you do in order to continue your business operation? The answer will vary on the type of business and the criticality of the applications that are run in the cloud. The point is that you can’t blindly rely on a 3rd party provider just because they have claimed something in the contract. Which is why it’s important to conduct adequate due diligence on your 3rd parties to make sure they really do have the capabilities to backup their claims. It’s like would you seriously go into a dangerous situation armed with a gun that only fires plastic pellets and jams after 3 rounds? Or would you take in a fully tested and functioning AK47? It depends on the situation, but there are few times where having an AK47 doesn’t strongly increase your negotiating power.

Which brings us nicely onto testing. There’s no point in having a lovely and wonderfully orchestrated DR plan if it all starts to fall apart when you need it because somebody forgot to get toilet roll for the building or change the static IP addresses. There are different types of tests you can undertake such as:

Checklist

This may seem like the lazy persons test, but it is very cost effective. It’s where you send the plan to everyone and they all review it in their own time. It’s not really a test per say but more of a guide to agree the principles. Think of it as sending instructions to someone on how to swim, but without the cost or hassle of actually getting into the pool.

Structured walk-through

A step up from the checklist, it’s where everyone gathers in a room and walk through the plan collectively. Again, people don’t get wet, but it’s a good way to laugh at others responsibilities and duties in the plan.

Simulation

A simulation is where you do a dry run involving all the staff who will be involved in providing support in the event of a disaster. They’ll usually all be asked to reconvene at an alternative site and pretend there is a disaster going on around them. Most staff end up treating it as half a day out of the office and try to figure out what they’ll be doing once they finish.

Parallel

A parallel test is a full test of the recovery plan, using all staff and resources. The key point though is that the actual production environment isn’t touched and left running as usual. In effect you end up running a second ‘hot’ site for the duration of this test.

Full interruption

This is the real deal. Other than the burning building and screaming ladies, this is where the production system is shut down and the disaster recovery plan is tested to its limits. Although this is the only way to be absolutely sure you’ve got a proper recovery plan in place, it’s one that requires extreme bravery to execute. Most people stop and wonder, what if they can’t recover, what if something breaks, what if the main system is unrecoverable. So they usually declare that their parallel test was good enough and leave it at that.

Communicating externally

A disaster is similar in many ways to any other incident, except at a much larger scale and just like an incident, it’s important to have well-established communication channels setup via which you can get in touch with key contacts such as the police, fire services, medical facilities, utility providers, press, customers, shareholders, partners, the list goes on.

It’s a good strategy to utilize social media to communicate with your wider customer base as it’s usually quicker and more direct. So you see, social media isn’t all bad is it.

Another important communication strategy that needs to be put in place is dealing with staff and their families. If, as a result of a disaster there is a loss of life or serious injury, how would you communicate with the family? If, as a result of the disaster, the business is impacted so badly, they have to lay off staff. How is that communicated?

Communicating with the media is also something that needs to be handled carefully in the best of times, but even more so during a disaster. You want the company to be accessible, but also ensure a media trained spokesperson is nominated. The last thing you want is Bill from mainframe support on the 9 o clock news flapping his gums about the disaster that he knows absolutely nothing about because he was in the basement when it all happened.

They think it’s all over

Be sure the plan is very clear as to when the disaster is over. Usually this is when all operations are returned to the normal state in the original locations etc. This is important because it’s at that point you can take a snapshot of your data and compare it with pre-disaster as well as your assets and personnel to effectively gauge the impact.

Disasters are a good opportunity for thieves and fraudsters to attack. Setting off fire alarms in order to walk out of buildings with a couple of laptops is a common technique. You can also find the business is subject to vandalism and looting so try to ensure as many different considerations are agreed upon before you’re faced with a calamity.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s