Fires, SAN failures, tornados, and car accidents all came up today, yet it was a very good day at work. I just started my job two weeks ago and made a list of questions to go through with them. After asking the one about when they last pulled an offsite backup and restored each Tier-1 database, I ended up in the middle of a project to restructure the disaster recovery and business continuity plan.
As a DBA I thought my job was to ensure that a database was recoverable from any disaster and that if you focused completely on the worst disasters possible then you would be safe from anything. In the short-sighted version of things, perhaps I’m still right. However, in the realistic version I had much to learn, and still do.
What is my job?
First, my job isn’t to make sure the databases are recoverable, it’s to make sure that database recoverability is part of a full business continuity plan. That goes well beyond restoring a simple database to some server that I have no idea where it’s at or how it got there connecting to an unknown network without an app server. There is so much more to a DBA’s responsibility in this aspect than I once believed.
If the app isn’t restored with the database then there’s no purpose for the database. If there’s nowhere for the employees to sit and use the app then there’s no purpose for the app. Finally, if the you restore too far from your current location then no one except the most senior and dedicated employees will travel to work. All of this has to fit together for the mere responsibilities of a DBA to matter at all.
Second, the worst of disasters isn’t the most common, and the multiple day turnaround time is unreasonable for something as simple as a SAN failing or a building burning down. Save the week long turnaround times for the disasters that remove your ability to work in the city you’re in. Small disasters shouldn’t cost more than a couple days of data and productivity loss.
You need to look at disasters in different levels, and you’ll realize that as you go up the levels that they are increasingly less common. Therefore you can’t have a simple business continuity plan and expect it to cover it all. Also, if you don’t have anything then you can’t wait until you can figure it all out before you implement the first levels.
At the most common, you have a processing mistake. The server and infrastructure are perfectly fine, someone just did something they shouldn’t have done. Forget about the details, it doesn’t matter to you at this point if it was a developer mixing up < and > (haha, brings back memories of lost weekends), a user deleting the wrong records, or a DBA making a simple mistake. No matter what, your challenge at this point is the same, you’re missing data and the business wants life back to normal as quick as possible. For this the business needs to define what level of recoverability you need to have and help ensure you have the budget to do it. If they never want to lose more than 2 minutes of data, you need log backups. If an hour is ok then you can get by with Volume Shadow Snapshots (VSS) on VMs or possibly differentials on physical boxes. If a whole day of data loss is acceptable then you’re home free with daily fulls or differentials. No matter what you’re talking about here, this should be easily accessible by the DBA without outside help anytime day or night.
Small Hardware Failures
Now lets move on to local hardware failures. A single drive shouldn’t be a concern because there’s no excuse not to have a minimum of RAID-5 on absolutely every file on a server. Not saying you don’t have to double-check that the server “Prod-012″ isn’t a laptop sitting under a stack of papers on an engineer’s desk, I’m just saying there’s no excuse. Anyways, motherboards don’t last forever, RAM fails, and if your RAID-5 array is too large then it’s no safer than a stand-alone drive. If you lose something, how long will it take to have something else take over? If you lose a couple drives do you have enough spare space to get you by until you can replace them? As for servers, Active/Passive clusters or VMs are great, one loss and you’re up and running in 5 minutes without too many complaints. Active/Active clusters are also great if they’re set up right, as long as you’re not double-dipping on the RAM with two instances each set up to use 50 GB of RAM on a box with 64 GB of RAM where the OS will lose the fight quickly. Standalones, however, are out of luck and need to have a plan in place. How long does is the business able to deal without that server? Did the business give you that answer in writing, and is the person in the business responsible for that server currently aware of that potential downtime as of your last annual review of disaster recovery and business continuity? Wait, do you do an annual review with the business of your disaster recovery and business continuity??? Many companies don’t, and it’s your responsibility to start that conversation.
Shared Hardware Failures
Shared hardware and services also fail, no matter how fault-tolerant you think it may be. Power goes out, do you have a generator? Networks go down, do you have two connections that can handle the traffic at an acceptable speed coming in through separate hardware? SANs spontaneously combust, were your data files and backup files on that same rack? Walk through your server room one day with a system admin and point to each device, discuss what happens when it stops working. Telephone polls and transformers take time to replace, were your fault-tolerant connections coming in on the same pole? Now we’re at the point that I’m not expecting you to have a spare on hand, but you still need a plan. SANs are expensive, so you at least need to know where the money is going to come from if you need a new one overnight. Again, know your risks, know your recovery time, and make sure the business has agreed to that in writing.
Now for what people tend to have in mind, fires, tornados, floods, blah, blah, blah, they all happen. I’m in Pittsburgh, so hurricanes are covered in floods and there is no specific earthquake plan. Is your server room in an area that would get flooded more than once a century? If so, can it be moved? If the building your server room in is no longer there, what do you have to recover with? Offsite backups of databases aren’t enough. You aren’t restoring databases at this point. If that’s what you were thinking, you’re in a heap of trouble here.
First, you’re panicked; everyone’s panicked. Where is your plan, it wasn’t in that building, was it? Who knows where to find the plan since you’re not in a disaster where you can count on everyone being available anymore. Can everything be restored to a functioning state with acceptable data loss from data stored offsite, and was it tested to prove that it could be? Remember, if you’ve only restored the database then you’ve restored a lot of valuable, yet useless, information. Are the app servers in that same plan and up-to-date so the users can use the data? Where are those users going to sit? I know, this is on a DBA’s blog, so mostly DBA’s are reading this, we don’t tend to think about a user’s cube, or if their temporary cube is further than an hourly employee would even consider traveling to work. However, right now our job is business continuity, not strictly databases.
So, say you do have an offsite, easily accessible plan that is written in a way that panicked and stressed employees mixed in with temps to make up for the employees that had to tend to their families instead of showing up for work could understand and implement flawlessly, but what does the business expect? I keep coming back to this because the answer is usually that they think we have a million-dollar plan while giving us a small budget. Realistically it may take us a week to have it all together, and we’d never know the full picture without testing it. However, the business may be there saying that if it’s down for more than four days then we’re out of business, which would really suck after putting all that work in to this awesome, yet lower budget plan. Make sure everyone knows the expectation and have that offsite plan not only signed off by the business, but the proof that they signed off on it stored offsite as well. The IT department isn’t the only department stressed out when the data center burns to the ground, and stressed out people like to blame others for everything no matter who is at fault.
Oh no, I never thought that would happen
Finally, the largest of disasters. What if the entire city where your data center in is inaccessible. I know, how often does that really happen? Hurricanes took out New Orleans for months and parts of New York City for weeks, earthquakes threaten to do the same to parts of the country, terrorist attacks are possible, plagues can happen (although I don’t foresee the zombie apocalypse), and I now know where the nearest nuclear power plant is. Do you have recent backups stored outside of the region that can recover from these types of disasters, and has the business signed off on this? For some companies it is acceptable for the answer to be that they don’t exist and it’s acceptable to go out of business for these disasters, and that’s ok. Again, get it in writing, and store that piece of information in your personal offsite storage (yay, Google drive). If this is the case, you’re protecting your rehirability when a company goes out of business saying their IT department failed them in a disaster.
I provide advice, not business continuity plans
Now it’s time for my disclaimer. This is by no means all inclusive, and nothing you get from ANYONE’s blog should be taken as such. Some companies have someone who has the job title VP of Business Continuity, and some don’t. No matter what the case, it’s your responsibility as a DBA to make sure the DBA team has a part in this process and that the process is implementable for all levels of disaster with restorability that the business is aware of. Unless you’re the only person in the IT department you shouldn’t have to do this alone, but it is your responsibility to start it if no one else has taken that responsibility.