Tuesday, January 1, 2008

Disaster Recovery Planning

In response to my posting about IT Governance, I received a very good question about prioritizing infrastructure spending: "Without an IT infrastructure steering committee, how do you resolve investment prioritization around these unseen but critical investments?"

Every year, I receive approximately $10 million dollars at BIDMC and $3 million at HMS for infrastructure spending on networks, servers, desktops, storage and wiring. This budget is an annuity based on the value of our IT infrastructure and the lifecycle of the components. However, it does not include funding for disaster recovery.

Five years ago, an audit at BIDMC pointed out our vulnerability to a disaster affecting the CareGroup data center, since the building itself is a single point of failure. I worked with the Board and senior management to raise awareness of disaster recovery planning and the need to make a multi-year capital investment. I've mentioned our disaster recovery work in previous blog entries, but not provided the details.

Cost of Information Technology

What will keep me up at night in 2008
Some Like it Hot

Here are all the details of how we're doing it including our budgets.

Step 1 We inventoried all our applications and determined the service levels required based on the business impact of downtime. We did not hire a team of expensive consultants for a formal business impact analysis. Instead, we used our existing governance committees to brainstorm how long applications could be interrupted before clinical workflow would be disrupted to the point of causing harm. Here are some examples of our informal business impact analysis:

Code Paging system - If a patient suffered a cardiac arrest and the code team did not respond, the patient could die. Hence, downtime of the code paging system must be a few minutes per year at most. No downtime at all is the goal.

Provider Order Entry - If medications, diagnostic testing and diets cannot be ordered, patients could have delays in therapy resulting in pain, extended illness or harm. Hence downtime of POE must be hours per year at most.

Revenue Cycle systems - If bills cannot be sent out for a day, no real harm is done since billing in hospitals is not a real time activity. However, if several days pass without billing, cash flow could be interrupted. Hence, downtime of revenue cycle systems must be a few days per year at most.

Library catalog - If the library catalog is disrupted, users will have to seek other sources of information on the web. A slight inconvenience will occur. Downtimes could be extensive without causing harm.

Step 2 We mapped out single points of failure in power, cooling, networks, servers, storage and infrastructure applications (i.e. DNS/DHCP) We developed an incremental plan to address these vulnerabilities and hired a new employee to coordinate risk mitigation efforts, beginning with enhancements to our existing data center.

Step 3 Since the data center itself was a single point of failure, we constructed a geographically distant data center to mitigate loss of the primary data center and have begun replicating data and applications in this secondary location.

BIDMC is in year 3 of a 5 year disaster recovery center implementation plan. The year by year budget totaling $13 million dollars which will support the recovery time and point objectives specified by our business impact analysis is here.

Harvard Medical School
HMS is in year 2 of a 5 year plan to provide similar protections. Since HMS is not a healthcare delivery organization but provides education, research and administrative services, the uptime requirements are less rigorous. HMS had a slightly different set of business requirements to meet when we began this project. Its primary data center was located in a 100 year old building with limited electrical and cooling support. Hence we wanted to establish a secondary data center which was an extension of the existing primary data center, then move all mission critical systems to the new data center, reserving the original data center for less critical applications and disaster recovery. The HMS five year milestones can be summarized as

Year 1 Create a new Data Center and run both the old and new physical locations as a single "virtual data center". This allowed us to keep existing applications running, add new applications to the new data center, and migrate servers from old to new in phases.
Year 2 Create a redundant network core and begin to operate the two physical locations as a primary and backup data center. Hire a disaster recovery coordinator.
Year 3 Create redundant storage, high performance computing and active/passive email hosting divided between the two data centers.
Year 4 Create redundant installations of critical applications between the two data centers
Year 5 Create redundant installations of critical applications between the two data centers

Each year of these plans enables us to progressively reduce risk. Of course, this disaster recovery planning must be complemented by a disaster response plan including calling/paging trees, communication strategies, and a playbook for responding to critical incidents. I'll post these plans in a later blog entry.

Just like security, disaster recovery planning is a journey. It requires a dedicated team, a project plan and a budget. We'll never be done, but by 2011 we'll have mitigated the risk of single points of data center failure for the majority of our applications.


Rich Kubica said...

We are also implemeting a Disaster Recovery Site at Hartford Hospital. With the criticality of clinical applications we have modified the design to be what we have labelled a "Continue to Run" (CTR) site.

Good luck as you move forward.

Gary said...

Hi guys.

I'd be interested in how you are both differentiating investment in continuity/resilience/high-availability systems and networks, from "true" contingency/DR planning. What I'm getting at is the distinction between 'keeping the lights on' and 'getting things going again if the lights go off'. In my experience, managers seldom appreciate the difference until it's pointed out to them ... or they suffer a major incident!

Thanks for an interesting posting. I'm currently researching for a security awareness module on BCP/DRP so this is very opportune.

Kind regards,
Gary Hinson

Rich Kubica said...

Gary, there is complexity in having the DR site be available for continued processing. We have clustered our critical servers and SANS across a fiber-optic network. For other recovery, more true DR, we have placed our Test and Development Servers at the DR Site and we will repurpose them in the event of an actual disaster. Business Continuity (Downtime Processes) are another part of the puzzle that need to be addressed.

Differentiation is based on clinical need as John H has indicated in his entry.