Tuesday, January 1, 2008

Disaster Recovery Planning

In response to my posting about IT Governance, I received a very good question about prioritizing infrastructure spending: "Without an IT infrastructure steering committee, how do you resolve investment prioritization around these unseen but critical investments?"

Every year, I receive approximately $10 million dollars at BIDMC and $3 million at HMS for infrastructure spending on networks, servers, desktops, storage and wiring. This budget is an annuity based on the value of our IT infrastructure and the lifecycle of the components. However, it does not include funding for disaster recovery.

Five years ago, an audit at BIDMC pointed out our vulnerability to a disaster affecting the CareGroup data center, since the building itself is a single point of failure. I worked with the Board and senior management to raise awareness of disaster recovery planning and the need to make a multi-year capital investment. I've mentioned our disaster recovery work in previous blog entries, but not provided the details.

Cost of Information Technology

What will keep me up at night in 2008
Some Like it Hot

Here are all the details of how we're doing it including our budgets.

BIDMC
Step 1 We inventoried all our applications and determined the service levels required based on the business impact of downtime. We did not hire a team of expensive consultants for a formal business impact analysis. Instead, we used our existing governance committees to brainstorm how long applications could be interrupted before clinical workflow would be disrupted to the point of causing harm. Here are some examples of our informal business impact analysis:

Code Paging system - If a patient suffered a cardiac arrest and the code team did not respond, the patient could die. Hence, downtime of the code paging system must be a few minutes per year at most. No downtime at all is the goal.

Provider Order Entry - If medications, diagnostic testing and diets cannot be ordered, patients could have delays in therapy resulting in pain, extended illness or harm. Hence downtime of POE must be hours per year at most.

Revenue Cycle systems - If bills cannot be sent out for a day, no real harm is done since billing in hospitals is not a real time activity. However, if several days pass without billing, cash flow could be interrupted. Hence, downtime of revenue cycle systems must be a few days per year at most.

Library catalog - If the library catalog is disrupted, users will have to seek other sources of information on the web. A slight inconvenience will occur. Downtimes could be extensive without causing harm.

Step 2 We mapped out single points of failure in power, cooling, networks, servers, storage and infrastructure applications (i.e. DNS/DHCP) We developed an incremental plan to address these vulnerabilities and hired a new employee to coordinate risk mitigation efforts, beginning with enhancements to our existing data center.

Step 3 Since the data center itself was a single point of failure, we constructed a geographically distant data center to mitigate loss of the primary data center and have begun replicating data and applications in this secondary location.

BIDMC is in year 3 of a 5 year disaster recovery center implementation plan. The year by year budget totaling $13 million dollars which will support the recovery time and point objectives specified by our business impact analysis is here.

Harvard Medical School
HMS is in year 2 of a 5 year plan to provide similar protections. Since HMS is not a healthcare delivery organization but provides education, research and administrative services, the uptime requirements are less rigorous. HMS had a slightly different set of business requirements to meet when we began this project. Its primary data center was located in a 100 year old building with limited electrical and cooling support. Hence we wanted to establish a secondary data center which was an extension of the existing primary data center, then move all mission critical systems to the new data center, reserving the original data center for less critical applications and disaster recovery. The HMS five year milestones can be summarized as

Year 1 Create a new Data Center and run both the old and new physical locations as a single "virtual data center". This allowed us to keep existing applications running, add new applications to the new data center, and migrate servers from old to new in phases.
Year 2 Create a redundant network core and begin to operate the two physical locations as a primary and backup data center. Hire a disaster recovery coordinator.
Year 3 Create redundant storage, high performance computing and active/passive email hosting divided between the two data centers.
Year 4 Create redundant installations of critical applications between the two data centers
Year 5 Create redundant installations of critical applications between the two data centers

Each year of these plans enables us to progressively reduce risk. Of course, this disaster recovery planning must be complemented by a disaster response plan including calling/paging trees, communication strategies, and a playbook for responding to critical incidents. I'll post these plans in a later blog entry.

Just like security, disaster recovery planning is a journey. It requires a dedicated team, a project plan and a budget. We'll never be done, but by 2011 we'll have mitigated the risk of single points of data center failure for the majority of our applications.

5 comments:

Rich Kubica said...

We are also implemeting a Disaster Recovery Site at Hartford Hospital. With the criticality of clinical applications we have modified the design to be what we have labelled a "Continue to Run" (CTR) site.

Good luck as you move forward.

NoticeBored said...

Hi guys.

I'd be interested in how you are both differentiating investment in continuity/resilience/high-availability systems and networks, from "true" contingency/DR planning. What I'm getting at is the distinction between 'keeping the lights on' and 'getting things going again if the lights go off'. In my experience, managers seldom appreciate the difference until it's pointed out to them ... or they suffer a major incident!

Thanks for an interesting posting. I'm currently researching for a security awareness module on BCP/DRP so this is very opportune.

Kind regards,
Gary Hinson
www.NoticeBored.com

Rich Kubica said...

Gary, there is complexity in having the DR site be available for continued processing. We have clustered our critical servers and SANS across a fiber-optic network. For other recovery, more true DR, we have placed our Test and Development Servers at the DR Site and we will repurpose them in the event of an actual disaster. Business Continuity (Downtime Processes) are another part of the puzzle that need to be addressed.

Differentiation is based on clinical need as John H has indicated in his entry.

平平 said...

^^Thanks!!

婚前徵信婚姻感情大陸抓姦外遇抓姦法律諮詢家暴婚前徵信尋人感情挽回大陸抓姦離婚工商徵信婚前徵信外遇抓姦感情挽回尋人大陸抓姦離婚家暴工商徵信法律諮詢跟蹤工商徵信婚前徵信感情挽回外遇抓姦法律諮詢家暴尋人大陸抓姦離婚大陸抓姦外遇尋人家暴工商徵信法律諮詢家暴感情挽回大陸抓姦外遇婚前徵信離婚尋人工商徵信外遇抓姦法律諮詢家暴婚前徵信大陸抓姦尋人感情挽回外遇抓姦婚前徵信感情挽回尋人大陸抓姦工商徵信法律諮詢離婚家暴工商徵信外遇抓姦法律諮詢家暴婚前徵信尋人感情挽回大陸抓姦離婚婚前徵信工商徵信外遇抓姦尋人離婚家暴大陸抓姦感情挽回法律諮詢離婚感情挽回婚前徵信外遇抓姦家暴尋人工商徵信外遇抓姦法律諮詢家暴婚前徵信尋人感情挽回">徵大陸抓姦離婚婚前徵信工商徵信外遇抓姦尋人離婚家暴大陸抓姦感情挽回法律諮詢

Affordable Luxurious Wedding Dress Blog said...

cheap wedding gowns,
discount bridal gowns,
China wedding dresses,
discount designer wedding dresses,
China wedding online store,
plus size wedding dresses,
cheap informal wedding dresses,
junior bridesmaid dresses,
cheap bridesmaid dresses,
maternity bridesmaid dresses,
discount flower girl gowns,
cheap prom dresses,
party dresses,
evening dresses,
mother of the bride dresses,
special occasion dresses,
cheap quinceanera dresses,
hot red wedding dresses