Tuesday, March 4, 2008

The CareGroup Network Outage

On November 13, 2002 at 1:45pm, Beth Israel Deaconess Medical Center went from the hospital of 2002 to the hospital of 1972. Network traffic stopped. No clinician could order medications or labs electronically. No decision support was available. Luckily, no patients were harmed. Here's the story of what happened and our lessons learned.

In the years after the 1996 merger of the Beth Israel and Deaconess Hospitals, operating losses caused several years of capital starvation (the network budget in 2002 was $50,000 for the entire $2 billion dollar enterprise). We were not available to invest in infrastructure, so our network components were beyond their useful life.

However, it was not this infrastructure underinvestment that was the root cause of the problem, it was my lack of enterprise network infrastructure knowledge. I did not know what I did not know. Here are the details:

1. Our Network topology was perfectly architected for 1996. Back in the days when the internet was a friendly place where all internal and external collaborators could be trusted, a switched core (layer 2) that transmitted all packets from place to place was a reasonable design. After 1996, the likelihood of denial of service attacks, trojans, or other malware meant that networks should be routed and highly segmented, isolating any bad actors to a constrained local area. At the time of our outage, a data flood in one part of the network propagated to every other part of the network - a bit like living downstream from a dam that collapsed. A well meaning researcher created a napster-like application that began exchanging hundreds of gigabytes of data via multicast to multiple collaborators. The entire network was so saturated that we could not diagnose the root cause of the data flood. I did not know that a switched core was a point of failure.

2. Our Network team was manged by a very smart engineer who did not share all his knowledge with the network team. Much of our network configuration was poorly documented. With the knowledge of our network isolated to one person, we had a single point of human failure. I did not know that this engineer was unfamiliar with the best practices for routed/redundant network cores, routed distribution layers and switched access layers isolated into vlans with quality of service configurations to prevent monopolization of bandwidth by any one user or application. We brought in a Cisco partner, Callisma, to document the network, but the network failure occurred before they were finished.

3. I did not know about spanning tree algorithms, hot standby routing protocols (HSRP), and Open Shortest Path First (OSPF). During the outage, I approved configuration changes that actually made the situation worse by causing spanning tree propagations, flooding the network with even more traffic.

4. I did not establish a relationship with the vendor (Cisco) that enabled them to warn me about our vulnerabilities. A relationship with a vendor can take many forms, ranging from a sales driven vendor/client adversarial relationship to a collaborative partnership. In 2002, Cisco was just another vendor we purchased from. Today they are a collaborative partner with a seat at the table when we plan new infrastructure.

5. I did not know that we needed "out of band" tools to gain insight into the problems with the network. In effect, we required the network to be functional to diagnose problems with the network.

6. We did not have a robust, tested downtime plan for a total network collapse. When the outage occurred, we rapidly designed new processes to transport lab results, orders, and other data via runners from place to place.

7. We did not have a robust communication plan for responding to a total network collapse. Email, web-based paging, portals, and anything that used the data network for communication was down. Voice mail broadcasts using our PBX and regular phones (not IP phones) turned out to save the day.

8. When we diagnosed the problem, we explored many root causes and made many changes in the hope that we'd find the magic bullet to cure the problem. In the end, we ended up fixing many basic structural problems in the network, which took 2 days and eventually solved the problem. A more expedient solution would have been to reverse all changes we made in our attempts to fix the network once we had stopped the internal attack. When a crisis occurs, making changes on top of changes can make diagnosis and remediation even more difficult.

9. We did not have an enterprise wide change control process to ensure that every network configuration, server addition, and software enhancement was documented and assessed for impact on security, stability, and performance. Today we have a weekly change control board that include includes all IS departments, Cisco engineering services, and IS leadership to assess and communicate all configuration changes.

10. I was risk averse and did not want to replace the leadership of the network team for fear that terminating our single point of human failure would result in an outage. The price of keeping the leadership in place was a worse outage. I should have acted sooner to bolster leadership of the team.

Despite the pain and stress of the outage, there was a "lemons to lemonade" ending. Without this incident, the medical center would never have realized the importance of investing in basic IT infrastructure. If not for the "perfect storm", we may have limped along with a marginal network for years.

Today, we receive annual capital funding to support a regular refresh of our technology base, and we are asked to introduce change at a pace that is manageable. People in the medical center still remember the outage and are more accepting of a tightly managed infrastructure such as locked down workstations, security precautions, disaster recovery initiatives, and maintenance downtime windows.

During and immediately following the event, I presented this overview of the outage to senior management. Shortly after the outage, I worked with CIO Magazine to document the events and lessons learned.

I hope that my experience will help other organizations prevent network and other IT outages in the future.

6 comments:

xcontrol2000 said...

Hi John, first let me say how much I enjoy your blog and your willingness to share information. I remember hearing about the network failure when it happened and I can only imagine what a difficult time it was for you and your staff.

Two questions I have regarding changes you have made since, I noticed you mentioned using a strong CMDB and I am wondering if you have implemented or plan to any of the ITIL framework? Certainly the CMDB isn't only used there, but I was curious if you use it. The second question is regarding documentation, it's always a struggle just to get it done, but a bigger challenge is to get it into a system that it is usable and fairly easy to maintain, I wanted to know what you use for system documentation in your organization. Thank you very much.
Chris Antonellis

John Halamka said...

We use CiscoWorks (http://www.cisco.com/en/US/products/sw/cscowork/ps2425/) to collect configuration data on our network. We also contract with Cisco's Advanced Services who use an appliance to capture the configuration data from CiscoWorks, transport it to their location, and scrub it against a set of best practice rules. Our folks are also good about updating "as-built" conditions in Visio whenever a major change has occurred. We have found this combo to be a satisfactory method to manage our network configuration without overburdening the staff with nice-to-have things. In terms of ITIL, many of our staff had ITIL training and observe the teachings in managing our service levels, capacity, performance, availability, and problems.

ian said...

Hello John, I salute you for your transparency and willingness to share best practices. You've clearly learned from this incident, and improved your organization's network infrastructure. I hope other IT professionals can learn from this as well.

jessica said...

Hi John,
First I have to say this is very weird as I have just gotten done reading a case study about this in "Corporate Information: Strategy and Management" Applegate. I have the homework due on the 8th.

May I ask…did you implement anything back then that was specifically looking out 10-15 years?

Thank you for your time,
Jessica Ramsaran

平平 said...

^^Thanks!!

婚前徵信婚姻感情大陸抓姦外遇抓姦法律諮詢家暴婚前徵信尋人感情挽回大陸抓姦離婚工商徵信婚前徵信外遇抓姦感情挽回尋人大陸抓姦離婚家暴工商徵信法律諮詢跟蹤工商徵信婚前徵信感情挽回外遇抓姦法律諮詢家暴尋人大陸抓姦離婚大陸抓姦外遇尋人家暴工商徵信法律諮詢家暴感情挽回大陸抓姦外遇婚前徵信離婚尋人工商徵信外遇抓姦法律諮詢家暴婚前徵信大陸抓姦尋人感情挽回外遇抓姦婚前徵信感情挽回尋人大陸抓姦工商徵信法律諮詢離婚家暴工商徵信外遇抓姦法律諮詢家暴婚前徵信尋人感情挽回大陸抓姦離婚婚前徵信工商徵信外遇抓姦尋人離婚家暴大陸抓姦感情挽回法律諮詢離婚感情挽回婚前徵信外遇抓姦家暴尋人工商徵信外遇抓姦法律諮詢家暴婚前徵信尋人感情挽回">徵大陸抓姦離婚婚前徵信工商徵信外遇抓姦尋人離婚家暴大陸抓姦感情挽回法律諮詢

Affordable Luxurious Wedding Dress Blog said...

cheap wedding gowns,
discount bridal gowns,
China wedding dresses,
discount designer wedding dresses,
China wedding online store,
plus size wedding dresses,
cheap informal wedding dresses,
junior bridesmaid dresses,
cheap bridesmaid dresses,
maternity bridesmaid dresses,
discount flower girl gowns,
cheap prom dresses,
party dresses,
evening dresses,
mother of the bride dresses,
special occasion dresses,
cheap quinceanera dresses,
hot red wedding dresses