Tuesday, March 4, 2008

The CareGroup Network Outage

On November 13, 2002 at 1:45pm, Beth Israel Deaconess Medical Center went from the hospital of 2002 to the hospital of 1972. Network traffic stopped. No clinician could order medications or labs electronically. No decision support was available. Luckily, no patients were harmed. Here's the story of what happened and our lessons learned.

In the years after the 1996 merger of the Beth Israel and Deaconess Hospitals, operating losses caused several years of capital starvation (the network budget in 2002 was $50,000 for the entire $2 billion dollar enterprise). We were not available to invest in infrastructure, so our network components were beyond their useful life.

However, it was not this infrastructure underinvestment that was the root cause of the problem, it was my lack of enterprise network infrastructure knowledge. I did not know what I did not know. Here are the details:

1. Our Network topology was perfectly architected for 1996. Back in the days when the internet was a friendly place where all internal and external collaborators could be trusted, a switched core (layer 2) that transmitted all packets from place to place was a reasonable design. After 1996, the likelihood of denial of service attacks, trojans, or other malware meant that networks should be routed and highly segmented, isolating any bad actors to a constrained local area. At the time of our outage, a data flood in one part of the network propagated to every other part of the network - a bit like living downstream from a dam that collapsed. A well meaning researcher created a napster-like application that began exchanging hundreds of gigabytes of data via multicast to multiple collaborators. The entire network was so saturated that we could not diagnose the root cause of the data flood. I did not know that a switched core was a point of failure.

2. Our Network team was manged by a very smart engineer who did not share all his knowledge with the network team. Much of our network configuration was poorly documented. With the knowledge of our network isolated to one person, we had a single point of human failure. I did not know that this engineer was unfamiliar with the best practices for routed/redundant network cores, routed distribution layers and switched access layers isolated into vlans with quality of service configurations to prevent monopolization of bandwidth by any one user or application. We brought in a Cisco partner, Callisma, to document the network, but the network failure occurred before they were finished.

3. I did not know about spanning tree algorithms, hot standby routing protocols (HSRP), and Open Shortest Path First (OSPF). During the outage, I approved configuration changes that actually made the situation worse by causing spanning tree propagations, flooding the network with even more traffic.

4. I did not establish a relationship with the vendor (Cisco) that enabled them to warn me about our vulnerabilities. A relationship with a vendor can take many forms, ranging from a sales driven vendor/client adversarial relationship to a collaborative partnership. In 2002, Cisco was just another vendor we purchased from. Today they are a collaborative partner with a seat at the table when we plan new infrastructure.

5. I did not know that we needed "out of band" tools to gain insight into the problems with the network. In effect, we required the network to be functional to diagnose problems with the network.

6. We did not have a robust, tested downtime plan for a total network collapse. When the outage occurred, we rapidly designed new processes to transport lab results, orders, and other data via runners from place to place.

7. We did not have a robust communication plan for responding to a total network collapse. Email, web-based paging, portals, and anything that used the data network for communication was down. Voice mail broadcasts using our PBX and regular phones (not IP phones) turned out to save the day.

8. When we diagnosed the problem, we explored many root causes and made many changes in the hope that we'd find the magic bullet to cure the problem. In the end, we ended up fixing many basic structural problems in the network, which took 2 days and eventually solved the problem. A more expedient solution would have been to reverse all changes we made in our attempts to fix the network once we had stopped the internal attack. When a crisis occurs, making changes on top of changes can make diagnosis and remediation even more difficult.

9. We did not have an enterprise wide change control process to ensure that every network configuration, server addition, and software enhancement was documented and assessed for impact on security, stability, and performance. Today we have a weekly change control board that include includes all IS departments, Cisco engineering services, and IS leadership to assess and communicate all configuration changes.

10. I was risk averse and did not want to replace the leadership of the network team for fear that terminating our single point of human failure would result in an outage. The price of keeping the leadership in place was a worse outage. I should have acted sooner to bolster leadership of the team.

Despite the pain and stress of the outage, there was a "lemons to lemonade" ending. Without this incident, the medical center would never have realized the importance of investing in basic IT infrastructure. If not for the "perfect storm", we may have limped along with a marginal network for years.

Today, we receive annual capital funding to support a regular refresh of our technology base, and we are asked to introduce change at a pace that is manageable. People in the medical center still remember the outage and are more accepting of a tightly managed infrastructure such as locked down workstations, security precautions, disaster recovery initiatives, and maintenance downtime windows.

During and immediately following the event, I presented this overview of the outage to senior management. Shortly after the outage, I worked with CIO Magazine to document the events and lessons learned.

I hope that my experience will help other organizations prevent network and other IT outages in the future.

4 comments:

Unknown said...
This comment has been removed by the author.
John Halamka said...

We use CiscoWorks (http://www.cisco.com/en/US/products/sw/cscowork/ps2425/) to collect configuration data on our network. We also contract with Cisco's Advanced Services who use an appliance to capture the configuration data from CiscoWorks, transport it to their location, and scrub it against a set of best practice rules. Our folks are also good about updating "as-built" conditions in Visio whenever a major change has occurred. We have found this combo to be a satisfactory method to manage our network configuration without overburdening the staff with nice-to-have things. In terms of ITIL, many of our staff had ITIL training and observe the teachings in managing our service levels, capacity, performance, availability, and problems.

I Lamont said...

Hello John, I salute you for your transparency and willingness to share best practices. You've clearly learned from this incident, and improved your organization's network infrastructure. I hope other IT professionals can learn from this as well.

Unknown said...

Hi John,
First I have to say this is very weird as I have just gotten done reading a case study about this in "Corporate Information: Strategy and Management" Applegate. I have the homework due on the 8th.

May I ask…did you implement anything back then that was specifically looking out 10-15 years?

Thank you for your time,
Jessica Ramsaran