Wednesday, July 31, 2013

Downtime in 2002 verses 2013

On November 13, 2002, the network core at Beth Israel Deaconess failed due to a complex series of events and the hospital lost access to all applications.   Clinicians had no email, no lab results, no PACS images, and no order entry.    All centrally stored files were unavailable.   The revenue cycle could not flow.   For 2 days, the hospital of 2002 became the hospital of 1972.  Much has been written about this incident including a CIO Magazine article and a Harvard Business School case.

On July 25, 2013, a storage virtualization appliance at BIDMC failed in a manner which gave us Hobson's choice  - do nothing and risk potential data loss; or intervene and create slowness/downtime.   Since data loss was not an option, we chose slowness.  Here's the email I sent to all staff on the morning of July 25.

"Last evening, the vendor of the storage components that support Home directories (H:) and Shared drives (S:) recommended that we run a re-indexing maintenance task in order avoid potential data corruption. They anticipated this task could be run in the middle of the night and would not impact our users.   They were mistaken.

The indexing continues to run and must run to completion to protect H: and S: drive data.  While it is running, access to H: and S: will be slow, but also selected clinical web applications such as Provider Order Entry, webOMR, Peri-operative Information System, and the ED Dashboard will be slow.  Our engineers are monitoring the clinical web applications minute to minute and making adjustments to ensure they are as functional as possible.   We are also investigating options to separate clinical web applications from the storage systems which are causing the slowness.

All available IS resources are focused on resolving this as soon as possible.  We ask that all staff and clinical services affected by the interruption utilize downtime procedures  until the issue is resolved.  We apologize for the disruption this issue has caused to patients, providers, and staff."

2002 and 2013 were very different experiences.   Here's a brief analysis:

1.  Although 2002 was an enterprise downtime of all applications, there was an expectation and understanding that failure happens.   The early 2000's were still early in the history of the web.   There was no cloud, no app-enabled smartphones, and no universal adoption of social networking. Technology was not massively redundant.  Planned downtime still occurred on nights and weekends.

In 2013, there is a sense that IT is like heat, power, and light - always there and assumed to be high performing.   Any downtime is unacceptable as emphasized by the typical emails I received from clinicians:

"My patients are still coming on time and expect the high quality care they normally receive. They also want it in a timely manner.  Telling them the computer system is down is not an acceptable answer to them.   Having an electronic health care record is vital but when we as physicians rely on it and when it is not available, it leads to gaps in care."

"Any idea how long we will be down? I am at the point where I may cancel my office for the rest of the day as I cannot provide adequate care without access to electronic records."

In 2013, we've become dependent on technology and any downtime procedures seem insufficient.

2. The burden of regulation is much different in 2013.  Meaningful Use, the Affordable Care Act, ICD10, the HIPAA Omnibus rule, and the Physician's Quality Reporting System did not exist in 2002.   There is a sense now that clinicians cannot get through each day unless every tool  and process, especially IT related, is working perfectly.

Add downtime/slowness and the camel's back is broken.

3.   Society, in general, has more anxiety and less optimism.    Competition for scarce resources  translates into less flexibility, impatience, and lack of a long-term perspective.

4.  The failure modes of technology in 2013 are more subtle and are harder to anticipate.

In 2002, networking was simple.  Servers were physical.  Storage was physical.  Today, networks are multi-layered.  Servers are virtual.  Storage is virtual. More moving parts and more complexity lead to more capabilities but when failure occurs, it takes a multi-disciplinary team to diagnose and treat it.

5.  Users are more savvy.   Here's another email:

"Although I was profoundly impacted by today's events as a PCP trying to see 21 patients, I understand how difficult it is to balance all that goes into making a decision with a vendor on hardware/software maintenance. However, I was responsible for this for a large private group on very sophisticated IT, and I would urge you to consider doing future maintenance and upgrade projects starting on Friday nights, so as to have as little impact as possible on ambulatory patient care."

My experience with last week's event will shape the way I think about future communications for any IT related issues.    Expectations are higher, tolerance is lower, and clinician stress is overwhelming.    No data was lost, no patient harm occurred, and the entire event lasted a few hours, not a few days.     However, it will take months of perfection to regain the trust of my stakeholders.

It's been 10 years since we had to use downtime procedures.   We'll continue to reduce single points of failure and remove complexity, reducing the potential for downtime.   As a clinician I know that reliability, security, and usability are critical.   As a CIO I know how hard this is to deliver every day.


Medical Quack said...

Everybody gives me a bad time, but this is what I say all the time to make a point about complexities...

“The short order computer code kitchen burned down several years ago and there was no fire sale”…

Yes indeed it was easier a few years ago before "platforms" and "clouds" came in but now there's a lot more rules and configurations, not to mention code out there:) I was talking to an MD client today and hearing his issues with the corporate hospital VPN he uses that won't turn loose and he has to shut down his computer to get away:)

It's much harder to manage today I fully agree and when you say so, what are the other Healthcare CIOs saying:)

Anonymous said...

The statement at the end... it's been 10 years since we had to use downtime procedures... made me wonder if "downtime drills" have a new place in the hospital, along with fire drills, mass casualty drills, etc. Hopefully, those events are even more rare than server failures... but we drill for them annually or even more, and then the staff seems to come together and perform beautifully in the direst of circumstances.

Oh I See (CIO inverted) said...

We recently migrated data centers and had to plan for 48 hour downtime over a weekend. While we were up and running in 36, the real impact was felt on Monday morning where some apps did not behave the way.

You rightly say that tolerance levels are getting lower and expectations are climbing new peaks. The challenge is to balance these.

Thanks for sharing.