Monday, May 16, 2011

Should We Abandon the Cloud?

It's been a bad month for the cloud.

First there was the major Amazon EC2 (Elastic Cloud) outage April 21-22 that brought down many business and websites.  Some of the data was unrecoverable and transactions were lost.

Next, the May 10-13 outage of Microsoft's cloud based email and Office services (Business Productivity Online Suite) caused major angst among its customers who thought that the cloud offered increased reliability

Then we had the May 11-13 Google Blogger outage which brought down editing, commenting, and content for thousands of blogs.

Outages from the 3 largest providers of cloud services within a 2 week period does not bode well.

Yesterday, Twitter went down as well.

Many have suggested we abandon a cloud only strategy.

Should we abandon the cloud for healthcare?  Absolutely not.

Should we reset our expectations that highly reliable, secure computing can be provided at very low cost by "top men" in the cloud?  Absolutely yes.

I am a cloud provider.   At my Harvard Medical School Data Center, I provide 4000 Cores and 2 petabytes of data to thousands of faculty and staff.   At BIDMC, I provide 500 virtualized servers and a petabyte of data to 12,000 users.   Our BIDPO/BIDMC Community EHR Private Cloud provides electronic health records to 300 providers.

I know what it takes to provide 99.999% uptime.  Multiple redundant data centers, clustered servers, arrays of tiered storage, and extraordinary power engineering.

With all of this amazing infrastructure comes complexity.   With complexity comes unanticipated consequences, change control challenges, and human causes of failure.

Let's look at the downtime I've had this year.

1.  BIDMC has a highly redundant, geographically dispersed Domain Name System (DNS) architecture.   It theory it should not be able to fail.  In practice it did.  The vendor was attempting to add features that would make us even more resilient.  Instead of making changes to a test DNS appliance, they accidentally made changes to a production DNS appliance.   We experienced downtime in several of our applications.

2.  HMS has clustered thousands of computing cores together to create a highly robust community resource connected to a petabyte of distributed storage nodes.   In theory is should be invincible.   In practice it went down.   A user with limited high performance computing experience launched a poorly written job to  400 cores in parallel that caused a core dump every second contending for the same disk space.   Storage was overwhelmed and went offline for numerous applications.

3.  BIDMC has a highly available cluster to support clinical applications.    We've upgraded to the most advanced and feature rich Linux operating system.  Unfortunately, it had a bug that when used in a very high performance clustered environment, the entire storage filesystem became unavailable.  We had downtime.

4.  BIDMC has one of the most sophisticated power management systems in the industry - every component is redundant.   As we added features to make us even more redundant, we needed to temporarily reroute power, which is not an issue for us because every network router and switch has two power supplies.   We had competed 4 of 5 data center switch migrations when the redundant power supply failed on the 5th switch, bringing down several applications.

5.  The BIDPO EHR hosting center has a highly redundant and secure network.  Unfortunately, bugs in the network operating system on some of the key components led to failure of all traffic to flow.

These examples illustrate that even the most well engineered infrastructure can fail due to human mistakes, operating system bugs, and unanticipated consequences of change.

The cloud is truly no different.  Believing that Microsoft, Google, Amazon or anyone else can engineer perfection at low cost is fantasy.   Technology is changing so fast and increasing demand requires so much change that every day is like replacing the wings on a 747 while it's flying.   On occasion bad things will happen.   We need to have robust downtime procedures and business continuity planning to respond to failures when they occur.

The idea of creating big data in the cloud, clusters of processors, and leveraging the internet to support software as a service applications is sound.

There will be problems.   New approaches to troubleshooting issues in the cloud will make "diagnosis and treatment" of slowness and downtime faster.

Problems on a centralized cloud architecture that is homogenous, well documented, and highly staffed can be more rapidly resolved than problems in distributed, poorly staffed, one-off installations.

Thus, I'm a believer in the public cloud and private clouds.  I will continue to use them for EHRs and other healthcare applications.   However, I have no belief that the public cloud will have substantially less downtime or lower cost than I can engineer myself.

The reason to use the public cloud is so that my limited staff can spend their time innovating - creating infrastructure and applications that the public cloud has not yet envisioned or refuses to support because of regulatory requirements  (such as HIPAA).

Despite the black cloud of the past two weeks, the future of the cloud, tempered by a dose of reality to reset expectations, is bright.

10 comments:

Donald Green MD said...

The major issue is not with the "cloud" but the vagaries of human behavior that accompany it. It seems presently out of sync with desired outcomes. To be reasonably protective of data, its human creators have to practice efficient work flow and ethics at their places of work. Until this necessity is addressed more globally this problem will not be solved, no matter the size of the investment.

Using this blogger's personal observations of computer use in hospitals many instances have been noted of concern. A number of users of digital information in hospitals walk away from machines leaving them open to any Tom, Dick, or Harry or their female counterparts. Without a single ability to hack into a computer its database is easily open to pirating. Of course this is not the whole story and John has outlined more sophisticated instances of data invasion.

Further if this protection does indeed lead to soaring costs it is marching in an unsustainable direction. It is opposite to the struggle for lowering costs in health care. This will inevitably lead either to more Federal or State regulations or frank takeover for that matter.

Maybe a breather is necessary before marching on. Let's help providers be excellent at their job by meshing their ability to enhance the doctor-patient relationship with best practices of use of technological tools. Then let's see what makes sense in connecting up these systems.

Anonymous said...

You bring up some interesting thoughts on the reliability of the cloud. The cloud is highly dependent on access to universal Internet bandwidth. The telecoms have been able to deliver 5x9's reliability for many years, but they only have to address one of the many components of the IT stack. What can we learn from how they do it? They spend billions of dollars on redundency and infrastructure in support.

You might find my blog on the subject of the increasing interdependence of cloud services to be relevant to this discussion.

http://itknowledgeexchange.techtarget.com/it-consulting/tangled-up-in-clouds-interdependency-lessons-from-the-aws-outage/

Peter Charland said...

This is a great discussion, and underscores what we at Stratus have learned in over 30 years of delivering world class high availability solutions - that uptime assurance involves not only resilient technologies, but also proactive monitoring and management, and best practices. You can't bet on technology alone.

MR01 said...

I really liked the "Top Men" reference!

Mallesh Murugesan said...

Interesting thoughts. From what I have seen and experienced, reliability is not the #1 reason for not adapting cloud solutions but is the availability and the security of the architecture. Specifically in a health care environment, availability (sufficient bandwidth to access data) and security (HIPAA) will make or break the move to a cloud architecture.

Vijay said...

Still the advantages outweigh the minus points. For extremely mission critical apps it's pretty risky. To reduce this risk, redundancy can be implemented for mission critical apps. The cost of redundancy will still offset the cost of a possible outage.

We should see 3rd party business continuity services for Clouds, in the near future.

GLevin said...

I recently posted a similar comment on my blog at Health Train Express. I expressed my own reservations and opinons. Followng this I interviewed Ryan Howard from Practice Fusion. Medical applications should not nor would not be hosted on clouds such as AWS, PF is hosted on it's own network servers. General cloud servers do include encryption and security but have not been CCHIT nor HIPAA certified. My blog is at http://healthtrain.blogspot.com

Sheldon Devine said...

Something to think about...very interesting conversation n comments

Anonymous said...

My company has yet to move anything into a public cloud and I am reluctant to do so. As mentioned in the article the infrastructure's in place at these cloud providers are more than likely very complex and I don't care what any cloud provider will try and say, they will have issues that will cause downtime.

My concern is that when a problem arises in these complex environments causing downtime for my company's applications or services, we will now have to sit and wait in line as they work on getting the thousands of other customers back up and running.

If we have an issue at my company on our systems currently, we worry only about getting our own problem fixed as quick as possible and more flexibility in creating temporary work arounds reducing potential downtimme that may occur.

Anonymous said...

Hi, John - long time! I'm quite late to this party - a friend just steered me here.

On e-patients.net I wrote a post about all this, "Cloud-only is dead for healthcare," that has been widely misconstrued as "Dave said cloud is dead." NO, doofuses out there, read the post; I said cloud-ONLY is dead. In the bullets I said (twice), "You gotta have offline data backups," and "For any 'can’t afford to be down' situation, you gotta have ironclad availability."

Nonetheless, commenters gave me a good thumping for the stuff I didn't actually say :-).

More important, though, they took me to school about the uptime of non-cloud systems.

Whatever. I use cloud for almost every aspect of my own little business. It rocks.

(Perversely, Blogger's comment login appears to be failing tonight, so I have to sign this manually.)

e-Patient Dave