Wednesday, June 3, 2009

Service Level Agreements

I was recently asked about our approach to Service Level Agreements (SLAs) at BIDMC.

We develop customer facing SLA's for every new infrastructure and application as part of our standard project management methodology. We work collaboratively with the application owner and subject matter experts to develop a mutually acceptable process for support escalation, with defined availability and response times.

The end result is a series of documents which outline customer and IS responsibilities, as well as provide enough detail about the application to understand its scope and uses.

Customer Facing Documents:
1. Customer Project and Post Project Responsibilities - This document serves as a foundation for each project and sets customer expectations for support roles and responsibilities.
2. Service Level Agreement - I've attached an SLA for a live application to illustrate the types of service level documentation we provide.

Internal IT Documents:
1. Business Impact Analysis - a worksheet used by our managers to facilitate discussions with application owners and document service level of objectives based on business requirements.
2. Service Level Objectives - availability and disaster recovery service levels by class of application

A few general observations about our SLAs.

1. Much of our planned downtime is now done as a background task thanks to improvements in our configurations. For example, we have clustered servers, redundant network components and Internet connections, mirrored storage devices, shadowed or mirrored data bases, and other improvements that have remarkably decreased the need for disruptive, planned outages.

2. Escalation processes differ slightly for our mission critical clinical applications, such that downtime over two hours triggers implementation of paper-based downtime procedures.

3. In addition to our own hosted applications, we have a few Software as a Service applications. Our SLAs with hosting vendors include:

a. Expected uptime. In some cases this is backed by a well-defined formula that states the goal, e.g. 99.9%, and any other qualifiers such as excluding planned downtime that is done at a mutually agreed upon times. Whatever is set as an uptime goal usually drives the high availability and disaster recovery configurations.

b. Transaction performance. This has traditionally not been a problem for us, but for applications that may have not been engineered well, it's an important component of an SLA.

c. Escalation. Defining the event levels (priority one, priority two etc.), contacts, and what response time (phone vs on-site) and repair time can be expected is a key component. Time to repair is usually a tough negotiation in hosted application SLAs.

d. Remedies. This is not usually defined in internal agreements, but is for vendor agreements. The typical remedy is a credit on future maintenance payments, which is not always satisfying if you lose an application for a prolonged period.

Feel free to use my SLA documents as templates in support of your own service level documentation needs.


GreenLeaves said...

Very informative and thank you for sharing the documents. I was interested in the process in which the paper-based response activates.

Is this automatic, i.e. the users automatically initiate the process at the 2 hour point, or does IT initiate as soon as they see that the 2 hour window will be exceeded?
My guess is that either can occur.

Bernz said...

I've run a 24/7, Legal SaaS for 8 years. We have, annually, around 99.99% uptime, though we've had periods of inaccessibility for certain customers due to Internet issues that are out of our control or the discovery of a bug that causes inaccessibility for certain circumstances. We have SLAs that give back money if we're down for certain periods of time, though given that a minute of downtime for us can cost a client about $2000/minute, the money we give back is paltry.

IT has become mission-critical and, in the case of health, life-critical. There is speculation that the recent Air France disaster was due to buggy software. At that point, SLAs of 99.999% don't matter when the "downtime" is loss of life.

I like Dr.Halamka's approach to dealing with SaaS vendors in defining what "service" really means explicitly. Not enough of our customers do that and I would love to see more expectations in terms of what "service" actually means explicitly.

Slicehost, a pretty well regarded cloud vendor has a fun answer for their SLAs -

"Do you offer an SLA? Not for Slices and here’s why: most hosting SLA agreements are just plain silly. They promise things like 99.9% uptime, but downtime excludes: scheduled maintenance, network outages, hardware failures and software trouble. Well what exactly is left to cause downtime? Here’s our SLA: we’ll do our best to keep your machines running smoothly for as long as possible and get them up ASAP should something go wrong."

Not exactly right for a hospital, but I admit I like that attitude -- not making excuses but aiming to satisfy the customer.

Bill Boyle said...

Thank you for sharing these templates.

John Halamka said...

Regarding GreenLeaves comment, the decision to move to manual/paper-based download procedures is made together with clinical leadership. If we're very close to issue resolution, the 2 hours might be extended a bit. It's far worse to go to paper, then back to electronic. Transcription from paper causes errors and compounds the risk of downtime, so we really try to avoid it.

Unknown said...

I like Dr.Halamka's approach to dealing with SaaS vendors in defining what "service" really means explicitly. Not enough of our customers do that and I would love to see more expectations in terms of what "service" actually means explicitly.

wrinkle cream

spilledmilk said...

Excellent insight into critical processes and procedures from an SLA perspective.

I am in agreement with everyone, staying away from paper as long as possible during a downtime is the best option.

GreenLeaves said...

Does the hospital ever practice failover to paper and resumption of normal operations?
When things run smoothly for long periods these processes are quickly forgotten.

Vaughan Merlyn said...

Very helpful, thanks. It seems like you have deftly avoided the trap of SLA hell! I have seen organizations where the "A" (agreement) was the overarching concept, and IT organizations saw SLA's as a way of contractually keeping out of trouble with users, while users saw SLA's as a way of contractually holding IT accountable for its services.

When the "A" dominates over the "S" and the "L", from my experience, this is a slippery slope to a destructive business-IT relationship. When, on the other hand (as you seem to have done) the process of reaching agreement on services and service level objectives is robust and collaborative, service level negotiation and agreement become important business-IT relationship building tools.