As CIOs we have significant responsibility but limited authority. We're accountable for stability, reliability, and security but cannot always control all the variables.
Here's an example of random events coming together to create a problem, which is now well on its way to resolution. However, there are many lessons learned that I'd like to share with you.
Timeline
November 2010 - Harvard Medical School financial and compliance experts asked for a temporary hold on computing and storage chargebacks, ensuring a thoughtful service center model could be implemented which adheres to every rule and regulation for grant funds flow.
January-March 2011 - Many ARRA grants enabled the research community to purchase microscopes, next generation sequencers, flow cytometry equipment, and other tools/technology that generate vast amounts of data. There was no specific process in place that required storage/IT resource plans before the purchases were made.
April 2011 - The Research Information Technology Group expanded the number of CPU Cores in our high performance computing cluster from 1000 to 4000. Although this did not specifically create an additional storage burden, it enabled the community to run 4 times as many jobs, increasing I/O demand.
May 2011 - All the new research equipment and tools/technologies were turned on by the research community. Storage demands grew from 650 terabytes to 1.1 petabytes in a few weeks. In parallel, next generation sequencing software, which tends to do millions of reads/writes, ran 4 times more often. At the same time. all IT resources were available without a chargeback. Effectively, the demand for a free service became infinite.
June 2011 - A few hours of storage downtime occurred as capacity crossed a threshold such that fewer hard drives were available to handle the I/O load. We initiated a series of immediate actions to resolve the problem, as described in my email to the community below
"Dear HMS IT User Community:
As a followup to my June 14 email about our plan to rapidly improve storage performance and capacity, here is an update.
The demand for storage between April and June 2011 increased 70%, from 650 Terabytes to 1.1 Petabytes (1,100,000,000,000,000 bytes), and research storage activity doubled. Per our promise we have
1. Separated all web and other applications into their own storage cluster, enhancing the speed and reliability of all our applications
2. Separated all home folders and administrative collaborations into their own storage cluster, enhancing the speed and reliability of file access for every user
3. Planned migrations of several research collaborations into a separate pool of specialized high speed storage.
4. Retained an expert consultant to provide an independent review of our storage infrastructure in 3 phases - short term improvements to ensure stability, medium term improvements to support growth, and long term improvements to ensure sustainability. Their report will be presented to the Research Computing Governance Committee tomorrow. After all stakeholders have reviewed that report, we will use existing budgets to make additional storage purchases that are consistent with the long term needs of the community.
5. The Research Computing Governance Committee is also working hard on policies to ensure everyone is a good steward of IT resources. A collaboratively developed chargeback model for computing and storage is nearly complete and will be widely vetted for feedback. Until the chargeback model is in place, we'll continue to use quotas to limit growth. By enhancing our storage supply and managing our demand, we will all be successful.
In the meantime, faculty will be receiving a note from the chair of the Research Computing Governance Committee, with further information and suggestions regarding efficient use of computing resources.
Thanks for your ongoing support, we're making progress
John"
What can we learn from this?
1. Policy and Technology need to be developed together. No amount of hardware technology will satisfy customer needs unless there is some policy as to how the technology is used. I should have focused on demand management in parallel with supply management, enforcing rigorous quotas and providing useful self-service reports while the chargeback model was being revised.
2. Governance is essential to IT success. Although no one in the user community relayed any storage plans or issues to me, there should have been appropriate committees or workgroups established to coordinate efforts among research labs. In administrative and educational areas, it's common for groups to coordinate efforts with enterprise initiatives. In research areas, it's more common for local efforts to occur without broad coordination. Establishing governance that includes all research lab administrators would help improve this.
3. Approval processes for purchases need to include IT planning. When grants are used to purchase equipment there is no specific oversight of the infrastructure implications of adding such equipment on firewalls, networks, servers and storage. Purchases that generate data should require additional approvals to align infrastructure supply and demand.
4. Be wary of "Big Bang" go lives. Our high performance computing upgrade was a single event - 1000 cores to 4000 cores. This should have been phased to better assess the impact of the expansion on application use, other elements of infrastructure, and customer expectation.
5. Know your own blind spots. As with the CareGroup Network Outage, there are aspects of emerging technology which are so new that I do not know what I do not know. When storage demand increases by 70% and throughput accelerates by a factor of 4, what happens to an advanced storage infrastructure? Bringing in a third party storage consultant would have filled in knowledge gaps.
In the world of healthcare quality, there's an analogy that error is like slices of swiss cheese. If a stack of individual slices is lined up precisely, you can get a hole all the way through the stack. In this case, a series of unrelated events lined up to create a problem. I hope that IT professionals can use this episode to realign their "slices" and prevent infinite demand from impacting a limited supply.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment