Wednesday, November 7, 2007

A Downtime Lesson Learned

Earlier this week, we had a few minutes of unexpected downtime in our mission critical clinical systems. The cause was interesting - a programmer accidentally requested our computers to do more than they were capable of doing. A runaway process brought a very powerful highly redundant cluster to the point of unresponsiveness. The cluster was so saturated that even our system administrators could not log on to kill the offending processes.

The event led me to examine all our last year's unplanned clinical system downtime. Three of our four unplanned events were related to runaway processes taking over computers before system administrators could intervene. Thus, we have highly redundant, highly scalable, change controlled infrastructure and software, but today's technology does not give computers the intelligence to say no when asked to perform processes that exceed the computer's capacity.

Over the next few weeks, we're going to think about automated watchdog systems that can take actionjust before the computers pass the point of no return. Actions could be killing runaway processes, reserving sufficient computer resources for high priority tasks such as system administration interventions, throttling processes that take more than a certain amount of total resources, or preventing new processes/sessions of a certain type from being initiated after a certain threshold is reached.

If anyone has any experience with building self-regulators into larger server farms, let me know!