Tuesday, December 23, 2008

Troubleshooting Complex IT problems

Whenever I'm asked to solve an intermittent IT problem such as a occasional slowness, occasional lost data or intermittently failing hardware, the research is often complex.

Here's a brief example of the efforts we employ to solve IT mysteries.

A few weeks ago we were told that the Interpreters at BIDMC often received pages with ambiguous call back numbers. At BIDMC, valid numbers are 5 digit extensions, 7 digit local numbers and 10 digit long distance numbers. Interpreters often received 4 digit or 6 digit numbers that were impossible to call back.

The most obvious explanation for such an intermittent problem that only seemed to occur in one department was human error. Doctors misdialed numbers, assuming the last 4 digits would be enough to identify their call back number.

We sent out a broadcast email instructing the clinicians to always dial at least 5 digits.

That did not cure the problem.

We then began a data analysis. Could we relate the bad pages to a particular individual, department or location. We found no correlation.

We then asked if the problem was truly isolated to Interpreters. Our data analysis suggested that it occurred regularly in several departments. No others had mentioned it, but the problem was real.

We then asked if the problem was unique to BIDMC, since we share a paging system with other hospitals. The analysis suggested that it was unique to us, since other hospitals did not have the problem.

It seemed unlikely that just our doctors were using the paging system improperly, so we began analyzing all the hardware involved in paging - phones, interface boards, servers and software. Since some of these components were redundant we experimented with taking one member of clustered services offline to see if we could isolate a problem in one switch, one signal processor or one server. Still no resolution.

We then spoke with the manufacturer of the paging software. They had no reports of similar problems. When then spoke to engineer who wrote the software. He had an idea.

When you use a paging system, the dialog goes something like this

"Please dial the person you wish to page"
User enters pager number
"Please enter your call back number"
User enters call back number
"Thank you"
Page is sent

Many interactive telephone systems have a "buffer clearing function" that clears any input before users enter numbers to eliminate digits being carried over between voice prompts and to cancel out any background sounds that might have occurred between prompts. This sounds great, but what happens if an experienced user begins to dial immediately without listening for the prompts i.e. a person dialing very fast, just immediately enters the pager number, then pauses for 1 second before the "Please enter your callback number" prompt is completed, and enters the call back number. It's likely call back number digits would be truncated.

We turned off buffer clearing and here's what happened.

------Percent Bad Call Back Numbers------

Pager Pre-change Post-change

Russian interpreter 163/493 = 33% 3/244 = 1%

Spanish interpreter 262/959 = 27% 4/404 = 1%

Chinese interpreter 207/1086 = 19% 8/417 = 2%

Of the 15 pages in the post-change group listed as bad, most, if not all, were likely miss-dialed. For example, a couple were "123". Others were bad, but were immediately followed by what appeared to be a correct return page suggesting the caller knew they had entered bad data.

The troubleshooting process was complicated by the fact that truncation only occurred with pagers signed out to other numbers as is the case with Interpreter Services. The "clear buffer" had no impact on pagers not signed out. We tried many times to "type ahead" call back numbers for these and were unable to mimic the problem. It appears to have only been a problem if someone signed their pager out to another. It turns out that the buffer clearing software works differently for pagers with a status of "signed out/covered by"

My advice for diagnosing complex operational IT problems is to work stepwise with every layer of architecture, and isolate the problem. Then, follow the advice of Sherlock Holmes in The Sign of Four "How often have I said to you that when you have eliminated the impossible, whatever remains, however improbable, must be the truth?”

Along the way, you may need tools such as OpNet to help isolate every component of hardware and software and gather data. In this case, our usual network-based software tools did not help because the critical connection we needed to check was the traffic between the phones and the interface boards on the paging server. We needed to analyze if the DTMF input coming from our telephone system was providing the correct digits. If the digits captured before hitting the interface boards were bad, we would know it was how the DMTF signals were handled on our side. If the digits differed from what was being logged in the paging server, we would know the problem was inside server.

I wish troubleshooting intermittent IT problems was easier, but alas, many modern technologies have so much complexity that it takes all the skills of forensic pathologist to solve the problem.

The folks at CSI would be proud of my team.


apw said...

My only modification to your recommendation is to do everything you possibly can to replicate the problem before worrying about possible causes.

Putting in the effort to replicate problems up-front is ultimately cheaper and faster. Testing, researching, and re-testing results in a better understanding of the problem, and can usually be done by just one person. You'll be able to eliminate any number of red herrings without having to involve unaffected teams or systems.

John Halamka said...

A very good point. In this case we tried to replicate the problem, but could not because we tried it with standard pagers, not those which were signed out. Our customers could replicate the problem, which convinced us to pursue it.