Tuesday, December 23, 2008

Troubleshooting Complex IT problems

Whenever I'm asked to solve an intermittent IT problem such as a occasional slowness, occasional lost data or intermittently failing hardware, the research is often complex.

Here's a brief example of the efforts we employ to solve IT mysteries.

A few weeks ago we were told that the Interpreters at BIDMC often received pages with ambiguous call back numbers. At BIDMC, valid numbers are 5 digit extensions, 7 digit local numbers and 10 digit long distance numbers. Interpreters often received 4 digit or 6 digit numbers that were impossible to call back.

The most obvious explanation for such an intermittent problem that only seemed to occur in one department was human error. Doctors misdialed numbers, assuming the last 4 digits would be enough to identify their call back number.

We sent out a broadcast email instructing the clinicians to always dial at least 5 digits.

That did not cure the problem.

We then began a data analysis. Could we relate the bad pages to a particular individual, department or location. We found no correlation.

We then asked if the problem was truly isolated to Interpreters. Our data analysis suggested that it occurred regularly in several departments. No others had mentioned it, but the problem was real.

We then asked if the problem was unique to BIDMC, since we share a paging system with other hospitals. The analysis suggested that it was unique to us, since other hospitals did not have the problem.

It seemed unlikely that just our doctors were using the paging system improperly, so we began analyzing all the hardware involved in paging - phones, interface boards, servers and software. Since some of these components were redundant we experimented with taking one member of clustered services offline to see if we could isolate a problem in one switch, one signal processor or one server. Still no resolution.

We then spoke with the manufacturer of the paging software. They had no reports of similar problems. When then spoke to engineer who wrote the software. He had an idea.

When you use a paging system, the dialog goes something like this

"Please dial the person you wish to page"
User enters pager number
"Please enter your call back number"
User enters call back number
"Thank you"
Page is sent

Many interactive telephone systems have a "buffer clearing function" that clears any input before users enter numbers to eliminate digits being carried over between voice prompts and to cancel out any background sounds that might have occurred between prompts. This sounds great, but what happens if an experienced user begins to dial immediately without listening for the prompts i.e. a person dialing very fast, just immediately enters the pager number, then pauses for 1 second before the "Please enter your callback number" prompt is completed, and enters the call back number. It's likely call back number digits would be truncated.

We turned off buffer clearing and here's what happened.

------Percent Bad Call Back Numbers------

Pager Pre-change Post-change

Russian interpreter 163/493 = 33% 3/244 = 1%

Spanish interpreter 262/959 = 27% 4/404 = 1%

Chinese interpreter 207/1086 = 19% 8/417 = 2%

Of the 15 pages in the post-change group listed as bad, most, if not all, were likely miss-dialed. For example, a couple were "123". Others were bad, but were immediately followed by what appeared to be a correct return page suggesting the caller knew they had entered bad data.

The troubleshooting process was complicated by the fact that truncation only occurred with pagers signed out to other numbers as is the case with Interpreter Services. The "clear buffer" had no impact on pagers not signed out. We tried many times to "type ahead" call back numbers for these and were unable to mimic the problem. It appears to have only been a problem if someone signed their pager out to another. It turns out that the buffer clearing software works differently for pagers with a status of "signed out/covered by"

My advice for diagnosing complex operational IT problems is to work stepwise with every layer of architecture, and isolate the problem. Then, follow the advice of Sherlock Holmes in The Sign of Four "How often have I said to you that when you have eliminated the impossible, whatever remains, however improbable, must be the truth?”

Along the way, you may need tools such as OpNet to help isolate every component of hardware and software and gather data. In this case, our usual network-based software tools did not help because the critical connection we needed to check was the traffic between the phones and the interface boards on the paging server. We needed to analyze if the DTMF input coming from our telephone system was providing the correct digits. If the digits captured before hitting the interface boards were bad, we would know it was how the DMTF signals were handled on our side. If the digits differed from what was being logged in the paging server, we would know the problem was inside server.

I wish troubleshooting intermittent IT problems was easier, but alas, many modern technologies have so much complexity that it takes all the skills of forensic pathologist to solve the problem.

The folks at CSI would be proud of my team.

4 comments:

APW said...

My only modification to your recommendation is to do everything you possibly can to replicate the problem before worrying about possible causes.

Putting in the effort to replicate problems up-front is ultimately cheaper and faster. Testing, researching, and re-testing results in a better understanding of the problem, and can usually be done by just one person. You'll be able to eliminate any number of red herrings without having to involve unaffected teams or systems.

John Halamka said...

A very good point. In this case we tried to replicate the problem, but could not because we tried it with standard pagers, not those which were signed out. Our customers could replicate the problem, which convinced us to pursue it.

J&D said...

米蘭情趣用品,情趣用品,飛機杯,自慰套,充氣娃娃,AV女優.按摩棒,跳蛋,潤滑液,角色扮演,情趣內衣,自慰器
免費視訊聊天,辣妹視訊,視訊交友網,美女視訊,視訊交友,視訊交友90739,成人聊天室,視訊聊天室,視訊聊天,視訊聊天室,情色視訊,情人視訊網,視訊美女
一葉情貼圖片區,免費視訊聊天室,免費視訊,ut聊天室,聊天室,豆豆聊天室,尋夢園聊天室,聊天室尋夢園,影音視訊聊天室,

辣妹視訊,美女視訊,視訊交友網,視訊聊天室,視訊交友,視訊美女,免費視訊,免費視訊聊天,視訊交友90739,免費視訊聊天室,成人聊天室,視訊聊天,視訊交友aooyy
哈啦聊天室,辣妺視訊,A片,色情A片,視訊,080視訊聊天室,視訊美女34c,視訊情人高雄網,視訊交友高雄網,0204貼圖區,sex520免費影片,情色貼圖,視訊ukiss,視訊ggoo,視訊美女ggoo
080苗栗人聊天室,080中部人聊天室ut,ut影音視訊聊天室13077,視訊做愛,kk777視訊俱樂部
A片下載,成人影片下載,免費A片下載,日本A片,情色A片,免費A片,成人影城,成人電影
影音視訊聊天室,辣妹視訊

Affordable Luxurious Wedding Dress Blog said...

cheap wedding gowns,
discount bridal gowns,
China wedding dresses,
discount designer wedding dresses,
China wedding online store,
plus size wedding dresses,
cheap informal wedding dresses,
junior bridesmaid dresses,
cheap bridesmaid dresses,
maternity bridesmaid dresses,
discount flower girl gowns,
cheap prom dresses,
party dresses,
evening dresses,
mother of the bride dresses,
special occasion dresses,
cheap quinceanera dresses,
hot red wedding dresses