Liz Kowalczyk from the Boston Globe recently ran a series on patient deaths linked to alarm malfunctions in hospitals. It turns out the "malfunctions" weren't related to software or hardware failures. The deaths occured because personnel didn't notice or react to alarms.
They call it “alarm fatigue.’’ Monitors help save lives, by alerting doctors and nurses that a patient is — or soon could be — in trouble. But with the use of monitors rising, their beeps can become so relentless, and false alarms so numerous, that nurses become desensitized — sometimes leaving patients to die without anyone rushing to their bedside. On a 15-bed unit at Johns Hopkins Hospital in Baltimore, staff documented an average of 942 alarms per day — about 1 critical alarm every 90 seconds.
I think development and operations team suffer from a similar problem: event log fatigue. Without constant vigilance on the part of both teams, the event logs become so littered with "normal errors" that they become useless. When something does go wrong, the following conversation typically ensues:
Dev: Why didn't you sound the alarm when the error first started appearing in the log? Were you asleep at the wheel?
Ops: I couldn't see it because it was surrounded by 10,000 normal errors. I can't do my job because you aren't doing yours.
To avoid event log fatigue, both groups need to hold the other accountable. There are two rules which must be held inviolable:
- All errors are bad errors-- There can be no such thing as a "normal error." This means devs must fix all errors (and ops applies pressure until they do) and ops must alert on every error (and devs apply pressure until they do)
- Dev and Ops must both be looking at the same errors logs -- It's a mistake for a developer to think they aren't responsible for looking at these at least once a day. It's also a mistake for ops not to give dev access to these things.
How do we do it at msnbc.com?
In the past, we've definitely suffered from event log fatigue. In our new systems built on Skypad, we have 3 tools in place to help dev and ops look at the same information:
- Lorax-- It's all about the logs. We aggregate the event logs from every server and expose the data via an API that allows a simple client app to perform queries based on any number of factors (application, severity, DateTime, server, etc).
- Avicode -- This is a fantastic product the we started using even before Microsoft acquired them. It detects application problems in real-time - including security and connectivity issues, performance bottlenecks, and code failures - and delivers immediate root cause information. It's worth every penny.
- Perf counter dashboard -- This is a centralized monitoring dashboard that allows anyone in the company to see key performance counters for every server in a single, at-a-glance. Green squares are good. Red squares are bad.
With the right tools in place, the other piece is constant vigilance. I can't offer any silver bullets to make this happen other than helping every developer and every ops person believe that all errors are bad errors. We're getting better at this, but it doesn't happen over night.
A final caution
Be wary of queueing up a project entitled "fix all the errors in the logs." Projects like these typically get handed off to maintenance teams who might supress the error without actually fixing the problem. The team that created the problem needs to fix it (and not suppress it).