I stumbled across this post quite by accident. It
articulates very clearly something I learned about the hard way, in a past job. I was responsible for the operations of our datacenter, and in particular for the availability (meaning the percentage of the time everything was running correctly). One of the routine exercises we went through was called “root cause analysis” – an absolutely standard methodology for identifying the cause of an outage. Once the cause was identified, the idea was that you'd take action to prevent it ever happening again, thus incrementally improving your availability metrics over time.
Nice theory. It even worked, sometimes. But many times it did not. Here's one relatively simple real-life example to illustrate how this theory could break down:
We had a planned change to the configuration of one of the applications we developed. The change consisted of adding two words to one line of a configuration file, which would be automatically re-read by our application. We had tested the change on a lab instance of the application. Our technician opened the file in his favorite editor, made the change, and saved the file. About two minutes later, our application crashed (taking down a stock trading application with hundreds of users).
We restarted our application, and everything was fine. Then we went into full root cause analysis mode, as this failure was kind of scary. I'll spare you all the details, but eventually we figured it out: the technician had saved the file, but left it open in his editor – and his editor
literally kept the file open. Our application tried to re-read it and got a “file in use” error and promptly exited.
The root cause analysis exercise pointed the finger of blame at the technician, for leaving the configuration file open. The mitigation my team recommended was (a) training the technicians to not do that, and (b) always making such changes with two-man teams, one to watch the other and verify that the editor was closed.
I was not at all happy with this outcome. Leaving the file open only caused an error because our application was stupid enough to behave that way. And our technician was only able to cause this error because we were too lazy to automate this sort of configuration change. To me, there were several “causes”, all contributing:
- the technician's error
- our lame software crashing on a config file open error
- our failure to automate the task
The article linked above discusses the failings of root cause analysis in a clear and easy to understand way. I particularly liked this observation:
Finding the root cause of a failure is like finding a root cause of a success.
In a single sentence, that's
exactly the problem I found with root cause analysis. In the end, we still used root cause analysis as a tool for analyzing our outages, but the mitigations we chose were not those that you'd expect to get with root cause analysis. I wish I'd read this article back then, as I think it would have provoked me to go think out a different approach...