You want to get to the root of the problem and fix it, instead of routinely treating symptoms every time they occur. But you have to pick your battles. Most organizations and products have a lot of problems, some of which the root causes are known, others unknown and never investigated. Some are not deemed worthy of investigation because adapting to the symptom is quick and easy.
The battles most organizations pick first are the symptoms that recur most often and cost the most time and money to treat. So if a symptom with an unknown cause happens rarely, and doesn’t take much time or trouble to work around, it probably will not be investigated anytime soon, which might be fine, or might hide a landmine root cause with more serious consequences later that could’ve been prevented. Sometimes it costs a lot more to confine root cause analysis to only the high-frequency, high-cost symptoms.
Here’s an example of how root cause analysis might work.
A customer calls customer-service with a software bug. Customer service can’t figure it out, and escalates it to involve a programmer. The programmer gets on the phone with the customer, and figures out a way to “make it work” for this customer. It is an expensive habit to devote programmers’ time to customer service calls. How many more customers will call with that problem?
You have a choice: train your customer service reps how to implement the “make-it-work” solution for every customer who calls in with the same symptom, or investigate the root cause and fix it and release a patch so the symptom will never happen again for any customer.
Once you’ve decided to investigate, then what? You have to isolate several things:
- Where in the software code is the error.
- What exactly in the code is causing the error; for example, the code looks ok, and it interacts with another area of code that looks ok too, but the fact that they interact a certain way under certain unusual conditions isn’t ok.
- Once you’ve found the code-interaction problem, your done, right? No! Keep going.
- Who wrote the code in the two areas that look ok but don’t interact well under certain conditions? Was it two different programmers? If so, you may have uncovered a deeper root cause: interaction misunderstanding or lack of information or certain conditions not communicated.
- How were these two coders managed and coordinated (which allowed faulty interaction of two areas of code)?
- Where did communication break down (which allowed one coder to not know what was going on in the other area enough to know how the code areas would interact)?
Now you know the true root cause was communication failure (management and between the coders). If you fix the 2nd-level cause (the code itself interacting wrong), you still haven’t fixed the 1st-level cause, or the true “root cause.” You have to fix the communication, manage the coding, ensure coders talk to each other and coordinate, verify all conditions are understood, ensure everyone understands how all the areas interact.
Next, you have to dredge to the bottom of the code to discover any other code-area-interaction glitches. There may be some that are very obscure, that a normal customer might encounter once every thousand years. But it may be a glitch that crashes the system in a big client who manages to beat the odds and encounter that obscure glitch. Once you know the root cause is miscommunication that leads to code areas not interacting properly, you know there is risk of the same issue elsewhere in the program.
Now that you’ve fixed the interaction glitch, and any other glitches you found during this phase of investigation (and several test cycles), you can release the patch.
Not only is the deepest level of interactive-code-glitch fixed, but the communication problem that led to it “falling through the cracks” is also remedied. You can rest a little easier because you have done some good preventive fixing—you tackled the people part as well as the technical part of the issue. That’s some real good root cause analysis.