Image in modal.

In his classic essay “Stop Those Hiccoughs!” the humorist Robert Benchley describes all the different “hiccough cures” he has tried over the years. He summarizes them by saying, “each time, … after a few seconds of waiting … that one, big hiccough always breaks the tension, indicating that the whole performance has been a ghastly flop.”

It’s true. The most frustrating moment in problem-solving is when you’ve finally fixed an error, only to see the very same thing turn up again the next day. When this happens, it means that you may have implemented an immediate correction, but you haven’t found the real “root cause”—the thing that is fundamentally driving the problem. Until you find that root cause and set it straight, you’ll stumble over the same issue again and again.

How do you find a root cause? Well, before you can find it, you have to know what it looks like. You have to know what it takes for something to count as a root cause. There are several features to look for.

  • A root cause actually causes the problem. It should be like flipping a light switch: if you remove the root cause, the problem goes away. If you put it back again, the problem reappears.
  • But it’s not just any cause. You can fix a surface cause without touching the underlying problem. If I’m putting things away in your warehouse and I put a box on the wrong shelf, you can tell me to move it. That’s a surface fix, and it works until I go on break and the next guy makes the same mistake. Now you put up a sign, and nobody gets it wrong any more. That’s addressing the root cause.
  • A root cause is actionable. You can do something about it. If there’s a fire in that warehouse, the causes might include oily rags or someone smoking where he shouldn’t. But another cause is oxygen in the atmosphere. Right? No oxygen, no fire. All the same, oxygen doesn’t count as a root cause because there’s nothing we can do about it. We’re not going to move the warehouse to the Moon.
  • For that reason, “human error” doesn’t count as a root cause. We will never permanently eliminate human error. What we can do is to treat human error as a valuable symptom. It shows us where we have a process or an activity that we need to “error-proof”—to redesign so that it is no longer possible to make that particular error any more. (Putting up a sign in your warehouse might be an example of “error-proofing.”)
  • Lastly, a root cause never assigns personal blame. Yes, someone goofed and that caused the problem. It’s tempting to blame him for what he did wrong. But if it was possible for him to make a mistake, it’s possible for the next guy to make the same mistake. Rather than blaming the one who goofed, it is more productive to redesign the work so the mistake won’t come back.

Now that you recognize a root cause, how do you find it?

There are a number of ways to do this, but the simplest is called “5-Why” analysis. The idea is that you look at the problem and ask, “Why did it happen?” Next you remember that your first answer is probably only a surface cause; so as soon as you have that answer you ask, “And why did that happen?” With each answer, ask in turn why that happened. How many times do you ask “Why?” It doesn’t really have to be five. You might get the answer in four, or it might take twelve. Five is just a good average. But remember that a root cause has to be actionable, so when you start getting answers that you can’t do anything to fix you may have gone too far.

Let me give an example.

  • Problem: My car won’t start.
  • Why won’t it start? The battery is dead.
  • Why is the battery dead? The alternator isn’t working.
  • Why isn’t the alternator working? The alternator belt is broken.
  • Why is the alternator belt broken? It wore out and was never replaced.
  • Why was it never replaced? I didn’t maintain the car according to the schedule in the manual.
  • So the root cause why my car won’t start is that I didn’t maintain it properly.

If you look at it, this cause meets all the criteria I listed up above.

  • It really is a cause. If I maintain my car correctly, this won’t happen; but if I fail to maintain my car, it’s sure to.
  • It’s not just a surface cause. The dead battery is a surface cause, and yes, I could just charge the battery. But as long as the alternator isn’t working, that charge won’t last long. Maintaining my car correctly means that the problem won’t recur.
  • It is actionable. All I have to do is maintain my car according to the schedule.
  • It’s not personal. That’s why the analysis stopped where it did. The goal was to arrive at an action I can take to eliminate the problem for good. If I then ask, “Why didn’t I maintain my car?” the answers probably sound like personal blame: maybe I’m lazy or forgetful. That doesn’t help, because at that point we’re talking about me and not the car.

There’s a way to check whether your answers make sense. Read the answers backwards to see if they make a logical story. In this case they do:

  • I didn’t maintain the car according to the schedule in the manual.
  • Therefore the alternator belt wasn’t replaced when it wore out.
  • Therefore the alternator belt broke.
  • Therefore the alternator didn’t work.
  • Therefore the battery died.
  • Therefore my car wouldn’t start.

That checks out exactly. But if you get answers that don’t make a logical story when you read them backwards, then you probably made a mistake in your 5-Why. One way to avoid that kind of mistake is to ask your questions the way I asked them in the example. Notice that I didn’t just say “Why? Why? Why?” Each time I repeated the previous answer in the next question: “Why is the battery dead?” “Why isn’t the alternator working?” Asking the questions like this focuses your attention, so that your answers are more likely to make a logical story when you read them backwards.

Notice something else. Sometimes you get two answers to a single “Why?” The answer to “Why is the alternator belt broken?” is “It wore out and was never replaced.” When you get two answers, your investigation may have to branch so that you can follow each of them. In the example, I follow the branch “Why was it never replaced?” Suppose we follow the other branch too. Then we get: “Why did the alternator belt wear out?” But the real answer to that is, “Everything wears out over time,” and that’s not actionable. It’s just a fact of life and there’s nothing we can do to change it. So that’s not a useful branch to follow, which is why I ignored it. But another time, it might not be so clear. In general, if you get an answer like “Because X and Y,” ask about each one in turn.

5-Why analysis is very powerful. There are ways you can build on it, and I’ll discuss them in a future article. But even by itself, this approach will take you far if you follow it diligently.