When there is a production problem, root cause analysis can find a solution. Assuming that the question, "Why?" is asked often enough.

When faced with a manufacturing problem, quality professionals at world-class organizations typically implement two types of remedial actions, known as "control of nonconforming product" and "root cause analysis with corrective action." These actions are separate processes, both essential, which must not be confused with each other.

Control of nonconforming product answers the question, "What do we do with the product at hand?" This issue must be addressed immediately.

Product that is known or suspected to have a problem must be contained and prevented from reaching subsequent manufacturing operations and the customer. If the affected product has left the factory, the need to get it back must be assessed, and the method determined.

Root cause analysis with corrective action addresses the question "Why is the product nonconforming in the first place and how do we avoid having more of it in the future?" This may take considerable investigation and an investment of time.

A correctly performed root cause analysis has depth and breadth--concepts that will be explained. A tree diagram illustrating the typical problem, "Nonconforming product has been sent to the customer," will be built. Four "stock answers" will be presented and explained. Mistake-proof and fail-safe techniques will be defined and examples given.

Why or why not?
The classic approach to root cause analysis of a problem situation is to ask the question "Why?" (or "Why not?") a number of Arial--five times is often quoted but it is not a concrete number. True root causes generally are deep, and corrective actions at a deep level are far-reaching and long lasting. If problems are addressed before enough questions are asked, the problems, while diminished, will generally recur.

It is important to know how deep is "deep enough." Some rules of thumb can help make that determination:

1. One of four "stock answers" is reached.

2. Common sense indicates that no more "Why?" or "Why not?" questions are needed to solve the problem.

3. The corrective action involved is permanent.

During the course of an investigation, the focus needs to be on the quality system. How did the system allow the problem to occur? Where is the weakness in the system?

The goal is one or more system-based improvements as opposed to finding someone to blame. Unfortunately, after all the questions are asked, "someone" does have to take responsibility for action, so the investigator may need to rely on good interpersonal skills to keep that person from feeling blamed for the weakness in the system. In addition, good persuasion and follow-up skills may be needed to ensure implementation of the improvements.

Root cause analysis is not easy, and it may unearth situations that are messy or time-consuming to correct. Fixing the root cause is often the responsibility of a different person than the individual involved in investigating the problem, and that finding can be interpreted as buck-passing if the process is not understood and embraced by the whole organization.

After an analysis has gone deep enough, the corrective action that corresponds to the bottom root cause is determined and implemented. This corrective action enables the preceding cause in the path to be addressed, and the process continues backwards to the original "Why?"

For an example, see the diagram at right. Using a flowchart or tree diagram format, we start with the typical problem "Nonconforming product has been sent to the customer" and ask "Why?" or "Why not?" at the end of each block. In this example, the number of successive questions down any one path ranges from three to six. In other investigations, there may be fewer than three questions or more than six questions down one of the paths, but three to six is a typical range.

The diagram shows a wide range of possible answers as to why the nonconforming product was sent to the customer. There can be more than one answer to each question. An actual investigation may not have as many paths, but in most cases, more than one opportunity should be pursued. This is what is meant by ensuring that an investigation has "breadth."

In the example, the left side of the diagram pertains to operators or inspectors not catching the non-conforming product; the right side pertains to the machine making the non-conforming product in the first place. Both avenues of pursuit are important. And to go broader still, while pursuing what actually went wrong this time around, the investigator should question what else could have gone wrong, and make improvements down those paths. That would be categorized as preventive action.

Stock answers
Many paths can lead to one of four stock answers:

A. Insufficient training

B. Insufficient time

C. Carelessness

D. Simple human error.

"Insufficient training" is a skills issue. It may be apparent that a person did not have the skills needed to do the task in question. The corrective action is to provide the appropriate training. However, it should be acknowledged that this investigative path could be taken one or two steps further. Why is a person on the job with insufficient skills? What is wrong with our hiring system or our training system? Going this deep may be necessary to avoid future problems.

"Insufficient time to do the job" is a conclusion that requires some judgement to reach. It can be a real problem--the boss piles on work and people are reluctant to say they have too much to do. If overload is truly the case, the corrective action is for the boss or management to provide help in setting priorities or to provide more resources such as more people or more efficient equipment. On the other hand, if the person in question is just slower than his or her peers, other actions may be necessary, such as coaching, training, reassignment or firing.

"Carelessness" by the employee is a difficult conclusion to reach. It is made easier if the person admits the carelessness or has a documented history of previous careless acts, but an admission or documentation may be hard to come by. To help determine if the act was careless, get several heads together and try to reach a consensus that this conclusion is correct. Nevertheless, carelessness is a true root cause and is a conclusion that is warranted some of the time. The corrective action is for management to provide the appropriate consequences for a first time or repeat offense as described in the company's counseling system procedures. This will set the tone that carelessness is unacceptable. In addition, try to eliminate or minimize the opportunities for careless errors by instituting mistake-proof and fail-safe techniques (more on this topic later).

"Simple human error" or "making an honest mistake" is the last of the four stock answers. People are human, and humans make mistakes. Here again, the corrective action is to eliminate or minimize the opportunities for errors to occur by instituting mistake-proof and fail-safe techniques.

Non-stock answers
It is no surprise that the answers can range from simple to complex. As an example, look at two of the paths from the diagram.

One path in the diagram leads to a procedure book that has disappeared. No one knows why. The corrective action is simply to replace the book and chain it down so it doesn't disappear again.

Another path in the example leads to a more capable machine that was eliminated from the budget because it was too expensive. Possible corrective actions are to pursue a series of process improvements on the existing machine, or to redesign the product to be more manufacturable with the equipment at hand. Neither of these options will be easy to complete. They will likely require a range of problem-solving tools to get started. Brainstorming may be necessary to gather ideas on possible avenues to pursue. A Pareto analysis of data may help set priorities. Cause and effect diagrams may help identify what influences what. Design of experiments may be necessary to determine the relative importance of variables. The whole bag of problem-solving tools may need to be invoked.

Sometimes the true root cause cannot be eliminated because the corrective action may take too long, be too expensive or be in conflict with other priorities. In those cases, corrective action for the preceding step in the path will need to be put into place. This will not be as effective as addressing the true root cause, but it may be the best busi-ness decision in that instance. For example, it may be determined that a certain type of assembly failure can be eliminated with the minor redesign of a part. To implement the redesign will take three months and require some investment in tooling; however, the product is going to be discontinued in four months. Thus, the decision is made to forgo the redesign and beef up inspection instead.

After a thorough analysis is conducted, it is wise to eliminate or minimize opportunities for careless errors or simple human errors to occur by instituting mistake-proof and fail-safe techniques.

Mistake proofing
Mistake-proofing is defined as implementing practices or devices that make it difficult or impossible for a task to be completed incorrectly--or that make it easy to do a task right. Despite its name, it is not 100% effective, but it can help a great deal. Mistake-proofing focuses on the human element of a potential error.

Examples are:

  • Creating checklists that must be filled out and signed off
  • Employing color coding as visual cues
  • Labeling bins so parts don't get mixed
  • Defining standard language such as defect names in an attempt to minimize communication misunderstandings
  • Requiring cross-checking of work done by someone else.

Fail-safing is defined as designing a system to make it physically impossible to fail. Fail-safing focuses on parts or equipment. Examples are:

  • Designing parts in an assembly so that the components only fit together one way, which eliminates the potential for error caused by mixing parts
  • Chaining down a procedure book so that it doesn't disappear
  • Requiring two switches, one for each hand, to operate a potentially dangerous piece of equipment such as a chopper
  • Using sensors to stop a manufacturing line when a certain condition is detected, such as a missing component.

Don't give up
It is rare that a root cause cannot be determined. Even when the answers are difficult to find, closing out an investigation by doing nothing is usually unacceptable. If an investigation is stymied, developing corrective actions on a best guess basis is appropriate. If the problem recurs, additional investigative leads may become apparent at that time.

Root cause analysis is a topic as old as the hills, but it is not always done. And when it is done, it is not always done thoroughly, with depth and breadth. And even when a thorough analysis is done, it is not always followed-up effectively, such as with mistake-proofing or fail-safing. Conducting thorough investigations and ensuring follow-through, though often difficult, will improve product quality and save the company money in the long run. And to state the obvious, high quality and low cost are two descriptors often associated with world-class status.