There are many approaches to Root Cause Analysis: [1]
- Events and Causal Factors
- Change Analysis
- Barrier Analysis
- Tree Diagrams
- Why-Why Chart
- Pareto
- Story Telling
- Fault Tree
- Failure Mode and Effects Analysis
The Why-Why approach or the 5-Whys created by Sakichi Toyoda is common in Agile and Lean involves asking why something happened and asking that same question again in a search of the Cause and Effect relationship of the underlying problem until the root cause is found.
This is a simplistic approach of asking five times until you can no longer answer the question. Five is an arbitrary number. This method produces a linear set of causal relationships and uses the experience of the problem owner to determine the root cause and corresponding solution. [2] The Five Whys method is inappropriate for any complex event but can be useful for minor problems that require nothing more than a basic discussion of an event.
The Five Whys root cause analysis proposes that if you retrace the chain of causes that led to a particular event, you will eventually find the single cause that set everything else in motion, known as the “root cause.” The theory is that finding and eliminating this single root cause will solve the problem.
Like a string of falling dominos, when we simply ask why, why, why, like the conventional Five Whys method, we believe that A caused B, B caused C, C caused D, and somewhere at the end of this causal chain there is a magical single cause that started everything, i.e., the root cause. In the thirteenth century, St. Thomas Aquinas of Sicily taught us the fallacy of this strategy when he proposed that “potency cannot reduce itself to act.” Or, as he clarified with this example, “the copper cannot become a statue by its own existence.” It requires the conditional cause of the copper’s existence and the actions of a sculptor. Unfortunately, this simple and important observation has not been understood or incorporated into everyday thinking, and most people continue to see the world linearly. [3]
We have been led to believe that effective problem solving can be had by finding the root cause at the end of a chain of causes. On the surface, this seems to make sense; but on examination, it is overly simplistic because it ignores the infinite set of causes. Our world is not linear and therefore this logic is overly simplistic and grossly ineffective. As we have seen, our world is made up of an infinite set of causes all connected through causal relationships. Some of these relationships are complicated by feedback loops. Some causes seem to come out of nowhere (the subconscious mind) but all should lead to a point of ignorance. The beginning and end of causation are determined by our knowledge and understanding of the problem.
Once we understand the nonlinearity of our universe, limiting oneself to a linear understanding such as the Five Whys method makes for terribly ineffective solutions. By understanding that there is an infinite number of causes connected in many ways, we begin to see that an infinite number of possible solutions exist. We may only need to affect one cause in a chain so that the problem does not occur—or we may need to attack several causes.
An Example of the Naivety of Five Whys
Using the Five Whys approach suffers from the typical problem-solving practices of:
- Stopping too soon - asking 5 questions ignores the fact there may be dozens of questions that need to be asked before we find the terminal cause. It may be when we reach to end of the questions line, there is another set of causes altogether.
- The need to place blame - belief that individuals need to be identified and their behaviors corrected to remove the cause.
- The root cause myth - the myth root causes have solutions that can be acted upon by removing, changing, or controlling them so the problem does not recur.
- The false belief in common sense and a single reality - if the perception is reality and everyone's reality is unique, what is reality or truth? It turns out everything is relative to our own truths and the best we can hope for is it find a way to incorporate other's truths into ours.
- Groovenation - is a term created by Dean Gano in [2] to describe the justification of beliefs. To be groovenated is to hold string biases and prejudices.
- Story Telling - our primary form of communication is through storytelling to describe an event that relates people (who), places (where), and things (what) in a linear time frame (when). Incident reports in search of the root cause are a prime example
- Categorical Thinking - is caused by the mind's need to order what it perceives and is a natural process. But it allows us to fail to understand how categorization can lead to intellectual laziness. Notions of good and bad is categorical thinking. Instead of seeking to understand, the observed outcomes and the related root causes are categorized as good or bad. It's not the categorization itself that is the problem, it is the belief that once categorized, relationships can be established. Categorization is strongly linked to storytelling.
Cause and Effect, the chancellors of God - Ralph Waldo Emerson, 1856
Nothing happens without a cause. [2]
- Cause and Effect are the same thing.
- Causes and Effects are part of an infinite continuum of causes.
- Each effect has at least two causes in the form of a Condition and an Action.
- An effect exists only if its causes and actions exist at the same point in time
The 5 Ways address NONE of these required conditions to find the root cause.
- What are the conditions the allow the software to not function properly?
- What actions are taken on the part of the software and/or the user that result in non-functioning outcomes
- At what time dos these non-functioning outcomes occur
In the Root Cause analysis method, used by NASA, National Nuclear Security Agency, many services in the DOD, most all of OSHA 1910,199 (Process Safety Management) root cause analyses, we perform by asking an answering
We can now go back to the resource and start to ask and answer, but these answers must be in the form of three elements - the Effect, the Action that creates the Effect, and the Condition that allows the Action to create the Effect
- Primary Effect – is any effect we want to prevent
- Action – momentary causes that bring condition together to cause an effect
- Conditions – the fundamental causal element of all that happens. It is made up of an effect and its immediate causes that represent a single causal relationship.
- As a minimum, the causes in this set consist of an action and one or more conditions.
- Causal sets, like causes, cannot exist alone.
- They are part of a continuum of causes with no beginning or end, which leads us to the next principle:
- Causes ad Effects are Part of an Infinite Continuum of Causes.
To create this chart above the are Seven steps [3], [4]
- Define the problem
- What is the primary Effect that creates the problem?
- When does this Primary Effect occur?
- Where is the Primary Effect seen?
- What is the significance of the Primary Effect and its impact on the goals of the project or operations?
- Determine the Causal Relationships between each Action or Condition cause
- For each Effect in the chart beyond the Primary Effect, define a Cause By Effect that is an Action or a Condition
- Make a graphical representation for the Causes, labeling each as Condition or Action.
- Provide the evidence that can be
- Observation - I saw it
- Written - I read it
- Verbal - I was told it
- Sound - I heard it
- Sensed - I smelled it
- Touch - I felt it
- Determine if the Causes are Sufficient and Necessary
- For each cause on the path in the chart provide the reason for stopping, any feedback loops (one cause ending that starts another cause), and how to obtain the information needed
- Identify the Effective Solutions for each Cause
- Implement and Track the Identified solutions
With this approach, we can see that cause and effect are the same thing
Then, of course, further investigation is needed as to why was the seal maintained? The method used in our Software Intensive System of Systems domain, ranging from Enterprise IT, Cyber Security, manned and unmanned space flight, petrochemical process plants, all the way to nuclear weapons, design, development, and testing treats the Root Cause Analysis process as having potential an infinite number of causes (conditions and actions)
A Case Study using Flawed and Naive Root Cause Analysis
The PMI Disciplined Agile approach to root cause analysis provides an example that makes a good case study of why the 5 Whys is naive, flawed, and doesn't find the cause of the problem, only collects a set of symptoms that require further Root Cause investigation:
- Q1: “Why are we having to rework the system?”
- A1: ” Because the programs do not function properly on our customers’ servers.”
- Q2: “Why do the programs not function properly on our customers’ servers?”
- A2: “Because the code was designed one way, but the servers are configured for another way.”
- Q3: “Why are our customers’ servers being configured differently from how it was expected?”
- A3: “Because our customers are not following our guidelines for server configuration.”
- Q4: “Why are our customers not following our guidelines for server configuration?”
- A4: “Because they aren’t aware of the guidelines.”
- Q5: “Why aren’t these customers aware of them?”
- A5: “Because sales, who is supposed to make sure they know of this configuration requirement, isn’t telling them.”
- Q6: “Why isn’t sales telling our customers they need to do this?”
- A6: “Because when a customer is ready to buy, sales tend to shut up and just get the contract signed. Closing the deal seems to be the most important thing to sales.”
The Five Whys suffers from three flaws:
- Incomplete problem definition - we communicate by telling each other stories because we don't think causally and we then infer
- The software doesn't function properly because it was designed wrong (Q1,2/A1,2).
- What's the evidence the software is not functioning properly?
- What's the definition of properly?
- What were the originally needed capabilities that are being observed as not functioning properly?
- What's the evidence it was designed wrong?
- The software doesn't function properly because it was designed wrong (Q1,2/A1,2).
- Unknown causal relationships - causal relationships often remain unknown because we do not think causally, instead of communicating by storytelling and making inferences surrounding the stories, and considering those stories as causes.
- Talking in terms of human error, lack of training, and other categorical causes like management or user inaction (Q3,3/A3,4)
- What's the evidence customers aren't following our guidelines?
- Are our guidelines adequate to properly use our software?
- Are the customers trained and verified to enable them to follow our guidelines?
- What's the reason they're not following our guidelines?
- Insufficient time?
- Overworked?
- No access to instruction at the time of need?
- Why are customers aware? (Q5,6/A5,6)
- What's the evidence sales isn't telling customers what they need?
- Why is this a sales task and not a training and support task?
- What's the cause of sales tending to shut up?
- Talking in terms of human error, lack of training, and other categorical causes like management or user inaction (Q3,3/A3,4)
- A focus on the solution - by focusing on solutions without clearly and concisely defining the problem and the cause of the problem, we end up solving the wrong problem
- Focusing on solutions is many times caused by Groovenation where we think we already know the solution before we know what the problem is. A use that false belief and cease to look further
Some Guidance for Successfully Finding the Root Cause of the Effect
And an example of putting that advice to work
And the foundation of a Root Cause Analysis used in our Software Intensive System of Systems domain using both Agile and Traditional methods
References
With these starting points, here is a compendium of resources we use in our Software Intensive System of Systems domain for determining the Root Cause and identifying the corrective and preventive actions needed to Handle the risk created by the Root Cause.
- Root Cause Investigation (RCI) Best Practices Guide Product Overview, May 8, 2014, Roland Duphily, Acquisition Risk and Reliability Engineering Department Systems Engineering Division Prepared for National Reconnaissance Office 14675 Lee Road, Chantilly, VA 20151-1715
- Apollo Root Cause Analysis: Effective Solutions to Everyday Problems Every Time, 3rd Edition, Dean L. Gano, Apollonian Publications, LLC, Richland Washington.
- Seven Steps to Effective Problem-Solving and Strategies for Personal Success, Dean L. Gano, Apollonian Publications, LLC, Richland Washington. (A National Nuclear Security Agency (NNSA) site).
- "Reality Charting: Creating the Reality Chart," Apollonian Publishing, 2017.
- Dean Gano was a participant in the Root Cause Analysis of the Three Mile Island nuclear power station accident. The common understanding was a stuck coolant valve, and that's what the public came to consider the root cause. The actual root cause was the alarm printer indicating the valve was stuck, which was many minutes behind realtime. Had the operators been aware of the stuck value in realtime, they would have initiated corrective action and we'd have nuclear power in America today.