Pages

Friday, January 28, 2011

5: The Domino Effect, Part 1











The domino effect is another way of graphically demonstrating the string of "causes" which always underlies a phenomenon.  As you will see, it reinforces all that we have been talking about (past tense versus present tense, blame versus introspection, and invisible versus visible forces, and the initial invisibility of causal factors).

I'll describe the domino effect by referring to an actual failure I dealt with several years ago.  I'll use a chemical plant example, but I could just as easily have used any other example.

A sulfuric acid plant depended on 2 large air compressors for control of its process.  One day in mid-summer, one of the 2 compressors unexpectedly "tripped off line," and coasted down to a standstill.  Since the other compressor was being maintained, the whole plant had to be shut down due to a loss of instrument air pressure.

Try to put yourself at the scene as I walk you through this example, and you'll see an actual example of all that we have been discussing.
People immediately started scurrying around, trying to understand what happened in order to know how to respond.  Initially, all they knew was that the compressor shut itself down automatically (see initially domino drawing).  Referring to our earlier discussions, this is the failure (problem), the unexpected, unplanned pain"tip of the iceberg" which suddenly emerged from the depths of the unknown.  Although we know this situation is part of a whole string of causal factors (an entity), we don't know what they are yet.  We intuitively desire to find out more about the situation -- we begin our search for the causes.
Before we continue, consider what would happen if the operator simply hit the restart button.  Most probably, the compressor would trip again, almost immediately (depending, of course, on the cause).  In other words, when we react to the immediate problem, without doing any digging, the problem is likely to recur instantly.
But most operators would not react to the immediate problem -- they know better.  They would "dig" a little, trying to understand what caused the trip.  As I said earlier, this desire to search is the clue that the situation is not fully understood -- that the causal factors are unknown.
Therefore the operator searched, and almost immediately found a light on the control panel indicating that the trip was due to high vibration.  In essence, the operator pushed back the "cloak of the unknown" and found another domino.  This initial probe consumed very little time and energy.

Notice that the focus of the operator's inquiry is now on the "high vibration reading."  Because he made it visible, it becomes the problem.

Now put yourself in his shoes.  The plant has been shut down.  You ( the operator) are the available "expert" on the compressor until others arrive.  From past experience, you know that the vibration sensors might have been triggered by something errant in the system, i.e., some one-time pressure surge or other condition.  Although you desire to confirm your hunches as you did before, you don't know how to do this in a timely manner.  So you assume the cause was an errant condition, and push the restart button in hope that you are correct.

In essence, you (the operator) do all you can do in search of cause within your perceived constraints.  But the compressor trips out again  The high vibration light reappears.  Your assumption was incorrect.  You "throw your hands into the air," give up the quest, and call the maintenance crew.

The maintenance crew arrives.  By this time, the other compressor has been restarted and the plant is back up and running.  After checking the sensor circuitry and find it okay, the maintenance people begin to disassemble the compressor in search of the cause of the vibration.  They remove all the components, and find nothing wrong.  The do all they can do, and then send the components to the machine shop.  Although they consume timed and energy, they are not successful in defining the next domino.

The components are now in the hands of the machine shop.  Upon inspection of the components, the shop finds that the high speed shaft and impeller assembly are bowed.  They involve maintenance engineering, asking them to perform some calculations, and find that the observed bowing would in fact, cause very high vibration at operating speed -- sufficient to trip the unit.  They found the next domino!
domino-bowed-shaft-4
At this point, it might look like the machine shop was solely responsible for finding the "cause."  But remember that the operator was also involved, having ruled out an errant condition.  The maintenance crew as involved, having ruled out any visually-obvious conditions.  The machine shop just happened to be at the right place, with the right tools for this particular failure to have been able to define this next domino.

Also recognize that it took considerable effort and time to actually "push back the cloak to this point!!"  It took more than 24 hours.

Although the implications of this discussion should be obvious, they must be stated.  Firstly, it takes time and effort to dig into the invisible areas of a situation.   Secondly, we cannot expect the person who just happened to be there to be responsible for taking the time required to define the invisible, underlying causes to a sufficient degree,unless specifically given that responsibility.  Finally, we cannot expect anybody with perceived constraints to be able to define deeper, underlying causes.  Left to our own individual initiatives, each person seems to do all they can do to push back the cloak, then pass it off to someone else

And we're not even close to where we want to end-up!

It is important to note what would happen if we stopped the pursuit at this point.  If the components were straightened (or new ones were purchased) and installed without pursuing the causes any further, the compressor would obviously fail again. But instead of tripping instantly (as it did with the operator), a period of time would lapse before the components bowed again, perhaps days or weeks.  In other words, all that would be gained by stopping the quest at this point is time -- time before the next failure.

Unfortunately, this is intuitively obvious and appealing to many people  It falls right into our desire for expediency.  It would allow the shop, the operator, and the maintenance crew to do something else.  It gets the compressor "off their plate" for the time being.  But although this is where many would stop, this plant pursued it another step.

The machine shop noted that the original components were discolored in the area of the bearing.  They looked "blue."  They recognized that this an an indication of overheating.  The on-site metallurgist confirmed that the shaft had overheated, and also suggested that this was the cause of the bowing.  In other words, the next domino had been found.
domino-hot-journals-4
But now the machine shop ran into their own perceived constraints.  They knew what should be checked next, but had no means to check it on their own.  Therefore, they forwarded the information pertaining to the overheated shaft to the area supervisor.  They also suggested that he check the oil level in the compressor's reservoir, as well as the condition of the pump.

But the oil had already been drained and discarded.  Therefore, the area supervisor checked the oil pump on the compressor to assure it was functional.  It worked just as if should.  Having done all he was asked to do, the supervisor decided enough was enough!  He had checked what he could, but because of so many other demands on this time, he decided to reassemble the compressor, fill the reservoir with newly purchased oil, and restart it.

Four months later, it tripped again!

Important Points:
  • It takes time and effort to reveal the previously invisible causes of a failure.
  • It often takes may different perspectives (expertise's) to reveal the causes of a failure.
  • It becomes very tempting to stop the investigation when something new is found, even though there is much more to learn.
  • As we act on deeper and deeper levels of causes, we delay the time until the next failure.  This is good.
  • But if we do not dig deep enough, the failure will eventually recur.
DISCUSSION FORUM
JOIN FAILSAFE NETWORK!