Gene Kranz: How to prevent disasters in space

The Space Foundation surveys its members in 2010: Who is your space hero? Which one person inspires you the most? The top response is Neil Armstrong.

But do you know who came second? Gene Kranz. Many people outside the space community might not know who he is, but Kranz played a pivotal role in preventing what could have been humanity’s greatest disaster in space. He was the man who saved the Apollo 13 mission and the three astronauts aboard.

When Apollo 13 was nearing the moon, an explosion occurred, and the spacecraft’s computer system rebooted. Scientists on the ground were unsure: Was this just an instrument failure, or had something truly gone wrong in space? Alarms were blaring in every direction, and engineers in Houston suspected it was mostly an instrument malfunction. After all, it seemed unlikely that four different systems had failed at once.

Amid this confusion, flight director Gene Kranz made the bold decision to abort the moon landing, shut down everything except emergency systems, and turn the spacecraft around to bring the astronauts home safely.

Only later did the engineers discover that an oxygen tank had burst. Had they hesitated for even a little longer, the outcome could have been catastrophic. Kranz made his decision swiftly and decisively without even knowing the exact cause of the problem.

While everyone else was lost in the noise of simultaneous alerts, Kranz zeroed in on the most critical data and made the right call. But before we explore how he managed that, let’s contrast it with another alert system.

The nifty alert at a nuclear plant

At the Three Mile Island nuclear reactor, engineers had designed an alert system that was considered ingenius: when a cooling system malfunctioned, a bright lightbulb would turn on, alerting technicians to fix the issue.

Unfortunately, in March 1979, when the cooling system failed, the lightbulb didn’t turn on. Why? Because the lightbulb had burned out. This small oversight resulted in a loss-of-coolant accident that exposed over two million people to radiation. The nuclear unit had to be shut down permanently.

What happens when the alert itself malfunctions?

W. Edwards Deming and Fixing Japan

The nifty bulb alert system was setup to track for failures. Which is the wrong thing to track according to W Edwards Deming. 

Deming is the statistician who helped Japan become a world leader in quality manufacturing after World War II. He taught companies like Sony and Toyota to avoid costly mistakes and improve efficiency.

His core philosophy? Don’t wait for machines to break down.

Deming taught that if you wait for failure to happen before you act, you will always be responding too late. Instead of monitoring for breakdowns, you should monitor for the absence of normal conditions.

This was the fatal flaw in the Three Mile Island bulb alert system. The engineers were watching for failure. They should have been watching for normal. The lightbulb should have been on during normal operation, and turned off when something went wrong.

Deming’s Control Charts: The Power of Simplicity

One of Deming’s key teachings was the use of control charts. He taught factory managers in Japan to graph their activities and watch for deviations from normal operations.

  1. First define your “normal”. Define what a well functioning system looks like. Establish the baseline of normal.
  2. Then simply monitor the normal consistently. And mark any deviations from normal.

The beauty of Deming’s system lies in its simplicity. Imagine waiting for someone to see flames before calling the fire department – that’s monitoring for failure. In contrast, a smoke alarm constantly monitors for the absence of normal by detecting smoke, long before a fire fully develops. Similarly, control charts track small deviations from the norm, triggering early alerts when something starts to go wrong.

Gene Kranz: Monitoring for the Absence of Normal

In the middle of the chaos and general confusion – Kranz didn’t react to the individual alerts of failure. He understood that there was too much noise, and the engineers were unsure whether the alerts were real or due to instrument errors.

So, Kranz chose a different path. He monitored for the absence of normal conditions.

He asked his team to focus on three key metrics that any spacecraft consumes: water, oxygen, and power usage. By comparing these levels to what would be expected during a normal flight, Kranz saw troubling anomalies. The data didn’t need to shout failure; it simply showed that things weren’t operating as they should be.

Once Kranz identified the abnormalities, he made a swift decision, abandoning the moon landing and turning the spacecraft back to Earth. If he would have waited for confirmation of clear-cut failures, he would not have saved the lives of the astronauts aboard Apollo 13.

Action Summary:

  • A good alert system is one that alerts you before the failure occurs, not after. So don’t monitor for failures, monitor for normal. And then track deviations from the norm.
  • First step is to define what normal looks like.