Officials are still struggling to understand how the Boeing 737 Max — a plane model built with the latest, most advanced technology, one that has been flying commercially for less than two years — could have failed so catastrophically so early in its career. Two of these airliners crashed in the span of five months, killing 346 people in total. But while much of the focus has fallen on a faulty sensor, the underlying cause appears to involve the way complex subsystems on the plane interacted with one another, a dynamic that has played out in other recent catastrophes, ranging from Deepwater Horizon to Air France 447.
While complexity’s hazards have become all the more apparent with the 737 Max crashes and the difficult development of self-driving cars, they have long been an object of intense interest for MIT aeronautics professor Nancy Leveson, who nearly 40 years ago began studying what she calls “software-intensive, complex, tightly coupled systems.”
Back then, Leveson was a newly minted computer science PhD trying to figure out how to fix a torpedo for the U.S. Navy. The Mark 48 Advanced Capability (ADCAP) simply wasn’t working — a major problem given that the weapon was intended to defend the United States against Russian subs and the nuclear missiles they carried.
Things can go catastrophically wrong even when every individual component is working as designed.
“[Navy officials] were just tearing their hair out,” Leveson says today. “They’d never used so many computers in these weapons systems before.”
Leveson realized that technology had advanced to such a point that the routine problem-solving methods engineers had long employed would no longer suffice. A new methodology needed to be developed.
To avoid disasters, engineers have traditionally identified which individual components in a system might fail and how. The most minor parts can make a huge difference. A classic example is the space shuttle Challenger, which blew up in 1986 during liftoff because a single O-ring failed to provide an adequate seal due to unusually low temperatures on the day of the launch. If you can avoid this kind of flaw — if you can guarantee that every component will perform as intended — you can eliminate any risk of failure.
At least in theory.
What Leveson realized is that as complexity increases within a system, this approach loses its effectiveness. Things can go catastrophically wrong even when every individual component is working precisely as its designers imagined. “It’s a matter of unsafe interactions among components,” she says. “We need stronger tools to keep up with the amount of complexity we want to build into our systems.”
Leveson developed her insights into an approach called system theoretic process analysis (STPA), which rapidly spread through private industries and the military. Among those adopting it today are the Air Force, most U.S. car makers — and Boeing.
“She’s literally invented a new approach to safety,” says Shem Malmquist, a professor of aeronautics at the Florida Institute of Technology. As technology becomes increasingly complex and automated, the type of accident STPA was developed to address will become more prevalent and the need for Leveson’s approach more urgent.
Leveson cites an incident that took place in 1993, when a Lufthansa Airbus A-320 overran a runway in Warsaw, Poland, killing two and injuring 68. The plane’s engines were equipped with thrust reversers that were intended to slow down the plane during landing. But there was a catch. Because reversing thrust would be catastrophic if it occurred in air, the system was designed so that it wouldn’t engage unless both wheels were on the ground. This was an important safety feature, and normally this arrangement worked well. But on the day in question, the airport was buffeted by unusually strong winds.
Anticipating that the air would be pushing him sideways, the Lufthansa pilot followed standard procedure and came in to land with the plane rolled to one side to counteract the lateral drift. This resulted in the plane setting down on just one of its two main landing gears. Since the other landing gear didn’t touch down until the plane was halfway down the runway, the thrust reversers didn’t deploy until it was too late, leaving the pilot helpless to stop the plane in time. Everything worked the way it was supposed to, but the Lufthansa plane was destroyed, and two people died.
“The old paradigm is ‘prevent components of the system from failing.’ So it has changed from a reliability problem to a control problem.”
To prevent this kind of disaster from happening in the future, STPA calls for engineers to enumerate the control loops embedded in a system and understand how they are connected together. This requires a paradigm shift.
“The old paradigm is ‘prevent components of the system from failing,’ while STPA says ‘enforce constraints on behavior,’” Leveson says. “So it has changed from a reliability problem to a control problem.” The essence of STPA is to “look at what is the worst case that could happen and how we can prevent that from happening.”
In reviewing the Lufthansa accident, for example, Leveson was struck by the fact that the A-320’s designers had felt so confident about the automatic system’s operation that they had left the human pilots with no way to override it. “Reliable operation of the automation is not the problem here,” she wrote in a later analysis. “Instead the issue is whether software can be constructed that will exhibit correct appropriate behavior under every foreseeable and unforeseeable situation and whether we should be trusting software over pilots.”
While the machines that surround us are becoming more complicated — “cars today have 100 million lines of code in them,” Leveson points out, compared to the 400,000 lines of code used on the original space shuttle — that complexity is just part of the picture. People, with all of our psychological quirks and organizational oddities, can both reduce risk and increase it.
Leveson sees a red flag, for instance, in the current state of semiautonomous vehicles. Manufacturers constantly remind consumers that their nascent self-driving systems are not capable of handling all possible situations that could arise, so drivers must stay alert and focused at all times. But Leveson believes this is unrealistic.
“No human can just sit there and not do anything and then be ready when an emergency occurs,” Leveson says. The problem of inattention was on display when a self-driving Uber car hit and killed a pedestrian in Arizona last year. A safety driver was behind the wheel but not paying attention to the road when the vehicle collided with a woman crossing the road. Nor does it help when tech entrepreneurs like Elon Musk boldly predict that fully self-driving cars will be available within a couple years, as the Tesla CEO did on Monday.
Leveson will not publicly address the factors underlying the crash of Ethiopian Airlines and Lion Air 737 Max planes, explaining that Boeing is one of her clients. Nevertheless, these crashes and other similar catastrophes have driven home to the public the realization that complex systems can produce unexpected behavior and that the human operators of those systems can struggle to respond effectively. We expect that as technology advances, it will necessarily become safer — and, for the most part, that’s been true. (More than twice as many people died in aviation accidents 20 years ago, even as the number of people flying has vastly increased.) But better technology comes with greater complexity, which can generate dangers all its own.
And that’s why STPA is an idea whose time appears to have come. “It’s spreading like mad,” Leveson says, “which is keeping me very busy.”