Comment by akulbe

Comment by akulbe 4 days ago

8 replies

I'm curious why you say "there is no such thing as a root cause". Is this because that's what you genuinely believe, or was this just Amazon culture?

willthames 4 days ago

Perhaps more accurate is that there is no one root cause, but a bunch of contributing factors.

I had a quick search to find my favourite article on this one, it might be https://willgallego.com/2018/04/02/no-seriously-root-cause-i... as it demolishes Five Whys but https://www.kitchensoap.com/2012/02/10/each-necessary-but-on... is a classic

donavanm 4 days ago

In addition to other comments see https://en.m.wikipedia.org/wiki/Ishikawa_diagram or the “new school” of safety engineering thats trying to get away from ideas like “RCA.” Sidney Dekker publishes an older version of his seminal work https://www.humanfactors.lth.se/fileadmin/lusa/Sidney_Dekker...

Im ex AWS and the internal tools _try_ to get away from the myth of human error and root cause. But its difficult when humans read and understand in a linear post hoc fashion.

lazyant 4 days ago

https://en.wikipedia.org/wiki/Swiss_cheese_model (kind of), pretty standard nowadays way in SRE to talk about root causes (plural) because usually it's more than one specific thing.

  • Twirrim 3 days ago

    There's some serious academic critique of the Swiss Cheese model, too.

    That said, I rather like it. It's straightforward to explain to people, they very quickly "get" it, and it gets them thinking along the right lines.

[removed] 4 days ago
[deleted]
Twirrim 3 days ago

Amazon culture was still really rather root cause oriented when I left. The COE process they followed to do post-incident analysis was flawed and based on discredited approaches like the "5 whys". I don't know if it has changed since I left, I rather hope it has.

I genuinely believe there is no such thing as a root cause. The reason I believe that is both grounded in personal experience, and in the 40+ years of academic research that demonstrates in far greater detail that no failure comes about as a result of a single thing. Failure is always complex, and that approaches to incident analysis and remediation that are grounded in it are largely ineffectual. You have to consider all contributing factors if you want to make actual progress. The tech industry tends to trail really far behind on this subject.

A couple of quick up-front reading suggestion, if I may:

* https://how.complexsystems.fail/ - An excellent, and short, read. It's not really written from the perspective of tech, but every single point is easily applicable to technological systems. I encourage engineers to think about the systems that they're responsible for at work, their code, incidents they've been involved in, as they read it. Everything we deal with is fundamentally complex. Even a basic application that runs on a single server is complex, because of all the layers of the operating system below it, let alone the wider environment of the network and beyond.

* If you're willing to go deeper: "Behind Human Error" ISBN-10: ‎9780754678342. It's a very approachable book, written by Dr Woods and several other prominent names in the academic research side in to failures.

My favourite example to use when talking about why there's no such thing as a root cause has been the 737-MAX situation. Which has only become more apt over time, as we've learned just how bad that plane was.

With the original 737-MAX situation, we had two planes crash in the space of 5 months.

If we follow a root cause analysis approach, you'll end up looking at the fact that the planes were reliant on a single sensor to tell them their angle of attack (AoA). So replace the single sensor with multiple, something that was already possible to do, and the planes stop crashing. Job done. Resilience achieved!

That doesn't really address a lot of important things, like, how did we end up in a situation where the plane was ever sold with a single AoA sensor (especially one with a track record of inaccuracy)?

Why did the Maneuvering Characteristics Augmentation System (MCAS) keep overriding pilot input on the controls, which it didn't do on previous implementations of MCAS systems? Why was the existence of the MCAS system not in the manuals? Why could MCAS repeatedly activate?

If the MCAS hadn't behaved in such a bizarre fashion, where it kept overriding input, it's arguable that the pilots would have been able to correct its erroneous dropping of the nose. If you're sticking with a "root cause" model, there's a pretty strong argument that the MCAS system was the actual root cause of the incident.

Given the fairly fundamental changes to the controls, and the introduction of the MCAS, why were pilots not required to be trained on the new model of plane? Is Boeings attempts to avoid pilots needing retraining the actual root cause?

Why was an MCAS system necessary in the first place? It's because of the engine changes they'd made, and how they'd had to position them in relation to the wings that tended to result in the planes nose wanting to go up and induce a stall. The MCAS system was introduced to help with counteracting that new behaviour. Was the new engine choice the root cause of failure?

and so on and so forth..

If you look at the NTSB accident report for the 737-MAX flights, it goes way deeper, and broader in to the incident. It takes a holistic view of the situation, because even the answers to those questions I pose above are multi-faceted. As a result it had a long list of recommendations for Boeing, the FAA, and all involved parties.

Those 2 crashes were the emergent property of a complex system. It's a symptom. A side effect. It required failures in the way things were communicated within Boeing, failures in FAA certification, failures in engineering, failures in management decision making, the works.

No one at Boeing deliberately set out to make a plane that crashes. A lot of the decisions about how they do things are reasonable, and make some sense, in isolation. But they all contributed together to make for a system to which it became a question of "when" and not "if" a major disaster would occur.

If resilience and incident analysis just focuses on a singular root cause, the systems that produced the bad outcome, will continue, and will inevitably make more bad decisions. The only thing that would improve is that they would probably never make a decision to sell a plane with a single AoA sensor.

As we've seen over time, the whole broken system has resulted in a lot of issues with the MAX, beyond that sensor situation.