Comment by Twirrim

Comment by Twirrim 4 days ago

11 replies

> line management was unable or unwilling to invest in fixing root causes of operational issues.

Sorry for an obligatory: there is no such thing as a root cause.

That said, that matches my general experience too (I left about 9 years ago). Unless the S-team specifically calls them out for any particular metric, it's not going to get touched.

Even then they'll try and game the metric. Sev2 rate is too high, let's find some alarms that are behind lots of false positives, and just make them sev3 instead, rather than investigate why. No way it can backfire... wait what do you mean I had an outage and didn't know, because the alarm used to fire legitimately too?

That major S3 collapse several years ago was caused by a component that engineers had warned leadership about for at least 4-5 years when I was there. They'd carefully gathered data, written reports, written up remediation plans that weren't particularly painful. Engineers knew it was in an increasingly fragile state. It took the outage for leadership to recognise that maybe, just maybe, it was time to follow the plan laid out by engineering. I can't talk about the actual what/why of that component, but if I did it'd have you face palming, because it was painfully obvious before the incident that an incident was inevitable.

Unfortunately, it seems like an unwillingness to invest in operations just pervades the tech industry. So many folks I speak to across a wide variety of tech companies are constantly having to fight to get operations considered any kind of a priority. No one gets promoted for performing miracles keeping stuff running.

akulbe 4 days ago

I'm curious why you say "there is no such thing as a root cause". Is this because that's what you genuinely believe, or was this just Amazon culture?

  • willthames 4 days ago

    Perhaps more accurate is that there is no one root cause, but a bunch of contributing factors.

    I had a quick search to find my favourite article on this one, it might be https://willgallego.com/2018/04/02/no-seriously-root-cause-i... as it demolishes Five Whys but https://www.kitchensoap.com/2012/02/10/each-necessary-but-on... is a classic

  • donavanm 4 days ago

    In addition to other comments see https://en.m.wikipedia.org/wiki/Ishikawa_diagram or the “new school” of safety engineering thats trying to get away from ideas like “RCA.” Sidney Dekker publishes an older version of his seminal work https://www.humanfactors.lth.se/fileadmin/lusa/Sidney_Dekker...

    Im ex AWS and the internal tools _try_ to get away from the myth of human error and root cause. But its difficult when humans read and understand in a linear post hoc fashion.

  • lazyant 4 days ago

    https://en.wikipedia.org/wiki/Swiss_cheese_model (kind of), pretty standard nowadays way in SRE to talk about root causes (plural) because usually it's more than one specific thing.

    • Twirrim 3 days ago

      There's some serious academic critique of the Swiss Cheese model, too.

      That said, I rather like it. It's straightforward to explain to people, they very quickly "get" it, and it gets them thinking along the right lines.

  • [removed] 3 days ago
    [deleted]
  • Twirrim 3 days ago

    Amazon culture was still really rather root cause oriented when I left. The COE process they followed to do post-incident analysis was flawed and based on discredited approaches like the "5 whys". I don't know if it has changed since I left, I rather hope it has.

    I genuinely believe there is no such thing as a root cause. The reason I believe that is both grounded in personal experience, and in the 40+ years of academic research that demonstrates in far greater detail that no failure comes about as a result of a single thing. Failure is always complex, and that approaches to incident analysis and remediation that are grounded in it are largely ineffectual. You have to consider all contributing factors if you want to make actual progress. The tech industry tends to trail really far behind on this subject.

    A couple of quick up-front reading suggestion, if I may:

    * https://how.complexsystems.fail/ - An excellent, and short, read. It's not really written from the perspective of tech, but every single point is easily applicable to technological systems. I encourage engineers to think about the systems that they're responsible for at work, their code, incidents they've been involved in, as they read it. Everything we deal with is fundamentally complex. Even a basic application that runs on a single server is complex, because of all the layers of the operating system below it, let alone the wider environment of the network and beyond.

    * If you're willing to go deeper: "Behind Human Error" ISBN-10: ‎9780754678342. It's a very approachable book, written by Dr Woods and several other prominent names in the academic research side in to failures.

    My favourite example to use when talking about why there's no such thing as a root cause has been the 737-MAX situation. Which has only become more apt over time, as we've learned just how bad that plane was.

    With the original 737-MAX situation, we had two planes crash in the space of 5 months.

    If we follow a root cause analysis approach, you'll end up looking at the fact that the planes were reliant on a single sensor to tell them their angle of attack (AoA). So replace the single sensor with multiple, something that was already possible to do, and the planes stop crashing. Job done. Resilience achieved!

    That doesn't really address a lot of important things, like, how did we end up in a situation where the plane was ever sold with a single AoA sensor (especially one with a track record of inaccuracy)?

    Why did the Maneuvering Characteristics Augmentation System (MCAS) keep overriding pilot input on the controls, which it didn't do on previous implementations of MCAS systems? Why was the existence of the MCAS system not in the manuals? Why could MCAS repeatedly activate?

    If the MCAS hadn't behaved in such a bizarre fashion, where it kept overriding input, it's arguable that the pilots would have been able to correct its erroneous dropping of the nose. If you're sticking with a "root cause" model, there's a pretty strong argument that the MCAS system was the actual root cause of the incident.

    Given the fairly fundamental changes to the controls, and the introduction of the MCAS, why were pilots not required to be trained on the new model of plane? Is Boeings attempts to avoid pilots needing retraining the actual root cause?

    Why was an MCAS system necessary in the first place? It's because of the engine changes they'd made, and how they'd had to position them in relation to the wings that tended to result in the planes nose wanting to go up and induce a stall. The MCAS system was introduced to help with counteracting that new behaviour. Was the new engine choice the root cause of failure?

    and so on and so forth..

    If you look at the NTSB accident report for the 737-MAX flights, it goes way deeper, and broader in to the incident. It takes a holistic view of the situation, because even the answers to those questions I pose above are multi-faceted. As a result it had a long list of recommendations for Boeing, the FAA, and all involved parties.

    Those 2 crashes were the emergent property of a complex system. It's a symptom. A side effect. It required failures in the way things were communicated within Boeing, failures in FAA certification, failures in engineering, failures in management decision making, the works.

    No one at Boeing deliberately set out to make a plane that crashes. A lot of the decisions about how they do things are reasonable, and make some sense, in isolation. But they all contributed together to make for a system to which it became a question of "when" and not "if" a major disaster would occur.

    If resilience and incident analysis just focuses on a singular root cause, the systems that produced the bad outcome, will continue, and will inevitably make more bad decisions. The only thing that would improve is that they would probably never make a decision to sell a plane with a single AoA sensor.

    As we've seen over time, the whole broken system has resulted in a lot of issues with the MAX, beyond that sensor situation.

33MHz-i486 3 days ago

the really dumb thing about working at AWS is they pay so much lip service to Ops, literally you can spend a third of a week in meetings talking about Ops Issues, but not a single long term project to improve the deeper architectural problems that cause bad Ops ever get funded.

jp57 4 days ago

> Sorry for an obligatory: there is no such thing as a root cause.

While I get what you mean, I think most people who've been in the situation know what I'm talking about. The same alarms are going off constantly and you keep doing the expedient thing to make them stop going off without investing any effort into stopping them from going off again in the same situation in the future.

Of course there is a chain of causes, and maybe you need to refactor a module, or maybe you need to redesign an interface, or maybe you need to throw the whole thing away and start over -- we did all those things in different situations while I was there -- but there's a point at which looking at deeper causes loses value because those causes are not in our power to fix and we're left to defend against those failures: a system we rely on is unreliable; machines and networks go down unexpectedly; a lot of people have poor reading comprehension so even good docs are sometimes useless; we are all sinners whose pride and sloth sometimes leads us to make crappy software and systems; etc.