Comment by jp57

Comment by jp57 4 days ago

17 replies

My experience there (15 years ago) was that on-call was terrible because line management was unable or unwilling to invest in fixing root causes of operational issues.

When I started I lucked into a situation where I was one engineer a "team" of two. We didn't have a manager and were reporting to the director of our department. He only had about an hour a week to meet with us. We spent a lot of time fixing broken stuff that we'd inherited (a task that I actually found kind of fun), and soon our ops load started going down. We eventually got another engineer and a manager who was willing to prioritize fixing the root causes of our on-call tickets.

During black-friday-week of my second year there we had essentially no operational issues and spent our time brainstorming future work while we kept an eye our performance dashboards. We got semi-scolded by a senior engineer from a neighboring team because we didn't "seem very busy". Our manager called that a win.

Even back then Amazon had the reputation for being a brutal place to work and for burning out engineers, but I rather liked it. I ultimately left because my wife hated living in Seattle.

hypeatei 4 days ago

> We got semi-scolded by a senior engineer from a neighboring team because we didn't "seem very busy"

What the hell? Hope you told him off, not his job or his business. Weird.

  • jp57 4 days ago

    Well, he rolled up to the same director and was the most senior engineer under that director, so it was a little bit his business.

    And, like I said it was only semi scolded: he came to me and quietly said something like, "you guys don't seem very busy", which I took to mean "why are you loudly brainstorming future work when the guys in the next row of cubes over haven't slept in 36 hours?"

    My answer was, "All our stuff is working."

    • ethbr1 4 days ago

      I heard it expressed years ago, when the role was still call sysops, that it was the one job where the better you were then the less work you did.

      It was attached to a similar anecdote about someone being yelled at for crafting a well-oiled system.

  • goostavos 4 days ago

    Stuff like that can almost always be traced back to that senior being told to "be visible." Show up! Have opinion on things (loudly)! "Scale yourself!" Other mumbo jumbo. It often leads to these weird misguided drive-bys where everyone is left confused.

    • kevinventullo 3 days ago

      Yet here we are talking about the person… they sound pretty influential to me. (I’m only half-joking)

Twirrim 4 days ago

> line management was unable or unwilling to invest in fixing root causes of operational issues.

Sorry for an obligatory: there is no such thing as a root cause.

That said, that matches my general experience too (I left about 9 years ago). Unless the S-team specifically calls them out for any particular metric, it's not going to get touched.

Even then they'll try and game the metric. Sev2 rate is too high, let's find some alarms that are behind lots of false positives, and just make them sev3 instead, rather than investigate why. No way it can backfire... wait what do you mean I had an outage and didn't know, because the alarm used to fire legitimately too?

That major S3 collapse several years ago was caused by a component that engineers had warned leadership about for at least 4-5 years when I was there. They'd carefully gathered data, written reports, written up remediation plans that weren't particularly painful. Engineers knew it was in an increasingly fragile state. It took the outage for leadership to recognise that maybe, just maybe, it was time to follow the plan laid out by engineering. I can't talk about the actual what/why of that component, but if I did it'd have you face palming, because it was painfully obvious before the incident that an incident was inevitable.

Unfortunately, it seems like an unwillingness to invest in operations just pervades the tech industry. So many folks I speak to across a wide variety of tech companies are constantly having to fight to get operations considered any kind of a priority. No one gets promoted for performing miracles keeping stuff running.

  • akulbe 4 days ago

    I'm curious why you say "there is no such thing as a root cause". Is this because that's what you genuinely believe, or was this just Amazon culture?

    • willthames 4 days ago

      Perhaps more accurate is that there is no one root cause, but a bunch of contributing factors.

      I had a quick search to find my favourite article on this one, it might be https://willgallego.com/2018/04/02/no-seriously-root-cause-i... as it demolishes Five Whys but https://www.kitchensoap.com/2012/02/10/each-necessary-but-on... is a classic

    • donavanm 4 days ago

      In addition to other comments see https://en.m.wikipedia.org/wiki/Ishikawa_diagram or the “new school” of safety engineering thats trying to get away from ideas like “RCA.” Sidney Dekker publishes an older version of his seminal work https://www.humanfactors.lth.se/fileadmin/lusa/Sidney_Dekker...

      Im ex AWS and the internal tools _try_ to get away from the myth of human error and root cause. But its difficult when humans read and understand in a linear post hoc fashion.

    • lazyant 4 days ago

      https://en.wikipedia.org/wiki/Swiss_cheese_model (kind of), pretty standard nowadays way in SRE to talk about root causes (plural) because usually it's more than one specific thing.

      • Twirrim 3 days ago

        There's some serious academic critique of the Swiss Cheese model, too.

        That said, I rather like it. It's straightforward to explain to people, they very quickly "get" it, and it gets them thinking along the right lines.

    • [removed] 3 days ago
      [deleted]
    • Twirrim 3 days ago

      Amazon culture was still really rather root cause oriented when I left. The COE process they followed to do post-incident analysis was flawed and based on discredited approaches like the "5 whys". I don't know if it has changed since I left, I rather hope it has.

      I genuinely believe there is no such thing as a root cause. The reason I believe that is both grounded in personal experience, and in the 40+ years of academic research that demonstrates in far greater detail that no failure comes about as a result of a single thing. Failure is always complex, and that approaches to incident analysis and remediation that are grounded in it are largely ineffectual. You have to consider all contributing factors if you want to make actual progress. The tech industry tends to trail really far behind on this subject.

      A couple of quick up-front reading suggestion, if I may:

      * https://how.complexsystems.fail/ - An excellent, and short, read. It's not really written from the perspective of tech, but every single point is easily applicable to technological systems. I encourage engineers to think about the systems that they're responsible for at work, their code, incidents they've been involved in, as they read it. Everything we deal with is fundamentally complex. Even a basic application that runs on a single server is complex, because of all the layers of the operating system below it, let alone the wider environment of the network and beyond.

      * If you're willing to go deeper: "Behind Human Error" ISBN-10: ‎9780754678342. It's a very approachable book, written by Dr Woods and several other prominent names in the academic research side in to failures.

      My favourite example to use when talking about why there's no such thing as a root cause has been the 737-MAX situation. Which has only become more apt over time, as we've learned just how bad that plane was.

      With the original 737-MAX situation, we had two planes crash in the space of 5 months.

      If we follow a root cause analysis approach, you'll end up looking at the fact that the planes were reliant on a single sensor to tell them their angle of attack (AoA). So replace the single sensor with multiple, something that was already possible to do, and the planes stop crashing. Job done. Resilience achieved!

      That doesn't really address a lot of important things, like, how did we end up in a situation where the plane was ever sold with a single AoA sensor (especially one with a track record of inaccuracy)?

      Why did the Maneuvering Characteristics Augmentation System (MCAS) keep overriding pilot input on the controls, which it didn't do on previous implementations of MCAS systems? Why was the existence of the MCAS system not in the manuals? Why could MCAS repeatedly activate?

      If the MCAS hadn't behaved in such a bizarre fashion, where it kept overriding input, it's arguable that the pilots would have been able to correct its erroneous dropping of the nose. If you're sticking with a "root cause" model, there's a pretty strong argument that the MCAS system was the actual root cause of the incident.

      Given the fairly fundamental changes to the controls, and the introduction of the MCAS, why were pilots not required to be trained on the new model of plane? Is Boeings attempts to avoid pilots needing retraining the actual root cause?

      Why was an MCAS system necessary in the first place? It's because of the engine changes they'd made, and how they'd had to position them in relation to the wings that tended to result in the planes nose wanting to go up and induce a stall. The MCAS system was introduced to help with counteracting that new behaviour. Was the new engine choice the root cause of failure?

      and so on and so forth..

      If you look at the NTSB accident report for the 737-MAX flights, it goes way deeper, and broader in to the incident. It takes a holistic view of the situation, because even the answers to those questions I pose above are multi-faceted. As a result it had a long list of recommendations for Boeing, the FAA, and all involved parties.

      Those 2 crashes were the emergent property of a complex system. It's a symptom. A side effect. It required failures in the way things were communicated within Boeing, failures in FAA certification, failures in engineering, failures in management decision making, the works.

      No one at Boeing deliberately set out to make a plane that crashes. A lot of the decisions about how they do things are reasonable, and make some sense, in isolation. But they all contributed together to make for a system to which it became a question of "when" and not "if" a major disaster would occur.

      If resilience and incident analysis just focuses on a singular root cause, the systems that produced the bad outcome, will continue, and will inevitably make more bad decisions. The only thing that would improve is that they would probably never make a decision to sell a plane with a single AoA sensor.

      As we've seen over time, the whole broken system has resulted in a lot of issues with the MAX, beyond that sensor situation.

  • 33MHz-i486 3 days ago

    the really dumb thing about working at AWS is they pay so much lip service to Ops, literally you can spend a third of a week in meetings talking about Ops Issues, but not a single long term project to improve the deeper architectural problems that cause bad Ops ever get funded.

  • jp57 4 days ago

    > Sorry for an obligatory: there is no such thing as a root cause.

    While I get what you mean, I think most people who've been in the situation know what I'm talking about. The same alarms are going off constantly and you keep doing the expedient thing to make them stop going off without investing any effort into stopping them from going off again in the same situation in the future.

    Of course there is a chain of causes, and maybe you need to refactor a module, or maybe you need to redesign an interface, or maybe you need to throw the whole thing away and start over -- we did all those things in different situations while I was there -- but there's a point at which looking at deeper causes loses value because those causes are not in our power to fix and we're left to defend against those failures: a system we rely on is unreliable; machines and networks go down unexpectedly; a lot of people have poor reading comprehension so even good docs are sometimes useless; we are all sinners whose pride and sloth sometimes leads us to make crappy software and systems; etc.