Comment by Twirrim
Comment by Twirrim 4 days ago
> line management was unable or unwilling to invest in fixing root causes of operational issues.
Sorry for an obligatory: there is no such thing as a root cause.
That said, that matches my general experience too (I left about 9 years ago). Unless the S-team specifically calls them out for any particular metric, it's not going to get touched.
Even then they'll try and game the metric. Sev2 rate is too high, let's find some alarms that are behind lots of false positives, and just make them sev3 instead, rather than investigate why. No way it can backfire... wait what do you mean I had an outage and didn't know, because the alarm used to fire legitimately too?
That major S3 collapse several years ago was caused by a component that engineers had warned leadership about for at least 4-5 years when I was there. They'd carefully gathered data, written reports, written up remediation plans that weren't particularly painful. Engineers knew it was in an increasingly fragile state. It took the outage for leadership to recognise that maybe, just maybe, it was time to follow the plan laid out by engineering. I can't talk about the actual what/why of that component, but if I did it'd have you face palming, because it was painfully obvious before the incident that an incident was inevitable.
Unfortunately, it seems like an unwillingness to invest in operations just pervades the tech industry. So many folks I speak to across a wide variety of tech companies are constantly having to fight to get operations considered any kind of a priority. No one gets promoted for performing miracles keeping stuff running.
I'm curious why you say "there is no such thing as a root cause". Is this because that's what you genuinely believe, or was this just Amazon culture?