Comment by supriyo-biswas

Comment by supriyo-biswas a day ago

4 replies

I guess this is only possible at engineering focused organizations that value technical excellence and also requires that one person be right enough in most cases to expend their social capital to advocate for the engineering changes that they want to see.

As a counterexample, I worked at a company with an extremely bureaucratic release process involving multiple levels of reviews from stakeholders, people manually monitoring a system after a release, and a policy of performing deployments only at nights, all indicators of the lack of confidence in the engineering processes of the organization.

While company management talked a lot about faster releases, “falling behind in the age of AI”, and the like, they also loved their processes and would rather keep it as to them it was a sign of meticulousness and quality. I hated it, but I don’t see how anyone, even the people who carried far more importance than me could have changed it, even though they’d acknowledge that it was slow and could do with more automation in private discussions.

tw04 a day ago

Do you have an example of a large organization that deploys in the middle of the business day that hasn’t had a catastrophic failure? I dont think “deploying after hours” is a sign of lack of confidence in engineering, it’s just basic common sense not to disrupt the people paying your bills because it might be slightly less convenient for a small subsets of your employees.

People always point to Facebook but they literally constantly have issues, it’s just that nobody dies when the like button glitches on grandma’s feed.

  • supriyo-biswas a day ago

    I feel like a lot of people have this mistaken impression that they don't need to invest in engineering processes because there's a "downtime" during which they can make a deployment. However, large companies don't have this luxury because their application is being used all the time, so they'd usually do some sort of blue/green, canary or cellular deployment where the alarm/metric thresholds can be utilized towards stopping further traffic propagation and/or a rollback.

    I also see that people are just generally unwilling to invest in an integration test suite, which can be run on a staging environment before the deployment, which would also catch lots of these issues. At a smaller scale, you can also run a lightweight integration test with test data on accounts that you control that runs just before you release the new version, similar to a canary, which is something I wanted to pursue there, but by that time I had decided to leave.

    Note that "inconvenience" is not a concern for me, all organizations maintaining external applications have the concept of oncall. And any large organization, at scale, will have failures, it's just that Facebook has gotten good at mitigating them.

  • adrianN a day ago

    Most sufficiently large companies don’t have the luxury of „after hours“ because they have customers in every time zone.

  • decimalenough a day ago

    If you have properly tuned canary releases and sufficiently large scale, it's effectively safe to release at any time, because any failures will be caught by the 1% stage.