Comment by supriyo-biswas

Comment by supriyo-biswas a day ago

0 replies

I feel like a lot of people have this mistaken impression that they don't need to invest in engineering processes because there's a "downtime" during which they can make a deployment. However, large companies don't have this luxury because their application is being used all the time, so they'd usually do some sort of blue/green, canary or cellular deployment where the alarm/metric thresholds can be utilized towards stopping further traffic propagation and/or a rollback.

I also see that people are just generally unwilling to invest in an integration test suite, which can be run on a staging environment before the deployment, which would also catch lots of these issues. At a smaller scale, you can also run a lightweight integration test with test data on accounts that you control that runs just before you release the new version, similar to a canary, which is something I wanted to pursue there, but by that time I had decided to leave.

Note that "inconvenience" is not a concern for me, all organizations maintaining external applications have the concept of oncall. And any large organization, at scale, will have failures, it's just that Facebook has gotten good at mitigating them.