Comment by siscia

Comment by siscia 6 days ago

0 replies

I see few blind spots from the write up.

1. Traffic for a new version was loaded up too quickly. I usually lobby for releasing updates slowly. This alone would have prevented the issue.

1. Tasks cannot fail under load. Load Shedding should be in place exactly for this reason. You don't take more than you can chew. If more arrives you slowly and politely refuse the request. You need to be both, slow and polite, so that the client will slowly retry and you won't incur in the herding issue.

1. The monitoring issue should have triggered (most likely) an increase of latency. That should have been enough to not complete the deployment and rollback carefully.

I am sure engineers in canva had their reason, and that the write up does not account for everything. Just some food for thought for other engineers.