Comment by ec109685

Comment by ec109685 6 days ago

1 reply

The incident report said, “the growth of off-heap memory” was a cause for the OOM.

Why would have too much traffic caused that to increase specifically? The overhead of a connection in the kernel isn’t that high.

To reduce pressure in the future, they could smear the downloading of new assets over time by background fetching. E.g. when canary release of a new canva release starts they probabilistically could download the asset in the background for the existing version, so when they switch, there’s nothing new to download.

Features like collapse forwarding and stale-while-revalidate are powerful features for CDN’s, but there are these non-intuitive failure modes that you have to be aware of. Anything that synchronizes huge numbers of requests is dangerous to stability.