Comment by supriyo-biswas

Comment by supriyo-biswas 6 days ago

6 replies

> checksumming does make sense because it ensures that the file you've transferred is complete and what was expected.

TCP has a checksum for packet loss, and TLS protects against MITM.

I've always found this aspect of S3's design questionable. Sending both a content-md5 AND a x-amz-content-sha256 header and taking up gobs of compute in the process, sheesh...

It's also part of the reason why running minio in its single node single drive mode is a resource hog.

lacop 6 days ago

I got some empirical data on this!

Effingo file copy service does application-layer strong checksums and detects about 4.5 corruptions per exabyte transferred (figure 9, section 6.2 in [1]).

This is on top of TCP checksums, transport layer checksums/encryption (gRPC), ECC RAM and other layers along the way.

Many of these could be traced back to a "broken" machine that was eventually taken out.

[1] https://dl.acm.org/doi/abs/10.1145/3651890.3672262

alwyn 6 days ago

In my view one reason is to ensure integrity down the line. You want the checksum of a file to still be the same when you download it maybe years later. If it isn't, you get warned about it. Without the checksum, how will you know for sure? Keep your own database of checksums? :)

  • supriyo-biswas 6 days ago

    If we're talking about bitrot protection, I'm pretty sure S3 would use some form of checksum (such as crc32 or xxhash) on each internal block to facilitate the Reed-Solomon process.

    If it's verifying whether if it's the same file, you can use the Etag header which is computed server side by S3. Although I don't like this design as it ossifies the checksum algorithm.

dboreham 6 days ago

Well known (apparently not?) that applications can't rely on TCP checksums.

  • [removed] 6 days ago
    [deleted]