Comment by tom1337

Comment by tom1337 6 days ago

checksumming does make sense because it ensures that the file you've transferred is complete and what was expected. if the checksum of the file you've downloaded differs from the server gave you, you should not process the file further and throw an error (worst case would probably be a man in the middle attack, not so worse cases being packet loss i guess)

supriyo-biswas 6 days ago

> checksumming does make sense because it ensures that the file you've transferred is complete and what was expected.

TCP has a checksum for packet loss, and TLS protects against MITM.

I've always found this aspect of S3's design questionable. Sending both a content-md5 AND a x-amz-content-sha256 header and taking up gobs of compute in the process, sheesh...

It's also part of the reason why running minio in its single node single drive mode is a resource hog.

Reply View 6 replies

lacop 6 days ago

I got some empirical data on this!
Effingo file copy service does application-layer strong checksums and detects about 4.5 corruptions per exabyte transferred (figure 9, section 6.2 in [1]).
This is on top of TCP checksums, transport layer checksums/encryption (gRPC), ECC RAM and other layers along the way.
Many of these could be traced back to a "broken" machine that was eventually taken out.
[1] https://dl.acm.org/doi/abs/10.1145/3651890.3672262

Reply View | 0 replies
alwyn 6 days ago

In my view one reason is to ensure integrity down the line. You want the checksum of a file to still be the same when you download it maybe years later. If it isn't, you get warned about it. Without the checksum, how will you know for sure? Keep your own database of checksums? :)

Reply View | 2 replies
- supriyo-biswas 6 days ago
  
  If we're talking about bitrot protection, I'm pretty sure S3 would use some form of checksum (such as crc32 or xxhash) on each internal block to facilitate the Reed-Solomon process.
  If it's verifying whether if it's the same file, you can use the Etag header which is computed server side by S3. Although I don't like this design as it ossifies the checksum algorithm.
  
  Reply View | 1 reply
  
  everfrustrated 6 days ago
  
  You may be interested in this https://aws.amazon.com/blogs/aws/introducing-default-data-in...
  
  Reply View | 0 replies
dboreham 6 days ago

Well known (apparently not?) that applications can't rely on TCP checksums.

Reply View | 1 reply
- [removed] 6 days ago
  
  [deleted]
  
  Reply View | 0 replies

vbezhenar 6 days ago

TLS ensures that stream was not altered. Any further checksums are redundant.

Reply View 7 replies

huntaub 6 days ago

This is actually not the case. The TLS stream ensures that the packets transferred between your machine and S3 are not corrupted, but that doesn't protect against bit-flips which could (though, obviously, shouldn't) occur from within S3 itself. The benefit of an end-to-end checksum like this is that the S3 system can store it directly next to the data, validate it when it reads the data back (making sure that nothing has changed since your original PutObject), and then give it back to you on request (so that you can also validate it in your client). It's the only way for your client to have bullet-proof certainty of integrity the entire time that the data is in the system.

Reply View | 0 replies
tom1337 6 days ago

Thats true, but wouldn't it be still required if you're having a internal S3 service which is used by internal services and does not have HTTPS (as it is not exposed to the public)? I get that the best practice would be to also use HTTPS there but I'd guess thats not the norm?

Reply View | 4 replies
- vbezhenar 6 days ago
  
  Theoretically TCP packets have checksums, however it's fairly weak. So for HTTP, additional checksums make sense. Although I'm not sure, if there are any internal AWS S3 deployments working over HTTP and why would they complicate their protocol for everyone to help such a niche use case.
  I'm sure that they have reasons for this whole request signature scheme over traditional "Authorization: Bearer $token" header, but I never understood it.
  
  Reply View | 3 replies
  
  easton 6 days ago
  
  AWS has a video about it somewhere, but in general, it’s because S3 was designed in a world where not all browsers/clients had HTTPS and it was a reasonably expensive operation to do the encryption (like, IE6 world). SigV4 (and its predecessors) are cheap and easy once you understand the code.
  https://youtube.com/watch?v=tPr1AgGkvc4, about 10 minutes in I think.
  
  Reply View | 0 replies
  
  formerly_proven 6 days ago
  
  Because a bearer token is a bearer token to do any request, while a pre-signed request allows you to hand out the capability to perform _only that specific request_.
  
  Reply View | 1 reply
  
  degamad 5 days ago
  
  Bearer tokens have a defined scope, which could be used to limit functionality in a similar way to pre-signed requests.
  However, the s3 pre-signed requests functionality was launched in 2011, but the Bearer token RFC 6750 wasn't standardised until 2012...
  
  Reply View | 0 replies
Spooky23 6 days ago

Not always. Lots of companies intercept and potentially modify TLS traffic between network boundaries.

Reply View | 0 replies

neon_me 6 days ago

yes, you are right!

On the other hand S3 uses checksums only to verify expected upload (on the write from client -> server) ... and suprisingly you can do that in paralel after the upload - by checking the MD5 hash of blob to ETag (*with some caveats)

Reply View 0 replies

0x1ceb00da 6 days ago

You need the checksum only if the file is big and you're downloading it to disk, or if you're paranoid that some malware with root access might be altering the contents of your memory.

Reply View 2 replies

lazide 6 days ago

Or you really care about the data and are aware of the statistical inevitability of a bit flip somewhere along the line if you’re operating long enough.

Reply View | 0 replies
arbll 6 days ago

I mean if a malware is root and altering your memory it's not like you're in a position where this check is meaningful haha

Reply View | 0 replies