Comment by stu2010

Comment by stu2010 13 hours ago

This is cool, I'd say that the most common tool in this space is bgzip[1]. Have you thought about training a dictionary on the first few chunks of each file and embedding the dictionary in a skippable frame at the start? Likely makes less difference if your chunk size is 2MB, but at smaller chunk sizes that could have significant benefit.

[1] https://www.htslib.org/doc/bgzip.html

jeroenhd 13 hours ago

Looking at the spec (https://github.com/facebook/zstd/blob/dev/contrib/seekable_f...), I don't see any mention of custom dictionaries like you describe.

The spec does mention:

> While only Checksum_Flag currently exists, there are 7 other bits in this field that can be used for future changes to the format, for example the addition of inline dictionaries.

so I don't think seekable zstd supports these dictionaries just yet.

With multiple inline dictionaries, one could detect when new chunks compress badly with the previous dictionary and train new ones on the fly. Could be useful for compressing formats with headers and mixed data (i.e. game files, which can contain a mix of text + audio + video, or just regular old .tar files I suppose).

Reply View 2 replies

ikawe 7 hours ago

Custom dictionaries are a feature of vanilla (non-seekable) zstd. As I understand it, all seekable-zstd are valid zstd, so it should be possible?
https://github.com/facebook/zstd?tab=readme-ov-file#the-case...

Reply View | 1 reply
- rorosen 2 hours ago
  
  Yes, dictionaries should be totally possible. However, I've never tried them to be honest because I usually only compress big files. They can be set on the (de)compression contexts the same way as with regular zstd.
  
  Reply View | 0 replies