Comment by leobuskin

Comment by leobuskin 15 hours ago

2 replies

I don’t think the size of content matters, it’s all about patterns (and their repetitiveness) within, and FASTA is a great target, if I understand the format correctly

kevincox 5 hours ago

Size of content does matter. Because the start of your content effectively builds up a dictionary. Once you have built a custom dictionary from the content you are compressing the initial dictionary is no longer relevant. So a dictionary effectively only helps at the start of the content. Exactly what "start" means will depend on your data and the algorithm but as the size of the data you compress grows the relative benefit of the dictionary drops.

Or another way to look at this is that the total bytes saved of the dictionary will plateau. So your dictionary may save 50% of the first MB, 10% of the next MB and 5% of the rest of the first 10MB. It matters a lot if you are compressing 2MB of data (7% savings!) but not so much if you are compressing 1GB (<1%).

privatelypublic 14 hours ago

I tried building a zstd dictionary for something(compressing many instances of the same Mono(.net) binary serialized class, mostly identical), and in this case it provided no real advantage. Honestly, I didn't dig into it too much, but will give --long a try shortly.

PS: what the author implicitly suggests cannot be replaced with zstd tweaks. It'll be interesting to look at the file in imhex- especially if I can find an existing Pattern File.