Comment by leobuskin
What about a specialized dict for FASTA? Shouldn't it increase ZSTD compression significantly?
What about a specialized dict for FASTA? Shouldn't it increase ZSTD compression significantly?
From what I've read (although I haven't tested and I can't find my source from when I read it), dictionaries aren't very useful when dataset is big, and just by using '--long' you can cover that improvement.
Have any of you tested it?
Size of content does matter. Because the start of your content effectively builds up a dictionary. Once you have built a custom dictionary from the content you are compressing the initial dictionary is no longer relevant. So a dictionary effectively only helps at the start of the content. Exactly what "start" means will depend on your data and the algorithm but as the size of the data you compress grows the relative benefit of the dictionary drops.
Or another way to look at this is that the total bytes saved of the dictionary will plateau. So your dictionary may save 50% of the first MB, 10% of the next MB and 5% of the rest of the first 10MB. It matters a lot if you are compressing 2MB of data (7% savings!) but not so much if you are compressing 1GB (<1%).
I tried building a zstd dictionary for something(compressing many instances of the same Mono(.net) binary serialized class, mostly identical), and in this case it provided no real advantage. Honestly, I didn't dig into it too much, but will give --long a try shortly.
PS: what the author implicitly suggests cannot be replaced with zstd tweaks. It'll be interesting to look at the file in imhex- especially if I can find an existing Pattern File.
Yes I'd expect a dict-based approach to do better here. That's probably how it should be done. But --long is compelling for me because using it requires almost no effort, it's still very fast, and yet it can dramatically improve compression ratio.