Comment by tialaramex
Comment by tialaramex 9 hours ago
It might be worth (in some other context) introducing a pre-processing step which handles this at both ends. I'm thinking like PNG - the PNG compression is "just" zlib but for RGBA that wouldn't do a great job, however there's a (per row) filter step first, so e.g. we can store just the difference from the row above, now big areas of block colour or vertical stripes are mostly zeros and those compress well.
Guessing which PNG filters to use can make a huge difference to compression with only a tiny change to write speed. Or (like Adobe 20+ years ago) you can screw it up and get worse compression and slower speeds. These days brutal "try everything" modes exist which can squeeze out those last few bytes by trying even the unlikeliest combinations.
I can imagine a filter layer which says this textual data comes in 78 character blocks punctuated with \n so we're going to strip those out, then compress and in the opposite direction we decompress then put back the newlines.
For FASTA we can just unconditionally choose to remove the extra newlines but that may not be true for most inputs, so the filters would help there.
For one approach to compressing FASTQ (which is a series 3 distinct lines), we broke the distinct lines each into their own streams and then compressed each stream independently. That way the coder for the header line, sequence line, and error line could learn the model of that specific line type and get slightly better compression (not unlike a columnar format, although in this case, we simply combined "blocks" of streams together with a record separator.
I'm still pretty amazed that periodic newlines hurt compression ratios so much, given the compressor can use both a huffman coding and a lookback dictionary.
The best rule in sequence data storage is to store as little of it as possible.