Comment by jefftk

Comment by jefftk 18 hours ago

13 replies

The FASTA format looks like:

    > title
    bases with optional newlines
    > title
    bases with optional newlines
    ...
The author is talking about removing the non-semantic optional newlines (hard wrapping), not all the newlines in the file.

It makes a lot of sense that this would work: bacteria have many subsequences in common, but if you insert non-semantic newlines at effectively random offsets then compression tools will not be able to use the repetition effectively.

LeifCarrotson 10 hours ago

In case "bases with optional newlines" wasn't obvious to anyone else, a specific example (from Wikipedia) is:

    ;LCBO - Prolactin precursor - Bovine
    MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSS
    EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHL
    VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDED
    ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC*
where "SS...EM", HL..VT", or "ED..AR" may be common subsequences, but the plaintext file arbitrarily wraps at column 65 so it renders on a DEC VT100 terminal from the 70s nicely.

Or, for an even simpler example:

    ; plaintext
    GATTAC
    AGATTA
    CAGATT
    ACCAGA
    TTACAG
    ATTACA
becomes, on disk, something like

    ; plaintext\r\nGATTAC\r\nAGATTA\r\nCAGATT\r\nACCAGA\r\nTTACAG\r\nATTACA\r\n
which is hard to compress, while

    ; plaintext\r\nGATTACAGATTACAGATTACCAGATTACAGATTACA
is just

    "; plaintext\r\n" + "GATTACA" * 7
and then, if you want, you can reflow the text when it's time to render to the screen.
  • tgtweak 8 hours ago

    Feels like it could be an extension to the compression lib (and would retain newlines as such) vs requiring external document tailoring. Also feels like a very specific use case but this optimization might have larger applications outside this narrow field/format.

  • Terr_ 7 hours ago

    Huh, so in other words: "If you don't arbitrarily interrupt continuous sequences of data with cosmetic noise, they compress better."

bede 17 hours ago

Thank you for clarifying this – yes the non-semantic nature of these particular line breaks is a key detail I omitted.

  • tialaramex 9 hours ago

    It might be worth (in some other context) introducing a pre-processing step which handles this at both ends. I'm thinking like PNG - the PNG compression is "just" zlib but for RGBA that wouldn't do a great job, however there's a (per row) filter step first, so e.g. we can store just the difference from the row above, now big areas of block colour or vertical stripes are mostly zeros and those compress well.

    Guessing which PNG filters to use can make a huge difference to compression with only a tiny change to write speed. Or (like Adobe 20+ years ago) you can screw it up and get worse compression and slower speeds. These days brutal "try everything" modes exist which can squeeze out those last few bytes by trying even the unlikeliest combinations.

    I can imagine a filter layer which says this textual data comes in 78 character blocks punctuated with \n so we're going to strip those out, then compress and in the opposite direction we decompress then put back the newlines.

    For FASTA we can just unconditionally choose to remove the extra newlines but that may not be true for most inputs, so the filters would help there.

    • dekhn 6 hours ago

      For one approach to compressing FASTQ (which is a series 3 distinct lines), we broke the distinct lines each into their own streams and then compressed each stream independently. That way the coder for the header line, sequence line, and error line could learn the model of that specific line type and get slightly better compression (not unlike a columnar format, although in this case, we simply combined "blocks" of streams together with a record separator.

      I'm still pretty amazed that periodic newlines hurt compression ratios so much, given the compressor can use both a huffman coding and a lookback dictionary.

      The best rule in sequence data storage is to store as little of it as possible.

AndrewOMartin 17 hours ago

The compression ratio will likely skyrocket if you sorted the list of bases.

  • shellfishgene 17 hours ago

    You're joking, but a few bioinformatics tools use the Burrows-Wheeler transform to save memory, which is a bit like sorting the bases.

    • jefftk 16 hours ago

      You can also improve compression by reordering the sequences within the FASTA file, as long as you're using it as a dictionary and not a list of title-sequence pairs.

mylons 12 hours ago

this is also the insight that the bwa developer had, to use the burrows-wheeler transform which is part of bzip2 due to it's compression properties being particularly good for genomic sequences.

  • dekhn 6 hours ago

    I once had the distinct pleasure of hosting the author of BWA (R. Durbin) at Google, and pointing out "That's Mike Burrows, over there, next to Jeff Dean and Sanjay Ghemawat". That led to an interesting discussion between Durbin and Dean on DNA sequence compression. It's not the first time I've been in a room with a bunch of geniuses and simply kept my mouth shut so nobody would know I'm an idiot.

amelius 16 hours ago

This was a question that I thought was interesting enough to test ChatGPT with.

Surprisingly, it gave an answer along the lines of the parent comment.

However, it seems it didn't figure this out by itself but it says:

> It’s widely known in bioinformatics that “one-line Fasta” files compress much better with LZ-based algorithms, and this is discussed in forums, papers, and practical guides on storing genomic data.

[removed] 16 hours ago
[deleted]