Comment by Kim_Bruning

Comment by Kim_Bruning 19 hours ago

Now I'm wondering why this works. DNA clearly has some interesting redundancy strategies. (it might also depend on genome?)

dwattttt 19 hours ago

The FASTA format stores nucleotides in text form... compression is used to make this tractable at genome sizes, but it's by no means perfect.

Depending on what you need to represent, you can get a 4x reduction in data size without compression at all, by just representing a GATC with 2 bits, rather than 8.

Compression on top of that "should" result in the same compressed size as the original text (after all, the "information" being compressed is the same), except that compression isn't perfect.

Newlines are an example of something that's "information" in the text format that isn't relevant, yet the compression scheme didn't know that.

Reply View 4 replies

hyghjiyhu 19 hours ago

I think one important factor you missed to account for is frameshifting. Compression algorithms work on bytes - 8 bits. Imagine that you have the exact same sequence but they occur at different offsets mod 4. Then your encoding will give completely different results, and the compression algorithm will be unable to make use of the repetition.

Reply View | 3 replies
- dwattttt 16 hours ago
  
  I was actually under the impression compression algorithms tend to work over a bitstream, but I can't entirely confirm that.
  
  Reply View | 2 replies
  
  Sesse__ 14 hours ago
  
  Some are bit-by-bit (e.g. the PPM family of compressors[1]), but the normal input granularity for most compressors is a byte. (There are even specialized ones that work on e.g. 32 bits at a time.)
  [1] Many of the context models in a typical PPM compressor will be byte-by-byte, so even that isn't fully clear-cut.
  
  Reply View | 0 replies
  
  vintermann 15 hours ago
  
  They output a bitstream, yeah but I don't know of anything general purpose which effectively consumes anything smaller than bytes (unless you count various specialized handlers in general-purpose compression algorithms, e.g. to deal with long lists of floats)
  
  Reply View | 0 replies

vintermann 19 hours ago

This is a dataset of bacterial DNA. Any two related bacteria will have long strings of the same letters. But it won't be neatly aligned, so the line breaks will mess up pattern matching.

Reply View 4 replies

bede 18 hours ago

Exactly. The line breaks break the runs of otherwise identical bits in identical sequences. Unless two identical subsequences are exactly in phase with respect to their line breaks, the hashes used for long range matching are different for otherwise identical subsequences.

Reply View | 0 replies
amelius 16 hours ago

And the compressor does not think: "how can I make these two sequences align better without wasting a lot of space?"

Reply View | 2 replies
- ebolyen 11 hours ago
  
  No, because alignment, in the general case, is O(n^2). It is ironically one of the more tractable and well solved problems in bioinformatics.
  
  Reply View | 0 replies
- tiagod 12 hours ago
  
  The compressor doesn't think about anything. Also, Zstd doesn't have the goal of reaching the highest possible compression ratio. It's more geared toward lowest overhead, high bandwidth compress/decompress.
  
  Reply View | 0 replies