Comment by FL33TW00D

Comment by FL33TW00D 20 hours ago

7 replies

Looking forward to the relegation of FASTQ and FASTA to the depths of hell where they belong. Incredibly inefficient and poorly designed formats.

jefftk 18 hours ago

How so? As long as you remove the hard wrapping and use compression aren't they in the same range as other options?

(I currently store a lot of data as FASTQ, and smaller file sizes could save us a bunch of money. But FASTQ + zstd is very good.)

  • FL33TW00D 15 hours ago
    • optionalsquid 9 hours ago

      The fact that these formats are unable to represent degenerate bases (Ns in particular, but also the remaining IUPAC bases), in my experience renders them unusable for many, if not most, use-cases, including for the storage of FASTQ data

      • dwattttt 8 hours ago

        The question of how to represent things not specified in the original format is a tough one.

        At the loosest end a format can leave lots of space for new symbols, and you can just use those to represent something new. But then not everyone agrees on what the new symbol means, and worse multiple groups can use symbols to mean different things.

        On the other end of the spectrum, you can be strict about the format, and not leave space for new symbols. Then to represent new things you need a new standard, and people to agree on it.

        It's mostly a question of how well code can be updated and agreed upon, how strict you can require your tooling to be w.r.t. formats.

  • fwip 16 hours ago

    There's a few options out there that have noticeably better compression, with the downside of being less widely-compatible with tools. zstd also has the benefit of being very fast (depending on your settings, of course).

    CRAM compresses unmapped fastq pretty well, and can do even better with reference-based compression. If your institution is okay with it, you can see additional savings by quantizing quality scores (modern Illumina sequencers already do this for you). If you're aligning your data anyways, probably retaining just the compressed CRAM file with unmapped reads included is your best bet.

    There are also other fasta/fastq specific tools like fqzcomp or MZPAQ. Last I checked, both of these could about halve the size of our fastq.gz files.