Comment by ashvardanian

Comment by ashvardanian 18 hours ago

5 replies

Nice observation!

Took me a while to realize that Grace Blackwell refers to a person and not an Nvidia chip :)

I’ve worked with large genomic datasets on my own dime, and the default formats show their limits quickly. With FASTA, the first step for me is usually conversion: unzip headers from sequences, store them in Arrow-like tapes for CPU/GPU processing, and persist as Parquet when needed. It’s straightforward, but surprisingly underused in bioinformatics — most pipelines stick to plain text even when modern data tooling would make things much easier :(

jltsiren 17 hours ago

Basic text formats persist, because everyone supports them. Many tools have better file formats for internal purposes, but they are rarely flexible enough and robust enough for wider use. There are occasional proposals for better general purpose formats, but the people proposing them rarely agree which of the competing proposals should be adopted. And even if they manage to agree, they probably don't have the time and the money to make it actually happen.

  • vintermann 15 hours ago

    Also for historical reasons I think, since Perl used to be the big bioinformatics language, and it is surprisingly hard to compete with in string handling.

    • lazide 14 hours ago

      Perl+strings really is one of those ‘unreasonably effective’ combinations.

      It feels like Benzene in some ways. Use it correctly and gdamn. Just don’t huff it - i mean - use it for your enterprise backend - and it’s worth it.

bede 16 hours ago

Yes, when doing anything intensive with lots of sequences it generally makes sense to liberate them from FASTA as early as possible and index them somehow. But as an interchange format FASTA seems quite sticky. I find the pervasiveness of fastq.gz particularly unfortunate with Gzip being as slow as it is.

> Took me a while to realize that Grace Blackwell refers to a person and not an Nvidia chip :)

I even confused myself about this while writing :-)

  • chrchang523 10 hours ago

    Note that BGZF solves gzip’s speed problem (libdeflate + parallel compression/decompression) without breaking compatibility, and usually the hit to compression ratio is tolerable.