boothby 14 hours ago

Spend a few years handling data in arcane, one-off, and proprietary file formats conceived by "brilliant" programmers with strong CS backgrounds and you might reconsider the conclusion you've come to here.

  • dwattttt 6 hours ago

    This is a presentation problem, or possibly a lack of tooling problem.

    A binary format with a tool that renders it to text works the same as a text format; if the rendering is lossless, you could even consume the text format rather than the binary.

    A "text" format is built to be understandable, but that's not a requirement; you could write a text format that isn't descriptive, and you'd have just as much trouble understanding what 'A' means as you would understanding what 'C0' means for a binary format.

    Undocumented formats are a pain, whether they're in text or binary.

semiinfinitely 17 hours ago

other file formats that rival fasta in stupidity include fastq pdb bed sam cram vcf. further reading [1]

> "intentionally or not, bioinformatics found a way to survive: obfuscation. By making the tools unusable, by inventing file format after file format, by seeking out the most brittle techniques"

1. https://madhadron.com/science/farewell_to_bioinformatics.htm...

  • jakobnissen 17 hours ago

    SAM is not a bad file format. What's bad about SAM?

    • optionalsquid 15 hours ago

      I don't dislike the format, and it is much, much better than what it replaced, but SAM, and its binary sister-format BAM, does have some flaws:

      - The original index format could not handle large chromosomes, so now there are two index formats: .bai and .csi

      - For BAM, the CIGAR (alignment description) operation count is limited to 16 bits, which means that very long alignments cannot be represented. One workaround I've seen (but thankfully not used) is saving the CIGAR as a string in a tag

      - SAM cannot unambiguously represent sequences with only a single base (e.g. after trimming), since a '*' in the quality column can be interpreted either as a single Phred score (9) or as a special value meaning "no qualities". BAM can represent such sequences unambiguously, but most tools output SAM

      • jakobnissen 11 hours ago

        True. I'd consider these minor flaws. W.r.t. the CIGAR, the spec says you do need to store it as a tag.

totalperspectiv 15 hours ago

> a testament to the massive gap in perceived vs actual programming ability of the average bioinformatician.

This is not really a fair statement. Literally all of software bears the weight of some early poor choice that then keeps moving forward via weight of momentum. FASTA and FASTQ formats are exceptionally dumb though.

Fraterkes 17 hours ago

I’ll do you the immense favor of taking the bait. What’s so bad about it?

StillBored 12 hours ago

I think the prevalence of the format vs something more widely used should be part of that metric.

On those grounds, the lack of pre-tokenization in html/css/js ranks at this point as a planet killing level of poor choices.

[removed] 14 hours ago
[deleted]
fwip 14 hours ago

It might be the stupidest, but stupid in the sense of "the simplest thing that could possibly work."

When FASTA was invented, Sanger sequencing reads would be around a thousand bases in length. Even back then, disk space wasn't so precious that you couldn't spend several kilobytes on the results of your experiment. Plus, being able to view your results with `more` is a useful feature when you're working with data of that size.

And, despite its simplicity, it has worked for forty years.

  • michaelhoffman 13 hours ago

    When FASTA was invented in 1985, generally sequencing reads would be about half that.

    The simplicity of FASTA seems like a dream compared to the GenBank flat file format used before then. And around the year 2000, less computationally-inclined scientists were storing sequence in Microsoft Word binary .doc files.

    A lot of file formats (including bioinformatics formats!) have come and gone in that time period. I don't think many would design it this way today, but it has a lot of nice features like the ones you point out that led to its longevity.

  • melagonster 2 hours ago

    Yes, If someone want, they can do many analyses by grep!

  • attractivechaos 13 hours ago

    FASTA was invented in late 1980s. At that time, unix tools often limited line length. Even in early 2000s, some unix tools (on AIX as I remember) still had this limit.