Comment by semiinfinitely

Comment by semiinfinitely 18 hours ago

FASTA is a candidate for the stupidest file format ever invented and a testament to the massive gap in perceived vs actual programming ability of the average bioinformatician.

boothby 14 hours ago

Spend a few years handling data in arcane, one-off, and proprietary file formats conceived by "brilliant" programmers with strong CS backgrounds and you might reconsider the conclusion you've come to here.

Reply View 1 reply

dwattttt 6 hours ago

This is a presentation problem, or possibly a lack of tooling problem.
A binary format with a tool that renders it to text works the same as a text format; if the rendering is lossless, you could even consume the text format rather than the binary.
A "text" format is built to be understandable, but that's not a requirement; you could write a text format that isn't descriptive, and you'd have just as much trouble understanding what 'A' means as you would understanding what 'C0' means for a binary format.
Undocumented formats are a pain, whether they're in text or binary.

Reply View | 0 replies

semiinfinitely 17 hours ago

other file formats that rival fasta in stupidity include fastq pdb bed sam cram vcf. further reading [1]

> "intentionally or not, bioinformatics found a way to survive: obfuscation. By making the tools unusable, by inventing file format after file format, by seeking out the most brittle techniques"

1. https://madhadron.com/science/farewell_to_bioinformatics.htm...

Reply View 3 replies

jakobnissen 17 hours ago

SAM is not a bad file format. What's bad about SAM?

Reply View | 2 replies
- optionalsquid 15 hours ago
  
  I don't dislike the format, and it is much, much better than what it replaced, but SAM, and its binary sister-format BAM, does have some flaws:
  - The original index format could not handle large chromosomes, so now there are two index formats: .bai and .csi
  - For BAM, the CIGAR (alignment description) operation count is limited to 16 bits, which means that very long alignments cannot be represented. One workaround I've seen (but thankfully not used) is saving the CIGAR as a string in a tag
  - SAM cannot unambiguously represent sequences with only a single base (e.g. after trimming), since a '*' in the quality column can be interpreted either as a single Phred score (9) or as a special value meaning "no qualities". BAM can represent such sequences unambiguously, but most tools output SAM
  
  Reply View | 1 reply
  
  jakobnissen 11 hours ago
  
  True. I'd consider these minor flaws. W.r.t. the CIGAR, the spec says you do need to store it as a tag.
  
  Reply View | 0 replies

totalperspectiv 15 hours ago

> a testament to the massive gap in perceived vs actual programming ability of the average bioinformatician.

This is not really a fair statement. Literally all of software bears the weight of some early poor choice that then keeps moving forward via weight of momentum. FASTA and FASTQ formats are exceptionally dumb though.

Reply View 0 replies

Fraterkes 17 hours ago

I’ll do you the immense favor of taking the bait. What’s so bad about it?

Reply View 1 reply

jszymborski 16 hours ago

It's a fine format for what it is.
A parser to stream FASTA can be written in like 30 lines [0], much easier than say CSV where the edge cases can get hairy.
If you need something like fast random reads, use the FAIDX format [1], or even better just store it in an LMDB or SQLite embedded db.
People forget FASTA was from 1985, and it sticks around because (1) it's easy to parse and write (2) we have mountains of sequences in that format going back 4 decades.
[O] https://gist.github.com/jszym/9860a2671dabb45424f2673a49e4b5...
[1] https://seqan.readthedocs.io/en/main/Tutorial/InputOutput/In...

Reply View | 0 replies

StillBored 12 hours ago

I think the prevalence of the format vs something more widely used should be part of that metric.

On those grounds, the lack of pre-tokenization in html/css/js ranks at this point as a planet killing level of poor choices.

Reply View 0 replies

[removed] 14 hours ago

[deleted]

Reply View 0 replies

fwip 14 hours ago

It might be the stupidest, but stupid in the sense of "the simplest thing that could possibly work."

When FASTA was invented, Sanger sequencing reads would be around a thousand bases in length. Even back then, disk space wasn't so precious that you couldn't spend several kilobytes on the results of your experiment. Plus, being able to view your results with `more` is a useful feature when you're working with data of that size.

And, despite its simplicity, it has worked for forty years.

Reply View 3 replies

michaelhoffman 13 hours ago

When FASTA was invented in 1985, generally sequencing reads would be about half that.
The simplicity of FASTA seems like a dream compared to the GenBank flat file format used before then. And around the year 2000, less computationally-inclined scientists were storing sequence in Microsoft Word binary .doc files.
A lot of file formats (including bioinformatics formats!) have come and gone in that time period. I don't think many would design it this way today, but it has a lot of nice features like the ones you point out that led to its longevity.

Reply View | 0 replies
melagonster 2 hours ago

Yes, If someone want, they can do many analyses by grep!

Reply View | 0 replies
attractivechaos 13 hours ago

FASTA was invented in late 1980s. At that time, unix tools often limited line length. Even in early 2000s, some unix tools (on AIX as I remember) still had this limit.

Reply View | 0 replies