Comment by jefftk
Comment by jefftk 18 hours ago
The FASTA format looks like:
> title
bases with optional newlines
> title
bases with optional newlines
...
The author is talking about removing the non-semantic optional newlines (hard wrapping), not all the newlines in the file.It makes a lot of sense that this would work: bacteria have many subsequences in common, but if you insert non-semantic newlines at effectively random offsets then compression tools will not be able to use the repetition effectively.
In case "bases with optional newlines" wasn't obvious to anyone else, a specific example (from Wikipedia) is:
where "SS...EM", HL..VT", or "ED..AR" may be common subsequences, but the plaintext file arbitrarily wraps at column 65 so it renders on a DEC VT100 terminal from the 70s nicely.Or, for an even simpler example:
becomes, on disk, something like which is hard to compress, while is just and then, if you want, you can reflow the text when it's time to render to the screen.