Comment by mgr86
Have you considered XML. TEI, for example, is very robust and mature for marking up documents.
Have you considered XML. TEI, for example, is very robust and mature for marking up documents.
Understandable. I work in academic publishing, and while the XML is everywhere crowd is graying, retiring, or even dying :( it still remains an excellent option for document markup. Additionally, a lot of government data produced in the US and EU make heavy use of XML technologies. I imagine they could be an interested consumer of Nanonets-OCR. TEI could be a good choice as well tested and developed conversions exist to other popular, less structured, formats.
Do check out MyST Markdown (https://mystmd.org)! Academic publishing is a space that MyST is being used, such as https://www.elementalmicroscopy.com/ via Curvenote.
(I'm a MyST contributor)
Yeah this really hurts. If your goal is to precisely mark up a document with some structural elements, XML is strictly superior to Markdown.
The fact that someone would go to all the work to build a model to extract the structure of documents, then choose an output format strictly less expressive than XML, speaks poorly of the state of cross-generational knowledge sharing within the industry.
I think the choice mainly stems from how you want to use the output. If the output is going to get fed to another LLM, then you want to select markup language where 1) the grammer would not cause too many issues with tokenization 2) which LLM has seen a lot in past 3) generates minimal number of tokens. I think markdown fits it much better compared to other markup languages.
If goal is to parse this output programmatically, then I agree a more structured markup language is better choice.
First I heard of it. https://en.wikipedia.org/wiki/Text_Encoding_Initiative