Comment by ks2048

Comment by ks2048 8 hours ago

It’s a shame all these models target markdown and not something with more structure and a specification. There are different flavors of Markdown and limited support for footnotes, references, figures, etc.

souvik3333 7 hours ago

Actually, we have trained the model to convert to markdown and do semantic tagging at the same time. Eg, the equations will be extracted as LaTeX equations, and images (plots, figures, and so on) will be described within the `<img>` tags. Same with `<signature>`, `<watermark>`, <page_number>.

Also, we extract the tables as HTML tables instead of markdown for complex tables.

Reply View 7 replies

mgr86 7 hours ago

Have you considered XML. TEI, for example, is very robust and mature for marking up documents.

Reply View | 4 replies
- esafak 6 hours ago
  
  First I heard of it. https://en.wikipedia.org/wiki/Text_Encoding_Initiative
  
  Reply View | 3 replies
  
  mgr86 6 hours ago
  
  Understandable. I work in academic publishing, and while the XML is everywhere crowd is graying, retiring, or even dying :( it still remains an excellent option for document markup. Additionally, a lot of government data produced in the US and EU make heavy use of XML technologies. I imagine they could be an interested consumer of Nanonets-OCR. TEI could be a good choice as well tested and developed conversions exist to other popular, less structured, formats.
  
  Reply View | 2 replies
jtbayly 7 hours ago

What happens to footnotes?

Reply View | 1 reply
- souvik3333 6 hours ago
  
  They will be extracted in a new line as normal text. It will be the last line.
  
  Reply View | 0 replies

starkparker 4 hours ago

I was more excited to hear about "structured Markdown" than the LLM OCR model, but the extent of it just seems to be tagging certain elements. It's useful in the LLM context but not as much outside of it.

Reply View 1 reply

agoose77 3 hours ago

Feel free to check out MyST Markdown, which very much aims to specify "structured Markdown": https://mystmd.org

Reply View | 0 replies