Comment by souvik3333

Comment by souvik3333 7 hours ago

7 replies

Actually, we have trained the model to convert to markdown and do semantic tagging at the same time. Eg, the equations will be extracted as LaTeX equations, and images (plots, figures, and so on) will be described within the `<img>` tags. Same with `<signature>`, `<watermark>`, <page_number>.

Also, we extract the tables as HTML tables instead of markdown for complex tables.

mgr86 6 hours ago

Have you considered XML. TEI, for example, is very robust and mature for marking up documents.

  • esafak 6 hours ago
    • mgr86 6 hours ago

      Understandable. I work in academic publishing, and while the XML is everywhere crowd is graying, retiring, or even dying :( it still remains an excellent option for document markup. Additionally, a lot of government data produced in the US and EU make heavy use of XML technologies. I imagine they could be an interested consumer of Nanonets-OCR. TEI could be a good choice as well tested and developed conversions exist to other popular, less structured, formats.

jtbayly 7 hours ago

What happens to footnotes?

  • souvik3333 6 hours ago

    They will be extracted in a new line as normal text. It will be the last line.