Comment by tecoholic

Comment by tecoholic 4 days ago

4 replies

> Converts an image to a single-page PDF with a hidden text layer using Tesseract. This is the 'State Preservation' step.

Does this mean the text only pdf page is transformed into an image that covers the full page, but the text is still under there. So, any machine based extraction would still get the text, but would probably loose all the bounding box information and regular users cannot just use their mouse to select text anymore?

kumarm 4 days ago

Seems true and really wish the project included some sample PDF output.

My Text to Speech app uses bounding box to display what text in PDF is being read and would not work well PDF's from this project.

  • GavCo 3 days ago

    OP here, I added a sample PDF output in the project assets and put screenshots in the ReadMe. The text is selectable after rehydration. would this work with your app?

    • tecoholic 3 days ago

      Wait! what? This is incredible. Amazing work.

    • kumarm 3 days ago

      Amazing. Worked really well. Thank you.