Comment by quuxplusone
Comment by quuxplusone 6 days ago
Copyright issues aside (e.g. if your thing is public domain), the galaxy-brain approach is to upload your raw scanned PDF to the Internet Archive (archive.org), fill in the appropriate metadata, wait about 24 hours for their post-upload format-conversion tasks to run automatically, and then download the size-optimized and OCR-ized PDF from them.
I've done this with a few documents from the French and Spanish national archives, which were originally provided as enormous non-OCRed PDFs but shrank to 10% the size (or less) after passage through archive.org and incidentally became full-text-searchable.
Last time I checked a few months ago, LLMs were more accurate than the OCR that the archive is using. The web archive version is/was not using context to figure out that for example “in the garden was a trge” should be “in the garden was a tree”. LLMs depending on the prompt do this.