Comment by aragonite
Comment by aragonite 6 days ago
I did this very recently for a 19th century book in German with occasionally some Greek. The method that produces the highest level of accuracy I've found is to use ImageMagick to extract each page as a image, then send each image file to Claude Sonnet (encoded as base64) with a simple user prompt like "Transcribe the complete text from this image verbatim with no additional commentary or explanations". The whole thing is completed in under an hour & the result is near perfect and certainly much better than from standard OCR softwares.
> a 19th century book
If you're dealing with public domain material, you can just upload to archive.org. They'll OCR the whole thing and make it available to you and everyone else. (If you got it from archive.org, check the sidebar for the existing OCR files.)