Comment by aragonite
I did try the full text OCR from archive.org, but unfortunately the error rate is too high. Here are some screenshots to show what I mean:
- Original book image: https://imgur.com/a8KxGpY
- OCR from archive.org: https://imgur.com/VUtjiON
- Output from Claude: https://imgur.com/keUyhjR
Ah, yeah, that's not uncommon. I was operating on an assumption, based on experience seeing language models make mistakes, that the two approaches would be within an acceptable range of each other for your texts, plus the idea that it's better to share the work than not.
Note if you're dealing with a work (or edition) that cannot otherwise be found on archive.org, though, then if you do upload it, you are permitted as the owner of that item to open up the OCRed version and edit it. So an alternative workflow might be better stated:
1. upload to archive.org
2. check the OCR results
3. correct a local copy by hand or use a language model to assist if the OCR error rate is too high
4. overwrite the autogenerated OCR results with the copy from step 3 in order to share with others
(For those unaware and wanting to go the collaborative route, there is also the Wikipedia-adjacent WMF project called Wikisource. It has the upside of being more open (at least in theory) than, say, a GitHub repo—since PRs are not required for others to get their changes integrated. One might find, however, it to be less open in practice, since it is inhabited by a fair few wikiassholes of the sort that folks will probably be familiar with from Wikipedia.)