Comment by aragonite

Comment by aragonite 6 days ago

I did this very recently for a 19th century book in German with occasionally some Greek. The method that produces the highest level of accuracy I've found is to use ImageMagick to extract each page as a image, then send each image file to Claude Sonnet (encoded as base64) with a simple user prompt like "Transcribe the complete text from this image verbatim with no additional commentary or explanations". The whole thing is completed in under an hour & the result is near perfect and certainly much better than from standard OCR softwares.

cxr 5 days ago

> a 19th century book

If you're dealing with public domain material, you can just upload to archive.org. They'll OCR the whole thing and make it available to you and everyone else. (If you got it from archive.org, check the sidebar for the existing OCR files.)

Reply View 3 replies

aragonite 5 days ago

I did try the full text OCR from archive.org, but unfortunately the error rate is too high. Here are some screenshots to show what I mean:
- Original book image: https://imgur.com/a8KxGpY
- OCR from archive.org: https://imgur.com/VUtjiON
- Output from Claude: https://imgur.com/keUyhjR

Reply View | 1 reply
- cxr 5 days ago
  
  Ah, yeah, that's not uncommon. I was operating on an assumption, based on experience seeing language models make mistakes, that the two approaches would be within an acceptable range of each other for your texts, plus the idea that it's better to share the work than not.
  Note if you're dealing with a work (or edition) that cannot otherwise be found on archive.org, though, then if you do upload it, you are permitted as the owner of that item to open up the OCRed version and edit it. So an alternative workflow might be better stated:
  1. upload to archive.org
  2. check the OCR results
  3. correct a local copy by hand or use a language model to assist if the OCR error rate is too high
  4. overwrite the autogenerated OCR results with the copy from step 3 in order to share with others
  (For those unaware and wanting to go the collaborative route, there is also the Wikipedia-adjacent WMF project called Wikisource. It has the upside of being more open (at least in theory) than, say, a GitHub repo—since PRs are not required for others to get their changes integrated. One might find, however, it to be less open in practice, since it is inhabited by a fair few wikiassholes of the sort that folks will probably be familiar with from Wikipedia.)
  
  Reply View | 0 replies
joseda-hg 5 days ago

Maybe I've just had back luck, but their OCR butchered some of the books I've tried to get

Reply View | 0 replies

HarHarVeryFunny 6 days ago

Is it really necessary to split it into pages? Not so bad if you automate it I suppose, but aren't there models that will accept a large PDF directly (I know Sonnet has a 32MB limit)?

Reply View 6 replies

7thpower 6 days ago

They are limited on how much they can output and there is generally an inverse relationship between the amount of tokens you send vs quality after the first 20-30 thousand tokens.

Reply View | 3 replies
- smallnix 5 days ago
  
  Are there papers on this effect? That quality of responses diminishes with very large inputs I mean. I observed the same.
  
  Reply View | 2 replies
  
  Breza 6 hours ago
  
  I've experienced this problem but I haven't come across papers about it. For this context, it would be interesting to compare the accuracy of transcribing one page at a time to batches of n pages.
  
  Reply View | 0 replies
  
  HarHarVeryFunny 5 days ago
  
  I think these models all "cheat" to some extent with their long context lengths.
  The original transformer had dense attention where every token attends to every other token, and the computational cost therefore grew quadratically with increased context length. There are other attention patterns than can be used though, such as only attending to recent tokens (sliding window attention), or only having a few global tokens that attend to all the others, or even attending to random tokens, or using combinations of these (e.g. Google's "Big Bird" attention from their Elmo/Bert muppet era).
  I don't know what types of attention the SOTA closed source models are using, and they may well be using different techniques, but it'd not be surprising if there was "less attention" to tokens far back in the context. It's not obvious why this would affect a task like doing page-by-page OCR on a long PDF though, since there it's only the most recent page that needs attending to.
  
  Reply View | 0 replies
therealpygon 5 days ago

Necessary? No. Better? Probably. Despite larger context windows, attention and hallucinations aren’t completely a thing of the past within the expanded context windows today. Splitting to individual pages likely helps ensure that you stay well within a normal context window size that seems to avoid most of these issues. Asking an LLM to maintain attention for a single page is much more achievable than an entire book.
Also, PDF size isn’t a relevant measurement of token lengths when it comes to PDFs which can range from a collection of high quality JPEG images to thousand(s) of pages of text

Reply View | 0 replies
siva7 5 days ago

They all accept large PDFs (or any kind of input) but the quality of the output will suffer for various reasons.

Reply View | 0 replies

ant6n 5 days ago

I recently did some OCRing with OpenAI. I found o3-mini-hi to be imagining and changing text, whereas the older (?) o4 was more accurate. It’s a bit worrying that some of the models screw around with the text.

Reply View 8 replies

jazzyjackson 5 days ago

There’s GPT4, then GPT4o (o for Omni, as in multi modal) and then GPT o1 (chain of thought / internal reasoning) then o3 (because o2 is a stadium in London that I guess is very litigious about its trademark?), o3-mini is the latest but yes optimized to be faster and cheaper

Reply View | 7 replies
- polshaw 5 days ago
  
  o2 is the UK's largest mobile network operator. They bought naming rights to what was known as the millennium dome (not even a stadium).
  
  Reply View | 1 reply
  
  jazzyjackson 5 days ago
  
  Ahh makes sense :)
  
  Reply View | 0 replies
- dotancohen 5 days ago
  
  What is the o3 model good for? Is it just an evolution of o1 (chain of thought / internal reasoning)?
  
  Reply View | 2 replies
  
  KTibow 5 days ago
  
  Yes
  (albeit I believe o3-mini isn't natively multimodal)
  
  Reply View | 1 reply
  
  dotancohen 5 days ago
  
  I see, thank you.
  
  Reply View | 0 replies
- ant6n 5 days ago
  
  Which one is the smartest, and most knowledgeable? (Like least likely to make up facts)
  
  Reply View | 1 reply
  
  wrsh07 5 days ago
  
  4o is going to be better for a straight up factual question
  (But eg I asked it about something Martin Short / John Mulaney said on SNL and it needed 2 prompts to get the correct answer..... the first answer wasn't making anything up it was just reasonably misinterpreting something)
  It also has web search which will be more accurate if the pages it reads are good (it uses bing search, so if possible provide your own links and forcibly enable web search)
  Similarly the latest Anthropic Claude Sonnet model (it's the new Sonnet 3.5 as of ~Oct) is very good.
  The idea behind o3 mini is that it only knows as much as 4o mini (the names suck, we know) but it will be able to consider its initial response and edit it if it doesn't meet the original prompt's criteria
  
  Reply View | 0 replies

hkonsti 5 days ago

Do you have a rough estimate of what the price per page was for this?

Reply View 1 reply

aragonite 2 days ago

It must have been under $3 for the 150 or so API calls, possibly even under $2, though I'm less sure about that.

Reply View | 0 replies

woile 5 days ago

What about preserving the style like titles and subtitles?

Reply View 1 reply

aragonite 5 days ago

You can request Markdown output, which takes care of text styling like italics and bold. For sections and subsections, in my own case they already have numerical labels (like "3.1.4") so I didn't feel the need to add extra formatting to make them stand out. Incidentally, even if you don't specify markdown output, Claude (at least in my case) automatically uses proper Unicode superscript numbers (like ¹, ², ³) for footnotes, which I find very neat.

Reply View | 0 replies

tmaly 2 days ago

how big were the image files in terms of size/resolution that go you the level of accuracy you needed with Claude?

Reply View 1 reply

aragonite 2 days ago

300dpi (`magick -density 300 book.pdf page_%03d.png` was the command I used). The PDF is a from archieve.org & a very high-quality scan (https://ia601307.us.archive.org/5/items/derlgnertheori00rsuo...)

Reply View | 0 replies