Comment by jncfhnb
Comment by jncfhnb 7 hours ago
I don’t think I believe that OCR can’t do it but random humans can
OCR is VERY good
Comment by jncfhnb 7 hours ago
I don’t think I believe that OCR can’t do it but random humans can
OCR is VERY good
> I don’t think I believe that OCR can’t do it but random humans can
Considering the people involved are experts in their field, are certainly aware of OCR capabilities, and have publicized a need thusly:
... the National Archives is looking for volunteers who can
help transcribe and organize its many handwritten records ...
Perhaps "random humans" can perform tasks which could reshape your belief:> OCR is VERY good
No. Sign up and look at the current missions. A lot of what they want transcribed is totally straightforward to OCR --- not even LLM, OCR. Whatever's going on, and I'm not second-guessing them, a pretty big chunk of their problem appears to be well within the state of the art. The appeal to authority isn't going to play here, because you can just click through to the archives and see what they're trying to figure out.
> No. Sign up and look at the current missions. A lot of what they want transcribed is totally straightforward to OCR --- not even LLM, OCR. Whatever's going on, and I'm not second-guessing them, a pretty big chunk of their problem appears to be well within the state of the art.
If it's that easy, then do it and be the hero they want.
Or maybe, just maybe, "a pretty big chunk of their problem appears to be well within the state of the art" is a sweeping generalization lacking understanding of the difficulties involved.
Also, you seem to have taken issue with the phrase “random humans” because you’re confused at what’s being done here. It is random humans. Non experts.
Experts are asking for the help of non experts.
> Anyone with an internet connection can volunteer to transcribe historical documents and help make the archives’ digital catalog more accessible
> There are conceivable reasons why they may be telling a half truth here. Just engaging the public is a worthy goal here.
Asserting an ulterior motive without supporting proof is to engage in conspiracy theories.
Sometimes a cigar is just a cigar.[0]
It doesn't look like a cigar (very tricky documents) though. Hence the skepticism.
> I don’t think I believe that OCR can’t do it but random humans can
I do.
> OCR is VERY good
Uh, my experience is extremely different.
The archivists themselves say that they run into such texts often enough that this program was needed:
> The agency uses artificial intelligence and a technology known as optical character recognition to extract text from historical documents. But these methods don’t always work, and they aren’t always accurate.
They are absolutely aware of the advances in these tools, so if they say they're not completely there yet I believe them. One likely reason is that the models probably have less 1800s-era cursive in their training set than they do modern cursive.
It's likely that with more human-tagged data they could improve on the state of the art for OCR, but it's pretty arrogant to doubt the agency in charge of this sort of thing when they say the tech isn't there yet.
Can someone please post a sample of one of these images that can only be read by a human for us naive OCR believers to see?
To be fair there was a similar discussion a few days ago in which an SME remained unconvinced: https://news.ycombinator.com/item?id=42566391
I don't necessarily agree with her conclusion because she wasn't participating directly in the thread and wasn't completely responsive to some of the points raised, but still, it appears that there are a few instances of difficult-to-read handwriting where OCR is still coming in second to skilled human interpretation.
> I would challenge you to find a picture of text that you think a human can read and OCR cannot.
Are you aware of CAPTCHA[0] images?
Solvable with the right tools.
https://github.com/noCaptchaAi/NoCaptcha-Ai-Browser-Extensio...
> Solvable with the right tools.
The original assertion was:
I would challenge you to find a picture of text
that you think a human can read and OCR cannot.
Not if many CAPTCHA image challenges could be automated. Unless the tool referenced guarantees 100% correct solutions for all manipulated text images.I mean, all you have to do is feed the image to ChatGPT, and it will read it basically as well as you can.
Denying/downvoting reality is always an option, of course.
Actually I think in 2025 you are correct, we just haven’t got the best tech into the OCR software that’s out there in the real world. I just pasted the letter from the article into ChatGPT (4o) and asked “what does this old letter say?” The response:
—-
The following is the declaration of James Lambert, a soldier of the Revolutionary War in North America.
The said James Lambert on this day personally appeared in the Probate Court of the County of Dearborn in the State of Indiana and at the November Term of said Court (1841), it being a court of record established by the laws of Indiana and made oath that:
On the 25th day of March 1842 he will be eighty-five years old; that he was born in the State of Maryland; that he is now a resident of said county and has been for the 27 years last past; that he has lived in Virginia, Maryland, Pennsylvania…
—-