jahewson 6 hours ago

Actually I think in 2025 you are correct, we just haven’t got the best tech into the OCR software that’s out there in the real world. I just pasted the letter from the article into ChatGPT (4o) and asked “what does this old letter say?” The response:

—-

The following is the declaration of James Lambert, a soldier of the Revolutionary War in North America.

The said James Lambert on this day personally appeared in the Probate Court of the County of Dearborn in the State of Indiana and at the November Term of said Court (1841), it being a court of record established by the laws of Indiana and made oath that:

On the 25th day of March 1842 he will be eighty-five years old; that he was born in the State of Maryland; that he is now a resident of said county and has been for the 27 years last past; that he has lived in Virginia, Maryland, Pennsylvania…

—-

ozbonus 6 hours ago

I've been trying every state of the art OCR solution on my students' handwritten essays for fifteen years and have yet to find anything even close to acceptable.

AdieuToLogic 7 hours ago

> I don’t think I believe that OCR can’t do it but random humans can

Considering the people involved are experts in their field, are certainly aware of OCR capabilities, and have publicized a need thusly:

  ... the National Archives is looking for volunteers who can 
  help transcribe and organize its many handwritten records ...
Perhaps "random humans" can perform tasks which could reshape your belief:

> OCR is VERY good

  • tptacek 6 hours ago

    No. Sign up and look at the current missions. A lot of what they want transcribed is totally straightforward to OCR --- not even LLM, OCR. Whatever's going on, and I'm not second-guessing them, a pretty big chunk of their problem appears to be well within the state of the art. The appeal to authority isn't going to play here, because you can just click through to the archives and see what they're trying to figure out.

    • AdieuToLogic 6 hours ago

      > No. Sign up and look at the current missions. A lot of what they want transcribed is totally straightforward to OCR --- not even LLM, OCR. Whatever's going on, and I'm not second-guessing them, a pretty big chunk of their problem appears to be well within the state of the art.

      If it's that easy, then do it and be the hero they want.

      Or maybe, just maybe, "a pretty big chunk of their problem appears to be well within the state of the art" is a sweeping generalization lacking understanding of the difficulties involved.

      • tptacek 5 hours ago

        Go ahead and find something hard, and relate back the steps you took to find it.

  • jncfhnb 2 hours ago

    Also, you seem to have taken issue with the phrase “random humans” because you’re confused at what’s being done here. It is random humans. Non experts.

    Experts are asking for the help of non experts.

    > Anyone with an internet connection can volunteer to transcribe historical documents and help make the archives’ digital catalog more accessible

  • jncfhnb 7 hours ago

    There are conceivable reasons why they may be telling a half truth here. Just engaging the public is a worthy goal here.

    • AdieuToLogic 6 hours ago

      > There are conceivable reasons why they may be telling a half truth here. Just engaging the public is a worthy goal here.

      Asserting an ulterior motive without supporting proof is to engage in conspiracy theories.

      Sometimes a cigar is just a cigar.[0]

      0 - https://quoteinvestigator.com/2011/08/12/just-a-cigar/

      • jncfhnb 2 hours ago

        The alternative is me saying that appealing to their “expertise” is an appeal to authority fallacy that flies in the face of general evidence that modern OCR is far better than humans at character recognition. Especially random non specialized humans.

      • Dylan16807 5 hours ago

        It doesn't look like a cigar (very tricky documents) though. Hence the skepticism.

BugsJustFindMe 7 hours ago

> I don’t think I believe that OCR can’t do it but random humans can

I do.

> OCR is VERY good

Uh, my experience is extremely different.

  • jncfhnb 7 hours ago

    I would challenge you to find a picture of text that you think a human can read and OCR cannot. I’m happy to demonstrate. The text shown in this article is trivial.

    • demosthanos 7 hours ago

      The archivists themselves say that they run into such texts often enough that this program was needed:

      > The agency uses artificial intelligence and a technology known as optical character recognition to extract text from historical documents. But these methods don’t always work, and they aren’t always accurate.

      They are absolutely aware of the advances in these tools, so if they say they're not completely there yet I believe them. One likely reason is that the models probably have less 1800s-era cursive in their training set than they do modern cursive.

      It's likely that with more human-tagged data they could improve on the state of the art for OCR, but it's pretty arrogant to doubt the agency in charge of this sort of thing when they say the tech isn't there yet.

      • jncfhnb 2 hours ago

        Then please provide a single example that we can’t instantly solve. Happy to prove them wrong.

      • tedunangst 6 hours ago

        Can someone please post a sample of one of these images that can only be read by a human for us naive OCR believers to see?

        • CamperBob2 5 hours ago

          To be fair there was a similar discussion a few days ago in which an SME remained unconvinced: https://news.ycombinator.com/item?id=42566391

          I don't necessarily agree with her conclusion because she wasn't participating directly in the thread and wasn't completely responsive to some of the points raised, but still, it appears that there are a few instances of difficult-to-read handwriting where OCR is still coming in second to skilled human interpretation.

    • AdieuToLogic 6 hours ago

      > I would challenge you to find a picture of text that you think a human can read and OCR cannot.

      Are you aware of CAPTCHA[0] images?

      0 - https://en.wikipedia.org/wiki/CAPTCHA

      • jncfhnb 2 hours ago

        Text that is _intentionally constructed_ to fool computers but not humans is obviously out of scope. But they’re generally easily solved with OCR these days anyway.

      • jahewson 6 hours ago
        • AdieuToLogic 5 hours ago

          > Solvable with the right tools.

          The original assertion was:

            I would challenge you to find a picture of text
            that you think a human can read and OCR cannot.
          
          Not if many CAPTCHA image challenges could be automated. Unless the tool referenced guarantees 100% correct solutions for all manipulated text images.
  • CamperBob2 7 hours ago

    Your experience is obsolete.

    • BugsJustFindMe 7 hours ago

      Oh, ok then.

      • CamperBob2 6 hours ago

        I mean, all you have to do is feed the image to ChatGPT, and it will read it basically as well as you can.

        Denying/downvoting reality is always an option, of course.