Comment by demosthanos

Comment by demosthanos a year ago

The archivists themselves say that they run into such texts often enough that this program was needed:

> The agency uses artificial intelligence and a technology known as optical character recognition to extract text from historical documents. But these methods don’t always work, and they aren’t always accurate.

They are absolutely aware of the advances in these tools, so if they say they're not completely there yet I believe them. One likely reason is that the models probably have less 1800s-era cursive in their training set than they do modern cursive.

It's likely that with more human-tagged data they could improve on the state of the art for OCR, but it's pretty arrogant to doubt the agency in charge of this sort of thing when they say the tech isn't there yet.

tedunangst a year ago

Can someone please post a sample of one of these images that can only be read by a human for us naive OCR believers to see?

Reply View 5 replies

CamperBob2 a year ago

To be fair there was a similar discussion a few days ago in which an SME remained unconvinced: https://news.ycombinator.com/item?id=42566391
I don't necessarily agree with her conclusion because she wasn't participating directly in the thread and wasn't completely responsive to some of the points raised, but still, it appears that there are a few instances of difficult-to-read handwriting where OCR is still coming in second to skilled human interpretation.

Reply View | 1 reply
- jncfhnb a year ago
  
  That’s comprehension of English not reading characters
  
  Reply View | 0 replies
BugsJustFindMe a year ago

I've posted these above, but I'll give you your own copy because the bits are free. Does your OCR work on these? Mine sadly doesn't. But if yours does, then I'll switch to it.
https://imgur.com/a/CDU6Lgs

Reply View | 2 replies
- jncfhnb a year ago
  
  The problem statement was text that random humans can read and OCR cannot.
  If you want to provide a good faith answer at least make it English. I assume this is French but it’s obviously much harder to evaluate on both ends when you’re mixing up the language.
  
  Reply View | 1 reply
  
  BugsJustFindMe a year ago
  
  I'm confused.
  Which parts of "OCR" and "human" stand for "modern english"?
  Are you suggesting that humans can't read or write in french? Because I can point to a lot of them who would disagree.
  
  Reply View | 0 replies

jncfhnb a year ago

Then please provide a single example that we can’t instantly solve. Happy to prove them wrong.

Reply View 0 replies