Comment by demosthanos
Comment by demosthanos 7 hours ago
Before commenting asking about why they don't just use LLMs, please note that the article specifically calls out that they do, but it's not always a viable solution:
> The agency uses artificial intelligence and a technology known as optical character recognition to extract text from historical documents. But these methods don’t always work, and they aren’t always accurate.
The document at the top is likely an especially easy document to read precisely because it's meant to be the hook to get people to sign up and get started. It isn't going to be representative of the full breadth of documents that the National Archives want people to go through.
OK, fair enough, but can you find one in this article that's hard for an LLM? The gnarliest one I saw, 4o handled instantly, and I went back and looked carefully at the image and the text and I'm sold.
Like if this is a crowdsourcing project, why not do a first pass with an LLM and present users with both the image and the best-effort LLM pass?
Later
I signed up, went to the current missions, and they all seem to post post-1900 and all typeset. They're blurry, but 4o cuts through them like a hot knife through butter.