Show HN: Ocrbase – pdf → .md/.json document OCR and structured extraction API
(github.com)81 points by adammajcher 11 hours ago
81 points by adammajcher 11 hours ago
Most production software is wrappers around existing libraries. The relevant question is whether this wrapper adds operational or usability value, not whether it reimplements OCR. If there are architectural or reliability concerns, it’d be more useful to call those out directly.
Sure. The self host guide tells me to enter my github secret, in plain-text, in an env file. But it doesn't tell me why I should do that.
Do people actually store their secrets in plain text on the file system in production environments? Just seems a bit wild to me.
This is admittedly dated but even back in December 2023 GPT-4 with it's Vision preview was able to very reliably do structured extraction, and I'd imagine Gemini 3 Flash is much better than back then.
https://binal.pub/2023/12/structured-ocr-with-gpt-vision/
Back of the napkin math (which I could be messing up completely) but I think you could process a 100 page PDF for ~$0.50 or less using Gemini 3 Flash?
>560 input tokens per page * 100 pages = 56000 tokens = $0.028 input ($0.5/m input tokens) >~1000 output tokens per page * 100 pages = $0.30 output ($3/m output tokens)
(https://ai.google.dev/gemini-api/docs/gemini-3#media_resolut...)
sure, in some small projects I recommend my friends to use gemini 3 flash. ocrbase is aimed more at scale and self-hosting: fixed infra cost, high throughput, and no data leaving your environment. at large volumes, that tradeoff starts to matter more than per-100-page pricing
How this is better over Surya/Marker or kreuzberg https://github.com/kreuzberg-dev/kreuzberg.
I have a flow where i extract text from a pdf with pdf-parse and then feed that to an ai for data extraction. If that fails i convert it to a png and send the image for data extraction. This works very well and would presumably be far cheaper as i'm generally sending text to the model instead of relying on images. Isn't just sending the images for ocr significantly more expensive?
I always render an image and OCR that so I don’t get odd problems from invisible text and it also avoids being affected by anything for SEO.
There was an interesting discussion on here a couple of months back about images vs text, driven by this article: https://www.seangoedecke.com/text-tokens-as-image-tokens/
Discussion is here: https://news.ycombinator.com/item?id=45652952
By definition, OCR means optical character recognition. It depends on the contents of the PDF what kind of extraction methodology can work. Often some available PDFs are just scans of printed documents or handwritten notes. If machine readable text is available your approach is great.
How does this compare to dots.ocr? I got fantastic results when I tested dots.
having worked with paddleocr, tesseract and many other ocr tools before this is still one of the best and smoothest ocr experiences ive ever had, deployed in minutes
What matters most is how well OCR and structured data extraction tools handle documents with high variation at production scale. In real workflows like accounting, every invoice, purchase order, or contract can look different. The extraction system must still work reliably across these variations with minimal ongoing tweaks.
Equally important is how easily you can build a human-in-the-loop review layer on top of the tool. This is needed not only to improve accuracy, but also for compliance—especially in regulated industries like insurance.
Other tools in this space:
LLMWhisperer/Unstract(AGPL)
Reducto
Extend Ai
LLamaparse
Docling
Why is 12GB+ VRAM a requirement? The OCR model looks kind of small, https://huggingface.co/PaddlePaddle/PaddleOCR-VL/tree/main, so I'm assuming it is some processing afterwards it would be used for.
This is essentially a (vibe-coded?) wrapper around PaddleOCR: https://github.com/PaddlePaddle/PaddleOCR
The "guts" are here: https://github.com/majcheradam/ocrbase/blob/7706ef79493c47e8...