Show HN: Morphik – Open-source RAG that understands PDF images, runs locally

200 points by Adityav369 3 months ago

Hey HN, we’re Adi and Arnav. A few months ago, we hit a wall trying to get LLMs to answer questions over research papers and instruction manuals. Everything worked fine, until the answer lived inside an image or diagram embedded in the PDF. Even GPT‑4o flubbed it (we recently tried O3 with the same, and surprisingly it flubbed it too). Naive RAG pipelines just pulled in some text chunks and ignored the rest.

We took an invention disclosure PDF (https://drive.google.com/file/d/1ySzQgbNZkC5dPLtE3pnnVL2rW_9...) containing an IRR‑vs‑frequency graph and asked GPT “From the graph, at what frequency is the IRR maximized?”. We originally tried this on gpt-4o, but while writing this used the new natively multimodal model o4‑mini‑high. After a 30‑second thinking pause, it asked for clarifications, then churned out buggy code, pulled data from the wrong page, and still couldn’t answer the question. We wrote up the full story with screenshots here: https://docs.morphik.ai/blogs/gpt-vs-morphik-multimodal.

We got frustrated enough to try fixing it ourselves.

We built Morphik to do multimodal retrieval over documents like PDFs, where images and diagrams matter as much as the text.

To do this, we use Colpali-style embeddings, which treat each document page as an image and generate multi-vector representations. These embeddings capture layout, typography, and visual context, allowing retrieval to get a whole table or schematic, not just nearby tokens. Along with vector search, this could now retrieve exact pages with relevant diagrams and pass them as images to the LLM to get relevant answers. It’s able to answer the question with an 8B llama 3.1 vision running locally!

Early pharma testers hit our system with queries like "Which EGFR inhibitors at 50 mg showed ≥ 30% tumor reduction?" We correctly returned the right tables and plots, but still hit a bottleneck, we weren’t able to join the dots across multiple reports. So we built a knowledge graph: we tag entities in both text and images, normalize synonyms (Erlotinib → EGFR inhibitor), infer relations (e.g. administered_at, yields_reduction), and stitch everything into a graph. Now a single query could traverse that graph across documents and surface a coherent, cross‑document answer along with the correct pages as images.

To illustrate that, and just for fun, we built a graph of 100 Paul Graham’s essays here: https://pggraph.streamlit.app/ You can search for various nodes, (eg. startup, sam altman, paul graham and see corresponding connections). In our system, we create graphs and store the relevant text chunks along with the entities, so on querying, we can extract the relevant entity, do a search on the graph and pull in the text chunks of all connected nodes, improving cross document queries.

For longer or multi-turn queries, we added persistent KV caching, which stores intermediate key-value states from transformer attention layers. Instead of recomputing attention from scratch every time, we reuse prior layers, speeding up repeated queries and letting us handle much longer context windows.

We’re open‑source under the MIT Expat license: https://github.com/morphik-org/morphik-core

Would love to hear your RAG horror stories, what worked, what didn’t and any feedback on Morphik. We’re here for it.

codegeek 3 months ago

We’re open‑source under the MIT Expat license"

Not quite. You should clarify a bit more. The README has this about their license.

"Certain features - such as Morphik Console - are not available in the open-source version. Any feature in the ee namespace is not available in the open-source version and carries a different license. Any feature outside that is open source under the MIT expat license."

Reply View 1 reply

Adityav369 3 months ago

Thanks we should have been more clear. The part in ee is our UI, which can be used to test or in dev environments. The main code, including API, SDK, and the entire backend logic is MIT expat.

Reply View | 0 replies

thot_experiment 3 months ago

I'd love to have something like this but calling a cloud is a no-go for me. I have a half baked tool that a friend of mine and I applied to the Mozilla Builders Grant with (didn't get in), it's janky and I don't have time to work on it right now but it does the thing. I also find myself using OpenWebUI's context RAG stuff sometimes but I'd really like to have a way to dump all of my private documents into a DB and have search/RAG work against them locally, preferably in a way that's agnostic of the LLM backend.

Does such a project exist?

Reply View 9 replies

Adityav369 3 months ago

You can run this fully locally using Ollama for inference, although you'll need larger models and a beefy machine for great results. On my end llama 3.2 8B does a good job on technical docs, but bigger the better lol.

Reply View | 2 replies
- thot_experiment 3 months ago
  
  Ahh, I didn't see that, I just saw them talking about a free tier or whatever and my eyes glazed over. I'll try it out with Mistral-small 3.1 at some point tonight, I've been having really great results with it's multimodal understanding.
  
  Reply View | 0 replies
- mrtimo 3 months ago
  
  how would you use this within open-web-ui locally?
  
  Reply View | 0 replies
oceansweep 3 months ago

Hey yes, I’m building exactly that.
https://github.com/rmusser01/tldw
I first built a POC in gradio and am now rebuilding it as a FastAPI app. The media processing endpoints work but I’m still tweaking media ingestion to allow for syncing to clients(idea is to allow for client-first design). The GitHub doesn’t show any of the recent changes, but if you check back in 2-3 weeks, I think I’ll have the API version pushed to the main branch.

Reply View | 0 replies
osigurdson 3 months ago

Just curious, are you fine with running things in your own AWS / Azure / GCP account or do you really mean that the solution has to be fully on-premise?

Reply View | 4 replies
- thot_experiment 3 months ago
  
  Airgapped. It really makes threat modelling so so soooo much easier. It's temporal so if I were being attacked by a state level actor exfiltration is possible but this specific application I either have the data live and no internet, or internet and no data. I also have some lesser stuff that I allow on-prem w/ internet and just trust the firewall, but absolutely no way am I doing any sensitive data storage or inference in the cloud.
  Since people will be curious, one lesser thing I used this for is a diary/assistant and it's nice to have the peace of mind that I can dump my inner most thoughts without any concern for oversharing.
  
  Reply View | 3 replies
  
  ArnavAgrawal03 3 months ago
  
  totally agree that air-gapped provides unparalleled peace of mind. That's a major reason why we have strong support for local deployment. Nice to know that our hypothesis is somewhat accurate :)
  
  Reply View | 0 replies
  
  rank0 3 months ago
  
  What kind of hardware do you need for this setup?
  
  Reply View | 1 reply
  
  thot_experiment 3 months ago
  
  A computer with a couple gaming GPUs, a lan cable you can unplug and an encrypted external hard drive to offline your sensitive data.
  
  Reply View | 0 replies

w10-1 3 months ago

The architecture sounds very, very promising. Normalizing entities and relations to put in a graph for RAG sounds great. (I'm still a bit unclear on ingesting or updating existing graphs.)

Curious about suitability of this for PDF's as conference presentation slides vs academic papers. Is this sensitive or tunable to such distinctions?

Looking for tests/validation; are they all in the evaluation folder? A Pharma example would be great.

Thank you for documenting the telemetry. I appreciate the ee commercialization dance :)

Reply View 1 reply

Adityav369 3 months ago

For ingesting graphs, you can define a filter, or certain document ids. When updating, we look at if any other docs are added with that filer (or you can specify new doc ids). We then do entity and relationship extraction again, and do entity resolution with the existing graph to merge the two.
Creating graphs and entity resolution are both tunable with overrides, you can specify domain specific prompts and overrides (will add a pharma example!) (https://docs.morphik.ai/python-sdk/create_graph#parameters). I tried to add code, but was formatting badly, sorry for the redirect.

Reply View | 0 replies

DavidPP 3 months ago

I'm currently building an internal tool using SurrealDB directly, but I'm curious to use Morphik since it implement features I hadn't the time to figure out yet. (For example, I started with hardcoded schemas and I like how you support both).

Minor nitpick, but the README for your ui-component project under ee says:

"License This project is part of Morphik and is licensed under the MIT License."

However, your ee folder has an "enterprise" license, not the MIT license.

Reply View 2 replies

Adityav369 3 months ago

Thanks for pointing that out! Fixed it.
For the metadata extraction, we save these as Column(JSONB) for each documents which allows it to be changed on the fly.
Although, I keep wondering if it would have been better to use something like mongodb for this part, just because it's more natural.
Please let me know if you have questions and how it works out for you.

Reply View | 0 replies
[removed] 3 months ago

[deleted]

Reply View | 0 replies

trollbridge 3 months ago

If it’s MIT open source, what does the paid part apply to?

Reply View 2 replies

Adityav369 3 months ago

The paid part applies to the ui-component which provides a chat user interface. The core code, SDK, api is all under MIT license.

Reply View | 1 reply
- trollbridge 3 months ago
  
  Thanks for responding. That’s pretty reasonable…
  
  Reply View | 0 replies

mmsc 3 months ago

Have you thought about some type of preprocessing to make the PDFs "simpler"? https://github.com/freedomofpress/dangerzone does something like that in its first stage.

Reply View 1 reply

ArnavAgrawal03 3 months ago

While we don't use this yet, it seems very promising - thanks! We did something similar (with libreoffice, for example) to have support for non PDF datatypes, but this seems like it is coming at it more from a security perspective - which makes sense.

Reply View | 0 replies

MitPitt 3 months ago

Should I use this if I don't plan on working with pdfs? What's the best RAG currently?

Reply View 1 reply

Adityav369 3 months ago

Depends on your document types.
If you're using txts, then plain RAG built on top of any vector database can suffice depending on your queries (if they directly reference the text, or can be made to, then similarity search is good enough). If they are cross document, setting a high number of chunks with plain RAG to retrieve might also do a good job.
If you have tables, images, etc. then using a better extraction mechanism (maybe unstructured, or other document processors) and then creating the embeddings can also work well.
I'd say if docs are simple, then just building your own pipeline on top of a vector db is good!

Reply View | 0 replies

jkc101 3 months ago

Looks cool! What are the compute requirements or recommendations for self-hosting Morphik? What are the scaling limits? Can you provide a sense for latencies for ingestion and retrieval as the index size grows?

Reply View 1 reply

Adityav369 3 months ago

Depending on the use case, it happily runs on my MacBook air M2 16GB ram with mps for small pdfs, and searching over 100-150 documents with colpali takes a 2-ish minutes. Very rough numbers. For ingestion, takes around 15-20-ish seconds a page, which is on the slower end. On an A100, it takes 4-5 seconds per page for ingestion using Colpali to run (we haven't performance optimized, or optimized batch sizes yet tho). Without Colpali it is much faster. Ingestion doesn't change much as size grows.
I'd be happy to report back after some testing, we are looking to optimize more of this soon, as speed is somewhat of a missing piece at the moment.

Reply View | 0 replies

breadislove 3 months ago

I uploaded a file and its been processing for over an hour now. No failure or anything. Maybe you should look into that.

Reply View 3 replies

Adityav369 3 months ago

Yeah we had an overload on the ingestion queue. If you try again will be much faster as we just moved to a beefier machine. (The previous ingestion will still work since it is in queue, but new ones will be faster)

Reply View | 2 replies
- hliyan 3 months ago
  
  Wait, your title says this "runs locally"?
  
  Reply View | 1 reply
  
  ArnavAgrawal03 3 months ago
  
  Yes! If you're running the local version and it's taking long, that an indication that your GPU isn't being used properly. This can be traced back to the `colpali_embedding_model.py` file, where you can set the device and attention you want PyTorch to use.
  
  Reply View | 0 replies

jrvarela56 3 months ago

Couldn't upload files, all had error 'failed to fetch'

Reply View 2 replies

ArnavAgrawal03 3 months ago

Hey! what format of files are you uploading? seems to work ok on my end...

Reply View | 1 reply
- jrvarela56 3 months ago
  
  Pdfs
  
  Reply View | 0 replies

Alifatisk 3 months ago

How could I extract rectangles from PDF and then do something like this?

Reply View 3 replies

Adityav369 3 months ago

Do you mean ingesting the extracted rectangles/ bounding boxes? We're actually working on bounding boxes, this is a good insight and we can add this to the product. However, the way we ingest is literally converting each page to an image then embedding that so the text, layout, diagrams are all encoded in. Would like to know what the exact use case is, can help you better

Reply View | 2 replies
- mnky9800n 3 months ago
  
  Why do you convert to image? It’s easy to turn the components of a pdf into separate items and then ingest them individually. I also imagine at some point rasterizing vectors will become a pain point for some reason.
  
  Reply View | 1 reply
  
  Adityav369 3 months ago
  
  Mainly to maintain layout information. Also search becomes easier this way.
  
  Reply View | 0 replies

nostrebored 3 months ago

ColQwen is basically a strict upgrade — would give it a go!

Reply View 1 reply

Adityav369 3 months ago

We do use ColQwen! Currently 2, but upgrading to 2.5 soon :)

Reply View | 0 replies

mertleee 3 months ago

Is this running a custom llm under the hood or?

Reply View 1 reply

ArnavAgrawal03 3 months ago

No, you can bring your own LLM. In the cloud, we're querying gpt-4o. We're looking to expand to have some fine-tuned VLMs for document parsing and extraction further in the roadmap, but that would heavily depend on use-case.

Reply View | 0 replies

Imanari 3 months ago

Looks really nice! How does it handle tables?

Reply View 2 replies

Adityav369 3 months ago

We have two ingestion pathways: 1. regular OCR + text embeddings; 2. Colpali. We've observed that Colpali does a much better job with tables since it can encode positional stuff and layouts as well.

Reply View | 1 reply
- th0ma5 3 months ago
  
  Whenever I ask people wanting to use such features at scale which figure could be out of place or have a transposed digit it generally makes the project evaporate.
  
  Reply View | 0 replies