Comment by maxcomperatore
Comment by maxcomperatore a day ago
i been working on something like this for my own stuff. my drive got screenshots pdfs md files invoices and random logs and i always forget what i named stuff from years ago
what helped me was
- ran ocr on images with tesseract (slow but it works)
- used unstructured and langchain to parse and chunk stuff even spreadsheets and emails
- embedded chunks with sentence-transformers and indexed it with faiss
- then built a local llm agent (used a quantized mistral model) to rerank results smartly
its rough but works like a semantic grep for your whole disk
if you want less diy paperless-ng plus anythingllm plus a lightweight embed model could work or wait some months and someone will wrap it all in an electron app with stripe on the homepage lol
funny how much time we spend trying to find stuff we already wrote