Comment by maxcomperatore

i been working on something like this for my own stuff. my drive got screenshots pdfs md files invoices and random logs and i always forget what i named stuff from years ago

what helped me was

- ran ocr on images with tesseract (slow but it works)

- used unstructured and langchain to parse and chunk stuff even spreadsheets and emails

- embedded chunks with sentence-transformers and indexed it with faiss

- then built a local llm agent (used a quantized mistral model) to rerank results smartly

its rough but works like a semantic grep for your whole disk

if you want less diy paperless-ng plus anythingllm plus a lightweight embed model could work or wait some months and someone will wrap it all in an electron app with stripe on the homepage lol

funny how much time we spend trying to find stuff we already wrote