Comment by msgodel
Multimodal Qwen is pretty good at OCR although it's pretty slow without a GPU.
For pure search you're almost certainly better off building an index of CLIP embeddings and then doing cosine similarity with a query embedding to find things. I have gigabytes of reaction images and memes I've been thinking about doing this with.