Comment by schmorptron
Comment by schmorptron 2 days ago
One thing I don't get about the ever-reoccuring RAG discussions and hype men proclaiming "Rag is dead", is that people seem to be talking about wholly different things? My mental model is that what is called RAG can either be:
- a predefined document store / document chunk store where every chunk gets a a vector embedding, and a lookup decides what gets pulled into context as to not have to pull whole classes of document, filling it up
- the web search like features in LLM chat interfaces, where they do keyword search, and pull relevant documents into context, but somehow only ephemerally, with the full documents not taking up context in the future of the thread (unsure about this, did I understand it right?) .
with the new models with million + tokens of context windows, some where arguing that we can just throw whole books into the context non-ephemerally, but doesnt that significantly reduce the diversity of possible sources we can include at once if we hard commit to everything staying in context forever? I guess it might help with consistency? But is the mechanism with which we decide what to keep in context not still some kind of RAG, just with larger chunks of whole documents instead of only parts?
I'd be extatic if someone who really knows their stuff could clear this up for me
Technically, RAG is anything that augments generation with external search. However, it often has a narrower meaning: "uses a vector DB."
Throwing everything into one large context window is often impractical - it takes much more time to process, and many models struggle to find information accurately if too much is going on in the context window ("lost in the middle").
The "classic" RAG still has its place when you want low latency (or you're limited by VRAM) and the results are already good enough.