Comment by schmorptron

Comment by schmorptron 2 days ago

8 replies

One thing I don't get about the ever-reoccuring RAG discussions and hype men proclaiming "Rag is dead", is that people seem to be talking about wholly different things? My mental model is that what is called RAG can either be:

- a predefined document store / document chunk store where every chunk gets a a vector embedding, and a lookup decides what gets pulled into context as to not have to pull whole classes of document, filling it up

- the web search like features in LLM chat interfaces, where they do keyword search, and pull relevant documents into context, but somehow only ephemerally, with the full documents not taking up context in the future of the thread (unsure about this, did I understand it right?) .

with the new models with million + tokens of context windows, some where arguing that we can just throw whole books into the context non-ephemerally, but doesnt that significantly reduce the diversity of possible sources we can include at once if we hard commit to everything staying in context forever? I guess it might help with consistency? But is the mechanism with which we decide what to keep in context not still some kind of RAG, just with larger chunks of whole documents instead of only parts?

I'd be extatic if someone who really knows their stuff could clear this up for me

kgeist 2 days ago

Technically, RAG is anything that augments generation with external search. However, it often has a narrower meaning: "uses a vector DB."

Throwing everything into one large context window is often impractical - it takes much more time to process, and many models struggle to find information accurately if too much is going on in the context window ("lost in the middle").

The "classic" RAG still has its place when you want low latency (or you're limited by VRAM) and the results are already good enough.

impossiblefork 2 days ago

We can't throw in infinite things in the context though.

My impression is that GPT-5 gets confused, not quite right away, but after a couple of pages it has no idea. It doesn't take pages on pages before it forgets things.

  • aerhardt 2 days ago

    I’m currently experimenting with prompts of ~300k tokens for a certain classification task and I think I might be able to make it work. GPT5 chokes but Gemini 2.5 Pro is showing promise. Jury’s still out and I might change my tune in a couple of weeks.

    • impossiblefork 2 days ago

      It should also be said, that what I say here is focused on things where these models have problems.

      For example, I consider the model confused when it starts outputting stereotyped or cliche responses, and I intentionally go at problems that I know that the models have problems with (I already know they can program and do some maths, but I want to see what they can't do). But if you're using them for things they're made for, and which aren't confusing, such as people arguing with each other, then you are probably likely to succeed.

      Prompts with lots of examples are reasonable and I know they can get very long.

GistNoesis 2 days ago

The answer is adaptability.

In both cases for "Question Answering" it's about similarity search but there are two main orthogonal differences between RAG and Non-RAG :

-Knowing the question at the time of index building

-Higher order features : the ability to compare fetched documents with one another and refine the question

Non-RAG, aka multi-layer (non-causal) transformer with infinite context, is the more generic version, fully differentiable meaning you can use machine learning to learn how to Non-RAG better. Each layer of the transformer can use the previous layer to reason and refine the similarity search. (A causal transformer know the question at the time when it is feed the question, and can choose to focus it's attention on different part of the previously computed features of the provided documents but may benefit from having some reflection token, or better : be given the question before being presented the documents (provided you've trained it to answer it like that).)

RAG is an approximation of the generic case to make it faster and cheaper. Usually it breaks end-to-end differentiability by using external tools, so this mean that if you want to use machine learning to learn how to RAG better you will need to use some variant of Reinforcement Learning which is slower to learn things. RAG usually don't know the question at the time of index building, and documents are treated independently of each other, so no (automatic) higher order features (embeddings are fixed).

A third usual approximation, is to feed the output of RAG into Non-RAG, to hopefully get the best of both world. You can learn the Non-RAG given RAG with machine learning (if you train it with some conversations where it used RAG), but the RAG part won't improve by itself.

Non-RAG need to learn so it needs a big training dataset, but fortunately it can pick-up question answer pair in an unsupervised fashion when you feed it the whole web, and you only need a small instruction training and preference optimization dataset to shape it to your need. If performance isn't what you expect in a specific case, you can provide more specific examples and retrain the model until it gets it and you get better performance for the case you were interested in. You can improve the best case but it's hard to improve the worst case.

RAG has more control on what you feed it but content should be in a more structured way. You can prevent worst cases more easily but it's hard to improve good case.

edanm a day ago

> My mental model is that what is called RAG can either be:

RAG is confusing, because if you look at the words making up the acronym RAG, it seems like it could be either of the things you mentioned. But it originally referred to a specific technique of embeddings + vector search - this was the way it was used in the ML article that defined the term, and this is the way most people in the industry actually use the term.\

It annoys me, because I think it should refer to all techniques of augmenting, but in practice it's often not used that way.

There are reasons that specifically make the "embeddings" idea special - namely, it's a relatively new technique that actually fits LLM very well, because it's a semantic search - meaning, it works on "the same input" as LLMs do, which is a free-text query. (As opposed to a traditional lookups that work on keyword search or similar.)

As for whether RAG is dead - if you mean specifically vector-embeddings and semantic search, it's possible - because you could theoretically use other techniques for augmentation, e.g. an agent that understands a user question about a codebase and uses grep/find/etc to look for the information, or composes a search to search the internet for something. But it's definitely not going to die in that second sense of "we need some way to augment LLMs knowledge before text generation", that will probably always be relevant, as you say.