Comment by bawana

But you still have to load the data for each request. And in an LLM doesnt this mean the WHOLE kv cache because the kv cache changes after every computation? So why isnt THIS the bottleneck? Gemini is talking about a context window of a million tokens- how big would the kv cache fir this get?