Comment by miven
The key and value vectors are cached, that's kind of the whole point of autoregressive transformer models, the "state" not only survives within the KV cache but, in some sense, grows continuously with each token added, and is reused for each subsequent token.
Hmm, maybe I misunderstood that part, but so far I thought the KV cache was really just that - a cache. Because all the previous tokens of the sequence stay the same, it makes no sense to compute the same K and V vectors again in each round.
But that doesn't change that the only input to the Q, K and V calculations are the tokens (or in later layers information that was derived from the tokens) and each vector in the cache maps directly to an input token.
So I think you could disable the cache and recompute everything in each round and you'd still get the same result, just a lot slower.