Comment by miven

Comment by miven a day ago

3 replies

The key and value vectors are cached, that's kind of the whole point of autoregressive transformer models, the "state" not only survives within the KV cache but, in some sense, grows continuously with each token added, and is reused for each subsequent token.

xg15 a day ago

Hmm, maybe I misunderstood that part, but so far I thought the KV cache was really just that - a cache. Because all the previous tokens of the sequence stay the same, it makes no sense to compute the same K and V vectors again in each round.

But that doesn't change that the only input to the Q, K and V calculations are the tokens (or in later layers information that was derived from the tokens) and each vector in the cache maps directly to an input token.

So I think you could disable the cache and recompute everything in each round and you'd still get the same result, just a lot slower.

  • miven a day ago

    That's absolutely correct, KV cache is just an optimization trick, you could run the model without it, that's how encoder-only transformers do it.

    I guess what I'm trying to convey is that the latent representations within a transformer are conditioned on all previous latents through attention, so at least in principle, while the old cache of course does not change, since it grows with new tokens it means that the "state" can be brought up to date by being incorporated in an updated form into subsequent tokens.

    • [removed] a day ago
      [deleted]