Comment by miven

Comment by miven 2 days ago

1 reply

That's absolutely correct, KV cache is just an optimization trick, you could run the model without it, that's how encoder-only transformers do it.

I guess what I'm trying to convey is that the latent representations within a transformer are conditioned on all previous latents through attention, so at least in principle, while the old cache of course does not change, since it grows with new tokens it means that the "state" can be brought up to date by being incorporated in an updated form into subsequent tokens.