Comment by xg15
Yes, but all that internal state only survives until the end of the computation chain that predicts the next token - it doesn't survive across the entire sequence as it would in a recurrent network.
There is literally no difference between a model predicting the tokens "<thought> I think the second choice looks best </thought>" and a user putting those tokens into the prompt: The input for the next round would be exactly the same.
So the tokens kind of act like a bottleneck (or more precisely the sampling of exactly one next token at the end of each prediction round does). During prediction of one token, the model can go crazy with hidden state, but not across several tokens. That forces the model to do "long form" reasoning through the tokens and not through hidden state.
The key and value vectors are cached, that's kind of the whole point of autoregressive transformer models, the "state" not only survives within the KV cache but, in some sense, grows continuously with each token added, and is reused for each subsequent token.