Comment by phailhaus
Until we have world models, that is exactly what they are. They literally only understand text, and what text is likely given previous text. They are very good at this, because we've given it a metric ton of training data. Everything is "what does a response to this look like?"
This limitation is exactly why "reasoning models" work so well: if the "thinking" step is not persisted to text, it does not exist, and the LLM cannot act on it.
Text comes in, text goes out, but there's a lot of complexity in the middle. It's not a "world model", but there's definitely modeling of the world going on inside.