Comment by famouswaffles
Comment by famouswaffles 13 hours ago
Revealing emergent human-like conceptual representations from language prediction - https://www.pnas.org/doi/10.1073/pnas.2512514122
Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task - https://openreview.net/forum?id=DeG07_TcZvT
On the Biology of a Large Language Model - https://transformer-circuits.pub/2025/attribution-graphs/bio...
Emergent Introspective Awareness in Large Language Models - https://transformer-circuits.pub/2025/introspection/index.ht...
Thanks for that. I've read the two Lindsey papers before. I think these are all interesting, but they are also what used to be called "just-so stories". That is, they describe a way of understanding what the LLM is doing, but do not actually describe what the LLM is doing.
And this is OK and still quite interesting - we do it to ourselves all the time. Often it's the only way we have of understanding the world (or ourselves).
However, in the case of LLMs, which are tools that we have created from scratch, I think we can require a higher standard.
I don't personally think that any of these papers suggest that LLMs manipulate concepts. They do suggest that the internal representation after training is highly complex (superposition, in particular), and that when inputs are presented, it isn't unreasonable to talk about the observable behavior as if it involved represented concepts. It is useful stance to take, similar to Dennett's intentional stance.
However, while this may turn out to be how a lot of human cognition works, I don't think it is what is the significant part of what is happening when we actively reason. Nor do I think it corresponds to what most people mean by "manipulate concepts".
The LLM, despite the prescence of "features" that may correspond to human concepts, is relentlessly forward-driving: given these inputs, what is my output? Look at the description in the 3rd paper of the arithmetic example. This is not "manipulating concepts" - it's a trick that often gets to the right answer (just like many human tricks used for arithmetic, only somewhat less reliable). It is extremely different, however, from "rigorous" arithmetic - the stuff you learned when you somewhere between age 5 and 12 perhaps - that always gives the right answer and involves no pattern matter, no inference, no approximations. The same thing can be said, I think, about every other example in all 4 papers, to some degree or another.
What I do think is true (and very interesting) is that it seems somewhere between possible and likely that a lot more human cognition than we've previously suspected uses similar mechanisms as these papers are uncovering/describing.