Comment by orbital-decay

I don't know, I've played with o1 and it seems obvious that it has the same issues as any other CoT - it has to be custom tailored to the task to work effectively, which quickly turns into whack-a-mole (even CoT generators like Self-Discover still have the same limitation).

Starting from the strawberry example: it counts 3 "r"s in "strawbery", because the training makes it ignore grammatical errors if they're not relevant to the conversation (which makes sense in an instruction-tuned model) and their CoT doesn't catch it because it's not specialized enough. Will this scale with more compute thrown at it? I'm not sure I believe their scaling numbers. The proof should be in the pudding.

I've also had mixed results with coding in Python, it's barely better than Sonnet in my experience, but wastes a lot more tokens, which are a lot more expensive.

They might have improved things and made a SotA CoT that works in tandem with their training method, but that is definitely not what they were originally hyping (some architecture-level mechanism, or at least something qualitatively different). It also pretty obviously still has limited compute time per token and has to fit into the context (which is also still suffering from the lost-in-the-middle issue by the way). This puts the hard limit on the expressivity and task complexity.