Comment by orbital-decay

Comment by orbital-decay a year ago

Scaling like what? Are there any comparisons with and without CoT, or with other models with their CoT? As far as I'm aware, their CoT part is secret. I'm sure the finetuning does some lifting, but I'm also sure the difference in a fair comparison won't be remotely as significant as it's being hyped currently.

This is still clearly CoT, with all its limitations and caveats as expected. That's an improvement, sure, but definitely not a qualitative leap like OAI is trying to present it. (in a really shady manner)

famouswaffles a year ago

CoT performance of any other SOTA quickly plateaus as tokens grow. It does not have anywhere near the same graph as o1's test time compute plot.

Saying it's just CoT is kind of meaningless. Even just looking at the examples on Open AI's blog and you quickly see no other model today can generate or utilize CoT to anywhere near that quality through prompting or naive fine-tuning.

Reply View 1 reply

orbital-decay a year ago

I don't know, I've played with o1 and it seems obvious that it has the same issues as any other CoT - it has to be custom tailored to the task to work effectively, which quickly turns into whack-a-mole (even CoT generators like Self-Discover still have the same limitation).
Starting from the strawberry example: it counts 3 "r"s in "strawbery", because the training makes it ignore grammatical errors if they're not relevant to the conversation (which makes sense in an instruction-tuned model) and their CoT doesn't catch it because it's not specialized enough. Will this scale with more compute thrown at it? I'm not sure I believe their scaling numbers. The proof should be in the pudding.
I've also had mixed results with coding in Python, it's barely better than Sonnet in my experience, but wastes a lot more tokens, which are a lot more expensive.
They might have improved things and made a SotA CoT that works in tandem with their training method, but that is definitely not what they were originally hyping (some architecture-level mechanism, or at least something qualitatively different). It also pretty obviously still has limited compute time per token and has to fit into the context (which is also still suffering from the lost-in-the-middle issue by the way). This puts the hard limit on the expressivity and task complexity.

Reply View | 0 replies