Comment by quantadev

Comment by quantadev 4 days ago

7 replies

OpenAI is of course going to claim what they've done is very special and hard to replicate. They're a for-profit company and they want to harm the competition any way they can.

If they were just doing prompt engineering and multiple inferences they'd definitely want to keep that a competitive secret and send all the open source devs off in random directions, or keep them guessing, rather than telling them which way to go to replicate Q-Star.

parineum 4 days ago

> and they want to harm the competition any way they can.

That's an incredibly cynical choice of phrasing.

Of course they don't want to help the competition, that's what a competition is. The competition isn't helping OpenAI either.

  • quantadev 4 days ago

    It's not cynical to simply remind everyone who and what is motivating OpenAI (i.e. ClosedAI) at this point. They're no longer about helping the "AI community". They're about holding back from the community. Like you said: "That's what competition is."

whimsicalism 4 days ago

nobody has shown CoT scaling like this except deepmind, it is very obviously a result of their alignment pipeline not just prompting.

  • orbital-decay 4 days ago

    Scaling like what? Are there any comparisons with and without CoT, or with other models with their CoT? As far as I'm aware, their CoT part is secret. I'm sure the finetuning does some lifting, but I'm also sure the difference in a fair comparison won't be remotely as significant as it's being hyped currently.

    This is still clearly CoT, with all its limitations and caveats as expected. That's an improvement, sure, but definitely not a qualitative leap like OAI is trying to present it. (in a really shady manner)

    • og_kalu 3 days ago

      CoT performance of any other SOTA quickly plateaus as tokens grow. It does not have anywhere near the same graph as o1's test time compute plot.

      Saying it's just CoT is kind of meaningless. Even just looking at the examples on Open AI's blog and you quickly see no other model today can generate or utilize CoT to anywhere near that quality through prompting or naive fine-tuning.

      • orbital-decay 3 days ago

        I don't know, I've played with o1 and it seems obvious that it has the same issues as any other CoT - it has to be custom tailored to the task to work effectively, which quickly turns into whack-a-mole (even CoT generators like Self-Discover still have the same limitation).

        Starting from the strawberry example: it counts 3 "r"s in "strawbery", because the training makes it ignore grammatical errors if they're not relevant to the conversation (which makes sense in an instruction-tuned model) and their CoT doesn't catch it because it's not specialized enough. Will this scale with more compute thrown at it? I'm not sure I believe their scaling numbers. The proof should be in the pudding.

        I've also had mixed results with coding in Python, it's barely better than Sonnet in my experience, but wastes a lot more tokens, which are a lot more expensive.

        They might have improved things and made a SotA CoT that works in tandem with their training method, but that is definitely not what they were originally hyping (some architecture-level mechanism, or at least something qualitatively different). It also pretty obviously still has limited compute time per token and has to fit into the context (which is also still suffering from the lost-in-the-middle issue by the way). This puts the hard limit on the expressivity and task complexity.

  • quantadev 4 days ago

    For example, a team of GPT3.5 agents can outperform GPT4o. A single inference is essentially just kind of a chain reaction where once you have a set of tokens generated, as it's building an answer, it's looking for next tokens only, and can't revise or rethink. CoT will always outperform the single inference approach.