Comment by thorum

Comment by thorum 5 days ago

20 replies

o1’s innovation is not Chain-of-Thought. It’s teaching the model to do CoT well (from massive amounts of human feedback) instead of just pretending to. You’ll never get o1 performance just from prompt engineering.

visarga 4 days ago

> from massive amounts of human feedback

It might be the 200M user base of OpenAI that provided the necessary guidance for advanced CoT, implicitly. Every user chat session is also an opportunity for the model to get feedback and elicit experience from the user.

narrator 4 days ago

If the training data for these LLMs is from humanity in general, and it is trying to imitate humanity, wouldn't its IQ tend to be the average of all of humanity? Perhaps the only people who talk about STEM topics are people of higher IQ generally, including a lot of poor students asking homework questions. Thus, the way to get to higher IQ output is to critique the lower IQ answers, which may be more numerous by rejecting their flaws in favor of the higher IQ answers. That, or just training more heavily on textbooks, and so forth. How to reject errors, and maybe train on synthetic data generated without reasoning with errors.

  • Meganet 4 days ago

    A LLM combines expertise from ALL Experts.

    A LLM can therefore have an higher IQ because it can combine all fields.

    Also parameters and architecture might or might not be a limiting factor to us humans or a LLM. But LLM and parameter size, optimizations etc. are just at the beginning.

    If we now have a good reasoning llm, we can build more test data automatically. Basically using the original content + creating new ones which can then lead to new knowledge = research.

  • killerstorm 4 days ago

    No.

    Does Midjourney output look like an average human drawing?

    Obviously, OpenAI knows how to train a classifier...

    • cubefox 3 days ago

      > Does Midjourney output look like an average human drawing?

      No, perhaps because it's heavily trained on photos.

qudat 4 days ago

Do you actually know that’s what’s happening? The details are extremely fickle the last I read (a couple days ago). For all we know, they are doing model routing and prompt engineering to get o1 to work.

logicchains 4 days ago

Maybe they didn't use a huge amount of human feedback; where it excels is coding and maths/logic, so they could have used compiler/unit tests for giving it the coding feedback and a theorem prover like Lean for the math feedback.

quantadev 4 days ago

OpenAI is of course going to claim what they've done is very special and hard to replicate. They're a for-profit company and they want to harm the competition any way they can.

If they were just doing prompt engineering and multiple inferences they'd definitely want to keep that a competitive secret and send all the open source devs off in random directions, or keep them guessing, rather than telling them which way to go to replicate Q-Star.

  • parineum 4 days ago

    > and they want to harm the competition any way they can.

    That's an incredibly cynical choice of phrasing.

    Of course they don't want to help the competition, that's what a competition is. The competition isn't helping OpenAI either.

    • quantadev 4 days ago

      It's not cynical to simply remind everyone who and what is motivating OpenAI (i.e. ClosedAI) at this point. They're no longer about helping the "AI community". They're about holding back from the community. Like you said: "That's what competition is."

  • whimsicalism 4 days ago

    nobody has shown CoT scaling like this except deepmind, it is very obviously a result of their alignment pipeline not just prompting.

    • orbital-decay 4 days ago

      Scaling like what? Are there any comparisons with and without CoT, or with other models with their CoT? As far as I'm aware, their CoT part is secret. I'm sure the finetuning does some lifting, but I'm also sure the difference in a fair comparison won't be remotely as significant as it's being hyped currently.

      This is still clearly CoT, with all its limitations and caveats as expected. That's an improvement, sure, but definitely not a qualitative leap like OAI is trying to present it. (in a really shady manner)

      • og_kalu 3 days ago

        CoT performance of any other SOTA quickly plateaus as tokens grow. It does not have anywhere near the same graph as o1's test time compute plot.

        Saying it's just CoT is kind of meaningless. Even just looking at the examples on Open AI's blog and you quickly see no other model today can generate or utilize CoT to anywhere near that quality through prompting or naive fine-tuning.

        • orbital-decay 3 days ago

          I don't know, I've played with o1 and it seems obvious that it has the same issues as any other CoT - it has to be custom tailored to the task to work effectively, which quickly turns into whack-a-mole (even CoT generators like Self-Discover still have the same limitation).

          Starting from the strawberry example: it counts 3 "r"s in "strawbery", because the training makes it ignore grammatical errors if they're not relevant to the conversation (which makes sense in an instruction-tuned model) and their CoT doesn't catch it because it's not specialized enough. Will this scale with more compute thrown at it? I'm not sure I believe their scaling numbers. The proof should be in the pudding.

          I've also had mixed results with coding in Python, it's barely better than Sonnet in my experience, but wastes a lot more tokens, which are a lot more expensive.

          They might have improved things and made a SotA CoT that works in tandem with their training method, but that is definitely not what they were originally hyping (some architecture-level mechanism, or at least something qualitatively different). It also pretty obviously still has limited compute time per token and has to fit into the context (which is also still suffering from the lost-in-the-middle issue by the way). This puts the hard limit on the expressivity and task complexity.

    • quantadev 4 days ago

      For example, a team of GPT3.5 agents can outperform GPT4o. A single inference is essentially just kind of a chain reaction where once you have a set of tokens generated, as it's building an answer, it's looking for next tokens only, and can't revise or rethink. CoT will always outperform the single inference approach.

Oras 4 days ago

Well, with Tree Of Thought (ToT) and fine-tuned models, I'm sure you can achieve the same performance with margin to improve as you identify the bottlenecks.

I'm not convinced OpenAI is using one model. Look at the thinking process (UI), which takes time, and then suddenly, you have the output streamed out at high speed.

But even so, people are after results, not really the underlying technology. There is no difference of doing it with one model vs multiple models.

  • alach11 4 days ago

    > I'm not convinced OpenAI is using one model. Look at the thinking process (UI), which takes time, and then suddenly, you have the output streamed out at high speed.

    According to OpenAI, the model does it's thinking behind the scenes, then at the end summarizes that thinking for the user. We don't get to see the original chain-of-thought reasoning, just the AI's own summary of that reasoning. That explains the output timing.

kristianp 5 days ago

Does o1 need some method to allow it to generate lengthy chains of thought, or does it just do it normally after being trained to do so?

If so, I imagine o1 clones could just be fine tunes of llamas initially.

  • astrange 4 days ago

    You need an extremely large amount of training data of good CoTs. And there probably is some magic; we know LLMs aren't capable of self reflection and none of the other ones are any good at iterating to a better answer.

    Example prompt for that: "give me three sentences that end in 'is'."

hjaveed 3 days ago

can you share any resource that mentions about teaching the model to do COT.. their release blog does not document much