Comment by Jianghong94

Comment by Jianghong94 10 months ago

yeah or a lot of people can just fake progress by attaching whatever viral tag onto their glue code. I mean to start with, unless you do a bit of fine-tuning + rlhf there's no way to do it o1-like.

arthurcolle 10 months ago

no its a lot more than RLHF, I think they figured out a way to have the LLM actually actively plot out scenario trajectories via context window manipulation and then use some kind of adhoc reward shaping mechanism to get it to select the best path based on the user's profile in a way that gets the most likely to be "liked" scenario (context window state change up to some N number of tokens (seems like they've been looking at 50k total range as "best area" minus the 20k tokens for the reasoning tokens)

also I think they deliberate give you bad answers sometimes / a lot over the last year to build up advanced chains where the user is not getting what they want so you have to explain why. I started building up like 10 or so of these conversations where after like 100 messages it gets the right answer and it was like hmm, I wonder if they are using this.

just my rambles

Reply View 2 replies

hadeson 10 months ago

I like the Tree of Thoughts theory that treat each chain of thoughts 'branch' as a possible hypothesis. They might trained a search system that quickly explore some of these branches and by some metric choose the most likely to be the right one at the moment to answer.

Reply View | 1 reply
- arthurcolle 10 months ago
  
  yeah exactly, MCTS
  
  Reply View | 0 replies