Comment by bogtog

Comment by bogtog 3 days ago

The premise of this post and the one cited near the start (https://www.tobyord.com/writing/inefficiency-of-reinforcemen...) is that RL involves just 1 bit of learning for a rollout, rewarding success/failure.

However, the way I'm seeing this is that a RL rollout may involve, say, 100 small decisions out of a pool of 1,000 possible decisions. Each training step, will slightly upregulate/downregulate a given training step in the step's condition. There will be uncertainty about which decision was helpful/harmful -- we only have 1 bit of information after all -- but this setup where many steps are slowly learned across many examples seems like it would lend itself well to generalization (e.g., instead of 1 bit in one context, you get a hundred 0.01 bit insights across 100 contexts). There may be some benefits not captured by comparing the number of bits relative to pretraining.

As the blog says, "Fewer bits, sure, but very valuable bits", this also seems like a different factor that would also be true. Learning these small decisions may be vastly more valuable for producing accurate outputs than learning through pretraining.

refulgentis 3 days ago

Dwarkesh's blogging confuses me, because I am not sure if the message is free-associating, or, relaying information gathered.

ex. how this reads if it is free-associating: "shower thought: RL on LLMs is kinda just 'did it work or not?' and the answer is just 'yes or no', yes or no is a boolean, a boolean is 1 bit, then bring in information theory interpretation of that, therefore RL doesn't give nearly as much info as, like, a bunch of words in pretraining"

ex. how this reads if it is relaying information gathered: "A common problem across people at companies who speak honestly with me about the engineering side off the air is figuring out how to get more out of RL. The biggest wall currently is the cross product of RL training being slowww and lack of GPUs. More than one of them has shared with me that if you can crack the part where the model gets very little info out of one run, then the GPU problem goes away. You can't GPU your way out of how little info they get"

I am continuing to assume it is much more A than B, given your thorough sounding explanation and my prior that he's not shooting the shit about specific technical problems off-air with multiple grunts.

Reply View 6 replies

robrenaud 3 days ago

He is essentially expanding upon an idea made by Andrej Karpathy on his podcast about a month prior.
Karpathy says that basically "RL sucks" and that it's like "sucking bits of supervision through a straw".
https://x.com/dwarkesh_sp/status/1979259041013731752/mediaVi...

Reply View | 0 replies
bugglebeetle 3 days ago

Dwarkesh has a CS degree, but zero academic training or real world experience in deep learning, so all of his blogging is just secondhand bullshitting to further siphon off a veneer of expertise from his podcast guests.

Reply View | 4 replies
- vessenes 3 days ago
  
  So grumpy! Please pick up the torch and educate the world better; it can only help.
  
  Reply View | 3 replies
  
  refulgentis 3 days ago
  
  Better to be honest than say nothing, plenty of people say nothing. I asked a polite question thats near-impossible to answer without that level of honesty.
  
  Reply View | 1 reply
  
  vessenes 2 days ago
  
  I thought your question was great. I read the Dwarkesh post as scratch space for working out his thinking - so, closer to a shower thought. But also, an attempt to do what he’s really great at, which is distill and summarize at a “random engineer” level of complexity.
  You can kind of hear him pull in these extremely differing views on the future from very different sources, try and synthesize them, and also come out with some of his own perspective this year - I think it’s interesting. At the very least, his perspective is hyper-informed - he’s got fairly high-trust access to a lot of decision makers and senior researchers - and he’s smart and curious.
  This year we’ve had him bring in the 2027 folks (AI explosion on schedule), Hinton (LLMs are literally divorced from reality, and a total dead-end), both Ilya (we probably need emotions for super intelligence, also I won’t tell you my plan), Karpathy and Dario (Dario maybe twice?), Gwen, all with very very different perspectives on what’s coming and why.
  So, I think if you read him as one of the chroniclers of this era his own take is super interesting, and he’s in a position to be of great use precisely at synthesizing and (maybe) predicting; he should keep it up.
  
  Reply View | 0 replies
  
  bugglebeetle 3 days ago
  
  I teach and mentor lots of folks in my world. What I don’t do is feign expertise to rub shoulders with the people doing the actual work so I can soak money from rubes with ad rolls.
  
  Reply View | 0 replies

ACCount37 3 days ago

RL is very important - because while it's inefficient, and sucks at creating entirely new behaviors or features in LLMs, it excels at bringing existing features together and tuning them to perform well.

It's a bit like LLM glue. The glue isn't the main material - but it's the one that holds it all together.

Reply View 1 reply

elchananHaas 3 days ago

RL before LLMs can very much learn new behaviors. Take a look at AlphaGo for that. It can also learn to drive in simulated environments. RL in LLMs is not learning the same way, so it can't create it's own behaviors.

Reply View | 0 replies

macleginn 3 days ago

It is the same type of learning, fundamentally: increasing/decreasing token probabilities based on the left context. RL simply provides more training data from online sampling.

Reply View 0 replies