Comment by refulgentis
Comment by refulgentis 3 days ago
Dwarkesh's blogging confuses me, because I am not sure if the message is free-associating, or, relaying information gathered.
ex. how this reads if it is free-associating: "shower thought: RL on LLMs is kinda just 'did it work or not?' and the answer is just 'yes or no', yes or no is a boolean, a boolean is 1 bit, then bring in information theory interpretation of that, therefore RL doesn't give nearly as much info as, like, a bunch of words in pretraining"
or
ex. how this reads if it is relaying information gathered: "A common problem across people at companies who speak honestly with me about the engineering side off the air is figuring out how to get more out of RL. The biggest wall currently is the cross product of RL training being slowww and lack of GPUs. More than one of them has shared with me that if you can crack the part where the model gets very little info out of one run, then the GPU problem goes away. You can't GPU your way out of how little info they get"
I am continuing to assume it is much more A than B, given your thorough sounding explanation and my prior that he's not shooting the shit about specific technical problems off-air with multiple grunts.
He is essentially expanding upon an idea made by Andrej Karpathy on his podcast about a month prior.
Karpathy says that basically "RL sucks" and that it's like "sucking bits of supervision through a straw".
https://x.com/dwarkesh_sp/status/1979259041013731752/mediaVi...