Comment by derbOac

There's some insights there about the base rate of correct responses and pretraining to boost that. Basically searching a suboptimal versus optimal area of the model space at a suboptimal versus optimal rate.

I think the framing of the discussion in general is kind of misleading though, because it kind of avoids the question of "information inefficient about what?"

In RL, the model is becoming more informative about a stimulus-action-feedback space; in SL the model is becoming more informative about a stimulus-feedback space. RL is effectively "built for" searching a larger space.

In situations like the essay where you are directly comparing SL and RL, you're kind of saying for RL "the action space is restricted to dictionary X and the feedback space is binary yes or no" and for SL "the feedback space is restricted to dictionary X". So in a certain sense you're equating the RL action space to the SL feedback space.

In that case, maybe searching over suboptimal regions of the RL-action-SL-feedback space is inefficient. But the reason why, I think RL exists is because it generalizes to situations where the feedback and action space is bigger. Maybe you want to differentially associate different responses with different rewards, or sample a response space that is so large that you can't define it a priori. Then SL breaks down?

Maybe this is obvious but I guess I get a little uneasy about talking about information efficiency of RL and SL without a broader framework of equivalence and what information is being represented by the model in both cases. It seems to me RL is a kind of superset of SL in terms of what it is capable of representing, which maybe leads to inefficiencies when it's not being used to its fullest.