Comment by pixelpoet
Probably a dumb/obvious question: if the bias comes from selecting the Q-maximum, can't this simply be replaced by sampling from a PDF?
Probably a dumb/obvious question: if the bias comes from selecting the Q-maximum, can't this simply be replaced by sampling from a PDF?
This is correct. I will add that sampling from the distribution thereafter is equivalent to on policy learning.
I think this means you're not longer optimizing for the right thing. My understanding is that max is important because your value of taking action A from state S includes the presumption that you'll pick the best available action under the policy from each successive state you visit. If you swap in some PDF, you're asking what's the value of taking action A if I then act by random sampling from that distribution thereafter.