Comment by pixelpoet

Comment by pixelpoet a day ago

Probably a dumb/obvious question: if the bias comes from selecting the Q-maximum, can't this simply be replaced by sampling from a PDF?

abeppu a day ago

I think this means you're not longer optimizing for the right thing. My understanding is that max is important because your value of taking action A from state S includes the presumption that you'll pick the best available action under the policy from each successive state you visit. If you swap in some PDF, you're asking what's the value of taking action A if I then act by random sampling from that distribution thereafter.

Reply View 1 reply

elchananHaas 16 hours ago

This is correct. I will add that sampling from the distribution thereafter is equivalent to on policy learning.

Reply View | 0 replies