abeppu a day ago

I think this means you're not longer optimizing for the right thing. My understanding is that max is important because your value of taking action A from state S includes the presumption that you'll pick the best available action under the policy from each successive state you visit. If you swap in some PDF, you're asking what's the value of taking action A if I then act by random sampling from that distribution thereafter.

  • elchananHaas 16 hours ago

    This is correct. I will add that sampling from the distribution thereafter is equivalent to on policy learning.