Comment by abeppu
The thing that worked for them was reducing the horizon. From my limited and dated understanding, I thought that's what the gamma term was for: exponentially you discount the value of stuff in the future to the point of being negligible (or less than the epsilon of differences you can even represent). So ... why/when is exponential discounting not enough?
It's mentioned in the article. But for really, really long-horizon tasks, it might be reasonable that you don't want to have a small discount factor.
For example, if you have really sparse rewards in a long-horizon task (say, a reward appears 1000 timesteps after the action), then a discount factor of even 0.99 won't help to capture that: 0.99 ^ 1000 ≈ 4e^-5.
Essentially, if your discount factor is too small for an environment, it will be near impossible to learn certain credit assignments.