Comment by munro
Here's an interesting write up on various algorithms & different epsilon greedy % values.
https://github.com/raffg/multi_armed_bandit
It shows 10% exploration performs the best, very great simple algorithm.
Also it shows the Thompson Sampling algorithm converges a bit faster-- the best arm chosen by sampling from the beta distribution, and eliminates the explore phase. And you can use the builtin random.betavariate !
https://github.com/raffg/multi_armed_bandit/blob/42b7377541c...