Comment by munro

Here's an interesting write up on various algorithms & different epsilon greedy % values.

https://github.com/raffg/multi_armed_bandit

It shows 10% exploration performs the best, very great simple algorithm.

Also it shows the Thompson Sampling algorithm converges a bit faster-- the best arm chosen by sampling from the beta distribution, and eliminates the explore phase. And you can use the builtin random.betavariate !

https://github.com/raffg/multi_armed_bandit/blob/42b7377541c...