Comment by cauch
Another way of seeing the situation: let run your MAB solution for a while. Orange has been tested 17 times and blue has been tested 12 times. This is exactly equivalent of doing a A/B testing where you display 1 time the orange button to 17 persons and 1 time the blue button to 12 persons.
The trick is to find the exact best number of test for each color so that we have good statistical significance. MAB does not do that well, as you cannot easily force testing an option that was bad when this option did not get enough trial to have a good statistical significance (imagine you have 10 colors and the color orange first score 0/1. It will take a very long while before this color will be re-tested quite significantly: you need to first fall into the 10%, but then you still have ~10% to randomly pick this color and not one of the other). With A/B testing, you can do a power analysis before hand (or whenever during) to know when to stop.
Literature does not start with "just do A/B testing" because it is not the same problem. In MAB, your goal is not to demonstrate that one is bad, it's to do your own decision when faced with a fixed situation.
> The trick is to find the exact best number of test for each color so that we have good statistical significance
Yes, A/B testing will force through enough trials to get statistical significance(it is definitely a “exploration first strategy), but in many cases, you care about maximizing reward as well, in particular during testing. A/B testing does very poorly at balancing exploitation with exploitation in general.
This is especially true if the situation is dynamic. Will you A/B test forever in case something has changed and give up that long term loss in reward value?