Comment by jbentley1
Multi-armed bandits make a big assumption that effectiveness is static over time. What can happen is that if they tip traffic slightly towards option B at a time when effectiveness is higher (maybe a sale just started) B will start to overwhelmingly look like a winner and get locked in that state.
You can solve this with propensity scores, but it is more complicated to implement and you need to log every interaction.
This objection is mentioned specifically in the post.
You can add a forgetting factor for older results.