Comment by sarpdag
I really like multi armed bandit approach, but struggles with common scenarios involving delayed rewards or multiple success criteria, such as testing ecommerce search with number of orders and GMV guardrails.
For simple, immediate-feedback cases like button clicks, the specific implementation becomes less critical.
It’s best for immediate rewards. If you have delayed rewards there is a paper on sampling from the “delay distribution” that solves this.