Comment by sweezyjeezy
Comment by sweezyjeezy 5 days ago
One of the assumptions of vanilla multi-armed bandits is that the underlying reward rates are fixed. It's not valid to assume that in a lot of cases, including e-commerce. The author is dismissive and hard wavy about this and having worked in in e-commerce SaaS I'd be a bit more cautious.
Imagine that you are running MAB on an website with a control/treatment variant. After a bit you end up sampling the treatment a little more, say 60/40. You now start running a sale - and the conversion rate for both sides goes up equally. But since you are now sampling more from the treatment variant, its aggregate conversion rate goes up faster than the control - you start weighting even more towards that variant.
Fluctuating reward rates are everywhere in e-commerce, and tend to destabilise MAB proportions, even on two identical variants, they can even cause it to lean towards the wrong one. There are more sophisticated MAB approaches that try to remove the identical reward-rate assumption - they have to model a lot more uncertainty, and so optimise more conservatively.
> ...the conversion rate for both sides goes up equally.
If the conversion rate "goes up equally", why did you not measure this and use that as a basis for your decisions?
> its aggregate conversion rate goes up faster than the control - you start weighting even more towards that variant.
This sounds simply like using bad math. Wouldn't this kill most experiments that start with 10% for the variant that do not provide 10x the improvement?