Comment by taion
Comment by taion 6 days ago
The problem with this approach is that it requires the system doing randomization to be aware of the rewards. That doesn't make a lot of sense architecturally – the rewards you care about often relate to how the user engages with your product, and you would generally expect those to be collected via some offline analytics system that is disjoint from your online serving system.
Additionally, doing randomization on a per-request basis heavily limits the kinds of user behaviors you can observe. Often you want to consistently assign the same user to the same condition to observe long-term changes in user behavior.
This approach is pretty clever on paper but it's a poor fit for how experimentation works in practice and from a system design POV.
I don't know, all of these are pretty surmountable. We've done dynamic pricing with contextual multi-armed bandits, in which each context gets a single decision per time block and gross profit is summed up at the end of each block and used to reward the agent.
That being said, I agree that MABs are poor for experimentation (they produce biased estimates that depend on somewhat hard-to-quantify properties of your policy). But they're not for experimentation! They're for optimizing a target metric.