Comment by ted_dunning

Comment by ted_dunning 5 days ago

1 reply

Multi-armed bandit approaches do not imply an immediate feedback loop. They do the best you can do with delayed feedback or with episodic adjustment as well.

So if you are doing A/B tests, it is quite reasonable to use Thompson sampling at fixed intervals to adjust the proportions. If your response variable is not time invariant, this is actually best practice.

orasis 2 days ago

Having significant experience with bandits in production, I strongly recommend only using them for immediate feedback. If the rewards are at all disconnected from the action you likely won’t be happy with the results.