Comment by jbentley1

Comment by jbentley1 6 months ago

Multi-armed bandits make a big assumption that effectiveness is static over time. What can happen is that if they tip traffic slightly towards option B at a time when effectiveness is higher (maybe a sale just started) B will start to overwhelmingly look like a winner and get locked in that state.

You can solve this with propensity scores, but it is more complicated to implement and you need to log every interaction.

LPisGood 6 months ago

This objection is mentioned specifically in the post.

You can add a forgetting factor for older results.

Reply View 3 replies

randomcatuser 6 months ago

This seems like a fudge factor though. Some things are changed bc you act on them! (e.g. recommendation systems that are biased towards more popular content). So having dynamic groups makes the data harder to analyze

Reply View | 1 reply
- LPisGood 6 months ago
  
  A standard formulation of MAB problem assumes that acting will impact the rewards, and this forgetting factor approach is one which allows for that and still attempts to find the currently most exploitable lever.
  
  Reply View | 0 replies
lern_too_spel 6 months ago

That's a different problem. In jbentley1's scenario, A could be better, but this algorithm will choose B.

Reply View | 0 replies