Comment by LPisGood

Comment by LPisGood 6 days ago

13 replies

Multi-arm bandit does beat A/B testing in the sense that standard A/B testing does not seek to maximize reward during the testing period, MAB does. MAB also generalizes better to testing many things than A/B testing.

cle 6 days ago

This is a double-edged sword. There are often cases in real-world systems where the "reward" the MAB maximizes is biased by eligibility issues, system caching, bugs, etc. If this happens, your MAB has the potential to converge on the worst possible experience for your users, something a static treatment allocation won't do.

  • LPisGood 6 days ago

    I haven’t seen these particular shortcomings before, but I certainly agree that if your data is bad, this ML approach will also be bad.

    Can you share some more details about your experiences with those particular types of failures?

    • cle 6 days ago

      Sure! A really simple (and common) example would be a setup w/ treatment A and treatment B, your code does "if session_assignment == A .... else .... B" . In the else branch you do something that for whatever reason causes misbehavior (perhaps it sometimes crashes or throws an exception or uses a buffer that drops records under high load to protect availability). That's suprisingly common. Or perhaps you were hashing on the wrong key to generate session assignments--ex you accidentally used an ID that expires after 24 hours of inactivity...now only highly active people get correctly sampled.

      Another common one I saw was due to different systems handling different treatments, and there being caching discrepancies between the two, like esp in a MAB where allocations are constantly changing, if one system has a much longer TTL than the other then you might see allocation lags for one treatment and not the other, biasing the data. Or perhaps one system deploys much more frequently and the load balancer draining doesn't wait for records to finish uploading before it kills the process.

      The most subtle ones were eligibility biases, where one treatment might cause users to drop out of an experiment entirely. Like if you have a signup form and you want to measure long-term retention, and one treatment causes some cohorts to not complete the signup entirely.

      There are definitely mitigations for these issues, like you can monitor the expected vs. actual allocations and alert if they go out-of-whack. That has its own set of problems and statistics though.

crazygringo 6 days ago

No -- you can't have your cake and eat it too.

You get zero benefits from MAB over A/B if you simply end your A/B test once you've achieved statistical significance and pick the best option. Which is what any efficient A/B test does -- there no reason to have any fixed "testing period" beyond what is needed to achieve statistical significance.

While, to the contrary, the MAB described in the article does not maximize reward -- as I explained in my previous comment. Because the post's version runs indefinitely, it has worse long-term reward because it continues to test inferior options long after they've been proven worse. If you leave it running, you're harming yourself.

And I have no idea what you mean by MAB "generalizing" more. But it doesn't matter if it's worse to begin with.

(Also, it's a huge red flag that the post doesn't even mention statistical significance.)

  • LPisGood 6 days ago

    > you can't have your cake and eat it too

    I disagree. There is a vast array of literature on solving the MAB problem that may as well be grouped into a bin called “how to optimally strike a balance between having one’s cake and eating it too.”

    The optimization techniques to solve MAB problem seek to optimize reward by giving the right balance of exploration and exploitation. In other words, these techniques attempt to determine the optimal way to strike a balance between exploring if another option is better and exploiting the option currently predicted to be best.

    There is a strong reason this literature doesn’t start and end with: “just do A/B testing, there is no better approach”

    • crazygringo 6 days ago

      I'm not talking about the literature -- I'm talking about the extremely simplistic and sub-optimal procedure described in the post.

      If you want to get sophisticated, MAB properly done is essentially just A/B testing with optimal strategies for deciding when to end individual A/B tests, or balancing tests optimally for a limited number of trials. But again, it doesn't "beat" A/B testing -- it is A/B testing in that sense.

      And that's what I mean. You can't magically increase your reward while simultaneously getting statistically significant results. Either your results are significant to a desired level or not, and there's no getting around the number of samples you need to achieve that.

      • LPisGood 6 days ago

        I am talking about the literature which solves MAB in a variety of ways, including the one in the post.

        > MAB properly done is essentially just A/B testing

        Words are only useful insofar as their meanings invoke ideas, and in my experience absolutely no one thinks of other MAB strategies when someone talks about A/B testing.

        Sure, you can classify A/B testing as one extremely suboptimal approach to solving MAB problem. This classification doesn’t help much though, because the other MAB techniques do “magically increase the rewards” compared this simple technique.

        • crazygringo 3 days ago

          > Sure, you can classify A/B testing as one extremely suboptimal approach to solving MAB problem. This classification doesn’t help much though, because the other MAB techniques do “magically increase the rewards” compared this simple technique.

          You are quite simply wrong. There is nothing suboptimal about an A/B test between two choices performed until desired statistical significance. There is nothing you can do to magically increase anything.

          If you think there is, you'll have to describe something specific. Because nowhere in the academic MAB literature does anyone attempt to state the contrary. And which, again, is why this blog post is so flawed.

    • cauch 6 days ago

      Another way of seeing the situation: let run your MAB solution for a while. Orange has been tested 17 times and blue has been tested 12 times. This is exactly equivalent of doing a A/B testing where you display 1 time the orange button to 17 persons and 1 time the blue button to 12 persons.

      The trick is to find the exact best number of test for each color so that we have good statistical significance. MAB does not do that well, as you cannot easily force testing an option that was bad when this option did not get enough trial to have a good statistical significance (imagine you have 10 colors and the color orange first score 0/1. It will take a very long while before this color will be re-tested quite significantly: you need to first fall into the 10%, but then you still have ~10% to randomly pick this color and not one of the other). With A/B testing, you can do a power analysis before hand (or whenever during) to know when to stop.

      Literature does not start with "just do A/B testing" because it is not the same problem. In MAB, your goal is not to demonstrate that one is bad, it's to do your own decision when faced with a fixed situation.

      • LPisGood 6 days ago

        > The trick is to find the exact best number of test for each color so that we have good statistical significance

        Yes, A/B testing will force through enough trials to get statistical significance(it is definitely a “exploration first strategy), but in many cases, you care about maximizing reward as well, in particular during testing. A/B testing does very poorly at balancing exploitation with exploitation in general.

        This is especially true if the situation is dynamic. Will you A/B test forever in case something has changed and give up that long term loss in reward value?

        • cauch 6 days ago

          But the proposed MAB system does not even propose a method to know when this system needs to be stopped (and remove all the choices except the best one).

          With the A/B testing, you can do power analysis whenever you want, including in the middle of the experiment. It will just be an iterative adjustment that converges.

          In fact, you can even run on all possibilities in advance (if A get 1% and B get 1%, how many A and B do I need, if A get 2% and B get 1%, if A get 3% and B get 1%, ...) and it will give you the exact boundaries to stop for any configurations before even running the experiment. You will just have to stop trialing option A as soon as option A crosses the already decided significance threshold for A.

          So, no, the A/B testing will never run forever. And A/B testing will always be better than the MAB solution, because you will have a better way to stop trying a bad solution as soon as you have crossed the threshold you decided is enough to consider it's a bad solution.

ertdfgcvb 5 days ago

Isn't that the point of testing (to not maximize reward but rather wait and collect data)? It sounds like maximizing reward during the experiment period can bias the results