Monday, September 17, 2012

A/B Testing vs MAB algorithms - It's complicated

Several months ago, several blog posts appeared on Hacker News comparing A/B testing and multi-armed bandit techniques.  If you want to review the posts and the discussion, see 20 lines of code that beat A/B testing every time, then Why multi-armed bandit algorithm is not "better" than A/B testing and finally Why Multi-armed Bandit algorithms are superior to A/B testing (with Math).  I participated in those discussions, and ever since then I've been wanting to write up my thoughts once I had them in a compact enough form to do so.

That has taken an unfortunately long time.  In fact I've given up on saying everything that I want to say in a compact form, and will try to only say what I think is most important.  And even that has wound up less compact than I'd like...

First a disclaimer.  Website optimization has been a large part of what I've done in the last decade, and I've been a heavy user of A/B testing.  See Effective A/B Testing for a well-regarded tutorial that I did on it several years ago.  I have much less experience with multi-armed bandit approaches.  I don't believe that I am biased.  But if I were, it is clear what my bias would be.

Here is a quick summary of what the rest of this post will attempt to convince you of.
  1. If you have actual traffic and are not running tests, start now.  I don't actually expand on this fact, but  it is trivially true.  If you've not started testing, it would be a shock if you can't find at least one 5-10% improvement in your business within 6 months of trying it.  Odds are that you'll find several.  What is that worth to you?
  2. A/B testing is an effective optimization methodology.
  3. A good multi-armed bandit strategy provides another effective optimization methodology.  Behind the scenes there is more math, more assumptions, and the possibility of better theoretical characteristics than A/B testing.
  4. Despite this, if you want to be confident in your statistics, want to be able to do more complex analysis, or have certain business problems, A/B testing likely is a better fit.
  5. And finally if you want an automated "set and forget" approach, particularly if you need to do continuous optimization, bandit approaches should be considered first.

That summary requires a lot of justification.  Read on for that.


What is A/B testing?  A/B testing is when you create 2 or more versions of something, say a web page, then for a period of time you randomly choose which version each person gets.  You track how frequently those groups do things you like (sign up, click on something, make a purchase - anything you care to track).  Eventually you do statistics and figure out what the better one is.  Then you roll out that version to everyone.  You can see my tutorial above if you want to roll a system yourself, or use an prebuilt package.  Google Website OptimizerVisual Website Optimizer/, and Optimizely are popular third party sites that make it easy to do A/B testing for websites.

Next, what are multi-armed bandit (MAB) algorithms?  Well suppose that you are facing a slot machine with several "arms" that you can pull.  Every time you pull an arm, you get some random response.  You know nothing about the arms when you start.  MAB algorithms try to strike a balance between exploring the performance of the arms and exploiting the better one.  You start with lots of exploration and little or no exploitation, then move to lots of exploitation and very little exploration.  There is a lot of mathematical literature on the subject.  Again you can roll your own, or you can use a pre-built solution like Myna.

Now how do A/B testing and MAB algorithms relate?  Well the problem that A/B testing is usually trying to solve can be described as a MAB problem.  Admittedly, you have random visitors to your website rather than a slot machine, and you have to choose which web page to deliver rather than choosing the arm on a slot machine.  But mathematically there is an equivalence.  When looked at from this point of view A/B testing itself is merely a potential algorithm for MAB.  And it is far from the best one known.

So MAB has a ton of theory and is better, right?  Not necessarily.  But before I can explain why, I need to explain enough about how MAB works that you can see the shortcomings.

Standard multi-armed bandit theory, in the simplest case, makes two basic assumptions.  The first is that you know what happened with the previous trials.  The second is that the behavior of the machine is random based on the arm pulled, but its behavior does not change over time, or depending on your previous actions.  There is a lot of research on ways to relax these assumptions, but that is the basic starting place.  After all the problem is an idealization of how it is possible to learn from experience.  Unless you know the past and have some reason to believe that the future is similar to the past, there is no possibility of being able to learn from experience!

MAB algorithms trade off exploration and exploitation, how can you do that?  The simplest possible technique is called epsilon-greedy.  Some fixed fraction of the time you explore versions randomly, the rest of the time you go for the version that is best so far.  This algorithm is quite bad, it has linear regret.  What is regret?  Regret is the regret you feel for finding out that you were wasting time on the wrong version, and could have made more by doing the right thing all along.  Since you're trying bad versions a fixed fraction of the time, your regret is linearly proportional to how many trials you have had.

Next we have epsilon-first where you spend a fixed period exploring the arms, and from then on pull the one you think is best.  A/B testing is one way to do epsilon-first.  This algorithm also has linear regret on average.  There is a fixed probability that you'll make the wrong choice, and from then on you'll forever do the wrong thing.  However in practice you stop reasonably quickly, and the linear term is very small.  So the regret tends to be acceptable in practice.

But we can do better.  The best techniques have logarithmic regret meaning that the fraction of the time that you spend doing the wrong thing scales as the log of the number of times you're faced with the web page (or are faced with new users).  Many techniques achieve this theoretical bound.  For example we can make the probability of pulling a specific lever proportional to our confidence that it is not worse than our average observed so far across all trials.  With this, as poor options acquire statistical evidence that they perform worse than average, they drop out of the traffic mix and the bar is raised.  Eventually the best version is the only one getting significant traffic.

If you've got experience with statistics, feel free to start computing expected values, variances, then confidence intervals that you do lookups into standard normal distributions with for a 1-tailed test.  The resulting MAB algorithm will work well as long as the distribution of each arm is independent, and has both an expected value and variance.

If you are not confident about statistics, all is not lost.  There is a lot of work on simple to implement algorithms.  One very simple one is UCB1.  In that one the version that you pick for the next pull is the one that makes avg(version) + sqrt( 2 lng(trials_of_all_versions) / trials(version) ) as large as possible.  If you want you can read the derivation, or just trust that it works (it does if its assumptions are met).  But do note that there is logarithmic, and then there is logarithmic with a good factor and smaller constant terms.  UCB1 is acceptable, but not the best algorithm available.

OK, that's enough theory about MAB.  Now it is time to start looking at how it might not work so well in the real world of A/B testing.
  1. Performance varies over time.  Conversion rates tend to vary by time of day, day of week, day of month, and so on.  Furthermore your site does not stand still.  While you are running one test, you often will start another test, get final results, and roll out a significant improvement.  Now in your original test you can wind up thinking that version B looks like it is better, so you keep on trying version B, and it keeps on performing better than your old version A data.  But you don't know whether version B is performing better because it is better, or because your whole site is better.

    Eventually you'll try A enough that you can come to a conclusion.  But in practice "logarithmic growth in regret" means that you need to wait an exponential amount of time before unbiased new A data starts to balance out conclusions that you're making based on the worse performance of old A data.

    A/B testing avoids this problem by making sure that all statistical comparisons are apples to apples.  I had an email discussion with the people at Myna about this, and we concluded that there is more than one way to make MAB algorithms that can handle this problem, but it certainly does not happen by default.  Nor have they gotten around to implementing it yet.  (I should write a blog post one of these days on how to solve this problem.)
  2. You may not have data from all past trials.  Suppose, for instance, that you are running an A/B test on an email program.  Every day you send out a batch of emails.  People don't generally open and react to them until after the day's send is over, and lag times of several days to respond is not uncommon.  So what do you do in a MAB algorithm when you are judging on known to be incomplete data?  Depending on the algorithm, it isn't clear.
  3. A/B testing applies more easily to sticky behavior.  Often during A/B tests you're testing something, like the presence of a link, that you don't want constantly every time a person looks at it.  So there are business reasons to make the version of the test that someone sees be sticky.  This creates the previous problem about data taking a while to get fully baked.  But it also introduces the problem that by the time you have figured out which version is better, a large portion of your user base is on the bad one.

    A/B testing gives you a clear point to make the business decision to pull the plug on those regrettable choices.  A MAB algorithm by default does not.  That said, many MAB algorithms can be instrumented to give you conversion rates, confidence bounds, and so on.
  4. A/B testing lets you clean up your code.  This is a variant of the previous problem.  When you test, you try lots of things.  When you A/B test you declare winners and delete code for losing versions.  With MAB algorithms you tend to just leave them there.  Over time this leads to grungy source code, and on your next test it may not be obvious what old testing code still matters.  Of course if your MAB algorithm gives you confidence intervals, and you eventually clean up losing versions, then this is a non-issue.
  5. Your value metric is not always clear.  Companies that have been doing A/B testing for a while find it worthwhile to track multiple metrics.  It may not be clear up front which metric will matter most.  Suppose that a change manages to increase signups on the site, and decreases how quickly they click on a paying ad.  Is that an improvement or not?  (This was the actual outcome of a real test at a real company.)  You really want people thinking about business decision.  If you leave it up to an unsupervised algorithm that never had those boundary cases thought through, you'll be making critical decisions about the business by chance, and will be missing out on opportunities to figure out the real answers.
  6. You may want to notice and handle different performance on user segments?  People who have been on the site for a while often react differently to changes than people who are new to the site.  If you segment your data, then you can zero in on (for instance) whether the behavior of existing users shifts after they've been exposed to the change for a while.  The after-the-fact analysis in A/B testing can let you perform this kind of sophisticated analysis.  MAB algorithms don't.
But it is not all one sided.  There are concrete reasons that you might prefer to use a MAB algorithm.  Here are some:
  1. Bandit approaches are mathematically better.  If the necessary assumptions are met, they simply are.  As noted above, the assumptions tend not to be perfectly met in real life, but if you don't think the violation is that bad, why not be better?
  2. The other potential problems above do not apply.  The problems that I listed above are all based on actual experience at actual companies.  But they may not be particularly applicable to your company.  Perhaps it is OK on your site if a page looks different every time a user sees it.  It may not matter to you that you are unable to do analyses that you weren't planning on doing.  And so on.  You know your company better than I do, make the right decision.
  3. A/B testing requires more attention.  At the very least, A/B testing requires that someone looks at, and understands, the test results.  And every so often you'll get into situations where you wait on being able to try new ideas because you're running an existing test that you don't want to stop.  Any decent MAB algorithm skips that.  Nobody ever needs to look at results (though you can), and with a good algorithm you can just throw a new idea out into the mix at any time that you want to.
  4. You want to do continuous optimization. There are many cases where you want to have a system that is constantly optimizing and re-optimizing itself based on what is currently happening, without human intervention.  The "without human intervention" bit means that you can't use A/B testing.  But you can adapt MAB algorithms for this purpose.  All that you do is discount data collected a while ago - just multiply the number of old observations by some decay rate - so that recent data counts for more.  Then let it run and watch it adapt.

    Things like Yahoo C.O.R.E may give inspiration for what can be aspired to with enough traffic.  Obviously any system like that will need a lot of tuning and customization for your specific situation, even if you're not trying to be that sophisticated.

    But you do not need an insane amount of traffic to benefit from continuous optimization.  For instance consider the problem of presenting content that people acclimatize to and then learn to ignore.  Examples include display ads, or email subject lines.  In this case the more often that you show particular versions, the worse they do.  Can you find a more perverse optimization problem?  Continuous optimization can help by letting you keep optimizing for what people have not yet acclimatized to.  (I believe I can beat a MAB-based algorithm for this specific case, but that is a subject for another day.)
If you've read this far, congratulations.  Let me just remind you of the summary, and you can decide for yourself whether I've supported my claims:
  1. If you have actual traffic and are not running tests, start now.  I don't actually expand on this fact, but  it is trivially true.  If you've not started testing, it would be a shock if you can't find at least one 5-10% improvement in your business within 6 months of trying it.  Odds are that you'll find several.  What is that worth to you?
  2. A/B testing is an effective optimization methodology.
  3. A good multi-armed bandit strategy provides another effective optimization methodology.  Behind the scenes there is more math, more assumptions, and the possibility of better theoretical characteristics than A/B testing.
  4. Despite this, if you want to be confident in your statistics, want to be able to do more complex analysis, or have certain business problems, A/B testing likely is a better fit.
  5. And finally if you want an automated "set and forget" approach, particularly if you need to do continuous optimization, bandit approaches should be considered first.
My thanks to Paras Chopra, Patrick McKenzie and Noel Welsh for feedbacks on earlier drafts of this.  And see http://news.ycombinator.com/item?id=4532578 for further discussion.

1 comment:

Alex Birkett said...

Hi - loved the article. Wanted to stop by and let you know I cited it in my latest ConversionXL post: http://conversionxl.com/bandit-tests/

Cheers,
Alex