Monday, October 29, 2012

A/B testing scale cheat sheet

This is not a guide to how to do A/B testing.  If you want that, see Effective A/B Testing, or any number of companies that will help you with A/B testing.  Instead this is a cheat sheet of basic facts on A/B testing (mostly on the scale involved) to help people who are beginning figure out what is feasible.
  • If you've never tested, expect to find a number of 5-20% wins.

    In my experience, most companies find several changes that each add in the neighborhood of 5-20% to the bottom line in their first year of testing.  If you have enough volume to reliably detect wins of this size in a reasonable time frame, you should start now.  If not, then you're not a great candidate for it..yet.
  • Experienced testers find smaller wins.

    When you first start testing, you pick up the low-lying fruit.  The rapid increase in profits can spoil you.  Over time the wins get smaller.  And more specific to your business.  Which means they will take longer to detect.  Expect this.  But if you're still finding an average of one 2% win every month, that is around a 25% improvement in conversion rates per year.
  • What works for others likely works for you.  But not always.

    Companies that test a lot which have settled on simple value propositions, streamline their signup process, put down big calls to action, places the same call to action in multiple places, and do email marketing.  Those are probably going to be good things for you to do as well.  However Amazon famously relies on customer ratings and reviews.  If you do not have their scale or product mix, you'll likely get very few reviews per product, and may get overwhelmingly negative reviews.  So borrow, but test.
  • Your testing methodology needs to be appropriate to your scale and experience.

    A company with 5k prospects per day might like to run a complex multivariate test all of the way to signed up prospect, and be able find subtle 1% conversion wins.  But they don't generate nearly enough data to do this.  They may need to be satisfied with simple tests on the top step of the conversion funnel.  But it would be a serious mistake for a company like Amazon to settle for such a poor testing methodology.  In general you should use the most sophisticated testing methodology that you generate enough data to carry off.
  • Back of the envelope: A 10% win typically needs around 800-1500 successes per version to be seen.

    One of the top questions people have is how long a test takes to run.  Unfortunately it depends on all sorts of things, including your traffic volume, conversion rates, the size of the win to be found, luck, and what confidence you cut the test off at.  But if one version gets to 800 successes, when the new one is at 880, you can convert at a 95% confidence level.  If you wait until you have 1500 versus 1650, you can convert at a 99% confidence level.  This data point, combined with your knowledge of your business, gives you a starting point for planning.
  • Back of the envelope: Sensitivity scales as the square root of the number of observations.

    For example a 5% win takes about 2x as much sensitivity as a 10% win, which means 4x as much data.  So you need 3200-6000 successes per version to see it.
  • Data required is roughly linear with number of versions.

    Running more versions requires a bit more data per version to reach confidence.  But not a lot.  Thus the amount of data you need is roughly proportional to the number of versions.  (But if some versions are real dogs, it is OK to randomly move people from those versions to other versions, which speeds up tests with a lot of versions.)  Before considering a complicated multivariate test, you should do a back of the envelope to see if it is feasible for your business.
  • Even if you knew the theoretical win, you can't predict how long it will actually take to within a factor of 3.

    An A/B test reaches confidence when the observed difference is bigger than chance alone can plausibly explain.  However your observed difference is the underlying signal plus a chance component.  If the chance component is in the same direction as the underlying signal, the test finishes very fast.  If the chance component is the opposite direction, then you need enough data that the underlying signal overrides the chance signal, and goes on to still be larger than chance could explain.  The difference in time is usually within a factor of 3 either way, but it is entirely luck which direction you get.  (The rough estimates above are not too far from where you've got a 50% chance of having an answer.)
  • The lead when you hit confidence is not your real win.

    This is the flip side of the above point.  It matters because someone usually has the thankless task of forecasting growth.  If you saw what looked like an 8% win, the real win could easily be 4%.  Or 12%.  Nailing that number down with any semblance of precision will take a lot more data, which means time and lost sales.  There generally isn't much business value in knowing how much better your better version is, but whoever draws up forecasts will find the lack of a precise answer inconvenient.
  • Test early, test often.

    Suppose that you have 3 changes to test.  If you run 3 tests, you can learn 3 facts.  If you run one test with all three changes, you don't know which change actually made a difference.  Small and granular tests therefore do more to sharpen your intuition about what works.
  • Testing one step in the conversion funnel is OK only if you're small and just beginning testing.

    Every business has some sort of conversion funnel which potential customers go through.  They see your ad, click on it, click on a registration link, actually sign up, etc.  As a business, you care about people who actually made you money.  Each step loses people.  Generally, whatever version pushes more people through the top step gets more business in the end.  Particularly if it is a big win.  But not always!  Therefore if testing eventual conversions takes you too long, and you're still finding 10%+ wins at the top step in your funnel, it makes business sense to test and run with those preliminary answers.  You'll make some mistakes, but you'll get more right than wrong.  Testing poorly is better than not testing at all.
  • People respond to change.

    If you change your email subject lines, people may be 2-5% more likely to click just because it is different, whether or not it is better.  Conversely moving a button on the website may decrease clicks because people don't see the button where they expect it.  If you've progressed to looking for small wins, then you should keep an eye out for tests where this is likely to matter, and try to dig a little deeper on this.
  • A/B testing revenue takes more data.  A lot more.

    How much more depends on your business.  But be ready to see data requirements rise by a factor of 10 or more.  Why?  In the majority of companies, a fairly small fraction of customers spend a lot more than average.  The detailed behavior of this subgroup matters a lot to revenue, so you need enough data to average out random fluctuations in this slice of the data.
  • Interaction effects are likely ignorable.

    Often people have several things that they would like to test at the same time.  If you have sufficient data, of course, you would like to look at each slice caused by a combination of possible versions separately, and look for interaction effects that might arise with specific combinations.  In my experience, most companies don't have enough volume to do that.  However if you assign people to test versions randomly, and apply common sense to avoid obvious interaction effects (eg red text on red background would cause an interaction effect), then you're probably OK.  Imperfect testing is better than not testing, and the imperfection of proceeding is generally pretty small.
As always, feedback is welcome.  I have helped a number of companies in the Los Angeles area on A/B testing, and this tries to  the most common questions that I've encountered about how much work it is, and what returns they can hope for.


Dan Siroker said...

Great blog post! I agree with many of your points.

Some thoughts:

On your idea that the testing methodology needs to be appropriate to your scale and experience-- I agree that most people are best suited with just simple A/B tests that incorporate the winner as they go along. However I disagree that companies like Amazon should use the most sophisticated testing methodology that they can generate enough data to carry off. Even large companies only run a small percentage of the tests they want to run. Keeping it simple and running as many tests as they can end up in much more successful outcomes overall vs. running one big massive multivariate test that tries to anticipate all of the things that one might want to test ahead of time.

Testing one part of the funnel: we typically recommend people focus on the bottleneck in their funnel to maximize the chance that improving it will improving the beginning to end conversion rate. For example back in 2007 during the Obama campaign we did a great job of getting people on our email list to donate and volunteer so we focused on getting more people on our email list as the bottleneck. Here is the experiment we ran:

On revenue: another reason why A/B testing revenue takes longer is that there is a much higher outlier effect. When looking at unique conversions divided by unique visitors one visitor can't bias the data dramatically (unless they keep clearing their cookies before converting). With revenue a handful of visitors in one test group can spend dramatically more than the average and skew the results.

Overall, great post and thanks for sharing your insights!


martingoodson said...

Nice article. This was surprising though:

"Back of the envelope: A 10% win typically needs around 500 successes per version to be seen."

How are you calculating that? Using power.prop.test in R with a fairly low power of 80% gives me something like 1200 in each group. (power.prop.test(p1=0.05, p2=0.05*1.1, power=0.8, alternative='one.sided'))

btilly said...

@martingoodson: You're right, I missed a factor in my back of the envelope. :-(

Conversion rates from visitor to signed up are usually low. The sum of many individually unlikely conversions is approximately Poisson. That has mean and variance the same. In an A/B test we're looking at something that is roughly one Poisson minus another. So the variance is twice the mean for each individual one. (That's the factor that I missed.) To draw statistical conclusions with 95% confidence, we want the measured difference to be about 2 standard deviations away.

Now let's play with numbers. If we had 100 expected successes, the variance of the difference would be 200, which would make a standard deviation be about 10 * sqrt(2), which would mean that we can detect at a 95% confidence level roughly a difference of 20 * sqrt(2) = 10 * sqrt(8). Applying the above scaling rule, if we expect 800 successes then we'd be detecting a 10% difference on average at the 95% confidence level. (A check with a chi-square calculator confirms that.)

Upping to 1000 raises that to about 97% confidence. Upping to 1500 gives 99% confidence. In practice, near the upper limit of the testing effort that you're willing to run, you can get away with surprisingly low confidence. This surprised me when I figure it out, so I need to do a blog post some time on it.

btilly said...

@Dan Siroker: There is always a balance. Even at Amazon's scale, you still parallelize what you can. But there is a possibility that choice of headline will interact with, say, which marketing copy you put under it. Or with the geographic location of the visitor. As a matter of course you'd want to look for such correlations.

When you have less traffic, you can't really ask those questions.

On the conversion funnel, most of the time, improving any one step is good. But experienced companies generally will encounter examples of, "It looked good at this step then sucked overall." Once you've encountered that, you tend to become shy of not testing all of the way to the final conversion.

martin said...

"Applying the above scaling rule, if we expect 800 successes then we'd be detecting a 10% difference on average at the 95% confidence level."

This is not quite right. You need to take into account the variability in the counts, under the alternative hypothesis. This will tell you the probability of a significant result, when there really is a difference in the two conversion rates.

With 800 successes you will only detect a true difference 50% of the time (ie power is 50%). Any power calculator will show you this (eg R's power.prop.test). A well powered test requires 1200-2200 success. eg see

btilly said...

@martin, when I say "on average" I mean "You've got even odds of detecting it". Which is the same as 50% power. Therefore the figure is correct.

But the later bullet point saying that there is a large factor of variance in when you will detect it in practice supports what you said.

Richard Hayes said...
This comment has been removed by the author.