- If you've never tested, expect to find a number of 5-20% wins.
In my experience, most companies find several changes that each add in the neighborhood of 5-20% to the bottom line in their first year of testing. If you have enough volume to reliably detect wins of this size in a reasonable time frame, you should start now. If not, then you're not a great candidate for it..yet.
- Experienced testers find smaller wins.
When you first start testing, you pick up the low-lying fruit. The rapid increase in profits can spoil you. Over time the wins get smaller. And more specific to your business. Which means they will take longer to detect. Expect this. But if you're still finding an average of one 2% win every month, that is around a 25% improvement in conversion rates per year.
- What works for others likely works for you. But not always.
Companies that test a lot which have settled on simple value propositions, streamline their signup process, put down big calls to action, places the same call to action in multiple places, and do email marketing. Those are probably going to be good things for you to do as well. However Amazon famously relies on customer ratings and reviews. If you do not have their scale or product mix, you'll likely get very few reviews per product, and may get overwhelmingly negative reviews. So borrow, but test.
- Your testing methodology needs to be appropriate to your scale and experience.
A company with 5k prospects per day might like to run a complex multivariate test all of the way to signed up prospect, and be able find subtle 1% conversion wins. But they don't generate nearly enough data to do this. They may need to be satisfied with simple tests on the top step of the conversion funnel. But it would be a serious mistake for a company like Amazon to settle for such a poor testing methodology. In general you should use the most sophisticated testing methodology that you generate enough data to carry off.
- Back of the envelope: A 10% win typically needs around 800-1500 successes per version to be seen.
One of the top questions people have is how long a test takes to run. Unfortunately it depends on all sorts of things, including your traffic volume, conversion rates, the size of the win to be found, luck, and what confidence you cut the test off at. But if one version gets to 800 successes, when the new one is at 880, you can convert at a 95% confidence level. If you wait until you have 1500 versus 1650, you can convert at a 99% confidence level. This data point, combined with your knowledge of your business, gives you a starting point for planning.
- Back of the envelope: Sensitivity scales as the square root of the number of observations.
For example a 5% win takes about 2x as much sensitivity as a 10% win, which means 4x as much data. So you need 3200-6000 successes per version to see it.
- Data required is roughly linear with number of versions.
Running more versions requires a bit more data per version to reach confidence. But not a lot. Thus the amount of data you need is roughly proportional to the number of versions. (But if some versions are real dogs, it is OK to randomly move people from those versions to other versions, which speeds up tests with a lot of versions.) Before considering a complicated multivariate test, you should do a back of the envelope to see if it is feasible for your business.
- Even if you knew the theoretical win, you can't predict how long it will actually take to within a factor of 3.
An A/B test reaches confidence when the observed difference is bigger than chance alone can plausibly explain. However your observed difference is the underlying signal plus a chance component. If the chance component is in the same direction as the underlying signal, the test finishes very fast. If the chance component is the opposite direction, then you need enough data that the underlying signal overrides the chance signal, and goes on to still be larger than chance could explain. The difference in time is usually within a factor of 3 either way, but it is entirely luck which direction you get. (The rough estimates above are not too far from where you've got a 50% chance of having an answer.)
- The lead when you hit confidence is not your real win.
This is the flip side of the above point. It matters because someone usually has the thankless task of forecasting growth. If you saw what looked like an 8% win, the real win could easily be 4%. Or 12%. Nailing that number down with any semblance of precision will take a lot more data, which means time and lost sales. There generally isn't much business value in knowing how much better your better version is, but whoever draws up forecasts will find the lack of a precise answer inconvenient.
- Test early, test often.
Suppose that you have 3 changes to test. If you run 3 tests, you can learn 3 facts. If you run one test with all three changes, you don't know which change actually made a difference. Small and granular tests therefore do more to sharpen your intuition about what works.
- Testing one step in the conversion funnel is OK only if you're small and just beginning testing.
Every business has some sort of conversion funnel which potential customers go through. They see your ad, click on it, click on a registration link, actually sign up, etc. As a business, you care about people who actually made you money. Each step loses people. Generally, whatever version pushes more people through the top step gets more business in the end. Particularly if it is a big win. But not always! Therefore if testing eventual conversions takes you too long, and you're still finding 10%+ wins at the top step in your funnel, it makes business sense to test and run with those preliminary answers. You'll make some mistakes, but you'll get more right than wrong. Testing poorly is better than not testing at all.
- People respond to change.
If you change your email subject lines, people may be 2-5% more likely to click just because it is different, whether or not it is better. Conversely moving a button on the website may decrease clicks because people don't see the button where they expect it. If you've progressed to looking for small wins, then you should keep an eye out for tests where this is likely to matter, and try to dig a little deeper on this.
- A/B testing revenue takes more data. A lot more.
How much more depends on your business. But be ready to see data requirements rise by a factor of 10 or more. Why? In the majority of companies, a fairly small fraction of customers spend a lot more than average. The detailed behavior of this subgroup matters a lot to revenue, so you need enough data to average out random fluctuations in this slice of the data.
- Interaction effects are likely ignorable.
Often people have several things that they would like to test at the same time. If you have sufficient data, of course, you would like to look at each slice caused by a combination of possible versions separately, and look for interaction effects that might arise with specific combinations. In my experience, most companies don't have enough volume to do that. However if you assign people to test versions randomly, and apply common sense to avoid obvious interaction effects (eg red text on red background would cause an interaction effect), then you're probably OK. Imperfect testing is better than not testing, and the imperfection of proceeding is generally pretty small.
As always, feedback is welcome. I have helped a number of companies in the Los Angeles area on A/B testing, and this tries to the most common questions that I've encountered about how much work it is, and what returns they can hope for.