tag:blogger.com,1999:blog-2316765421340036602.post8997385048551992558..comments2021-09-20T03:43:49.212-07:00Comments on Random Observations: A/B testing scale cheat sheetbtillyhttp://www.blogger.com/profile/04335648152419715383noreply@blogger.comBlogger7125tag:blogger.com,1999:blog-2316765421340036602.post-42891499320881754132015-04-13T06:17:59.559-07:002015-04-13T06:17:59.559-07:00This comment has been removed by the author.Richard Hayeshttps://www.blogger.com/profile/00656388493685578168noreply@blogger.comtag:blogger.com,1999:blog-2316765421340036602.post-3121898472513112192012-10-29T16:35:15.666-07:002012-10-29T16:35:15.666-07:00@martin, when I say "on average" I mean ...@martin, when I say "on average" I mean "You've got even odds of detecting it". Which is the same as 50% power. Therefore the figure is correct.<br /><br />But the later bullet point saying that there is a large factor of variance in when you will detect it in practice supports what you said.btillyhttps://www.blogger.com/profile/04335648152419715383noreply@blogger.comtag:blogger.com,1999:blog-2316765421340036602.post-38224644725468781932012-10-29T16:29:55.896-07:002012-10-29T16:29:55.896-07:00"Applying the above scaling rule, if we expec..."Applying the above scaling rule, if we expect 800 successes then we'd be detecting a 10% difference on average at the 95% confidence level."<br /><br />This is not quite right. You need to take into account the variability in the counts, under the alternative hypothesis. This will tell you the probability of a significant result, when there really is a difference in the two conversion rates. <br /><br />With 800 successes you will only detect a true difference 50% of the time (ie power is 50%). Any power calculator will show you this (eg R's power.prop.test). A well powered test requires 1200-2200 success. eg see http://en.wikipedia.org/wiki/Statistical_power<br />martinhttps://www.blogger.com/profile/17263260129250188582noreply@blogger.comtag:blogger.com,1999:blog-2316765421340036602.post-79907597904552666032012-10-29T12:56:04.744-07:002012-10-29T12:56:04.744-07:00@Dan Siroker: There is always a balance. Even at A...@Dan Siroker: There is always a balance. Even at Amazon's scale, you still parallelize what you can. But there is a possibility that choice of headline will interact with, say, which marketing copy you put under it. Or with the geographic location of the visitor. As a matter of course you'd want to look for such correlations.<br /><br />When you have less traffic, you can't really ask those questions.<br /><br />On the conversion funnel, most of the time, improving any one step is good. But experienced companies generally will encounter examples of, <i>"It looked good at this step then sucked overall."</i> Once you've encountered that, you tend to become shy of not testing all of the way to the final conversion.btillyhttps://www.blogger.com/profile/04335648152419715383noreply@blogger.comtag:blogger.com,1999:blog-2316765421340036602.post-62278729286488854642012-10-29T12:29:29.717-07:002012-10-29T12:29:29.717-07:00@martingoodson: You're right, I missed a facto...@martingoodson: You're right, I missed a factor in my back of the envelope. :-(<br /><br />Conversion rates from visitor to signed up are usually low. The sum of many individually unlikely conversions is approximately Poisson. That has mean and variance the same. In an A/B test we're looking at something that is roughly one Poisson minus another. So the variance is twice the mean for each individual one. (That's the factor that I missed.) To draw statistical conclusions with 95% confidence, we want the measured difference to be about 2 standard deviations away.<br /><br />Now let's play with numbers. If we had 100 expected successes, the variance of the difference would be 200, which would make a standard deviation be about 10 * sqrt(2), which would mean that we can detect at a 95% confidence level roughly a difference of 20 * sqrt(2) = 10 * sqrt(8). Applying the above scaling rule, if we expect 800 successes then we'd be detecting a 10% difference on average at the 95% confidence level. (A check with a chi-square calculator confirms that.)<br /><br />Upping to 1000 raises that to about 97% confidence. Upping to 1500 gives 99% confidence. In practice, near the upper limit of the testing effort that you're willing to run, you can get away with surprisingly low confidence. This surprised me when I figure it out, so I need to do a blog post some time on it.btillyhttps://www.blogger.com/profile/04335648152419715383noreply@blogger.comtag:blogger.com,1999:blog-2316765421340036602.post-54020586277622355492012-10-29T11:11:14.653-07:002012-10-29T11:11:14.653-07:00Nice article. This was surprising though:
"B...Nice article. This was surprising though:<br /><br />"Back of the envelope: A 10% win typically needs around 500 successes per version to be seen."<br /><br />How are you calculating that? Using power.prop.test in R with a fairly low power of 80% gives me something like 1200 in each group. (power.prop.test(p1=0.05, p2=0.05*1.1, power=0.8, alternative='one.sided'))<br /> <br />Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-2316765421340036602.post-32293680206758156022012-10-29T11:04:31.090-07:002012-10-29T11:04:31.090-07:00Great blog post! I agree with many of your points....Great blog post! I agree with many of your points.<br /><br />Some thoughts:<br /><br />On your idea that the testing methodology needs to be appropriate to your scale and experience-- I agree that most people are best suited with just simple A/B tests that incorporate the winner as they go along. However I disagree that companies like Amazon should use the most sophisticated testing methodology that they can generate enough data to carry off. Even large companies only run a small percentage of the tests they want to run. Keeping it simple and running as many tests as they can end up in much more successful outcomes overall vs. running one big massive multivariate test that tries to anticipate all of the things that one might want to test ahead of time.<br /><br />Testing one part of the funnel: we typically recommend people focus on the bottleneck in their funnel to maximize the chance that improving it will improving the beginning to end conversion rate. For example back in 2007 during the Obama campaign we did a great job of getting people on our email list to donate and volunteer so we focused on getting more people on our email list as the bottleneck. Here is the experiment we ran: http://www.youtube.com/watch?v=7xV7dlwMChc#t=4m47s<br /><br />On revenue: another reason why A/B testing revenue takes longer is that there is a much higher outlier effect. When looking at unique conversions divided by unique visitors one visitor can't bias the data dramatically (unless they keep clearing their cookies before converting). With revenue a handful of visitors in one test group can spend dramatically more than the average and skew the results.<br /><br />Overall, great post and thanks for sharing your insights!<br /><br />DanAnonymoushttps://www.blogger.com/profile/17788075557167362593noreply@blogger.com