Elon Musk has been dropping hints about his Hyperloop idea. (We cannot call it a proposal because he has not actually proposed it yet.) There is a lot of curiosity about what it might be. Given Elon's history, the idea will sound audacious and yet will actually be workable.
Jacques Mattheij recently speculated on the topic. His proposal has the serious problem in that the friction of the air on the sides of the tunnel would lose way too much energy. But it got me thinking, and I have what may be a more realistic proposal.
First, imagine a tube that goes in a loop from Los Angeles to San Francisco. Let's put flaps on the walls. When they are open, air pressure can equalize. When they are closed, they don't leak much. Now let's put large, heavy objects going around and around the loop. For lack of a better name, let's call them plungers. The plungers can be floated and moved very efficiently with maglev technology. As each plunger approaches, the flaps open so that air can get pushed out, then closes so that it doesn't come back in. This is not an evacuated tube (Elon explicitly says that his technology isn't an evacuated tube), but results in a decent vacuum away from the shockwaves in front of each of plunger. That eliminates most of your friction losses. I don't know how low, but Elon claims low enough that solar panels on top of the device provide more than enough energy to keep it permanently going. I see no reason to disbelieve that solar panels could do that.
Now where do the people fit in? People go into vehicles that I'll call cars, even though they aren't really cars. These cars can be fired by a railgun to match speeds with the tube, and injected in front of a plunger. We can build the plunger with a space in its front that the car fits in. This space has air trapped in it by the shock wave and so the people can breathe. On its own that space would heat up due to the friction on the gas, but you can put a heat sink (eg a block of ice) in the car and keep it comfortable inside. Near the end of your journey the plunger ejects the car from this space on a course that launches it out of the tube while the plunger continues on its way. The car is then stopped with regenerative braking that recovers most of the launch energy, resulting in surprisingly little energy loss for taking the trip.
Now what would some of the specs be? Well, Elon claims 30 minutes from downtown Los Angeles to downtown San Francisco. According to google maps that's 382 miles, which is about 600 km. So the loop should be going around 1200 km/hour. If we put 12 plungers on the loop, and have plenty of vehicles, then you get in your vehicle and have a launch opportunity every 5 minutes. Increase the number of plungers, and the time to launch can be decreased while the capacity of the system can be increased. If we space the plungers a third of a km apart, we would have 3600 of them and could be launching every second into the system. It probably is more efficient if you instead make plungers larger so that cars carry more people. So instead of car think "bus". But after the initial system is built, you can later add new entrance/exit ramps and ramp up capacity. As Elon has promised, you would not need to reserve tickets - you'd pretty much arrive and then go.
Elon also claims that his system could store a lot of energy, enough to collect energy during the day and run off of it at night. The obvious place to store energy is in the kinetic energy of the plungers. How much energy are we talking? Well 1200 km/hour is a third of a km per second. That's 55.5 kJ of energy per kg of plunger. So 64,800 kg (about 143,000 pounds) of plunger is a megawatt-hour of power. Suppose that is one plunger. If you've got a thousand plungers, and each is storing a full megawatt-hour, you could permanently consume 50 megawatts of energy. You'd use up over half of it at night, then regain it during the day. Trips taken in the early morning commute might take 50% longer than during the evening, but it is doable. This gets better if we make plungers bigger, have a more efficient system, or have more plungers. I'm sure that Elon has thought about the ideal parameters. But if heavy plungers are good, well, put in enough metal for maglev to work and then add rock. You'd store a lot of energy.
Heck, the solar panel angle is fun but not really necessary. From the point of view of the electric power grid, it would be very, very good to have a large energy sink that can even out power fluctuations. Renewable energy sources often arrive at different times than we'd like to get power out. Sometimes, like with wind, we get very sharp spikes that we need to even out. If designed properly, the Hyperloop can absorb pretty much any power spike, and can bleed enough power out to be interesting. Therefore if the power utilities are smart then they should be willing to pay to add more plungers. Not because they care about the fact that they are improving peak capacity and reducing waiting time, but because they want to be able to store more power in it.
I'm sure that there are many improvements on this design that I have not thought of but which Elon has. I'm also sure that Elon has detailed blueprints that take this from a half-baked concept to something you can start to put cost estimates on. But this idea looks doable to me, and looks like it could - at least in principle - justify all of the claims that Elon has been making for the Hyperloop.
Finally I'd love to see this built. I'd love to see it built in California. But, unless someone like Elon pushes it, I'd be willing to bet that the Chinese get it first.
BTW for further discussion, see Hacker News
Wednesday, November 21, 2012
Monday, October 29, 2012
A/B testing scale cheat sheet
This is not a guide to how to do A/B testing. If you want that, see Effective A/B Testing, or any number of companies that will help you with A/B testing. Instead this is a cheat sheet of basic facts on A/B testing (mostly on the scale involved) to help people who are beginning figure out what is feasible.
- If you've never tested, expect to find a number of 5-20% wins.
In my experience, most companies find several changes that each add in the neighborhood of 5-20% to the bottom line in their first year of testing. If you have enough volume to reliably detect wins of this size in a reasonable time frame, you should start now. If not, then you're not a great candidate for it..yet.
- Experienced testers find smaller wins.
When you first start testing, you pick up the low-lying fruit. The rapid increase in profits can spoil you. Over time the wins get smaller. And more specific to your business. Which means they will take longer to detect. Expect this. But if you're still finding an average of one 2% win every month, that is around a 25% improvement in conversion rates per year.
- What works for others likely works for you. But not always.
Companies that test a lot which have settled on simple value propositions, streamline their signup process, put down big calls to action, places the same call to action in multiple places, and do email marketing. Those are probably going to be good things for you to do as well. However Amazon famously relies on customer ratings and reviews. If you do not have their scale or product mix, you'll likely get very few reviews per product, and may get overwhelmingly negative reviews. So borrow, but test.
- Your testing methodology needs to be appropriate to your scale and experience.
A company with 5k prospects per day might like to run a complex multivariate test all of the way to signed up prospect, and be able find subtle 1% conversion wins. But they don't generate nearly enough data to do this. They may need to be satisfied with simple tests on the top step of the conversion funnel. But it would be a serious mistake for a company like Amazon to settle for such a poor testing methodology. In general you should use the most sophisticated testing methodology that you generate enough data to carry off.
- Back of the envelope: A 10% win typically needs around 800-1500 successes per version to be seen.
One of the top questions people have is how long a test takes to run. Unfortunately it depends on all sorts of things, including your traffic volume, conversion rates, the size of the win to be found, luck, and what confidence you cut the test off at. But if one version gets to 800 successes, when the new one is at 880, you can convert at a 95% confidence level. If you wait until you have 1500 versus 1650, you can convert at a 99% confidence level. This data point, combined with your knowledge of your business, gives you a starting point for planning.
- Back of the envelope: Sensitivity scales as the square root of the number of observations.
For example a 5% win takes about 2x as much sensitivity as a 10% win, which means 4x as much data. So you need 3200-6000 successes per version to see it.
- Data required is roughly linear with number of versions.
Running more versions requires a bit more data per version to reach confidence. But not a lot. Thus the amount of data you need is roughly proportional to the number of versions. (But if some versions are real dogs, it is OK to randomly move people from those versions to other versions, which speeds up tests with a lot of versions.) Before considering a complicated multivariate test, you should do a back of the envelope to see if it is feasible for your business.
- Even if you knew the theoretical win, you can't predict how long it will actually take to within a factor of 3.
An A/B test reaches confidence when the observed difference is bigger than chance alone can plausibly explain. However your observed difference is the underlying signal plus a chance component. If the chance component is in the same direction as the underlying signal, the test finishes very fast. If the chance component is the opposite direction, then you need enough data that the underlying signal overrides the chance signal, and goes on to still be larger than chance could explain. The difference in time is usually within a factor of 3 either way, but it is entirely luck which direction you get. (The rough estimates above are not too far from where you've got a 50% chance of having an answer.)
- The lead when you hit confidence is not your real win.
This is the flip side of the above point. It matters because someone usually has the thankless task of forecasting growth. If you saw what looked like an 8% win, the real win could easily be 4%. Or 12%. Nailing that number down with any semblance of precision will take a lot more data, which means time and lost sales. There generally isn't much business value in knowing how much better your better version is, but whoever draws up forecasts will find the lack of a precise answer inconvenient.
- Test early, test often.
Suppose that you have 3 changes to test. If you run 3 tests, you can learn 3 facts. If you run one test with all three changes, you don't know which change actually made a difference. Small and granular tests therefore do more to sharpen your intuition about what works.
- Testing one step in the conversion funnel is OK only if you're small and just beginning testing.
Every business has some sort of conversion funnel which potential customers go through. They see your ad, click on it, click on a registration link, actually sign up, etc. As a business, you care about people who actually made you money. Each step loses people. Generally, whatever version pushes more people through the top step gets more business in the end. Particularly if it is a big win. But not always! Therefore if testing eventual conversions takes you too long, and you're still finding 10%+ wins at the top step in your funnel, it makes business sense to test and run with those preliminary answers. You'll make some mistakes, but you'll get more right than wrong. Testing poorly is better than not testing at all.
- People respond to change.
If you change your email subject lines, people may be 2-5% more likely to click just because it is different, whether or not it is better. Conversely moving a button on the website may decrease clicks because people don't see the button where they expect it. If you've progressed to looking for small wins, then you should keep an eye out for tests where this is likely to matter, and try to dig a little deeper on this.
- A/B testing revenue takes more data. A lot more.
How much more depends on your business. But be ready to see data requirements rise by a factor of 10 or more. Why? In the majority of companies, a fairly small fraction of customers spend a lot more than average. The detailed behavior of this subgroup matters a lot to revenue, so you need enough data to average out random fluctuations in this slice of the data.
- Interaction effects are likely ignorable.
Often people have several things that they would like to test at the same time. If you have sufficient data, of course, you would like to look at each slice caused by a combination of possible versions separately, and look for interaction effects that might arise with specific combinations. In my experience, most companies don't have enough volume to do that. However if you assign people to test versions randomly, and apply common sense to avoid obvious interaction effects (eg red text on red background would cause an interaction effect), then you're probably OK. Imperfect testing is better than not testing, and the imperfection of proceeding is generally pretty small.
As always, feedback is welcome. I have helped a number of companies in the Los Angeles area on A/B testing, and this tries to the most common questions that I've encountered about how much work it is, and what returns they can hope for.
Wednesday, October 17, 2012
My son's flashcard routine
My 7 year old son is in grade 2. In the previous grade, despite his intelligence, he was significantly behind his class in handwriting, letter reversals, and spelling. He was getting extra help from his teacher, but he still had an uphill battle. So I decided to start a flashcard routine to assist. This solved the original problem. Here is a description of the current routine, and how it has evolved to this point.
It will surprise nobody who has read Teaching Linear Algebra that I started with the thought of some sort of spaced repetition system to maximize his long-term retention with a minimum of effort. I needed to help him with around handwriting, so I wanted to be personally evaluating how he was doing. This seemed simplest with a manual system. I therefore settled on a variation of the Leitner system because that is easy to keep track of by hand.
To make things simple for me to track, I am doing things by powers of 2. Every day we do the whole first pile. Half of the second. A quarter of the third. And so on. (Currently we top out at a 1/256th pile, but are not yet doing any cards from it.) Cards that are done correctly move into the next pile. Those that he get wrong fall into the bad pile, which is the next day's every day pile.
So far, so good. I tried this. Then quickly found that I did an excellent job of sifting through all of the words he knew and getting the ones he didn't know into the bottom pile. But he wasn't learning those. This lead to frustration. Not good.
I then added an extra drill on the pile that he got wrong. At the end of the session, we do a quick drill with just the problem cards. Here is the drill until we get to 3 cards. If he gets the card first try, or gets a card that came all of the way from the bottom since he last got it wrong, it is removed from the drill. If he gets it wrong, I tell him how to do it, and put it back in the pile near the top so he sees it again soon. If he gets it right after a recent reminder, it goes to the bottom to get a chance to come out of the drill.
After we get down to 3 cards, I switch the drill up. If he gets a card wrong I correct him and put it in slot 2. If he gets it right I put it on the bottom. Once he gets all three right, I end the drill for that day.
After I added this final drill on the problem cards, the "not learning" problem disappeared. He began learning, and saw his school performance improve. His spelling tests went from under half the words correct to the 80-100% range. Everyone was happy.
It is worth noting that at the end of grade 1 he took several tests, and we found that he was spelling at a grade 3 level. We have no direct measurement proving it, but I guarantee that he spells even better now.
This happiness lasted until he got used to doing well. Over time we had more piles. In school he was being given more words. I began adding simple arithmetic facts. This meant more and more work. Not fun work. Sometimes he would make a mistake on a card that he had known for a long time. Then he'd get upset. Once he got upset he'd get lots of others wrong. Over the next few days we'd get the cards moving back up the piles, then it would happen again. The flashcard routine became a point of conflict.
Then I had a great idea (which I borrowed from a speech therapist). The idea is that I'd mix a reward activity and flashcards. We'd start on the reward, then do a pile, go back to the reward, then do another pile, go back to the reward, and so on. The specific reward activity that we're using is that I'm reading books to him that are beyond his current reading level (currently The Black Cauldron), but in principle it could be anything. With this shift, the motivation problem completely disappeared. He enjoys the reward. The flashcards are a minor annoyance that gets him the reward. If he goes off track, the reward restores his equilibrium. Intellectually he's happy that he's mastering the material. But the reward is motivation.
With this fix in place, we lasted several months. Then we developed an issue. A couple of words were sufficiently hard that they just stayed in the bad pile every day. So I made a minor tweak. I had been doing his top pile, then his next, then his next, on down. But instead I do his every day pile. Then go into the top pile, next, next, etc. But after each of those groups I try him again on the every day words that he hasn't gotten right yet. Thus he is forced to get his trouble words right 2x per day. This helped him master them and got them moving back up.
With that fix, we lasted until this week. This week we had a problem. His spelling test for this week includes the word embarrassing. (And he can get a bonus for knowing peculiar.) The problem is that this word has enough spelling tricks to get it right that he simply cannot get it in one pass. We tried several times, without success. I therefore have added flashcards like em(barr)assing for which he gets told, "The word 'embarrassing' starts 'em'. Write the 'barr' bit." With these intermediate flashcards he seems to be breaking up learning the whole word into manageable tasks, from which he can learn the word itself. But I've also generated a ton of temporary flashcards, which may become an issue. (I plan on removing those piecemeal ones after he successfully gets them in the every 8 day pile. In a few weeks I'll know how well this is working)
That brings us to the current state of his flashcard routine. He currently has hundreds of spelling words and basic arithmetic facts learned. 373 of them learned sufficiently well that he reviews them less than once per month. But I am sure that I'm not done tweaking. Here are current issues:
It will surprise nobody who has read Teaching Linear Algebra that I started with the thought of some sort of spaced repetition system to maximize his long-term retention with a minimum of effort. I needed to help him with around handwriting, so I wanted to be personally evaluating how he was doing. This seemed simplest with a manual system. I therefore settled on a variation of the Leitner system because that is easy to keep track of by hand.
To make things simple for me to track, I am doing things by powers of 2. Every day we do the whole first pile. Half of the second. A quarter of the third. And so on. (Currently we top out at a 1/256th pile, but are not yet doing any cards from it.) Cards that are done correctly move into the next pile. Those that he get wrong fall into the bad pile, which is the next day's every day pile.
So far, so good. I tried this. Then quickly found that I did an excellent job of sifting through all of the words he knew and getting the ones he didn't know into the bottom pile. But he wasn't learning those. This lead to frustration. Not good.
I then added an extra drill on the pile that he got wrong. At the end of the session, we do a quick drill with just the problem cards. Here is the drill until we get to 3 cards. If he gets the card first try, or gets a card that came all of the way from the bottom since he last got it wrong, it is removed from the drill. If he gets it wrong, I tell him how to do it, and put it back in the pile near the top so he sees it again soon. If he gets it right after a recent reminder, it goes to the bottom to get a chance to come out of the drill.
After we get down to 3 cards, I switch the drill up. If he gets a card wrong I correct him and put it in slot 2. If he gets it right I put it on the bottom. Once he gets all three right, I end the drill for that day.
After I added this final drill on the problem cards, the "not learning" problem disappeared. He began learning, and saw his school performance improve. His spelling tests went from under half the words correct to the 80-100% range. Everyone was happy.
It is worth noting that at the end of grade 1 he took several tests, and we found that he was spelling at a grade 3 level. We have no direct measurement proving it, but I guarantee that he spells even better now.
This happiness lasted until he got used to doing well. Over time we had more piles. In school he was being given more words. I began adding simple arithmetic facts. This meant more and more work. Not fun work. Sometimes he would make a mistake on a card that he had known for a long time. Then he'd get upset. Once he got upset he'd get lots of others wrong. Over the next few days we'd get the cards moving back up the piles, then it would happen again. The flashcard routine became a point of conflict.
Then I had a great idea (which I borrowed from a speech therapist). The idea is that I'd mix a reward activity and flashcards. We'd start on the reward, then do a pile, go back to the reward, then do another pile, go back to the reward, and so on. The specific reward activity that we're using is that I'm reading books to him that are beyond his current reading level (currently The Black Cauldron), but in principle it could be anything. With this shift, the motivation problem completely disappeared. He enjoys the reward. The flashcards are a minor annoyance that gets him the reward. If he goes off track, the reward restores his equilibrium. Intellectually he's happy that he's mastering the material. But the reward is motivation.
With this fix in place, we lasted several months. Then we developed an issue. A couple of words were sufficiently hard that they just stayed in the bad pile every day. So I made a minor tweak. I had been doing his top pile, then his next, then his next, on down. But instead I do his every day pile. Then go into the top pile, next, next, etc. But after each of those groups I try him again on the every day words that he hasn't gotten right yet. Thus he is forced to get his trouble words right 2x per day. This helped him master them and got them moving back up.
With that fix, we lasted until this week. This week we had a problem. His spelling test for this week includes the word embarrassing. (And he can get a bonus for knowing peculiar.) The problem is that this word has enough spelling tricks to get it right that he simply cannot get it in one pass. We tried several times, without success. I therefore have added flashcards like em(barr)assing for which he gets told, "The word 'embarrassing' starts 'em'. Write the 'barr' bit." With these intermediate flashcards he seems to be breaking up learning the whole word into manageable tasks, from which he can learn the word itself. But I've also generated a ton of temporary flashcards, which may become an issue. (I plan on removing those piecemeal ones after he successfully gets them in the every 8 day pile. In a few weeks I'll know how well this is working)
That brings us to the current state of his flashcard routine. He currently has hundreds of spelling words and basic arithmetic facts learned. 373 of them learned sufficiently well that he reviews them less than once per month. But I am sure that I'm not done tweaking. Here are current issues:
- One week is not enough. Every week he is given a new set of words to master. But as anyone who has done spaced repetition knows, a week is not very long to master material. Spaced repetition excels for memorizing a body of data over years, not one week. On most weeks he is given a set of standard words to learn, and a set of words for bonus points. With the bonus words he usually gets over 100% on his tests. But we don't stop, so now he'd do substantially better on last week's test than he actually did last week.
- He's only learning what I know that he needs to. This week I reached out to his teacher and said that I am doing flashcards with him, and looked for feedback on more ways to use them for his benefit. She pointed out a number of things he can improve on, including common words that he has wrong, grammar, poems he is supposed to memorize, and geography that he is supposed to learn. The flashcard routine can help with these issues in time, but I had not been aware that he needed it. Better late than never...
- Work is climbing again. Currently every day I add 2 cards. Plus every week I add a spelling test of unpredictable size (this week 27, of which he already knew one). This is increasing the size of the bottom piles, and the work has been increasing. It is manageable, but I'm keeping my eye on it.
- This takes my time. At the moment that's unavoidable. One of the issues that we're still working on is handwriting, so there needs to be a human evaluation of what he's doing. But still I'm taking an hour per day with this. I think it is an hour well-spent that we both value. However in a couple of years if his sister needs similar help, what then? In the long run I'd love to offload the flashcards to a computer program, but the idea of a reward activity has to be in there. All of the flashcard apps that I've seen assume that doing flashcards is itself a fun activity. That will not work for my son. Maybe I'm being too picky. But I've developed opinions about what works while fine-tuning my son's system. If there is something that fits that, I'd love to find it.
Labels:
children,
education,
flashcards,
Leitner system,
school,
spaced repetition,
spelling
Monday, October 8, 2012
How reliable will the Falcon 9 be?
Let's apply statistics to see, based on current launch data, how reliable we predict that the Falcon 9 will be.
Falcon 9 just had a launch that succeeded despite an engine failure. According to design parameters, it should be able to survive the failure of any two engines. But the flight can be lost if we lose 3+ engines. Exactly how reliable is the Falcon 9 design?
Let me first take a naive approach. To date we've had 4 launches of the Falcon 9, each with 9 engines (that's the 9 in Falcon 9), and have seen one in flight failure. The measured success rate of an engine is therefore 35/36. With that in mind, we can produce the following figures.
For comparison the US Space Shuttle had a failure rate of 2/135 which is about 1.5%.
So SpaceX flights are dangerous compared to most things that we do, but so far seem much better than any previous mode of transport, including the US Space Shuttle. Which was previously the most reliable form of transport into space. (Not the safest though! Soyuz has that record because, unlike the Space Shuttle, they've demonstrated the ability to have passengers survive a catastrophic failure that aborted the mission.)
But is that the end of the story? No!
Suppose that the true failure rate of each individual engine is actually 10%. Then an exactly parallel calculation to the above will find that the failure rate of a rocket launch is 5.3%. That doesn't sound very reliable!
However is it reasonable to think that 10% is a likely failure rate for the rocket? Well suppose that before we had seen any launches that we thought that a 10% failure rate was equally likely as a failure rate of 1/36. Our observation is 1 engine failure out of 36. The odds of that exact observation with a 10% failure rate are 9.0%. The odds of that observation with a failure rate of 1/37 are 37.3%. According to Bayes' theorem, the probabilities that we give to theories after making an observation should be proportional to our initial belief of the probability of that theory times the probability of the given observation under that theory.
That is a mouthful. Let's look at numbers. In this hypothetical scenario our initial belief was a 50% chance of a 10% failure rate, and a 50% chance of a failure rate of 1/36. After observing 36 instances of engines lifting off with 1 failure, the 10% theory has probability proportional to 4.5%, while the 1/36 theory has probability proportional to 18.35%. Thus our updated belief is that the 10% theory has likelihood 4.5/(4.5 + 18.35) = 0.199 = 20%. (Without the intermediate rounding we'd actually be at 0.195.) And the 1/36 theory has likelihood around 80%. Then combining the predictions of the theories with the likelihood assigned to each theory we get an estimated failure rate of 0.053 * 0.195 + 0.0016 * 0.805 = 0.023= 1.16%. Our confidence in the record put up by the Falcon 9 is not as good now!
Please note the following characteristics of this analysis:
Falcon 9 just had a launch that succeeded despite an engine failure. According to design parameters, it should be able to survive the failure of any two engines. But the flight can be lost if we lose 3+ engines. Exactly how reliable is the Falcon 9 design?
Let me first take a naive approach. To date we've had 4 launches of the Falcon 9, each with 9 engines (that's the 9 in Falcon 9), and have seen one in flight failure. The measured success rate of an engine is therefore 35/36. With that in mind, we can produce the following figures.
- Probability of no engine failures: (35/36)**9 * (1 - 35/36)**0 * (9 choose 0) = (35/36)**9 = 77.6%
- Probability of 1 engine failure: (35/36)**8 * (1 - 35/36)**1 * (9 choose 1) = (35/36)**8 * (1/36) * 9 = 20.0%
- Probability of 2 engine failures: (35/36)**7 * (1 - 35/36)**2 * (9 choose 2) = (35/36)**7 * (1/36)**2 * 36 = 1.8%
- Probability of 3+ engine failures: 1 - above probabilities = 0.2% (actually 0.16%)
For comparison the US Space Shuttle had a failure rate of 2/135 which is about 1.5%.
So SpaceX flights are dangerous compared to most things that we do, but so far seem much better than any previous mode of transport, including the US Space Shuttle. Which was previously the most reliable form of transport into space. (Not the safest though! Soyuz has that record because, unlike the Space Shuttle, they've demonstrated the ability to have passengers survive a catastrophic failure that aborted the mission.)
But is that the end of the story? No!
Suppose that the true failure rate of each individual engine is actually 10%. Then an exactly parallel calculation to the above will find that the failure rate of a rocket launch is 5.3%. That doesn't sound very reliable!
However is it reasonable to think that 10% is a likely failure rate for the rocket? Well suppose that before we had seen any launches that we thought that a 10% failure rate was equally likely as a failure rate of 1/36. Our observation is 1 engine failure out of 36. The odds of that exact observation with a 10% failure rate are 9.0%. The odds of that observation with a failure rate of 1/37 are 37.3%. According to Bayes' theorem, the probabilities that we give to theories after making an observation should be proportional to our initial belief of the probability of that theory times the probability of the given observation under that theory.
That is a mouthful. Let's look at numbers. In this hypothetical scenario our initial belief was a 50% chance of a 10% failure rate, and a 50% chance of a failure rate of 1/36. After observing 36 instances of engines lifting off with 1 failure, the 10% theory has probability proportional to 4.5%, while the 1/36 theory has probability proportional to 18.35%. Thus our updated belief is that the 10% theory has likelihood 4.5/(4.5 + 18.35) = 0.199 = 20%. (Without the intermediate rounding we'd actually be at 0.195.) And the 1/36 theory has likelihood around 80%. Then combining the predictions of the theories with the likelihood assigned to each theory we get an estimated failure rate of 0.053 * 0.195 + 0.0016 * 0.805 = 0.023= 1.16%. Our confidence in the record put up by the Falcon 9 is not as good now!
Please note the following characteristics of this analysis:
- Observations do not tell us what reality is, they update our models of reality.
- A wide range of failure probabilities fit the limited observations that we have so far on the Falcon 9.
- With enough data, theories that are far away from the observed average become very unlikely.
Now a curious person might want to know what the odds of failure would be if we included more possible prior theories. I whipped up a quick Perl script to do the calculation for an initial expectation that 0.00%, 0.01%, 0.02%, ..., 99.99%, 100% were all equally likely failure rates a priori. When I run that script I get a probability of 0.0198180199757443, which is an estimated failure rate of about 2%. If you start with different beliefs, you can generate very different specific numbers. For an extreme instance if you believe that SpaceX is constantly improving, so their future engines are likely to be more reliable than their past ones, then ridiculously good numbers become very plausible.
However the bottom line is that we cannot yet, based on the data that we have so far, conclude that we have good evidence that the Falcon 9 actually will put up a better reliability record over its lifetime than previous space vehicles.
Monday, September 17, 2012
A/B Testing vs MAB algorithms - It's complicated
Several months ago, several blog posts appeared on Hacker News comparing A/B testing and multi-armed bandit techniques. If you want to review the posts and the discussion, see 20 lines of code that beat A/B testing every time, then Why multi-armed bandit algorithm is not "better" than A/B testing and finally Why Multi-armed Bandit algorithms are superior to A/B testing (with Math). I participated in those discussions, and ever since then I've been wanting to write up my thoughts once I had them in a compact enough form to do so.
That has taken an unfortunately long time. In fact I've given up on saying everything that I want to say in a compact form, and will try to only say what I think is most important. And even that has wound up less compact than I'd like...
First a disclaimer. Website optimization has been a large part of what I've done in the last decade, and I've been a heavy user of A/B testing. See Effective A/B Testing for a well-regarded tutorial that I did on it several years ago. I have much less experience with multi-armed bandit approaches. I don't believe that I am biased. But if I were, it is clear what my bias would be.
Here is a quick summary of what the rest of this post will attempt to convince you of.
That summary requires a lot of justification. Read on for that.
- If you have actual traffic and are not running tests, start now. I don't actually expand on this fact, but it is trivially true. If you've not started testing, it would be a shock if you can't find at least one 5-10% improvement in your business within 6 months of trying it. Odds are that you'll find several. What is that worth to you?
- A/B testing is an effective optimization methodology.
- A good multi-armed bandit strategy provides another effective optimization methodology. Behind the scenes there is more math, more assumptions, and the possibility of better theoretical characteristics than A/B testing.
- Despite this, if you want to be confident in your statistics, want to be able to do more complex analysis, or have certain business problems, A/B testing likely is a better fit.
- And finally if you want an automated "set and forget" approach, particularly if you need to do continuous optimization, bandit approaches should be considered first.
That summary requires a lot of justification. Read on for that.
Subscribe to:
Posts (Atom)