Random Observations: How reliable will the Falcon 9 be?

Let's apply statistics to see, based on current launch data, how reliable we predict that the Falcon 9 will be.

Falcon 9 just had a launch that succeeded despite an engine failure. According to design parameters, it should be able to survive the failure of any two engines. But the flight can be lost if we lose 3+ engines. Exactly how reliable is the Falcon 9 design?

Let me first take a naive approach. To date we've had 4 launches of the Falcon 9, each with 9 engines (that's the 9 in Falcon 9), and have seen one in flight failure. The measured success rate of an engine is therefore 35/36. With that in mind, we can produce the following figures.

Probability of no engine failures: (35/36)**9 * (1 - 35/36)**0 * (9 choose 0) = (35/36)**9 = 77.6%
Probability of 1 engine failure: (35/36)**8 * (1 - 35/36)**1 * (9 choose 1) = (35/36)**8 * (1/36) * 9 = 20.0%
Probability of 2 engine failures: (35/36)**7 * (1 - 35/36)**2 * (9 choose 2) = (35/36)**7 * (1/36)**2 * 36 = 1.8%
Probability of 3+ engine failures: 1 - above probabilities = 0.2% (actually 0.16%)

For comparison the US Space Shuttle had a failure rate of 2/135 which is about 1.5%.

So SpaceX flights are dangerous compared to most things that we do, but so far seem much better than any previous mode of transport, including the US Space Shuttle. Which was previously the most reliable form of transport into space. (Not the safest though! Soyuz has that record because, unlike the Space Shuttle, they've demonstrated the ability to have passengers survive a catastrophic failure that aborted the mission.)

But is that the end of the story? No!

Suppose that the true failure rate of each individual engine is actually 10%. Then an exactly parallel calculation to the above will find that the failure rate of a rocket launch is 5.3%. That doesn't sound very reliable!

However is it reasonable to think that 10% is a likely failure rate for the rocket? Well suppose that before we had seen any launches that we thought that a 10% failure rate was equally likely as a failure rate of 1/36. Our observation is 1 engine failure out of 36. The odds of that exact observation with a 10% failure rate are 9.0%. The odds of that observation with a failure rate of 1/37 are 37.3%. According to Bayes' theorem, the probabilities that we give to theories after making an observation should be proportional to our initial belief of the probability of that theory times the probability of the given observation under that theory.

That is a mouthful. Let's look at numbers. In this hypothetical scenario our initial belief was a 50% chance of a 10% failure rate, and a 50% chance of a failure rate of 1/36. After observing 36 instances of engines lifting off with 1 failure, the 10% theory has probability proportional to 4.5%, while the 1/36 theory has probability proportional to 18.35%. Thus our updated belief is that the 10% theory has likelihood 4.5/(4.5 + 18.35) = 0.199 = 20%. (Without the intermediate rounding we'd actually be at 0.195.) And the 1/36 theory has likelihood around 80%. Then combining the predictions of the theories with the likelihood assigned to each theory we get an estimated failure rate of 0.053 * 0.195 + 0.0016 * 0.805 = 0.023= 1.16%. Our confidence in the record put up by the Falcon 9 is not as good now!

Please note the following characteristics of this analysis:

Observations do not tell us what reality is, they update our models of reality.
A wide range of failure probabilities fit the limited observations that we have so far on the Falcon 9.
With enough data, theories that are far away from the observed average become very unlikely.

Now a curious person might want to know what the odds of failure would be if we included more possible prior theories. I whipped up a quick Perl script to do the calculation for an initial expectation that 0.00%, 0.01%, 0.02%, ..., 99.99%, 100% were all equally likely failure rates a priori. When I run that script I get a probability of 0.0198180199757443, which is an estimated failure rate of about 2%. If you start with different beliefs, you can generate very different specific numbers. For an extreme instance if you believe that SpaceX is constantly improving, so their future engines are likely to be more reliable than their past ones, then ridiculously good numbers become very plausible.

However the bottom line is that we cannot yet, based on the data that we have so far, conclude that we have good evidence that the Falcon 9 actually will put up a better reliability record over its lifetime than previous space vehicles.