Friday, February 5, 2010

Developing on HEAD scales to Google

I have a simple rule of thumb for what I am and am not allowed to say about Google. If I can find it said in some official-looking place from Google, then I think I'm allowed to say it. Otherwise not.

I was therefore very glad to run across this talk by Guido Van Rossum describing his code review tool named Mondrian, put on youtube by Google Tech Talks, and judging from the knowledge level of the audience, delivered to a non-Google audience. Confirming that this was not, in fact, accidentally open, I found an O'Reilly article confirming that it was public.

I therefore feel comfortable talking about anything and everything said in that video. So you can view this as my spin on the material in that video, and if you want the full version without the editorializing, feel free to take an hour and watch the video for the original source.

These days everyone who is competent has agreed that source control is a Good Thing to use. However opinions on how to use it vary widely. One argument in particular that I've seen in multiple organizations is on the value of using multiple branches versus having everyone develop in one branch, which is often called something like MAIN or HEAD. The issues involved are somewhat complex, and not obvious. So let me give a quick overview.

The primary argument for branching is that people can work on different features on different schedules without getting in each other's way. The primary argument for all developing on the same branch is that you discover conflicts immediately when they are easiest to resolve, rather than months later when people have moved on to different things and have lost context for the pieces of code that conflicts. The primary argument against branching is the pain of merging later. (Granted, distributed version control systems like git have done a lot to reduce this pain. But they have not eliminated it.) The primary argument against developing on HEAD is that it requires a constant level of diligence on the part of developers. When any developer can break all developers, you need to be careful in what you check in.

I've just kept this down to the primary arguments because the secondary back and forth arguments get long, involved, and heated. Also what I've described as two ways of working is really two families of working. There are a lot of ways to organizing multiple branches. And there are quite a few useful uses for branches in a software project even if everyone is developing on the same branch. None of this is made easier by the fact that (as with many religious wars) the people involved have imprinted on one way of doing things. That makes them hyper aware of the problems in other approaches, and they don't even notice potential pain points in their own.

Until I came to Google my personal position was that everyone working on HEAD was the best approach as long as your team was small enough that you could make it work. And I vaguely accepted that the pain of branching was a necessary evil on large software projects. Even when the pain reached the point of craziness, I was mostly willing to accept unsupported claims that it is necessary.

Then I came to Google. There are a lot of groups at Google, and they are free to do their own thing. Many do. However most groups develop out of one giant code base on HEAD. And it works as a development model. Google has made it scale.

Guido's talk describes many elements of what makes it scale. The first piece is having good developers. The second piece is an enforced policy of having every single patch go through a code review process before you check anything in. The third piece is a lot of proprietary infrastructure that Google has built up to make things work. And beyond that you have people paying attention to best practices such as consistent style, good unit testing, so on and so forth. (All of which are reinforced in the code review process.)

My opinion after seeing Google has changed. I freely admit that there are real process and tool challenges to making it possible for large teams to develop everything on HEAD and have it scale smoothly. However it is possible. I've seen it work. And, speaking personally, this is my preferred way to work.

Different organizations are different, have different capabilities, different needs, and different goals. Sight unseen I'm not going to tell anyone else that their organization should try to work like Google does. I simply don't have the facts about your situation. But if, like me not that long ago, you've accepted the claim that large development teams have no choice, you now know better.

Now admittedly most people can't go and see this first hand for themselves at Google. But if you want to watch a large successful project developing code to a high standard in this way, I recommend watching clang. And if you want to know more about what kinds of tool support you need to make this work, go listen to Guido. He's smarter than I am, has been doing it longer, and actually built some of the basic tool support for it.

Monday, February 1, 2010

What is intelligence?

Last September I posted on why hard sciences are different from soft ones. The subject of this blog post is a perfect illustration of the critical difference. In the hard sciences people mostly agree on basic definitions, broad agreement exists on key problems to solve, and in general there is agreement on a basic paradigm. Through no fault of their own, in the soft sciences, none of this exists.

Few things illustrate this better than getting a handle on what intelligence is. Here we have a subject that has been seriously studied for over a century. Yet researchers have not achieved agreement on whether they should focus on a single measure of intelligence called g, or multiple measures, or if multiple measures then how many there should be and how they should be divided. Note here that if you talk to individual researchers this fact does not worry them, because each researcher works in a community of other researchers who have achieved agreement on this. Each school of thought therefore churns out a stream of research, but they have failed to convert each other. And after a century the debate shows no sign of ending any time soon.

Please note that I am far from expert on any of this, so I'll try to summarize in broad terms my understanding, but my understanding could be incorrect. Also be aware that a lot of experts disagree with each other strongly, and therefore it is easy to find experts who will disagree with anything interesting that I can say. Please keep those caveats in mind as I try to flounder my way through the complications.

Let's start with the simplest approach, namely Spearman's g. What Spearman did in 1904 is take a number of things that one would think correlate to intelligence, such as grades, and found that they seemed correlated well with each other. He then looked for some linear combination which correlated better with each of the initial factors than any initial factor did. He found a very combination that did so, and found that it successfully predicted a very large part of the variation. In modern mathematical terms he did something called a Principal Component Analysis and looked for the most significant component. He then found that the most significant component really did capture a lot of the variability.

Spearman called this factor g for general intelligence.

Not long after the IQ test was invented. Before long people added IQ tests to the mix of factors, and found that a well-designed IQ test served as a fairly good predictor of g. Then g in turn served as a decent predictor for many things, including academic performance, and success in the workplace. Since IQ tests were fast and easy to administer, IQ tests and ability tests were on. (Ability tests such as the SAT and GRE exams actually are fairly directly comparable to IQ tests, and high-IQ societies such as MENSA are happy to accept them as proof that your IQ is high enough to belong.)

All of this begs many important questions. We got g out of a mathematical construct. But is it real? Is there really such a thing as general intelligence? Also even if there is such a thing as general intelligence, is it a stable attribute of a person? Does your general intelligence change over time? And the whole principal components analysis admits of the discovery of multiple components, how important are the lesser components? (The existence of a split between mathematical and verbal ability is obvious enough that most ability tests split those into separate sections.) Are there important factors which have been missed? Also how much of this is affected by heredity versus environment? How does race play into this? Can training change how you score on the test? Can it change your actual intelligence?

Believe you me, those questions have been researched. Extensively. And argued. More extensively. The only ones that there seems to be a decent consensus on is that to the extent that general intelligence makes sense, it seems to be fairly stable, and both hereditary and environmental factors affect it. Every other one is still under debate, though it is easy to find experts that make definitive claims that the matter has been completely settled one way or the other.

As one example of the ongoing debate, ETS, which puts out various tests including the SAT exams, claims that training cannot improve your performance. At the same time Kaplan makes money selling the claim that they can significantly improve your performance. Kaplan has enough evidence of effectiveness that I accept that they are right, but the debate continues. I suspect not the least because ability tests are just IQ tests under a different name, and so acceptance that Kaplan can teach an ability test is evidence that a key tool underlying a lot of research into psychology is fundamentally flawed, which opens up a can of worms that many don't want to look into.

For another example, I strongly suggest that mathematically inclined readers read g a Statistical Myth. It presents a toy statistical model which demonstrates how readily a large number of independent random factors with a small amount of interaction could give rise to all of the statistical evidence which underlies claims that such a thing as general intelligence exists. Given the fact that any hereditary effect on intelligence is the result of a large number of fairly independent random factors called genes, I think the point is an important one. The idea that we each have a huge number of different intellectual abilities at different degrees of competence accords very well with my understanding of how the brain works, and my experience of interacting with others.

However it is also true that a small number of factors have a major effect on many different kinds of mental activities. Thus some combination of the amount of things you can keep in your mind at the same time and speed of reasoning may be influential enough on enough things to deserve the name "general intelligence".

At this point I'd say that the jury is not just out; they are deadlocked and randomly fighting each other. And have been in this state for the better part of a century.




So what do I have to contribute to the discussion? I'd like to throw out a toy mathematical model for how IQ and intelligence relate. Please do not take this too seriously. After all it is not even agreed on at this point that general intelligence is a particularly well-defined concept. But what I'd like to show is that even if you do accept that general intelligence exists, and that IQ is reasonably well correlated with it, that IQ tests are not a particularly effective way to find the smartest people. (Whether or not any better way exists is a question I'd like to avoid. In fact, please forget this parenthetical point.)

Now suppose you sit down to take a test. You come prepared with many relevant abilities. You have your vocabulary, your polished reasoning ability, and a body of basic knowledge you're expected to have acquired about the world. We are generally prepared to accept these as components of intelligence, and the test will try to measure those. Let's call that INT. However you come prepared with other relevant abilities. There is how well you are feeling, how much sleep you got, any tricks you know about how to handle the types of questions that are likely to be asked, cultural cues you share with the test makers and how stressed you are. We are generally not prepared to accept these as components of intelligence, yet they affect how we do. Let us call these TST.

For my toy model let's say that INT and TST are independent random normals with a mean of 0 and a standard deviation of 10. That's math talk describing a particular bell curve. If you want the odds of it being less than a particular number you just have to divide the number by 10 to get the number of standard deviations you're out, and then look at a standard table and read off the probability.

Now let's say that the measured IQ will be 100+INT+TST. If you're a mathematician you'll know that IQ will have a normal distribution with mean 100 and a standard deviation of 10*sqrt(2) which is about 14.14. The correlation between IQ and INT turns out to be 70.7%. By comparison the Stanford-Binet is designed to have a mean of 100 and a standard deviation of 16, and its correlation with other measures of intelligence like academic performance is (depending on who measures it and how it is done) in the range of 70-80%. So while the model is pretty simple, it is not too far off from how a real IQ test behaves. And it captures the obvious observation that there are factors that affect test performance which we don't think of as intelligence.

Now the fun thing about a model like this is that we can play with it since we know exactly how it is set up. And are free to add any internal details we want. And while we can't necessarily believe the answers we get, we can gain valuable intuition about what kinds of answers we can expect to see if we understood a far more complex system. Such as real people with real brains.

To analyze this I'm going to complicate the model one more time by introducing two more random variables. Let's make them independent normal variables with mean 0 and standard deviation 10/sqrt(2), and call them AVG and DIF. It turns out that (AVG+DIF) and (AVG-DIF) are independent normal variables with mean 0 and standard deviation 10. So let's make (AVG+DIF) be INT, and (AVG-DIF) be TST. High-school algebra shows that AVG = (INT+TST)/2 and DIF = (INT-TST)/2. (Hence the names.)

What was the point of that? Well it is this. At this point IQ = 100+2*AVG. And INT = AVG+DIF. Which means that if you give me an IQ, I can calculate AVG then use the known distribution of DIF to calculate confidence intervals on INT.

For example, in this model how intelligent is a typical member of MENSA? Well MENSA accepts anyone whose IQ is in the top 2% of the population. So a typical member might have an IQ at the boundary of the top 1%. (If you haven't taken statistics these calculations may make you eyes gloss over. You'll have to just trust that I'm coming up with correct numbers.) That IQ is about 2.326 standard deviations out, which after multiplying by the 10*sqrt(2) and adding 100 is an IQ of 132.90. (Remember that our standard deviation is lower than the regular IQ test, so our purported IQ scores will be lower than a regular test.) Which means that AVG is 16.45. On average our INT will be the same as the AVG, which puts you 1.645 standard distributions out, which is right around the 95th percentile.

We can go farther and calculate a 95% confidence interval for what the real intelligence of this individual is. 95% of the time a normally distributed variable will be within 1.960 standard deviations of 0, so that means that DIF will be in the range +- 13.86. Which means that INT will be (2.59, 30.31). Which means that INT is anywhere from 0.0259 to 3.031 standard deviations out, which would put INT anywhere from the 51st percentile to almost the 99.9th percentile. Hmmm, we really haven't nailed down INT very closely, have we? A similar calculation shows that with 80% odds INT is between the 77th and 99.5th percentiles. And you can even do fun things like show that with about 78% odds, this individual is not in the top 2% of the population on INT, and therefore would not qualify for MENSA if they were able to measure INT directly instead of the more complicated IQ.

Let's reverse the question. Let's take someone at the boundary of the 99th percentile in INT, and ask what IQ we should expect on average. Well that INT is 2.326 standard deviations out, which is an INT of 16.44, for an IQ of 116.44, which is 1.1625 standard deviations out, which is at the 87.77th percentile. So sorry, you're probably not going to test as smart enough for MENSA even though you're probably smarter than most of the people in MENSA. How likely is this person to fail to meet MENSA's admission standards? Well the cutoff for MENSA is top 2% which is 2.0537 standard deviations out, which is an IQ of 129.04 (again remember that the toy model's IQs are not as spread out as real IQ tests), so we need TST to be 12.6, which is 1.26 standard deviations out, which gives a probability of 89.6% of not qualifying for MENSA.

The upshot? This model predicts that most members of MENSA are not in the top 2% on intelligence, and most people in the top 2% of intelligence would not qualify for membership in MENSA.




Now let's apply this to a question of more personal interest. According to this model, how intelligent am I likely to be? I bring this up because in my last blog post I used my GRE to discuss my likely IQ, and therefore likely intelligence, then used it as a benchmark to compare with the people I've met at Google. So if I believe that high IQs systemically overestimate INT, how badly is mine likely to be overestimated according to my model?

Well my scores were V: 760, Q: 780 and A: 800. Going to a handy IQ calculator I put in 1540 for my GRE V+Q and find that I am 3.625 standard deviations out and only 1/7143 people have an IQ that high. So on my toy model that is an IQ of 151.265... Which implies that AVG is 25.633. So on average you'd expect my INT to be 25.633, which is 2.5633 standard deviations out, which puts me around the 99.5th percentile, or about 1/200 on INT. (Incidentally the average person with an INT that high would fall in the 96% percentile on IQ, and therefore would not qualify for MENSA.) But let's go farther. 50% of the time a normal distribution falls within 0.67449 standard distributions of the mean, which means that DIF is half the time within the range -4.769 to 4.769. That makes a 50% confidence interval for my INT be the range (20.864, 30.402). Which would put me anywhere from the 98th percentile to the 99.9th percentile in INT. Which is in line with my claim in the previous blog post that I wouldn't be entirely surprised to find that I am at the 99.9th percentile in intelligence. According to this model the odds are that I'm not that smart, but it is possible.

Interestingly there is a symmetry in this model which means that all calculations that I've done for INT apply for TST as well. Which means that I am likely to have a number of characteristics that make me good at taking tests but which are not reflected in my general intelligence. This is actually true.

The vast majority of people when they are faced with a major test or interview get stressed. Now what does stress do to you? Well stress is caused by adrenaline. Adrenaline prepares you for a flight-or-fight reaction. This means decreased blood flow to things like your stomach and neocortex (which is where most of your higher reasoning happens), and increased blood flow to areas that are likely to see action, like your muscles. The result? Improved vision, hearing, faster reflexes, and increased strength. Also significant damage to your ability to carry out complex cognitive tasks. Needless to say this is about the worst possible physiological response to sitting in place for several hours and completing a series of complex cognitive tasks.

Apparently evolution didn't anticipate that millions of us would have our life courses depend on how we do on multiple choice exams. :-)

Very oddly, many members of my family, including me, have the exact opposite response. When I walk into a test the thought process I have is basically, "Well this is it. I've prepared all I'm going to, and now I've got a multiple-guess test which will overstate my abilities. Let's see how I do." And I relax. Comparing me to a normal person based on the resulting test score is like starting with two runners, taking one out back and beating on him for a while, then expecting them to run a fair race. No matter how objectively you measure the result, I have an unfair advantage.




Random note. I originally intended to just present the toy model and use it to provide mathematical support for my impression that a really high IQ score is decent evidence of intelligence, but the true intelligence of most intelligent people isn't actually reflected in their IQ. But to sanity test the model I needed to find the estimated correlation between a standard IQ test and g, the general intelligence factor. And reading up on that I realized how much more complicated the subject was than I realized. I also found myself convinced by g a Statistical Myth that g was probably a statistical artifact with no real meaning. Which is how this turned into 2 long and only tangentially related blog posts mushed into 1. I apologize for the length. I didn't intend it to turn out this way, but I hope it was an interesting read.