What I learned by discovering statistics using R

I would summarize many of my driving interests under the heading of “scientific epistemology”. However, for a long time I had an egregious blind spot: statistics. Although I read my way through Rohlf and Sokal’s classic text “Biometry” six years ago, it left me with something less than a working understanding of statistics as a research scientist would use it. Whether this was my fault or the text’s, or simply a matter of incompatibility, is hard to say.

To ameliorate the situation, I spent much of my spare time last April plowing through each and every chapter of “Discovering Statistics Using R” by Andy Field (and co-authors). On the whole, it was an immensely enjoyable experience. Here are a few of my meta-insights.

  1. You can grasp the statistical concepts without becoming a mathematician. I sometimes have difficulty assimilating knowledge if I fail to understand its foundations — e.g. to learn how a drug is used without understanding its molecular mode of action. This difficulty persisted even after I had identified it as a hindrance. (This is part of why I wandered from medicine into the history and philosophy of science, where an obsession with foundations is generally a natural advantage.) Analogously, I was worried that I might get stuck with my statistics text as soon as I encountered some mathematical theorem that I had to accept but couldn’t understand with reasonable effort and within reasonable time. Happily, I found it easy to deal with mathematical black boxes in statistics. I think two things helped. First, DSUR introduces the black boxes efficiently and often labels them explicitly, which makes it easier to accept them. Second, many statistical black boxes can be grasped intuitively. For instance, there is the “variance sum law”, which states that the variance of the differences or sums of two independent variables is equal to the sum of the variances of the two variables (this matters, for example, if you are testing whether the means of two populations differ in a t-test). I don’t know how you prove this (although it is not difficult to imagine the outlines of a proof), but I nevertheless find it highly plausible that the variance sum law holds. Other questions are more difficult — e.g., why do correlation coefficients range from -1 to 1? Mathematician friends tell me that the answer to this is nontrivial. Nevertheless, I did not have any difficulty accepting it, and so my education in statistics could proceed. I found that there were many similar instances of very tolerable black boxes.
  2. Statistics should be seen in relation to concrete study designs. When I read “Biometry”, I think I lost the forest for the trees: I learned about the theory of statistics but failed to see how it applied to concrete research situations. One of the strengths of DSUR is that it is pretty clear about how each statistical method relates to familiar types of study designs.
  3. The importance of the computer is hard to exaggerate. “Biometry” was originally written in the 1970s, and its primary tool was the pencil: It taught me how to do statistics by hand, if necessary. I get that this can be useful for teaching concepts. But in practice (and in 2014) I found it vastly more enjoyable to study statistics in close contact with R, where I learned how to actually work on more or less realistic data sets. I like to joke that I love computers and will take any excuse to spend more time with them. More seriously, I think that doing statistics is pretty similar to programming: Understanding the concepts is one thing, but you also need to learn which functions take which values, where to put the semicola, and what the error messages mean. There is a craft to statistics, and I think that familiarity with the craft makes it easier to assimilate the theory.
  4. Emotions matter. It is well know that learning without positive emotion is difficult for us humans. Importantly, therefore, DSUR helped me to get excited about statistical methods. I get that you should have a good conceptual grasp of the assumptions that a data set must meet if you want to do an ANOVA. But studying those assumptions before you have ever done an ANOVA and thus before you have discovered the potential power of the method is, frankly, boring. DSUR helps you to see — more importantly, to feel — that statistical methods are really cool and powerful, and this helps you through more tedious things like checking whether your data are homoscedastic.
  5. Philosophy of science is useful. During the first half of the book, in the throes of young romance, I felt that statistics is the key to understanding scientific epistemology and in some sense removes the need for a philosophy of science. But I quickly recuperated: I now think with renewed conviction that the philosophy of science is tremendously important. It should be taught alongside statistics to students. One cannot make sense of scientific methodology by understanding only statistics but none of the concepts that traditionally live in the philosophy of science. Statistics texts hardly touch many of these questions: What is a cause? What is the logic of causal inference, and what are its prerequisites? (Which is the basis for asking: And how does statistics help in inferring causes?) What is the epistemological role of scientific models? What are mechanisms, and what does it take to ascertain them? How do causal processes at different levels of organization relate to each other? What is an explanation, and what role does explanatory power play in the confirmation of scientific hypotheses? Many of these questions do not (currently) have definitive answers. But I do think (based on experience) that most working scientists have strong intuitions about them that help them in their epistemological work — and if nothing else, the philosophy of science can prime these intuitions and help to produce better scientists. To my surprise, then, an immersion in statistics has helped me to better appreciate one of my parent disciplines (which are, in this order: biomedicine, history of science, and philosophy of science).

Grand Theories

David Hull:

Although grand theories about the nature of science are currently out of fashion, I think we need to rehabilitate them. We need to construct theories about science the way that scientists construct theories about fluids, gene flow and continental drift. To construct such theories, we need data, and our only source of data is the study of science, past and present.

Exactly.

(Original at JSTOR.)

The power of natural selection

Last week I wrote a version of Richard Dawkins’s “methinks it is a weasel” program (as explained in The Blind Watchmaker). The point of the program is to demonstrate the power of cumulative selection in comparison to pure chance. Consider a random string such as “in the beginning god created the heavens and the earth”. In a purely random process, the probability of this string occurring is minuscule: with 27 letters in the alphabet (don’t forget the space!) and 54 letters in the string, the number of possible strings is 2754, or 1.97 x 1077. Your chances of hitting on this string by producing random strings are, for all practical purposes, zero.

But the situation changes once we introduce selection and cumulation. The program begins by creating a population of random strings, each 54 letters in length. None of these will be very close to the target string “in the beginning god created the heavens and the earth”. Nevertheless, some strings will match the target string in a few positions. The program evaluates each one to determine the best match. For example, the following best candidate in generation 1 (from an actual run of the program) shares 7 letters with the target string. These are underlined:

 gen 1: tashiwwsmsianhdfyf yvrrjutym bjjoig byxfpkwpkkhzfj g h
target: in the beginning god created the heavens and the earth

The program then takes this one best match and mutates it to create a new population of candidate strings. For example, each letter (in each string in the population) might be replaced with a randomly chosen letter from the alphabet with a probability of 0.09 (resulting, in this case, in around 4.8 replaced letters per string on average). This new population of strings is then again evaluated, and the best match to the target string is again retained and mutated. The mutations can be either neutral (if a non-matching letter is replaced with a non-matching letter, or if a matching letter is replaced with itself), detrimental (if a matching letter is replaced with a non-matching one) or beneficial (if a non-matching letter is replaced with a matching one).

Many generation will yield none or only small improvements — for example, when a new population of 100 strings was created based on the first generation string above, the best candidate in the second generation had gained only one matching letter (underlined) in addition to a number of neutral mutations:

 gen 2: tashiwwsmsitnhdfyfvyvrrjutym bjjoig byxfhkwpkkhzfj gth
 gen 1: tashiwwsmsianhdfyf yvrrjutym bjjoig byxfpkwpkkhzfj g h
target: in the beginning god created the heavens and the earth

This cumulative process of mutation and selection continues until each letter in the string matches the target string, at which point the program stops. Needless to say, in the beginning most mutations will be neutral. As the string approaches the target string, more and more mutations will be detrimental, or in other words, the program can take quite a number of generations towards the end to optimize the last few letters.

I would offer my own version to the internet, but it seems redundant since a nice Python version of it is already available, and it is easier to play around with the source code of the Python program than with the code of my Objective-C implementation. (On a Mac, you can run the Python version by opening a Terminal, cd-ing to the directory of the Weasel.py, and running: “python Weasel.py”.)

The main message of the program is that cumulative selection is very different from random generation. In a typical run of the program, it takes around 112 generations to select the target string. If the proverbial monkeys tapping away randomly at typewriters produced one string per second, it would take them 6 x 1069 years to explore all the possible strings of 54 letters. The equivalent selection process — also producing one string per second — would be completed after only 31 hours. Such is the power of random variation coupled with cumulation!

The program has been criticized for exaggerating the power of selection. The critics argue that the program retains correct letters permanently and does not allow them to mutate any more, which is obviously not how mutations work in nature. However, the criticism backfires, since the selection process of the program works fine even if all letters are allowed to mutate in each generation. (Note: That all letters are allowed to mutate does not mean that all letters will mutate; this depends on the mutation rate, discussed below.) Both my implementation and the Python version allow all strings to mutate: it is entirely possible for the number of differences to the target string to increase from one generation to the next, and this often happens. Nevertheless, over time the deleterious mutations are removed.

When I’ve talked about this program in my lectures, some students were concerned about a kind of cheating. They felt that the program “already knew” the target string, so that it did not mirror evolution in nature, where the outcome is unknown. Maybe there is a version of this objection that I have not considered, but generally speaking I think it misses the point. The evolved string is found by the program through a process of blind variation and selection (just as in nature); the target string is only used to determine the “fitness” of a particular letter in a particular location. This reflects actual selection processes: biological variations will also have fitness values relative to the environment in which they occur.

It is instructive to consider how variations in the parameters “mutation rate” and “population size” affect the selection process. The mutation rate, in this program, is a number between 0 and 1. It determines the probability with which an individual letter in a string is replaced during the production of a new population of strings. Trivially, a mutation rate of 0 means no mutations and thus no change in the strings over time. A mutation rate of 1 means that each letter is mutated in each generation, which amounts to the absence of cumulative selection: gains in one generation are not retained in the next. For cumulative selection to work, mutation rates have to be relatively low (try it). In my experiments, mutation rates much above 0.1 generally lead to a selection process that oscillates around a certain number of differences and does not terminate. The reason for this is clear: high mutation rates interfere with the retention of matching letters.

More surprising perhaps are variations of the population size. This variable determines how many strings are produced (and mutated) in each generation. Even though I should have known better, I expected this not to matter too much: I was thinking about the total number of variations produced, and so surely it should be immaterial whether I’m producing 300 generations x 100 strings or 3000 generations x 10 strings — the total number of strings is the same. But this is where so-called “genetic drift” becomes an issue! Consider that each generation begins with a “best candidate” string and produces a population of mutated variants of it. In a reasonably large population, there will be many neutral or detrimental variants and a few improved ones; the improved ones are then selected as the template for the next generation. However, the smaller the population size, the more probable it becomes that none of the few produced variants are improvements. It is easy to see this if you assume a population size of 1: most one-off variants of a string will not be improvements, especially in the latter parts of the selection process (when most letters already match the target). Thus, small population sizes make it possible for the string to start to “drift” randomly, simply because each generation only realizes a small sample of possible variations, most of which are neutral or detrimental.

However, the effects of population size and mutation rate interact. For instance, a mutation rate of 0.09 and a population size of 1000 will allow “in the beginning god created the heavens and the earth” to evolve. If you change the population size to 100, then the process will not terminate: it will oscillate between 10 and 15 differences or so. If you now reduce the mutation rate a bit, say to 0.05, then the process will again terminate. I leave it as an exercise for the reader to figure out the explanation of the phenomenon!

Data never tell a story on their own

As a follow-up on last week’s post, here’s Paul Krugman on Nate Silver’s new FiveThirtyEight:

I’d argue that many of the critics are getting the problem wrong. It’s not the reliance on data; numbers can be good, and can even be revelatory. But data never tell a story on their own. They need to be viewed through the lens of some kind of model, and it’s very important to do your best to get a good model. And that usually means turning to experts in whatever field you’re addressing.

A tentative suggestion: It seems to me that when Krugman says “model”, philosophers of science might prefer to say “mechanism”. I don’t think Krugman wants you to use a particular way of representing reality (a model); he wants you to analyze data with reference to the actual entities and interactions in the system under study (its mechanism).

Silver’s Mining Playbook (data mining, that is)

Having left the New York Times last year, Nate Silver has now relaunched FiveThirtyEight. I love Silver’s work, and I think his contribution to the American political discourse is invaluable. His push for “data journalism” is timely and necessary. But then, I would advocate a data-driven approach to most areas of life.

When it comes to the philosophy of science, however, Silver could and should be more sophisticated. This bothered me about his intriguing but flawed book “The Signal and the Noise” (look out for a brief review on this blog in the future), and it became apparent again in this piece introducing the new FiveThirtyEight:

Suppose you did have a credible explanation of why the 2012 election, or the 2014 Super Bowl, or the War of 1812, unfolded as it did. How much does this tell you about how elections or football games or wars play out in general, under circumstances that are similar in some ways but different in other ways?

These are hard questions. No matter how well you understand a discrete event, it can be difficult to tell how much of it was unique to the circumstances, and how many of its lessons are generalizable into principles. But data journalism at least has some coherent methods of generalization. They are borrowed from the scientific method. Generalization is a fundamental concern of science, and it’s achieved by verifying hypotheses through predictions or repeated experiments.

The first of these hyperlinks is to the Stanford Encyclopedia of Philosophy’s entry on scientific progress, and the second is to the entry on Karl Popper. Both are problematic – let me explain.

Popper is famous for advocating the view that there is no such thing as verification. Contrary to what generations of scientists and philosophers of science thought, Popper argued, there is in fact no way to show that some generalization such as “all ravens are black” is true, or even likely to be true. On Popper’s account, all you can do is to refute false generalizations. A hundred black ravens do not show that all ravens are black. Nor do a thousand, or a million. However, a single white raven shows conclusively that “all ravens are black” is false. It is debatable whether this works as a general method of science: The philosophical consequences of such a view are severe, and it is pretty clear that scientists do not actually think like this. But one thing is certain: “verifying hypotheses through prediction or repeated experiments” is not a good characterization of Popper’s position.

Modern opinion on the role of generalizations in science is more divided. Certainly they play a role. But it is doubtful that science’s main goal is to search for generalizations, since these are often not too interesting. Take the above example of “all ravens are black”. This is known not to be true, but suppose it were: Would it be very interesting? We would still not know whether it just happens to be true, or whether there is something lawful about it. I would argue that the goal of science is different: it is to understand causal mechanisms. To return to the ravens, we want to know in detail how the causal mechanisms of raven pigmentation work. This will give us an understanding of why most ravens are black, and also of how the mechanisms of pigmentation can change to produce differently colored ravens. Knowledge of causal mechanisms is far more insightful and useful than knowledge about generalizations. But the transition from generalizations to causal mechanisms is one of the great challenges for data mining approaches such as Silver’s.

(Thanks to my friend Fabio Molo for drawing my attention to Silver’s piece, and to Tim Räz for paronomastic help with the title.)