Silver’s Mining Playbook (data mining, that is)

Having left the New York Times last year, Nate Silver has now relaunched FiveThirtyEight. I love Silver’s work, and I think his contribution to the American political discourse is invaluable. His push for “data journalism” is timely and necessary. But then, I would advocate a data-driven approach to most areas of life.

When it comes to the philosophy of science, however, Silver could and should be more sophisticated. This bothered me about his intriguing but flawed book “The Signal and the Noise” (look out for a brief review on this blog in the future), and it became apparent again in this piece introducing the new FiveThirtyEight:

Suppose you did have a credible explanation of why the 2012 election, or the 2014 Super Bowl, or the War of 1812, unfolded as it did. How much does this tell you about how elections or football games or wars play out in general, under circumstances that are similar in some ways but different in other ways?

These are hard questions. No matter how well you understand a discrete event, it can be difficult to tell how much of it was unique to the circumstances, and how many of its lessons are generalizable into principles. But data journalism at least has some coherent methods of generalization. They are borrowed from the scientific method. Generalization is a fundamental concern of science, and it’s achieved by verifying hypotheses through predictions or repeated experiments.

The first of these hyperlinks is to the Stanford Encyclopedia of Philosophy’s entry on scientific progress, and the second is to the entry on Karl Popper. Both are problematic – let me explain.

Popper is famous for advocating the view that there is no such thing as verification. Contrary to what generations of scientists and philosophers of science thought, Popper argued, there is in fact no way to show that some generalization such as “all ravens are black” is true, or even likely to be true. On Popper’s account, all you can do is to refute false generalizations. A hundred black ravens do not show that all ravens are black. Nor do a thousand, or a million. However, a single white raven shows conclusively that “all ravens are black” is false. It is debatable whether this works as a general method of science: The philosophical consequences of such a view are severe, and it is pretty clear that scientists do not actually think like this. But one thing is certain: “verifying hypotheses through prediction or repeated experiments” is not a good characterization of Popper’s position.

Modern opinion on the role of generalizations in science is more divided. Certainly they play a role. But it is doubtful that science’s main goal is to search for generalizations, since these are often not too interesting. Take the above example of “all ravens are black”. This is known not to be true, but suppose it were: Would it be very interesting? We would still not know whether it just happens to be true, or whether there is something lawful about it. I would argue that the goal of science is different: it is to understand causal mechanisms. To return to the ravens, we want to know in detail how the causal mechanisms of raven pigmentation work. This will give us an understanding of why most ravens are black, and also of how the mechanisms of pigmentation can change to produce differently colored ravens. Knowledge of causal mechanisms is far more insightful and useful than knowledge about generalizations. But the transition from generalizations to causal mechanisms is one of the great challenges for data mining approaches such as Silver’s.

(Thanks to my friend Fabio Molo for drawing my attention to Silver’s piece, and to Tim Räz for paronomastic help with the title.)