The curse of statistics

But what gives grave concern to public health leaders is that the increase in lung cancer mortality shows a suspicious parallel to the enormous increase in cigarette consumption (now 2500 cigarettes per year for every human being in the United States).

The latest study, which is published in The Journal of the American Medical Association (May 27, 1952), by a group of noted cancer workers headed by Dr. Alton Ochsner…discloses that during the period 1920 to 1948, deaths from bronchiogenic carcinoma in the United States increased more than ten times, from 1.1 to 11.3 per 100,000 of the population. From 1938 to 1948, lung cancer deaths increased 144 percent.”

How do we know when one thing causes another? To take the example above, how do we know whether cigarette consumption causes lung cancer? The scientific consensus — particularly now, as opposed to 1952, when this Reader’s Digest article was published — suggests overwhelmingly that cancer is indeed caused by cigarettes. But suppose we were trying to reason for ourselves, and not simply take at face value what the experts tell us.

The evidence cited by the excerpt above is largely circumstantial. There is a “suspicious parallel” between two different curves: the trend of cigarette consumption over time, and the trend of incidence of lung cancer over time. That these two trends are causally linked is obvious to us in hindsight, but by no means was it obvious to people of that era. Indeed, the fact that two such “time series” display similar behavior does not, in general, imply that they have a causal dependence. “Correlation is not causation”, as the hoary adage goes.

Why is this the case?

Let’s take a “classical example” (this paragraph is adapted from the excellent book by Judea Pearl, Causal Inference in Statistics: A Primer). Suppose we are trying to determine whether a particular drug helps people recover from a disease. In other words, is taking the drug causally linked to recovery from the disease? We find that, out of the 350 people who took the drug, 273 (78%) recovered, but of the 350 people who did not take the drug, 289 (83%) recovered. In short, taking the drug is correlated with lower recovery rates. So the drug is not helpful, and might even be harmful.

This type of evidence is, superficially, even stronger than that presented by Reader’s Digest above. In the cigarette example, we are making an observation at the level of the entire population. The entire population has experienced an increase in cancer deaths. And the entire population has been smoking more cigarettes. In the drug example, the observations are segregated by the “causal factor”: whether the individual took the drug. (The analogue in the cigarette example would be if we published lung cancer rates for smokers and non-smokers and showed that the former was significantly larger than the latter. This is, of course, also true.)

Continuing with the drug example, we also find that, if we further segregate our observations by gender, the correlation seems to flip. For men, 81 out of 87 (93%) recover with the drug and only 234 out of 270 (87%) recover without the drug. For women, the numbers are 73% and 69%, respectively. We are left with the paradox, known as Simpson’s paradox, that the drug seems to help both men and women but does not help the general population.

So what is going on? Here’s Pearl’s explanation:

The answer is nowhere to be found in simple statistics. In order to decide whether the drug will harm or help a patient, we first have to understand the story behind the data — the causal mechanism that led to, or generated, the results we see. For instance, suppose we knew an additional fact: Estrogen has a negative effect on recovery, so women are less likely to recover than men, regardless of the drug. In addition, as we can see from the data, women are significantly more likely to take the drug than men are….being a woman is a common cause of both drug taking and failure to recover.

What this means, in this particular example, is that the gender-segregated analysis is correct, and the overall population-level analysis is incorrect. We need to segregate by gender (“condition on” gender, to use the statistician’s terminology) because it is a factor that mediates the causal relationship between taking the drug and its effectiveness. So, in fact, the drug is beneficial, and our initial conclusion was in error.

Pearl goes on to construct a theoretical framework that allows us to take any “story behind the data” — any “causal story” — and figure out how to analyze the data to uncover the causal relationship, if it exists, in a mathematically rigorous way. His framework tells us which variables we ought to condition on, and which variables we ought not to. It’s one of the more beautiful things I’ve read in recent memory, and I rarely use that descriptor for technical books anymore.

That being said, it’s largely useless. And this is no criticism of Judea Pearl (who, again, I admire greatly) or other statisticians in the field of causal inference. Pearl and others have found that, if we can write down the causal story, and we have the data that includes the variables in that causal story, we can (usually) come to a conclusion about whether the relationship between two variables is causal. But those are really big ifs! In many (most?) cases of practical interest, we do not have a comprehensive causal story, nor do we have an exhaustive dataset. The result is that there are many things that we should probably believe, but also cannot be rigorously justified according to causal statistics. If we ignore Pearl’s theoretical framework, and simply take the data at face value, we risk running aground on Simpson’s paradox, as we did above. But it is also sensible to ignore that risk if we are trying to make a practical decision.

(As an aside, statisticians and social scientists typically overcome this problem by one of two methods. First is the randomized experiment. This is a very powerful tool but also very limited in scope. Take, for instance, a randomized experiment for the link between cigarette consumption and cancer. We would have to mandate that half the study participants start smoking, the other half don’t, and wait decades for members of both groups to either develop cancer or not. Obviously, this would not pass muster with any ethics board. The second method is to construct a simplified but plausible causal story. We take only the variables we believe to be the most important confounders. This is perhaps the only way to make progress in real-world problems, but it is clearly less than satisfactory, and opens up the work to charges of “errors of omission”.)

Let’s take an example close to my heart. I used to use non-stick Teflon pans. I don’t anymore. The reason is that there is strong circumstantial evidence that using such cookware is bad for you. I’ll be the first to admit, though, that I am unable to find strong evidence proving a causal link between Teflon and bad health outcomes. Here’s one decent summary from the New York Times.

Of particular concern is perfluorooctanoic acid, PFOA, also known as C-8, which is a crucial ingredient in the making of Teflon.

In 2004, DuPont agreed to pay more than $100 million to settle another class-action lawsuit brought by Ohio and West Virginia residents who contended that releases of PFOA from a plant in West Virginia contaminated supplies of drinking water.

DuPont argues on its Web site ( that the PFOA is burned off in the manufacturing process of the Teflon coating and not present in the finished cookware.

DuPont and the Cookware Manufacturers Association, a trade organization, maintain that Teflon is completely safe and actually helps people by lowering the amount of oil needed for cooking and reducing the risk of fire because there is less fat that might burst into flames.

The Environmental Protection Agency says on its Web site ( that “routine use” of nonstick cookware does not pose a concern and there is no reason for consumers to stop using it.

But the nonprofit Environmental Working Group ( says research has shown that Teflon cookware can harm birds and cause flu-like symptoms in humans by emitting toxic fumes when heated at high temperatures.

PFOA is found at a very low level in the blood of most Americans, according to the Environmental Protection Agency, and studies have shown it can cause health problems in laboratory animals.

But the agency does not know exactly how people are being exposed to the chemical, nor how dangerous it is in the human body.

Suppose we were trying to prove a causal link between Teflon and a negative health outcome. (PFOA has been linked to “kidney cancer, testicular cancer, ulcerative colitis, thyroid disease, hypercholesterolemia (high cholesterol), and pregnancy-induced hypertension.”). To properly account for Simpson’s paradox, we would have to construct a complete causal story. We would need to figure out all of the variables associated with those outcomes — things like gender, age, race, obesity, diet, exercise level, exposure to other chemicals, etc. — and attempt to condition on all of these, if appropriate. (Note further that some of these are interrelated, like diet and obesity, which makes the conditioning even more difficult.) Even if we were able to draw a causal diagram with the correct “colliders”, “confounders”, etc., we would also need a dataset that encompassed all of these variables. That is, we would need to know, for each person, whether they used a nonstick pan, and with what frequency; whether they were exposed to other chemicals, and, if so, by how much; what their “typical” diet and exercise regimen consist of, and so on. The dataset would also need to be large enough (i.e., have enough participants) that our conclusion would be statistically significant even after accounting for all of these “covariates”.

This is essentially impossible. And it makes me sad, because there are so many causal claims that directly touch our lives that are similarly unanswerable. Is alcohol consumption bad for you? Does living near a fracking site lead to Ewing’s sarcoma? Does the minimum wage fail to cause job loss? Does indoor dining increase the spread of coronavirus? Does outdoor dining not? Is wearing a mask good for the wearer? Is it good for others? Is global warming caused by man-made carbon emissions? Does lead exposure contribute to crime rates?

I would answer “yes” to all of these questions, but I would also struggle to justify my conclusions. I am caught in a strange dual-mindedness. I think it is reasonable for the EPA to claim that Teflon cookware is not unsafe, yet I think it is also reasonable for any individual to claim that Teflon cookware is unsafe. I am sure there are scientists at the EPA who have much stricter standards at home — as far as the food, drink, chemicals, etc. their families use and consume — than what their employer officially recommends.

I am reminded of the slide shown at the top of this post, taken from Governor Cuomo’s coronavirus briefings, in which he showed the results of statewide contact tracing data, broken down by the source of infection. Bars and restaurants accounted for only 1.4% of cases. Gyms, hair salons, and barber shops were less than 1% combined. In contrast, “household spread” was the predominant cause, totalling 74% of all cases. The takeaway, apparently, was that these seemingly high-risk areas were lower risk than we thought, and the real danger lay at home. (Which raises the question — how were these contagious people at home getting infected in the first place?)

My view is that this slide is arrant nonsense. Even setting aside the problems with the data itself (how does one know where they contracted Covid from? If I experience multiple points of possible exposure, how is that fact tabulated in the statistics?), we have once again confused observational data with a causal finding. The real question for an individual is: if I go to the gym, what is my risk of catching Covid? And the corresponding question for a government is: if it shuts down gyms, by how much would it reduce the spread of Covid? The data presented by Cuomo has almost nothing with those questions, despite appearing to. Indeed, almost no observational data can help us here, unless that data contends with the (enormously complicated) causal story of Covid. The curse of statistics is the disparity between what it promises theoretically and what it is practically capable of doing. Our hammers are weaker than our nails; our eyes are bigger than our stomachs.


2 thoughts on “The curse of statistics

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s