False confidence


I started a new job not too long ago, and one of my responsibilities is to help our leadership make “build or buy” decisions. To “build”, here, means to construct a piece of software ourselves; to “buy” means to purchase that software from a third-party company who, presumably, specializes at such a thing.

I recently talked to one company, which makes a product to automate advertising spending on Facebook and other ad platforms. (One of the relatively unnoticed consequences of the rise of Facebook and Google, who, along with Amazon, Microsoft, and Apple, make up an astonishing 20% of my Vanguard S&P index fund, is how they’ve spawned a clutch of smaller companies, latching onto their success like a remora attaches to the underside of a shark.) This product tries to predict the “lifetime value” (LTV) of a cohort of customers that are being acquired, and automatically reallocates your budget to spend on channels or ads that are drawing in the highest LTV cohorts. The executives gave us a tour of the tool itself, complete with slick-looking visualizations and a modern user interface. My eyes alighted on one plot, showing a forecast of how the value of a particular cohort accumulates over time. The CEO self-assuredly explained that the tool provides a “confidence” interval for the forecast in order to approximate its uncertainty, and further provides the true (“actual”) values of LTV that can be compared to the forecast, as the data trickles in. I noticed that the confidence interval spanned the range of $1.24-$1.52, but the actual data point was at $8. Visually it was quite striking, marooned on its own island. I asked the CEO why the confidence interval seemed too skinny, and he responded, “confidence intervals are always too skinny”, and then laughed and moved on. He was right, but probably not for the reason he intended.

(As an aside, if one assumes the forecast is normally distributed, the probability of seeing a data point as extreme as $8 (or more), given a 95% confidence interval of [$1.24, $1.52], is something like 10^(-1900), a number is so vanishingly small that the human mind cannot conceive it (for comparison, the number of atoms in the universe is only 10^80). Even if one assumes a “fatter-tailed” distribution, which is usually the case in practice, the idea that the tool’s confidence interval was capturing the true uncertainty is highly implausible.)

The term “uncertainty”, used above, contains surprising subtleties. There are actually two types of uncertainties at play when trying to predict something, whether it is the LTV of a cohort of customers or the number of Americans who will die of COVID-19 in the next month. Predictions, of course, require models, and these models are almost always “parametric”, meaning that they have coefficients (“parameters”) that must be estimated by fitting the model to the data. Kevin Hassett, of Dow 36000 fame, fit a cubic polynomial model to COVID-related deaths over time to conclude that the number of deaths in the U.S. would hit zero by May 15, or a week from now. (Jared Kushner was reportedly impressed by all of this; I was more impressed by the fact that Americans would start coming back to life after May 15, as the prediction reached negative territory.) More sophisticated approaches, such as those used by actual epidemiologists, often employ R0, a now infamous parameter that captures the number of people that an average person will infect, given that they have the disease. Regardless of whether there are four parameters, as in a cubic model, or dozens and dozens, as in more complex epidemiological models, each of these parameters has inherent uncertainty. We can infer R0 only approximately from the data, not exactly. One way to capture uncertainty in the prediction is to make use of the uncertainty in the parameters, and to propagate it through the model. In other words, we can make a series of separate forecasts of COVID-19 infections and deaths if R0 is 1, or 0.7, or 1.3, or 2.6. By combining the data that already exists with the model itself, we can infer the probability that R0 takes on any of these values. The rest is a mathematical exercise, known as the Monte Carlo method: sample from the inferred probability distribution of the parameters (in the simplest case, just R0), run a forecast for each sample, and collate together all of the results (ideally in a slick visualization with a modern interface).

But is what we have captured actually the true uncertainty? Perhaps surprisingly, the answer is no. We have accounted for the “aleatoric” uncertainty, but not the “epistemic” uncertainty. In other words, our uncertainty estimate is correct given that the model that we have assumed is true, but is certainly not correct otherwise. Kevin Hassett’s cubic model likely has very little aleatoric uncertainty, at least as far as the projection that deaths will reach zero by the middle of May. But reporting an uncertainty interval of [May 12, May 18] for zero deaths would seem ludicrous to most people, perhaps excepting Jared Kushner.

Even the epidemiological models suffer from this issue, in less noticeable ways. (This essay in NY Mag serves as an excellent primer.). The 95% uncertainty intervals produced by the most press-worthy model, the University of Washington’s IMHE model, have been wrong 70% of the time, which is hard to believe if they represent the true bound on uncertainty. I am not an epidemiologist, and I don’t mean to fault the work that they do. And I certainly don’t mean to lend credence to complaints of charlatans like Richard Epstein, whose forecasts have been even worse. But I know enough about building models to say that even modeling the past is very difficult, let alone the future!

Let’s step back and consider all of the variables at play. There is no single R0 for the entire country: each micro-region within the U.S. has its own rate of spreading. This rate depends on the particular policies the local government has implemented, how well the populace has adhered to those policies, exogenous factors like the weather (which purportedly affects the rate of transmission), and stochastic factors like the presence of “super-spreaders”. The entire concept of a R0 is a mathematical abstraction that hides a wealth of statistical intricacy. If one person infects 50 others at a church, and 10 other people stay at home and infect no one, how does one assign an R0 to that circumstance? A model that started off as a simple curve fitting exercise must now expand to encompass parameters related to epidemiology, psychology, sociology, geography, meteorology, and politics. It calls to mind Laplace’s boast that “An intellect which at a certain moment would know all forces that set nature in motion, and all positions of all items of which nature is composed, if this intellect were also vast enough to submit these data to analysis, it would embrace in a single formula the movements of the greatest bodies of the universe and those of the tiniest atom; for such an intellect nothing would be uncertain and the future just like the past would be present before its eyes.” In our case, we might truly need all of the atoms in the universe to forecast the course of this disease.

I am reminded of a recent story published in the New Yorker about Ali Khan, former chief of the C.D.C. He was heavily involved in the Singaporean government’s response to SARS (which is also a type of coronavirus). The disease had the ability to spread incredibly quickly given the right environment, as this anecdote illustrates:

SARS reached Toronto on February 23, 2003, carried by a seventy-eight-year-old woman, who, with her husband, had spent several nights of a two-week trip to Hong Kong on the ninth floor of the Metropole Hotel. The woman sickened, then died at home on March 5th, attended by family, including one of her sons, who soon showed symptoms himself. After a week of breathing difficulties, he went to an emergency room and there, without isolation, was given medication through a nebulizer, which turns liquid into mist, pushing it down a patient’s throat. “It helps open up your airways,” Khan told me—a useful and safe tool to prevent, say, an asthma attack. But, with a highly infectious virus, unwise. “When you breathe that back out, essentially you’re taking all the virus in your lungs and you’re breathing it back out into the air—in the E.R. where you’re being treated.” Two other patients in the E.R. were infected, one of whom soon went to a coronary-care unit with a heart attack. There he eventually infected eight nurses, one doctor, three other patients, two clerks, his own wife, and two technicians, among others. You could call him a super-spreader. One E.R. visit led to a hundred and twenty-eight cases among people associated with the hospital. Seventeen of them died.

To be perfectly honest, I don’t remember even being worried about SARS back in 2003. The disease, as the New Yorker piece mentions, “touched the U.S. only gently, producing twenty-seven probable cases and no deaths, most likely for reasons amounting to luck”. One important reason is that the primary cluster of the disease was in Singapore, where the local population was subjected to extreme measures to try to stamp it out, and where local health authorities were able to implement the suggested practices of intensive testing and tracing. If I were asked in 2003 to construct a forecast, encompassing both aleatoric and epistemic uncertainty, of the deaths that would result from a SARS flareup, wouldn’t my resulting uncertainty interval be enormous? If the universe that we inhabit was one in which the U.S. avoided a pandemic “for reasons amounting to luck”, then what about the other universes in which we were not so lucky?

I would contend that we are asking too much of our models. The goal is not to produce forecasts of an inherently unpredictable process. The goal is to help political actors figure out which interventions and policies are most likely to produce a return to normalcy. And, I would argue further, models are largely irrelevant to this calculus. When models fail, turn to experiments. And across the world we have, inadvertently, been conducting a series of “natural” experiments. We have countries like Brazil, whose leader treats Covid-19 like a joke. We have countries like Sweden, which has not imposed harsh lockdown orders but whose economy has suffered nonetheless. We have countries like Italy, which did not take the threat seriously at first but now appear to be making substantial progress towards reopening after imposing strict quarantines. We also have countries which never had a full-blown epidemic because they employed large-scale testing and tracing to limit the spread of the disease. And, finally, we have a country like ours, which has tried a combination of these strategies, half-heartedly and heterogeneously, and ended up with dismal results. Is there much more a model could tell us at this point? And if it told us that, on our current course, the number of deaths would be 200,000 instead of 100,000, would that affect our decision-making? Should it?

We are engulfed in tragedy, particularly in my city of New York. We obsess over models and forecasts as a way to recover a modicum of certainty in these uncertain times. We have turned to science because politics is so bleak. I fear, though, that this obsession — the armchair epidemiology on Twitter, the deluge of new forecasts, the concern devoted to every wiggle, the chart and map pornography — is increasingly just a distraction: from the fact that what we should be doing is clear, but, in typical American fashion, we are not doing it.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s