Truth or consequences

The best gauge of a society is truth: its prevalence and how it’s treated.

–Robert Gore

Recently we were treated to yet another salvo in the Cholesterol wars.  A prospective study conducted by Danish researchers has shown that high levels of ‘good’ cholesterol are not really good for you; especially if you are male.  As far as I call tell from the publication, the study followed all of the accepted practices.  So why are the results in such marked contrast to accepted beliefs?

It’s not just ‘good’ cholesterol, there are countless examples from the medical literature; dietary fat, salt, carbohydrates, and on and on.

What’s going on here?

Lies, Damn lies, and statistics; and/or you can prove anything with statistics

The issue is the incentives provided for producing slip-shod research are combined with the relative ease of using mathematics to justify desired conclusions.  The pursuit of a small p-value isn’t particularly rare; Ioannidis claims that more than half of all scientific studies are false, and Nature printed a study that found 61% of studies in psychology were not replicable.

Nor is it particularly recent; Wikipedia dates the ‘crisis’ to 2010.  Tom Siegfried reported in ScienceNews: “It’s science’s dirtiest secret: The ‘scientific method’ of testing hypotheses by statistical analysis stands on a flimsy foundation.”

In 2015 the American Statistical Association (ASA) crafted a statement on the use of p-values in hypothesis testing.  They admonish researchers to be careful about what the p-value is actually saying.  They encourage researchers to publish the results of all tests done on the data.  However, about a year later the Royal Statistical Society published a study of ASA publications that found no difference in the frequency of p-value use following the statement.  I guess that ASA assumes its members know how to use them; but do they really?

Andrew Gelman notes that the issue isn’t just with the tests that were applied but is also a function of “what tests would have been done had the data been different.”  I don’t think many researchers plan out what they will do when they don’t get the results they expect.  Oh, they know what they would have done with a sufficiently small p-value—unthinkingly publish and move on; but that’s not enough.  P-values are a function of the tests done and all of the tests not done.  He calls this decision tree of research the “garden of forking paths”.  The actual path through this garden is all important to interpreting p-values.  If they can’t be interpreted, they are useless.

How did we get here?

In the 19th century, physicists were deriving theories of randomness.  By the 20th century these ideas were turned into two competing theories of significance and hypothesis testing.  Over the course of a few decades the two methods became indistinguishable, and that’s what’s almost universally taught in statistics classes.

I can’t rule out either accidental or intentional misapplication of statistical procedures—the lies, and damn lies.  What is glossed over is that even if the methods of hypothesis testing are applied exactly as they are taught, in an honorable manner, they only work within a very narrow range of application.  Now, there is no doubt that this narrow range of application, experiments designed to falsify a specific hypothesis, is extremely important.  However, the procedures have clearly been applied to situations beyond the bounds of proper application—as an example prospective data examination through meta-analysis.

The primary goals of inferential statistics are: to compare data sets, detect/quantify patterns, adjust beliefs, and make decisions.  Each of these have their own set of processes because their goals, while complimentary, are subtly different.  The question: “what does this data say about this new drug?” is not the same as “Will this drug help me with my symptoms?”. Unfortunately, in the scientific literature, and even more commonly, in the popular media, these questions are treated as if they are the same.

Fisher was adamant that significance testing should only be applied to a single test and those results extended only to the population represented by the sample.  Since it isn’t possible to tell how well the sample represents the population of interest, he suggested random selection.  Then, if the sample is big enough, and the distribution of the population value conforms to expected norms, you should be fine.

Neyman and Pearson allowed multiple tests constrained to a specific model of the underlying data, but also restricted inference to the population represented by the sample.  In both cases, one decides what the data looks like, builds a specific, informed guess of expectations and then tests that guess against the data.  If the value of a metric coming from the test application is larger than some number, you reject your guess and start over with a new guess and a new sample.

This is kind of hard, you have to think about your problem in advance, and it’s wasteful of data.  Researchers find it hard to dispose of data just because they changed their mind on a research issue.  So, statisticians advocate adjustments for multiple comparisons—change your mind midstream then adjust the threshold for significance.  However, it’s rare that published papers actually report multiple comparisons that are not part of their designed study, even when they cite previous work on a data set.  As Professor Gelman points out even this is not enough; what would you have done if the data were different?

So, how do we deal with this?

The most popular path is to ignore all of this Mathy stuff as much ado and seek a small p-value for your study; after all this pays more, in both fame and fortune.  One could also plan the research, formulate a falsifiable hypothesis, collect sufficient data to reasonably expect to detect the expected result, apply the tests, and then replicate the results on a new data set; little fame and no fortune here.

There are other statistical paradigms with the potential to guide research.  The most promising is information theory.  Developed by Claude Shannon in the 1940’s to find signals in dirty telecommunications data, it seeks to answer questions of belief.  Information Theory has been expanded to cover Bayesian analysis, another theory dealing with beliefs.

Bayesian testing starts with the idea that the researcher has an idea of what is going on.  This is referred to as the prior; it’s a mathematical representation of the probability model currently held by researchers.  That model is updated with the ideas contained in the data.  Some folks don’t like this because it’s hard to be honest about one’s prior beliefs; it’s much easier to say: “it’s not me, randomness did it”.

Bayesian statistics are geared to update one’s beliefs about the specific concept under test.  For instance, a researcher may have this idea that a particular diet will lower cholesterol in females aged 20 to 30.  The researcher summarizes their beliefs into a model that seems to work.  Then they select a sample, apply the treatment to a randomly selected portion of the sample, and see what happens.  At the end of the experiment the data is compared to the selected model and the researcher’s beliefs about the effectiveness of the new diet are updated.

To be fair Bayesian statistics are just as sensitive to issues of sample size, bias in selection, and multiple comparisons.  Selecting the appropriate Bayesian methods and applying them properly certainly helps, but the inference still relies on the veracity of the sample.  And all you get out is a updated belief; but that is closer to what the general public is looking for.  What should I believe about your research topic given the data you collected?  Remember the caveat: given the data collected; it makes all the difference in the world.  How do you compare to the population that made up the sample.  Oh, and how do you know that?

Recently we have seen the rise of the machine.  It’s now possible to generate theory free methods of pattern recognition; no pesky probability limits, or prior beliefs; just pure associations.  Plunk in massive data sets and read out the frequency of associations.

The downside of this is that the sample is treated as the population, and all inference applies only to the sample.  The Machine Learning people want to get around this by increasing the sample size.  Want an answer to a medical question then put in every medical record ever collected into the pile and let the machine sort it out.  In my day job I do this, not with medical data, but with aerial vegetation images.

What now?

In October 2017 the ASA is holding a Symposium on statistical inference.  For just under $600 you too can participate (ASA members get a discount).  Since they have their statement on p-values I’m not sure what they hope to accomplish; consensus?  We shall see…

I agree with Andrew Gelman: ”the solution is not to reform p-values or to replace them with some other statistical summary or threshold, but rather to move toward a greater acceptance of uncertainty and embracing of variation.”  Since to be effective this has to be coupled with a greater respect for responsibility, I don’t see it happening any time soon.