Blog – The Horror of P-Values

Blog from Rebecca Williams

Reading Time: 6 minutes

16/11/2023

Rain hammers against your office window. The room lit only by the harsh light of a computer monitor, darkness pervading the space as the nights continue to draw in. Your face pales, your heart filled with dread, as analyses flash on screen until it appears. Crawling forth from months’ worth of data collection, a sight so terrifying that no creature of the night could compare: P = 0.051.

You scream in terror! Then your office mate switches on the light and somewhat ruins the ambience. You go home for the night, try to remember the big picture, but insignificant results still sure seem scary. Perhaps you could say its “approaching significance”? And of course you can always reason that perhaps there aren’t enough participants. Sure you did your power calculations, as you’re sure Frankenstein did when bringing his monster to life using the mains, but there’s always a chance you just couldn’t quite catch that true effect. That’s what you’ll focus on, because ultimately that insignificant result doesn’t prove the null. As any good statistician will tell you, a p value can never confirm the true absence of an effect.

P values are to statistics, as jump scares are to horror movies. Whatever the data, whatever the test, they always pop up to terrify us with the prospect of null results. We spend so much time cowering under the spreadsheets that we often forget to consider what that p value really means. When we glance behind the curtain, we see that our p values actually reflect the likelihood of finding a false positive. The probability of reporting a significant result when no effect actually exists. Therefore, setting our threshold of significance to the magic 0.05 translates to a 5% chance of a significant result occurring due to random chance.

If I were to flip a fair coin one hundred times, it’s entirely possible, though unlikely, that it lands on heads every time. My conclusion may therefore be that the coin is biased, and this would be statistically justified, but nonetheless incorrect.

Taking this to its conclusion, 1 in 20 studies using p values and that magic significance threshold are reporting false positives – now that’s horrifying!

Fortunately, replications are there to help us weed through the potentials, but we might also argue: why don’t we decrease our threshold for significance? Well, sometimes we do. The more statistical tests we run, the more stringent our threshold for significance should be. We see this in techniques like the Bonferroni correction, and very commonly in neuroimaging. If you haven’t had the joy of reading Bennett et al’s (2009) work employing a very deceased salmon and an fMRI machine to prove the importance of such corrections, I highly recommend you do so now. But if we consistently reduce our threshold for significance, then our likelihood of false negatives increases. We might miss true effects and get more of those dreaded null results. The magic 0.05 threshold is set as a balance between our chances of finding a false positive and a false negative… or so the story goes. In my PhD interview, that is the textbook answer I gave, and the textbook answer I’d been giving my students for years. But when my interviewers pressed, I caved and told them my real thoughts. Our significance threshold is 0.05 because we had to pick a number, and that’s what we went with.

So, significance as measured by p values currently has two flaws: 1) the thresholds of significance we’ve chosen seem to be at least somewhat arbitrary and 2) null findings are scary. Now that latter point isn’t entirely down to statistics, but they don’t seem to be helping. The fact is that insignificant results are almost impossible to interpret when using p values (and frequentist statistics more broadly). We cannot say that our null result proves there’s nothing to find, only that we didn’t find it. The fact we didn’t find it could be because there’s nothing to find, but might just as likely be due to a lack of power. There’s simply no way to distinguish between the two.

Here we need a technique which would allow us to disambiguate a lack of power from a true null effect… if only there was a way! But what’s this? A grizzled old-timer coming over the horizon with a lantern and wisdom from years gone by that will surely give us exactly the trick we need. And, as in every good scary movie, the trick happens to come in the form of probability distributions. Bayesian statistics are nothing new, in fact the original Bayes theorem is hundreds of years old, and its application to statistics was around long before the frequentist approach we now use as convention. In the early days it was missing some key components, and we were unable to use Bayes in the ‘plug and play’ manner that we can now, but in its latest form, what advantages could Bayes give us?

Firstly, Bayes (at least to me) is much more intuitive in the way it calculates our statistics. Whereas frequentist approaches calculate the probability of the data we observe existing given our current model or hypotheses, Bayes flips this on its head to calculate the likelihood of the model given our data. Given that the data we have definitely exists, and it’s our model which is in question, this perhaps already seems a touch more sensible. Bayes also allows us to find evidence in favour of the null hypothesis. No more confusion around power or true null effects. It’s true that Bayes Factors are subject to the same somewhat arbitrary cut-offs as p values, but this now includes varying levels of evidence. Evidence for the alternative, evidence for the null, and crucially a middle ground between the two where there isn’t sufficient evidence to draw a conclusion at all. Even better, incorporating Bayesian statistics into analyses isn’t too complicated anymore, with readily available packages in R and MATLAB, as well as friendly and free-to-use software packages like JASP. There’s also the lovely extension of techniques like Bayesian stopping rules, which I won’t go into here, but you can read more about with the resources linked below.

As an age of open science shines a light on the dark and scary statistical practises of yore, perhaps it’s time to also leave the horror of p values in the past. Though most people will still include them as convention, alternative techniques like Bayes now offer increased interpretability with very little additional effort. As someone currently running a study whose results will form the basis of my PhD, I certainly find it a little less scary to work with Bayes than worry about that dreaded insignificant result. We know that null findings are important, so let’s give them the same statistical backing that we give our alternatives.

And so, as the dark night has passed and the monsters return to their slumber, let’s remember that null results aren’t so horrific after all, and that maybe the statistical tests we use should reflect just how important they are.

More about Bayes:

Dienes (2011) Bayesian Versus Orthodox Statistics: Which Side Are You On?
Schonbrodt et al (2017) Sequential hypothesis testing with Bayes Factors: Efficiently testing mean differences
Hackenberger (2019) Bayes or not Bayes, is this the question?
Download JASP here: https://jasp-stats.org/download/

Rebecca Williams

Author

Rebecca Williams is PhD student at the University of Cambridge. Though originally from ‘up North’ in a small town called Leigh, she did her undergraduate and masters at the University of Oxford before defecting to Cambridge for her doctorate researching Frontotemporal dementia and Apathy. She now spends her days collecting data from wonderful volunteers, and coding. Outside work, she plays board games, and is very crafty.

Follow @beccasue99