Why Statistical Significance is Killing Science

by Joseph Mercola, DO | Guest Writer

Published April 25, 2019
Academia, Business

In 2016, the American Statistical Association¹ released an editorial warning against the misuse of statistical significance in interpreting scientific research. Another commentary was recently published in the journal Nature,² calling for the research community to abandon the concept of statistical significance.

Before being published in Nature,³ the article states it was endorsed by more than 800 statisticians and scientists from around the world. Why are so many researchers concerned about the P-value in statistical analysis?

In 2014, George Cobb, a professor emeritus of mathematics and statistics, posed two questions to members of an American Statistical Association discussion forum.⁴ In the first question, he asked why colleges and grad schools teach P=0.05, and found this was the value used by the scientific community. In the second question he asked why the scientific community used this particular P-value and found this was what was taught in school.

In other words, it was circular logic that drove the continued belief in an arbitrary value of P=0.05. Additionally, researchers and manufacturers may alter the perception of statistical significance, demonstrating a positive response occurs in an experimental group over the control group simply by using either relative or absolute risk.

However, since many are not statisticians, it’s helpful to first understand the mathematical basis behind P-values, confidence intervals and how absolute and relative risk may be easily manipulated.

Probability Frameworks Define How Researchers Present Numbers

At the beginning of a study, researchers define a hypothesis, or a proposed explanation made on limited evidence, which they hope research will either prove or disprove. Once the data are gathered, researchers employ statisticians to analyze the information to determine whether or not the experiment proved their hypothesis.

The world of statistics is all about probability, which is simply how likely it is that something will or will not happen, based on the data. These collections of data from sample sizes are used in science to infer whether or not what happens in the sample size would likely happen in the entire population.⁵

For instance, if you wanted to find the average height of men around the world, you couldn’t measure every man’s height to get the answer, so researchers would estimate the number. Samples would be gathered from subpopulations to infer the height. These numbers are then evaluated using a framework. In many instances, medical research⁶ uses a Bayesian framework.⁷

Under a Bayesian framework, researchers see probabilities as a general concept. This framework has no problem assigning probabilities to nonrepeatable events.

Frequentist framework defines probability in repeatable random events that are equal to the long-term frequency of occurrence. In other words, they don’t attach probabilities to hypotheses or any fixed but unknown values in general.⁸

Within these frameworks the P-value is determined. The researcher first defines a null hypothesis, in which they state there is no difference or no change between the control group and the experimental group.⁹ The alternate hypothesis is opposite of the null hypothesis, stating there is a difference.

What’s Behind the Numbers?

The simple definition of the P-value is that it represents the probability of the null hypothesis being true. If P = 0.25 then there is a 25 percent probability of no change between the experimental group and the control group.¹⁰ In the medical field,¹¹ the acceptable P-value is 0.05, or the cut-off number resulting in a threshold considered to be statistically significant.

When the P-value is 0.05, or 5 percent, researchers say they have a confidence interval of 95 percent that there is a difference between the two observations, as opposed to differences due to random variations, and the null hypothesis is disproved.¹²

Researchers look for a small P-value, typically less than 0.05, to indicate strong evidence the null hypothesis may be rejected. When P-values are close to the cutoff, they may be considered marginal and able to go either way in most other fields.¹³

Since “perfectly” random samples cannot be obtained and definitive conclusions are difficult to confirm without perfectly random samples, the P-value attempts to minimize the sources of uncertainty.¹⁴

The P-value may then be used to define the confidence interval and confidence level. Imagine you’re trying to find out how many people from Ohio have taken two weeks of vacations in the past year. You could ask every resident in the state, but to save time and money you could sample a smaller group, and the answer would be an estimate.¹⁵ Each time you repeat the survey, the results may be slightly different.

When using this type of estimate, researchers use a confidence interval to determine a range of values above and below a finding the actual value is likely to fall. If the confidence interval is 4 and 47 percent of the sample takes a two-week vacation, researchers believe that had they asked the entire relevant population, then between 43 percent and 51 percent would have gone for a two-week vacation.

The confidence level is expressed as a percentage of how often the true percentage of the population would pick the answer lying within the confidence interval. If the confidence level is 95 percent, the researcher is 95 percent confident that between 43 percent and 51 percent would have gone on a two-week vacation.¹⁶

Scientists Rebelling Against Statistical Significance

Kenneth Rothman, professor of epidemiology and medicine at Boston University, took to Twitter with a copy of a letter to the JAMA editor after it was rejected from the medical journal.¹⁷ In the letter, signed by Rothman and two of his colleagues from Boston University, they outline their agreement with the American Statistical Association statement, stating,¹⁸ “Scientific conclusions and business or policy decisions should not be based only on whether a P-value passes a specific threshold.”

William M. Briggs, PhD, author and statistician, writes all statisticians have felt the stinging disappointment from clients when P-values do not fit the client’s expectations, despite explanations of how this significance has no bearing on real life and how there may be better methods of evaluating the experiment’s success.¹⁹

After receiving emails from other statisticians outlining their reasons for maintaining the status quo of using P-values to ascertain the value of a study, and ignoring arguments he lays out, Briggs goes on to say:²⁰

A popular thrust is to say smart people wouldn’t use something dumb, like P-values. To which I respond smart people do lots of dumb things. And voting doesn’t give truth.

Numbers May Not Accurately Represent Results

A recent editorial in the journal Nature delves into the reason why P-values, confidence intervals and confidence levels are not accurate representations of whether a study has proven or disproven its hypothesis. The authors urge researchers to:²¹

[N]ever conclude there is ‘no difference’ or ‘no association’ just because a P value is larger than a threshold such as 0.05 or, equivalently, because a confidence interval includes zero. Neither should we conclude that two studies conflict because one had a statistically significant result and the other did not. These errors waste research efforts and misinform policy decisions.

The authors compare an analysis of the effects of anti-inflammatory drugs between two studies. Although the actual data in both studies found the exact risk ratio of 1.2, since one study had more precise measurements, it found a statistically significant risk versus the second study, which did not. The authors wrote:²²

It is ludicrous to conclude that the statistically non-significant results showed ‘no association,’ when the interval estimate included serious risk increases; it is equally absurd to claim these results were in contrast with the earlier results showing an identical observed effect. Yet these common practices show how reliance on thresholds of statistical significance can mislead us.

The authors call for the entire concept of statistical significance to be abandoned and urge researchers to embrace uncertainty. Scientists should describe practical implications of values and limits of the data rather than relying on proving a null hypothesis and claiming no associations if the value of the interval is deemed unimportant.²³

They believe using confidence intervals as a comparison will eliminate bad practices and may introduce better ones. Instead of relying on statistical analysis, they hope scientists will include more detailed methods sections and emphasize their estimates by explicitly discussing the upper and lower limits in their confidence intervals.

Relative Risk or Absolute Risk?

George Canning was a British statesman and politician who served briefly as prime minister in England in 1827.²⁴ He was quoted in the Dictionary of Thoughts published in 1908, saying, “I can prove anything by statistics except the truth.”²⁵

As you read research or media stories, the risk associated with a particular action is usually expressed as relative risk or absolute risk. Unfortunately, the type of risk may not be identified. For instance, you may hear a particular action will reduce the risk of prostate cancer by 65 percent.

Unless you know if this refers to absolute risk or relative risk, it’s difficult to determine how much this action would affect you. Relative risk is a number used to compare the risk between two different groups, often an experimental group and a control group. The absolute risk is a number that stands on its own and does not require comparison.²⁶

For instance, imagine there were a clinical trial to evaluate a new medication researchers hypothesized would prevent prostate cancer, and 200 men signed up for the trial. The researchers split the group into two, with 100 men receiving a placebo and 100 men receiving the experimental drug.

In the control group, two men developed prostate cancer. In the treatment group only one man developed prostate cancer. When the two groups are compared, the researchers find there is a 50 percent reduction in prostate cancer when they talk about relative risk. This is because one developed it in the treatment group and two developed it in the control group.

Since one is half of two, there is a 50 percent reduction in the development of the disease. This number can sound really good and potentially encourage someone to take a medication with significant side effects if they believe it can cut their risk of prostate cancer in half.

The absolute risk, however, is far smaller. In the control group, 98 men never developed cancer. In the treatment group, 99 men never developed cancer. Put another way, in the control group, the risk of developing prostate cancer was 2 percent, since 2 out of 100 got cancer; while in the treatment group, the risk lowered to 1 percent.

This means there is a 1 percent absolute risk of developing prostate cancer with the medication, compared to 2 percent. The difference now—your absolute risk—is not 50 percent but 1 percent (2 minus 1). Knowing this, taking the drug may not seem worth it.

Note: This article was reprinted with the author’s permission. It was originally published on Dr. Mercola’s website at www.mercola.com.

References:

¹ The American Statistician, 2016; 70(2).
² Nature, March 20, 2019.
³ Ibid.
⁴ See Footnote 1.
⁵ Probabilistic World, June 16, 2016.
⁶ Statistics in Medicine, 2000; 19: 3291.
⁷ BMJ, 2005; 330(7499).
⁸ See Footnote 5.
⁹ SPSS Tutorials, Null Hypothesis.
¹⁰ DZone, September 5, 2018.
¹¹ See Footnote 2.
¹² See Footnote 10.
¹³ Dummies, What a P-Value Tells You About Statistical Data.
¹⁴ Stats Direct, P Values.
¹⁵ Institute for Work and Health, Confidence Intervals.
¹⁶ Research Basics, Confidence Intervals and Levels.
¹⁷ Twitter, Ken_Rothman.
¹⁸ American Statistical Association, March 7, 2016.
¹⁹ William Briggs, March 11, 2019.
²⁰ William Briggs, January 8, 2019.
²¹ See Footnote 1.
²² See Footnote 1.
²³ See Footnote 1.
²⁴ Encyclopedia Britannica, George Canning.
²⁵ The Famous People, 18 Best George Canning Quotes.
²⁶ National Breast Cancer Coalition, Relative Risk Versus Absolute Risk.

14 Responses

Michael says:

April 26, 2019 at 12:42 am

Interesting – I had almost decided this was an article for mathematicians and I’m not, until I reached the very last section on relative and absolute risk and the example was very valuable. The rest of the article not so much.

Reply
Jan Mclellan says:

April 26, 2019 at 12:54 am

Someone once said “there are statistics, statistics and damn lies”. How true!

Reply
1. Andrew Forbes says:
  
  April 28, 2019 at 12:02 pm
  
  “There are lies, damned lies and statistics.” Mark Twain and/or Disraeli (some Jewish guy? Look him up). Jeez, it’s not that hard to do a tiny bit of research and get it right. But then that is the nature of this forum right?
  
  Reply
2. Andrew Forbes says:
  
  April 28, 2019 at 12:04 pm
  
  “There are lies, damned lies and statistics.” Mark Twain and/or Disraeli (some Jewish guy? Look him up). Jeez, Jan, it’s not that hard to do a tiny bit of research and get it right. But then that is the nature of this forum right?
  
  Reply
Jay says:

April 26, 2019 at 9:23 am

Thank you, thank you for posting this! I have been a long-time critic of the abuse of statistics. Statistical study is not science. Over the last 50 years, lazy “scientists” have abandoned the rigorous scientific method that involves multiple iterations of developing cause-and-effect hypotheses, experimentation and fine-tuning, in favor of shortcuts using statistical data manipulation. With enough slicing and dicing of the data, a statistician can produce any conclusion you tell him to produce! Charles Darwin, Galileo, and Mendel would be turning in their graves.

Incidentally the exact same abuse of p-value significance was also responsible for trillion dollars of mortgage-backed-securities being declared safe, which ultimately caused the 2008 financial crisis.

Reply
michael KENDALL says:

April 26, 2019 at 9:23 am

This is a beautiful, well written article that shows great insight and academic acumen and having used many examples of statistical analysis in my PhD dissertation in 1972 explains many discrepancies, especially regarding relative and absolute risk. These observations should also be extrapolated to the vaccine crisis crippling America. Anything can be “proven” with statistical methodology.

Reply
Julie Williams says:

April 26, 2019 at 10:50 am

Damn lies is what they are and have always been pushing. PUSH BACK -and don’t stop. (((They))) are the relentless borg banking on our naïveté and ignorance. They don’t debate because they can’t. Like cockroaches- they scurry back into the safety of their dark ideologies and total (((legal))) immunity -insisting that we are disease carrying vermin that need to be poisoned to death for our own good. Meanwhile we have lawless open borders letting every creature in -carrying gods knows what into our once safe and prosperous homeland.

Reply
luis says:

April 26, 2019 at 11:38 am

The body of knowledge needed to understand this article is missing from the college curricula as required. Sadly we are in the process of “simplifying” math, etc. More knowledge is needed to navigate present day life. Succinctly: more articles like this one are very much needed: How else can we realize how much more education is missing?
Thank you!

Reply
Alexander says:

April 26, 2019 at 12:12 pm

A very important topic indeed, but unfortunately there are many errors in this article that may mislead readers. There are far too many error to list them here.
This reflects poorly on NVIC.

Yes, epidemiology and biostatistics are somewhat technical, but the basics are not hard to learn. There are excellent free online courses on statistics in medicine which I would recommend to anyone with an interest.

One particular fundamental error here is unfortunately widely misunderstood to be true:
“The simple definition of the P-value is that it represents the probability of the null hypothesis being true. If P = 0.25 then there is a 25% probability of no change between the experimental group and the control group.”
No. The p-value is the probability over your observed data if the null hypothesis is true, assuming that a bunch of usually unstated assumptions are true. Thats’s very different. Life would be so much simpler if we could directly estimate the probability of null hypothesis being true.

Reply
1. Karen says:
  
  April 28, 2019 at 11:50 am
  
  Alexander, Thank you for pointing this out. What little I can understand I would like to understand correctly. Is there anything online in the nature of Statistics for Dummies (I have a serious math disability – nothing computes). I took statistics in college and graduate school, but didn’t understand it either time. The only thing I really understood and remembered is the difference between median and mean. If only they could explain statistics in a picture book I might begin to grasp it. They start throwing letters together with numbers and my brain starts protesting, “What? Are we reading or doing math?” Then it just shuts down altogether.
  
  Reply
mark says:

April 26, 2019 at 2:13 pm

Lies, damned lies, and statistics
Mark Twain

Reply
mark says:

April 26, 2019 at 2:14 pm

British prime minister Benjamin Disraeli: “There are three kinds of lies: lies, damned lies, and statistics.”

Reply
John Castleman says:

April 30, 2019 at 11:34 pm

Excellent. I have majored in statistical methods and spent a career in risk assessment and this article sums up my own understanding and observations better than anything else I’ve seen written. I fully support this and would add that preoccupation with statistics has often been at the cost of understanding the scientific processes, physical relationships and factors involved in the subject of study. Uncertainty, lack of knowledge and less common true randomness all need to be understood and statistics won’t tell you that.

Reply
Nick Kottenstette says:

July 6, 2019 at 8:59 am

Thank you for publishing this article. P-values, confidence intervals and a thorough understanding of the process all help one to decide the overall benefit or risk it can provide.
Relying heavily on any one aspect is naive and clearly does not help one move forward to improve the process with additional gained insight.

Time and time again I have seen engineers not fully understand the value of their own data set when a certain p threshold was not met as it pertains to a certain assumption of the process they were trying to fit.

Reply