Why Even Scientists Get Statistics Wrong
A survey found that 97% of researchers endorsed incorrect interpretations of confidence intervals. How can this be, and why does it matter for science?
Imagine you're reading a groundbreaking medical study. The authors report that their new treatment reduces symptoms by 40%, with a 95% confidence interval of 35% to 45%. What does this actually mean? If you interpreted this as "there's a 95% probability the true effect lies between 35% and 45%," you'd be in good company—but you'd be wrong. In fact, this common misunderstanding represents one of the most pervasive statistical illusions in modern research.
97%
of researchers endorsed incorrect interpretations of confidence intervals in surveys
Surveys of scientists have revealed widespread misinterpretation of these statistical concepts. In one eye-opening study, researchers presented six false statements about confidence intervals to experienced scientists and graduate students. All six misinterpretations were endorsed by the majority of respondents, with one misunderstanding accepted by 97% of those surveyed 4 .
This statistical confusion isn't just an academic exercise—it contributes to what many call the "replication crisis" in science, where findings that appear solid in one study fail to hold up in subsequent research 8 . Understanding what confidence intervals and standard errors really mean—and what they don't—is essential for both scientists and consumers of science.
A confidence interval provides a range of values used to estimate an unknown statistical parameter, such as a population mean 6 . The confidence level—typically 95%—refers to the long-run reliability of the method used to generate the interval.
In other words, if we were to repeat the same sampling process 100 times, approximately 95 of those confidence intervals would contain the true population parameter. The critical point is that for any single, specific interval we've already calculated, we cannot say there's a 95% probability that it contains the true value 6 . That particular interval either contains the parameter or it doesn't.
Perhaps an even more common confusion lies in distinguishing between standard error (SE) and standard deviation (SD). Though mathematically related, they answer fundamentally different questions 3 :
The relationship between them is expressed mathematically: Standard Error = Standard Deviation / √(Sample Size) 3 . This formula reveals why larger sample sizes yield more precise estimates—as sample size increases, standard error decreases.
| Aspect | Standard Deviation (SD) | Standard Error (SE) |
|---|---|---|
| Measures | Spread of individual data points | Uncertainty in the sample mean |
| Based on | Individual observations | Sampling distribution |
| Affected by sample size? | No | Yes (larger samples → smaller SE) |
| Use case | "How much do individual test scores vary?" | "How precise is our estimate of the average test score?" |
To investigate how researchers interpret confidence intervals, Hoekstra et al. (2014) conducted a survey that has become landmark evidence of statistical misunderstanding 4 . The researchers presented participants with a scenario: a 95% confidence interval for a mean was calculated as [0.1, 0.4]. Participants were then asked to evaluate six statements about what this confidence interval meant.
The survey was administered to 442 researchers and students, including experienced professors and PhD candidates across various disciplines. The participants represented a broad spectrum of scientific expertise, from beginners to established researchers 4 .
Among the six statements presented, five were common misinterpretations that the researchers had identified as false. Only one statement accurately reflected the correct interpretation of a confidence interval. Participants were asked to indicate which statements they believed were correct 4 .
The findings were startling. The majority of researchers endorsed false interpretations of confidence intervals. The most commonly accepted misunderstanding—endorsed by 97% of respondents—was Statement 4: "There is a 95% probability that the true mean lies between 0.1 and 0.4" 4 .
This misinterpretation represents what statisticians call the "probability of the parameter" fallacy. From a frequentist statistical perspective (the framework used by most researchers), the true population parameter is fixed, not random. Therefore, once a confidence interval is calculated, it either contains the parameter or it doesn't—there's no probability involved for that specific interval 6 4 .
| Statement Type | Example | Percentage Endorsing |
|---|---|---|
| Probability about parameter | "There is a 95% probability that the true mean lies between 0.1 and 0.4." | 97% |
| Probability about future samples | "If we repeated the experiment, there is a 95% probability the new estimate would fall between 0.1 and 0.4." | 66% |
| Data percentage interpretation | "We can be 95% confident that the true mean lies between 0.1 and 0.4." | 58% |
In response to critiques that the misunderstandings might be merely linguistic, the researchers devised a clever thought experiment using the nonsense word "bunky" to isolate the conceptual problem 4 .
They proposed: Suppose we sample participants from a population and calculate a 95% confidence interval using Student's t method. We then say we have "95% bunkiness" in that interval, meaning that in the long run, 95% of such intervals would contain the true value.
Now, suppose someone reliable tells us the population standard deviation—additional information we could use. The long-run behavior of the Student's t intervals doesn't change, so our "bunkiness" doesn't change either. Yet, 95% of Student's t intervals would still contain the true value 4 .
This reveals the core problem: "bunkiness" (like confidence) is a property of the method, not the specific interval. The multiplicity of ways to generate intervals creates a reference class problem—there are many "long runs" we could consider, each giving different probabilities 4 . This demonstrates that the misunderstanding isn't just about word choice but about fundamental statistical concepts.
Error bars are commonly used in scientific publications to represent uncertainty, but their meaning is frequently ambiguous or misinterpreted 7 . A survey by Belia et al. (2005) found that researchers often confuse different types of error bars 7 .
The problem is compounded when bar plots are used inappropriately to display data. As one principle of effective data visualization notes: "Bar plots are noted for their very low data density... A good use of a bar plot might be to show counts of something, while poor use of a bar plot might be to show group means" 9 .
| Error Bar Type | Represents | Common Misinterpretation |
|---|---|---|
| Standard Deviation (SD) | Spread of individual data points | Precision of the mean estimate |
| Standard Error (SE) | Uncertainty in the sample mean | Spread of individual data points |
| Confidence Interval (CI) | Range of plausible values for parameter | Probability that parameter lies in interval |
Drag the slider to change the confidence level and see how the interval changes:
At 95% confidence, the interval has a 95% chance of containing the true value in repeated sampling.
| If you see... | Correct Interpretation | Common Pitfall to Avoid |
|---|---|---|
| 95% CI: [2.1, 5.3] | "This method produces intervals that contain the true parameter in 95% of repeated samples." | "There's a 95% chance the true value is between 2.1 and 5.3." |
| Mean ± SE | "This reflects the precision of our mean estimate." | "This shows how spread out the individual data points are." |
| Mean ± SD | "This shows the variability of individual observations around the mean." | "This reflects how precise our mean estimate is." |
| Non-overlapping error bars (SE) | "The group means are significantly different at approximately p < 0.05." | "The group means are dramatically different." |
The widespread misunderstanding of confidence intervals and standard error bars represents more than just a technical statistical issue—it reflects a gap in scientific education that has real consequences for how research is conducted, interpreted, and applied.