The mismatch between statistical thinking and statistical practice

Why it matters for understanding the replication crisis

Richard D. Morey (Twitter @richarddmorey)
ESRC Research Methods Festival // July 2016

The "crisis"

Majority of scientists polled believe science has "reproducibility crisis"

(Monya Baker, Nature News, 25 May 2016)

Some suggested fixes

  • Pre-registration
  • More replication
  • Open data/materials
  • Better methods education
  • Effect sizes/CIs
  • Meta-analysis
  • Bayesian statistics
  • etc, etc

 

None of these things will "fix" the crisis.

Bayes and the reproducibility crisis

How can Bayes shed light on the reproducibility crisis?

  • Not primarily through an end to NHST
  • Not primarily through better techniques
  • Not primarily through allowing null results to be evidential

Bayes forces us to grapple with a foundational question: what is [statistical] evidence?

How much evidence?

Rosnow and Rosenthal on "evidence"

"He [Fisher] did not give specific advice on how to appraise "the exact strength" of the evidence, but the use of statistical power analysis, effect-size estimation procedures, and quantitative meta-analytic procedures (to which we refer later) enables us to do this with relative ease."

Rosnow and Rosenthal (1989), "Statistical Procedures and the Justification of Knowledge in Psychological Science"

How much evidence?

Rosnow, in response to the question "What definition of evidence did you have in mind here that would be exactly quantifiable?"

"I was not thinking of it then in an epistemic sense...I was thinking of "strength of evidence" in randomized controlled experiments in an operational sense as the size of the effect in correlational terms...Still...I humbly notice that I continue to skirt the deeper question!"

Rosnow (2012, personal communication)

Differing principles of evidence

Frequentist Evidence Principle:

"\(y\) is (strong) evidence against \(H_0\), i.e. (strong) evidence of discrepancy from \(H_0\), if and only if, where \(H_0\) a correct description of the mechanism generating \(y\), then, with high probability, this would have resulted in a less discordant result than is exemplified by \(y\)."

(Mayo and Cox, 2006, emphasis added)

Bayesian Evidence Principle:

\(y\) is evidence differentiating between \(H_1\) and \(H_2\) insomuch as a reasonable person's beliefs regarding \(H_1\) versus \(H_2\) should be swayed by \(y\) (where "reasonable" is defined acting in accordance with probability theory and conditionalization).

Two examples of differing evidential principles

A mismatch between theory and practice

  • Stopping rules: Is \(p\) hacking a crime?
  • Confidence intervals: What do they mean?

Is \(p\) hacking a crime?

A hypothetical scenario

Your colleague Dr. Smith and you are doing research together. She is in change of data collection, but has not been in the lab for a week so you decide to take a look at the data. A \(t\) test on the comparison of interest yields \(p=0.045\).

You decide to interpret this as "strong enough" evidence against the null to begin writing up the results.

Is \(p\) hacking a crime?

Scenario A

You notice in Dr. Smith's email outbox that she sent an email saying that she ran the last three participants after noting that \(p=0.051\), thinking to just get a "bit more evidence".

Does this change your perception of the evidence?

Is \(p\) hacking a crime?

Scenario B

You notice in Dr. Smith's email outbox that the last three participants were run due to a miscommunication between research assistants. One thought that the other was sick, and made up for the subjects they thought were "lost".

Does this change your perception of the evidence?

\(p\) hacking and the evidence

"\(p\) hacking" is bad under the frequentist evidential principle.

  • \(p\) hacking increases Type I error rate.
  • \(p\) hacking is considered a "Questionable Research Practice"
  • \(p\) values must be corrected for changes to the stopping rule.

Observed evidence was the same either way. What changes is what might have been observed.


\(p\) hacking represents mismatch between

  • Evidential intuition (what matters is what was observed)
  • Frequentist principles (what might have been observed matters)

Neyman (1937) and theory of confidence intervals

What is a frequentist confidence procedure?

  1. Sample data from a population.
  2. Compute two numbers from the data $(L, U)$ using some procedure.
  3. Say "The parameter is inside the interval $(L, U)$."

If the procedure is such that the statement in (3) is true \(X\%\) of the time in repeated samples, the procedure is an \(X\%\) confidence procedure.

What is a confidence interval?

  • Masson and Loftus (2003): "[t]he interpretation of the confidence interval constructed around that specific mean would be that there is a 95% probability that the interval is one of the 95% of all possible confidence intervals that includes the population mean. Put more simply, in the absence of any other information, there is a 95% probability that the obtained confidence interval includes the population mean."

  • Kalinowski (2010): "A good way to think about a CI is as a range of plausible values for the population mean (or another population parameter such as a correlation)."

  • Cumming (2014): "[w]e can be 95% confident that our interval includes [the parameter] and can think of the lower and upper limits as likely lower and upper bounds for [the parameter]."

  • Young and Lewis (1997): "[t]he width of the CI gives us information on the precision of the point estimate."

  • Cumming (2014): "[l]ong confidence intervals (CIs) will soon let us know if our experiment is weak and can give only imprecise estimates"

Neyman and theory of confidence intervals

"Consider now the case when a sample ... is already drawn and the [confidence interval] given...Can we say that in this particular case the probability of the true value of [the parameter] falling between [the bounds of the CI] is equal to [$1-\alpha$]? The answer is obviously in the negative." (1937, p. 349)

"...it is not suggested that we can 'conclude' that [the CI contains the true value], nor that we should 'believe' that [the CI contains the true value]..." (1941, p. 133)

"[Statistical inferences are] certainly not any sort of 'reasoning', at least not in the sense in which this word is used in other instances; they are acts of will." (1957, p. 10)

The fallacy of placing confidence in confidence intervals

Morey et al (2016)

  • Observed CIs don't (generally) track precision
  • Observed CIs don't (generally) allow evaluation of "likelihood"
  • Observed CIs don't (generally) have a X% probability of containing true parameter

Confidence intervals: just inverted significance tests!


No free lunch: if you want Bayesian statistics, do Bayesian statistics.

Let's make real choices about methods.

  • Researchers have not made considered choices about evidential principles
  • Leads to disconnect between practice and theory
  • Upshot: Training does not prepare students to be thoughtful scientists
  • Upshot: Conclusions are not constrained by theory of evidence

Critical comparison of frequentism vs Bayesian evidence can lead to more careful science