What Would You Need to Know to Determine Whether the Study Has Been Replicated
Perspect Psychol Sci. Author manuscript; available in PMC 2017 Jul 1.
Published in last edited form every bit:
PMCID: PMC4968573
NIHMSID: NIHMS775432
What should we expect when we replicate? A statistical view of replicability in psychological science
Abstract
A recent report of the replicability of key psychological findings is a major contribution toward understanding the human side of the scientific procedure. Despite the careful and nuanced analysis reported in the newspaper, mass, social, and scientific media adhered to the simple narrative that only 36% of the studies replicated their original results. Here we show that 77% of the replication effect sizes reported were inside a 95% prediction interval based on the original effect size. Our assay suggests two critical issues in agreement replication of psychological studies. Get-go, our intuitive expectations for what a replication should show do not always match with statistical estimates of replication. 2d, when the results of original studies are very imprecise they create broad prediction intervals - and a broad range of consequent replication effects. This may pb to effects that replicate successfully, in that replication results are consequent with statistical expectations, but that practice not provide much information about the size (or existence) of the true issue. In this light, the results of Reproducibility Project: Psychology can be viewed as statistically consistent with what you would look when performing a large scale replication experiment.
Introduction
It is natural to hope that when 2 scientific experiments are conducted in the same way, they volition lead to identical conclusions. This is the intuition behind the recent bout-de-strength replication of 100 psychological studies by the Open Science Collaboration, Reproducibility Project: Psychology (Collaboration et al., 2015). At incredible expense and with painstaking effort, the researchers attempted to replicate the exact conditions for each experiment, collect the data, and analyze them identically to the original study.
The original analysis considered both subjective and quantitative measures of whether the results of the original study were replicated in each example. They compared boilerplate consequence sizes, compared result sizes to confidence intervals, and measured subjective and qualitative assessments of replication. Despite the measured tone of the manuscript, the resulting mass, social, and scientific media coverage of the paper fixated on a argument that just 36% of the studies replicated the original result (Patil & Leek, 2015).
Although we may hope that a properly replicated study will provide the same result as the original, statistical principles suggest that this may not exist the case. The Reproducibility Project: Psychology study coincided with extensive discussion on what it means for a study to be reproducible and how to account for different sources of variation when replicating (Ledgerwood, 2014). Stanley and Spence (Stanley & Spence, 2014) showed through simulation how sampling and measurement variation interplay with the size and reliability of an effect to produce wide distributions of replication result sizes. These examinations were accompanied past discussions of adequate study power (Maxwell, 2004; McShane & Böckenholt, 2014), sample size (Gelman & Carlin, 2014; Schönbrodt & Perugini, 2013), and how meta-analysis may address the consequences of inadequate power or sample size (Braver, Thoemmes, & Rosenthal, 2014). Anderson and Maxwell (Anderson & Maxwell, 2015) furthered these concepts by categorizing the unlike goals of replicating a study and recommending appropriate analyses and equivalence tests specific to each goal. In sum, the sources of variability that make replicating the outcome of a detail study so difficult were well-documented when the Reproducibility Project: Psychology study was underway.
Hither nosotros nowadays a view of replication based on prediction intervals - a statistical technique for predicting the range of furnishings we would expect in a replication study. This technique respects both the variability in the original study and the variability in the replication written report to come up to a global view of whether the results of the two are consequent. The statistical analysis shows that our intuitive understanding of replication can be flawed. The primal point is that there is variability both in the original study and the replication report. When the original study is small or poorly designed, this means that the range of potential replication estimates consistent with the original estimate will be large. Larger, more advisedly designed studies will have a narrower range of consistent replication estimates. With this view many smaller studies will show statistically consistent replications, even if they provide very little data about the quantity of interest. In other words, the replication may be statistically successful, merely yet may bear little information near the truthful effects being studied.
Our analysis re-emphasizes the importance of well designed studies that are run with sufficient sample sizes for cartoon informative conclusions. It also suggests that replicating studies with small original sample sizes may be relatively uninformative - the replication estimates will exist statistically consistent even in cases where the estimates modify sign or are quite different from the original study.
Defining and Quantifying Replication Using P-values
In the original newspaper describing the Reproducibility Project: Psychology, a number of approaches to quantifying reproducibility were considered. The widely publicized 36% figure refers only to the percentage of study pairs that reported a statistically meaning (P < 0.05) result in both the original and replication studies. The relatively low number of results that were statistically significant in both studies was the focus of extreme headlines like "Over one-half of psychology studies fail reproducibility test." (Bakery, 2015) and played into the prevailing narrative that science is in crunch (Gelman & Loken, 2014).
The most widely disseminated written report from this newspaper is based on a misinterpretation of reproducibility and replicability. Reproducibility is defined informally every bit the ability to recompute data analytic results conditional on an observed data prepare and knowledge of the statistical pipeline used to calculate them (Peng, 2011; Peng, Dominici, & Zeger, 2006). The expectation for a written report to be reproducible is that the verbal aforementioned numbers will be produced from the same code and information every time. Replicability of a study is the risk that a new experiment targeting the same scientific question will produce a consistent effect (Asendorpf et al., 2013; Ioannidis, 2005). When a written report is replicated, it is not expected that the same numbers volition result for a host of reasons including both natural variability and changes in the sample population, methods, or analysis techniques (Leek & Peng, 2015).
Nosotros therefore practise not expect to get the same answer even if a perfect replication is performed. Defining replication as sequent results with P < 0.05 squares with the intuitive idea that replication studies should arrive at like conclusions. So it makes sense that despite the many reported metrics in the original paper, the media has chosen to focus on this number. However, this definition is flawed since there is variation in both the original written report and in the replication written report, every bit has been much-studied in the psychology community to date. Even if you performed 10,000 perfect studies and 10,000 perfect replications of those studies, y'all would expect the number of times both P-values are less than 0.05 to vary.
In real studies we don't know the truth - what the existent effect size is or whether the study constitute it. An alternative is to generate simulated data where the event size and variability are already known, and then employ statistical methods to encounter what characteristics these methods testify. We conducted a pocket-sized simulation based on the effect sizes presented in the original commodity. In the original report, the authors applied transformations to 73 of the 100 studies whose furnishings were reported via exam statistics other than the correlation coefficient (due east.thou. t-statistics, F-statistics). We simulated 10,000 perfect replications of these 73 studies based on one degree of freedom tests. Each of these ten,000 simulations represents a perfect version of the Reproducibility Project with no errors. In each case, nosotros calculated the per centum of P-values less than 0.05. The per centum of P-values less than 0.05 ranged from 73% to 91% (1 st to 3 rd quartile; high: 100%; low: vi%) with a high degree of variability (Effigy S1).
Prediction Intervals
Sampling variation alone may contribute to "un-replicated" results if you define replication by a P-value cutoff. We instead consider a more than direct arroyo by asking the question: "What effect would we await to see in the replication study once we take seen the original effect?" This expectation depends on many variables about how the experiments are performed (Goodman, 1992). Here nosotros assume the replication experiment is indeed a truthful replication - a not unreasonable assumption in light of the effort expended to replicate these experiments accurately.
One statistical quantity that incorporates what nosotros can reasonably expect from subsequent samples is the prediction interval. A traditional 95% confidence interval describes our incertitude well-nigh a population parameter of interest. We may see an odds ratio reported in a paper as one.6 [i.ii, 2.0]. Here, 1.6 is our all-time approximate of the true population odds ratio based on the observed data. The range [one.2, 2.0] is our 95% confidence interval constructed from this study. If we were able to observe 100 samples and construct a 95% confidence interval for each sample, 95 of the 100 would contain the true population odds ratio.
A prediction interval makes an analogous claim about an individual future ascertainment given what we have already observed. In our context, given the observed original correlation and some distributional assumptions (described in detail in the Supplementary section), we can construct a 95% prediction interval and state that if we were to replicate the exact same study 100 times, 95 of our observed replication correlations will autumn within the respective prediction interval.
Using Prediction Intervals to Assess Replication
Assuming the replication is truthful and using the derived correlations from the original manuscript, we applied Fisher'due south z-transformation (Fisher, 1915) to calculate a pointwise 95% prediction interval for the replication effect size given the original outcome. The 95% prediction interval is , where orig is the correlation estimate in the original study; northwardorig, northwardrep are the sample sizes in the original and replication studies; and z 0.975 is the 97.5% quantile of the normal distribution (Supplementary Methods). The prediction interval accounts for variation in both the original report and in the replication report through the sample sizes incorporated in the expression of the standard fault.
We observe that for the 92 studies where a replication correlation consequence size could be calculated, 69 (or 75%) were covered past the 95% prediction interval based on the original correlation effect size (Effigy i). In two cases, the replication consequence was really larger than the upper bound of the 95% prediction interval. Because the asymmetric nature of the comparison, one might consider these furnishings equally having "replicated with outcome clear". Nosotros then guess that 71/92 (or 77%) of replication effects are in or above the 95% prediction interval based on the original outcome. Some of the effects that changed signs upon replication withal savage within the 95% prediction intervals calculated based on the original effects. This in unsurprising in light of the relatively pocket-sized sample sizes and effects in both the original and replication studies (Effigy S2).
95% prediction intervals suggest most replication furnishings fall in the expected range
A plot of original furnishings on the correlation scale (x-axis) and replication effects (y-centrality). Each vertical line is the 95% prediction interval based on the original consequence size. Replication effects could either be below (pink), inside (grayness), or to a higher place (bluish) the 95% prediction interval.
We notation here that of the 69 replication effect sizes that were covered by the 95% prediction interval, ii replications showed a slightly negative correlation (−0.005, −0.034) as compared to a positive correlation in the original study (0.22, 0.31, respectively). In the first study, the original and replication sample sizes were 110 and 222; in the second study, they were 53 and 72. We would classify these two studies as "replicated with ambiguous effect" as opposed to "replicated with upshot clear" due to the change in management of the effect, although both are very close to aught. All other negative replication effects did not fall into the 95% prediction intervals, and hence were considered "did not replicate".
We also considered the 73 studies the authors reported to exist based on one degree of freedom tests. In 51 of these 73 studies (70%), the replication effect was inside the 95% prediction interval. The same two cases where the replication result exceeded the 95% prediction interval were in this set up, leaving usa with an estimate of 53/73 (73%) of these studies had replication effects consequent with the original effects.
Based on the theory of the prediction interval we expect about 2.five% of the replication effects to be in a higher place and 2.5% of the replication effects to exist beneath the prediction interval bounds. Since about 23% were beneath the bounds, this suggests that not all effects replicate or that there were of import sources of heterogeneity between the studies that were non accounted for. The key message is that replication data—even for studies that should replicate—is subject field to natural sampling variation in improver to a host of other confounding factors.
It is notable that almost all of the replication study effect sizes were smaller than the original report effect sizes, whether or not they brutal inside the 95% prediction interval. In the original set of 92 studies, of those where the replication effect falls within the 95% prediction interval (69 studies), 55/69 (eighty%) had a replication upshot size that was smaller than the original effect size.
There is about certainly some level of publication bias in the original estimates orig . This bias means that the deviation volition have a non-zero expectation. If we make the reasonable supposition that people usually report larger effects so the bias in the quantity will exist positive. Based on the calculation (in Supplementary Methods) our prediction intervals are less likely to cover the true value when bias exists in the original studies. This is likely the reason for some of the discrepancy between our observed and expected coverage of prediction intervals.
This speaks to the notion that at that place are likely a host of biases that pervade the original study, pertaining mostly to the want of reporting a statistically meaning effect - even if it is modest or unlikely to replicate (Gelman & Weakliem, 2009). In this sense, our analysis complements the finding of the Open Science Collaboration while simultaneously providing some additional perspective on the expectation of replicability.
Decision
We need a new definition for replication that acknowledges variation in both the original study and in the replication written report. Specifically, a written report replicates if the information collected from the replication are drawn from the same distribution as the information from the original experiment. To definitively evaluate replication we volition need multiple independent replications of the aforementioned study. This view is consistent with the long-standing idea that a claim will but be settled past a scientific process rather than a single definitive scientific paper. We support Registered Replication Reports (Simons, Holcombe, & Spellman, 2014) and other such policies that incentivize researcher contribution to these efforts.
The Reproducibility Project: Psychology report highlights the fact that effects may be exaggerated and that replicating a study perfectly is challenging. Nosotros were caught off baby-sit by the immediate and strong sentiment that psychology and other sciences may exist in crisis (Gelman & Loken, 2014). The fact that many furnishings fall within the predicted ranges despite the long interval between original and replication study, the complicated nature of some of the experiments, and the differences in populations and investigators performing the studies is a reason for some guarded optimism about the scientific process. Information technology is too in line with estimates we have previously fabricated about the rate of simulated discoveries in the medical literature (Jager & Leek, 2014).
Nonetheless, our analysis too makes 2 more general points almost studying replication in psychological science. First, that replication should consider both the variability in the original study and the replication study. When both original and replication variability are considered studies may replicate statistically in ways that are unintuitive. For case, replication effects with opposite signs may still be statistically consistent with the original study.
Second, our work highlights the disquisitional importance of good written report blueprint and sufficient sample sizes both when performing original research and when deciding which studies to replicate. Our work shows that studies with small sample sizes - like many in the Reproducibility Projection: Psychology - will produce wide prediction intervals. Although this may mean that the replication estimates will be statistically consequent with the original estimates - they may not be very informative. This means that replication of studies that are poorly designed or insufficiently powered may non tell usa much nigh replication. Merely if the replication is well designed and powered, it may tell u.s. something about whether the consequence appears to be there at all.
We stress that the approach outlined here is hands applied when the result of interest in a study tin be summarized by one value upon which nosotros can accredit distributional assumptions. In reality, almost scientific studies are more circuitous, dealing in multiple stimuli (Westfall, Kenny, & Judd, 2014), adaptation over time and circumstance (Berry, 2011), and complicated data sources (Cardon & Bell, 2001), but to name very few. Our proffer of 95% prediction intervals to aid assess replication is meant to establish a conceptual framework and motivate researchers to brainstorm considering what is a reasonable expectation for a replicated outcome. Extending these concepts to modern study designs is the next pace in agreement the replicability of scientific inquiry.
Supplementary Textile
01
References
- Anderson SF, Maxwell SE. At that place'southward more than one way to deport a replication study: Beyond statistical significance. Psychological Methods. 2015 http://dx.doi.org/10.1037/met0000051. [PubMed] [Google Scholar]
- Asendorpf JB, Conner Thou, De Fruyt F, De Houwer J, Denissen JJ, Fiedler K, et al. Recommendations for increasing replicability in psychology. European Journal of Personality. 2013;27(ii):108–119. [Google Scholar]
- Baker Yard. Over half of psychology studies neglect reproducibility test. 2015 Aug; ( http://www.nature.com/news/over-one-half-of-psychology-studies-fail-reproducibility-test-i.18248 [Online; posted 27-August-2015])
- Berry DA. Adaptive clinical trials: the promise and the circumspection. Journal of Clinical Oncology. 2011;29(six):606–609. [PubMed] [Google Scholar]
- Braver SL, Thoemmes FJ, Rosenthal R. Continuously cumulating meta-analysis and replicability. Perspectives on Psychological Science. 2014;9(3):333–342. [PubMed] [Google Scholar]
- Cardon LR, Bell JI. Association study designs for complex diseases. Nature Reviews Genetics. 2001;2(2):91–99. [PubMed] [Google Scholar]
- Collaboration OS, et al. Estimating the reproducibility of psychological science. Science. 2015;349(6251):aac4716. [PubMed] [Google Scholar]
- Fisher RA. Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika. 1915:507–521. [Google Scholar]
- Gelman A, Carlin J. Across power calculations assessing blazon s (sign) and blazon m (magnitude) errors. Perspectives on Psychological Science. 2014;9(half-dozen):641–651. [PubMed] [Google Scholar]
- Gelman A, Loken E. The statistical crisis in scientific discipline. American Scientist. 2014;102(vi):460. [Google Scholar]
- Gelman A, Weakliem D. Of beauty, sex and ability: Too little attention has been paid to the statistical challenges in estimating small effects. American Scientist. 2009:310–316. [Google Scholar]
- Goodman SN. A annotate on replication, p-values and bear witness. Statistics in medicine. 1992;11(seven):875–879. [PubMed] [Google Scholar]
- Ioannidis JP. Contradicted and initially stronger effects in highly cited clinical research. Jama. 2005;294(2):218–228. [PubMed] [Google Scholar]
- Jager LR, Leek JT. An guess of the science-wise false discovery rate and awarding to the acme medical literature. Biostatistics. 2014;15(1):one–12. [PubMed] [Google Scholar]
- Ledgerwood A. Introduction to the special section on advancing our methods and practices. Perspectives on Psychological Science. 2014;nine(3):275–277. [PubMed] [Google Scholar]
- Leek JT, Peng RD. Statistics: P values are just the tip of the iceberg. Nature. 2015;520(7549):612–612. [PubMed] [Google Scholar]
- Maxwell SE. The persistence of underpowered studies in psychological research: causes, consequences, and remedies. Psychological methods. 2004;ix(two):147. [PubMed] [Google Scholar]
- McShane BB, Böckenholt U. You cannot pace into the aforementioned river twice when ability analyses are optimistic. Perspectives on Psychological Science. 2014;9(6):612–625. [PubMed] [Google Scholar]
- Patil P, Leek JT. Reporting of 36% of studies replicate in the media. 2015 Sep; ( https://github.com/jtleek/replication_paper/blob/gh-pages/in_the_media.md [Online; updated 16-September-2015])
- Peng RD. Reproducible research in computational science. Science (New York, Ny) 2011;334(6060):1226. [PMC costless commodity] [PubMed] [Google Scholar]
- Peng RD, Dominici F, Zeger SL. Reproducible epidemiologic research. American journal of epidemiology. 2006;163(ix):783–789. [PubMed] [Google Scholar]
- Schönbrodt FD, Perugini G. At what sample size practise correlations stabilize? Journal of Research in Personality. 2013;47(5):609–612. [Google Scholar]
- Simons DJ, Holcombe AO, Spellman BA. An introduction to registered replication reports at perspectives on psychological science. Perspectives on Psychological Science. 2014;9(5):552–555. [PubMed] [Google Scholar]
- Stanley DJ, Spence JR. Expectations for replications are yours realistic? Perspectives on Psychological Science. 2014;9(3):305–318. [PubMed] [Google Scholar]
- Westfall J, Kenny DA, Judd CM. Statistical power and optimal pattern in experiments in which samples of participants respond to samples of stimuli. Periodical of Experimental Psychology: General. 2014;143(5):2020. [PubMed] [Google Scholar]
Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4968573/
0 Response to "What Would You Need to Know to Determine Whether the Study Has Been Replicated"
Post a Comment