A hallmark of science is reproducibility. A finding is promoted from anecdote to scientific evidence if it can be reproduced (Lykken, 1968; Popper, 1959). There is growing awareness that problems exist with reproducibility in psychology. A recent estimate is that fewer than half of the findings in cognitive and social psychology are reproducible (Open Science Collaboration, 2015). In addition, there have been several been high-profile, preregistered, multi-lab failures to replicate well-known effects psychology (Eerland et al., 2016; Hagger et al., 2016; Wagenmakers et al., 2016). A similar multi-lab replication psychology that was considered successful yielded an effect size that was much smaller than the original (Alogna et al. 2014). These findings have engendered pessimism about reproducibility.

Coincident with the start of the reproducibility debate was the advent of online experimentation. Crowd-sourcing websites, such as Amazon Mechanical Turk, offered the prospect of more efficient, powerful, and generalizable ways of testing psychological theories (Buhrmester, Kwang, & Gosling, 2011). The lower monetary costs and the more time-efficient way of conducting experiments online rather than in a physical lab allowed researchers to recruit larger numbers of participants across broader geographical, age, and educational ranges of participants compared with undergraduates (Paolacci & Chandler, 2014). However, online experimentation presents challenges, typically associated with the loss of control over the testing environment and conditions (Bohannon, 2016). Most relevant to the reproducibility debate, online participant pools are large but not infinite, and hundreds of studies are conducted on the same participant pool every day, familiarizing participants with study materials and procedures (Chandler, Mueller, Paolacci, 2014; Stewart et al., 2015). Of particular concern for reproducibility, participants may participate in studies in which they have participated before. A recent preregistered study found sizable reductions in decision-making effects among participants had previously participated in the same studies, suggesting that nonnaïve participants may pose a threat to reproducibility (Chandler et al., 2015). Indeed, nonnaïve participants have been implicated in failures to replicate and declining effect sizes (DeVoe & House, 2016; Rand et al., 2014).

Although concerns with reproducibility span the entire field of psychology and beyond, results in cognitive psychology are typically conceived as comparatively robust (Open Science Collaboration, 2015). We put a sample of these findings to a particularly stringent test by running them under circumstances that are increasingly representative of current practices of data collection but also are documented as challenging for reproducibility. In particular, we conducted the first preregistered replication of a large set of cognitive psychological effects in the most popular online participant pool (Crump, McDonnell, & Gureckis, 2013 and Zwaan & Pecher, 2012 for non-preregistered replications on MTurk). Most importantly, we examined whether reproducibility depends on participant nonnaïveté by conducting the same experiments twice on the same participants a few days apart.

Research suggests that access to knowledge obtained from previous participation (e.g., from alternative conditions or elaboration) can affect people’s responses and may reduce effect sizes when participants accordingly adjust their intuitive responses towards what is perceived as normatively correct (Chandler et al., 2015). However, studies in cognitive psychology typically have nontransparent research goals, making memory of previous experiences irrelevant. Accordingly, a reduction of effect size due to repeated participation should be close to zero.

We tested the hypothesis that cognitive psychology is relatively immune to nonnaïveté effects in a series of nine preregistered experiments (https://osf.io/shej3/wiki/home/; see Table 1 for descriptions of each experiment). We selected these experiments for the following reasons. First, we wanted a broad coverage of cognitive psychology. Therefore, we selected three experiments each from the domains of perception/action, memory, and language, arguably the major areas in the field of cognitive psychology. Second, we selected findings that are both well known and known to be robust. After all, testing immunity to nonnaïveté effects presupposes that one finds effects in the first place. Third, we selected tasks that lend themselves to online testing. And fourth, we selected tasks that our team had experience with.

Table 1 Brief descriptions of and references to all replicated experiments

Although these findings have proven to be highly reproducible in the laboratory, their robustness in an online environment has not yet been established in preregistered experiments. More importantly, it is unknown whether these findings are robust to the presence of nonnaïve participants. We tested this hypothesis by replicating each study in the most conservative case—in which all participants encountered the study before.

General method

Detailed method descriptions for each experiment can be found in the Supplementary Materials . Participants were tested in two waves using the Mechanical Turk platform. Approval for data collection was obtained from the Institutional Review Board in the Department of Psychology at Erasmus University Rotterdam. All experiments were programmed in Inquisit. The Inquisit scripts used for collecting the data can be found at https://osf.io/ghv6m/. At the end of wave 1 of each experimental task, participants were asked to provide the following information: age, gender, native language, education. At the end of both waves, we asked the following questions, all of which could be responded to by selecting one of the alternatives “not at all,” “somewhat,” or “very much”: “I’m in a noisy environment”; “There are a lot of distractions here”; “I’m in a busy environment”; “All instructions were clear”; “I found the experiment interesting”; “I followed the instructions closely”; “The experiment was difficult”; “I did my best on the task at hand”; “I was distracted during the experiment.”

In all experiments, different versions of materials and, in some cases, key assignments were created. Different versions ensured counterbalancing of stimulus materials and key assignments. Participants were randomly assigned to one of the versions when they participated in wave 1. Then, upon return 3 or 4 days later for wave 2, half of the participants were assigned to the exact same version of the experiment and the other half were assigned to a different version such that there was zero overlap between the stimuli in the first and second wave. Participants who had participated in one of the experiments were not prohibited from participating in the other experiments.

Sampling plan

For each experiment, we started with recruiting 200 participants: 100 on Monday and 100 on Thursday. Three or four days after the first participation, each participant was invited to participate again. Our goal was to have a final sample size of 80 participants per condition (same items or different items on the second occasion), taking into account nonresponses and the exclusion criteria below. Whenever we ended up with fewer than 80 participants per condition, we recruited another batch. Because we expected null effect for the crucial interactions, power analyses could not be used to determine our sample sizes, because these analyses require that one predicts an effect and that one has strong arguments for its magnitude. Hence, we decided to obtain more observations than is typically done in previous experiments examining the same effects. By doing so, our parameter estimates are relatively precise.

Exclusion criteria

Data from participants with an accuracy <80% in RT tasks or an accuracy <10% in memory tasks or a mean (reaction time) RT longer than the group M + 3SD were excluded. Data from each participant in the RT tasks were trimmed by excluding trials where the trial RT deviated more than 3SD from the subject M. From the remainder, participants were excluded (starting with those who participated last) to create equal numbers of participants per counterbalancing version.

Participants were recruited via Amazon Mechanical Turk. The subjects participated in two waves, held approximately 3 days apart. In the second wave, half of the subjects participated in an exact copy of the experiment they had participated in before; the other half participated in a version that had an identical instruction and procedure but used different stimuli. A recent study demonstrated that certain findings replicated with the same but not with a different set of (similar) stimuli (Bahník & Vranka, 2017). Our manipulation allowed us to examine whether changing the surface features of an experiment (i.e., the stimuli) affects the reproducibility of its effect in the same sample of subjects. Each experiment had a sample size of 80 per between-subjects condition (same stimuli vs. different stimuli).

General results

Detailed results per experiment are described in the Supplementary Materials . Data for all experiments can be found here: https://osf.io/b27fd/. The results can be summarized as follows. First, the first wave yielded highly significant effects for all nine experiments, with in each case Bayes factors in excess of 10,000 in support of the prediction. Second, each effect was replicated during the second wave. Third, effect size did not vary as a function of wave; Bayes factors showed moderate to very strong support for the null hypothesis. Fourth, it did not matter whether subjects had previously participated in the exact same experiment or one with different stimuli. The main results are summarized in Fig. 1. The x-axis displays the wave-1 effect sizes and the y-axis the wave-2 effect sizes. The blue dots indicate the same-stimuli condition and the red dots the different-stimuli condition. The numbers indicate the specific experiment (e.g., 5 = false memory).

Fig. 1
figure 1

Wave 1 effect size versus wave 2 effect size (Cohen’s d). Effect sizes were computed in JASP (JASP Team, 2017). Diagonal line represents equal effect sizes. For each experiment separate effect sizes are plotted for same materials between sessions (blue solid dots) and different materials between sessions (red striped dots). Labels correspond to the different experiments listed in Table 1.

In the preregistration, we stated that “Bayesian analysis will be used to determine whether the effect size difference between waves 1 and 2 better fits a 0% reduction model or a 25% reduction model.” However, the absence of a reduction in effect sizes from wave 1 to wave 2—the wave 2 effect sizes were, if anything, larger than the wave 1 effect sizes—rendered the planned analysis meaningless. We therefore did not conduct this analysis.

General discussion

Overall, these results present good news for the field of psychology. In contrast to findings in other parts of the field (Chandler et al., 2015), the effects we studied were reproducible in samples of nonnaïve participants, which are increasingly becoming the staple of psychological research. What the tasks used in this research have in common is that they (1) use within-subjects designs and (2) have opaque goals. Although it is clear that participants may learn something from their previous experience with the experiments (e.g., response times were often faster in wave 2 than in wave 1), this learning did not extent to the nature of the manipulation. We should note that it is not impossible that some of our participants had previously participated in similar experiments. For these participants, wave 1 would actually be wave N+1 and wave 2 would be wave N+2. Nevertheless, it appears that the tasks used in this study are so constraining that they encapsulate behavior from contextual variation and even from recent relevant experiences to yield highly reproducible effects. We should add a note of caution. What we have examined are the basic effects with each of these paradigms. In the literature, one often finds variations that are designed to examine how the basic effect varies as a function of some other factor, such as manipulations of instructions, stimulus materials (e.g., emotional vs. neutral stimuli), subject population (patients vs. controls) of the addition of a secondary task. The jury is still out on whether such secondary findings are as robust as the more basic findings we have presented here.