One way to examine the nature of human information processing is to explore the impact of stimuli on the reaction to subsequent stimuli. In sequential priming paradigms, briefly shown prime stimuli are followed by target stimuli, which are to be responded to. Differences in target performance as a function of the relationship between the primes and the targets can provide insight into the functional structure of semantic memory as well as the processes involved in early encoding of stimulus processing (Spruyt, Gast, & Moors, 2011). While there are many ways in which prime and target stimuli might be related, most studies have looked at the impact of evaluative relatedness (e.g., sun-friendship), associative relatedness (e.g., bread-butter), and/or category relatedness (e.g., goat-dog; for a more detailed discussion, see Hutchison, 2003).

In one variant of this paradigm, the pronunciation (or naming) task is used. In this task, stimuli can be words or pictures and participants are to read out or name the target. If prime and target are semantically related, pronunciation is often facilitated (i.e., responses to the target are faster and more accurate) relative to the case in which they are not related. This is called a semantic priming effect. However, not all types of prime-target relationships must necessarily influence responses to the targets. The facilitation via associative relationships has been well established (see Hutchison, 2003; Neely, 1991). In contrast, experiments investigating evaluative prime-target relationships have thus far produced mixed results (e.g., Bargh, Chaiken, Raymond, & Hymes, 1996; Klauer & Musch, 2001; Spruyt, Hermans, Pandelaere, De Houwer, & Eelen, 2004; for categorical relationships, see Hutchison, 2003; Lucas, 2000; Williams, 1996). Importantly, some experimental manipulations appear to increase the chances of obtaining the evaluative priming effect in the pronunciation/naming task. For instance, researchers have employed pictures to increase prime salience (e.g., Spruyt & Hermans, 2008) or degraded targets to increase processing difficulty (e.g., De Houwer, Hermans, & Spruyt, 2001). To account for such findings, it has also been argued that feature-specific attention allocation might play an important role in pronunciation tasks. More specifically, it was hypothesized that (task-irrelevant) semantic prime information will affect target responding only if and to the extent that it is selectively attended to (Everaert, Spruyt, & de Houwer, 2011; Spruyt, de Houwer, & Hermans, 2009; Spruyt, de Houwer, Hermans, & Eelen, 2007; see also Kiefer, & Marten, 2010, and Storbeck & Robinson, 2004, for related arguments).

Importantly, this line of research is part of a broader debate in cognitive psychology that has been conducted over the past 20 years. Typically, associative priming effects have been explained by mechanisms of early-state cognitive processing: Due to spreading of activation, the very encoding of a target stimulus is assumed to proceed more efficiently when it is preceded by an associatively related prime stimulus as compared to when it is preceded by an associatively unrelated prime stimulus. Some theorists have argued that similar mechanisms may be at play with other prime-target relationships such as evaluative congruency (e.g., Wentura & Rothermund, 2003). For example, Bargh et al. (1996) suggested that the mere perception of a prime stimulus is sufficient to activate all other concepts of the same valence in memory. Others, however, have argued that evaluative incongruency of prime and target results in an internal conflict that consumes cognitive resources (Gast, Werner, Heitmann, Spruyt, & Rothermund, 2014; Hermans, Van den Broeck, & Eelen, 1998), leading to depressed task performance. Finally, some scholars have questioned the assumption that evaluative information can influence subsequent early-stage encoding processes (e.g., Klauer, Roßnagel, & Musch, 1997). In their view, facilitated target responding (“evaluative priming”) will occur only if primes and targets can, in principle, elicit the same response tendency or response tendencies with feature overlap.Footnote 1 For instance, in the standard evaluative priming task (Fazio, Sabonmatsu, Powell, & Kardes, 1986), participants are instructed to categorize targets as positive or negative. Here, primes and targets might activate the same or different response tendencies, thereby facilitating or hindering target performance, respectively. This response-level account of evaluative priming has received much empirical and theoretical support (e.g., De Houwer, Hermans, Rothermund, & Wentura, 2002; Klinger, Burton, & Pitts, 2000; Wentura, 1999; Spruyt et al., 2007).

In the pronunciation task, primes never elicit the same response as targets do. Accordingly, response-level accounts cannot explain the occurrence of semantic (e.g., evaluative) priming effects in this task. This task feature and other procedural aspects (Klauer, Becker, & Spruyt, 2016) make the task an interesting test for the question whether priming effects exist at the encoding level. Spruyt and colleagues (2007, 2009; also see Spruyt, De Houwer, Evereart, & Hermans, 2012) have tested this hypothesis in a series of experiments assuming that attention needs to be allocated to the relevant stimulus dimension for such priming effects to emerge, as elaborated above. In the experiment revisited in this manuscript (Spruyt et al., 2009, Exp. 3), attention allocation was manipulated by means of induction trials that required half of the participants to assess the targets’ category membership with respect to two non-evaluative categories (i.e., objects and persons). For the other participants, the induction trials required an evaluative judgment of the targets (i.e., positive vs. negative). Induction trials and pronunciation trials were presented in an intermixed, random order. In the crucial pronunciation trials, participants from the category assessment group showed a categorical priming effect (i.e., object primes primed object targets, and person primes primed person targets) and no evaluative priming effect. Conversely, participants from the valence assessment group showed an evaluative priming effect and no categorical priming effect. Spruyt and colleagues (2009) interpreted this pattern of results as indicating (a) that the semantic analysis of the task-irrelevant stimuli is critically dependent upon dimension-specific attention allocation and (b) that, in consequence, semantic priming effects arise along the attended dimension in the pronunciation task.

Because of the theoretical relevance of the research question and the frequent difficulties to replicate non-associative priming effects in the pronunciation task (see above), it is desirable to replicate these results in a different laboratory. For this purpose, we engage in an adversarial collaboration (e.g., Kahneman, 2003), cooperating as a team of researchers with competing hypotheses about the outcome of the experiments. Results of a first pilot study reported below failed to replicate the original finding by Spruyt and colleagues (2009), but the statistical power for this study may have been too low and there was one potentially critical discrepancy between the two studies.

Importantly, the current study is not designed to question the role of attention in stimulus processing. The crucial question in this research project is whether selective attention is enough to produce “automatic” semantic priming effects, even in the absence of direct response competition. As has also been argued by Werner and Rothermund (2013), a combination of selective attention and dimension-relevant response compatibilities might be necessary to produce the respective semantic priming effects. However, the debate on this question, which is also at the heart of this article, persists (Rothermund & Werner, 2014; Spruyt, 2014; Spruyt and Tibboel, 2015).

Spruyt, De Houwer, & Hermans (2009) original study

Procedure

As described above, Spruyt and colleagues (2009, Exp. 3) implemented two types of trials in their experiment: In only 25 % of the trials were participants to pronounce the targets. In the remaining trials, participants were to categorize the stimuli depending on their group membership: In the affective categorization condition, stimuli were 48 positive and 48 negative words; in the non-affective semantic categorization condition, stimuli were 48 words referring to objects and 48 words referring to humans. Half of the words were randomly selected as primes and half of them as targets with an equal number of primes and targets from each affective or semantic category. The categorization trials were marked by a cue, that is, a green rectangle enclosing the target. In the pronunciation trials, no such cue was given. The stimuli for these trials were 16 positive and 16 negative words evenly referring to either objects or humans, randomly selected from a total of 64 words with equal numbers of primes and targets in each valence times semantic category combination (see Spruyt et al., 2009, Appendix). The procedures and the data preparation are further described in the online supplement.

An analysis of variance with the factors categorization condition, stimulus dimension, and congruence revealed a significant three-way interaction between these factors, F(1,52) = 7.64, p < .01, MSE = 144.65. Separate analyses per category revealed a significant evaluative priming effect in the affective categorization condition, F(1,26) = 18.38, p < .01, MSE = 144.75, and a significant semantic priming effect in the semantic categorization condition, F(1,26) = 8.92, p < .01, MSE = 145.49. All other effects were nonsignificant, Fs < 2.81, ps > .1.

Data reproducibility

Following the exact analysis strategy of the original authors, we were able to reproduce the original findings reported by Spruyt et al. (2009). Nevertheless, our analyses also revealed that the original report included an error, the correction of which was inconsequential for the main pattern of findings. A more detailed description of the reanalysis is provided in the online supplement. We then started with a first replication attempt.

Pilot study

The existence of the evaluative priming effect in the pronunciation task is the topic of an ongoing debate, and the main interest of the first two authors is in this effect. Based on the published results, we calculated an effect size d = .825 in the target experiment for the evaluative priming effect. Therefore, a sample of N = 40 (N = 20 in the affective categorization condition) ensured a statistical power of .97 to detect the priming effect observed in the original study by means of a one-tailed test. In the meantime, support and development of Affect 3.0 had been discontinued, so the third author programmed a new version of the experiment with Affect 4.0 (Spruyt, Clarysse, Vansteenwegen, Baeyens, & Hermans, 2010). That way, procedural details of the experiment were believed to be identical, with the exception of a blue instead of a green rectangle as a cue for the categorization trials and the translation of stimuli into the German language.

Data analysis revealed that the crucial three-way interaction between condition, stimulus dimension, and congruence failed to reach significance, F < 1, ηp 2 = .02. Likewise, the interaction between stimulus dimension and congruence was unreliable in both conditions, in particular in the evaluative categorization which is the main interest of the first two authors, F < 1, ηp 2 = .02. Mean latencies per condition can also be seen in Table 1.

Table 1 Mean priming effects for each condition of the Pilot Study (in ms, 95 % confidence intervals in square brackets). Mean response latencies as a function of condition and congruence (error percentages in parentheses)

The third author voiced concern, however, that the study might have actually been underpowered: While the target study showed a large effect size, the structurally similar experiment preceding it (Exp. 1) showed a considerably lower effect size. A weighted average of the evaluative priming effects from the original Experiments 1 and 3 is d = .65. A weighted average of the categorical priming effects from the original Experiments 2 and 3 is d=.58. Discussions between the three authors also revealed that there was one procedural difference between the two experiments: In the new version of the experiment, a response deadline was not included.

Main study

To show that our procedures and instruments are in principle capable of capturing priming effects, we conducted a second pilot study (Klauer et al., in press). We tested for associative priming effects in the pronunciation task (cf. Neely, 1991). We found a reliable associative priming effect of 10 ms (SD = 12 ms) with an estimated effect size of d = .88; t(19) = 3.93, p < .01 (N = 20).

Having demonstrated the ability to detect priming effects in the pronunciation task with the instruments, participant population, and procedures used in our laboratory, we implemented a second attempt at replicating the original experiment of Spruyt and colleagues (2009). This time, the study was conducted with an experimental script that was basically identical to the original one used by Spruyt and colleagues (2009) and a larger sample (N = 80 participants, leading to a power of .97 even for the weaker categorical semantic priming effect estimated at d = .58). The only difference in the script was the application of PsychoPy version 1.81 (Peirce, 2007). Note that we implemented a 3,000-ms response deadline to be as close to the original experimental program as possible. Also, experimenters emphasized speed during the initial instruction, as was done in the original experiment.

At the very end of the experimental session, a simple questionnaire capturing the degree to which participants thought the experiment might have important implications for science was issued. It was implemented by request of the third author as it was found to moderate priming effects in similar experiments. The use and analysis of this measure was exploratory in nature.

Participants

A total of N = 89 participants was sampled at Freiburg University. The datasets of three participants were excluded prior to data analysis as they were not German native speakers, had already participated in the same experiment, or had been reported by the experimenters as inappropriately dealing with the task. Furthermore, five participants were excluded as their mean categorization accuracies were extreme outliers according to Tukey’s criterion (i.e., more than three times the interquartile range above the upper or below the lower quartile; Tukey, 1977). Finally, the dataset of one participant was excluded as the planned number of participants in the affective categorization condition had already been reached. The remaining participants were inconspicuous regarding their mean latency and pronunciation accuracy. The final sample consisted of N = 80 participants. They received partial course credit or 5 Euros for their participation.

Results

The preregistered main statistical and outlier analyses closely followed the original analyses (Spruyt et al., 2009; the raw data, the initial manuscript, and the analysis script can be found at the Open Science Framework, https://osf.io/qtdhg/). Only trials with correct pronunciation were included (erroneous responses amounted to 0.79 % of trials, voice key failures to 0.38 % of trials, and 1.45 % of trials were lost due to participants wrongly categorizing in pronunciation trials, leading to an overall exclusion of 2.62 % of trials). Trials with latencies above 3 s or below 150 ms were also excluded (excluding another 0.31 %). Finally, a conditional latency exclusion criterion employed by Spruyt et al. (2009) was applied, excluding 2.36 % of trials (see online supplement for details).

An analysis of variance with the factors categorization condition, stimulus dimension, and congruence revealed no significant three-way interaction between these factors F(1,78) = .766, p = .384, η p 2 = .01. There was a hint of a possible overall main effect of congruence, F(1,78) = 2.496, p = .118, η p 2 = .031. All other effects were non-significant, Fs < 1.16, ps <.29. Separate one-tailed analyses per category revealed no significant evaluative priming effect in the affective categorization condition, t(39) = 1.143 p = .13, d = .18; and no significant semantic priming effect in the semantic categorization condition, t(39) = −.125, p = .55, d = −.02. For mean response latencies and priming effects, see Table 2.

Table 2 Mean priming effects for each condition of the Main Study (in ms, 95 % confidence intervals in square brackets). Mean response latencies as a function of condition and congruence (error percentages in parentheses brackets)

Discussion

The attempt to replicate the conditional congruency effects found in Spruyt et al. (2009) was not successful. In line with an earlier report by Spruyt et al. (2012), there was no semantic priming effect in the semantic categorization condition. In addition, there was little evidence for an evaluative priming effect in the evaluative categorization condition. At least in this attempt, selective attention was not enough to produce “automatic” priming effects in the pronunciation task.

However, the slightly positive evaluative priming effect in the evaluative categorization condition leaves open the question whether there actually is a rather small effect that could not be detected in our current design. It has been seen as good practice (Tressoldi, 2012) and desired by this journal (Wolfe, 2013) to base power calculations on the empirical effect sizes. However, this practice has recently been questioned, especially when one is interested whether there is any effect at all (e.g., Simonsohn, 2015).The effect sizes of all five studies that have examined the impact of dimensional attention on evaluative priming effects in the pronunciation task were very heterogeneous (ranging from d = −.05 to d = .83, as can be seen in our mini meta-analysis in Fig. 1). We cannot rule out the possibility that there is a small evaluative priming effect in this design, as is also implied by the meta-analytic weighted mean of d = .39, but the true mean is most likely much smaller than in Spruyt et al.’s (2009) studies. Also, priming effects in the current research design might be biased, as it has been shown that paradigms that require task-switching are susceptible to distortions by task-switching effects (Klauer & Mierke, 2005; Klauer, Schmitz, Teige-Mocigemba, & Voss, 2010). Given the intermixed presentation of pronunciation trials and induction trials, it could be argued that the observed priming effects are a by-product of task-switching effects. More specifically, when participants undergo a pronunciation trial following an induction trial, it might be easier for them to disengage from the first task on congruent trials as compared to incongruent trials, thereby producing an artifactual evaluative priming effect. Hence future endeavors in this paradigm might want to control for this: Contingent upon the results of our main study, we had actually planned a second study; see our online supplement for more information.

Fig. 1
figure 1

Summary of the studies that examined the impact of dimensional attention on affective priming effects in the pronunciation task. ID 1: Spruyt et al. (2009; Exp. 1), ID 2: Spruyt et al. (2009; Exp. 3), ID 3: Spruyt et al. (2012), ID 4: This manuscript, pilot study, ID 5: This manuscript, main study

There are, of course, several explanations for the present failures to replicate the original findings reported by Spruyt et al. (2009). Besides chance (false positives in the original studies, false negatives in the replications), one might argue that experimenter demand or expectancy effects may play a role (e.g., Klein et al., 2012). Note, however, that the experimenters in the present studies were not explicitly informed about the hypotheses under investigation. In addition, one might argue that linguistic differences between German and Dutch may have been responsible for the observed discrepancy between the two sets of studies. We have recently summarized empirical work, however, suggesting that the contribution of linguistic factors is probably quite small (Klauer et al., in press). In sum, more research is needed to further examine the extent to which evaluative information can impact early encoding processes, preferably using a diverse set of paradigms that mitigate or rule out task-switching effects (Gast et al., 2014; Spruyt, 2014; Spruyt and Tibboel, 2015; Werner & Rothermund, 2013; also see online supplement).