Underpowered samples, false negatives, and unconscious learning
- 4.6k Downloads
The scientific community has witnessed growing concern about the high rate of false positives and unreliable results within the psychological literature, but the harmful impact of false negatives has been largely ignored. False negatives are particularly concerning in research areas where demonstrating the absence of an effect is crucial, such as studies of unconscious or implicit processing. Research on implicit processes seeks evidence of above-chance performance on some implicit behavioral measure at the same time as chance-level performance (that is, a null result) on an explicit measure of awareness. A systematic review of 73 studies of contextual cuing, a popular implicit learning paradigm, involving 181 statistical analyses of awareness tests, reveals how underpowered studies can lead to failure to reject a false null hypothesis. Among the studies that reported sufficient information, the meta-analytic effect size across awareness tests was d z = 0.31 (95 % CI 0.24–0.37), showing that participants’ learning in these experiments was conscious. The unusually large number of positive results in this literature cannot be explained by selective publication. Instead, our analyses demonstrate that these tests are typically insensitive and underpowered to detect medium to small, but true, effects in awareness tests. These findings challenge a widespread and theoretically important claim about the extent of unconscious human cognition.
KeywordsContextual cuing False negatives Implicit learning Null hypothesis Significance testing· Statistical power
Research practices in the behavioral sciences are under scrutiny to an extent that would have been inconceivable 10 years ago. Much of the debate has concerned habits (such as “p-hacking” and the filedrawer effect) which can boost the prevalence of false positives in the published literature (Ioannidis, Munafò, Fusar-Poli, Nosek, & David, 2014; Simmons, Nelson, & Simonsohn, 2011). Much less attention has been paid to the harmful consequences of false negatives, namely reports which purport to present evidence supporting false null hypotheses (Fiedler, Kurtzner, & Krueger, 2012). Via meta-analysis of a particular sub-literature within the field of implicit learning, we demonstrate how the use of underpowered experiments and Null Hypothesis Significance Testing (NHST) can combine to encourage the reporting of false negatives and consequent theoretical distortion.
When a researcher obtains a result that is significant at p < .05 and consequently reports that the null hypothesis is rejected, then of course we have learned something: That the likelihood of obtaining data at least as extreme as those that were observed, if the null hypothesis is true, is less than 5 %. Many would argue that we have not learned very much – for example, we have not learned that the null hypothesis is false or unlikely (Dienes, 2011; Fidler & Loftus, 2009). In contrast, when the researcher finds a result that is not significant (p > .05) and consequently concludes that the null hypothesis cannot be rejected, from the point of view of NHST we have learned literally nothing. We have not learned that the experimental hypothesis is false (the experiment may be underpowered) nor have we learned that the null hypothesis is true. Thus there is a sense in which any conclusions drawn from failures to reject the null hypothesis are intrinsically more problematic than those drawn from rejections of the null.
Underpowered studies are a major contributing factor to the reporting of both false positives and false negatives (Button et al., 2013). The power of typical studies in psychology, combined with typical effect sizes, indicates that the literature contains far more significant results than it should, suggesting that it is therefore biased in favor of significant findings (false positives) rejecting true null hypotheses (Francis, 2012). But low power might also contribute to the reporting of false negatives, when authors wish to demonstrate the absence of some effect. For instance, the absence of judgmental biases outside the laboratory (e.g., List, 2002), the absence of gender differences in math performance (e.g., Hyde, Lindberg, Linn, Ellis, & Williams, 2008), the absence of differences between studies run in the laboratory versus online (McGraw, Tew, & Williams, 2000), the absence of awareness in studies of implicit processing, and many other such influential claims depend on null effects which could potentially be false negatives if based on low-powered studies. NHST provides further impetus, in that its dichotomous nature (significant/nonsignificant at the arbitrary p = .05 cliff-edge) and focus on rejection of the null hypothesis encourage both researchers and students to interpret failure to reject the null hypothesis as implying that the null hypothesis is true (Hoekstra, Finch, Kiers, & Johnson, 2006). As Fidler and Loftus (2009) note, “this kind of almost irresistible logical slippage can, and often does, lead to all manner of interpretational mischief later on” (p. 29).
Confidence intervals (CIs) have an important role to play in the interpretation of null results (but see Hoekstra, Morey, Rouder, & Wagenmakers, 2014). If such intervals include zero but are narrow, then it can safely be concluded that the effect in question is either small or negligible in magnitude (though of course it cannot be concluded that the effect is non-existent). But if the intervals are wide, then little confidence can be placed on the null result and a motivation is provided for running larger sample sizes. Equally important is the role that meta-analysis can play in reaching valid conclusions across bodies of research featuring null results. Even though individual underpowered studies may fail to reject the null hypothesis, meta-analysis across a set of such studies may permit modest but real effects to be detected.
In the present research we illustrate these issues via a systematic review of a large body of studies within the field of implicit learning. These studies depend crucially on null results in awareness checks, because implicit learning by definition involves mental processing in the absence of awareness. As we show, the majority of these studies are underpowered to detect small but real awareness effects. We illustrate how the computation of CIs (and their graphic depiction) and meta-analysis can lead to radically different conclusions from those reached in the individual studies themselves. Our results challenge a theoretically crucial conclusion drawn from this body of research.
Null results as a crucial feature of research on implicit processing
Research on implicit processing provides an excellent example to illustrate the consequences of overreliance on NHST to gather support for the null hypothesis. In a typical experiment on implicit processing, participants’ performance on some task is above a baseline level, but this behavioral outcome is seemingly not accompanied by any awareness of the environmental cues or regularities that gave rise to the behavior. For instance, in research on subliminal perception, some form of behavior is primed by a briefly-flashed stimulus of which participants are unaware (e.g., Dehaene et al., 1998); research in neuropsychology suggests that perception, memory, and choices can be influenced by cues unconsciously in various patient populations (Bechara et al., 1995; Cohen & Squire, 1980; Goodale, Milner, Jakobson, & Carey, 1991); in research on behavior priming, some behavioral response such as voting intentions (Hassin, Ferguson, Shidlovski, & Gross, 2007), walking speed (Bargh, Chen, & Burrows, 1996), or answering general knowledge questions (Dijksterhuis & van Knippenberg, 1998) is influenced by a subtle cue without participants being aware of this influence; research on implicit moral judgments, emotions, and attitudes proposes that behaviors in each of these domains can again be influenced by environmental cues unconsciously (Bargh, 2006; Williams & Bargh, 2008), and so on. Usually the absence of awareness is inferred from a null result in an awareness test (Dienes, 2015). For example, participants might fail to detect stimuli in a forced-choice test or they might perform at chance when asked to exert some control over the cue’s influence on their behavior.
However, as mentioned above, null results in NHST are inherently ambiguous. They can mean either that the null hypothesis is true or that there is insufficient evidence to reject it. In the context of implicit processing experiments, this means that when an awareness test yields a non-significant result, this can indicate either that participants were really unconscious of the cue or that the awareness test is inadequate to permit a firm conclusion about whether participants were aware or not. Unfortunately, the statistical analyses reported in many implicit processing experiments are insufficient to test which of these two interpretations is more plausible. A Bayesian approach to statistical analysis might allow researchers to quantify to what extent null results reflect a real absence of effects or a lack of statistical sensitivity (Dienes, 2015; Rouder, Speckman, Sun, Morey, & Iverson, 2009). However, these Bayesian analyses are seldom conducted (or reported) on data from awareness tests. Furthermore, researchers sometimes report so little information in their statistical analyses that it is also difficult for other researchers to compute these Bayesian analyses on reported data.
Usually, the implicitness of this learning is assessed by means of a recognition test conducted at the end of the experiment. Participants are shown all the repeating patterns intermixed with new random patterns and are asked to report whether they have already seen each of those patterns. The learning effect found during the training phase is considered implicit if the number of patterns correctly recognized as old in the recognition test (hits) is no larger than the number of random patterns wrongly classified as old (false alarms), or if participants’ performance is at chance (50 % correct) overall. Another popular test used to assess whether learning was implicit is to ask participants to guess where the target would be in a search display where the target has been replaced by an additional distractor. If they perform at chance in this task, their learning about the repeating search configurations is again considered implicit. In both procedures, learning is assumed to be unconscious if a statistical comparison yields a null result.
However, as explained above, the statistical analyses typically conducted in these studies do not allow one to conclude that the null effects observed in the awareness tests reflect truly random performance. Meta-analysis across the whole body of experiments published in this domain permits us to check whether these null results reflect a real absence of awareness. Based on the relative proportions of significant results or on the overall trends of mean performance in awareness tests it is possible to measure to what extent the prevalence of null results reveals a genuine absence of awareness or merely insensitivity of statistical data in individual studies.
Proportion and distribution of significant results
To assess to what extent the null results observed in these analyses reflect a real absence of awareness or a mere lack of statistical sensitivity, we conducted a systematic review of the literature. As explained in Appendix 1, we included in our analyses all the experiments that found spatial contextual cueing and that included either of the two awareness tests explained above (i.e., a recognition test or a target guessing test).
By definition, research on implicit processing assumes that participants lack awareness of the relevant regularity, and accordingly 78.5 % of the awareness tests yielded nonsignificant (p > .05) differences. However, 21.5 % of the awareness tests did yield a significant difference, well above (binomial p < .001) the theoretical 5 % of false positives that should be observed if the one-tailed null hypothesis is true with a standard α = .05. This proportion of significant results becomes particularly striking if we take into account that most of these statistical contrasts actually relied on two-tailed t-tests, for which the theoretical proportion of false positives would be just 2.5 %. The proportion of significant (p < .05) or marginally significant (.05 < p < .10) results was 27.6 %, again above the theoretical 10 % that would be predicted on the null hypothesis given a one-tailed test, binomial p < .001.
Regardless of the results of the inferential analyses, we also coded for each study whether participants performed numerically above chance (+1), exactly at chance (0), or below chance (-1) (see Appendix 1 for further details). The mean value of this direction score across experiments was 0.53 (95 % CI 0.41–0.66), far above the theoretical 0 that should be observed under the null hypothesis, t(165) = 8.468, p < .001, d z = 0.66. The proportion of experiments scoring 1 was 66.9 %, significantly above 50 % in a binomial test, p < .001. Interestingly, within our database, the vast majority of experiments that reported a significant result had direction scores of 1. A logistic regression confirmed that there was a relationship between the direction scores and the probability of a significant result in the awareness tests, B = 1.37, SE B = 0.483, Wald = 8.114, Odds ratio = 3.95, Model χ 2(1) = 16.11, p < .001. In other words, significant results were far more likely to be associated with numerically above- than below-chance performance in the awareness test.
Overall, these results are not consistent with the idea that the null hypothesis reflects the true distribution of results in the awareness tests. On a true null hypothesis (hits = false alarms in the awareness test, or performance equal to chance), only around 5 % of studies should yield a significant result, and the number of effects in the “explicit” direction should equal those in the wrong direction. There should be no tendency for significant awareness results to be more prevalent in one direction than the other.
Is there publication bias in the results of awareness tests?
However, it is still possible that the null hypothesis is true and that the unusually large number of significant results reflects a bias favoring the publication of significant results versus non-significant results. Even if participants perform at chance in the awareness test, occasionally the statistical analyses will yield a significant result by mere chance. If researchers or journals are biased towards publishing significant results, then the proportion of these in the published literature will exceed the theoretical proportion of false alarms that would be expected under the null hypothesis. Although this hypothesis might appear counterintuitive given that truly implicit learning requires null awareness, it is important to evaluate this possibility within the studies included in the meta-analysis.
Deviations from chance are more likely to occur in low quality experiments where the measurement error is larger (e.g., smaller samples or unreliable methods). That is to say, under the null hypothesis, large and significant effect sizes are more likely to be obtained in low- than in high-powered experiments. In meta-analyses, this trend is usually represented by means of a funnel plot representing the relationship between effect size and the measurement error. Unfortunately, it is difficult to draw a funnel plot with the information available in our dataset because many experiments did not report sufficient statistical information to compute effect sizes. For instance, standard errors and exact t-values were reported only in roughly half of the analyses. However, if publication bias were responsible for the unusually large number of significant results, then one would expect to find more significant results in low quality studies.
Sample size, defined as the number of participants, is another important determinant of the methodological quality of an experiment. Studies conducted on larger samples are more likely to yield results that converge to the true effect size. Figure 2B shows the relationship between sample size and statistical significance in contextual cuing experiments. The height of each bar represents the sample size of the study. As in the case of the previous analysis, a logistic regression suggests that the probability of finding a significant result grows with sample size, B = 0.024, SE B = 0.013, Wald = 3.247, Odds ratio = 1.024, Model χ 2(1) = 3.128, p = .077. Although only marginally significant, this trend goes in the opposite direction from the one predicted if the high number of positive results were due to a publication bias favoring significant results over non-significant ones.
A defender of the implicit nature of contextual cuing could argue that awareness truly is absent in these studies, and that publication bias explains the prevalence of significant results in the meta-analysis. The results above show that this hypothesis is implausible and that the prevalence is not attributable to publication bias. However, they also show something else of importance, namely that many of the reported null results are likely to be false negatives arising from underpowered studies. As the quality of the measurement improves in terms of sample size and number of observations, it becomes appreciably more likely that the study will yield evidence of awareness.
Effect sizes and statistical power
Overall, these analyses suggest that there is a true positive effect in the awareness tests employed in the studies included in the meta-analysis, and that failures to reach statistical significance are largely due to the small number of observations registered in most experiments, both in terms of sample size and in the number of trials included in the awareness test. Additional evidence for this interpretation can be obtained by exploring the typical size of the effect found in the awareness tests.
In many of the studies included in the present analyses, the authors failed to report sufficient information to compute the effect size of the results of the awareness test. Very frequently, the only piece of information available was that p-values were larger than .05, without additional details about t- or F-values. However, we were able to compute effect sizes for 96 of the statistical contrasts included in our data set. Based on sample sizes, reported t-values or, alternatively, one-degree-of-freedom F-statistics we were able to compute Cohen’s d z effect size scores. We coded d z scores as positive if the outcome went in the “explicit” direction (e.g., hit rate > false-alarm rate, regardless of significance) and as negative if the pattern of results was the opposite. Given the significant heterogeneity of effect sizes, Q(95) = 160.78, p < .001, we conducted a meta-analysis on d z scores using a random effects model. The meta-analytic mean d z was 0.31 (95 % CI 0.24–0.37).
Interestingly, although small, the meta-analytic effect size remains significantly greater than zero even if one actively removes from the meta-analysis all the statistical contrasts that turned out to be individually significant, d z = 0.16 (95 % CI 0.10–0.22). Thus aggregate awareness is evident even amongst those studies that obtained no significant awareness and were on that basis interpreted as showing implicit learning. This speaks against the possibility that the studies in the meta-analysis represent two quite distinct sub-groups, one in which learning is truly conscious and one in which it is truly unconscious. Even when the true conscious studies are removed, the remainder yield above-chance awareness.
It is important to acknowledge the real size might be smaller than our meta-analytic estimate of d z = 0.31. The t- and F-values were less likely to be reported when awareness tests failed to reach statistical significance, because in many of those cases the authors simply noted that p-values were larger than .05. Even so, assuming that 0.31 is approximately the true d z of the typical awareness test, it is possible to compute what would be the required sample size to achieve a specific level of statistical power. Using G*Power 3 (Faul, Erdfelder, Lang, & Buchner, 2007) we found that, assuming a d z of .31, a sample size of at least 66 participants would be needed to achieve statistical power of .80 in a one-tailed paired-samples t-test. For the more frequent two-tailed t-test, the figure goes up to 84. But recall that, as just mentioned, 0.31 might overestimate or underestimate the real effect size.
Most interestingly, the median N of all the contrasts included in the meta-analysis (also including the ones for which d z could not be calculated) was 16. The statistical power of a sample of 16 participants to obtain a significant two-tailed effect given a d z of 0.31 is around .21. Note that this range of statistical powers is virtually identical to the proportion of significant results (21.5 %) observed in our dataset. Given the small size of the effect found in the typical awareness test, the average sample sizes used in these studies are seriously underpowered. At the same time, the distribution of significant and nonsignificant results is close to what would be expected if the awareness results in individual studies are sampled from a distribution with a mean effect size of around .30.
Effect size in implicit versus explicit measures
It might be countered that this effect size in the awareness test is far too small to account for the usually large contextual cueing effect found in these experiments, as the typical contextual cueing experiment yields effect sizes well above d z = 1 on the implicit RT measure. If participants had conscious access to the representations learned in contextual cuing, why should this knowledge yield larger effects when assessed by means of visual search than when measured by means of an awareness test? This concern neglects the fact that contextual cuing and awareness are measured with radically different procedures. Even if they were measuring exactly the same memory trace, the differences between the procedures are so numerous that it would be naïve to expect the same effect size in both of them. Just to mention a clear difference, contextual cuing is traditionally assessed by gathering reaction times from hundreds of trials (usually more than 500 across the experiment). In contrast, awareness is assessed by means of just a few discrete responses. As can be seen in Fig. 2A, the number of trials rarely goes beyond 24 or 40. One cannot expect to find the same precision in a dependent variable based on a few observations of a discrete response as in one based on hundreds of observations of a continuous measure, even if those two dependent variables are measuring exactly the same latent variable.
In fact, when other constraints are taken into account, a small effect size is exactly what one would expect to find in any measure of contextual cuing that is not based on a very large number of observations. The available evidence shows that the faster reaction times found in repeated patterns are usually attributable to a small number of search displays (Schlagbauer, Müller, Zehetleitner, & Geyer, 2012; Smyth & Shanks, 2008). In other words, participants seem to learn very little or nothing about most of the search displays. Furthermore, it is also known that even for the search displays that elicit some learning, participants do not seem to acquire detailed information about all the elements in the search display. Instead, they seem to learn something only about the two or three distractors that happen to be closest to the target (Brady & Chun, 2007; Olson & Chun, 2002). Trying to detect these fragmentary memory traces in a brief recognition test, where each pattern is only presented once, is like finding a needle in a haystack. It is hardly surprising that the resulting effects are small.
This simulation illustrates that the fact that the effect size of awareness is small does not mean that it is insufficient to explain or cannot be related to the (usually large) size of the contextual cuing effect found in reaction times. Instead, the small effect size found in awareness tests is exactly what one would expect to find when a subtle effect is assessed with an unreliable test. This problem does not apply to the usual measure of contextual cuing, which typically relies on hundreds of trials and consequently yields very precise estimations (and therefore large effect sizes) for even very subtle effects. The asymmetry between the small effects found in the awareness test and the large effects found in visual search facilitation can be attributed to differences in the sensitivity of the two measures (we return to this issue later).
It is interesting to note that the superior sensitivity of contextual cueing measures relative to awareness tests is also evident in experimental protocols where a brief awareness test is sufficiently powered to detect above-chance performance. For instance, it is widely acknowledged that contextual cueing is explicit when natural scenes are used as contexts. In these experiments (not included in our meta-analysis), a short test is usually enough to detect explicit awareness. But even so, this effect is disproportionally smaller than the corresponding contextual cuing effect found in reaction times. As an example, Brockmole and Henderson (2006, Experiment 1) found that participants performed above chance in a location-guessing test, and this effect was so large (d z = 1.14) that it reached statistical significance with a small sample of only eight participants. But even this seemingly large effect is tiny compared to the huge size of the contextual cueing effect (d z = 6.54). Thus, the reduced sensitivity of awareness tests is obvious even in experiments where learning is unambiguously considered explicit and tests are adequately powered to detect above-chance awareness.
Confidence intervals as a partial solution to the false-negative problem
Recall that a positive difference indicates that the proportion of hits was larger than the proportion of false alarms, in other words that participants were able to discriminate repeated from random search displays. As can be seen, for many of the studies with small sample sizes (19/21), the CI includes zero. Those results are usually taken as a proof that participants were unaware of learning. However, in general, these CIs are very wide. They include not just a small region around zero, but also a wide range of positive values. Therefore, these studies do not allow one to conclude that participants were unaware. They simply demonstrate that these experiments do not permit the level of awareness to be estimated with any precision.
In contrast, among the six experiments with the largest sample sizes the CIs are narrower and only one of them includes zero. Interestingly, the meta-analytic 95 % CI of all the experiments included in the figure overlaps with the CI of every single study. In other words, although the larger experiments yield significant results and the smaller experiments tend to yield non-significant results, there is actually no contradiction between them. Null results create the illusion that there is no difference between hits and false alarms and that participants were, therefore, unaware of learning. But the CIs do not allow this inference to be made with any degree of certainty. The use of CIs and graphic depiction is a powerful method for conveying the degree of precision in the estimate and of avoiding the temptation to interpret a failure to reject the null as evidence in favor of the null (Cumming, 2014; Fidler & Loftus, 2009).
Bayes Factors as an alternative solution
CIs and meta-analysis provide a particularly clear and simple means to illustrate the uncertainty associated with underpowered studies, especially when the goal of the researchers is to draw conclusions on the basis of null results. However, an important shortcoming of CIs is that they fail to quantify the extent to which the results of an experiment favor the null or the alternative hypothesis. If an experiment yields a precise (i.e., narrow) CI around zero, it is legitimate to conclude that the null hypothesis is probably supported by the data, or at least that the effect is of little practical significance. But in the absence of a means to quantify support for the null hypothesis precisely this judgment remains somewhat arbitrary and subjective.
In contrast, Bayes Factors provide such a means to quantify the extent to which evidence favors the null or the alternative hypothesis and could accordingly play an important role in future research on contextual cuing and other implicit learning effects (Dienes, 2015). Specifically, a Bayes Factor (BF 10 ) represents the ratio between the likelihood of the data given the alternative hypothesis (1) and the likelihood of the data given the null hypothesis (0). A BF 10 larger than 3 is usually considered to reflect substantial support in favor of the alternative hypothesis and values larger than 10 strong support. Conversely, values lower than 1/3 are considered substantial evidence and values lower than 1/10 strong support for the null hypothesis (Wetzels, Matzke, Lee, Rouder, Iverson, & Wagenmakers, 2011).
Do the results of the awareness tests reviewed in our meta-analysis provide more support for the null hypothesis than for the alternative hypothesis? To answer this question, we converted all the 96 effect sizes entered in the meta-analysis back to t-values that we submitted to a Bayes Factor analysis using a Cauchy distribution with a (default) scaling factor r = 0.707 as the alternative hypothesis. To improve the comparability of values supporting the null hypothesis (originally bounded from 0 to 1) with values supporting the alternative hypothesis (originally bounded from 1 to ∞), we took the logarithm of all BF 10 ’s, which yields a symmetric distribution where all negative values support the null hypothesis and all positive values support the alternative hypothesis. On this logarithmic scale, values roughly larger than 1.1 provide substantial support for the alternative hypothesis (BF 10 > 3) and values roughly larger than 2.3 provide strong support (BF 10 > 10). Conversely, values lower than −1.1 or than −2.3 constitute substantial and strong support for the null hypothesis.
Therefore, this Bayesian analysis offers a somewhat tantalizing view of the implicitness of contextual cueing that has important implications for future research: On the one hand, there are a large number of studies with results numerically more consistent with the null hypothesis (no awareness) than with the alternative hypothesis (awareness). On the other hand, there are more experiments strongly supporting the alternative hypothesis than strongly supporting the null hypothesis. Fortunately, Bayesian statistics also offer a way of resolving this apparent contradiction regarding the inconclusiveness of existing evidence. Although in NHST researchers are not free to continue testing participants after reaching the sample size they specified a priori, Bayesian statistics do allow researchers to continue gathering data (e.g., in an awareness test) until a specific level of precision is reached (Dienes, 2011, 2014), for instance, until the Bayes Factor becomes larger than 10 or smaller than 1/10. This feature of Bayesian statistics make Bayes Factors a powerful means by which future research could establish the implicitness of contextual cuing and other seemingly unconscious learning effects (Rouder, Morey, Speckman, & Pratte, 2007).
Correlations and post hoc data selection
We should acknowledge that many of the studies included in the meta-analysis based their conclusion – that the contextual cuing they obtained was implicit – not only on a null result in an awareness test but also on one of two additional pieces of evidence (or both): Correlations and post hoc data selection. However both of these are statistically problematic.
The first of these refers to the finding that across participants, the magnitude of contextual cuing tends not to be significantly correlated with the measure of awareness. For instance, going back to the examples depicted in Fig. 4, Zellin, Conci, von Mühlenen, and Müller (2013, Experiment 3) found a marginally significant effect in the awareness test. However, instead of concluding that learning was explicit, they went on to estimate the correlation between the results of the awareness test and the size of contextual cueing and found a correlation of r = .42, p > .10. This lack of significant correlation seems on the face of it to provide further and stronger support for the claim that learning is implicit, but a moment’s thought reveals that once again absence of evidence is not the same as evidence of absence. Without knowing the CI on the correlation coefficient, we cannot evaluate how much weight to place on the null result, yet authors never report such CIs. We computed the 95 % CI on the correlation coefficient obtained by Zellin, Conci et al. (2013, Experiment 3) and found that it had lower and upper limits of −.14 and .77. Thus the data in this study are compatible with a true correlation as large as .77 or as low as −.14. Similarly, Conci and von Mühlenen (2011, Experiment 2) and Preston and Gabrieli (2008) reported non-significant correlations with 95 % CIs of [−.42 to .62] and [−.33 to .49], respectively. Obviously, these estimations are too imprecise to permit any strong conclusions to be drawn.
Furthermore, it is common practice to report the correlation between explicit and implicit measures of learning only when the explicit awareness measures yield significant results (e.g., Conci & von Mühlenen, 2011, Experiment 2; Geyer, Shi, & Müller, 2010; Peterson & Kramer, 2001; Preston & Gabrielli, 2008). This is particularly problematic. In just the same way that multiple testing increases the risk of type 1 errors, it also increases the risk of type 2 errors. Put differently, if researchers explore different awareness measures until they find one that yields a null result, the chances that the null result will reflect a false negative increase as the number of statistical tests grows. To prevent type 1 errors when multiple comparisons are conducted it is usual to make adjustments of α, like the Bonferroni correction. Similarly, in order to prevent type 2 errors, it would be necessary to adjust β for multiple comparisons, which is virtually identical to increasing statistical power, defined as 1- β.
We have argued here that studies which measure awareness alongside some “implicit” behavioral measure can yield erroneous evidence if NHST leads researchers to mistake weak awareness for null awareness. We have also noted that this problem applies not only to the interpretation of the awareness measure itself and whether it exceeds chance, but also extends to interpretation of correlations between implicit and explicit measures where absence of evidence can again be misinterpreted as evidence of absence. One final method may at first sight appear to avoid these problems by unequivocally ensuring null awareness: Selecting participants post hoc who score at or below chance on the awareness measure. If such a sample of participants (or a sample of configurations) shows significant contextual cuing (which they do: e.g., Colagiuri, Livesey, & Harris, 2011; Geyer, Shi, & Müller, 2010; Geyer, Zehetleitner, & Müller, 2010; Smyth & Shanks, 2008), then surely this is clear evidence of true implicit learning? The answer to this question is an emphatic “no.” The method is statistically unsound (Shanks & Berry, 2012).
We now select only those simulated participants who individually score at or below chance (d = 0) on the awareness measure (illustrated by the open circles in Fig. 6) and we ask what contextual cuing score we see in these “unaware” participants. The score in these participants is ~70 msec. Despite the fact that contextual cuing and awareness are based on the same underlying knowledge representation in this model (and on nothing else apart from noise), and that these participants are selected on the basis of chance (or below chance) awareness, they nonetheless show a highly reliable contextual cuing effect. There is no mystery to this: It is simply a manifestation of regression to the mean. In noisy bivariate data, a sample created by applying a cut-off on one dimension will have a mean on the other dimension that is closer to the overall mean. Note that although this demonstration concerns participants selected post hoc, the same logic applies to configurations selected in the same way (e.g., Geyer, Shi, & Müller, 2010; for a similar approach, see Conci & von Muhlenenn, 2011, Experiment 2). It implies that the logic of interpreting significant contextual cuing in participants (or configurations) retrospectively chosen because their awareness is at or below chance as evidence of implicit learning can be misleading.
Lastly, note that across all of the data generated by the model, the effect size for contextual cuing is Cohen’s d ≈ 1 while that for awareness is d ≈ 0.3 (these can be calculated directly from Eqs. 1 and 2). Thus, confirming what we claimed earlier, the fact that real studies might yield larger effect sizes for contextual cuing than for awareness does not license the conclusion that the former is based on some special form of unconscious knowledge. It arises simply because the model assumes a greater relative contribution of random error to awareness measures than to contextual cuing.
Conclusions drawn by authors and impact on publication quality
The analyses conducted so far give us reasons to suspect that many, if not most, of the null results obtained in this kind of awareness test can be considered false negatives. This conclusion stands in stark contrast with the certainty with which authors interpret these null results as strong evidence in support for the null hypothesis. As an example, the experiment with the widest CI in Fig. 4 is Experiment 4 from Conci and von Mühlenen (2011). In spite of the uncertainty revealed by the CI of the awareness test, the conclusion drawn by the authors was that “no explicit awareness of the display repetitions could be formed” (Conci & von Mühlenen, 2011, p. 219). Note also the results of the two conditions analysed in Experiment 4 of Zellin, Conci et al. (2013). Although they include zero, the CIs do not exclude a wide range of positive values. Obviously, no conclusion can be drawn with any assurance from the results of those awareness tests. However, the interpretation of the authors was that “observers did not explicitly recognize the old context-displays” (Zellin, Conci et al., 2013, p. 10).
In recent years the scientific community has witnessed growing concern about the high rate of false positives and unreliable results within published studies (Francis, 2012; Ioannidis et al., 2014; Simonsohn, Nelson, & Simmons, 2014). In contrast, the potential impact of false negatives has remained largely ignored (Fiedler et al., 2012). This asymmetry is natural, given that most experiments seek to observe positive results. However, there are many areas of psychological research where the evidential value given to null results is critical. In fact, there are several reasons to suspect that the over-interpretation of null results is even more dangerous than the prevalence of false positives in some areas of research. First, null results are inherently ambiguous. They indicate that there is not enough support for the alternative hypothesis, but they are silent about the amount of support for the null hypothesis. Second, unlike positive results, null results are surprisingly easy to obtain by mere statistical artefacts. Simply using a small sample or a noisy measure can suffice to produce a false negative.
The results of the present systematic review suggest that these problems might be obscuring our view of implicit learning and memory in particular and, perhaps, implicit processing in general. It is popularly claimed that contextual cuing and other implicit learning effects take place without participants becoming aware of the representations they learn (Chun & Jiang, 2003). Contrary to this prevalent view, we found that the seemingly chance-level performance of participants in awareness tests is more likely to reflect a type 2 error. The overall proportion of positive results is too large for the null hypothesis to be true. This proportion cannot easily be explained in terms of publication bias favoring positive results, but is perfectly consistent with the frequency of positive results that one would expect to find, given a true but modest-sized awareness effect, in underpowered experiments using unreliable dependent measures. This result is also consistent with experimental evidence suggesting that the quality of the awareness test is a key determinant of whether contextual cuing experiments yield “explicit” or “implicit” results (Smyth & Shanks, 2008).
We have offered some suggestions about how future studies could provide firmer evidence for implicit learning in contextual cuing, including increasing sample sizes to boost power, reporting CIs, and continuing to collect awareness (e.g., recognition) data until the Bayes Factor crosses a boundary of evidential support. We have also suggested that two data analytic techniques should unequivocally be avoided in future studies: The calculation of implicit-explicit correlations after finding that the implicit score is significantly greater than chance, and post hoc data selection.
Before ending, we would also like to emphasize that we do not believe that researchers working in this area are following these practices (e.g., using small numbers of testing trials or relying on NHST to claim support for the null hypothesis) in a deliberate attempt to deceive their readers. Most likely, researchers are simply following routinely a research protocol that, with its pros and cons, has become standard. It must be acknowledged that many of the experiments included in the present meta-analysis (and especially those that made no mention of awareness in their titles) were designed primarily to explore issues largely unrelated to the question of whether contextual cuing is implicit or not, such as the role of working memory in contextual cueing, how spatial associations are formed, the neural underpinnings of contextual learning, and so on. The fact that awareness was only a secondary concern might explain why the vast majority of them did not include a sensitive (and lengthy) awareness test and why they relied on simple NHST to analyse their results. But this only serves to illustrate how easily a particular conception can gain momentum in a substantial body of literature and become part of the zeitgeist, despite weak evidence.
Although we restricted our analyses to experiments conducted within a specific implicit learning paradigm, the same problem extends to other phenomena where participants’ awareness is discounted on the basis of NHST, such as subliminal perception and other forms of unconscious learning and implicit processing that we have not considered here (e.g., Dehaene et al., 1998; Pessiglione et al., 2008). False negatives also pose important problems for current attempts to replicate controversial findings.
These and other examples show that null results in underpowered studies may give the false impression that an effect is genuinely absent when actually it is not. They can also create the impression that there is a deep inconsistency between studies showing significant results and those yielding null results, even when the latter just reflect a lack of statistical sensitivity. Fortunately, researchers can resort to alternative statistical analyses when they need to assess the amount of support for the null hypothesis, including CIs, Bayes factors, and counternull values (Cumming, 2014; Dienes, 2015; Rosenthal & Rubin, 1994; Rouder et al., 2009). The price we pay for our reluctance to use these alternatives to NHST is that important aspects of what we believe about cognition may be mistaken.
The authors were supported by Grant ES/J007196/1 from the Economic and Social Research Council. We are indebted to Marvin Chun, Markus Conci, Thomas Geyer, Hermann Müller, and Eric-Jan Wagenmakers for their valuable comments on earlier versions of this article.
References marked with an asterisk indicate studies included in the meta-analysis.
- *Barnes, K. A., Howard, J. H., Jr., Howard, D. V., Gilotty, L., Kenworthy, L., Gaillard, W. D., & Vaidya, C. J. (2008). Intact implicit learning of spatial context and temporal sequences in childhood autism spectrum disorder. Neuropsychology, 22, 563–570.Google Scholar
- *Barnes, K. A., Howard, J. H., Jr., Howard, D. V., Kenealy, L., & Vaidya, C. J. (2010). Two forms of implicit learning in childhood ADHD. Developmental Neuropsychology, 35, 494–505.Google Scholar
- *Bennett, I. J., Barnes, K. A., Howard, J. H., Jr., & Howard, D. V. (2009). An abbreviated implicit spatial context learning task that yields greater learning. Behavior Research Methods, 41, 391–395.Google Scholar
- *Brady, T. F., & Chun, M. M. (2007). Spatial constraints on learning in visual search: Modeling contextual cuing. Journal of Experimental Psychology: Human Perception and Performance, 33, 798–815.Google Scholar
- *Chaumon, M., Drouet, V., & Tallon-Baudry, C. (2008). Unconscious associative memory affects visual processing before 100 ms. Journal of Vision, 8, 10.Google Scholar
- *Chaumon, M., Schwartz, D., & Tallon-Baundry, C. (2008). Unconscious learning versus visual perception: Dissociable roles for gamma oscillations revealed in MEG. Journal of Cognitive Neuroscience, 21, 2287–2299.Google Scholar
- *Chua, K. P., & Chun, M. M. (2003). Implicit scene learning is viewpoint dependent. Perception & Psychophysics, 65, 72–80.Google Scholar
- *Chun, M. M., & Jiang, Y. (1998). Contextual cueing: Implicit learning and memory of visual context guides spatial attention. Cognitive Psychology, 36, 28–71.Google Scholar
- *Chun, M. M., & Jiang, Y. (2003). Implicit, long-term spatial contextual memory. Journal of Experimental Psychology: Learning, Memory, and Cognition, 29, 224–234.Google Scholar
- *Chun, M. M., & Phelps, E. A. (1999). Memory deficits for implicit contextual information in amnesic subjects with hippocampal damage. Nature Neuroscience, 2, 844–847.Google Scholar
- *Colagiuri, B., Livesey, E. J., & Harris, J. A. (2011). Can expectancies produce placebo effects for implicit learning? Psychonomic Bulletin & Review, 18, 399–405.Google Scholar
- *Conci, M., & Müller, H. J. (2012). Contextual learning of multiple target locations in visual search. Visual Cognition, 20, 746–770.Google Scholar
- *Conci, M., Sun, L., & Müller, H. J. (2011). Contextual remapping in visual search after predictable target-location changes. Psychological Research, 75, 279–289.Google Scholar
- *Conci, M., & von Muhlenen, A. (2009). Region segmentation and contextual cuing in visual search. Attention, Perception, and Psychophysics, 71, 1514–1524.Google Scholar
- *Conci, M., & von Mühlenen, A. (2011). Limitations of perceptual segmentation on contextual cueing in visual search. Visual Cognition, 19, 203–233.Google Scholar
- *Dixon, M. L., Zelazo, P. D., & De Rosa, E. (2010). Evidence for intact memory-guided attention in school-aged children. Developmental Science, 13, 161–169.Google Scholar
- *Endo, N., & Takeda, Y. (2005). Use of spatial context is restricted by relative position in implicit learning. Psychonomic Bulletin & Review, 12, 880–885.Google Scholar
- *Geringswald, F., Baumgartner, F., & Pollmann, S. (2012). Simulated loss of foveal vision eliminates visual search advantage in repeated displays. Frontiers in Human Neuroscience, 6, 134.Google Scholar
- *Geringswald, F., Herbik, A., Hoffmann, M. B., & Pollmann, S. (2013). Contextual cueing impairment in patients with age-related macular degeneration. Journal of Vision, 13, 28.Google Scholar
- *Geyer, T., Baumgartner, F., Müller, H. J., & Pollmann, S. (2012). Medial temporal lobe-dependent repetition suppression and enhancement due to implicit vs. explicit processing of individual repeated search displays. Frontiers in Human Neuroscience, 6, 272.Google Scholar
- *Geyer, T., Shi, Z., & Müller, H. J. (2010). Contextual cueing in multiconjunction visual search is dependent on color- and configuration-based intertrial contingencies. Journal of Experimental Psychology: Human Perception and Performance, 36, 515–532.Google Scholar
- *Geyer, T., Zehetleitner, M., & Müller, H. J. (2010). Contextual cueing of pop-out visual search: When context guides the deployment of attention. Journal of Vision, 10, 20.Google Scholar
- *Giesbrecht, B., Sy, J. L., & Guerin, S. A. (2013). Both memory and attention systems contribute to visual search for targets cued by implicitly learned context. Vision Research, 85, 80–89.Google Scholar
- *Greene, A. J., Gross, W. L., Elsinger, C. L., & Rao, S. M. (2007). Hippocampal differentiation without recognition: An fMRI analysis of the contextual cueing task. Learning & Memory, 14, 548–553.Google Scholar
- *Howard, J. H., Jr., Howard, D. V., Dennis, N. A., Yankovich, H., & Vaidya, C. J. (2004). Implicit spatial contextual learning in healthy aging. Neuropsychology, 18, 124–134.Google Scholar
- *Jiménez, L., & Vázquez, G. A. (2011). Implicit sequence learning and contextual cueing do not compete for central cognitive resources. Journal of Experimental Psychology: Human Perception and Performance, 37, 222–235.Google Scholar
- *Jiménez-Fernández, G., Vaquera, J. M. M., Jiménez, L., & Defior, S. (2011). Dyslexic children show deficits in implicit sequence learning, but not in explicit sequence learning or contextual cueing. Annals of Dyslexia, 61, 85–110.Google Scholar
- *Johnson, J. S., Woodman, G. F., Braun, E., & Luck, S. J. (2007). Implicit memory influences the allocation of attention in visual cortex. Psychonomic Bulletin & Review, 14, 834–839.Google Scholar
- *Kawahara, J. (2003). Contextual cueing in 3D layouts defined by binocular disparity. Visual Cognition, 10, 837–852.Google Scholar
- *Kourkoulou, A., Kuhn, G., Findlay, J. M., & Leekam, S. R. (2013). Eye movement difficulties in autism spectrum disorder: Implications for implicit contextual cueing. Autism Research, 6, 177–189.Google Scholar
- *Kourkoulou, A., Leekam, S. R., & Findlay, J. M. (2012). Implicit learning of local context in autism spectrum disorder. Journal of Autism and Developmental Disorders, 42, 244–256.Google Scholar
- *Le Dantec, C. C., Melton, E. E., & Seitz, A. R. (2012). A triple dissociation between learning of target, distractors, and spatial contexts. Journal of Vision, 12, 5.Google Scholar
- *Luethi, M., Meier, B., & Sandi, C. (2009). Stress effect on working memory, explicit memory, and implicit memory for neutral and emotional stimuli in healthy men. Frontiers in Behavioral Neuroscience, 2, 5.Google Scholar
- *Makovski, T., & Jiang, Y. V. (2011). Investigating the role of response in spatial context learning. Quarterly Journal of Experimental Psychology, 64, 1563–1579.Google Scholar
- *Manginelli, A. A., Baumgartner, F., & Pollmann, S. (2013). Dorsal and ventral working memory-related brain areas support distinct processes in contextual cueing. NeuroImage, 67, 363–374.Google Scholar
- *Manginelli, A. A., Geringswald, F., & Pollmann, S. (2012). Visual search facilitation in repeated displays depends on visuospatial working memory. Experimental Psychology, 59, 47–54.Google Scholar
- *Manginelli, A. A., Langer, N., Klose, D., & Pollmann, S. (2013). Contextual cueing under working memory load: Selective interference of viuospatial load with expression of learning. Attention, Perception, & Psychophysics, 75, 1103–1117.Google Scholar
- *Manginelli, A. A., & Pollmann, S. (2009). Misleading contextual cues: How do they affect visual search? Psychological Research, 73, 212–221.Google Scholar
- *Manns, J. R., & Squire, L. R. (2001). Perceptual learning, awareness, and the hippocampus. Hippocampus, 11, 776–782.Google Scholar
- *Mednick, S. C., Makovski, T., Cai, D. J., & Jiang, Y. V. (2009). Sleep and rest facilitate implicit memory in a visual search task. Vision Research, 49, 2557–2565.Google Scholar
- *Nabeta, T., Ono, F., & Kawahara, J.-I. (2003). Transfer of spatial context from visual to haptic search. Perception, 32, 1351–1358.Google Scholar
- *Negash, S., Petersen, L. E., Geda, Y. E., Knopman, D. S., Boeve, B. F., et al. (2007). Effects of ApoE genotype and mild cognitive impairment on implicit learning. Neurobiology of Aging, 28, 885–893.Google Scholar
- *Ogawa, H., & Watanabe, K. (2010). Time to learn: Evidence for two types of attentional guidance in contextual cueing. Perception, 39, 72–80.Google Scholar
- *Ogawa, H., & Watanabe, K. (2011). Implicit learning increases preference for predictive visual display. Attention, Perception, & Psychophysics, 73, 1815–1822.Google Scholar
- *Olson, I. R., & Chun, M. M. (2002). Perceptual constraints on implicit learning of spatial context. Visual Cognition, 9, 273–302.Google Scholar
- *Oudman, E., Van der Stigchel, S., Wester, A. J., Kessels, R. P. C., & Postma, A. (2011). Intact memory for implicit contextual information in Korsakoff’s amnesia. Neuropsychologia, 49, 2848–2855.Google Scholar
- *Peterson, M. S., & Kramer, A. F. (2001). Attentional guidance of the eyes by contextual information and abrupt onsets. Perception & Psychophysics, 63, 1239–1249.Google Scholar
- *Pollmann, S., & Manginelli, A. A. (2009). Anterior prefrontal involvement in implicit contextual change detection. Frontiers in Human Neuroscience, 3, 28.Google Scholar
- *Pollmann, S., & Manginelli, A. A. (2010). Repeated contextual search cues lead to reduced BOLD-onset times in early visual and left inferior frontal cortex. The Open Neuroimaging Journal, 4, 9–15.Google Scholar
- *Preston, A. R., & Gabrieli, J. D. E. (2008). Dissociation between explicit memory and configural memory in the human medial temporal lobe. Cerebral Cortex, 18, 2192–2207.Google Scholar
- *Rausei, V., Makovski, T., & Jiang, Y. (2007). Attention dependency in implicit learning of repeated search context. Quarterly Journal of Experimental Psychology, 60, 1321–1328.Google Scholar
- *Schankin, A., & Schubö, A. (2009a). Cognitive processes facilitated by contextual cueing: Evidence from event-related brain potentials. Psychophysiology, 46, 668–679.Google Scholar
- *Schankin, A., & Schubö, A. (2009b). The time course of attentional guidance in contextual cueing. In L. Paletta & J. K. Tsotsos (Eds.), Attention in cognitive systems: Lecture notes in computer sciences (pp. 69–84). Berlin: Springer.Google Scholar
- *Schankin, A., Stursberg, O. & Schubö, A. (2008). The role of implicit context information in guiding visual-spatial attention. In B. Caputo & M. Vincze (Eds.), Cognitive vision (pp. 93–106) Berlin: Springer.Google Scholar
- *Schlagbauer, B., Müller, H. J., Zehetleitner, M., & Geyer, T. (2012). Awareness in contextual cueing of visual search as measured with concurrent access- and phenomenal-consciousness tasks. Journal of Vision, 12, 25.Google Scholar
- *Shi, Z., Zang, X., Jia, L., Geyer, T., & Müller, H. J. (2013). Transfer of contextual cueing in full-icon display remapping. Journal of Vision, 13, 2.Google Scholar
- *Smyth, A. C., & Shanks, D. R. (2008). Awareness in contextual cuing with extended and concurrent explicit tests. Memory & Cognition, 36, 403–415.Google Scholar
- *Smyth, A. C., & Shanks, D. R. (2011). Aging and implicit learning: Explorations in contextual cueing. Psychology and Aging, 26, 127–132.Google Scholar
- *Song, J.-H., & Jiang, Y. (2005). Connecting the past with the present: How do humans match an incoming visual display with visual memory? Journal of Vision, 5, 322–330.Google Scholar
- *Travers, B. G., Powell, P. S., Mussey, J. L., Klinger, L. G., Crisler, M. E., & Klinger, M. R. (2013). Spatial and identity cues differentially affect implicit contextual cueing in adolescents and adults with autism spectrum disorder. Journal of Autism and Developmental Disorders, 43, 2393–2404.Google Scholar
- *Travis, S. L., Mattingley, J. B., & Dux, P. E. (2013). On the role of working memory in spatial contextual cueing. Journal of Experimental Psychology: Learning, Memory, and Cognition, 39, 208–219.Google Scholar
- *Tseng, P., Hsu, T.-Z, Tzeng, O. J. L., Hung, D. L., & Juan, C.-H. (2011). Probabilities in implicit learning. Perception, 40, 822–829.Google Scholar
- *Tseng, Y.-C., & Li, C.-S. R. (2004). Oculomotor correlates of context-guided learning in visual search. Perception & Psychophysics, 66, 1363–1378.Google Scholar
- *Tseng, Y.-C., & Lleras, A. (2013). Rewarding context accelerates implicit guidance in visual search. Attention, Perception, & Psychophysics, 75, 287–298.Google Scholar
- *Vaidya, C. J., Huger, M., Howard, D. V., & Howard, J. H. (2007). Developmental differences in implicit learning of spatial context. Neuropsychology, 21, 497–506.Google Scholar
- *van Asselen, M., Almeida, I., Andre, R., Januário, C., Gonçalves, A. F., & Castelo-Branco, M. (2009). The role of the basal ganglia in implicit contextual learning: A study of Parkinson's disease. Neuropsychologia, 47, 1269–1273.Google Scholar
- *van Asselen, M., Almeida, I., Julio, F., Januario, C., Campos, E. B., Simoes, M., & Castelo-Branco, M. (2012). Implicit contextual learning in prodromal and early stage Huntington’s disease patients. Journal of the International Neuropsychological Society, 18, 689–696.Google Scholar
- *van Asselen, M., & Castelo-Branco, M. (2009). The role of peripheral vision in implicit contextual cuing. Attention, Perception, & Psychophysics, 71, 76–81.Google Scholar
- *Zellin, M., Conci, M., von Mühlenen, A., & Müller, H. J. (2011). Two (or three) is one too many: Testing the flexibility of contextual cueing with multiple target locations. Attention, Perception, & Psychophysics, 73, 2065–2076.Google Scholar
- *Zellin, M., Conci, M., von Mühlenen, A., & Müller, H. J. (2013). Here today, gone tomorrow: Adaptation to change in memory-guided visual search. PLoS ONE, 8, e59466.Google Scholar
- *Zellin, M., von Mühlenen, A., Müller, H. J., & Conci, M. (2013). Statistical learning in the past modulates contextual cueing in the future. Journal of Vision, 13, 19.Google Scholar
- *Zhao, G., Liu, Q., Jiao, J., Zhou, P., Li, H., & Sun, H.-J. (2012). Dual-state modulation of the contextual cueing effect: Evidence from eye movement recordings. Journal of Vision, 12, 11.Google Scholar
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.