Introduction

A hotly debated question in contemporary psycholinguistics is whether the regular use of multiple languages has consequences for cognition more generally (Adesope, Lavin, Thompson, & Ungerleider, 2010). It is widely assumed that words in each of a bilingual’s two lexicons are active during language use and compete for selection (for an overview, see Bialystok, Craik, Green, & Gollan, 2009). Theorists have proposed that competition between first language (L1) and second language (L2) lexical items is overcome using more general cognitive processes. For example, Green (1998) proposed that words in the non-target language are inhibited during regular language use. Others have speculated that bilingual language use might involve conflict monitoring (Abutalebi et al., 2012). This raises the intriguing possibility that bilingualism may exercise and thereby strengthen these processes, yielding a so-called bilingual advantage. This hypothesis has been tested using a variety of tasks and bilingual samples; however, results of these studies have been mixed (Costa, Hernández, Costa-Faidella, & Núria Sebastián-Gallés, 2009; de Bruin, Treccani, & Della Sala, 2015b; Hilchey & Klein, 2011; Lehtonen, Soveri, Laine, Järvenpää, J., de Bruin, & Antfolk, 2018; Paap, Johnson, & Sawi, 2015).

Across many studies, interference-control tasks, such as the Simon, Flanker, and variations of the Stroop, have been used as tests of a bilingual advantage. In all interference-control tasks, participants respond to either a single bivalent stimulus or a single stimulus with a distracter stimulus present. On some trials, the distracter dimension or stimulus cues the same response as the target (congruent trials) and in other trials it cues a different response (incongruent trials). These tasks yield two dependent variables (DVs). The first DV is global RT, which is the average RT across congruent and incongruent trials. It is generally assumed to reflect the efficiency of processing in a high-conflict environment. The second DV is interference cost, calculated as the difference in mean reaction time (RT) between incongruent and congruent trials. It is thought to reflect the amount of extra time it takes to suppress the response cued by the incongruent information.

Researchers have predicted bilingual advantages on both of these DVs. For example, Green’s (1998) prominent model of bilingual lexical production assumes that lexical items for each of a bilingual’s two languages compete for selection and that non-target lexical items are suppressed by a domain-general inhibition mechanism. Bialystok, Craik, Klein, and Viswanathan (2004) speculated that regular exercise of this mechanism in bilinguals would lead them to exhibit smaller interference effects (i.e., lower interference cost) than monolinguals. Costa et al. (2009) suggested that bilinguals might experience enhanced conflict monitoring skills, leading to a general advantage in processing stimuli with a high degree of conflict and a bilingual advantage on global RT. Similarly, noting inconsistent evidence for a bilingual advantage on interference effects, Hilchey and Klein (2011) suggested that bilinguals might process conflict differently than monolinguals by developing separate processing pathways for conflicting and non-conflicting stimuli, which should give bilinguals an advantage for global RT. Importantly, these predictions are not necessarily mutually exclusive. Indeed, recently Bialystok (2017) sketched out a theory where bilingualism enhances broader executive attention abilities, which presumably might lead to bilingual advantages on both interference cost and global RT across a broad range of tasks.

Many researchers have sought to synthesize this literature with an eye toward explaining inconsistencies in findings across studies. For example, Hilchey and Klein (2011) conducted a literature review and identified a consistent bilingual advantage on global RTs but not on interference costs, and developed the previously described theoretical account of how such an advantage might occur. However, in a follow-up review, Hilchey, Saint-Aubin, and Klein (2015) found inconsistent evidence of a bilingual advantage for both interference cost and global RT. Valian (2015), focusing on executive functioning (EF) more generally, argued that many cognitively challenging activities, including practicing music and exercise, involve and strengthen EF. She speculated that bilingual advantages on EF tasks are real, but as they compete with advantages conferred by other challenging activities they are difficult to detect in individual studies. If this were true, one promising approach would be to quantitatively synthesize many studies, rather than consider individual studies since, in a single study, the effect of bilingualism may be overwhelmed by other factors. In addition to three older syntheses (Adesope et al., 2010; Costa et al., 2009; Hilchey & Klein, 2011), four recent reports (de Bruin et al., 2015b; Lehtonen et al., 2018; Paap et al., 2015; Paap et al., 2017a) have quantitatively synthesized the large database of bilingual advantage studies.

De Bruin et al. (2015b) conducted two analyses to determine whether publication bias may explain the inconsistent findings across studies exploring the bilingual advantage. First, they collected conference abstracts reporting on comparisons between bilinguals and monolinguals on any cognitive task, and conducted a logistic regression to determine whether abstracts reporting an advantage were more likely to result in publication than those that did not. Results revealed that 63% of conference abstracts reporting a bilingual advantage were published in comparison to only 36% reporting no bilingual advantage. Second, they conducted a meta-analysis on the conference submissions that had been published in a journal, which yielded a medium effect size (d = .30), with funnel plots indicating strong evidence of publication bias.

It is important to note that the de Bruin et al. (2015b) study was primarily an assessment of the presence of publication bias rather than an attempt to estimate an average effect size for the bilingual advantage. As noted by the authors, and by Bialystok, Kroll, Green, MacWhinney, and Craik (2015), publication bias is ubiquitous to psychology and its existence does not mean the absence of an effect. Notably, de Bruin et al. (2015) included any study that contained a bilingual group, a monolingual group, and some sort of cognitive task, and thus did not specifically focus on EF.

Paap et al. (2015) adopted a different approach in their analysis of the relationship of bilingualism to EF. They surveyed the literature and summarized the results of all studies comparing monolinguals and bilinguals on interference-control and task-switching tasks. They relied on a vote-counting procedure, in which they coded each comparison as either supporting or failing to support a bilingual advantage according to whether it yielded a significant p-value. Across all measures, proportions of comparisons yielding a significant bilingual advantage were low (the highest proportion of significant tests was ~.22). Moreover, they found that small sample studies were more likely to produce bilingual advantages than were large sample studies. They argued that the few significant comparisons might reflect questionable research practices endemic to psychology, confounds and between-group differences on non-EF constructs measured by these tasks. Paap et al.’s (2015) contribution is important, especially because of its focus on questionable research practices and interpretive methods in psychology research in general, and bilingual advantage research in particular. These factors certainly account for some of the discrepant findings. However, their review does not definitively disprove the existence of a bilingual advantage. First, it is unlikely that the advantage exists for all bilingual individuals. Looking solely at aggregate results might obscure effects for specific groups. Bilingual advantages may be restricted to certain age groups or bilinguals who began learning their L2 before a certain age. Second, as noted by Linck (2015), the vote-counting procedure is potentially misleading as it conflates large and small effect sizes.

Two meta-analyses conducted contemporaneously with the present one, however, provide stronger evidence against a bilingual advantage. First, Paap et al. (2017a) averaged global RT and interference cost scores from 101 studies including Simon, Flanker, Stroop, or Attentional Network tasks. Using raw RT difference scores, rather than effect sizes, they analyzed small sample studies (n < 40) and large sample studies (n > 40) separately. Amongst small sample studies they found bilingual advantages of 107 ms for interference costs and 114 ms for global RTs, both of which significantly differed from 0. However, amongst large sample studies, they found bilingual advantages of only 6 ms for interference costs and 9 ms for global RTs. Paap et al. (2017a) interpreted the findings as providing weak evidence for a bilingual advantage.

Second, Lehtonen et al. (2018) conducted a large-scale meta-analysis of studies comparing bilingual adults on a broad array of cognitive tasks, including scores derived from interference-control tasks, and considered several theoretically and methodologically significant moderators. They found a non-significant effect size for monitoring, which included global RT, and a very small significant effect size for inhibition, which included interference cost, that became non-significant after correcting for publication bias. Moreover, they found that task did not moderate effect sizes within either of the broader domains, suggesting that effect sizes on individual interference-control tasks were non-significant. Further, they found that age and AoA, both considered in the present analysis, did not significantly moderate effect sizes. The contribution of Lehtonen et al. (2018) is important in that it provides a broad review of different tasks, and yet it only found a small, tenuous bilingual advantage for inhibition that may be attributable to publication bias. However, a meta-analysis that is more focused on a narrower array of tasks may be more likely to uncover group differences. Furthermore, Lehtonen et al. only examined studies with adults, but it is possible that any bilingual advantage may be more pronounced in childhood, when executive functions are still developing.

The current study

While these recent reports provide clearer evidence against a bilingual advantage, the debate is still heated (Bialystok, 2017). We conducted a series of meta-analyses, using a multiverse approach (Steegen, Tuerlinckx, Gelman, & Vampaemel, 2016). As is evident from various meta-analyses of research on cognitive effects of other experiential variables, such as videogame play, different meta-analyses on the same topic can reach different conclusions because of different decisions about inclusion criteria, coding, and analytic strategies (e.g., Bediou, Adams, Mayer, Tipton, Green, & Bavelier, 2018; Powers, Brooks, Aldrich, Palladino, & Alfieri, 2013; Sala, Tatlidil, & Gobet, 2018). It is therefore important to see how robust conclusions are over variations in these factors. Our multiverse analytic approach was inspired by the recommendations of Steegen, Tuerlinckx, Gelman, & Vanpaemel (2016). These authors note that in a given empirical study the dataset is often one of a large set of reasonable datasets; conducting the same analysis over the entire multiverse of datasets allows the analyst to determine whether their conclusions are specific to a particular dataset or analytic strategy. While Steegen, Tuerlinckx, Gelman, & Vanpaemel (2016) promoted the use of a multiverse analysis in the context of a single empirical study, it seems particularly relevant to meta-analysis, where decisions about coding and inclusion are often imperfect judgment calls. Moreover, the multiverse approach can be extended to alternative approaches to modeling dependences between effect sizes (e.g., multiple tasks within the same study, multiple DVs from the same task), since it is likely there is no ideal model for the observed dependence.

Therefore, in the present analysis, we first conducted what we call our “preferred” analysis that utilized a broad definition of bilinguals but included only standard versions of interference-control tasks. We then conducted a multiverse analysis to determine how sensitive our conclusions were to decisions about inclusion criteria, coding of variables, and plausible analytic strategies. In the multiverse analysis, we varied the definition of bilingual group, task selection, coding of the age of L2 acquisition (AoA), and the treatment of outliers.

Potential moderators of effect sizes

Age

One moderator that has been considered extensively in the bilingual advantage literature is age. It is agreed upon that EF exhibits a complex developmental trajectory, developing slowly during childhood, peaking in early adulthood, and declining later in life (Zelazo & Lee, 2010). For example, Waszak, Li, and Hommel (2010) conducted a lifespan study of interference cost on the Flanker task, and observed smallest costs between the ages of 16 and 42 years. Interference cost decreased non-linearly over childhood and increased non-linearly after the age of 42 years. Bialystok, Martin, and Viswanathan (2005b) have argued that when EF is operating at peak efficiency, bilingual advantages might not be easily detected. If bilingualism accelerates the development of and/or ameliorates the decline of this system, stronger advantages for bilinguals may be seen in children and older adults, relative to young adults.

When looking at individual studies, findings are mixed amongst all age groups. For example, while several studies report significant advantages for children (Engel de Abreu, Cruz–Santos, Tourinho, Martin, & Bialystok, 2012; Kapa & Colombo, 2013; Martin-Rhee & Bialystok, 2008; Poarch & Bialystok, 2015; Poarch & van Hell, 2012; Yang, Yang, & Lust, 2011). A few recent, large-scale studies have failed to observe a bilingual advantage (Antón et al., 2014; Duñabeita et al., 2014; Gathercole et al., 2014). Findings are even more mixed among young adults, with some studies finding a bilingual advantage on interference-control tasks (e.g., Coderre, van Heuven & Conklin, 2013; Costa, Hernández, & Sebastián-Gallés, 2008; Luk, Anderson, Craik, Grady, & Bialystok, 2010; Luk, de Sa, & Bialystok, 2011), and others failing to do so (Bialystok et al., 2004; Gathercole et al., 2014; Paap & Greenberg, 2013; Paap & Sawi, 2014). Results with older adults are also mixed. Bialystok et al. (2004) report several studies finding a bilingual advantage for older adults on the Simon task, as do Salvatierra and Rosselli (2011), but neither Kirk, Fiala, Scott-Brown, and Kempe (2014) nor Gathercole et al. (2014) found one. In their recent meta-analysis, Lehtonen et al. (2018) found no effect of age, providing the clearest available evidence against a moderating effect of age; however, this analysis considered a narrower age range than the present study (only adults) and a broader array of tasks.

In testing for a moderating effect of age, at least three patterns of results might be expected to hold. First, there could be a main effect of age, with children and perhaps older adults showing a larger effect of bilingualism on EF than young adults. Alternatively, if the bilingual advantage was specific to global RT or interference cost, there could be an interaction between age and the DV, with children and older adults differing from young adults on only one of the two measures. Or there could be larger effect sizes for children than young adults on one DV and larger effect sizes for older adults than young adults on the other DV. If this were the case, it would suggest that the two DVs reflect different EF constructs over the lifespan.

Task

Several interference-control tasks have been used to compare monolinguals and bilinguals. While these tasks are often viewed as interchangeable indicators of the same constructs, there are several reasons to doubt this claim. First, while factor-analytic studies have reported that many tasks of inhibition load onto a single factor (Miyake, Friedman, Emerson, Witzki, & Howerter, 2000), few studies have utilized multiple interference-control tasks (e.g., Simon, Flanker, and Stroop), and when they have, they have typically not used interference cost as the DV. For example, Miyake et al. (2000) found that the difference between incongruent and neutral trials (trials without distracters) on the Stroop task loaded on an inhibition factor (later re-named common EF; Miyake & Friedman, 2012), along with the stop-signal and anti-saccade tasks. In a subsequent study, Friedman and Miyake (2004) found that the difference in RT between congruent and neutral trials on the Flanker task loaded on a shared factor with the analogous measure of cost from the Stroop task; however, it is not clear whether interference cost would follow the same pattern. Second, as reviewed by Paap and Sawi (2014), the correlations between raw scores on these tasks are often quite low, especially for interference cost. For example, Paap and Greenberg (2013) reported a non-significant correlation between interference cost in the Simon and Flanker tasks (r = .01); see also Salthouse (2010). Strikingly, measures of interference cost across different versions of the Stroop task appear to be uncorrelated as well (Shilling, Chetwynd, & Rabbitt, 2002). Global RT tends to exhibit larger correlations; for example, Paap and Sawi (2014) report correlations of .60 between global RT from Simon and Flanker tasks. Given this pattern of correlations, it is sensible to question whether the Simon, Flanker, and Stroop tasks are indices of the same EF constructs. If the three tasks measure separable constructs, a bilingual advantage may emerge on only a single task. It is, therefore, possible that we may observe a main effect of task on effect sizes, or an interaction of task with the DV. However, as stated above, Lehtonen et al. (2018) did not find that task moderated bilingual advantages in their meta-analysis.

Age of L2 acquisition

Researchers have speculated that the effects of bilingualism on interference-control tasks might be influenced by the age at which the bilingual subject learned their L2; however, as with other potential moderators, effects have been mixed. Using the Attentional Network Test as a measure of EF, Pelham and Abrams (2014) reported a bilingual advantage of similar magnitude for Spanish-English bilinguals who acquired their L2 either in childhood or in adulthood, relative to monolinguals. In contrast, other studies suggest larger effects for bilinguals with early AoA of their L2. Luk et al. (2011) compared college-aged early bilinguals (average age of regular use of L2 = 5 years) to late bilinguals (average age of regular use of L2 = 15 years) and to monolinguals on the Flanker task. They found that early bilinguals exhibited significantly smaller interference cost than the other two groups, who did not differ significantly. Subsequent regression analyses found that AoA, treated as a continuous variable, was positively related to interference cost. Tao, Marzecová, Taft, Asanowicz, and Wodniecka (2011) compared college-aged monolinguals to early (average age of exposure to L2 = 3 years old) and late (average age of exposure to L2 = 8 years old) bilinguals on a lateralized Flanker task. Early bilinguals exhibited faster global RT and smaller interference cost than monolinguals, whereas late bilinguals only exhibited smaller interference cost relative to monolinguals. Kapa and Colombo (2013) compared 10-year-old monolinguals, early bilinguals (L2 learned before age 3 years), and late bilinguals (L2 learned after age 3 years) on the Flanker task. They found that, after controlling for age and English receptive vocabulary, early bilinguals outperformed both the late bilinguals and the monolinguals. Note, however, that the cut-off used by Kapa and Colombo (2013) to distinguish early and late AoA is notably earlier than the other studies (presumably due to the young age of the sample).

Taken together, these studies suggest that the effect of the bilingualism may vary as a function of the AoA of the L2, with early bilinguals exhibiting a larger bilingual advantage than late bilinguals. Moreover, if the bilingual advantage is specific to a particular DV, there might be an interaction between AoA and DV, with the larger effect of AoA in modulating the effect of bilingualism on one of the two DVs. Although Lehtonen et al. (2018) did not find AoA to moderate their findings, we revisited this issue here using a multiverse approach to coding AoA.

Research questions

The current meta-analysis explored the following research questions:

  1. 1)

    What is the magnitude of the effect of bilingualism on global RT and interference cost?

  2. 2)

    Is the bilingual advantage restricted to particular interference-control tasks, such as the Simon or Flanker tasks? If so, does task interact with the DV (global RT vs. interference cost)?

  3. 3)

    Does the bilingual advantage vary as a function of the age of participants, with smaller effects for young adult samples than for children or older adults? Does the effect of age interact with the DV?

  4. 4)

    Does the bilingual advantage vary as a function of the AoA of the L2, with larger effects for early bilinguals than for late bilinguals? If so, does this effect interact with the DV?

  5. 5)

    How robust are the results given the variety of plausible data sets and coding procedures?

Method

Literature search

In this section we describe our method for finding and selecting studies to be included in the analyses.

PsycINFO was searched periodically until December 2017. Search terms included a combination of bilingual or bilingualism with executive control, executive function, inhibition, or interference control. Reference sections of the relevant studies and review articles (e.g., de Bruin et al., 2015b; Paap et al., 2015; Lehtonen et al., 2018) were examined to identify additional studies for inclusion. Of the 286 studies consulted, 80 met the following criteria for inclusion (see Fig. 1 for more details):

  1. 1.

    Study included a bilingual group. We defined bilingualism in two different ways, to create two separate datasets for our multiverse analysis. First, under a broad definition, we treated any group designated as bilingual by the research team as bilingual. Second, under a narrow definition of bilingual, we looked for groups who could be reasonably described as near balanced on the basis of reported proficiency.

    The first dataset, which used the broad definition, included all studies that had a group designated as bilingual; see Table 1 for a description of all datasets. This included lapsed and late bilinguals, but excluded groups designated as second language learners (excluding one study and two groups from included studies) and bidialectals (excluding one study, and two groups from included studies). We favor this broad definition as we believe the researchers of the original studies were in the best position to define their groups as bilingual or monolingual, given the combination of demographic information, familiarity with the local sociolinguistic context, and familiarity with the participants. We therefore used this dataset in our preferred analysis.

    The second dataset used a narrow definition of bilingual and included only studies with participants who were either of nearly balanced proficiency, or, if proficiency information was unavailable, reported nearly balanced usage between the two languages. Studies were excluded if the authors did not provide information about their bilingual participants’ language proficiency or exposure. To identify groups of balanced bilinguals, we first looked at information about proficiency. If self-reported proficiency was reported for both L1 and L2, we calculated the ratio of the weaker to the stronger language. When proficiency was reported separately for speaking, listening, writing, and reading, we only considered speaking and listening, so as not to eliminate populations who had not been schooled in one of their languages. If only average proficiency across those four domains was reported, we used those scores. If proficiency was not reported for the two languages, but was reported for the weaker language, we took the ratio of the weaker language to the maximum possible score. We preferred the ratio of reported L1 to L2 proficiency to this score, as the former accounts for response bias. If self-reported proficiency was not reported, but a linguistic measure was given in L1 and L2, we took the ratio of scores on these instruments. We preferred self-reported proficiency because it is a holistic measure of language proficiency, whereas many of the standardized tasks were specific to vocabulary. Several studies did not report proficiency but did report measures of usage or exposure to each language. One subset of these studies reported Likert scales ranging from all one language, to all the other language, with the middle value reflecting equal usage of the two languages. In studies with children that provided one item for the language used by the child and another item for the language spoken to the child, we averaged across those two, but ignored items about siblings and media (radio, television, and books). In studies with adults that presented Likert scales for several time points (e.g., childhood, school age), we converted the Likert-scale scores into proportions and averaged across all time points.

    As an operational definition of balanced bilingual, we selected bilingual groups who had at least a ratio of .66 for proficiency in their weaker language to their stronger language, or between .33 and .66 usage or exposure to each of the two languages. We also added bilingual groups from studies that did not report mean proficiency or mean exposure, but said all bilinguals had above a given value of mean proficiency or exposure, if that value met our thresholds. Note that for some studies we combined a different set of bilingual groups in analyses using the broad versus narrow definitions of bilingualism; see section below on Data preparation and effect size calculation.

  2. 2.

    Study included at least one monolingual group. For both the main analysis and the multiverse analysis we treated a group as monolingual if they were designated as such by the research group.

  3. 3.

    Participants were at least 4.5 years old and without psychological impairment. We therefore excluded studies examining potential benefits of bilingualism as a protective factor in dementia (e.g., Bialystok, Craik, & Freedman, 2007).

  4. 4.

    Study contained RT data or efficiency scores (RT divided by accuracy) from at least one of five commonly used non-verbal interference-control tasks: Flanker, Simon, Simon Arrows, Numerical Stroop, and Attentional Network Test (ANT). All these tasks share the following structure: Participants are asked to make judgments about a visually presented target stimulus, where either an additional stimulus, or an irrelevant dimension of the target stimulus, can elicit a competing response. On incongruent trials the distracter (stimulus or dimension) elicited a response different from the target; on congruent trials the distracter (stimulus or dimension) elicited the same response as the target. We avoided tasks with explicitly verbal stimuli, such as the traditional Stroop and Verbal Flanker tasks, and tasks that involved a non-keypress response. In our preferred analysis, we considered standard versions of these tasks, which did not, for example, contain unbalanced numbers of congruent and incongruent trials; see section on Moderator coding for more details. However, in our multiverse analysis, we considered two datasets, one with standard versions and one with all versions of tasks; see Table 1 for a description of how many instances of each task type were included in each of the datasets.

Fig. 1
figure 1

PRISMA flow chart indicating the steps involved in determining the eligibility and inclusion of studies in the meta-analysis of impact of bilingualism on interference-control tasks

Table 1 Number of comparisons for moderator variables in each of the four datasets used in the multiverse analysis. Note that the preferred analysis used a broad definition of bilingual and standard versions of tasks

Data preparation and effect size calculation

The 80 studies comprised 253 comparisons, each involving a bilingual group compared with a monolingual group on a single task; see Table 2 for a complete list of the comparisons with moderator values and effect sizes for global RT and interference cost. Hence, if a study reported on several separate bilingual and monolingual groups (e.g., varying in age), each of these groups was included as a separate comparison. Care was taken to minimize the number of statistically dependent comparisons while maximizing the number of available effect sizes, as follows:

Table 2 Effect sizes and moderators for all comparisons

First, if a study contained a single monolingual group and two bilingual groups and the bilingual groups did not differ on AoA of the L2 (see Moderator coding below), the two bilingual groups were averaged together. However, if the two bilingual groups differed in AoA and were compared to the same monolingual group, the N for the monolingual group was split in half across the two comparisons. For example, Luk et al. (2011) compared one monolingual group with early and late bilingual groups. Since the early and late bilingual groups differed on AoA, they were included as separate comparisons. The same monolingual mean was included in both comparisons, but the monolingual N for each comparison was half the N in the study. This is done because including the same control group in two effect sizes without correcting the N double counts those participants, thereby artificially inflating the number of observations (Higgins, Deeks, & Altman, 2008).

Second, if multiple tasks were given to the same sample, each task counted as a separate comparison. This amounts to assuming that scores on these tasks are independent. As described in the Introduction, interference costs tend to be uncorrelated, whereas global RT tends to correlate across interference-control tasks. Thus, the assumption of independence is violated for global RT from studies that included multiple interference-control tasks. Results should be interpreted with some caution because of this violation. However, if a study reported multiple blocks on the same task, and each block was reported separately, only the first block was included. This was done because averaging standard deviations across blocks would require assuming a correlation between consecutive blocks on the same task. Moreover, if a study included more than one condition of the same task (e.g., with varying working memory demands), only trials from the standard condition were chosen. The reason we did not include multiple blocks or conditions from the same task is that these are far more likely to be correlated than different tasks. Averaging them together would have required knowing the correlations between separate blocks or conditions of the same task.

Each comparison typically yielded two DVs, for global RT and interference cost, respectively. Recall that interference cost is calculated as the difference in RTs for congruent versus incongruent trials, and global RT is calculated as the average RT across these two trial types. In some studies, these scores and their standard deviations were reported. However, in many cases, means and standard deviations (or standard errors) were reported for the congruent and incongruent trials, but not for global RT or interference cost. Because RTs for congruent and incongruent trials are correlated, calculating the standard deviation for global RT and interference cost from these scores requires the correlation between trial types. Two strategies were employed to estimate this correlation: When the mean and standard deviations of RTs for congruent and incongruent trials and for interference cost were reported, it was possible to solve algebraically for the correlation coefficients. However, in most cases correlation coefficients had to be stipulated. In order to do so, a dataset of correlation coefficients was created. First, authors who were e-mailed to provide information for an earlier version of this project (Donnelly, Brooks, & Homer, 2015), were asked to provide correlations between RTs for congruent and incongruent trials within the same task. We then conducted a meta-analysis on these correlations, and estimated an average correlation coefficient of .72, which was used for the remaining studies. In three studies, no standard deviations were provided. Standard deviations for these studies were estimated in the following manner: A linear regression predicting the standard deviation from the mean RT was fit to the available means and standard deviations, and then used to predict the standard deviations for the three studies based on the published means.

Moderator coding

In this section we describe how the moderator variables were coded. The numbers presented below represent the dataset utilizing the broad definition of bilingual; see Table 1 for numbers of comparisons included in each of the four datasets used in the multiverse analysis.

Task

Task was coded as a factor variable with four levels: Flanker, Simon, ANT, and Stroop-like. Task was further coded as standard or non-standard (see below). Flanker tasks included the traditional and alternating position Flanker task. Standard versions of the Flanker task presented the distracters on the left and right sides of the target arrow and did not include additional EF manipulations within the same block (k = 22). However, in the expanded dataset, we included all versions of the Flanker task, including those with go/no-go trials within block and unconventional placements of distracter arrows (k = 24). The Simon task included the traditional Simon task, and a version of the Simon task used by Blumenfeld and Marian (2014), in which arrows rather than colored squares were used as the stimuli, but, unlike in the Simon Arrows task, participants had to remember an arbitrary mapping between upward and downward pointing arrows and left and right key presses. Standard versions of the task included equal numbers of congruent and incongruent trials and did not explicitly manipulate the number of switch trials (k = 54; if the number of congruent and incongruent trials were not explicitly mentioned, they were assumed to be equal). In the expanded set, we included versions of the task with switch manipulations and unbalanced numbers of congruent and incongruent trials (k = 59). The ANT and lateralized ANT were coded as the ANT. The lateralized ANT was coded as a non-standard version of the task. In general, ANT data were presented in one of two formats, with data for each cueing condition presented either separately or aggregated across the cueing conditions. When each cueing condition was presented separately, we used the no-cue condition to calculate effect sizes. When data were aggregated across cueing conditions, we used the aggregated data. We coded data from the no-cue condition as the standard version of the task (k =11), but included both sources of data in our expanded dataset (k = 29) because the aggregated scores were also influenced by the dynamics of attentional cueing. We coded both the Numerical Stroop and Simon Arrows (or Spatial Stroop) as Stroop-like because there were relatively low numbers, and since, unlike the Simon task, the Simon Arrows task does not have arbitrary stimulus-response mappings and, therefore, may be more Stroop-like. We treated versions with equal numbers of congruent and incongruent trials as standard versions (k = 12). In our expanded data set, we additionally included trials with unbalanced numbers of congruent and incongruent trials, and one study that randomly interspersed mind-wandering probes between trials (k = 20).

Age

Age was coded categorically with the prediction that larger effect sizes would be observed in children and older adults than in adults. Hence, we coded participants younger than 13 years as children, participants aged 18–40 years as younger adults, and participants over 60 years as older adults. Thirteen comparisons were excluded from the analysis of age: in 12 cases the participants’ ages were outside these three categories and in one case RTs were aggregated across multiple age groups in the original study (Blumenfeld, Schroeder, Bobb, Freeman, & Marian, 2016). Note that these 13 comparisons were included in tests of other moderators.

AoA of the L2

Sabourin, Brien, and Burkholder (2014) found empirical evidence of differences in lexical organization between participants who began learning their L2 before and after age 7 years; thus, for the purpose of defining early versus late bilinguals, we aimed to use this age as a cut-off. However, because the studies varied widely in what information was reported, there was no single rule that could be applied to every single study. For our multiverse analysis, we coded AoA of the L2 according to two schemes: lax and strict.

In applying our lax scheme, we considered any relevant information that was reported and applied a set of rules in a fixed order to determine AoA. First, if mean participants’ self-reported age of L2 acquisition, immersion, or immigration or the mean number of years of L2 exposure was reported, this number was used (k = 36). Next, if the maximum age of acquisition was reported, this was used (k = 16). Next, if the authors stated participants used languages in school, participants were treated as early AoA (k = 9). Next, if participants were described as early or late bilinguals, they were coded as such (k = 12). Finally, if participants were 6 years of age or younger, they were coded as early AoA (k = 12; i.e., all studies with participants 6 years of age or younger). The lax coding scheme resulted in 60 early and 25 late AoA comparisons. We used this definition of AoA in our preferred analysis.

According to our strict scheme, we calculated AoA if one of four conditions were met. First, if the mean participants’ self-reported age of L2 acquisition, immersion, or immigration or the mean number of years of L2 exposure was reported, this number was used (k = 33). Second, if the paper reported the latest AoA of their bilingual participants, and this age was below 7 years, this number was used (k = 8). Third, if the paper reported that participants learned both languages in school, and reported the age at which both languages were used in the curriculum, we used this age (k = 6); note this could overestimate AoA, but did not lead to any late AoA categorizations. Fourth, if participants were 6 years of age or younger, they were coded as early AoA (k = 12). The strict coding scheme resulted in 44 early and 15 late AoA comparisons.

Analytic strategy

In both our preferred and multiverse analyses, we first fit models to estimate average effect sizes for global RTs and interference costs. We then tested each moderator separately. To do so, we included both its main effect and its interaction with DV, and conducted an omnibus test to determine whether they jointly improved fit.

As global RT and interference cost are DVs taken from the same studies, they are likely to be correlated at both the level of the true effect and the sampling error. The ideal way to model these effect sizes would be to conduct a multivariate meta-analysis. However, these models require that the analyst knows or assumes a value of the correlation between the two variables within a study. As these are generally not reported, we tried four separate multivariate meta-analyses. First, based on available unpublished data, we calculated the correlation between the global RTs and interference costs in Flanker, Simon, and Stroop tasks and got an average correlation coefficient of .14. In our preferred analysis, we stipulated this as the correlation between the two DVs. Second, in our multiverse analysis we considered the unlikely possibility that the DVs were uncorrelated, and conducted a multivariate meta-analysis with a within-comparison correlation of 0. Third, we considered the possibility that the true within-comparison correlation was double the estimated value and assumed a within-comparison correlation of .30. Fourth, a helpful reviewer pointed out that the cluster robust variance estimation could be used to estimate the correlation. In all of these models, in addition to estimating random effects by comparison, we also estimated random effects by publication. Another possible way of analyzing these data is to conduct a multilevel univariate meta-analysis with cluster-robust variance estimation within comparisons (Hedges, Tipton, & Johnson, 2010). We considered this fifth model in our multiverse analysis as well. All analyses were conducted using the metafor package in R (Viechtbauer, 2010); the code is available in the Online Supplemental Materials.

Results

We report the results in two parts. First, we report in detail the results of our preferred analysis, which utilized a broad definition of bilingual but included only standard versions of interference-control tasks. Second, we present a multiverse analysis, examining how inferences about the overall effect sizes and moderators are impacted by different plausible decisions for the construction and analysis of the dataset. A preliminary model was a multivariate meta-analysis, allowing effect sizes to vary by comparison and by paper. However, this yielded a highly skewed distribution of internally standardized residuals with several large positive values (seven standardized residuals between 2.63 and 7.98). Ideally we would have calculated externally standardized residuals, but at this point metafor only calculates internally standardized residuals for multivariate models. As a result, these residuals are likely underestimates. All five of these large standardized residuals came from a landmark study that reported extremely large effect sizes (Bialystok et al., 2004). The question of what to do with outlying effect sizes is difficult. On the one hand, these effect sizes likely violate the assumption of normally distributed random effects and exert an outsized influence on the overall effect size estimates. On the other hand, this is a landmark study in the field and does not contain obvious methodological shortcomings that would justify excluding them entirely. We therefore decided to exclude this study as an “outlier” in our preferred analysis, but include a dimension in our multiverse analysis to examine how the results are influenced by the inclusion of this study.

Part 1: Preferred analysis

To examine the overall effect sizes, a multivariate random effects meta-analysis with unstructured random effects by comparison and paper was fit to the subset of the data containing the studies with broadly defined bilinguals, standard versions of tasks, and without the study mentioned above.Footnote 1 For analyses involving age of L2 acquisition, the preferred analysis used the lax coding of AoA as this included more comparisons, especially of bilinguals with late AoA. The preferred model revealed a very small, but significant, effect of bilingualism on global RT (g = .13, Z = 2.51, p = .012, CI = .03; .23), and a small, but significant, effect of bilingualism on interference cost (g = .11, Z = 2.71, p = .007, CI = .03; .18). The model also revealed a significant amount of residual heterogeneity, Q(179) = 382.00, p < .001. To determine the extent of publication bias on these effect sizes, Egger’s regression tests, predicting effect sizes from standard errors, were conducted on global RT and interference cost separately. Note that this approach does not account for the dependence between sampling errors and true effect sizes in these studies and, therefore, it should only be taken as a heuristic. There was significant funnel plot asymmetry for global RT (Z = 3.23, p = .001), but not interference cost (Z = .51, p = .607). In order to assess the influence of publication bias, we followed Lehtonen et al. (2018) and used the PEESE method to correct the effect sizes. This method entails predicting effect sizes from their variances in a weighted least squares regression (weighted by standard errors). The intercept in these models can be conceived of as the zero variance effect size. Because the test above does not account for the dependence between effect sizes, we applied the PEESE correction to both global RT and interference cost, though we were more concerned about the funnel plot asymmetry for global RT. The PEESE method revealed a corrected global RT effect size that was negative and did not significantly different from 0 (g = –.17, t(86) = –1.86, p = .067, CI = –.34; .01). Consistent with the results of Egger’s regression test, there was also a significant positive relationship between each effect size and its variance (b = 3.59, t(86) = 3.26, p = .002, CI = 1.40; 5.78). For interference cost, the effect size was of similar magnitude to the previous analysis (g = .08, t(91) = 1.17, p = .25, CI = –.05; .21) but no longer significant. The relationship between each effect size and its variance was also not significant (b = .32, t(91) = .39, p = .699, CI = –1.33; 1.64), which is consistent with the results of Egger’s regression test above. Importantly, neither Egger’s regression test nor the PEESE correction accounts for the dependence between observations and should therefore be considered very rough approximations of the unbiased effect sizes. Because of the evidence of publication bias for global RT, we report PEESE corrected effect sizes for all subsequent analyses, as well as slopes of the relationship between variance and effect size in order to determine whether publication bias is likely in a given analysis.

To examine the effects of the three moderator variables (age group, task, and AoA), three models were fit testing for the main effects of each moderator as well as their interaction with DV (global RT or interference cost). The effects of these coefficients were then examined using omnibus tests. Including age group and its interaction with the DV did not significantly improve model fit (Q(4) = 1.55, p = .818), nor did including task and its interaction with the DV (Q(6) = 5.49, p = .483). However, including lax-coded AoA as a dummy variable with 0 for early AoA and 1 for late AoA and its interaction with the DV did significantly improve fit (Q(2) = 6.52, p = .038). This effect appears to be due to a significant interaction between AoA and DV (b = .31, Z = 2.52, p = .012, CI = .07; .55). In order to understand this interaction, we conducted a separate multilevel univariate meta-analysis on global RT and interference cost with AoA as a moderator. For global RT, there was a very small, significant effect size for early AoA comparisons (g = .15, Z = 2.48, p = .013, CI = .03; .26) and AoA did not significantly moderate effect sizes (b = -.05, Z = -.06, p = .952, CI = -.16; .16). For interference cost, there was a very small, non-significant effect size for early AoA comparisons (g = .09, Z = 1.85, p = .064, CI = -.01; .18). Effect sizes were significantly larger in AoA comparisons with late as opposed to early bilinguals (b = .18, Z = 2.64, p = .008, CI = .05; .32). In order to determine whether effect sizes from the early late AoA comparisons were significantly different than 0, we conducted a multivariate meta-analyses for early and late AoA comparisons separately. Amongst, early AoA comparisons, there was a very small but significant effect size for global RT (g = .14, Z = 2.89, p = .022, CI = .02; .26) and a non-significant effect size for interference cost (g = .07, Z = 1.47, p = .142, CI = -.02; .15). When the PEESE-method was applied to global RTs, the resulting effect size was non-significant (g = –.15, t(49) = -1.3, p = .132, CI = –.36; .07) and the relationship between effect size variances and effect sizes magnitudes was significant (b = 3.68, t(49) = 2.69, p = .010, CI = .94; 6.42). Amongst late AoA comparisons there was a non-significant effect for global RT (g = -.02, Z = -.27, p = .789, CI = –.16; .13), and a small but significant effect size for interference costs (g = .25, Z = 3.29, p = .001, CI = .10; .40). When the PEESE-method was applied to the effect size for interference cost for late AoA comparisons, the resulting effect size was no longer significant (g = .14, t(23) = .81, p = .43, CI = -.22; .50). However, there was no evidence of publication bias (b = .81, t(23) = .40, p = .694, CI = –3.39; 5.01), so this non-significant corrected effect size may reflect the relatively small number of effect sizes included in this analysis (k = 25).

Part 2: Multiverse analysis

To examine the sensitivity of conclusions to plausible analysis and design decisions, we conducted sets of analyses influenced by Steegen, Tuerlinckx, Gelman, & Vanpaemel (2016) multiverse analysis approach. More specifically, we addressed each research question between 40 and 80 times, making different decisions about the construction of the dataset and analysis. We considered four datasets with varying inclusion criteria, defined by broad versus narrow definitions of bilingual and exclusion/inclusion of non-standard versions of interference-control tasks (see Table 1). We considered five statistical models: four multivariate meta-analyses stipulating different values of the correlations and one univariate multilevel meta-analysis with cluster robust variance estimation. Model 1 used an estimate of .14 for the correlation between the sampling errors for global RT and interference cost; Model 2 used an estimate of 0 for the correlation; Model 3 used an estimate of .30 for the correlation; Model 4 used cluster robust variance estimation; and Model 5 was the multilevel meta-analysis with robust variance estimation. We compared these methods with and without the inclusion of the Bialystok et al. (2004) study that was identified as an outlier in our preferred analysis. Moreover, for analysis of the moderator AoA of the L2, we considered both strict and lax coding of AoA.

Tables 3 and 4 show the effect size estimates for global RT and interference cost, respectively, for each of the analyses in the multiverse. As shown in Table 3, global RT effect sizes ranged between .07 and .17 when the outlier effect sizes were excluded and between .13 and .20 when the outliers were included. In general, the effect sizes were also a bit smaller in the univariate multilevel meta-analyses than in the multivariate meta-analyses. As shown in Table 4, the interference cost effect sizes ranged between .10 and .12 when outliers were excluded, and between .16 and .20 when outliers were included.Footnote 2 Tables 5, 6, and 7 show the p-values for analyses examining the effects of the three moderators (age group, task, AoA). All models include the main effects for the moderators and their interactions with the DV. As shown in Table 5, age did not significantly moderate effect sizes in any cells of the multiverse. As shown in Table 6, task also did not significantly moderate effect sizes in any cells of in the multiverse. However, as shown in Table 7, when using lax coding of AoA, AoA tended to significantly moderate effect sizes when outliers were excluded; when using strict coding of AoA, AoA tended to significantly moderate effect sizes only under a broad definition of bilingual with the expanded task set.

Table 3 Average global reaction times (RTs) for different possible analyses
Table 4 Average interference cost for different possible analyses
Table 5 P-values for age across different possible analyses
Table 6 P-values for task across different possible analyses
Table 7 P-values for AoA across different possible analyses

Two results from the multiverse analysis warrant follow-up. First, when outliers were included, effect sizes increased. In order to determine whether these larger effect sizes were affected by publication bias, we first applied the PEESE correction to the global RT and interference cost with outliers included. Because these outliers were large effect sizes with large standard errors, they strongly influenced the regression line, and resulted in publication bias corrected effect sizes that seemed implausible because the coefficients were negative (indicative of better performance for monolinguals). The corrected effect sizes were significant for both global RT (g = –.35, t(91) = –4.52, p < .001, CI = –.50; –.19; b = 6.49, t(91) = 8.73, p < .001, CI = 5.01; 7.97) and interference cost (g = –.24, t(96) = –-3.53, p = .001, CI = –.36; –.09; b = 5.13, t(96) = 7.45, p < .001, CI = 3.76; 6.49).

Second, as AoA interacted with the DV in the preferred analysis (with AoA coded using a lax scheme), it was necessary to see if the effect was similar when AoA was coded using a strict scheme. We therefore re-ran the models with strict coding of AoA. The models used a broad definition of “bilingual,” and both standard and non-nonstandard tasks without outliers, and assumed a correlation of .14 between the sampling errors for global RT and interference cost. As in the previous analysis with lax coding, including AoA and its interaction with DV significantly improved model fit (Q(2) = 6.42, p = .040) with a significant interaction between AoA and DV (b = .23, Z = 2.09, p = .037, CI = .01; 45). Again, we ran separate analyses on global RT and interference cost. For global RTs, there was a non-significant effect size for early AoA comparisons (g = .11, Z = 1.72, p = .086, CI = -.02; .23) and a non-significant effect of AoA (b = .05, Z = .50, p = .61, CI = –.14; .23). For interference costs, there was a non-significant effect size for early AoA comparisons (g = .04, Z = 1.14, p = .252, CI = -.03; .12) and a significant effect of AoA (b = .22, Z = 3.35, p < .001, CI = .09; .35). In order to see if late AoA effect sizes differed from 0, a multivariate meta-analysis on the late AoA comparisons yielded a non-significant effect for global RT (g = .08, Z = .80, p = .43, CI = –.12; .28) and a small-to-medium effect for interference cost (g = .23, Z = 2.72, p = .007, CI = .06; .40). When the PEESE-method was applied to effect sizes for interference cost for the late AoA comparisons, the corrected effect size was smaller and non-significant (g = .12, t(20) = .56, p = .583, CI = –.32; .55). However, since there was no evidence of publication bias (b = 1.57, t(20) = .65, p = .229, CI = –3.50; 6.6), and this analysis was based on very few effect sizes (k = 22), we are cautious about interpreting the corrected effect sizes.

Discussion

The purpose of this meta-analytic study was to determine under which circumstances bilinguals outperform monolinguals on interference-control tasks. In particular, we sought to estimate the average effect sizes for global RT and interference cost, two DVs associated with interference-control tasks, and determine whether effect sizes were moderated by age, task, and age of L2 acquisition. After conducting our preferred analysis, which utilized a broad definition of bilingual and only standard versions of interference-control tasks, we conducted a multiverse analysis to determine how sensitive our conclusions were to plausible constructions of the dataset and analyses. We examined how the results were affected by: (1) strictness of the criteria used to define bilinguals, (2) whether both standard and non-standard versions of interference-control task were included, (3) how strictly age of L2 acquisition was coded, (4) how the dependence between effect sizes was modeled, and (5) inclusion of effect sizes from two studies that were identified as outliers (Bialystok et al., 2004; Martin-Rhee & Bialystok, 2008). Results were largely consistent across all of the dataset construction and design possibilities.

Preferred analysis

Our preferred meta-analysis revealed very small but statistically significant effect sizes for both global RT and interference cost (g = .13 and g = .11, respectively). However, once corrected for publication bias, only the effect size for interference cost remained significant. The overall effect sizes observed were of similar magnitude to the monitoring and inhibition effect sizes reported by Lehtonen et al. (2018) in a recent meta-analysis of evidence for the bilingual advantage. Lehtonen et al.’s (2018) meta-analysis differed from the present study in several important ways. First, their meta-analysis only considered studies involving adult bilinguals and encompassed a wider range of tasks than the current analysis. Second, they used a different approach to model the dependence between observations. They conducted univariate multilevel meta-analyses of each domain separately, and included four levels of random effects. Third, when descriptive statistics for interference cost were not available, they used either the difference between incongruent and neutral trials or average RT on incongruent trials, whereas we estimated the relevant standard deviations by assuming a correlation of .72 between congruent and incongruent trials.

Our estimates of the overall effect size for the bilingual advantage confirm the small magnitude of effects reported in a recent meta-analysis conducted by Paap et al. (2017a). When Paap et al. considered only large-sample studies, they found an average bilingual advantage of 9 ms for global RT and 6 ms for interference cost. Paap et al. (2017a, b) analysis is substantially different from our own: Their study analyzed raw RTs, rather than effect sizes, and analyzed the data by calculating means weighted by sample sizes rather than using a formal meta-analysis, which weights effect sizes by their variances.

Lehtonen et al. (2018) identified publication bias in their dataset and, once this was corrected for, effect sizes for monitoring and inhibition were indistinguishable from 0. Moreover, Paap et al. (2017a, b) found that effects of bilingualism were much larger for small-sample than large-sample studies for both global RTs and interference costs. In contrast, our analysis only revealed publication bias for global RT, but not interference cost. However, our methodology for testing publication bias did not account for the dependence between global RT and interference cost. When PEESE corrections were applied to each DV separately, the effects of bilingualism were not significant for either DV. Given this complex pattern of results, one might conclude either that there is not an overall bilingual advantage on interference-control tasks, or that there is a very small bilingual advantage on interference cost, but not global RT.

Effect sizes were not moderated by age in our preferred analysis or in any cells of the multiverse analysis. This result is also consistent with the results of Lehtonen et al. (2018), who did not observe differences between younger or older adults when age was coded categorically or continuously, nor did they observe effects of age when each EF construct was considered separately. This result is also consistent with the qualitative literature reviews conducted by Paap et al. (2015) and Hilchey, Saint-Aubin, and Klein (2015), who did not find evidence that age moderated effects of bilingualism. Such findings may be viewed as surprising, as it has been suggested that the influence of bilingualism on interference control should be more apparent during childhood and late adulthood than during young adulthood when EF processes are expected to be at their peak (Bialystok, Martin, & Visawanathan, 2005b; Bialystok, 2017). Given the variability in performance on the Flanker task across ages (Waszak et al., 2010), it was conceivable that the effect of bilingualism would vary across age groups. However, de Bruin and Della Sala (2017) have shown that the pattern of age-related decline is not identical across interference costs from different tasks. It is therefore possible that age effects would be evident on some tasks but masked by null effects on others. However, testing such a hypothesis would require a model with a 2 (DV) × 3 (Age Group) × 4 (Task) interaction for which our data set is too small.

We did not observe an effect of task in our preferred analysis or in any cells of our multiverse analysis. This result is consistent with the analysis of Lehtonen et al. (2018), who found that task did not significantly moderate effect sizes within either their inhibition or monitoring domains, which closely map onto interference cost and global RT, respectively. Bialystok’s (2017) framework, which locates the bilingual advantage in general executive attention abilities rather than in specific executive functions, seems consistent with the lack of a task effect, but is difficult to reconcile with the very small and non-significant overall effect sizes.

AoA significantly moderated effect sizes in our preferred analysis, which appeared due to a significant interaction between AoA and the DV. We had expected early bilinguals to show larger advantages than late bilinguals on EF, and suggested that this AoA effect might interact with the DV. However, our results revealed a different pattern. While global RT was unaffected by AoA, the effect sizes for interference cost were larger for late than for early AoA comparisons. When late AoA comparisons were considered alone, there were non-significant effect sizes for global RT, but a small-to-medium effect size for interference cost. While the PEESE correction suggested this effect size was not different than 0, there was no evidence of publication bias, so the lack of significance may have reflected the relatively small number of observations in that analysis. Note that Lehonten et al. (2018) also considered AoA as a moderator in their meta-analysis, but found that it did not moderate effect sizes.

While the direction of the effect of AoA on interference cost was unanticipated, it appears to be consistent with some existing theorizing: It has been argued that late learned second languages might incur more interference than early learned ones, thereby leading to larger bilingual advantages on measures of inhibition (Bak, Vega-Mendoza, & Sorace, 2014). However, it is difficult to disentangle the direction of causality in this case as it also plausible that individuals with better inhibition abilities are more effective at learning a second language later in life. Furthermore, this account seems to run to counter to regression studies showing that when treated as a continuous variable, AoA is negatively associated with interference cost (Luk, De Sa, & Bialystok, 2011), or unrelated to interference cost (Paap & Sawi, 2014). However, given our results, and the plausible theoretical account, more research teasing out of the relationship between these variables seems warranted.

Multiverse analysis

The average effect sizes observed for global RT and interference cost were generally insensitive to variation in the modeling approach or dataset, although they were smaller using the multilevel model with robust variance estimation than when using the multivariate meta-analysis models. Moreover, our results are likely anti-conservative as we did not correct for multiple comparisons.

In the preferred analysis we excluded one study due to exceptionally large effect sizes that appeared to be outliers (i.e., g’s of up to 6.79). Including the outlying effect sizes strongly influenced the average effect, in some cases by large magnitudes. This study clearly had a very large influence on the average effect size while also creating a skewed pattern of residuals that likely violates the assumptions of normally distributed random effects at the levels of paper and comparisons. Although it is a landmark study without obvious methodological concerns, we are cautious about interpreting the models that included the outliers for two reasons. First, the skewed residuals likely reflect a violation of the assumptions of normally distributed random effects and sampling errors. Second, including this study in the PEESE analyses greatly influences inferences about publication bias, suggesting the true publication-bias free estimates should be significantly lower than 0 (i.e., favoring monolinguals), which is implausible. We therefore think the effect size estimates from the models without outliers are more accurate.

Overall the results of this meta-analysis provide only very weak evidence for a bilingual advantage in interference-control tasks overall and little evidence for an advantage on global RT. For interference cost, there may be a small advantage on interference cost for late bilinguals. While future research is necessary to replicate this finding, as it is inconsistent with the null effect of AoA in Lehtonen et al.’s (2018) meta-analysis, it may suggest that later second language learners may enjoy increases in inhibitory control (Bak et al., 2014).

Limitations and methodological issues

Despite our findings casting doubt on a robust bilingual advantage on interference-control tasks, it remains possible that bilingual advantages are present, but restricted to certain bilinguals or that existing methodological and statistical practices are not sufficient to detect these sorts of effects. We consider these possibilities in turn.

One important factor to consider is that the term “bilingual” has been used in the literature quite broadly, often with little consideration of differences in participants’ language proficiency or distinct linguistic experiences. (See Surrain and Luk (2017) for a systematic review of how the term “bilingual” has been used in the literature.) It is possible that specific linguistic experiences are necessary for a bilingual advantage to emerge. For example, a potentially important moderator that was not included in the present study is the frequency of switching between languages. If bilingual advantages emerge because of frequent code switching, rather than as a consequence of lexical representation, then it is possible that bilinguals who have the experience of switching between languages more frequently would exhibit smaller interference cost than those who switch less frequently, or monolinguals. Costa et al. (2009) speculated that in socio-linguistic contexts characterized by a high degree of code switching, bilinguals must monitor for cues that determine when to switch languages; such cues may include the linguistic background of interlocutors or the broader linguistic context of the conversation. This suggests that EF advantages might only be evident for bilinguals from socio-linguistic contexts in which code switching is the norm. Unfortunately, while some studies have found a relationship between frequency of language switching and cognitive tasks (Prior & Gollan, 2011; Verreyt, Woumans, Vandelanotte, Szmalec, & Dyuck, 2016), other studies with larger samples have not (Johnson, Sawi, & Paap, 2015; Paap et al., 2017b). In a recent study, Hofweber, Marinis, and Treffers-Daller (2016) found that it was the type of code switching engaged in by bilinguals rather than just frequency that predicted an inhibitory advantage in a high-monitoring condition of the Flanker task. Neither frequency nor type of language switching was considered in the current analysis because it could not be reliably determined from the majority of studies in the published literature; nonetheless, this may prove to be a fruitful avenue for future research.

Costa et al. (2009) suggest that superior coordination of related cognitive skills, such as monitoring and goal maintenance, may underlie any bilingual advantage; hence, to reliably detect differences may require complex tasks. For example, Morales, Yudes, Gómez-Ariza, and Bajo (2015) used a continuous performance task to evaluate both proactive and reactive control in bilingual and monolingual young adults. With the relatively more complex task, the authors found that bilinguals demonstrated better coordination of proactive and reactive control, resulting in fewer errors and significant ERP differences (i.e., N2 and P3a following the task prompt, and greater error-related negativity following incorrect responses). Such results suggest that more complex tasks and measures may be more sensitive to bilingual advantages in skills related to cognitive control.

Another possible reason why a bilingual advantage may exist, but has been difficult to consistently identify, is the existing literature’s reliance on measures of mean RTs and differences between RTs. It has long been acknowledged that task manipulations affect components of RT distributions beyond just the mean, particularly the tail of the distribution (Luce, 1986). Indeed, some researchers have begun using the exponentially modified Gaussian distribution, which allows for estimates of the mean, variance, and tail of an RT distribution (Calabria et al., 2011). This technique is a step forward in that it could be more sensitive to experimental effects than a simple mean would be. However, one weakness of this approach is that parameters of this distribution do not have distinct theoretical interpretations (Matzke & Wagenmakers, 2009).

A more precise analysis of RT data comes from cognitive process models, such as the Ratcliff drift diffusion model of two-choice forced decision tasks (Ratcliff & McKoon, 2008). The Ratcliff diffusion model models RT and accuracy distributions for two-choice forced decision tasks using four parameters with distinct theoretical interpretations (e.g., the rate of evidence accumulation and an initial response bias). This model provides excellent fit to RT distributions and has the advantage of interpretable parameters (Ratcliff & McKoon, 2008). However, these parameters cannot be identified using the normal or exponentially modified Gaussian (ex-Gaussian) distribution. Matzke and Wagenmakers (2009) simulated data from the diffusion model and fit the ex-Gaussian and shifted Wald distributions to the datasets. They found that diffusion model parameters did not distinctly map onto parameters of either distribution; in other words, parameters of the ex-Gaussian and shifted Wald distributions correspond to multiple cognitive processes.

A generalization of the diffusion model has been developed for interference-control tasks (White, Ratcliff & Starns, 2011). This shrinking-spotlight diffusion model contains additional parameters corresponding to the shrinking of the attentional window within a trial and provides excellent fit to the Simon, Flanker, and Stroop data (Servant, Montagnini, & Burle, 2014). However, like the simpler diffusion model, its components combine in unintuitive ways, meaning that means, mean differences, and skewnesses from RT distributions will reflect a combination of cognitive processes. Ong, Sewell, Weekes, McKague, and Abutalebi (2017) found that bilinguals exhibited smaller non-decision times than monolinguals, which they interpreted as evidence that bilinguals more efficiently filter out distracting information. An advantage in the non-decision component may not consistently manifest in mean RTs, in which case, analysis of RTs might obscure a true underlying effect.

Conclusion

The purpose of this study was to determine the magnitude of a putative bilingual advantage on interference-control tasks and determine whether effects were moderated by age, task, and age of L2 acquisition. Counter to predictions, we failed to find evidence for a significant bilingual advantage for global RT after controlling for publication bias. For interference cost, although there was little evidence of publication bias, the overall effect size was very small effect after removing a landmark study identified as an outlier. Effect sizes were not significantly moderated by age or task, but were significantly moderated by an interaction between age of L2 acquisition and the DV. This interaction was driven by an effect of AoA on interference costs, with late bilinguals exhibiting larger advantages than early bilinguals. The robustness of our findings in the multiverse analysis supports the conclusions of other recent meta-analyses (Lehtonen et al. 2018; Paap et al., 2017a, b), and suggests cognitive benefits of bilingualism for standard interference-control tasks have been overstated.