Introduction

Robust working memory (WM) effects have been found across a range of complex cognitive processes, including reasoning, problem solving, planning, abstraction, mental arithmetic, and first language (L1) comprehension (Engle, 2002; for a meta-analysis addressing the role of WM in L1 comprehension, see Daneman & Merikle, 1996). Although theoretical positions vary, there is little argument that these complex cognitive behaviors at least partially rely on the attentional and executive control processes that underlie WM performance. Second language (L2) processing places demands on these WM resources, too, especially for less-proficient speakers, and growing evidence indicates that executive functions support the various cognitive control mechanisms necessary for L2 use (e.g., Abutalebi & Green, 2008; Hernandez & Meschyan, 2006). It should thus come as no surprise that WM has been implicated in studies of L2 processing (see, e.g., Michael & Gollan, 2005; Tokowicz, Michael, & Kroll, 2004) and learning (e.g., Linck & Weiss, 2011; Martin & Ellis, 2012). Although it is uncontroversial that WM is related to both L2 proficiency development and use, the magnitude of the WM effects and the specific component that drives these effects (i.e., the executive control vs. short-term store component of WM) have been somewhat inconsistent across studies (for recent reviews, see Juffs & Harrington, 2011, and Williams, 2011). Moreover, it is unclear whether these inconsistencies simply reflect variation around the population effect size due to noise (e.g., sampling error or measurement error), or whether they are instead due to systematic differences that can inform theoretical models. For example, to foreshadow our results, complex span measures are stronger predictors of L2 outcomes than are simple span measures, suggesting that the executive control component of WM may play a larger role than short-term memory when using an L2. A sizeable number of relevant studies that have varied in their research design factors (e.g., sample size, language of WM assessment) and participant characteristics (e.g., L2 proficiency level) are now available in the literature. Thus, a systematic, quantitative review seems warranted. To this end, we report a meta-analysis of the extant studies to better elucidate whether and under what conditions WM is related to performance on measures of L2 processing and proficiency.

Working memory

According to contemporary views, WM refers to the cognitive system(s) responsible for the control, regulation, and active maintenance of information in the face of distracting information (e.g., Conway, Jarrold, Kane, Miyake, & Towse, 2007). Baddeley’s seminal multicomponent model divided the construct of WM into two separable systems, a storage-based system (i.e., slave systems), analogous to short-term memory (STM), and an executive, attentional system that controls information between the slave systems and long-term memory stores (Baddeley, 1986; Baddeley & Hitch, 1974).

Many modern theories of human cognition describe a single system that is dedicated to the temporary processing, maintenance, and holding of information that is relevant to current tasks—that is, the WM system. Many theoretical models exist to describe its operation (see the variety of opinions offered in Miyake & Shah, 1999), but its function remains similar across models: It orders, stores, and manages immediate sensory details until they can be properly incorporated into the cognitive process that must integrate that data. The amount of data that can be stored for immediate, accurate recall (availability) is limited in size, and the speed with which it can be recalled (accessibility) varies. Ideal WM function, then, would increase both the accuracy of recall and the rate at which information in WM can be accessed.

WM is classically discussed in terms of two different subsystems or components: visuospatial WM, which represents, manipulates, and briefly maintains information in the spatial domain; and verbal WM, which handles verbally mediated representations and processing (Baddeley & Hitch, 1974; Baddeley & Logie, 1999).Footnote 1 More recent theories of WM are process-oriented rather than structural. Probably the most influential model in this regard is Cowan’s (1995, 2001, 2005) model. Cowan proposed a two-tier structure for WM, distinguishing a zone of privileged and immediate access—the focus of attention—from activated but not immediately accessible long-term memory. Memory in the focus of attention is highly accessible and available,Footnote 2 but the focus of attention is capacity-limited to a fixed number of items, or chunks. The activated portion of long-term memory is not capacity-limited, but memory in this state is prone to forgetting due to interference and/or decay. Attentional control processes are responsible for manipulating the contents of WM. Among other things, these processes activate, focus, update, switch, and inhibit memory during information processing. Here and for the remainder of this article, we will use the terms “attentional control processes” and “executive function” synonymously, which is consistent with Engle and Kane’s (2004) highly influential executive-attention theory of the variation in WM capacity.

Empirical support for the role of WM in complex cognition has come from the finding that WM capacity is a reliable predictor of performance on a wide variety of learning and high-level cognitive tasks, including tasks that tap general fluid intelligence (Engle, Tuholski, Laughlin, & Conway, 1999), reasoning ability (Kyllonen & Christal, 1990), mathematical ability (Ashcraft & Krause, 2007), and spatial ability (Kane et al., 2004). WM is an important component in many learning processes, including taking notes, following directions, or ignoring distractions (Engle, 2001; Engle, Carullo, & Collins, 1991; Piolat, Olive, & Kellogg, 2005).

Evidence suggests that WM is also an important part of language comprehension. Speakers with larger WM capacity are better able to learn vocabulary (in both first and second languages), write more proficiently, and have better L1 reading and listening comprehension (Atkins & Baddeley, 1998; Daneman & Hannon, 2007; Engle, 2001). Individual differences in WM—that is, the extent to which normal adults vary in their WM capacity—should therefore be important for understanding differences in these and other text comprehension processes. For instance, people differ in (1) the ability to remember new information encountered while reading, (2) the ability to make inferences about information encountered while reading, (3) the ability to access knowledge from long-term memory, and (4) the ability to integrate new information with knowledge from long-term memory (Daneman & Hannon, 2007). Because WM plays an important role in these broader cognitive processes and abilities, it comes as no surprise that WM is considered to be one of the most critical components of cognitive and linguistic achievement.

The measurement of WM capacity, which reflects individual differences in the efficacy with which the WM system functions (see Shipstead, Harrison, et al., 2013), is often separated between tasks that measure an individual’s ability to store and rehearse information—the so-called “simple” span tasks—and those that measure an individual’s ability to store information while faced with additional processing tasks—often termed “complex” span tasks. Simple span tasks, such as the forward digit span, word span, or nonword span, require an individual to recall a string of nonrelated letters, words, digits, or visual objects after a brief period of presentation. Complex span tasks, on the other hand, require an individual to actively process input (e.g., a sentence or a simple mathematical equation) while remembering a string of letters, words, digits, or objects.

A meta-analysis by Daneman and Merikle (1996) revealed that complex span tasks were better predictors of L1 comprehension than simple span tasks. It is also of interest to note that their findings did not depend on the nature of the stimuli in the complex span tasks. That is, operation span tasks (a predominantly nonlinguistic task) accounted for just as much variance of the criterion measures as did reading/listening span tasks that required the processing of linguistic material. The latter findings support the widely held notion that the executive control component of WM is a domain-general system (e.g., Baddeley, 2007).

A preliminary synthesis, which examined 16 studies focused on the role of WM in second language acquisition (SLA), featured a mean correlation coefficient (r) of .18 (Watanabe & Bergsleithner, 2006), suggesting that WM is positively related to L2 proficiency outcomes. Although few would argue that WM is unimportant for L2 processing, some researchers contend that WM’s importance has been overstated (e.g., Juffs & Harrington, 2011). This debate has been fueled not only by inconsistent results, but also by diverse research methodologies, which have led to difficulties in qualitatively comparing studies. In the following sections, we will review several studies that have examined the relationship between WM and L2 processing and proficiency development, identifying design factors that may have contributed to the diversity in the reported WM effects. The review below will provide a broad overview of the literature included in the meta-analysis, so as to justify the isolation of specific predictor variables. For a more in-depth discussion of specific studies, the interested reader is referred to Juffs and Harrington (2011) and Williams (2011).

Levels of proficiency

L2 processing (i.e., production and comprehension) generally requires more cognitive resources than does processing in the L1 (e.g., Green, 1998; Hernandez & Meschyan, 2006). Therefore, it is reasonable to argue that individuals with greater WM resources would perform better on processing tasks in the L2. However, a collection of studies have suggested that proficiency level may moderate this processing advantage. For example, according to Abu-Rabia (2001), Hummel (2009), Leeser (2007), Linck, Hoshino, and Kroll (2008), and Weissheimer and Mota (2009), low-proficiency bilinguals with greater WM spans performed significantly better than those with lower span scores on tasks addressing L2 processing abilities. However, when examining highly proficient bilinguals, several studies failed to show significant L2 processing advantages for individuals with higher WM scores (Fehringer & Fry, 2007a, 2007b; Foote, 2011; Hummel, 2009).

Working memory span task variables

The literature is quite inconsistent when it comes to selecting the language in which to measure WM capacity. Research by Osaka and Osaka (1992) indicated a strong positive correlation between WM span tasks administered in the L1 and those administered in the L2, supporting views that WM is a domain-general resource. However, the L2 proficiency of the participants was a factor to consider when deciding the language of the WM span task. To resolve this issue, researchers sometimes administer the same class of WM span task in both the L1 and the L2. Previous research has concerned itself with differences in the domain (i.e., verbal vs. nonverbal) content of the WM span task. Daneman and Merikle (1996) found that the content domain of the WM span task did not contribute to substantial differences in the amounts of variance explained. Although the content of the domain may not matter to a significant degree, Daneman and Merikle’s meta-analysis did show that complex span tasks were much better predictors of L1 comprehension performance than were simple span tasks. In L2 research, both simple and complex span tasks have been found to significantly predict L2 processing and proficiency outcome measures (for simple span, see, e.g., Christoffels, De Groot, & Waldorp, 2003; O’Brien, Segalowitz, Freed, & Collentine, 2007; Slevc & Miyake, 2006; for complex span, see, e.g., Abu-Rabia, 2001; Mackey & Sachs, 2012; Révész, 2012). However, it is difficult to determine which of these span tasks is a better predictor of L2 processing and proficiency tasks, due to the heterogeneity of the L2 tasks used and other research design factors.

Simple span tasks consistently account for significant amounts of variance in L1 vocabulary learning tasks and outcome measures (Gathercole, Willis, Emslie, & Baddeley, 1992). Similar results have been found in L2 learning studies (see Williams, 2011, for a review). Several researchers have suggested that the role of phonological STM is greater in less-proficient bilinguals (Cheung, 1996; Juffs & Harrington, 2011). Studies by Abu-Rabia (2001) and Speciale, Ellis, and Bywater (2004) showed that simple span measures correlated significantly with L2 lexical development. However, not all studies have shown relationships between simple span measures and L2 lexical development. For example, Akamatsu (2008) failed to find a significant correlation between word span (a simple span task) and gains made on a word-recognition task following a 7-week word recognition training period. One potential explanation for these inconsistencies is that changes in the speed of lexical retrieval likely reflect different processes or mechanisms than those contributing to knowledge acquisition that have been measured in other studies (e.g., L2 vocabulary development). Akamatsu even commented on the relative lack of cognitive demand in the word recognition tasks, another potential factor that may have led to the null findings. The rationale is that tasks that consume many cognitive resources will give individuals with high WM capacities an inherent advantage over those with lower WM capacities. If a particular task is so easy that all participants can perform it without much effort, WM capacity differences are much less likely to be found.

Studies spanning the past two decades have indicated that WM plays a role in L2 processing and proficiency development (e.g., Michael & Gollan, 2005; Williams, 2011). However, since the results across studies have been inconsistent, the precise nature of this role remains unclear, and Watanabe and Bergsleithner’s (2006) preliminary synthesis did not examine the potential influence of covariates on the magnitude of the population effect size. Our review of the literature identified a number of research design factors that may have contributed to the heterogeneity of the research findings. Specifically, we identified three categories of potential moderators of the relationship between WM and L2 outcomes, including characteristics of the WM measures, features of the criterion measures, and the proficiency of the participants included in the study (see Table 1). Given the interest in WM’s impact on L2 processing and proficiency development, and the number of studies now available in the literature, it is time for a quantitative synthesis of the extant results.

Table 1 Categories of variables examined as potential moderators of the relationship between working memory and second language criterion outcomes

The present meta-analysis

The goals of our meta-analytic review were twofold. First, we wanted to estimate the population WM effect size, on the metric of the correlation coefficient. Second, we wanted to examine the potential moderating influences of relevant variables, to better understand the boundary conditions of WM effects. To our knowledge, this is the first exhaustive quantitative synthesis of studies of WM effects in the L2 literature that has taken such covariates into account, and as such, it represents a major step forward in the field’s understanding of the relationship between WM and L2 processing and proficiency outcomes.

Method

Literature search

Studies were located online via keyword searches in databases (Academic Search Premier, Dissertation Abstracts International, ERIC, PsycINFO, and the Psychology and Behavioral Sciences collection) and in the Google Scholar search engine. All of the searches used variations of the following terms: second language, foreign language, bilingual* (with the asterisk serving as a wildcard operator), working memory, working memory capacity, WMC, working memory span, short-term memory, short-term memory span, reading span, listening span, operation span, digit span, nonword span, word span, and letter span. Tables of contents were inspected in peer-reviewed journals that focus on SLA- and bilingualism-related topics (i.e., Applied Psycholinguistics, Bilingualism: Language and Cognition, Language and Cognitive Processes, Language Learning, Second Language Research, and Studies in Second Language Acquisition). The reference lists of publications located through these search methods were also inspected to identify studies cited therein. Finally, for each study on this interim list, a “cited by” search was conducted to identify more recent articles that have cited the target reference. Our search included published articles and book chapters, as well as unpublished masters theses and doctoral dissertations that were available in the databases as of September 19, 2012. In order to provide a comprehensive analysis and to mitigate the “file drawer problem” (i.e., publication bias; see below), unpublished studies were included in the meta-analysis (e.g., Rosenthal, 1979). All studies included in the meta-analysis are identified with an asterisk (*) in the References list.

Inclusion criteria

A set of inclusion criteria was designed to focus the meta-analysis on studies relevant to understanding the role of WM in adult L2 proficiency and processing outcomes. Each study was examined to identify whether it satisfied the following set of criteria.

  1. 1.

    All participants were classified as adults (above the age of 18).

  2. 2.

    Participants were classified as “nonnative bilinguals.” Here, we use the term “bilingual” liberally to refer to an individual with at least minimal knowledge of an L2, and “nonnative” to refer to individuals who began learning an L2 after first becoming proficient in a primary (native) language. This criterion excluded heritage speakers and childhood bilinguals who acquired both languages simultaneously as children.

  3. 3.

    No participant had a known history of neurological or psychopathological problems (including learning and language impairments).

  4. 4.

    In each study, at least one WM measure and one L2 outcome measure (assessing an aspect of processing and proficiency) were administered. Studies using performance measures that required participants to learn nonwords or artificial grammar rules were not included in the analysis, to restrict the analysis to studies of natural L2 processing and/or proficiency outcomes.

  5. 5.

    It was necessary that each study quantify the relationship between the WM measures and the criterion measures through either a Pearson product–moment correlation coefficient (r) or another statistic (e.g., t, Cohen’s d, or F) that could be transformed into a correlation coefficient (see the Appendix for equations). Following standard meta-analytic procedures, results from analyses of variance were included only for F statistics with one degree of freedom in the numerator (e.g., Rosenthal, 1995). When a study simply stated that an effect was nonsignificant, without reporting an actual effect size, the effect size was assigned an estimate of r = 0 for the main analyses reported below. However, this is known to lead to conservative, downward-biased population estimates, and therefore an alternative approach—excluding the effect size—was conducted as part of a “sensitivity analysis” (Rosenthal, 1995). Note that this alternative approach is itself known to introduce upward bias in the population estimate. The sensitivity analysis allowed for an examination of the extent to which the inferences drawn from the meta-analyses were sensitive to these decisions.

Because nearly all of the studies failed to report knowledge of languages other than the two being empirically addressed, we were unable to control for proficiency in additional languages. The criteria above led to a final data set of 748 effect sizes from 79 independent samples involving 3,707 participants. See the online supplemental materials for a table that provides the following information for each of the 79 independent samples: study reference, sample size, median correlation coefficient, range of correlation coefficients, participant proficiency level, publication status, the coding results for the WM measures and criterion measures (see below), and the specific WM task(s) and criterion measure(s) used in a study.

Variables and coding procedures

The literature review identified several variables likely to influence the strength and/or direction of effect sizes. These variables categorized relevant characteristics of the WM tasks, the criterion tasks, the participants, and the publication status of the study. We will review the variable-coding procedures in turn.

WM span tasks

The WM span tasks varied on a number of factors, both within and across studies (see Table 2). First, the language of performance was classified as either L1 or L2, with tasks requiring the processing or storage of numeric stimuli coded as L1, since numeric calculation is typically performed in the L1. Second, tasks were classified as simple span tasks (i.e., measuring storage only) or complex span tasks (requiring both storage and processing; see Daneman & Merikle, 1996; Unsworth & Engle, 2007). Finally, WM measures were classified according to the content domain of the stimuli: verbal (i.e., requiring processing of linguistic material such as words or sentences) or nonverbal (i.e., requiring processing of nonlinguistic material, including numeric digits, math equations, or visuospatial images). For complex span tasks that included both verbal and nonverbal stimuli, this variable was determined on the basis of the content of the processing component. For example, in the operation span task, participants must process (make judgments about) simple arithmetic problems while storing words or letters; because the processing component was nonverbal, the operation span was classified as a nonverbal span task (see Daneman & Merikle, 1996, for a similar classification of the task).

Table 2 Classification system for coding working memory span tasks

L2 performance measures

Criterion measures of L2 performance were classified on the basis of the modality of the measure—namely, comprehension (e.g., lexical decision task), production (e.g., cloze test), or both (e.g., simultaneous interpretation: composite measures combining separate indicators of comprehension and production). In addition, each criterion measure was classified as focusing on language processing or proficiency. Processing measures gauged online language-processing abilities, such as those measured by gating tasks (e.g., McDonald, 2006), hesitation phenomena (e.g., Fehringer & Fry, 2007b), fluency during an oral proficiency interview (e.g., O’Brien, Segalowitz, Collentine, & Freed, 2006; O’Brien et al., 2007), speech generation tasks (e.g., Weissheimer & Mota, 2009), or lexical recognition tasks (e.g., Leeser, 2007). Proficiency measures, on the other hand, assessed L2 knowledge or more general language abilities. These included standardized tests of proficiency, such as the Michigan Test (e.g., Juffs, 2005) and the grammar and reading sections of TOEFL (e.g., Harrington & Sawyer, 1992), as well as nonstandardized tests of vocabulary (e.g., Hummel, 2009). We also noted whether these criterion measures used a standardized measure (e.g., TOEFL section scores, Michigan Test scores) or a nonstandardized measure (e.g., cloze test, grammaticality judgment task). A list of the various criterion measures found in the studies and their classifications within this coding scheme is presented in Table 3.

Table 3 Classification system for coding second language criterion measures

Participant L2 proficiency

The studies included in this meta-analysis varied greatly in how they described and/or quantified the L2 proficiency of their sample populations. For this meta-analysis, participants were categorized as either highly proficient learners (i.e., having extensive academic or professional exposure and/or intensive immersion experience) or less-proficient learners. Individuals labeled as highly proficient learners met one or more of the following criteria: (1) international students enrolled in an academic program (undergraduate or graduate) administered entirely in the participant’s L2, (2) masters- or PhD-level students specializing in the foreign language of study, or (3) professionals who were functioning completely in their foreign language and had begun learning their foreign language during adulthood. Individuals not meeting any of these criteria were labeled as less-proficient learners.

Publication status

Each study was coded as published or unpublished. This coding allowed for an examination of the potential for publication bias—the systematic underreporting of smaller effect sizes due to nonsignificant null hypothesis significance tests. If a meta-analysis focuses solely on published studies, the researcher risks inflating the estimated population effect size (see Rosenthal, 1979). Therefore, a concerted effort was made to include unpublished reports, including master’s theses and dissertation studies. Their inclusion allowed us to explicitly compare the estimated effect sizes from published versus unpublished studies, while also mitigating to some extent threats of publication bias.

Some unpublished studies that were identified during the literature search phase included subsets or supersets of participants whose data were subsequently published. The published and unpublished reports often contained nonidentical analyses (e.g., the dissertation contained the full correlation matrix for the predictors and outcomes, whereas the published article only reported targeted correlations to test specific hypotheses) and/or participant samples (e.g., the sample reported in the dissertation was supplemented with subsequently tested participants or was combined with another sample for the published article). Meta-analysis assumes and requires the independence of participants between different studies in the meta-analyzed data set (e.g., Hedges, Tipton, & Johnson, 2010). Therefore, when the samples of two studies overlapped, we included in the data set the study with the larger, more encompassing sample, and excluded the other study. In most cases, this led to the inclusion of the published article and the exclusion of the unpublished study. However, in one case (Fortkamp & Bergsleithner, 2007), the published article appeared to include a subset of the sample from the unpublished dissertation (Bergsleithner, 2007), and therefore the unpublished dissertation with the larger sample was included in place of the published article. A few studies were also reported in university or departmental bulletins (e.g., Ikeno, 2006, published in the Bulletin of the Faculty of Education). Although it was not easy to ascertain the extent of peer review for these venues, the bulletins were regularly produced and published by the universities, and therefore the studies reported therein were coded as published.

Interrater agreement

The data coding was performed by the first and second authors, with approximately 70% of the data being coded independently to check for interrater agreement. Across all coded variables, agreement ranged from 86% to 100% (median = 98.9%). After the initial coding was completed, disagreements were discussed and resolved by providing further specification of our coding scheme, where needed. The lowest agreement of 86% was found for the criterion modality variable, and was driven by disagreements on the criterion outcomes from three specific studies reporting multiple outcomes.

Analytic approach

The goal of this meta-analysis was to estimate the mean of the population distribution of effect sizes and to generalize the results beyond the sample of examined studies. Therefore, random-effects models were employed (Hedges & Vevea, 1998). Most studies reported multiple effect sizes (a median of four rs per sample), often due to the inclusion of different types of WM measures (e.g., simple and complex span tasks, or L1 and L2 administrations). This violates the assumption of independence among effect sizes in standard meta-analytic procedures (Hedges et al., 2010). Various methods have been proposed to address this violation, such as collapsing these dependent effect sizes into one “synthetic” effect size per study by computing the mean or median effect or by randomly selecting one effect size per study (see Marin-Martinez & Sanchez-Meca, 1999, for a comparison of the methods). A recently proposed alternative is to explicitly model the interdependence among effect sizes by robust variance estimation (Hedges et al., 2010), thereby eliminating the need to discard information through aggregation. Therefore, we employed robust variance estimation procedures in the R statistical software package (R Development Core Team, 2012) using the R code provided in Hedges et al.’s appendix.

As part of sensitivity analyses, we also followed an aggregation approach by first computing the median effect sizes for each study, then conducting standard random-effects meta-analyses using the “metafor” R package (Viechtbauer, 2010). Broadly speaking, the results paralleled those of the robust standard error (SE) method, and we report the results of both approaches below (see the “Complete-data meta-analysis” section). However, since we believe the robust SE method was better suited to our specific data set, for the covariate analyses we only report the results from the robust SE analyses.

Following Higgins and Thompson (2002), rather than focusing on a categorical significance test of effect size heterogeneity between studies, we focused on quantifying the degree of heterogeneity in the analyzed effect size by reporting τ 2. We planned a priori to examine a number of potential covariates in order to address theoretical claims that have been posed in the literature, regardless of whether a hypothesis test determined that a significant amount of heterogeneity was present between studies. Moreover, a number of these covariates were observed within studies (e.g., both simple and complex span tasks), suggesting that characterizations of the between-study variance would not provide a complete picture of the results.

Correlation coefficients are known to be nonnormally distributed, and thus the recommended effect size for meta-analysis is Fisher’s z transform of r (e.g., Schafer, 1999). However, in the results below, we report effect sizes and 95% confidence intervals (CIs) on the original r metric to facilitate interpretation. See the Appendix for the equations used to convert effect sizes between r and Fisher’s z.

File drawer analysis

A concern for any meta-analysis is publication bias—the potential for more extreme results to be overly represented in the literature, due to biases against publishing nonsignificant effects (Rosenthal, 1979). We assessed the presence of publication bias in our sample by multiple methods. First, we computed a fail-safe N (Orwin, 1983; Rosenthal, 1979), which computes the number of missing, unpublished, or future studies with null effects that would be required to render the probability of a Type I error for a significance test of \( \widehat{\rho} \) to increase above an acceptable level. Orwin suggested a variation on Rosenthal’s fail-safe N that identifies the number of studies with a particular effect size (e.g., null) that would be required to alter the observed effect size to reach a designated criterion (e.g., \( \widehat{\rho} \) = .01) that the meta-analyst believes would call into question the validity of the findings. For each analysis, we computed Orwin’s fail-safe N, using a criterion effect size of r = .01 and an effect size of r = 0 for missing studies (see the Appendix for the equation and details). This approach provides a sense of the stability of the findings of the meta-analysis, with a suggested rule of thumb being that results are valid and robust against the “file drawer problem” if the fail-safe N reaches or surpasses the value of 5k + 10, where k is the number of studies in the analysis (Rosenthal, 1979).

Covariate analyses

We examined the potential moderating influences of the categorical covariates identified in the literature review by first creating subsets of the data set for each level of the covariate, then separately fitting random-effects models with robust standard errors to each subset.Footnote 3 We also examined theoretically motivated interactions between the covariates, where sufficient data were available in each data subset to draw reasonable inferences. We report the results of two such interactions—Language × Complexity and Focus × Complexity.

Results and discussion

Prior to the analysis, extreme outliers were removed in order to prevent any undue influence on the inferences. Outliers were identified as any effect size more than twice the interquartile range above or below the median effect size (see Fig. 1). This criterion was applied to Fisher’s-z-transformed data, since these were the values submitted to the meta-analyses. This procedure identified 40 observed effect sizes—2/3 positive, 1/3 negative—corresponding to the following unique values (converted to the r metric): –.97, –.62, –.57, –.53, –.48, –.42, –.39, –.37, –.36, .66, .67, .68, .69, .70, .71, .72, .73, .74, .76, .79, and .80. These effect sizes were excluded prior to analysis in order to avoid any undue influence of extremely unlikely values. Note that this procedure removed twice as many positive as negative correlation values. This is not problematic, because it suggests that the most likely effect of this outlier removal procedure would be to attenuate any potential positive inflation of the population effect size estimate due to publication bias.

Fig. 1
figure 1

Boxplot of the distribution of the analyzed effect sizes (Fisher’s z values) in the full data set, with outliers identified by asterisks (see the text for the corresponding values on the r metric)

Complete-data meta-analysis

Descriptive statistics

Table 4 provides descriptive statistics to characterize the sample of studies (e.g., number of independent samples, total number of participants) and the effect sizes included in the analyses (e.g., proportion of correlations with positive values, minimum and maximum observed values). Descriptive statistics are provided first for the complete data set, then separately for each covariate subset analysis.

Table 4 Descriptive statistics for the main complete-data analysis, individual covariate subset analyses, and covariate interaction subset analyses

Inferential statistics/results

A random-effects model with robust standard errors was fit to the full data set to compute \( \widehat{\rho} \) —the estimate of the mean value of ρ in the population distribution of effect sizes (Hedges et al., 2010). The results are reported in Table 5, including \( \widehat{\rho} \), 95% CIs, and fail-safe N. The analysis suggests that the population distribution of ρ is centered around a value of .255 and is significantly positive, as indicated by the 95% CIs that do not overlap with zero. In fact, across all analyses reported below, none of the 95% CIs overlapped with zero, indicating a robust positive relationship between WM and L2 outcomes across the range of covariates investigated here.

Table 5 Random-effects meta-analysis results for the complete-data analysis and for models examining individual covariates and the covariate interactions

For the complete-data analysis, the amount of between-study heterogeneity, τ 2, was estimated as being .017. For the results reported in Table 5, we specified the within-study correlation between effect sizes as being .80. Following Hedges et al. (2010), to check whether the results were sensitive to this value, we estimated τ 2, \( \widehat{\rho} \), and the SE of \( \widehat{\rho} \) across correlation values ranging from 0 to 1 in increments of .1. Across these values, we found that τ 2 ranged from .0167 to .0168, that \( \widehat{\rho} \) was equivalent to the fourth decimal place, and that the SE of \( \widehat{\rho} \) was equivalent to the fifth decimal place.

As part of a sensitivity analysis, a random-effects model was also fit to the aggregated data, with a median effect size being computed for any sample with more than one effect size. Following recommendations that meta-analysts report by-study effect sizes with CIs (e.g., Moher, Liberati, Tetzlaff, Altman, & the PRISMA Group, 2009), Fig. 2 presents a forest plot based on the aggregated-data random-effects model, after converting the effect sizes back to the r metric. Forest plots provide a visual depiction of the effect sizes estimated from a meta-analytic model, which can provide useful information regarding the nature of the distribution and precision of the effect sizes across samples.

Fig. 2
figure 2

Forest plot of effect sizes for each independent sample, based on a random-effects model fit to the aggregated data (median Fisher’s z). The estimated effects were transformed back to the r metric prior to generating the forest plot, to facilitate interpretation. Each independent sample is shown on a separate row, and the points (and bars) represent the estimated effect size (and CIs). The bottom-most row depicts the population value and SE, as estimated from the aggregated-data random-effects model. The dashed vertical line indicates an effect size of zero (i.e., no relationship)

Two patterns are worth noting. First, the population estimates were very similar in the two analyses (aggregated-data model, r = .253, CI = [.216, .289]; robust SE model, r = .255, CI = [.219, .291]). Second, one might be concerned that the inclusion of data from extremely small samples might bias the population effect size estimate. However, the magnitude of the estimated effect size does not appear to be related to sample size, which is indicated by the width of the CIs (for correlation coefficients, CI widths are directly proportional to the sample size). Indeed, the smallest samples (i.e., those effects with the widest CIs) appear across the entire range of the distribution of effect sizes, suggesting that the population estimate was not biased by the inclusion of particularly small samples.

In the aggregate-data random-effects model, approximately 12.4% of the variance in effect sizes was due to heterogeneity between the studies (τ 2 = .003, SE = .004). A comparison with the complete-data robust SE results indicated that the degree of heterogeneity in the effect sizes appears to have been underestimated by the mean-aggregated random-effects model—as often happens with a samplewise aggregation procedure (Cheung & Chan, 2004). Although the random-effects meta-analysis model suggests that not much residual heterogeneity between studies needs to be explained by covariates, a number of theoretically motivated covariates were identified in the literature review, many of which may explain variability both between and within studies. Therefore, we now turn to a systematic examination of these covariates.

Covariate subset analyses

In all of the analyses reported here, a series of random-effects models with robust standard errors were fitted separately for each covariate. The results are reported in Table 5 following the random-effects meta-analytic model results. Overall, the subset analyses corroborate the findings of the full-data meta-analysis. The central tendency of the \( \widehat{\rho} \) estimates from the individual and interaction covariate models is approximately centered around the full-data meta-analysis estimate (mean \( \widehat{\rho} \) = .241, median \( \widehat{\rho} \) = .245). Moreover, all of the covariate analysis 95% CIs cover positive values and exclude zero, indicating significant positive correlations between WM and L2 outcomes, even when accounting for a range of covariates. We now consider each set of covariates in turn.

Characteristics of WM measures

The analysis of the language of the WM measure suggests that larger correlations between WM and L2 outcomes may be found when WM is measured in the L2 rather than the L1, with the L2 WM correlation estimate being .30. However, the partially overlapping 95% CIs indicate that this difference may not be robust. We argue that any difference is likely due to the confounding of L2 proficiency with WM abilities when WM tasks are administered in the L2. That is, to the extent that the WM task performance requires L2 use, the task will be an indicator of both WM abilities and L2 proficiency, and therefore will not purely measure WM. In the context of predicting L2 outcomes, this confound would inflate the WM–outcome correlation estimate. Indeed, the L2 WM covariate analysis estimated the third highest population value, and was one of only three estimates reaching .30. Therefore, these results suggest that researchers who wish to isolate the true relationship between WM and L2 proficiency should employ L1 measures, to provide a purer estimate of WM abilities.

A significantly stronger correlation was found for complex WM span tasks relative to simple span tasks (see the nonoverlapping CIs). This finding parallels the results of Daneman and Merickle’s (1996) meta-analysis of WM and L1 reading comprehension, in which complex (process-plus-storage) measures of WM were stronger predictors than simple (storage-only) measures. These results indicate that, across a range of L2 processing outcome measures, better WM abilities were related to better performance. Research on L2 aptitude has also implicated WM as an important individual difference, with some arguing that WM is at the core of L2 aptitude (Miyake & Friedman, 1998). These results further corroborate claims that WM is a critical component to any successful theory of L2 aptitude (see DeKeyser & Koeth, 2011).

Note that simple span tasks that measure STM—including phonological STM—also had significant and positive relationships with L2 outcomes. Phonological STM has been identified as an important contributor to L2 aptitude (e.g., Hummel, 2009), including a recent investigation of aptitude for high-level language proficiency (Linck et al., 2013). The present results are congruent with accounts that include both executive control (i.e., WM) and (phonological) STM as abilities that account for individual differences in L2 outcomes. Any comprehensive theoretical model of L2 outcomes—both processing and proficiency—likely should include both WM and STM. It will be useful for future studies to identify the types of tasks and conditions that modulate the relative contributions of WM and STM.

Focusing on the content of the WM measures, the covariate analyses indicated that verbal WM measures were somewhat more highly correlated with L2 outcomes than nonverbal WM measures, although their 95% CIs overlapped slightly. This result replicates the patterns reported in Daneman and Merickle’s (1996) meta-analysis of L1 reading comprehension.Footnote 4 According to the multicomponent model of WM (e.g., Baddeley & Hitch, 1974), this difference would be attributed to the functioning of the phonological loop within the WM system. That is, one’s facility with processing or manipulating verbal content would be driven by the domain-specific WM component specialized for verbal information. However, contemporary views posit that WM (particularly the central executive) is a domain-general ability that operates independent of language (e.g., Engle, Kane, & Tuholski, 1999). According to such accounts, this verbal/nonverbal difference would simply be due to the overlap in the content being manipulated (i.e., common method bias), despite the fact that the WM system per se is not specialized for or constrained to a specific content domain. These results are congruent with both accounts.

Characteristics of criterion measures

The studies examining L2 processing outcomes and L2 proficiency outcomes showed similar magnitudes of correlations, around .25. Note that the processing outcomes were associated with a wider confidence interval and smaller fail-safe N, likely reflecting the smaller number of studies available (less than half that of the proficiency outcomes). Nonetheless, this result highlights the need for WM to be incorporated into comprehensive models of L2 processing as well as theories of SLA. Future research should examine the conditions under which WM affects various aspects of L2 processing and proficiency development, which could help elucidate the role of executive functions to specific L2 processes (for an example, see Robinson, 1995).

With respect to proficiency outcomes, a rich literature has focused on L2 aptitude effects, examining the role of individual differences in cognitive and perceptual abilities (e.g., Carroll, 1985; Grigorenko, Sternberg, & Ehrman, 2000). Contrary to some who have questioned the importance of WM (vs. phonological STM) in theories of aptitude (Juffs & Harrington, 2011), this meta-analysis corroborates claims that WM is correlated with L2 proficiency outcomes and, therefore, is an important component of any theoretical model of such outcomes—including models of L2 aptitude. To the extent that different components of aptitude are relevant to predicting the rate of success at earlier stages of learning versus the attainment of high-level proficiency (e.g., Linck et al., 2013), more research will be needed to contrast the contributions of WM for these two different proficiency outcomes.

Similar correlations were also found for comprehension and production outcomes, as well as for aggregate outcomes tapping into both skills (.24, .27, and .21, respectively), suggesting that WM is relevant to understanding both receptive and productive L2 abilities. Future research could compare and contrast the roles of WM on a specific process, such as lexical access, across different skills (e.g., during reading vs. speech production). Such an approach would further enhance theories of L2 processing and L2 proficiency by increasing the specificity of the role(s) of WM at various levels of analysis and across the various subskills.

No reliable differences in effect sizes were found between standardized criterion measures, such as the TOEFL subtests, and nonstandardized criterion measures, such as grammaticality judgment tasks (see Table 3), with moderate-sized correlations for both outcome types. The estimate, numerically, was slightly higher for standardized criterion measures, although it was also more uncertain (as indicated by the larger CI), likely due to the much smaller number of studies in our sample employing standardized criterion measures.

Characteristics of participants

Similar correlations were found with high- and low-proficiency bilinguals, suggesting that WM is related to L2 outcomes for both less- and more-proficient adult learners. It remains to be determined by future research whether the precise role of WM varies as a function of L2 proficiency. For example, studies on the executive function of inhibitory control have been interpreted as suggesting that the reliance on inhibitory control to support bilingual lexical selection changes as L2 proficiency increases (e.g., Costa & Santesteban, 2004; Schwieter & Sunderman, 2008).

WM Language × Complexity interaction

Numerically larger effects were found with complex WM measures than with simple WM measures, regardless of the language of administration, although these differences were only marginally significant, and particularly for L2-administered WM measures (see the partially overlapping CIs). This pattern replicates the findings of the complexity covariate analysis reported above. Focusing on the complex WM tasks, effect sizes were marginally stronger for L2 than for L1 measures. Again, as we discussed above, we suggest that such effects are driven primarily by the confounding of L2 proficiency and WM abilities.

Criterion Focus × WM Complexity interaction

For L2 processing outcomes, similar effect sizes were found with simple and complex WM measures. However, for L2 proficiency outcomes, the effect sizes for complex WM measures were significantly larger than the effect sizes for simple WM measures (.27 vs. .17, respectively), as indicated by the nonoverlapping CIs. This pattern suggests that the executive control component and the STM component of WM are similarly important for understanding differences in L2 processing, whereas the executive control component may be more critical when examining L2 proficiency outcomes. However, given the relatively small number of studies examining the relationship between simple WM measures and processing outcomes (k = 8), this result should be considered with caution until it is replicated in future studies.

File drawer analysis

Publication bias is a major issue that must be addressed in any successful meta-analysis (Rosenthal, 1979). To mitigate the risk of overestimating the population effect size due to such bias, we took considerable effort to locate unpublished studies, including many masters theses and doctoral dissertations. Indeed, unpublished studies comprised over 20% of the studies in our sample, contributing over 36% of the analyzed effect sizes. We also performed various analyses to assess the extent to which our effect size estimates were inflated due to publication bias. First, we examined the effect of publication status as a covariate. Although the effect size estimate for published studies was numerically larger than that for unpublished studies, the CIs overlapped almost entirely, indicating that the effect sizes did not differ significantly with publication status. Then, we computed the fail-safe N (Rosenthal, 1979) for each effect size estimate from the overall analysis and each covariate analysis (see rightmost column of Table 5). Across all analyses, the fail-safe N values were at least three times greater than the rule-of-thumb limit of 5k + 10, providing further evidence that publication bias does not threaten the validity of these results. For example, for the primary analysis, the fail-safe N estimates that over 2,000 studies reporting a correlation of near zero would be required, in addition to the observed 79 studies, to eliminate our confidence that a true effect existed in the population. Taken together, the publication-status analysis results and the collection of fail-safe N findings indicate that the inferences drawn from this meta-analysis likely were not significantly affected by publication bias, and that WM effects are robust and positive.

Sensitivity analysis

To examine whether particular assumptions of the reported analytic methods could have impacted the results, we conducted a series of alternative analyses in which particular assumptions were relaxed or further constrained, to assess the degree of variability in the estimated population effect size. Specifically, we repeated the analyses after modifying the data set in the following ways: (1) nine effect sizes reported as being “nonsignificant” were dropped from the analysis (recall that for the primary analyses, these effect sizes were set to zero and included in the analysis); (2) outliers were included; and (3) nonsignificant effect sizes were dropped and outliers were included. As expected, the effect size estimates were similar or slightly higher across these additional analyses, and the inferences drawn from the analyses were identical, with one exception: Although the verbal-versus-nonverbal contrast was not significantly different in the primary analysis (with partially overlapping 95% CIs), the effect sizes were significantly larger for verbal WM measures than for nonverbal measures in two of the three supplemental analyses, as indicated by nonoverlapping 95% CIs. This pattern suggests that WM measures involving the processing of verbal content may be more strongly associated with L2 outcomes. Taken together, these results indicate that the effect size estimates from the present meta-analysis are relatively robust to the specific inclusion/exclusion criteria.

General discussion

Since Baddeley and Hitch’s (1974) seminal article, WM has become a topic of ever-increasing interest, and has reached such a level of import that WM is regarded as a central construct in cognitive psychology (Conway et al., 2005). Over the past two decades, this interest has expanded to include the study of bilingualism, with the multicomponent model (see Baddeley, 2000) inspiring the integration of WM into theoretical accounts of L2 processing and SLA. Some researchers have even gone so far as to argue that WM is L2 aptitude (e.g., Miyake & Friedman, 1998). However, others have recently questioned the growing emphasis on WM, arguing that the empirical evidence is too inconsistent to justify a central role for WM in theories of SLA (e.g., Juffs, 2005).

This meta-analysis was conducted to provide a quantitative synthesis of the effect sizes reported in studies examining WM and L2 processing and proficiency outcomes. A series of analyses revealed a robust, positive correlation between WM and L2 outcomes, with a population effect size estimated at .255. We examined a set of covariates that were identified in the review of the literature. The covariate-analysis results indicated that the executive control component of WM (measured with complex span tasks) is more strongly related to L2 outcomes than is the storage component (measured by simple span tasks), which showed attenuated but still significantly positive effect sizes. Verbal WM measures also demonstrated slightly stronger correlations with L2 outcomes, likely due to domain overlap in the task stimuli.

Implications for theories of WM

From the framework of the multicomponent model of WM (see Baddeley, 2000), the stronger contribution of complex span measures can be interpreted as indicating differential importance for the component subsystems. Specifically, the responsibilities of the executive control system (e.g., managing conflict and preventing interference from distracting information) may be more important to L2 processing and proficiency than is simply maintaining an active representation in the phonological store. Indeed, evidence is growing that bilinguals and L2 learners must manage conflict between potentially competing representations from both languages, even when only using one language in a monolingual context (i.e., lexical access is “language nonselective”; see Dijkstra, 2005, for a review), suggesting a critical role for the executive control component of WM for successful L2 use (see also Hernandez & Meschyan, 2006).

More contemporary views of WM that do not posit slave subsystems might accommodate these results by pointing to the greater need for executive control—that is, simultaneous processing, attentional control, and coordination of multiple cognitive tasks—in the complex span tasks than in the simple span tasks. Given that bilinguals likely must engage these executive functions to support language use, the prediction is that WM tasks requiring greater executive control (i.e., complex span tasks) should be better predictors of L2 outcomes—as was borne out in the analysis. Although our data and analysis cannot adjudicate between these two competing views of WM, these results clearly indicate an important role of executive control processes for a range of L2 outcomes. More work will be needed to better specify the precise contributions of executive functions, as we discuss below.

Most current models of WM assume it to be domain-general. The stronger effect sizes found with L2 measures of WM (relative to L1 measures) could be interpreted as suggesting the need for a further fractionation of the multicomponent model into language-specific components (e.g., Alptekin & Erçetin, 2012). For example, one could argue for L1- and L2-specific phonological loops. However, this additional complexity is not warranted. The multicomponent model, as well as the other two more contemporary views of WM reviewed earlier, can easily address these content-specific differences, with the simplifying assumption that these effects are driven by the overlap in content between the measure and the criterion—not by the architecture of the WM construct itself. Similarly, the differences found between WM measures administered in the L1 versus the L2 likely reflect overlap in the content of the predictor and criterion. Moreover, empirical evidence has indicated that L1 and L2 measures of WM are highly related (e.g., Osaka & Osaka, 1992). No existing theoretical model posits separate WM components for each language, and the present results can be accommodated without positing any further fractionation of the WM construct.

Process decomposition

On the relation between individual differences in WM and executive functions, we have taken the position throughout this article that executive attention control processes of the WM system drive individual differences in WM; that is, the executive attention processes that are tapped by WM tasks are responsible for the covariation between WM and language processes. This view is consistent with Engle and Kane’s (2004) executive-attention theory of the variation in WM capacity. According to this view, the predictive power of WM capacity tasks (i.e., complex span tasks) comes from the fact that they tap executive attention processes—namely, the ability to maintain access to information and goals in the face of distraction, and despite interference and attentional shifts. Engle and Kane’s theory assumes that executive functions are components of the WM system, but they are otherwise agnostic with respect to the number of executive functions and how they relate to one another, or to how executive functions relate to individual differences in WM.

With respect to the number and nature of executive functions, one of the most influential frameworks to date is that of Miyake et al. (2000), which offers data showing that three highly prominent functions—updating, shifting, and inhibition—are related but separable processes. But, beyond this framework, we know very little about the unity and diversity of WM and executive function, how these constructs correlate, and how these abilities operate in the L1 and L2 domains. In short, we do not know which executive functions are the most important for L2 comprehension and production.

Two forthcoming papers may soon shed some new light on this topic, but only incompletely so. Shipstead, Harrison, et al. (2013) used structural equation modeling to test the unity and diversity of WM and four executive functions, memory updating, attention control, prospective memory, and verbal fluency. Their central theorem was that the relationship between WM and general fluid intelligence, which has been well documented elsewhere (e.g., Conway, Cowan, Bunting, Therriault, & Minkoff, 2002), can be explained by these several individual executive functions. Surprisingly, what they found was that the executive functions most highly related with individual differences in WM (memory updating and attention control) did not mediate the relationship between WM and general fluid intelligence, but the variance common to all the executive functions did partially mediate that relationship. These findings underscored the fact that variance in WM is more than just variance in executive function.

Shipstead, Trani, et al. (2013) extended these results into the L1 domain. They used structural equation modeling to relate these same four executive functions to verbal reasoning and multiple types of reading comprehension, including ordinary comprehension and comprehension when the reader is misled (e.g., garden-path sentences). The executive function memory updating fully mediated the relationship between WM and ordinary paragraph comprehension, and the attention control and verbal fluency functions were essential for comprehending the more ambiguous garden-path material. But, again, they did not find evidence of executive function mediating the relationship between WM and verbal reasoning ability, which is consistent with Shipstead, Harrison, et al.’s (2013) findings for WM and general fluid intelligence (i.e., reasoning) ability.

These new results have implications for Engle and Kane’s (2004) theory, but also for how we interpret our results here. First, we must assume that variance in WM is more than just variance in executive function, but considerably more research will be needed to specify which variance in WM is due to executive function and which is due to other aspects of WM (e.g., on the size of the focus of attention, see Cowan, 2001; on retrieval from secondary memory, see Unsworth & Spillers, 2010). Second, we do not know which executive functions are most important for L2 proficiency. What is needed is a study of the kind reported by Shipstead, Trani, et al. (2013), but with L2 materials, including tests of L2 proficiency, L2 comprehension, and verbal reasoning in the L2.

Although data on this topic is lacking, we can speculate on the kinds of executive functions that are important for L2 outcomes. For example, in a lexical decision task, participants are presented with a letter string and must decide whether or not the stimulus is a word. This task requires cognitive processes ranging from perceptual identification of the presented stimulus (e.g., a nonword letter string) to initiating a task-relevant response (e.g., pressing a button to indicate that the stimulus is not a word). To further our understanding of how WM supports the performance of L2 tasks, researchers could employ a process decomposition approach, whereby specific cognitive subprocesses are identified at a more fine-grained level (e.g., for updating: monitoring, item deletion, and active maintenance; see Miyake & Friedman, 2012, note 1). These subprocesses could then be linked to specific linguistic processes to increase the specificity of our understanding of when and how WM contributes to L2 outcomes.

Another executive function that could be important in the language domain is the need to resolve conflict between competing representations (e.g., attention control as defined by Shipstead, Harrison, et al., 2013, or inhibition as defined by Miyake et al., 2000). This ability is also relevant to L2 processing and proficiency, given that a bilingual’s two languages are both active and available in most circumstances (for reviews, see Kroll, Bobb, & Wodnieka, 2006; Kroll, Sumutka, & Schwartz, 2005). WM tasks can be manipulated to place more or less focus on this “conflict resolution” aspect of performance. The prediction would be that performance in conditions that require conflict resolution should be more highly correlated with L2 outcomes that tap into some facet of linguistic conflict resolution. Consider the N-back task, which requires participants to decide whether each stimulus in a sequence matches the one that appeared n items ago. In a low-conflict version, the list of memoranda for a given trial could be selected to have minimal repetition in nontarget locations, so that there would be little need to overcome proactive interference when making a judgment. To increase the conflict resolution demands, the task could be modified to include lures—memoranda that are repeated just prior to or following the target location (e.g., on a three-back trial, a lure would appear in Position 2 or 4). We might then predict that performance in the high conflict (lures) condition should better predict performance on L2 tasks that specifically rely on this kind of conflict resolution, such as the reading of garden-path sentences that require syntactic reinterpretation at the point of disambiguation (e.g., for evidence that n-back training improved L1 sentence processing, see Novick, Hussey, Teubner-Rhodes, Harbison, & Bunting, 2013).

It will be important for future work to consider how specific tasks and conditions—in both WM and linguistic tasks—call upon specific executive control processes to better elucidate the contributions of specific control mechanisms to L2 processing and proficiency development.

WM and L2 outcomes: A (bi)directional relationship?

It is important to keep in mind that the effect size analyzed in this meta-analysis was the correlation coefficient, and therefore we cannot draw any inferences regarding causality. However, on the basis of these results, it is tempting to infer a directional relationship in which greater WM resources cause better performance on the L2 criterion measures. Such an account would be consistent with research in other domains, in which WM has been identified as a mechanism underlying individual differences in performance across a wide range of outcomes, such as analogical reasoning and reading comprehension (Cowan, 2005; Daneman & Merikle, 1996; Engle, 2001). Some evidence suggests that systematic training of executive control processes can lead to improvements not only in performance on similar WM tasks (i.e., near transfer; see Harrison et al., 2013; Sprenger et al., 2013; von Bastian & Oberauer, 2013), but also on language processing tasks that place similar demands on executive control (i.e., far transfer; e.g., Novick et al., 2013).Footnote 5 If WM training can lead to improvements in L2 processing tasks requiring executive control, then this could suggest a causal relationship going in the direction of WM to L2 outcomes.

Evidence from another body of research suggests the opposite direction of causality. A growing literature is demonstrating that so-called “crib bilinguals” (individuals who have spoken multiple languages from birth) show enhanced executive functions relative to monolinguals. This has been demonstrated on tasks involving conflict, with evidence coming from behavioral methods (for a review, see Bialystok, 2010) as well as from neural measures of the efficiency of cognitive control (Gold, Kim, Johnson, Kryscio, & Smith, 2013). Initially, these results were interpreted as suggesting a benefit to inhibitory control processes in particular. However, more recent research has suggested that the benefits are not limited to contexts requiring inhibition, but rather extend to task conditions that place demands on executive control functions more generally.Footnote 6 The assumption in this research is that the lifetime of experience managing multiple language systems within a single mind confers benefits to the domain-general executive control abilities of bilinguals, and that these benefits extend to other domains and tasks.

So, the directionality of the relationship between WM and L2 outcomes remains unclear. On the one hand, WM has been suggested as a (causal) mechanism underlying performance in a range of domains, including L2 processing and proficiency outcomes. But the bilingual-advantage literature suggests that the repeated, intensive performance of multilingual language tasks can impact executive functioning, and hence that the causation is in the opposite direction. The currently available data from these different literatures are unable to disentangle these possibilities. One main goal of future work could be to design experiments to clarify the direction of causality in the relationship between WM and L2 outcomes. It is entirely possible that the relationship is bidirectional, or that the directionality of the relationship depends on other factors, such as the level or time course of an analysis. These possibilities should be explored in order to identify the conditions in which WM impacts L2 outcomes, as well as the specific types and durations of L2 experience that can lead to improvements in executive control. This research will further our understanding of the complex interplay between language and cognition.

Implications for models of bilingualism

As we stated in the introduction, some have argued that the role of WM in L2 processing has been overstated (Juffs & Harrington, 2011, and Williams, 2011). To the contrary, the results of this meta-analysis suggest the need to revise existing models of bilingual comprehension and production to address individual differences in WM. For example, consider the contributions of WM to Green’s (1998) inhibitory control model of bilingual speech production, which was motivated in part by Norman and Shallice’s (1986) model of action. When bilinguals speak in one language, they are unable to completely “turn off” their other language (see Kroll et al., 2006), suggesting the need for control mechanisms to resolve any potential cross-language competition. According to Green’s model, domain-general inhibitory control is the main mechanism for resolving this lexical competition. This control mechanism is exerted by the supervisory attentional system from outside the language system by activating schemas, which then prioritize task-relevant responses and inhibit inappropriate responses. The supervisory attentional system activates schemas on the basis of the current goals of the speaker. Individual differences in WM could be accounted for at the level of the supervisory attentional system, which is responsible for maintaining task goals and prioritizing task schemas. More efficient management of task schemas would allow individuals with greater WM to more quickly resolve interference between competing representations. That is, the relationship between WM and L2 outcomes (particularly those involving conflict) could be driven by better top-down control at the level of the task schemas.

Following our recommendation above, to consider more fine-grained subprocesses of the WM system, we might go a step further and speculate on the different roles of various executive functions. Considering the three functions from Miyake et al.’s (2000) framework, this model has clear connections to conflict resolution ability (or the inhibition executive function), which, according to Green’s model, would be the primary mechanism behind the linguistic inhibitory control, and might be represented at the level of the task schemas (which are responsible for inhibiting nontarget representations within the language system). The inhibitory control model was developed to account for a range of findings suggesting that representations in the nontarget language are suppressed in order to allow successful communication in the target language (Green, 1998; also see Kroll, Bobb, Misra, & Guo, 2008, for a recent review of evidence in favor of bilingual inhibitory processes). Moreover, some evidence has directly linked better domain-general inhibitory control abilities to reduced cross-language competition, as reflected by smaller switch costs in a language-switching task (Linck, Schwieter, & Sunderman, 2012).

Less is known about the precise roles for shifting and updating. We suggest that shifting ability might be represented at the level of the supervisory attentional system, where control is exerted over the language system by prioritizing different task schemas on the basis of the current goals of the speaker. Although little direct evidence has indicated this link, one useful data point comes from evidence suggesting that bilinguals who switch between their languages frequently during naturalistic conversations (i.e., frequent code switchers) show better performance on a domain-general, nonlinguistic task-switching task (Prior & Gollan, 2011), suggesting that shifting ability may contribute to bilingual language control by supporting shifts between task schemas. In contrast, updating may best be represented at the level of goal setting and maintenance in the face of distraction. The goal of the speaker provides top-down guidance over the language system, and the current goal must be maintained in the face of distracting information that can inappropriately activate other goals. In summary, we speculate that Green’s (1998) inhibitory control model could incorporate Miyake et al.’s (2000) three related but separable executive functions, such that the current goal of the speaker (updating) directly informs the supervisory attentional system’s functioning (shifting), which then translates that goal into a specific task schema that exerts control over the language system (inhibition). This discussion provides one possible direction that could be pursued to develop models of bilingual language processing. But what is clear is that the construct of WM—and a more nuanced fractionation of executive functions—can and should inform these developments.

Some models of language aptitude already account for differences in WM. Indeed, Miyake and Friedman (1998) argued that WM essentially underlies the components of some models of language aptitude. For example, Skehan (1989) hypothesized that language aptitude is composed of language analytic capacity, memory ability, and phonetic coding ability—all three of which may be driven by WM and STM. Similarly, Linck et al. (2013) proposed the inclusion of both WM and STM as key components of a model of aptitude for higher-level proficiency attainment. Their study was motivated by a theoretical model of aptitude focusing on the cognitive and perceptual abilities that underlie the skills required to attain high-level foreign language proficiency. With theories of language aptitude taking a more cognitive view of SLA in recent years (e.g., Dörnyei & Skehan, 2003), WM will clearly remain a core component of successful models of language aptitude.

Strengths and limitations of the present meta-analysis

The studies included in our sample cover a range of disciplines—including psycholinguistics, cognitive psychology, and SLA—and reflect a diverse set of sources (journals, proceedings, and unpublished studies). Consequently, the inferences from this meta-analysis are not biased by undue influence from a particular theoretical perspective. As we discussed above, these results have implications for models of bilingual language processing, theories of SLA, and research on the contributions of WM to performance more broadly construed.

To better understand why the effect sizes reported in the literature are so variable, we examined the studies that reported the most extreme negative effect sizes (rs < −.20), as well as studies that reported nonsignificant correlations without providing a specific correlation estimate. The sample size for many of these studies was in the range of 50–100 participants—above the median across the meta-analysis sample—suggesting that the results do not necessarily stem from low power. However, many of these studies employed global outcome measures (e.g., fluency or complexity), which may be susceptible to extra measurement error, relative to specific measures of language processing or proficiency. Alternatively, with global criterion measures, perhaps learners with less WM have more of an opportunity to employ compensatory strategies, thereby reducing the potential for WM to account for variability in these outcomes. Moreover, some studies included a participant sample that was heterogeneous with respect to education, language background, and degree of acculturation into the local society (e.g., Andringa, Olsthoorn, van Beuningen, Schoonen, & Hulstijn, 2012). The variability in education, L1 abilities, and length of L2 exposure may have introduced additional noise into the data that attenuated any detectable relationship between WM and the outcomes.

Our survey of the literature discovered few studies of highly proficient adult learners that were relevant to the present analyses. As additional studies of WM and L2 outcomes are conducted with highly proficient learners, further synthesis of the extant results will enhance our understanding of whether and how WM’s role(s) may change across the proficiency spectrum. In addition, we excluded from our meta-analysis any studies involving bilinguals who had been exposed to the L2 during childhood. Thus, it remains to be determined whether our findings would generalize to other participant populations, such as simultaneous, balanced bilinguals who have continually used both languages throughout their lives. Given that recent research has suggested that lifelong bilingualism incurs cognitive benefits including enhanced attention control (e.g., Bialystok, 2010), future studies and meta-analyses will be needed to determine whether WM’s relationship with L2 outcomes differs for this population, relative to adult L2 learners.

It is also important to note that this meta-analysis focused on bivariate correlations, and therefore necessarily ignored the potential explanatory power of other relevant factors, such as general intelligence. As we mentioned previously, WM and general intelligence are correlated (e.g., Conway, Cowan, Bunting, Therriault, & Minkoff, 2002); therefore, when accounting for L2 outcomes, variance is likely to be shared between WM and general intelligence. Taking the process decomposition approach advocated above, room is certainly available to further slice up the variance in L2 outcomes and to provide incremental explanations by investigating other relevant constructs. This approach fits with the results of ongoing work on L2 aptitude, in which WM has been identified as one component of aptitude, along with other relevant cognitive abilities, including associative learning and implicit learning (e.g., Linck et al., 2013). To move the field past simply stating that WM (broadly construed) is related to L2 outcomes, it is time to focus future efforts on further specifying the subprocesses within the WM system that drive the relationships between WM and L2 outcomes, and then examining the contributions of these subprocesses and other relevant factors, like general intelligence.

Conclusions

In summary, the present meta-analysis was conducted to provide a quantitative synthesis of findings regarding the relationship between WM and a range of L2 outcomes, and to identify moderators of this relationship. The results are congruent with claims that WM is an important component of the cognitive processes underlying bilingual language processing and performance on measures of L2 proficiency. Nonetheless, significant work still remains to be done to link specific executive functions to specific language processes, in order to advance theoretical models and further our understanding of the contributions of domain-general cognitive control mechanisms to L2 outcomes.