The seminal report most often taken to support the efficacy of systematic phonics compared with alternative methods was a government document produced by the National Reading Panel (NRP, 2000), with the findings later published in peer review form (Ehri et al. 2001). The authors carried out the first meta-analysis evaluating the effects of systematic phonics compared with forms of instruction that include unsystematic or no phonics across a range of reading measures, including word naming, nonword naming, and text comprehension tasks. The meta-analysis included 66 treatment-control comparisons taken from 38 experiments, and the main findings can be seen in Table 1. Based on these findings, Ehri et al. (2001) concluded in the abstract:
“Systematic phonics instruction helped children learn to read better than all forms of control group instruction, including whole language. In sum, systematic phonics instruction proved effective and should be implemented as part of literacy programs to teach beginning reading as well as to prevent and remediate reading difficulties.”
The NRP report has been cited over 24,000 times and continues to be used in support of systematic phonics, with over 1000 citations in 2019. In addition, the Ehri et al. (2001) article has been cited over 1000 times. However, a careful look at the results undermines these strong conclusions.
Table 1 Summary of findings The most important limitation is that systematic phonics did not help children labeled “low achieving” poor readers (d = 0.15, not significant). These were children above first grade who were below average readers and whose cognitive level was below average or not assessed. By contrast, children labeled “reading disabled” who were below grade level in reading but at least average cognitively and were above first grade in most cases did benefit (d = 0.32). Note, by definition, half the population of children above grade 1 will have an IQ below average, and it is likely that more than 50% of struggling readers above grade 1 will fall into this category given the comorbidity of developmental disorders (Gooch et al. 2014). Of course, additional research may show that systematic phonics does benefit low achieving poor readers (the NRP only included eight comparison groups in this condition), but there is no evidence for this from the NRP meta-analysis.
Second, based on the finding that effect sizes were greater when phonics instruction began by first grade (d = 0.55) rather than after first grade (d = 0.27), the authors of the NRP wrote in the executive summary “Phonics instruction taught early proved much more effective than phonics instruction introduced after first grade” (pp. 2–93). But in the body of the text, it becomes clear that findings do not support this strong conclusion. One problem is that the majority of older students (78%) in the various studies included in the NRP analysis were either low achieving readers or students with reading disability, and as noted above, systematic phonics was less effective with both these populations (especially the former group). With regard to the normally developing older readers, the NRP meta-analysis only included seven comparison groups, and four of them used the Orton-Gillingham method that was developed for younger students. As noted by Ehri et al. (2001):
“The conclusion that phonics instruction is less effective when introduced beyond first grade may be premature… Other types of phonics programs might prove more effective for older readers without any reading problems.” (p. 428)
This is straightforwardly at odds with the above executive summary and explains why so many authors cite the NRP as providing evidence that early phonics instruction is important.
Third, although the authors of the NRP emphasized that the systematic phonics had long-term impact, the effect size declined from d = 0.41 when children were tested immediately following the intervention to d = 0.27 following a 4 to 12-month delay. However, the authors did not assess whether the long-term benefits extended to spelling, reading texts, or reading comprehension. Given that the short-term effects on spelling, reading texts, or reading comprehension was much reduced compared with the overall short-term effect (Table 1), there is no reason to assume these effects persisted.
Fourth, the evidence that that systematic phonics is more effective than whole language is weaker still. This claim is not based on the overall effect size of d = 0.41, but rather, on a subanalysis that specifically compared systematic phonics to whole language. This analysis was based on 12 rather than 38 studies, and not one of these 12 studies used a randomized control trial (RCT) design. This analysis showed a reduced overall effect of d = 0.31 (still significant), with the largest effect obtained for decoding (mean of the reported effect sizes was d = 0.55) and smallest effect on comprehension (mean of the reported effect sizes was d = 0.19), with only two studies assessing performance following a delay. And although the NRP is often taken to support the efficacy of synthetic systematic phonics (the version of phonics legally mandated in the UK), the NRP meta-analysis only included four studies relevant for this comparison (of 12 studies that compared systematic phonics with whole language, only four assessed synthetic phonics). The effect sizes in order of magnitude were d = 0.91 and d = 0.12 in two studies that assessed grade 1 and 2 students, respectively (Foorman et al. 1998); d = 0.07 in a study that asses grade 1 students (Traweek & Berninger, 1997); and d = − 0.47 in a study carried out on grade 2 students (Wilson & Norman, 1998).
In sum, rather than the strong conclusions emphasized the executive summary of the NRP (2000) and the abstract of Ehri et al. (2001), the appropriate conclusion from this meta-analysis should be something like this:
Systematic phonics provides a small short-term benefit to spelling, reading text, and comprehension, with no evidence that these effects persist following a delay of 4–12 months (the effects were not reported nor assessed). It is unclear whether there is an advantage of introducing phonics early, and there are no short- or long-term benefit for majority of struggling readers above grade 1 (children with below average intelligence). Systematic phonics did provide a moderate short-term benefit to regular word and pseudoword naming, with overall benefits significant but reduced by a third following 4–12 months.
And even these weak conclusions in support of systematic phonics are not justified given subsequent work by Camilli et al. (2003, 2006) and Torgerson et al. (2006) who reanalyzed the studies (or a subset of studies) included in the NRP, as described next.
Camilli et al. (2003, 2006)
Camilli et al. (2003) identified a number of flaws in the NRP meta-analysis, but here I emphasize one, namely, it was not designed to assess whether there is any benefit in teaching phonics systematically. Similar design choices were made by all subsequent meta-analyses taken to support systematic phonics, and this has led to unwarranted conclusions from these meta-analyses as I detail below.
As noted above, the headline figure from the NRP analysis is that systematic phonics showed an overall immediate effect size of d = 0.41. What needs to be emphasized is that this figure is the product of comparing systematic phonics with a heterogeneous control condition that included (1) intervention studies that used unsystematic phonics and (2) intervention studies that used no phonics. As elementary point of logic, if you compare systematic phonics to a mixture of different methods, some of which use unsystematic phonics and other that use no phonics, then it is not possible to conclude that systematic phonics is more effective than unsystematic phonics. In order to assess whether the “systematic” in systematic phonics is important, it is necessary to compare systematic phonics to studies that included unsystematic phonics, something that the NRP (2000) did not do.
The reason why this is important is that unsystematic phonics is standard in common alternatives to systematic phonics. Indeed, in addition to the widespread use of unsystematic phonics in the USA prior to the NPR (2000) report (as shown above in a quote from NRP), Her Majesty’s Inspectorate (1990) reported that unsystematic phonics was also common in the UK prior to the legal requirement to teach systematic synthetic phonics in England in 2007, writing
“...phonic skills were taught almost universally and usually to beneficial effect” (p. 2) and that “Successful teachers of reading and the majority of schools used a mix of methods each reinforcing the other as the children’s reading developed” (p. 15).
Accordingly, the important question is whether systematic phonics is more effective than the unsystematic phonics that is used in alternative teaching methods.
In order to assess the importance of teaching phonics systematically, Camilli et al. (2003, 2006) coded the studies included in the NRP as having no phonics, unsystematic phonics, or systematic phonics. In addition, the authors also noted that some moderator variables were ignored by the NRP analysis that may have contributed to the outcomes. Accordingly, the authors also coded whether or not the intervention studies included language-based reading activities such as shared writing, shared reading, or guided reading, whether treatments were carried out in the regular class or involved tutoring outside the class, and whether basal readers were used (if known). Both the experimental and control groups were coded with regard to these moderator variables. It should also be noted that the Camilli et al. (2003, 2006) analyses were carried out on a slightly modified dataset given problems with some of the studies and conditions included in the NRP report. For example, the authors dropped one study (Vickery et al., 1987) that did not include a control condition (an exclusion condition according to the NRP) and included three studies that were incorrectly excluded (the studies did fulfill the NRP inclusion criterion), resulting in a total of 40 rather than 38 studies. The interested reader can find out more details regarding the slightly modified dataset in Camilli et al. (2003), but in any case, the different datasets produce the same outcome as discussed below.
The Camilli et al. (2003) analysis showed that effect size of systematic phonics compared with nonsystematic phonics was significant, but roughly half the size of the effect of systematic phonics reported in the NRP report (d = 0.24 vs. d = 0.41). Interesting, the analysis also found significant and numerically larger effects of systematic language activities (d = 0.29) and tutoring (d = 0.40). The subsequent analysis by Camilli et al. (2006) was carried out on the same dataset but used a new method of analysis (a multilevel modeling approach) and included three rather than two levels of language-based reading activities as a moderator variable (none vs., some, vs. high levels of language-based activities). This analysis revealed an even smaller effect of systematic phonics (d = 0.12) that was no longer significant. Camilli et al. (2006) took these findings to challenge the strong conclusion drawn by the authors of the NRP.
These analyses were subsequently supported by Stuebing et al. (2008) who reanalyzed the Camilli et al. (2003, 2006) dataset and showed that the different outcomes were not the consequence of the slightly different studies included in the Camilli and the NPR meta-analyses. However, Stuebing et al. (2008) drew a different conclusion, writing
The NRP question is analogous to asking about the value of receiving the intervention versus not receiving the intervention. The Camilli et al. (2003) report is analogous to asking what is the value of receiving a strong form of the intervention compared to a receiving weaker forms of the intervention and relative to factors that moderate the outcomes. From our view, both questions are reasonable for intervention studies.
But the two questions are not equally relevant to teaching policy. The relevant question is whether systematic phonics is better than preexisting practices. Given that unsystematic phonics was standard practice, and given the Camilli et al. (2006) analysis failed to show an advantage of systematic over unsystematic phonics, Camilli et al. analysis challenges the main conclusion that schools should introduce systematic phonics.
To avoid any conclusion, it is important to highlight that the Camilli et al. (2006) reanalysis of the NRP dataset does not suggest that grapheme-phoneme knowledge is unimportant. Indeed, their reanalysis suggests that systematic phonics is significantly better than a nonphonics control condition. Rather, their key finding is that systematic phonics was no better than nonsystematic phonics as commonly used in schools.
Torgerson et al. (2006)
The Torgerson et al. (2006) meta-analysis was primarily motivated by another key limitation of the NRP report not touched on thus far, namely, the fact that the NRP included studies that employed both randomized and nonrandomized designs. Given the methodological problems with nonrandomized studies, Torgerson et al. (2006) carried out a new meta-analysis that was limited to randomized control trials (RCTs). But it is worth noting two additional limitations of the NRP report that motivated this analysis.
First, the authors were concerned that bias played a role in 13 RCT studies included in the original NRP report given that the NRP report only considered published studies (studies that obtained null effects may have been more difficult to publish). Indeed, the authors carried out a funnel plot analysis on these 13 studies and concluded that the results provided: “…prima facie evidence for publication bias, since it seems highly unlikely that no RCT has ever returned a null or negative result in this field.” Accordingly, Torgerson et al. (2006) searched for unpublished studies that met their inclusion criteria. They found one additional study that reported an effect size of − 0.17 that they included in their analyses. Note that this bias would have inflated the small effects reported in the NRP (2000) and the Camilli et al. (2003, 2006) meta-analyses. Second, Torgerson et al. removed two studies that should have been excluded from the NRP analyses (Gittelman and Feingold 1983, because it did not include a phonics instruction intervention group; Manzticopoulos et al., 1992, because the children in the control condition did not receive a reading intervention, and the attrition rate of the studies was extreme, with 437 children randomized and only 168 children tested). This led to 12 studies that compared systematic phonics to a control condition that included unsystematic phonics or no phonics instruction control. The key positive result was with regard to word reading accuracy with an effect size estimated to between 0.27 and 0.38 (depending on assumptions built into the analyses). By contrast, no significant effects were obtained for comprehension (d estimates ranging between 0.24 and 0.35), or spelling (d = 0.09).
There are, however, reasons to question the significant word reading accuracy results. This result was largely due to one outlier study (Umbach et al. 1989) that obtained a massive effect on word reading accuracy (d = 2.69).Footnote 1 In this study, the control group was taught by two regular teachers with help from two university supervised practicum students, whereas the experimental group was taught by four Masters’ degree students who were participating in a practicum at a nearby university. Accordingly, there is a clear confound in the design of the study. Torgerson et al. themselves reanalyzed the results when this study was excluded and found that the word reading accuracy result was reduced (d estimates between 0.20 and 0.21) with the effect just reaching significance one analysis (p = 0.03) and nonsignificant on another (p = 0.09). For summary of findings, see Table 1. And even these findings likely overestimate the efficacy of systematic phonics given the evidence that bias may have inflated the estimate of effect sizes in this study. As Torgerson et al. wrote
In addition, the strong possibility of publication bias affecting the results cannot be excluded. This is based on results of the funnel plot... It seems clear that a cautious approach is justified (p. 48).
The conclusions one can draw are further weakened by the quality of the studies included in the meta-analysis, with the authors writing
…none of the 14 trials reported method of random allocation or sample size justification, and only two reported blinded assessment of outcome… all were lacking in their reporting of some issues that are important for methodological rigor. Quality of reporting is a good but not perfect indicator of design quality. Therefore due to the limitations in the quality of reporting the overall quality of the trials was judged to be “variable” but limited.
Nevertheless, despite all the above issues, the authors concluded
Systematic phonics instruction within a broad literacy curriculum appears to have a greater effect on childrens progress in reading than whole language or whole word approaches. The effect size is moderate but still important.
This quote not only greatly exaggerates the strength of the findings (which helps explain why the meta-analysis has been cited over 250 times in support of systematic phonics), but it again reveals a misunderstanding regarding the conclusions one can draw from the design of the meta-analysis. The study continued to use the design of the NRP (2000) meta-analysis that compared systematic phonics to a control condition that combined (1) nonsystematic phonics and (2) no phonics. Accordingly, it is not possible to conclude that systematic phonics is more effective than whole word instruction that uses unsystematic phonics. That would require a direct comparison between conditions that was not carried out.
To summarize thus far, a careful review of the NPR (2000) findings show that that the benefits of systematic phonics for reading text, spelling, and comprehension are weak and short-lived, with reduced or no benefits for struggling readers beyond grade 1. The subsequent Camilli et al. (2003, 2006) and Torgerson et al. (2006) reanalyses further weakens these conclusions. Indeed, Camilli et al. (2006) found no overall benefit of systematic phonics over nonsystematic phonics, and Torgerson et al. (2006) did not find any benefit of systematic phonics in the subset of RCT studies included in the NRP for word reading accuracy, comprehension, or spelling (when one outlier study was excluded). The null effects in the Torgerson et al. (2006) meta-analysis were obtained despite evidence for publication bias and flawed design that combined unsystematic and no phonics studies into a control condition (with both of these factors serving to inflate the benefits of systematic phonics).
McArthur et al. (2012)
This meta-analysis was designed to assess the efficacy of systematic phonics with children, adolescents, and adults with reading difficulties. The authors included studies that use randomization, quasi-randomization, or minimization (that minimizes differences between groups for one or more factors) to assign participants to either a systematic phonics intervention group or a control group that received no training or alternative training that did not involve any reading activity (e.g., math training). That is, the control group received no phonics at all. Based on these criteria, the authors identified 11 studies that assessed a range of reading outcomes, although some outcome measures were only assessed in a few studies. Critically, the authors found a significant effect of word reading accuracy (d = 0.47, p = 0.03) and nonword reading accuracy (d = 0.76, p < 0.01), whereas no significant effects were obtained in word reading fluency (d = − 0.51; expected direction), reading comprehension (d = 0.14), spelling (d = 0.36), and nonword reading fluency (d = 0.38, the unexpected direction). Based on the results, the authors concluded that systematic phonics improved performance, but they were also cautious in their conclusion, writing
…there is a widely held belief that phonics training is the best way to treat poor reading. Given this belief, we were surprised to find that of 6632 records, we found only 11 studies that examined the effect of a relatively pure phonics training programme in poor readers. While the outcomes of these studies generally support the belief in phonics, many more randomised controlled trials (RCTs) are needed before we can be confident about the strength and extent of the effects of phonics training per se in English-speaking poor word readers.
But there are reasons to question even these modest conclusions. One notable feature of the word reading accuracy results is that they were largely driven by two studies (Levy and Lysynchuk 1997; Levy et al. 1999) with effect sizes of d = 1.12 and d = 1.80, respectively. The remaining eight studies that assessed reading word accuracy reported a mean effect size of 0.16 (see Appendix 1.1, page 63). This is problematic given that the children in the Levy studies were trained on one set of words, and then, reading accuracy was assessed on another set of words that shared either onsets or rhymes with the trained items (e.g., a child might have been trained on the word beak and later be tested on the word peak; the stimuli were not presented in either paper). Accordingly, the large benefits observed in the phonics conditions compared with a nontrained control group only shows that training generalized to highly similar words rather than word reading accuracy more generally (the claim of the meta-analysis). In addition, both Levy et al. studies taught systematic phonics using one-on-one tutoring. Although McArthur et al. reported that group size did not have an overall impact on performance, one-on-one training studies with a tutor showed an average effect size of d = 0.93 (over three studies). Accordingly, the large effect size for word reading accuracy may be more the product of one-on-one training with a tutor rather than any benefits of phonics per se, consistent with the findings of Camilli et al. (2003). In the absence of the two studies by levy and colleagues, there is no evidence from the McArthur et al. (2012) meta-analysis that systematic phonics condition improved word reading accuracy, word reading fluency, reading comprehension, spelling, or nonword reading fluency, leaving only a benefit for nonword reading accuracy.
But even putting these concerns aside, the most important point to note is that this meta-analysis compared systematic phonics to no extra training at all, or to training on nonreading tasks. Accordingly, it is not appropriate to attribute any benefits to systematic phonics. Any form of extra instruction may have mediated the (extremely limited) gains. So once again, this analysis should not be used to make any claims that systematic phonics is better than standard alternative methods, such as whole language that do include unsystematic phonics.
Galuschka et al. (2014)
Galuschka et al. carried out a meta-analysis of randomized controlled studies that focused on children and adolescents with reading difficulties. The authors identified 22 trials with a total of 49 comparisons of experimental and control groups that tested a wide range of interventions, including five trials evaluating reading fluency trainings, three phonemic awareness instructions, three reading comprehension trainings, 29 phonics instructions, three auditory trainings, two medical treatments, and four interventions with colored overlays or lenses. Outcomes were divided into reading and spelling measures.
The authors noted that only phonics produced a significant effect, with an overall effect size of g′ = 0.32, and concluded
This finding is consistent with those reported in previous meta-analyses... At the current state of knowledge, it is adequate to conclude that the systematic instruction of letter-sound correspondences and decoding strategies, and the application of these skills in reading and writing activities, is the most effective method for improving literacy skills of children and adolescents with reading disabilities
However, there are serious problems with this conclusion. Most notably, the overall effect sizes observed for phonics (g′ = 0.32) was similar to the outcomes with phonemic awareness instruction (g′ = 0.28), reading fluency training (g′ = 0.30), auditory training (g′ = 0.39), and color overlays (g′ = 0.32), with only reading comprehension training (g′ = 0.18) and medical treatment (g′ = 0.12) producing numerically reduced effects. The reason significant results were only obtained for phonics is that there were many more phonics interventions. In order to support their conclusion that phonics is more effective, the authors need to show an interaction between the phonics condition and the alternative methods. They did not report this analysis, and given the similar effect sizes across conditions (with small sample sizes), this analysis would not be significant. Of course, future research might support the author conclusion, but this meta-analysis does not support it.
To further compromise the authors’ conclusion, Galuschka et al. reported evidence that the published phonics studies were biased using a funnel plot analysis. Using a method called Duval and Tweedie’s trim and fill they measured the extent of publication bias and estimated an unbiased effect size for systematic phonics to be greatly reduced, although still significant, g′ = 0.198. And yet again, the design of the meta-analysis did not assess whether systematic phonics was more effective than unsystematic phonics (let alone show that systematic phonics is more effective than the alternative methods they did investigate). Nevertheless, the meta-analysis is frequently cited as evidence in support of systematic phonics over whole language (e.g., Lim and Oei 2015; Treiman 2018; Van der Kleij et al. 2017).
Suggate (2010) carried out a meta-analysis to investigate the relative advantages of systematic phonics, phonological awareness, and comprehension-based interventions with children at-risk of reading problems. The central question was whether different forms of interventions were more effective with different age groups of children who varied from preschool to grade 7.
The meta-analysis included peer-reviewed randomized and quasi-experimental studies, with control groups receiving either typical instruction or an alternative “in-house” school reading intervention. They identified 85 studies with 116 interventions: 13 were classified as phonological awareness, 36 as phonics, 37 as comprehension based, and 30 as mixed. Twelve studies were conducted with participants who did not speak English. A range of dependent measures were assessed, from prereading (e.g., letter knowledge, phonemic/sound awareness), reading, and comprehension measures.
Averaging over age, similar overall effects were for phonological awareness (d = 0.47), phonics (d = 0.50), meaning based (d = 0.58), and mixed (d = 0.43). The critical novel finding, however, was that there was a significant interaction between method of instruction and age of child, such that phonics was most useful in kindergarten for reading measures, but alternative interventions were more effective for older children. As Suggate (2010) writes
If reading skills per se are targeted, then there is a clear advantage for phonics interventions early and—taking into account sample sizes and available data—comprehension or mixed interventions later.
However, this is not a safe conclusion. First, the difference in effect size in phonics compared with alternative methods was approximately d = 0.10 in kindergarten and 0.05 in grade 1 (as estimated from Figure 1 in Suggate 2010). This is not a strong basis for arguing the importance of early systematic phonics. It is also important to note that 10% of the studies included in the meta-analysis were carried out on non-English children. Although the overall difference between non-English (d = 0.61) and English (d = 0.48) studies was reported as nonsignificant, the difference approached significance (p = 0.06). Indeed, the phonics intervention that reported the very largest effect size (d = 1.37) was carried out in Hebrew speakers (Aram & Biron, 2004), and this study contributed to the estimate of the phonics effect size in prekindergarten. Accordingly, the small advantage of phonics (the main novel finding in this report) is inflatedwhen applied to English. And once again, the treatments were compared with a control condition that combined a range of teaching conditions, and accordingly, it is again unclear whether there was a difference between systematic vs. unsystematic phonics during early instruction.
But the most critical limitation is that Suggate’s (2010) conclusion regarding the benefits of early phonics instruction is contradicted in a subsequent Suggate (2016) meta-analysis. This meta-analysis included 71 experimental and quasi-experimental reading interventions that assessed the short- and long-term impacts of phonemic awareness, phonics, fluency, and comprehension interventions on prereading, reading, reading comprehension, and spelling measures. The analysis revealed an overall short-term effect (d = 0.37) that decreased in a follow-up test (d = 0.22; with mean delay of 11.17 months) with phonics producing the most short-lived benefits. Specifically, the long-term effects were phonics, d = 0.07; fluency, d = 0.28; comprehension d = 46; and phonemic awareness, d = 0.36.
As with the other meta-analyses, there are additional issues that should be raised. For example, a funnel plot observed evidence for publication bias, especially in the long-term condition, and once again, the study does not compare systematic to unsystematic phonics. It is striking that long-term benefits of systematic phonics are so small despite these factors that should be expected to inflate effect sizes.