1 Introduction

There has been an increasing emphasis on evidence-based literacy practices in education (Polanin et al., 2023). Curriculum development, pedagogy, and policy are expected to draw on high-quality research that has demonstrated positive reading and writing outcomes for students (Newman & Gough, 2020). This paper provides a narrative synthesis systematic review/NSSR (Popay et al., 2006) on the effectiveness of genre theory (GT) (Martin, 1985) and systemic functional linguistics (SFL) (Halliday, 1985) practices for improving reading and writing outcomes in K-10 education. Internationally, the GT/SFL framework has had significant influence on reading and writing curriculum design, teacher training, and pedagogies for over 30 years. The evidence base has never been synthesized. The focus of this review is on student outcomes in English-dominant countries: Australia, UK, USA, New Zealand, and Canada, given the notable influence on policy, curriculum, initial teacher training (ITE), and practice in this context.

2 Genre theory and systemic functional linguistics in curriculum and pedagogy

A short synopsis cannot capture the richness of SFL nor GT, so the following presents only the essential information for readers. SFL is a school of linguistics developed by the linguist Michael Halliday (Halliday, 1985) which places emphasis on how language is used to accomplish various functions and has long been closely connected with education in Australia and the UK (Christie, 2005). This school of linguistics places emphasis on how language is used to accomplish functions. These functions can relate to text (e.g., a language pattern typical of a certain text type), discourse management (e.g., how words function as cohesive links), and/or society (e.g., how particular language patterns privilege certain social structures). At the core of SFL is a description of grammar with a rich metalanguage for describing language functions. For example, SFL describes three ‘metafunctions’ of grammar: experiential (e.g., the function of language to represent a physical, sensory, or mental experience), interpersonal (e.g., the function of language to negotiate relationships), and textual (e.g., how language functions to create texts). The framework also includes ‘register’ variables, which capture how language use is affected by ‘field’ (e.g., subject matter), tenor (e.g., level of formality), and mode (e.g., written, spoken). Functions constitute a ‘system of choices’ (Myhill et al., 2021), which speakers/writers and hearers/readers navigate in their linguistic experience. Genre theory situates itself within the broader tradition of SFL. It emerged in the 1980s with a focus on how particular text-types (called ‘genres’) can be analysed as configurations of function-driven choices with a social purpose (Martin, 1985; Christie, 2004). For example, the purpose of a student’s science report typically includes providing information on an experiment, such as aims, procedures, outcomes, and observations. These social-communicative functions account for linguistic patterns found in the information report genre (e.g., heading, subheadings; content) and student accomplishment is calibrated to successful receptive and expressive mastery of various genres (Derewianka, 2012).

More so than other linguistic theories, GT/SFL has long focussed on curriculum and pedagogy (Christie, 1994), having particularly significant impact on the Australian and UK educational systems, as well as influence in Canada, New Zealand, and the USA (Schleppegrell, 2016). Derewianka (2012) states GT/SFL constitutes the basis for the ‘language strand’ of the Australian Curriculum, and its approach to grammar is described by Myhill (2018) as primarily ‘Hallidayan’ (p. 11). In the UK, GT/SFL informed the National Literacy Strategy (Walsh, 2006) and Language in the National Curriculum Project (Myhill, 2021). Beyond curricula, GT/SFL is widely taught in Initial Teacher Training and influences practice through textbooks using the framework (Rose, 2020). Literacy activities in classrooms based on GT/SFL are typically identifiable through the metalanguage of the framework. For example, activities mentioned in the research record within the Australian education context include having students highlight the ‘transitivity functions’ in sentences and explore ‘mood’, ‘modality’, or ‘theme’ (Williams, 2004; Derewianka & Jones, 2012). At the macro-level, K-10 students are taught the explicit features of a genre, the functions of certain language patterns in that genre, and the functions of grammar within that genre. This instruction is designed to achieve positive reading and writing outcomes by giving students an explicit metalanguage for analysis and raising their awareness of why authors might use certain language patterns and text structures. Students who learn this ‘knowledge about language’ are, it is argued, then able to apply it to their own experience and improve their reading and writing (Derewianka, 2011).

Two of the more developed teaching–learning approaches are ‘Reading to Learn’ and the ‘Exeter Pedagogy’. Reading to Learn/R2L (Rose, 2020) is a teacher professional development program and pedagogy. Teachers are taught about GT/SFL and the associated metalanguage. They learn, for example, how to identify and label the functions of word groups, clauses, and how these function in texts. In the classroom, teachers initially focus on reading comprehension and via direct instruction guide students through example genre texts, deconstructing these using the framework. Next, teachers turn to writing and students are tasked with applying what they have learned to their own texts. The Exeter Pedagogy/EP is primarily contextualized grammar instruction for improving writing outcomes. Myhill (2021) describes EP as derived from Halliday’s “notion of grammar as choice. The goal of teaching is to support students’ understanding of this crucial relationship between grammatical choice and meaning making” (p. 268). Classroom practices follow the acronym LEADS: Link grammar to its genre function; Examples teach grammar with examples; use Authentic examples; Discussion should take place using metalanguage. This grammar approach is argued to improve student writing because it maintains a close link between grammar and the functional needs of emerging writers whereas traditional grammar lacked this link and was thus ineffective.

3 Narrative synthesis review for evaluating the evidence for literacy practices

The following systematic review adheres to the ESRC Guidelines for Narrative Synthesis Systematic Reviews (NSSR) (Popay et al., 2006). NSSRs were designed as the strongest alternative to a meta-analysis when studies exhibit heterogeneity (e.g., in methods, interventions, outcome measures, reporting, instruments) and pooling statistical information is precluded. They rely on narrative and statistical data to summarize studies, explain results, and synthesize evidence ‘in a way that tells a convincing story of why something needs to be done, or needs to be stopped, or why we have no idea whether a long-established policy or practice makes a positive difference’ (Popay et al., 2006, p. 5).

Our current review is modelled on Erbeli and Rice’s (2022) recent example of a literacy-focussed application of the NSSR methodology to the evidence for sustained silent reading (SSR). Like GT/SFL, SSR has many classroom variations and is widely accepted as a practice that improves reading and writing outcomes. Erbeli and Rice (2022) reviewed the research published in the past 20 years to find research that met the following criteria: experimental/quasi-experimental with a control/comparison group; measured any reading outcome within K-12 (e.g., fluency, comprehension, vocabulary), and was published in a peer-reviewed journal. Approximately 16,422 published articles were sourced, but after screening only 14 met the standards, which is not a large potential evidence base. Additionally, Erbeli and Rice (2022) reviewed these studies using two codebooks: one that extracted details about the research designs (e.g., sample size, effect sizes) and another that assessed quality (e.g., controls, measurement rigour, reporting quality). They report that 71% of studies were conducted with elementary school students and 70% of these reported no positive reading outcomes. While the remaining studies at middle/high school reported positive outcomes, the quality of studies was poor, making the evidence unreliable. Erbeli and Rice (2022) concluded ‘there is still insufficient and incomplete rigorous evidence on the benefit of independent reading on reading outcomes… we cannot indicate that research has proven or unproven that such procedures actually work’ (p. 15). The study provides caution to literacy stakeholders that widely valued and common practices cannot be assumed to have a strong evidence base, and this motivates the current study to explore the evidence base for GT/SFL approaches to reading and writing.

4 The current study

The following review evaluates the extent to which the research record contains evidence of positive outcomes on reading or writing for practices based on GT/SFL. Any measured outcome on any aspect of reading or writing is included. The intention of this study was to cast a wide net to be as fair as possible to potential benefits in the classroom. Thus, all potential studies that included keywords from an extensive pool of GT/SFL-related metalanguage were sought and reviewed. A study needed not to be a direct test of the effectiveness of GT/SFL for improving a reading or writing outcome. While we included all such that were found, we also included any intervention studies that made explicit links to GT/SFL. Studies without explicit links (e.g., lacking citation of prominent authors, or use of the metalanguage) were excluded, such as studies of genre from different theoretical orientations. Furthermore, in the spirt of casting a wide net, all studies were sought that had any measured outcome on either reading or writing. The scope was limited to grades 1–10 in English-dominant education systems (Australia, UK, USA, New Zealand, Canada), as a goal of this study is to make recommendations in this educational policy space. However, GT/SFL is important in other educational contexts, such as at the university level and L2 English medium instruction, but these contexts are different enough from K-10 mainstream schooling that they should be evaluated in separate research. Also, note we have not focussed on multimodal text production of comprehension. The procedures in this study, coupled with close reading of the research, indicate that no important studies have been missed that would alter the conclusions related to the evidence of impact reported here. The research questions were:

  • RQ1: What is the published evidence indicating classroom practices that incorporate GT/SFL improve reading outcomes?

  • RQ2: What is the published evidence indicating classroom practices that incorporate GT/SFL improve writing outcomes?

5 Methods

A Boolean search of 11 major library databases was conducted, with search terms and protocols reviewed and optimised by research librarians. Keywords were an extensive list of GT/SFL metalanguage. For transparency, search terms are provided in (1) and all bibliographic records in the Open Science Framework (https://rb.gy/jxsjoi).

  • (1) Search terms and databases

figure a

Keywords were searched within entire texts published within the past 30 years, including peer reviewed books, journals, and chapters. Online registers were also searched including the What Works Clearinghouse/WWC (US) (already included in ERIC), the Early Intervention Foundation (UK), and the Centre for Education Statistics and Evaluation/CESE (AUS). No equivalent registers for New Zealand or Canadian research were found. From the ‘Reading to Learn’ webpage, 11 listed references were also added. This obtained an initial pool of 9787 records for screening. Data was uploaded into the systematic review tool Rayyan. Duplicates were removed leaving 7846 publications. These were screened blindly by two research team members against the inclusion criteria in (2).

  • (2) Inclusion criteria

    • • Experimental or quasi-experimental

    • • Peer reviewed journal articles, books, chapters, research reports

    • • Measured any writing or reading outcome

    • • Grades K-10 (Australia, Canada, UK, New Zealand, US)

    • • Contained a control/ comparison group

    • • Intervention focussed on English-language outcomes

    • • Exclude studies focused principally on English as a second/additional language

    • • Exclude studies not containing sufficient information to judge eligibility

    • • Exclude studies more than 30-years old (before 1991)

Calibration was undertaken at a team meeting before commencing review to ensure screening consistency. Screening involved reading abstracts, and when sufficient details in the abstracts could not be found to make a judgment, the full text was read. After blind-screening, all studies with agreement were included. Those with a split decision were discussed at team meetings and a group decision made. Ancestry searches were then conducted; that is, studies cited in reference lists were sourced, added back into the pool and blind evaluation was again undertaken. A PRISMA flow diagram is provided in Fig. 1 (Page. et al., 2021).

Fig. 1
figure 1

Workflow chart

Five final studies were identified after initial screening, and four more from their references. One funding report, McRae et al. (2000), was excluded due to the criteria of reporting insufficient information to judge eligibility. Nevertheless, codebooks and a narrative are provided in the supplementary materials because it has been cited as evidence in favour of GT/SFL effects ‘up to three times as effective as other literacy approaches (McRae et al., 2000)’ (Rose et al., 2003, p. 42). One other funding report, Report on Reading to Learn: Middle Years Literacy Project (2009), was requested from the funding body but not received. One book chapter, Myhill and Watson (2017), was found after blind screening through reading further papers by researchers represented in the final pool. It met the criteria for inclusion so was added to the evidence synthesis. The methods produced eight publications that met the criteria for review. Each was reviewed through close reading and the production of a study summary table, adapted from Erbeli and Rice (2022), which extracted from each study its research design, target grade, outcome measures, participant numbers, intervention frequency/length, and effect sizes. Additionally, study quality was assessed through the nine indicators in (3).

  • (3) Indicators of study quality

    • 1. Were participants in the sample comparable at baseline to comparisons?

    • 2. Was the intervention clearly described?

    • 3. Was intervention fidelity described and evaluated?

    • 4. Was attrition reported and was it equivalent in conditions?

    • 5. Was the comparison instruction described?

    • 6. Was documentation of the comparison instruction provided?

    • 7. Were multiple outcome measures used to balance between measures aligned with the intervention and generalised performance?

    • 8. Was the reliability of measures provided?

    • 9. Was the validity of measures provided?

These indicators, following Erbeli and Rice (2022) were broadly assessed as either: high, low, N/A (not applicable), or ‘?’ (unsure, e.g., not reported). The study summaries, quality evaluations, and synthesis of the final pool of studies were not blind but developed through group discussion in reviewing the research record. From this information for each study, a narrative synthesis was developed to evaluate the research evidence for positive outcomes on reading or writing for GT/SFL practices.

6 Results and discussion

Because of the wide variation in study design and reporting conventions, studies could not be merged into a single table that would be readable. Therefore, each study is presented narratively first, followed by an overall evidence synthesis. The studies included in this synthesis are reported in Table 1.

Table 1 Studies in the evidence synthesis

Table 1 reflects that in over 30 years of research, very few studies have been completed that are eligible for consideration as educational evidence. Thus, even before further detailed analysis, the possible evidence for GT/SFL improving reading and writing outcomes in K-10 is not substantial. Studies 1 and 2 are book chapters; 3, 4, and 5 are journal articles; 6, 7, and 8 are funding-body reports. Journal articles are considered to have higher quality peer review than chapters or funding reports. Myhill et al. (2013) and Jones et al. (2013) report aspects of the same intervention. Similarly, Rose and Martin (2013) reports the same intervention as Rose (2011). This reduces the research record to six interventions. Of these, three took place in Australia and three in the UK. All UK studies were by the same research group and focussed on the Exeter Pedagogy. Evidence is stronger when independent research teams duplicate results. All Australian studies focussed on Reading to Learn. The studies are now reviewed in turn

6.1 Rose and Martin(2013) and Rose (2011)

These studies applied GT/SFL ‘theories in a large-scale educational intervention in Australian primary and secondary schools… to measure the effectiveness of genre-based literacy pedagogy’ (Rose & Martin, 2013, p. 1). The outcome measure was writing improvement. Table 2 provides a synopsis.

Table 2 Summary of Rose and Martin (2013) and Rose (2011)

The intervention involved approximately 400 teachers trained in Reading to Learn/R2L, who implemented it in in their classrooms over one school year. All grades K-8 were included. Only approximate numbers are reported, so the precise number of participants is unclear. Teachers were trained in metalanguage, the basics of theory and grammar, and how to guide students to write genres by deconstructing model texts, then jointly producing their own, and finally independently producing texts. Direct teaching of the features of genres, from grammatical choices to text-level organisation occurred. It is not reported how often the practices were implemented over the year, nor if implementation frequency varied amongst teachers. Fidelity was not part of the research design, so it is unknown how close what happened in classrooms was to what teachers were asked to implement. These issues reduce the accurate measurement of intervention effects.

The measured outcome was writing achievement. A rubric aligned with the intervention’s theoretical framework was used. Assessment tasks aligned with interventions can inflate outcomes compared to independent measures (Erbeli & Rice, 2022). However, the rubric is discussed as being the basis for the national standardized test’s rubric for writing in Australia (the National Assessment Program Literacy and Numeracy/NAPLAN). This speaks to a relationship to a widely used external and normed measurement instrument (and the influence of GT/SFL), though no statistical data on concurrent validity is reported. Writing samples were scored 0–3 on the criteria: purpose, staging, phases, field, tenor, mode, lexis, appraisal, conjunction, reference, grammar, spelling, punctuation, and presentation. The entire population of students in the intervention were not included in outcome measures. Rather, 100 teachers were randomly selected, then asked to identify papers from low, middle, and high achieving students. It is not reported (but seems unlikely) that teachers randomly selected student work. This could mean that teachers may have, potentially inadvertently, chosen work that showed an impact of their teaching.

Effects of the intervention are reported as ‘growth rates’ and proportional increases. It is reported that the intervention improved writing 2.34 times expected growth. This seems to have been computed as follows. Growth rates were measured by researchers taking the intervention rubric (42 points possible) and marking student-writing exemplars provided to teachers to help them calibrate their assessment. These exemplars were taken from the Australian National Literacy Profiles and NSW Board of Studies. The researchers state that a score of ‘around’ 35 on their rubric was equivalent to a B range (scale A-E) on the national literacy grading curve at the time, based on marking the exemplars. They argue that a standard average growth per year was therefore equivalent to seven points improvement on the intervention rubric (7 points = 1 standard ‘growth rate’). So, they conclude R2L advanced student writing by 2.34 times expected growth. Limited details are provided on the reliability and validity of this approach. How many benchmark papers were marked by how many people is also not reported. Exemplars, even if carefully chosen for teacher-assessment calibration by education departments, may not represent the actual growth in writing of a larger pool of students. Technically, one might argue that this study does not have a real control group and should have been excluded from the systematic review; however, we felt that as a large-scale intervention, it was proper to include it. Percentage gains scores are also reported, but it is unclear if students gained beyond maturation, given no pre/post data from a comparison group.

Rose (2011) concludes “these growth rates are unparalleled by national and international standards. They are consistent, though, with independent evaluations of R2L programs, using a range of measures, that show average growth at 2 to 4 times expected rates (Culican, 2006; McCrae et al., 2000)’ (p. 6). The flaws in the measurement of outcomes cast doubt on these conclusions.

6.2 Culican (2006)

Culican (2006) reports an intervention in Australia to evaluate R2L on reading and writing outcomes within the Catholic education sector. The study, summarized in Table 3, reports on an intervention conducted in 2003, repeated with adaptations in 2004.

Table 3 Summary of Culican (2006)

The project involved 58 teachers, 24 schools, and 410 students across grades 5–8. Teachers identified a minimum of six students each, who were below standards for their grades. Neither random sampling nor assignment occurred at the school, teacher, or student-participant levels. Teachers selected student-groups based on holistic evaluations of ability, informed by observation of general literacy difficulties, EAL background, presence of language disorders, or behavioural issues. The composition of participants is unknown, opening the possibility that some groups may have more or fewer EAL students, others more language disorders, and others more behavioural issues. This is problematic, for example, since a student with a language disorder is likely to have differential responses to literacy interventions than a learner acquiring an additional language.

Two interventions were run, the first occurred over 6 months and the second over one school year. Both interventions began with teacher professional development in R2L. Teachers chose to implement the pedagogy either in a ‘withdrawal group’ in which students came out of classes typically 2–4 times per week; a ‘separate group within the whole class’, in which students remained in their class but undertook R2L activities; as a ‘whole class’, in which all students including target participants undertook the activities; or a ‘combination’ of these. Most (41.6%) used the combination delivery. Frequency of intervention is unclear overall but reported as up to four times per week for 30–45 min for ‘some’ teachers. Students may have experienced quite different interventions. Teachers were asked in the 2004 intervention to complete a Record of Scaffolding Sessions fortnightly to, in part, monitor fidelity; however, no data is reported as to well-aligned teaching practices were.

Only the 2004 iteration reports control groups, which were created at the school level with teachers asked to select a representative comparison sample. It is reported that this was typically an entire class at the same year level. Composition is unclear, but the comparison does not appear to have been made to non-intervention lower literacy learners. The idea, presumably, was to see if the intervention helped close the gap between the lower proficiency students and the class average. It is reported that in some instances, class norms included the data from target students, meaning that the comparison data incorporated intervention data.

The reading outcome measure is reported in two forms, but only one measure was taken. This measure was the standardized test DART (Developmental Assessment Resource for Teachers). The researchers transformed this score into ‘approximate’ grade-level curriculum standards known as the CSF, i.e., where the student might be placed on the state curriculum progressions. Each curriculum level was divided into three sub-levels beginning (B), consolidating (C), and established (E), which were numerically given a difference of 3. For example, the researchers argued that each DART raw score range matched one of the three sub-levels in the CSF for each grade: A DART raw score of 3–4 represented reading proficiency at grade-level 3B, while 5–7 placed a student at 3C and three points higher in curriculum progression. While no data is reported on the validity or reliability of this data transformation, one concern is that as shown in Table 3, it is reported that while the DART scores for the combination delivery model are reported as not significant, the transformation of the same data in CSF form is reported as significant and indicative of the intervention’s positive effect. These issues call into question a conclusion such as that ‘20% of students made gains of two or more CSF levels, or four times the expected rate of literacy development’ (Culican, 2006, p. 57).

Outcome measures report consistent gains from four to eight points on DART over the 6-month intervention, and at all year levels. While no controls were included in 2003, the 2004 iteration reports an overall statistically significant outcome for reading. However, based on descriptive statistics provided, the effect size is d = 0.2, which is negligible (Dione-Rodgers et al., 2012). Three of the four conditions showed no significant improvements, excepting the whole class condition. However, this significance test is problematic because it combined 29 students across all grades (5, 6, 7, 8) and their data was also included in the comparison data (i.e., the class to which they belonged).

In addition to DART, teacher-administered running records were used. The texts for these running records were assigned grade-level suitability based on researchers’ judgments. Only 69 were completed in 2003 and 79 in 2004, but the data is reported as indicating reading gains. In 2003, 27 advanced by 1 year (i.e., moving from reading texts benchmarked by researchers as being at the grade 5–6 level to texts classified as grade 6–7 level), and 25 progressed by more than 1 year. Indeed, within the two 10-week terms, ten students advanced their reading level by 3 years, from reading at grade 3–4 level to grade 7–8 level. In 2004, similar gains are reported: 29 improving by one year level, 27 by 2–3-year levels, and one advancing by more than 5 years. These are remarkable gains, but measurement may be an issue. That texts selected by the researchers may not have been reliable indicators of reading grade explaining findings such as a low proficiency reader advancing 5 years within a short time. The intervention also targeted writing. Writing samples (pre and post) were collected, but no quantitative data is reported. A positive effect of the intervention is concluded based on only three written samples taken at different periods of the intervention, which is not strong evidence. A positive effect is also reported based on survey results and qualitative comments from 36 teachers, 28 of whom responded that the intervention had ‘considerable’ or ‘high’ impact. A concern is that this represents only 48% of teacher-participants in the study.

6.3 Dione-Rodgers, Harriman, Laing and Snitch (2012)

Dione-Rodgers et al. (2012) reports an evaluation of R2L over approximately 18 months in 18 schools (15 public sector). Schools self-selected participation. The study measured reading outcomes in two cohorts (approximately 1290 students) from grades 3 and 5. The research is summarised in Table 4.

Table 4 Summary of Dione-Rodgers et al. (2012)

This study measured reading development of two cohorts through NAPLAN scores, a standardised test of literacy and numeracy in Australia, taken by all children at grades 3 and 5. The intervention did not use randomisation in either school sampling or student assignment to conditions and was implemented at the whole class level. The design did not have a true control group. The researchers argue that state NAPLAN norms functioned as comparison data to determine intervention effects. One concern (noted by the researchers) is that given GT/SFL was widespread across schools, students within the comparison condition may have experienced aspects of the intervention.

All cohorts in the intervention performed significantly worse in their NAPLAN scores. Negative effects ranged from − 41 to − 0.51, considered ‘medium’ effect sizes. Fidelity was monitored through a survey, but only 60% of teachers reported using the R2L frequently. The 40% who reported they never or sometimes used resources, however, did not have their data excluded. The negative impact may have resulted from a lack of buy-in amongst teachers. The researchers also suggest that NAPLAN examinations did not occur at the end of the intervention but during it, so testing for positive responses to intervention was premature. While the NAPLAN measure may not have been a well-aligned in terms of scheduling, it is a high-quality instrument because it is a nation-wide reading comprehension measure, with prior validity and reliability checks, and external to the intervention. Despite the statistical outcomes, Dione-Rodgers et al. (2012) conclude that students improved based on qualitative feedback. In a survey, 56% of teachers believed reading outcomes had improved for all/most students, and 39% that some improved. Teacher perceptions should not be ignored, but this is not the strongest form of reading achievement evidence and contrasts with the objective measures. Overall, the study does not provide objective evidence for, or against, this GT /SFL pedagogy improving reading outcomes.

6.4 Myhill, Jones, Watson, and Lines, (2013)and Jones, Myhill, and Bailey (2013)

Myhill et al. (2013) and Jones et al. (2013) report a randomised control trial conducted in the UK on the effect of functional grammar instruction on writing outcomes (i.e., ‘Exeter Pedagogy’). It involved 32 schools and 744 students in grade 8. The research is summarised in Table 5.

Table 5 Summary of Myhill et al. (2013) and Jones et al. (2013)

This intervention consisted of three units containing explicit functional grammar teaching. Writing was the focus of the lessons. The genres targeted were narrative, poetry, and argumentative. Each unit consisted of three 1-h lessons per week over 3 weeks spread across 1-year. The 32 schools were selected from a randomised list of schools with half randomly assigned to intervention and comparison conditions. One intervention group was removed due to low fidelity. Each teacher taught 1 class of Year 8 students. Teachers took a grammar-knowledge test, and comparison and intervention groups were matched on this variable. Teachers were told the focus of the study was writing and not the effect of grammar on writing, masking the condition. Teaching materials included explicit teaching of grammatical metalanguage through authentic examples; a focus on how grammatical functions enhance writing, dialogic talk, and the use of model texts. Both intervention and control group lessons had the same curriculum outcomes. Evaluation of the pedagogy was based on gain scores (%) and multiple regression to model scores and include covariates. Rubrics for assessing writing were mark schemes aligned with the national curriculum, marked out of 30 points by external examiners blind to condition.

As shown in Table 5, the intervention group gained 5.11% more than the comparison group. Sub-analysis of the three writing components indicated gains for intervention students across sentence structure and punctuation, text structure and organisation, and composition and effect. However, interactions showed that only students who entered the intervention with above average writing ability benefitted from the intervention. While this study meets all recommended indicators for study quality, other than the use of multiple outcome measures, the evidence it reports is limited to that functional grammar may have a small, positive impact on higher performing students’ writing in year 8. However, Myhill et al. (2013) note that the quality of the intervention’s lesson plans compared to the ‘business-as-usual’ lessons could account for this effect.

6.5 Myhill, Jones, &and Lines (2018)

This study followed Myhill et al. (2013) to determine effects for less proficient writers. The study included seven schools, and 315 students (158 intervention) identified as below age 11 minimum standards. The focus was on contextualised grammar instruction over 4 weeks for improving narrative genre writing and is summarized in Table 6.

Table 6 Summary of Myhill et al. (2018)

Random assignment and sampling were not used this time. Baseline measures of writing at pre-intervention indicated comparability, with both conditions having a mean score of level 3.7 (below minimum standards). Teachers in the intervention were provided with a half day training and resources. The earlier study used blinding to minimise problems such as that the teachers in the intervention may believe in the value of teaching functional grammar, and thus, its effect could be confounded by enhanced teaching. The researchers examined a corpus of 50 low-proficiency student writing samples and concluded that weaknesses included punctuation, description though noun phrases, limited sentence variety, character, setting, and a tendency to use patterns not typical of written narratives. Lesson plans therefore drew attention to these features. Fidelity to intervention is not reported.

Pre- and post-test measures were a writing sample. Half the students took prompt A at pre and half prompt B, which was crossed at post. This protected against task effects. Marking schemes from the earlier study were used. Samples were double marked, blind, by examiners independent of the research team. Attrition was reported as high at 23%: 42 were lost from the intervention, and 30 from the comparison. The intervention group gained 0.3 points more than the comparison group. An ANCOVA using the pretest scores as a covariate was statistically significant but a small/negligible effect size (d = 0.17). Sub-analyses indicated that sentence structure and punctuation were significant with a small to moderate effect (p = 0.02, d = 0.33) but no effect was found on text structure and organisation, nor composition and effect. Though noted in the earlier study, the possible confound of higher quality lesson plans remained unaddressed. Given these limitations and the small to negligible overall effect size, evidence that GT/SFL instruction improved writing outcomes is difficult to conclude from this study.

6.6 Myhill and Watson (2017)

This intervention focused reading comprehension and writing amongst 15–16-year-olds (grade 10/GCSE). It lasted 3 weeks, nine lessons, with 161 intervention students and 147 controls. A summary is provided in Table 7.

Table 7 Summary of Myhill and Watson (2017)

The intervention class received lesson plans aligned with the GCSE outcomes and the effect was measured via an adapted GCSE examination task and rubric. For reading, assessment consisted of two questions on inferential comprehension and information retrieval for a target text, and three questions on language analysis and the use of quotations. For writing, assessment included composition, sentence structure, and spelling. Business as usual was the comparison condition. Attrition was similar between groups, but fidelity monitoring and the reliability/validity of the measurement instrument are not clear. The intervention group had higher pre-test scores, as shown in Table 7. At outcome, the intervention gained 0.5 for reading, while comparisons lost − 0.4. For writing, intervention students gained 0.5 compared to − 0.6 for comparison. Standard deviations and effect sizes are unreported. The researchers note variability amongst classes, with some in the intervention showing negative outcomes, which they suggest may be a teacher effect, or lack of student buy-in to the test at the end of semester. An ANCOVA indicated statistically significant effects overall (reading: p = 0.001; writing: p = 0.02). Sub-analysis indicated no effects on the two reading comprehension measures, but significant differences for the language analysis questions. Overall, this study reports that nine lessons incorporating GT/SFL teaching showed no improvements to reading comprehension for grade 10/GSCE students but may enhance responses to requiring language analysis. The researchers do not rule out that the result may be due to the less committed test-taking from students who knew they were not receiving an intervention.

7 Evidence synthesis

This systematic review makes several contributions. The available evidence is inconsistent with the long-standing influence of GT/SFL in K-10 curriculum, policy, and practice, most notably in Australia and the UK. More high-quality research is needed before it can be concluded that it is valuable for improving measurable reading and writing outcomes in this educational context. Very little GT/SFL has been tested in the classroom, and despite many publications relating to practice, few meet the criteria required for a robust evidence base. It is legitimate to question whether interventions with controls should be valued as the strongest form of educational evidence (Myhill et al., 2013), but we must nevertheless acknowledge that the research does not currently include many interventions with interventions with controls. The influence of GT/SFL on curriculum and teacher training for many decades has therefore not been driven by high-quality, rigorously researched evidence of positive outcomes for students.

Though a small overall pool, this review shows that an intervention has been conducted for every grade K-10, except grade 9. At least 5000 students have been involved (the unknown number of students in Rose and Martin (2013) make estimation difficult). It is not unreasonable to conclude that for no grade does the research record provide significant evidence that GT/SFL practices have improved reading and writing outcomes. The exception would be the positive effect on higher ability students within grade 8, reported in Myhill et al. (2013). Yet even then, the researchers acknowledge this effect may have been due to the high-quality lesson materials the research team provided teachers, rather than the component of GT/SFL within these materials. This issue with the researcher-developed materials in the Exeter Pedagogy studies is interesting. In educational research, it is reasonable for business as usual to form comparisons because this represents what is common in classrooms. However, the challenge faced by the studies was that rich materials (e.g., lesson plans and resources developed by expert researchers that may be more engaging, carefully structured and scaffolded than typical lessons) make it difficult to demonstrate that it was the grammar component of the lessons that made the difference to outcomes, rather than just better lessons overall. This equivalency problem needs to be considered in future research.

It is important to close with the clear message that this synthesis does not show that GT/SFL-related pedagogies are not effective. Rather it shows that current research record is not sufficient to prove/disprove the value of this approach. More high-quality research is needed, and this could provide the evidence base, or it could confirm no effect, or it could show a negative effect.

8 Limitations

This study is not a review of the evidence from all educational contexts. GT/SFL in second language acquisition has been widely researched. both within the target countries of this paper and elsewhere. Higher education is also an important area. There may be clear evidence of positive impacts in these contexts. The current review has not included ‘grey’ literature, e.g., unpublished studies, masters and PhD thesis, and non-peer reviewed papers. This is because the focus has been on the evidence in the peer-reviewed research record. The study quality indicators rely on the information that is published to make an evaluation. The indicators do not make a distinction between the study and what is published, so studies can be penalized for reporting rather than design issues. Quality indicators were not rated blindly with interrater reliability. Other researchers have different views on the quality of the studies; however, this would be surprising, and we would also argue that slightly different quality evaluations would not impact the evidence base presented here. Finally, this study has not explored multimodal texts, an active and important research area that draws on GT/SFL.

9 Conclusion

Large-scale interventions and randomised control trials require significant commitments from many in the education community, including schools, funding agencies, universities, and researchers. It is reasonable to make trade-offs in research design, given resource limitations, and try to provide the educational community with as much information as possible. Those involved in the projects that have been explored in this evidence synthesis should be commended for their contributions, which are valuable attempts to evaluate the potential benefits of GT/SFL in K-10 classrooms. However, the research record currently lacks substantial evidence of positive impacts. This systematic review found few published studies on the effectiveness of GT/SFL for improving students’ reading and writing. The research that has been done does not provide clear evidence of benefits. This information is potentially useful for those involved in discussions around policy, curriculum, and teacher training in several countries.