Text generation—the process of encoding ideas into language for writing at word, sentence, and discourse levels—involves oral language, as writers use their oral vocabulary and grammatical knowledge to generate words and sentences for their texts (Abbott & Berninger, 1993; Babayigit & Stainthorp, 2010). Yet, oral language skills are rarely a target of writing interventions (Goldfeld et al., 2017; Spencer & Petersen, 2018). The lack of intervention studies targeting text generation skills through oral language training appears surprising when one considers that difficulties in learning to write and read are often caused by deficits in underlying oral language skills (Dockrell & Connelly, 2013; Dockrell, Lindsay, Connelly, & Mackie, 2007; Hulme & Snowling, 2014). Even when writing difficulties are not associated with clear language impairments, oral and written expression are strongly related (Berninger & Abbott, 2010; Mehta, Foorman, Branum-Martin, & Taylor, 2005).

In line with other recent initiatives (Goldfeld et al., 2017), the present study was aimed at filling the extant evidence gap in the effectiveness of classroom-level oral language interventions on students’ writing. We did this by testing the effects of a focused oral-language (i.e., sentence generation) intervention on Italian fifth and 10th graders’ text generation and written composition skills.

The role of oral language and text generation skills in writing development

Oral language abilities are not always an explicit component of developmental writing models (see, for example, Berninger, 2000; Berninger et al., 2002). Yet, these models assume that oral language underpins text generation, the core component of the developing writing process (Berninger et al., 2002; Kim, 2016; Kim & Schatschneider, 2017). Indeed, text generation is often operationalized in writing research as oral vocabulary, grammatical, and discourse skills (Abbott & Berninger, 1993; Kim & Schatschneider, 2017).

The not-so-simple view of writing describes writing development as the product of the development of three interacting processes: text generation, which involves oral language skills, transcription (spelling and handwriting), and executive function (including attentional control, planning, review, and self-regulation abilities; Berninger, 2000; Berninger et al., 2002). The limited transcription skills (spelling and handwriting) of beginning writers (Abbott, Berninger, & Fayol, 2010; Berninger, 2000; Berninger et al., 1992; Graham, Berninger, Abbott, Abbott, & Whitaker, 1997; Pinto, Tarchi, & Bigozzi, 2015) hinder the translation of their oral language knowledge (vocabulary, grammatical and oral discourse skills) into writing and thus the potential impact of their oral language on writing. However, as writers get older and transcription skills automatize, the proportion of variance accounted for by transcription processes in writing drops (Berninger, 1999) and individual differences in text generation (and oral language skills) become more discriminant and start to influence writing more directly (Berninger, Nagy, & Beers, 2011; McCutchen, Covill, Hoyne, & Mildes, 1994). This process may require up to 4 years of writing instruction for English-speaking children (Berninger et al., 2011; Juel, 1988; McCutchen et al., 1994), but only 2 or 3 years for students writing in shallow orthographies, such as Italian (Arfé, Dockrell, & De Bernardi, 2016; Arfé, Cona, & Merella, 2018; Babayigit & Stainthorp, 2010).

Recently, research conducted with English-speaking beginning writers has also suggested that developmental models of writing might have underestimated the role of oral language skills in early written composition (e.g., Kent, Wanzek, Petscher, Al Otaiba, & Kim, 2014; Kim, Al Otaiba, Wanzek, & Gatlin, 2015). This has led to the development of the direct and indirect effects model of writing (DIEW, Kim, 2016; Kim & Schatschneider, 2017). In contrast to the not-so-simple view of writing, the DIEW model of writing (Kim & Schatschneider, 2017) considers oral language skills to be a direct influence on written composition from the beginning of formal writing instruction, proposing that not only transcription skills but also oral discourse-level skills (a component of text generation) have a strong and direct effect on written production beginning in Grade 1. Similarly to the not-so-simple view of writing, the DIEW model of writing has been validated by empirical evidence of early effects of oral discourse-level language skills on writing (Kim, 2019; Kim & Schatschneider, 2017), which has been observed in older writers as well (fourth graders; Kim, 2019).

Other studies have explored more closely the interrelationship between oral and written language abilities across ages and grade levels (Berninger & Abbott, 2010; Mehta et al., 2005), showing that oral and written language are independent yet strongly interrelated constructs. For example, Mehta et al. (2005) found that, in elementary school children (third graders), the correlation between the two factors was .70 at the student level and 1.00 at the classroom level, where the two competences were indistinguishable. Berninger and Abbott (2010) found strong and significant correlations between oral language factors (listening comprehension and oral expression) and writing in Grade 1 (.65 correlation between writing and listening comprehension and .62 correlation between writing and oral expression), Grade 3 (.70 and .68), and Grade 5 (.67 and .66). These findings confirm that oral and written expression, though not identical, draw on common processes.

In line with this research, a recent meta-analysis (Graham, Hebert, Fishman, Ray, & Rouse, 2020) showed a consistent and significant association between oral and written language problems from the preschool years to adolescence (Grade 12). According to the authors, three theoretical explanations can be given for the link between oral language and writing difficulties. The first, the shared knowledge hypothesis, is that oral and written production share knowledge representations (phonological, morphological, and syntactic knowledge) and processes (e.g., translating ideas into language units, monitoring language production), and that problems in constructing these representations or in these processes thus affect both production modalities. The second explanation, the rhetorical relations hypothesis, is that children need to develop similar pragmatic and rhetorical skills (e.g., sense of audience and the ability to choose appropriate rhetorical devices) in the two production modalities. Children’s problems in developing these rhetorical and pragmatic skills in oral production can influence writing development as well. A third explanation posits that the same general learning deficit or basic dysfunctions (e.g., a processing capacity limitation or deficits in procedural memory) can cause oral and written language difficulties (the learning deficit hypothesis). In all cases, empowering skills and processes in one modality (oral language) should translate into benefits to the other modality (written language).

The increasing awareness of the role played by oral language skills in writing has recently led researchers to test the benefits of introducing oral language interventions into writing or literacy instructional programs (Goldfeld et al., 2017; Spencer & Petersen, 2018). Yet the research in this field is still limited, and these intervention studies have been focused on young writers only (first to third graders).

Although the foundational oral language components of text generation are partly developed before children learn to write (Kent et al., 2014; Pinto et al., 2015), they continue to develop concurrently with written language during upper elementary school years and beyond (Alamargot et al., 2015; Chanquoy & Negro, 1996; Hebert, Bohaty, Nelson, & Roehling, 2018, Hebert, Kearns, Hayes, Bazis, & Cooper, 2018; Jones, Myhill, & Bailey, 2013; Ravid & Tolchinsky, 2002; Wijekumar et al., 2019). For example, Ravid and Berman (2006) demonstrated that children’s oral and written productions in English and Hebrew showed parallel developmental patterns from 9–10 years (Grade 4) to 16–17 years of age (high school), with a turning point in the development of oral and written language abilities in adolescence between ages 12–13 and 16–17. Thus, training in oral language skills also could be beneficial for these older writers.

The efficacy of text generation interventions and interventions focused on sentence-level skills

Limited evidence exists in general on the efficacy of instructional programs that target text generation directly and primarily (McMaster, Kunkel, Shin, Jung, & Lembke, 2018). In a best-evidence review, McMaster et al. (2018) showed that 17 out of 25 studies addressing Grades 1 to 3 targeted text generation in multicomponent interventions; however, most of them did not examine the specific contribution of training text generation processes. Other meta-analyses that examined the effectiveness of writing interventions with elementary school children (Graham, McKeown, Kiuhara, & Harris, 2012; Graham & Perin, 2007) or older writers (Graham & Perin, 2007) analyzed four types of instructional programs focused primarily on text generation—explicit teaching of grammar, sentence combining, teaching text structure, and emulating good texts—and reported only small to moderate efficacy (effect sizes from 0.25 to 0.59) for those programs. Of the instructional strategies assessed, sentence combining (i.e., teaching students to construct compound, complex or sophisticated sentences) and text structure interventions (i.e., providing explicit knowledge of the text structure of specific text types) led to the largest effect sizes (0.50 and 0.59, respectively). Grammar instruction was the sole treatment that appeared to have a negligible or even negative effect on students’ writing (Graham et al., 2012).

Independently of the limited efficacy of grammar instruction, however, the fluency with which writers generate sentences (i.e., grammatical and syntactic structures) orally or in writing appears to be a crucial element of the writing performance of young writers (e.g., Arfé et al., 2016). In fact, in upper elementary school, measures of writing fluency are most effective in discriminating between good and struggling writers (Dockrell, Connelly, & Arfé, 2019; McCutchen et al., 1994).

Text writing fluency is an important feature of skilled writing, reflecting the facility with which writers generate and produce connected texts (Alves & Limpo, 2015; Dockrell et al., 2019; Kim, Gatlin, Al Otaiba, & Wanzek, 2018; Limpo & Alves, 2018). Writing research has operationalized writing fluency in several different ways: (a) as the number of words produced in a text per minute (i.e., the speed of transcription and text generation; Limpo & Alves, 2018); (b) as the number of words produced in a writing burst (i.e., between writing pauses, a second speed measure; Alves & Limpo, 2015; Dockrell et al., 2019); or (c) as the number of correctly connected words produced within a time limit (i.e., the accuracy and speed of transcription and text generation; Kim et al., 2018). These measures tap different aspects of writing fluency. The first two measures (a and b) tap the automaticity of transcription and text generation. The third measure (c) taps the efficiency with which children execute the transcription and text generation task (i.e., how well and rapidly they perform the task). In both cases, in beginning writers, transcription skills mainly constrain text fluency. However, as noted earlier, later in writing development—from Grade 3—the contribution of oral language (text generation) skills to text writing fluency increases. As words in a text connect by means of grammatical and syntactic rules, sentence fluency (fluency in generating grammatical and syntactic structures) becomes an important index of writing ability at this point in writing development (Arfé et al., 2016; Dockrell et al., 2019; McCutchen et al., 1994). In this study we operationalized text writing fluency and sentence fluency as a writer’s speed of producing connected text (words in a text or complete sentences).

A writer’s ability to fluently generate and handle complex sentence structures can be critical, especially in romance languages, such as Italian (Arfé et al., 2016), where a very shallow orthography may reduce the demands of spelling, but the morphological and grammatical complexity of the language increases the demands of text generation (Arfé et al., 2016). Thus, sentence generation skills not only play a significant role in the quality and complexity of the texts produced by young Italian writers (Arfé et al., 2016), but also continue to account for variance in writing fluency (i.e., speed) even during high school years (ages 16–18; Danzak & Arfé, 2016). The development of the cognitive and neural underpinnings of syntactic processing is indeed prolonged, continuing through adolescence (Schneider, Abel, Ogiela, Middleton, & Maguire, 2016), and the ability to organize information in complex sentences by means of a variety of subordinate conjunctions represents one of the areas of greatest weakness for adolescents with language learning needs (Gamez, Lesaux, & Rizzo, 2016).

Training oral language abilities at sentence and intersentential levels may be thus an effective strategy to improve writing skills, especially (although not only) in a morpho-syntactically complex language. Past intervention studies have shown that sentence-level interventions focused on sentence construction can be effective for writers of different ages (from upper elementary school to high school), and the positive effects can transfer to extended composition (Datchuk & Kubina, 2013; Limpo & Alves, 2013; Myhill, Jones, Lines, & Watson, 2012; Saddler & Graham, 2005). However, the focus of these interventions was on written sentence construction. To our knowledge, no research has attempted to test oral sentence generation skill training. Training oral sentence construction and fluency can be beneficial for writers, allowing for focus on the linguistic processes of text generation without the additional demands of transcription, which may be a constraint for translation processes even during middle school years (Grades 7–8; Limpo, Alves, & Connelly, 2017) and up to adulthood (Olive & Kellogg, 2002). Such interventions could be particularly effective for those writers who show weak foundational oral language skills because of language disabilities (Dockrell, Lindsay, & Connelly, 2009) or socioeconomic or linguistic disadvantages (Hoff, 2013). Empowering the oral language skills of these writers could have beneficial effects on their writing outcomes.

In the present study, we aimed to explore the effectiveness of a nine-session, classroom-level oral language intervention focused on the oral sentence generation skills of students in Grades 5 and 10. As noted earlier, students’ linguistic skills in oral and written language abilities continue to develop from 9–10 years of age (Grade 4) to 16–17 years (high school; Ravid & Berman, 2006). Grades 5 and 10 represent two important turning points in writing development. Grade 5 is the last year of elementary school in Italy. At this stage, written production tasks become more complex and varied to prepare children for middle school compositional tasks. Fluency in text generation represents a prerequisite and developmental constraint for engaging in these more complex tasks (Dockrell et al., 2019). Grade 10 is the last grade in which grammatical skills are taught in Italian school and represents a consolidation stage of higher level syntactic skills and metalinguistic abilities that are necessary for rhetorically complex writing tasks, such as persuasive writing (Brimo & Hall-Mills, 2019), as well as academic writing in general (Silliman, 2014). We addressed the intervention to classrooms in which teachers reported significant writing needs among students and targeted oral sentence generation and reformulation skills, progressing through increasing levels of difficulty, from simple to complex sentence construction up to the use of inter-sentential links at the discourse level. We examined the effectiveness of the training in improving written sentence fluency (close transfer) and writing quality and fluency (far transfer). Three research questions guided this study:

  1. 1.

    Would an oral language intervention focused on sentence and inter-sentential construction skills yield significant gains at the sentence and text levels in fifth- and 10th-graders’ writing?

  2. 2.

    Will potential gains differ depending on students’ grade level or level of writing development? Related to this, will the intervention affect text generation at different levels (i.e., sentence and text level) in fifth-grade and 10th-grade students’ writing?

  3. 3.

    Will students retain writing gains post-intervention?

We made the following hypotheses:

  • Given that the cognitive underpinnings of syntactic skills continue to develop until adolescence (Schneider et al., 2016), we expected that training oral sentence generation skills would lead to significant improvements in students’ written sentence generation fluency and written sentence reformulation skills at both grade levels (Grades 5 and 10). However, we expected greater improvements at the sentence level for the younger writers (fifth graders), whose sentence-level text generation skills were assumed to be less mature.

  • Given the role played by oral language and sentence generation fluency in students’ writing (Dockrell et al., 2019), students’ gains following the training should transfer to text production, impacting the fluency (text generation speed) and quality of the students’ written compositions.

  • Students should maintain these positive effects at a 5-week follow-up from the end of the intervention.

The study

Participants

We enrolled two groups of participants in the study, one including four classrooms of young writers attending the last year (Grade 5) of an Italian public primary school in northwest Italy, and one including four classrooms of older writers attending the second year of an Italian public high school (Grade 10) in the same region. Inclusion criteria for the study, at the classroom level, were: (a) a classroom reported by the language teacher to present writing needs; and (b) teachers’ commitment to participating in the full (nine sessions) classroom-level intervention. Inclusion criteria at the student level included: (a) residing in Italy for at least 3 years; (b) no reported sensory, motor, or intellectual disabilities; and (c) parents’ written informed consent to the study. Currently, there are no specific formulae for the computation of power and sample size in cohort stepped-wedge trials involving repeated measures. The required sample size, computed for a factorial repeated-measures design, is N = 128, with power set at .80 and effect size at .25, p < .05, and N = 171, with power set at .90. Based on these calculations, we initially recruited 167 participants (81 fifth graders and 86 10th graders) to take part in the study. Of these, only 115 students participated in all the assessment phases of the study (pretest, posttest, and follow-up) and in a sufficient number of training sessions (at least 7 out of 9) to be included in the final study sample. Thus, attrition was 31%. The dropout rate was higher among high school students (n = 39 participants; 45%), probably due to the relatively greater socioeconomic disadvantage of these students in comparison to the elementary school students (χ2 = 109.33, df = 2, p < .001). Students from lower socioeconomic backgrounds typically show higher rates of school absenteeism (Klein, Sosu, & Dare, 2020).

The final study sample comprised of 68 primary school children (37 girls, 54%, mean age = 9.92) from four fifth grade classrooms, and 47 high school students (37 girls, 78%, mean age = 15.54) from four, 10th grade classrooms. Table 1 reports the participant characteristics. Four fifth graders and four 10th graders of this sample were diagnosed with learning disabilities: three fifth graders and one 10th grader presented a mixed learning disorder, including deficits in reading, writing, and math; and three 10th graders and one-fifth grader had other kinds of learning disabilities (including visuospatial or attentional disorders). Twenty-seven participants (23% of the sample, seven fifth graders and twenty 10th graders) were from other countries of origin (Romania, Albania, Senegal, China, Brazil, Morocco, and Russia) and learned Italian (their second language, L2) at school. Only seven of these students (three-fifth graders and four 10th graders) were in Italy for less than 6 years (range 3–5 years) at the time of the study. They were fluent enough in Italian to take part in the study activities.

Table 1 Pretest (T1) differences between the experimental and wait-list groups: means (standard deviation)

We did not obtain the school’s permission to collect students’ socioeconomic (SES) data. However, the two school districts were respectively in a medium (for fifth graders school) and medium to low (for the 10th graders’ school) SES areas. The participants who dropped from the study did not differ significantly from the other study participants on any of the demographic variables considered (i.e., age, gender, SES area, learning disabilities, or years in Italy). Little’s MCAR (missed completely at random) test run on pretest scores showed that the students with missing data were not significantly different from the participants without missing data: χ2 = 13.15, p = .28 (fifth graders) and χ2 = 20.24, p = .09 (10th graders).

For each grade level (5th or 10th), we randomly assigned classrooms to an experimental (two fifth grade classrooms, n = 35; two 10th grade classrooms, n = 20) or a wait-list condition (two fifth grade classrooms, n = 33; two 10th grade classrooms, n = 27). The experimental and wait-list groups did not differ significantly on gender distribution, students’ years in Italy, and prevalence of learning disabilities (see Table 1). The only significant difference was the age between the fifth grade experimental and wait-list group t (66) = − 2.19, p < .05: the children in the wait-list group were slightly older than the children in the experimental group were (see Table 1).

All parents provided written informed consent for their children to participate in the study. In addition, we requested students’ oral consent to the study before each assessment session.

Procedure

We ran a stepped-wedge cluster-randomized controlled trial (Campbell et al., 2019) with groups (classrooms) randomly assigned to an experimental or wait-list condition and both groups (the experimental and wait-list) receiving the intervention at different times. Figure 1 displays the study timeline. This experimental design allowed all students to receive the oral language intervention, while also controlling for intervention effects. Participants’ sentence-level and text-level writing skills were assessed three times over the study period (see Fig. 1): pretest (T1), posttest (T2, after 3 weeks of intervention), and follow-up (T3, 5 weeks after the posttest).

Fig. 1
figure 1

Study design

The Grade 5 and Grade 10 experimental groups received the intervention immediately after the pretest, between the end of November and December. During this time span, the wait-list groups received business-as-usual (BAU) writing instruction. After the 3-week experimental/BAU intervention, at the end of December, we reassessed the fifth grade and 10th-grade students’ sentence-level and text-level writing skills (T2, posttest). Next, the wait-list group received the experimental training for an equivalent duration (3 weeks in January), and the experimental group followed standard (BAU) writing instruction in their classrooms. At the end of January (T3), after this second 3-week intervention, we reassessed the students’ sentence-level and text-level writing skills. For the experimental group, this second posttest, 5 weeks after the first, consisted of a follow-up assessment.

Pretest and posttest assessments

We assessed the students’ sentence-level and text-level writing skills at the pretest (T1), posttest (T2), and at the follow-up (T3) by written sentence generation and sentence reformulation tasks (Arfé et al., 2016), and written composition tasks (fantasy stories). We used parallel versions of the written sentence generation and reformulation tasks and similar text production tasks to retest sentence-level and text-level writing skills over time. Students completed all tasks in one classroom session of approximately 1 h. In addition to these tasks, all participants completed a standardized spelling task in the pretest session to assess their transcription skills.

Standardized spelling tasks

Different standardized spelling tests are used in Italy to assess spelling during elementary school and high school years. Therefore, we used two different spelling tasks for the two age groups of this study: the fifth and 10th graders.

Word Dictation (Grade 5) We used the word-spelling subtest of the Battery for the Assessment of Dyslexia and Dysorthographia (DDE-2; Sartori, Job, & Tressoldi, 2007) to assess the fifth graders’ spelling abilities. Participants wrote, from dictation, 48 words varying in length (from two- to four-syllable words), frequency, and orthographic structure. We scored the number of misspelled words. The concurrent validity of this subtest, reported by prior studies (Arfe et al., 2016), is .82.

Text Dictation (Grade 10) For the 10th graders, we used a standardized text dictation task from the Neuropsychological Assessment Battery for Adolescence (BVN 12–18; Gugliotta, Bisiacchi, Cendron, Tressoldi, & Vio, 2009) to assess spelling skills. This test consists of writing a short text (152 words) under dictation within a 2-min time limit. We asked students to pay attention to the dictation and to write down the text as accurately and as rapidly they could. The number of correctly spelled words were scored. The manual does not provide reliability or validity values for this subtest, which researchers extensively use in clinical settings.

Sentence-level tasks We used two tasks that have shown good discriminant validity in the assessment of students’ text generation skills (Arfé et al., 2016; Dockrell et al., 2019).

Sentence generation This task provides a sentence-level measure of students’ text generation fluency, tapping their ability to generate ideas and to translate them in written sentences (syntactic and semantic structures; Arfé et al., 2016). Students received two word pairs (e.g., acqua-ponte/water-bridge and cani- gatti/cats-dogs) and were asked to generate as many different sentences as they could using both words in 5 min. We scored only sentences containing both words and allowed no changes to the word pair (e.g., changing word form from singular to plural). Participants practiced with an example word pair before starting the test. We recorded sentence fluency (number of different sentences generated) and accuracy (sentence generation accuracy scores). For accuracy, we adapted the scoring from Arfé et al. (2016) and Dockrell et al.’s (2019) studies, allowing for better discrimination among older students’ sentence generation skills. Scoring criteria were:

  • Each new sentence that was grammatically and semantically accurate earned a score of 2;

  • Each sentence that was either semantically or grammatically incorrect earned a score of 1;

  • Each sentence that was semantically and grammatically incorrect, did not include both target words, or included variations of the target word pair (e.g., plural instead of singular) earned a score of 0;

  • Sentences that varied minimally from the previous ones (e.g., The fireman jumped in the water from the bridge and The old lady jumped in the water from the bridge) earned a score of 0.5.

We did not code errors in punctuation, capitalization, or misspellings.

Reliability The second author and a trained independent rater (a master’s student), who was blind to the research groups and to the hypotheses of the study, independently scored the sentences. Interrater agreement computed on 100% of the sentences was high, 95%, and similar to that found by Dockrell et al. (2019): 94%. The two raters discussed and solved any disagreements.

Sentence reformulation

This was an additional, sentence-level measure of text generation fluency based on higher-order metalinguistic skills. The task assessed students’ ability to reformulate sentences (i.e., find alternative words or grammatical structures to translate a given meaning; Arfé et al., 2016). Students were asked to reformulate two simple (one clause) and two complex (main plus subordinate clause) sentences. For each sentence, the student could generate up to three reformulations by using different words (i.e., synonyms or paraphrases) and/or grammatical/syntactic structures (e.g., translating sentences to passive voice). Differently from the sentence generation task, in the sentence reformulation task students were constrained by semantics; that is, the reformulated sentence needed to convey the same meaning as the given sentence. A time limit of 10 min was given for each trial. Before starting the test, the children practiced with two training items. The scoring was the same as that used in prior studies (Arfé et al., 2016):

  • A score of 2 was awarded if the sentence reformulation was grammatically correct and maintained the meaning of the target sentence (e.g., Alice wishes to play cards with Lucia for Alice wants to play cards with Lucia).

  • A score of 1 was given to reformulations that were grammatically correct but did not maintain the original meaning of the item (e.g., Alice plays cards with Lucia for Alice wants to play cards with Lucia).

  • A score of 0 was given when the reformulation was incorrect both grammatically and semantically or the reformulated sentence was totally unrelated to the target (e.g., Alice makes a cake with Lucia for Alice wants to play cards with Lucia). Scores could range from 0 to 24 (2 [maximum score] × 3 reformulations × 4 sentences).

As with the sentence generation task, errors in punctuation, capitalization, or misspellings were not scored.

Reliability

Arfé et al. (2016), who tested younger writers (6- to 8-year-old children), reported an interrater reliability of 93%. In this study, interrater agreement between the second author and the independent rater was 86% on 100% of the sentence reformulations. All cases of disagreements were discussed between the two raters and solved.

Text-level tasks

Written composition Students were asked to write fantasy stories based on a topic title. Similarly to other studies (Dockrell et al., 2019), 15 min were allotted to complete the task. Three topic titles were given in the three (T1, T2, and T3) assessments: “An old man, feeling useless and seeking attention, decides to become a cat… [Continue]” (Topic 1, T1); “One morning, you wake up to discover you are an adult. Your parents have become children/adolescents. What happens next?” (Topic 2, T2); and “Today is the 1st of February of the year 3000. Retell your day” (Topic 3, T3). When 15 min had passed, students were asked to stop writing. If, at that moment, they were writing a sentence, they were allowed to complete it before stopping.

Texts were scored according to text quality and text generation fluency (number of words).

Text quality A scoring rubric was used to score text quality on a scale from 1 to 4. Two dimensions were scored, based on prior research (Arfé et al., 2016) and existing analytical writing scales (WOLD; Rust, 1996).

Macrostructure The macrostructural dimension included two dimensions of the WOLD: Ideas and development; and Organization, unity, and coherence (Rust, 1996). Text macrostructure was assessed considering the number of ideas produced and the quality of their logical organization (i.e., how well they were connected in a coherent text). Ideas were scored 1 when the text included few and poorly elaborated ideas and 4 when the text was rich with ideas, which were always articulated. Organization was scored 1 when the text presented a weak logical organization and several instances of incoherence and 4 for a very coherent and logically organized text. The two scores were summed for a macrostructure score (range 2–8).

Language The language dimension referred to the microstructure of the text, and included the Vocabulary, Sentence structure and variety, and Grammar and usage dimensions of the WOLD (Rust, 1996). The linguistic quality of the text was scored considering the writer’s lexical choices, the grammatical and syntactic accuracy of the text produced, and the variety of syntactic structures used. Scores were higher (4) for texts with contextually appropriate lexical choices, syntactically accurate sentences, and a great variety of lexical and syntactic structures. A low score (1) was characterized by incorrect lexical choices, frequent repetitions, and several grammatical and syntactic errors.

Text fluency The number of words written in 15 min was used to score writing fluency (i.e., speed of text generation).

Reliability Interrater agreement computed on 100% of the texts, independently scored by the second author and the independent rater, was 81% for macrostructural quality, 82% for linguistic quality, and 100% for text fluency. Disagreements between the two raters were discussed and resolved. When reaching a consensus was difficult, the average score between the two raters was used.

The intervention

The intervention consisted of nine, 90-min language workshops conducted by the second author, with the assistance of the classroom teacher. The workshops aimed at stimulating the development of students’ oral language abilities through group sentence generation games. Playful, teamwork activities (in groups of 4–5 members each) were designed to encourage students’ interaction, engagement, and problem solving (Boscolo, Gelati, & Galvan, 2012). The linguistic games were aimed at (a) increasing students’ awareness of sentence construction rules; and, (b) enhancing their sentence generation and linguistic fluency.

Linguistic games of different types (e.g., sentence re-construction, sentence expansion, sentence combining) alternated during the intervention to force students to generalize and flexibly use the linguistic abilities developed during the workshops. Workshop sessions and games were organized in levels of increasing difficulty. Games focused on simple sentence (verb plus complement) construction were introduced first, followed by games requiring the construction of longer and more complex sentences, involving the use of logical connectives and anaphors. Task constraints increased progressively to reflect the demands of text generation in writing, in which the production of new sentences is constrained by the text generated up to that moment. Finally, the students played games in which they had to transfer their sentence generation skills to argumentative or narrative tasks. Activities were discussed with the class at the end of each workshop, and informative and corrective feedback was provided to the teams.

Nine, 90-min workshops were administered to the experimental group (between T1 and T2) and to the wait-list group (between T2 and T3; see Fig. 1), distributed in three weekly workshops, during school hours, for a duration of 3 weeks (about 13 h of intervention in total).

Sessions were organized as follows.

Session 1. How do words and sentences affect meaning and message?

The goal of this introductory workshop was to frame the intervention and engage students in the workshop activities: A story (Grade 5) or an essay (Grade 10) around the theme of “language use” was read to the classes to prompt students’ questions and classroom discussion. Questions related to how a text meaning might change depending on the words or sentences used, or questions around how specific lexical or grammatical choices can strengthen a given message, were elicited. Then, the first game (broken sentences) was presented. In this session, students were not yet grouped in teams, but worked individually. This allowed the researcher to get to know the students and then assign them to mixed-ability teams in the second workshop session.

Broken sentences, basic

The students received a set of cards, each containing one of the elements of a broken sentence (a noun, a verb, a preposition, etc.). The task was to reconstruct the sentence in the best way possible, recombining the given sentence elements. Once the sentence was reconstructed, other students were asked to reduce its length as much as possible, up to its kernel, by deleting all unnecessary sentence elements one by one. Elements could be deleted if their deletion did not affect the grammaticality or semantic completeness of the sentence. The goal of this game was to invite students to reflect on the core role of verbs in sentences and the minimal elements necessary to convey a complete sentence meaning (see Tesnière, 2015).

Session 2

In this session, students were assigned to groups. Linguistic games were played in teams of 4–5 members each. The experimenter always introduced new games, showing the students how to perform the tasks. Two games were played in Session 2.

Sketches, basic

As the verb is the structural center of the clause, our sentence generation intervention started from verb valence. In linguistics, valence is the number of dependent elements (arguments or complements) that are necessary to complete a verb meaning (Tesnière, 2015). Like atoms combine in chemical compounds, words combine with other words in sentences. Valence is a measure of their combining capacity. Some verbs have greater valence than others, because they require more elements to convey a complete meaning. For example, the verb “to give” (Italian, dare) is trivalent, as it needs at least three complements (who gives what, to whom) to convey the action meaning, whereas the verb “to rain” (Italian, piovere) is monovalent, because it does not require complements (piove/it rains, conveys a complete meaning).

All teams received a box containing puzzle pieces representing different verbs. The task was to randomly pick up a puzzle piece (a verb) and represent the verb meaning in a “sentence sketch.” The sketch should be “interpreted” by actors (team members), embodying all the sentence elements necessary to represent the verb meaning. Each team selected as many actors as necessary to convey the meaning of the target verb. For example, to represent the verb “to give,” the team required a minimum of three “actor elements”: an agent (who gives), an object (what), a receiver (to whom): Marco gives the book to Luisa. In turn, each team “interpreted” the sentence for the classroom, explaining why a certain number of “actor elements” were necessary. The goal was to stimulate a reflection about verb valence.

Sketches, advanced

Similarly to the basic sketches game, teams picked up a verb puzzle piece from a verbs box and represented the action meant by the verb in a sketch. Differently from the basic sketches game, however, the task also involved developing the sentence further. Students were taught to use questions to decide how to expand the sentence meaning. For example, the action “went” could be represented in a simple sketch: “He went to the park”. Then, questions such as “with whom?”, “why did he go there?”, and “how did he get to the park?” could help develop the sentence. One point was given to the team for each additional element that was correctly added to the basic sentence.

Session 3

In Session 3, students continued practicing sentence construction skills, but with increasing content and grammatical constraints. Generating sentences respecting specific constraints was considered an important target skill of the intervention, as text generation is always constrained by the preceding text, the words that have been used, and rhetorical constraints. Two games were played.

Sentence puzzles, basic

Each team received a box with 15 puzzle pieces corresponding to sentence elements. Each element had a different color depending on its grammatical category (e.g., light blue for articles, pink for verbs, violet for nouns, brown for prepositions, etc.). The task was to use as many puzzle pieces (sentence elements) as possible to generate a sentence. One point was awarded for each element (puzzle piece) added to the sentence. The longer the sentence, the higher the team’s score. Grammatically incorrect sentences were scored 0.

Crazy sentences, basic (connectives)

For this game, teams were randomly given two picture cards depicting two content elements each (e.g., a man and a penguin on one card and a king and a present on the second) and were asked to formulate (within 20 s) a sentence that related the four elements. To combine all elements in a sentence, students had to use connectives. However, they were not allowed to use the additive connective, and. As noted in the introduction, students with language learning needs may lack the ability to organize information in complex sentences by means of conjunctions (Gamez et al., 2016). The students in this study tended to use “and” in place of more specific subordinate or coordinate connectives. By not allowing the use of “and”, we pushed them to make use of more specific connectives in sentences. The team received one point for each grammatically accurate sentence that did not include the connective and.

Session 4

In Session 4, an advanced version of the “sentence puzzles” game was proposed, in which the linguistic constraints of the game increased.

Sentence puzzles, advanced

This game was similar to the basic sentence puzzles game, but with the additional constraint that the team should use a specific verb to construct the sentence. All teams received a puzzle piece representing the target verb, plus 15 further puzzle pieces, corresponding to different sentence elements (nouns, articles, adjectives, etc.). Their task was to construct the longest sentence possible, starting from the sentence elements at hand. Like in session 2, verbs of different valence were used. Once the sentence was generated, the team could develop it further by rolling dice and picking up from different element boxes (e.g., nouns, adjectives, verbs boxes) as many sentence elements (nouns, articles, or other verbs) as the dice indicated. The team received as many points as the number of the correctly used sentence elements.

Session 5

As in Session 4, in Session 5, students only played one game. This session aimed at consolidating the sentence generation skills developed in Sessions 2 and 4 (mainly sentence expansion).

Syntactic goose game, basic

In this game, two teams competed in a challenge. The first player generated a sentence based on a verb card picked up from a verb card deck. Then, the teammates picked up a card each from another card deck (including Wh- question cards such as “with whom?” or preposition cards, like of). The task was to develop the sentence by answering the Wh- question (e.g., “with whom?”) or using the target preposition (e.g., of) card. The number of cards correctly used to develop the sentence indicated how many steps along the track the team advanced. Then, a player on the opposing team picked up a verb card and generated a sentence containing the verb. The other members expanded the sentence by picking up a Wh- question or preposition card each.

In Sessions 6 and 7, the focus was on the use of inter-sentential links (connectives and pronominal references) and the construction of complex (main plus subordinate) sentences. As in sentence combining (Saddler, Behforooz, & Asaro, 2008; Saddler & Graham, 2005), the games were aimed at increasing students’ awareness of the logical meaning of connectives and their ability to use them (as well as pronominal references) to integrate information from different sentences.

Session 6

In Session 6, fifth graders played the advanced broken sentences game, whereas 10th graders played the advanced crazy sentences game. The two games had the same goal but matched the different learning skills of the two age groups. We decided to use different games for the two age-groups to provide each group of students with an optimal challenge for their ability level.

Broken sentences, advanced

This game was the reverse of the basic broken sentences game. All teams were given scissors and a sheet of paper with a complex sentence printed on it: a main plus a subordinate clause, containing a temporal or causal connective. The task was to break the sentence into its basic elements and subsequently recombine them in different sentences, using the same temporal or causal connective used in the original sentence. One point was given for each element included in a grammatically and semantically correct reformulation.

Crazy sentences, advanced

This game was similar to the basic crazy sentences game. The teams had to generate sentences starting from two picture cards (four content elements) within 20 s. However, in the advanced version, the teams were also given a specific coordinating or subordinate connective (or a pronoun) to be used in the sentence. Additive connectives such as and were excluded. Teams were provided with a conjunctions table with connectives divided based on their logical function (causal, temporal, etc.) and were told that they could use the table to decide how to use the connective. The produced sentences were evaluated by the other teams and discussed with them.

Session 7

In this session, the students only played an advanced version of the syntactic goose game.

Syntactic goose game, advanced

This game was similar to the basic version, with the addition of conjunction and verb cards for sentence expansion. The expansion card deck contained Wh-questions (e.g., “why?”) and conjunctions (e.g., while; instead of prepositions). The task was to develop a sentence generated based on a card verb, using Wh-questions (e.g., “with whom?”) or conjunctions (e.g., while), with the additional challenge of adding a new verb to the sentence. Thus, the sentences constructed were more complex and included inter-sentential links.

In sessions 8 and 9 the goal was to stimulate students’ transfer of their sentence combining skills in the context of discourse production. The focus was on the ability to use connectives to express logical relations between sentences and combine sentences in discourse.

Session 8

In Session 8, fifth graders played the story gaps game, whereas 10th graders played argumentative challenges. As in Session 6, the two games had the same goal but were adapted to match the different discourse skills of the two age groups.

Story gaps

In this game, knowledge of inter-sentential links was applied at the discourse level. A picture story sequence was given, with some gaps to fill. Students received a box of puzzle pieces corresponding to different conjunctions. The task was to tell the story, filling in the story gaps with appropriate logical connectives (e.g., then, so, but). One score was given for each correctly used connective.

Argumentative challenges

Taking turns, one member of each team confronted a member of another team to debate on a topic chosen by the group. The only constraint was to use some of the connectives from the conjunctions table used in the advanced crazy sentences game.

Session 9

In this final session, the groups worked together on a collaborative text construction. While the fifth-grade group worked on a narrative text, the 10th-grade group produced an argumentative text. At the beginning of the activity, each group received a set of connectives. To add a new idea to the text, each group had to generate at least one sentence using one of the connectives received.

Treatment fidelity

The first and second authors of the study jointly designed the training, and the second author administered the training and conducted all sessions with the assistance of the classroom teacher. The second author used a rubric to take note of any deviation from the intervention plan or problem with the planned training program. Deviations from the planned lessons steps were very rare and never exceeded 10% of the planned lesson. Instructional sections were timed and lasted 90 min each to ensure equivalence of treatment time spans across classrooms. To guarantee balanced participation among students, teamwork sessions were designed so that each team member had a turn being in charge of leading the task. Once a week, the first and second authors met online or physically to discuss any problems that arose during the intervention. The only reported deviations from the planned intervention were some session shifts due to other scholastic activities, and all classrooms received the same sequence and length of intervention.

Data analyses

Gains in writing over the three time points (T1–T3) were assessed by computing writing gain scores (in sentence generation accuracy, sentence generation fluency, sentence reformulation, and written composition macrostructure, language, and fluency). Gain scores corresponded to the gains in students’ performance from the pretest (T1) to the posttest (T2) and to the gains between the posttest (T2) and the follow-up (T3). T1–T2 gains were calculated by subtracting the students’ scores at T1 from their scores at T2; T2–T3 gains were calculated by subtracting scores at T2 from those at T3. As the experimental group received the intervention between T1 and T2, the first gain score (T1–T2 gain) was considered a measure of students’ learning following the intervention, whereas the second (T2–T3 gain) was considered a measure of maintenance of learning effects. For the wait-list group, the first gain score (T1–T2 gain) reflected only improvements due to practice, (i.e., test–retest effects), whereas the second (T2–T3 gain) was hypothesized to reflect gains in performance due to the intervention.

The statistical analyses tested the main effects of time (T1–T2 gain vs. T2–T3 gain), group (experimental vs wait-list), and grade level (Grade 5 vs Grade 10), as well as their interactions, on written sentence generation skills and text composition. Score distribution was checked first (and separately for each grade level), by inspecting skewness and kurtosis. Between-group differences at the pretest were assessed for each grade level (Grades 5 and 10) by independent t-tests or by Chi square analyses (for nominal variables).

As participants were nested within classes, a multilevel analysis was initially planned to control for random effects at the classroom level (Peugh, 2010). Intraclass correlation (ICC) statistics suggested that very little variance, between 0 and 6%, occurred across classrooms. We thus estimated the design effect (DE) to decide whether we needed to control for clustering at the classroom level. The design effect was computed by the following formula: DE = 1 + (nc − 1) * ICC, where nc is the average number of subjects per cluster. DEs above 2 indicate violations of the independence assumption that results from nested data and thus suggest the need for multilevel modelling (Peugh, 2010). In our study, all DEs were < 2. Consequently, we performed multiple regression analyses.

Six multiple regressions examined the contribution of grade level (Grades 5 and 10), group (experimental or control), time (T1–T2 or T2–T3 gain), and their interactions (Group × Time, Group × Grade level, Grade level × Time, and the three-way interaction Group × Time × Grade level) on students’ gains in sentence generation accuracy, sentence generation fluency, sentence reformulation, text macrostructure, language, and fluency. As prior research highlighted the influence of spelling on text generation skills (Berninger et al., 2011; Limpo et al., 2017; Sumner, Connelly, & Barnett, 2016), spelling skills were also included in all regression models as a control measure. Moreover, as preliminary analyses revealed pretest differences in text fluency (text generation speed) between the experimental and wait-list groups (see Results), pretest (T1) text fluency was controlled too. Statistical significance was adjusted to p ≤ .008 (Bonferroni corrections) to control for Type 1 errors. Pairwise comparisons were used to explore the Time × Group and Time × Group × Grade level interactions. Effect sizes were estimated by Cohen’s d (1988). Follow-up maintenance at 5 weeks from the end of the training (T3) was tested by pairwise samples t-tests for the experimental group only, who received the intervention between T1 and T2.

Results

For both grade levels, all measures except spelling scores showed skewness and kurtosis values ≤ 1. Nonnormal distribution was expected in spelling, as the majority of Italian students make almost no spelling errors by the end of primary school. As shown in Table 1, the experimental and wait-list groups did not differ in gender distribution, number of students with learning disabilities, or years in Italy. The only difference observed was age between the fifth graders’ experimental and wait-list groups. Differences between the experimental and wait-list groups on pretest scores were examined by t-tests. These analyses revealed a significant difference between the experimental and wait-list groups at both grade levels (Grades 5 and 10) in compositional text fluency (i.e., text generation speed). In the fifth-grade sample, the wait-list group was more fluent in writing at the pretest. In the 10th-grade sample, the experimental group showed greater pretest fluency. The 10th graders’ experimental and wait-list groups also differed in spelling skills, although the difference only approached statistical significance.

Effectiveness of the intervention: time, group, and grade-level effects

Multiple regressions

The multiple regression models are reported in Table 2 (for dependent measures at the sentence level) and Table 3 (for dependent measures at the text level).

Table 2 Parameters estimates for gains at sentence level with the intervention
Table 3 Parameters estimates for gains at text level with the intervention

Written sentence generation accuracy Group (B = − 3.15, p < .005), time (B = − 2.67, p < .005), the two-way interaction Group × Time (B = 7.04, p < . 001), and the three-way interaction Grade level × Group × Time (B = − 5.78, p < . 005) were significant. The regression model was significant, F(9,220) = 4.54, p < . 001, accounting for 16% of variance in written sentence generation skills. Figure 2 displays the interaction effects observed for written sentence generation accuracy.

Fig. 2
figure 2

Effects of the intervention on students’ sentence generation accuracy. Note. T1–T2 = training experimental group; T2–T3 = training wait-list group

Written sentence generation fluency After Bonferroni corrections were applied, none of the variables accounted for a significant variance in written sentence fluency. The regression model was, however, significant—F(9,220) = 4.91, p < . 001—accounting for 17% of variance in written sentence generation fluency.

Written sentence reformulation Group (B = − 1.26, p < . 001), time (B = − 1.29, p < . 001), and the two-way interaction Group × Time (B = 2.25, p < . 001) were significant. The regression model was significant, F(9,220) = 10.21, p < . 001, accounting for 30% of variance in written sentence reformulation skills.

Text composition quality Two measures of written text quality were considered: macrostructural quality and quality of language.

Macrostructure Group (B = − 2.17, p < . 001), time (B = − 2.36, p < . 001), the two-way interaction Group × Time (B = 4.60, p < . 001), the two-way interactions Grade level × Time (B = 1.25, p < . 005) and Grade level × Group (B = 1.20, p < . 005), and the three-way interaction Grade level × Group × Time (B = − 1.69, p < . 005) were all significant. The regression model was significant, F(9,220) = 24.42, p < . 001, and accounted for 50% of variance in text macrostructural quality. The interaction effects are displayed in Fig. 3.

Fig. 3
figure 3

Effects of the intervention on students’ text macrostructure. Note. T1–T2 = training experimental group; T2–T3 = training wait-list group

Language As for changes in language quality, group (B = − 0.48, p < . 001), time (B = − 0.67, p < . 001), and the two-way interaction Time × Group (B = 0.68, p < . 001), were significant. The regression model was significant, F(9,220) = 5.98, p < . 001, and accounted for 20% of variance in language quality.

Text composition fluency Group (B = − 50.37, p < . 001), time (B = − 58.79, p < . 001), the two-way interaction Group × Time (B = 77.64, p < . 001), and the two-way interaction Grade level × Time (B = 58.38, p < . 005) accounted for significant variance in text writing fluency. The contribution of pretest (T1) text fluency was also significant (B = − 0.27 p < . 001). The regression model was significant, F(9,220) = 5.90, p < . 001, accounting for 20% of variance in text writing fluency.

In synthesis, significant interaction effects between group and time were found for all dependent variables, except for written sentence generation fluency. A three-way interaction, indicating different effects of the intervention depending on grade level (Grades 5 and 10), was observed only for written sentence generation accuracy and macrostructural text quality.

Pairwise comparisons

Table 4 displays the mean gains of the two age groups and the experimental and wait-list groups between the first time interval (T1–T2) and the second time interval (T2–T3).

Table 4 Mean gain scores by age group (fifth and 10th graders) and group (experimental and wait list) at the two time intervals (T1–T2 and T2–T3)

Pairwise comparisons between T1–T2 gain scores and T2–T3 gain scores were run to clarify the two main interactions Time × Group and Time × Group × Grade level. The results for the younger (fifth graders) and older (10th graders) groups are synthetized in Table 5. Table 5 reports the statistical tests and effect size (d) of the difference in gain between the two time intervals (T1–T2 gain minus T2–T3 gain). Positive t-tests indicate that greater gain occurred between T1 and T2 than between T2 and T3. Negative t-tests indicate greater improvement in the second time interval (i.e., T2–T3 gain > T1–T2 gain).

Table 5 Pairwise comparisons: difference in gains between T1–T2 and T2–T3 by group: t-tests and Cohen’s d

Effects of the intervention on fifth graders’ writing skills

As shown in Table 5, the training led to significant improvement for both the experimental and wait-list groups in all the dependent measures, except for text language and text fluency, for which the gain resulted significant only for the wait-list group, and for sentence fluency, for which the wait-list group did not show significant improvement. Positive t values in Table 5 for the experimental group and negative t values for the wait-list group indicate that the gain in writing scores at the sentence and text level was greater between T1 and T2 than between T2 and T3 for the experimental group (i.e., T1–T2 gain > T2–T3 gain), and was greater between T2 and T3 than between T1 and T2 for the wait-list group (i.e., T1–T2 gain < T2–T3; see also Figs. 2 and 3). This pattern confirms that improvement occurred as a consequence of the intervention in both groups. The effect sizes ranged from medium (d = − .49) to large (d = − 1.33), with larger effect sizes in both groups on the sentence reformulation tasks (d = .81 for the experimental group and d = − .96 for the wait-list group) at the sentence level, and on text macrostructure (d = 1.12 for the experimental group and d = − 1.33 for the wait-list group) at the text level. For the wait-list group, the effect size was also large for gains in text language (d = − .86) and text fluency (d = − .90).

Effects of the intervention on 10th graders’ writing skills

For these older writers, the training led to significant improvements in both the experimental and wait-list group only in the macrostructural quality of the texts (see Fig. 3). The effect size was large for the experimental group (d = 1.04) and medium for the wait-list group (d = − . 69). For text language, the training proved to be effective for the wait-list group only (the effect size was large: d = − 1.03). Although the improvement in sentence reformulation skills was not significant after Bonferroni corrections, the dimension of the effect for the sentence reformulation skills was moderate (d = .63 for the experimental group and d = − .41 for the wait-list group).

Summing up, the pairwise comparisons confirmed the existence of different training effects for students of different grade levels: The training was less effective in boosting sentence-level language skills for the older writers. By contrast, at the text level, it led to significant improvements for both age groups, especially when macrostructural quality was considered.

Follow-up

For the experimental group, who received the intervention between T1 and T2 (first time interval), we also tested the maintenance of the training effects at a 5-week follow-up. Paired samples t-tests between the students’ writing scores (sentence generation accuracy and fluency, sentence reformulation, text macrostructure, text language, and text writing fluency) at T2 and T3 revealed no significant decline in writing scores for any of the dependent variables. Table 4 shows that little change occurred between T2 and T3 for the experimental groups.

Discussion

Oral language is a largely neglected component of instructional writing interventions. In this study, we tested the effectiveness of a nine-session classroom-level writing intervention focused on students’ oral sentence construction and reformulation skills. The students involved in this trial had mostly automatized their transcription skills but were still struggling with text generation and oral language abilities, as noted by their teachers and demonstrated by the significant effects of the training.

The results revealed a general effectiveness of oral language training for students’ writing, especially for the younger participants (fifth graders), with significant improvements in both the experimental and wait-list groups in all dependent sentence-level writing measures after the intervention—except for sentence fluency. That is, the younger writers transferred the oral sentence generation skills acquired during the training to writing (written sentence generation and reformulation, representing close transfer). Only sentence fluency did not significantly improve for the wait-list group. The benefits of the training generalized for these fifth graders to text composition as well (far transfer), leading to significant improvements in the macrostructural quality of texts produced after the intervention, with large effect sizes (d = 1.12 for the experimental and d = − 1.33 for the wait-list group).

For the older writers (10th graders), the effects of the training were overall not significant at the sentence level. However, these writers significantly benefitted from the intervention at the text level. The oral (sentence generation) intervention led to significant improvements in the macrostructural quality of their texts, with large (d = 1.04) and moderate (d = − .69) effect sizes for the experimental and wait-list groups, respectively. The results of the training were also significant for text language in the wait-list group only (d = − 1.03).

Sentences represent the natural processing units of text generation (Berninger et al., 2011; Saddler & Graham, 2005), as ideas are translated into sentences when building a text. Developing the ability to translate ideas in sentences thus represents a foundational writing skill and prerequisite for higher-level coherence-making writing processes. It is thus little surprising that an intervention focused on sentence construction and reformulation led to significant improvements in the macrostructural quality of the texts produced. Similar findings have been obtained in other studies focused on sentence construction (sentence-combining) skills (Limpo & Alves, 2013; Saddler & Graham, 2005). However, these intervention studies did not test the transfer of oral sentence generation skills to writing. Instead, they trained students on written sentence-construction skills directly. Our intervention adds to this important research line by demonstrating that targeting oral language skills at the classroom level can benefit students’ writing.

The focus and nature of our training (oral vs written language) can explain the large effect sizes obtained in this study. The magnitude of the training effects (for the younger group in particular: d = .81 and − .96 for written sentence reformulation in the experimental and wait-list group and d = 1.12 and − 1.33 for macrostructural quality in the same groups) is larger than that reported by meta-analyses for interventions targeting grammar or sentence structure (ES = 0.50; Graham & Perin, 2007). Two differences between the present training and those discussed in Graham and Perin’s meta-analysis are that: (1) our training was based on oral language activities exclusively. Practicing in oral language, students could focus on idea translation without the additional burdens of transcription. Although relatively fluent in transcription, the participants in this study still made several spelling errors in their texts. This indicated that, under the pressure of text production, transcription skills could still represent a demand for them, as for many struggling writers; (2) all training activities in the present study were group-based and game-like (i.e., playful activities). Other authors have stressed the importance of engaging students in playful and rewarding language/writing activities (Boscolo et al., 2012). Writing is an extremely demanding task, and students must be deeply engaged and motivated to invest their cognitive and linguistic resources and time in this activity.

A finding of this study is that the effectiveness of the intervention was different for the two age groups. Targeting oral sentence generation skills did not lead to significant improvements in the written sentence generation skills of the older writers (10th graders); however, it did impact the development of their text-level writing: text macrostructural and (for the wait-list group) language quality. Table 1 shows that the older students were good at the sentence generation task and, as one would expect, performed better than the fifth graders on the sentence reformulation task as well. Thus, the oral language skills acquired with the training impacted their written production at higher (coherence-making) levels.

Comparing the effectiveness of an experimental intervention for students of different age, grade, or ability levels is important in making practical decisions on when and how to apply that specific intervention in the future (Jones et al., 2013). It is, for example, important to explore how specific instructional strategies or approaches match or can be adapted to the learning needs of various groups of students.

We expected significant effects on writing fluency, but our findings contradicted our initial hypotheses. Writing fluency increased significantly only in the younger wait-list group. As our training had a significant metalinguistic component, it is possible that the students simply spent more time reflecting on their linguistic choices—planning how to translate their ideas in text—at the expense of their writing fluency, which did not improve.

Another unexpected finding of this study was the different efficacy of the training for text language in the two groups (experimental and wait list). In both age groups (fifth and 10th graders), only the wait-list group improved significantly on this measure. Although no significant differences were found between the experimental and wait-list groups at the pretest (except for text fluency and age for the younger writers), three of the younger (fifth grade) students with learning disabilities and three of the older (10th grade) students who had lived in Italy fewer than 6 years were in the wait-list groups (see Table 1). It is possible that greater improvement of the wait-list groups in text language (vocabulary and syntax) was due to the fact that they included these (and perhaps other) students with greater language needs, who benefitted more from the intervention.

Instructional implications

The finding that oral language interventions can lead to improvements in writing has practical implications for the teaching of writing, especially considering that, for many students, poor transcription skills represent a significant barrier to their writing development (Graham & Santangelo, 2014; Hebert, Bohaty et al., 2018, Hebert, Kearns et al., 2018; Sumner et al., 2016). For these students, as well as for those with language-learning needs (Hoff, 2013), as were probably many of the students in this study, practicing text generation processes without the additional burden of transcription may represent an empowering learning experience that could lead to the development of both language and text generation skills.

Oral language activities like those proposed in this study can be particularly beneficial because they are based on immediate peer feedback, collaborative work, and communicative tasks. All these elements may increase students’ engagement in the language tasks, stimulating the development of linguistic and metalinguistic skills at the individual level.

The idea that writing instruction can be also realized by oral language activities, especially during the elementary school years, leads to consideration of the role of oral language teaching in teachers’ professional training. Viewing oral language as an actual component (not only a prerequisite) of written language instruction necessitates a shift in teaching methods and strategies. Should future studies confirm the effectiveness of oral language interventions for the development of writing skills, teachers’ professional training could start integrating oral language components into their programs.

Traditionally, in Europe and the United States, oral language intervention has been considered the field of speech-language pathologists, whose task has been typically to support the oral language underpinnings of reading and writing (Silliman, 2014). The preliminary findings of this study offer a different perspective to these practitioners as well, suggesting that integrated oral language-writing interventions could also be useful in supporting children’s language/literacy development.

Limitations and future directions

In assessing intervention effectiveness, we must consider both the strength of an intervention’s effects and the internal and external validity of the intervention (McMaster et al., 2018). In this study, the experimenter administered the training directly. Although this approach reduces the risks of low treatment fidelity significantly, it also represents a threat to the external validity of a study, as it is uncertain whether classroom teachers would be able to conduct the intervention themselves. In our study, teachers were involved and assisted the experimenter but were not responsible for conducting the training. Our next step will be to test whether these positive effects can extend to situations in which classroom teachers are in charge of interventions.

Other methodological limitations of this study concern the lack of a precise measure of treatment fidelity for students’ participation in the intervention at the individual level, and the high attrition recorded in the study. As for treatment fidelity, we checked that each classroom received instruction for the same length of time and designed the intervention so that all students were forced to take turns leading the teamwork sessions. However, we did not monitor individual students’ participation in the training with systematic observations (Ledford et al., 2014). Systematic observations could better ensure that students’ engagement in oral language activities was similar among participants and classrooms. As noted in the Participants section, attrition in this study was also high, in particular among our older participants. For students with a low socioeconomic background, like were many of these high school students, school absenteeism can be high (Klein et al., 2020). Unfortunately, it was impossible to make up the missed assessment or training sessions of these students due to scheduling conflicts. This led to a high dropout rate.

A final limitation of this study concerns the long-term efficacy of the intervention. The training was largely effective, especially for younger writers (fifth graders), and its positive effects were maintained 5 weeks after the end of the intervention. However, a long-term follow-up would be needed to conclude that this relatively short (3-week) intervention produced stable gains in students’ writing skills. Although long-term follow-ups may have significant costs in terms of dropouts, they represent a unique means to ascertain the long-term efficacy of an intervention, which is what is most important in school.

Conclusions

Writing research has demonstrated that multicomponent interventions combining the strengths of focusing on transcription and text generation (Berninger et al., 2002), or on transcription and self-regulation (Limpo & Alves, 2018), are generally very effective in developing writing skills. The effects of a combined (multicomponent) intervention could be further emphasized when students are allowed to focus on (linguistic) text generation processes without the burdens of transcription while developing transcription skills in ad hoc activities. On the other hand, interventions combining self-regulation and (linguistic) text generation processes could be another promising avenue to support the development of high-level writing skills in older writers who have automatized spelling and handwriting. In school, some basic oral language skills, such as phonological skills, are currently overemphasized, while other higher-level language skills, such as vocabulary and syntactic skills, do not seem to receive sufficient attention. Evidence-based instruction of these skills is still lacking.