Reading is one of the most critical academic skills that children acquire. Early reading skills predict later reading ability, language ability and general knowledge (Rigney et al., 2020). In contrast, illiteracy costs the UK economy £36 billion/year in lost earnings, benefit payments and increased medical costs (Hazell, 2018), there are also links between low literacy and depression, obesity, and life expectancy (Griggs & Walker, 2008). One societal group at greater risk of experiencing reading difficulties are children from disadvantaged backgrounds. Such children are more likely to live in cramped housing where they share rooms, are more likely not to have eaten before attending school, are less likely to have completed their homework and suffer from poorer attendance than their peers (Kellett & Dar, 2007). They are also more likely to engage in challenging behavior, aligned with poorer concentration and are more likely to be bullied for being poor (Thompson, 2020). Disadvantaged children are therefore more likely to suffer barriers to learning, and one result of this is a higher level of literacy difficulties. When people with poor literacy have children, there is likely to be a cycle of illiteracy whereby they are unable to help with homework, more likely to be unemployed, more likely to encounter social isolation, and suffer from lack of self-esteem and poor health (Hirsch, 2007).

In general, Northern Ireland (NI) has a lower level of educational attainment than the rest of the United Kingdom (UK). It also has a high number of people living in poverty: in 2014, around 17% of the population (370,000) lived in relative poverty of whom 110,000 were children (Joseph Rowntree Trust, 2018). Furthermore, there is a large gap in educational attainment between richer and poorer children, and NI had a consistently higher number of adults leaving education without qualifications compared to the rest of the UK (17% compared to 9% in 2014) (Barnard, 2018).

While there is unanimous agreement between educators and researchers that targeting literacy skills at an early age is crucial to closing the literacy gap between disadvantaged children and their non-disadvantaged peers, the most effective method of doing so is not agreed. To counter the impact of disadvantage, UK primary schools receive (since 2011) increased funding for every pupil who has been registered for free school meals at any time in the last 6 years. This money must be spent on activities designed to raise the attainment level of these disadvantaged pupils. Using these funds, many schools provide additional literacy activities in addition to regular, teacher-delivered classroom instruction, and invest in additional resources, including computer technology and equipment in efforts to support the pupils who need it most. However, increased school resources do not necessarily result in improved pupil performance (Hanushek., 1997), and this may be indicative of the lack of a systematic, evidence-based approach to the purchase and use of educational resources. Levačić and Vignoles (2002) argue that until better empirical evidence on the impact of using resources in different ways becomes available, it is difficult to provide guidance to head teachers on how best to allocate their resources.

Computer-assisted instruction (CAI) can be used to provide an additional means of support to teach beginning reading skills, with most programs designed to supplement teacher instruction (Blok et al., 2002). Headsprout Early Reading (HER), the intervention reading program used in this study, and Headsprout Comprehension are two CAI programs developed in a rigorous manner based on the scientific principles of behavior analysis (Layng et al., 2003, 2004, 2011). There is an emerging research base that shows HER to be useful for teaching students key foundational reading skills which match those identified as critical by the National Reading Panel (2000). HER is designed to bring users to a proficient level of reading in 80-, 20-min episodes. Previous studies of the impact of HER on literacy performance (Huffsteder et al., 2010; Twyman et al., 2011) indicated that partial completion of HER still resulted in significant literacy gains, especially for competition of the first 23 episodes which concentrate on mastery of basic phonemic skills. HER has been evaluated in both school and home settings and with diverse populations of children, including typically developed children in home and school settings, those with intellectual disabilities, those with Autistic Spectrum Disorder and those from a care background (Layng et al., 2003, 2004; Clarfield & Stoner, 2005, Whitcomb et al., 2011; Grindle et al., 2013; Huffstetter et al., 2010, Tyler et al., 2015; Storey et al., 2017, 2020; Nally et al., 2021). However, a systematic review (Rigney et al., 2020) concluded that more rigorous research is required to support the widespread adoption of HER in mainstream schools, with publications to date showing only tentative support for effectiveness with this population. Rigney et al. suggested that future research should focus on school-based evaluations, employ more rigorous research designs, report effect sizes, and where possible use randomized assignment.

This study therefore aimed to extend the current evidence base of HER and fill a gap in knowledge by evaluating the impact on literacy skills of disadvantaged primary school children who were falling behind their peers in mainstream NI primary schools. Objectives were to investigate whether the addition of HER to existing school-based literacy instruction could bring about significant improvement in literacy skills, with particular focus on phonemic awareness, as well as word and sentence reading age. In a within and between groups design, children were randomly allocated to an intervention group, where the target was to work through 80 reading training episodes over 24 weeks of a school year, or a teaching as usual group. Reading skills were assessed in all children before and after intervention using a standardized reading assessment. This generated a sentence reading age and a phonics reading age. An additional assessment was carried out on pupils before, at the midpoint and after the intervention using a phonics identification test which contained three levels of difficulty. This was devised for this study in order to provide a measure that could show high and low rates, and could be applied several times during the study. It used sets of flashcards presented in one-minute timings. There were three levels of difficulty in the phonemes presented, and a correct and incorrect score was generated by each flashcard test. Ecological validity and sustainability of the intervention were planned to be enhanced by recruiting an HER coordinator at each participating school who was responsible for the children in the intervention group accessing and working through their HER episodes.

Existing research into the impact of HER has been carried out in various cohorts. This research is the first to look at the impact of HER on the literacy performance of primary school children in mainstream schools who are falling behind the expected reading level.

Methods

Participants and Design

Participants (n = 123) were recruited for inclusion in the study if they met the following criteria:

  1. i.

    They were attending primary school with a higher-than-average uptake of free school meals at the beginning of the school year.

  2. ii.

    They were availing of free school meals at the beginning of the school year.

  3. iii.

    They had a reading age at least 1 year lower than their chronological age based on the results of the last annual literacy tests.

Each of eight participating schools was asked to select at least 15 potential participants who met the inclusion criteria.

Participants

As participants were under 18 years old, information and consent forms were distributed to their parents/guardians for completion. Schools were randomized to an intervention group or a waiting list control group at the pre-intervention stage by the process of simple randomization. This resulted in five schools (Schools 1–5) in the treatment group (n = 79 age ranges from 5 to 9 years, 43 females and 36 males, average age 90 months) and three schools (School 6–8) in the control group (n = 44, age range from 5 to 9 years, 21 females and 23 males, average age, 88 months.).

A repeated measure, between and within group design was used to evaluate the impact of engagement with the HER program on the literacy performance of disadvantaged primary school children. This was compared to the impact of teaching as usual. All participants were assessed before intervention, at midpoint of the HER intervention (approximately 12 weeks later) and post-intervention (approximately 26 weeks after the first assessment).

Measures

These were the Phonics Early Reading Assessment (PERA), a reading test providing a sentence reading age and a phonics reading age developed by McCarty and Ruttle (2012); and the Flashcard Identification Test (FIT), a bespoke phonics identification test which tested individual level of fluency for each pupil.

The PERA is a standardized phonics and early reading assessment which is based on the requirement to read a series of pre-determined words and non-words. PERA takes approximately 10 min to administer and tests phonic knowledge and fluency. It consists of two tests. The first test administered was the sentence reading age (SRA) assessment. Pupils were asked to read a short story from a pre-designed double-sided card which contained 50 words. No indication was given to the pupil if answers given were correct or incorrect. The words were scored as pronounced correctly or incorrectly. The word on which the pupil made their fifth error indicated their sentence reading age in years and months. The second PERA assessment used was the phonics reading age (PRA) test. Pupils were asked to read 50 words spread over three sides of pre-designed A4 cards. A combination of real and nonsense words was used. There was no time limit set to read each card. The total number of correct answers provided by a pupil out of 50 was cross-referenced to a scoring table of established norms which provided a phonics reading age in years and months.

The FIT was designed to measure a participant’s fluency in identifying the relationships between graphemes (written letters) and phonemes (spoken sounds) for the 44 phonemes that make up the English language. These 44 phonemes are represented by combinations of the 26 letters of the alphabet and are taught as part of the curriculum in NI primary schools by the end of the third primary school year. None of the phonemes in the FIT assessment should have been novel to any participants. Headsprout incorporates these 44 phonemes into the 80 episodes of its teaching system. Each of the three FIT tests contained 50 cards, and pupil’s performance on seeing and saying the sound combinations were timed for one minute. Each 1-min timing provided a correct and incorrect rate of responding on the set tested. The level of difficulty of the phonemes used increased from FIT1 to FIT2 and again from FIT2 to FIT3. In each test, flashcards were held up in front of the pupil over a 60-s period, timed via a stopwatch app on a mobile phone. The pupils were asked to say the letters sound rather than the letter names on each card. The pack of assessment cards were held in the researcher’s right hand. Each card was turned over and held up in front of the pupil by the researcher’s left hand where both the pupil and a research assistant (RA) could see it. Each RA was blind as to which schools and pupils were part of the treatment and control groups The RA sat beside the pupil opposite the researcher so they could score see the card and score the response. The researcher placed the card in a correct/incorrect pile based on the pupil’s answer. Correct and incorrect answers were tallied at the end of each 1-min timing, and the researcher’s scores (e.g., “21–4” with the first number the correct answers and the second the incorrect/no answer given score) were compared with the RA’s scores to calculate inter-observer agreement. During the test, no verbal feedback was delivered. Copies of the FIT tests are available on application to any of the authors.

Procedure

Pupils in the treatment group had access to a Laptop, PC, iPad/Tablet with headphones and a wired/Wi-Fi internet connection to access the HER program. The school HER coordinators were present, while pupils used the program. Treatment schools committed to arranging for four 30-min sessions per pupil per school week at the outset of the research. HER sessions were carried out at the five schools in the treatment group in computer suites, classrooms and corridors based on the available resources each day in each school. Each school HER coordinator was responsible for timetabling HER on a weekly basis then and informing relevant staff of the days and times sessions were scheduled for pupils.

Assessments and interventions were conducted as following in all eight schools:

  1. 1.

    Assessment period 1: Baseline performance assessment using PERA and FIT measures. The HER intervention then began in the five treatment group schools immediately following this assessment. HER sessions ran over a 6-month period from January to June. Pupils were expected to complete one lesson or episode per session. HER coordinators interacted with pupils only to help with technical difficulties as all instructions are contained within the HER program. While pupils in the treatment group were receiving the HER intervention, pupils in the control group received teaching as usual (TAU). The TAU content and methods used varied from school to school but generally involved phonics training using programs such as Jolly Phonics (Lloyd & Wernham, 2005), teacher word recognition tasks and sentence reading using the student’s current classroom reading book.

  2. 2.

    Assessment period 2: This occurred at the midpoint of the intervention, around 12 weeks after Assessment period 1, and only the FIT measures were used.

  3. 3.

    Assessment period 3: Post-intervention assessment occurred once HER intervention ended, using the PERA and FIT measures. This occurred about 26 weeks after Assessment period 1.

Pupils were given clear verbal instructions of what they were being asked to do for each section of the assessment. Shorts breaks were offered to any pupils who required them during the assessment.

Inter Observer Agreement (IOA) A research assistant observed and recorded all student responses during all assessments. IOA was calculated by dividing the number of observer agreements by the number of judgments. IOA was above 98% for all baseline, midpoint and post-intervention assessments. All data collected from the FIT assessments were subject to IOA. No formal review of procedural integrity was carried out as computer-based instruction was used. This monitored student progress and ensured lessons were completed in the correct order.

Program Completion

All participants completed at least the first 23 episodes of HER which focus on mastery of basic phonemic skills. The average number of episodes completed within the treatment group was 49 at an average competition time of 31 min per episode. Pupils completed an average of approximately 2 episode per week over the 24 weeks of the study.

Results

Of the 123 pupils in the study, 27 achieved baseline PERA sentence/phonics reading age scores at Assessment 1 that were within 12 months of their chronological age. The results for these pupils were not included in the PERA results analyses; for those analyses, n = 96. The FIT assessment results were analyzed for all participants, n = 123.

Figure 1 (upper panel) shows the group mean SRA scores taken at baseline and post-intervention for the pupils in the treatment group (n = 54) and the control group (n = 37). There was a marked increase in mean SRA of 17 months from 61.5 months to 78.5 months for the treatment group. There was also an increase in mean SRA for the control group of 7.6 months from 59.5 months to 67.1 months. A summary of all ANOVAs carried out is shown in Table 1. In each of these, there was one within-subjects factor, time of testing, and one (group) or two (group, and FIT level) between-subjects factors. As shown in the table, very many effects reported were significant with a large majority also having a large effect size. For SRA, there was main effect of group, and of time of testing, with interaction between these two factors, Independent sample t tests showed no significant difference in SRA between the treatment group and the control group at baseline, t(93) = 2.193, NS, but a significant post-intervention difference in SRA between treatment and control groups, t(89) = 5.98, p < 0.001, d = 1.31, a large effect size. Because both groups showed rises in mean SRA scores between the two times of testing, differences scores (post-intervention SRA–baseline SRA) were calculated and compared between groups with an independent samples t test. The mean increase for pupils in the intervention group (17.0, SD = 6.75) was significantly greater than for the control group (7.65, SD = 7.12), t(88) = 6.28, p < 0.001, d = 1.36, a large effect size.

Fig. 1
figure 1

Group means (+/− SE) in months of sentence reading ages (upper panel) and phonic reading ages (lower panel) at baseline and at post-intervention test

Table 1 Results of ANOVAs

Figure 1 (lower panel) shows the group mean PRA scores taken at baseline and post-intervention for the treatment group (n = 71) and the children who had teaching as normal in the control group (n = 41). As seen in the figure, there was a marked increase in mean PRA of 12.1 months from 61.3 months to 73.4 months for the pupils in the treatment group. There was also an increase in mean PRA for pupils in the control group of 7.8 months from 55.8 months to 63.6 months (n = 39). As shown in Table 1, there was a main effect of group, of time of testing, with an interaction between these two factors. Independent sample t tests showed a significant difference in PRA between treatment and control groups at baseline, t(119) = 3.54, p < 0.01, d = 0.55, a moderate effect size, and a significant difference in PRA between the treatment and control groups post-intervention, t(119) = 5.049, p < 0.001, d = 1.11, a large effect size. Because both groups showed rises in mean PRA scores between the two times of testing, differences scores (post-intervention PRA–baseline PRA) were calculated and compared between groups with an independent samples t test. The mean increase for pupils in the intervention group (12.2, SD = 5.43) was significantly greater than for the control group (7.80, SD = 7.36), t(110) = 3.13, p < 0.01, d = 0.82, a large effect size.

The FIT measures were conducted at baseline, midpoint and post-intervention. Mean scores for treatment and control groups of correct responses at these three time points for FIT1, FIT2 and FIT 3 are shown in Fig. 2. As is evident from the figure, all three measures increased with time, but these increases were more marked in the treatment group. These effects were confirmed by statistical analysis. As shown in Table 1, there was a main effect of group, of time of testing, and of FIT level, with two- and three-way interactions between these factors. By a least significant differences test, scores on FIT1 were significantly higher than those on FIT2 which were higher than those on FIT3. All other differences were explored by carrying out separate two-way ANOVAs for each FIT level. For the FIT1 measure, Table 1 shows there was a main effect of group, of time of testing, with an interaction between these two factors. Independent sample t tests showed no significant difference in FIT1 scores between the treatment group and the control group at baseline t(119) = 0.62, NS, but a significant difference at the midpoint test, t(116) = 6.11, p < 0.001, d = 1.58, a large effect size and at post-test, t(115) = 6.04, p < 0.001, d = 1.62, a large effect size. For the FIT2 measure, there was a main effect of group, of time of testing, with an interaction between these two factors. Independent sample t tests showed no significant difference in FIT2 scores between the treatment group and the control group at baseline t(119) = 2.59, NS, but a significant difference at the midpoint test, t(116) = 4.18, p < 0.001, d = 1.19, a large effect size and at post-test, t(115) = 4.32, p < 0.001, d = 1.27, a large effect size. For the FIT3 measure there was a main effect of group, and of time of testing, with no interaction between these two factors. Independent sample t tests showed no significant difference in FIT2 scores between the treatment group and the control group at baseline t(119) = 2.24, NS, or at the midpoint test, t(116) = 2.44, NS, but a significant difference at post-test, t(115) = 2.82, p < 0.01 d = 0.95, a large effect size.

Fig. 2
figure 2

Group mean (+/− SE) correct scores on Flashcard Identification Tests, FIT1 (upper panel), FIT2 (middle panel), and FIT3 (lower panel) at baseline mid-point and at post-intervention test

Mean scores for treatment and control groups of incorrect responses at the three time points for FIT1, FIT2 and FIT 3 are shown in Fig. 3. As is evident from the figure, all three measures decreased with time, but these decreases were more marked in the treatment group. These effects were confirmed by statistical analysis. As shown in Table 1, there was a main effect of group, of time of testing, and of FIT level, and an interaction between time and group, but no other interactions. By a least significant differences test, scores on FIT1 were lower than those on FIT2 which were lower than those on FIT3 (p < 0.001 in each case). All other differences were explored by carrying out separate two-way ANOVAs for each FIT level. For the FIT1 measure, there was a main effect of group, of time of testing, and an interaction between these two factors. Independent sample t tests showed no significant difference in FIT1 scores between the treatment group and the control group at baseline t(119) = 0.59, NS, but a significant difference at the midpoint test, t(116) = 6.04, p < 0.001, d = 1.01, a large effect size, and at post-test, t(115) = 4.60, p < 0.001, d = 0.69, a moderate effect size. For the FIT2 measure there was a main effect of group, and of time of testing, with no interaction between these two factors. Independent sample t tests showed no significant difference at baseline t(119) = 1.54, NS, but a significant difference at the midpoint test, t(116) = 3.33, p < 0.01, d = 0.35, a small effect size, and at post-test, t(115) = 4.17, p < 0.001, d = 0.58, a moderate effect size. For the FIT3 measure there was a main effect of group, a main effect of time of testing, and an interaction between these two factors. Independent sample t tests showed no significant difference in FIT3 scores between the treatment group and the control group at baseline t(119) = 0.82, NS, or at the midpoint test, t(116) = 2.90, NS, but a significant difference at post-test, t(115) = 4.07, p < 0.001, d = 0.51, a moderate effect size.

Fig. 3
figure 3

Group mean (+/− SE) incorrect scores on Flashcard Identification Tests, FIT1 (upper panel), FIT2 (middle panel), and FIT3 (lower panel) at baseline mid-point and at post-intervention test

Discussion

This study evaluated the impact of engagement with the HER literacy program on the core reading skills of disadvantaged primary school children in NI and who were initially at least a year behind their peers in reading attainment. The researchers sought to make the study more sustainable by involving teachers from the schools as HER coordinators and using the equipment and facilities already in the schools where possible. The results indicate that the children in the treatment group who received the Headsprout intervention made significantly greater gains in the various measures used to test literacy performance than pupils in the control group receiving teaching as usual.

Importantly, average sentence reading age improved by 17.6 months for treatment group children in comparison to 7.1 months for control group children, while average phonics reading age improved by 12.1 months for treatment group children in comparison with 7.8 months for control group children (see Fig. 1). On average children in the HER treatment groups gained an additional 11.6 months on sentence reading age and an additional 6.1 months on phonics reading age (relative to standardized age group norms), while those in the control group gained on average only 1.1 and 1.8 months, respectively, in the 6-month time period. Similar findings were obtained by Storey et al. (2019) evaluating the potential of HER as a supplementary support for at risk pupils in NI. However, the present study increased the sample size of study participants to 123 pupils across eight schools, with randomization to treatment or waiting list control condition (the control schools were offered access to HER training after the study reported here was completed). In addition, this study reports large effect sizes for both the main effect of treatment and the interaction between treatment and time, on both sentence reading age and phonics reading age. These findings add further evidence for a body of literature showing positive impacts on children’s reading using HER (Huffstetter et al., 2010; Pindiprolu & Forbush, 2009; Storey et al., 2019; Twyman et al., 2011; Tyler et al., 2015; Watkins et al., 2016) and expand on current literature by employing a larger sample size, randomization and calculating effect sizes and well as standardized scores. The accelerated learning seen for the HER intervention group is vital in the context of pupils who are struggling with literacy and falling behind the performance of their peers (Rose, 2009). These findings are very encouraging, especially in relation to efficiency and efficacy, considering the 6-month timeframe of the study.

The study also incorporated repeated measures of fluency using 1-min timed test using sets of flashcards, with phonemes at three levels of difficulty. The FIT measures showed good validity in that lower correct scores and higher incorrect scores were obtained at baseline as the level of difficulty increased from FIT1 to FIT3. Over the six months of the study, correct scores tended to increase (see Fig. 2) and incorrect scores tended to decrease for all participants (Fig. 3). However, children in the HER treatment group showed greater improvements in phoneme fluency across FIT1, FIT2, and FIT3 assessments with higher rates of correct scores and lower rates of incorrect scores. This pattern supports the results of the standardized reading test measures for both groups, and provides some validation of the measures developed for this study to test accuracy and speed of the key elements of reading fluency (Hudson et al., 2005; Rasinski, 2012; Shanahan, 2012). In addition, similar to findings reported by Nally et al. (2021), the significant increases in rate of word reading and non-word reading by participants in the treatment group suggest that HER is effective in promoting fluency, important for the generalization of decoding strategies (Kent & Street, 2013).

Based on the literacy benefit demonstrated in this study, the potential return on investment of use of HER is significant. Whereas existing research into the evidence base of reading programs and interventions used in the UK found a wide variety of gains, effects sizes and claims of efficacy (Brooks, 2016), the results from this study support the findings of previous studies which resulted in improved literacy performance from children in mainstream schools using Headsprout in both the US, UK and in NI (Twyman et al., 2011; Tyler et al., 2015; Watkins et al., 2016; Storey et al., 2017; Storey et al., 2021). Results of this research and others, along with performance figures from the Department of Education for NI (Department of Education, 2015, 2019), demonstrate that teaching as usual incorporating an eclectic approach is not always effective for this population. The implications of ineffective strategies for children struggling with literacy in terms of education and future wellbeing are profound. The attainment gap appears early and continues to increase throughout school (Department of Education, 2015, 2019) and, as lessons become more difficult, it is likely the attainment gap will increase (Goodman & Gregg, 2010). Subsequent studies (Nally et al., 2021; Thompson et al., 2021; Pindiprolu & Forbush, 2009) have shown the use of evidence-based practice (EBP), such as HER, offers a possible solution to the attainment gap, indicating that well-targeted resources can have important effects for modest outlays. Although research suggest the impact of EBP is limited to between 10 and 30% of the variance in pupil outcomes (Teddlie & Reynolds, 2000; Muijs, 2009), an increase of up to 30% in performance would make a significant impact in closing the literacy attainment gap.

The small amount of qualitative information gathered from participating teachers showed that schools can utilize evidence-based CAI approaches easily and effectively, when initial training and ongoing support are offered. While there were some minor issues reported, these were mainly concerning the availability of school resources and adequate Wi-Fi signal, rather than issues with HER software. In fact, respondents reported that they would continue to use Headsprout in the future having seen the positive impact on pupil progress. These findings, although from a small sample of teachers, are important to informing future research and development of the training and support required by schools and services to better equip them in decision making and use of funding available. Further well-designed research is warranted into the impact of Headsprout Early Reading on literacy skills (Rigneyet al., 2020). Equally important is a focus on what educators require in order to identify and select EBP methods, and effectively and efficiently utilize them to close the attainment gap.