Autism spectrum disorder (ASD) is a set of neurodevelopmental disorders characterised by a deficit in social communication, social behaviours and restrictive and repetitive behaviours, activities, or interests (American Psychiatric Association 2013). Based on epidemiological studies conducted over the past 40 years, the prevalence of ASD appears to be increasing globally (WHO 2020). Recent studies indicate that ASD has an estimated prevalence rate of 1 in 54, with 4.3 of these being male for every one female diagnosed, among children aged 8 years in the USA (Maenner et al. 2020).

Autism has a significant and persistent impact on the lives of those receiving a diagnosis as well as their families (Begum and Mamin 2019). Also, in financial terms, the cost of supporting children with ASDs is estimated to be £2.7 billion each year in the UK alone (Knapp et al. 2009). The financial demands, as well as the persistent and the pervasive impact of ASD on a person’s current and future well-being, highlight the importance of providing support, and Social Stories™ is a widely used intervention that is liked by professionals and acceptable to children on the autism spectrum and their families.

Social Stories

The Social Story™ (SS) intervention was developed for, and is used frequently to assist, children with autism spectrum disorder (Kokina and Kern 2010; Pane et al. 2015). SS are narratives consisting of personalised text and illustrations. The intervention was introduced by Gray and Garand (1993) to provide individuals with ASD with the information they may need to learn new information and to understand and function appropriately in different social situations. Gray and Garand (1993) originally recommended Social Stories™ to be used only with higher functioning verbal pupils and that the entire story should be presented on a single sheet of paper without other visual distractions. Over time, some of these recommendations, particularly related to story style and format, have changed (Gray 1998, 2004, 2010, 2015; Gray and Garand 1993). Social Stories have become a widely used intervention due to their low cost and accessibility, as well as their capacity to address parents’ support needs, such as managing challenging behaviour (Derguy et al. 2015; Wahman et al. 2019). Given the widely used nature of the intervention, as well as the variability in the recommendations for how the intervention should be delivered over the 25 years since it was developed, professionals must follow evidence-based practice and recommendations. This should ensure that the intervention is sound, and delivered appropriately (Suhrheinrich et al. 2014; Will et al. 2018). However, despite decades of research, there is still a question as to whether or not SS interventions should be considered an evidenced-based practice (e.g. Test et al. 2011).

Evidence-Based Practice

Intervention literature moves quickly, and so does evidence-based practice, which is an active and dynamic concept (Wong et al. 2015). The Canadian Psychological Association Task Force on Evidence-Based Practice of Psychological Treatments (Dozois et al. 2012) defines evidence-based practice (EBP) as the conscientious, explicit and judicious application of the best available research evidence to inform clinical practice and service delivery. Contrariwise, the use of treatments that are based on poor-quality research tend to waste time and money, and “prey upon the emotional vulnerability of parents and caregivers” (Zane et al. 2008, p. 44). Reports by the National Clearinghouse on Autism Evidence and Practice Review Team (Odom et al. 2010; Steinbrenner et al. 2020), which aimed to document possible new EBPs whilst continuing to validate existing EBPs, failed to define the SS intervention as an EBP. However, they placed Social Narratives within the EPB category. In this case, Social Narratives were defined as “interventions that describe social situations in order to highlight relevant features of a target behaviour” (p. 29). Here Social Narratives were not considered “tantamount” to Carol Gray’s Social Stories™ (1993); rather, they were defined as a distinct type of narrative. Nevertheless, they were deemed to “fit” within the Social Narratives category. Nonetheless, these conclusions were challenged by Zimmerman and Ledford (2017) who report variable outcomes and absence of a sufficient number of rigorous studies on Social Narratives. They also advocate for professionals to be cautious with the use of social narratives in isolation for children.

Purpose of the Current Review

Clinical observation, qualitative research, single-subject research (SSR), and randomised control trials (RCTs) are amongst the research designs that can contribute towards an “evidence base” (APA 2006). However, most of the published research on SSs that has been undertaken has been within “the constraints of a wholly positivist or quantitative paradigm” (Styles 2011, p.424). Very few descriptive and qualitative research designs have been published. Studies such as Sandt (2008)—who presents a descriptive report that explains how the author used a SS to help students with autism participate in physical education (PE) lessons—or Smith (2001)—who examined the impact on children’s social behaviour of a two-session workshop for groups of parents and teacher—provide descriptive evidence about the effectiveness of SSs. Nevertheless, although useful, such evidence could be considered anecdotal, particularly when considering the effectiveness—i.e. the degree of beneficial effect in real world clinical practice (Godwin et al. 2003)—of the intervention.

Several reviews of literature, in the form of systematic reviews, meta-analysis, and narrative reviews, have been published, all of which contribute towards the current knowledge base on SSs. An increase in the quality of the experimental research over time has been noted by a number of these reviews, such as Test et al. (2011). However, the extent, range, and nature of research activity remains unclear, whist the question of efficacy remains unanswered, especially because of the high variability in research quality that these reviews have identified. Thus, the purpose of the current review was to provide up-to-date information about the current state of SS research, with a specific focus on the effectiveness of SS interventions and factors which influence outcomes. In turn, outcomes of these findings could contribute further to the debate of whether or not SSs should be considered an EBP.


As reviews of SS already exist, the aim of the present study was not to conduct yet another literature review of one form or another, but rather to conduct a scoping review of the existing reviews. A scoping review is an exercise in mapping the existing literature (Ehrich et al. 2002). An adapted version of the scoping process outlined by Arksey and O’Malley (2005) was utilised for this review. This entailed (1) the identification of the research aims; (2) the searching for relevant studies; (3) the systematic selection of studies; (4) the charting of the data; and (5) the presentation of the results.

Identifying Relevant Studies

The electronic databases searched were CINAH EBSCO, A+Eductaion (Informit), ERIC, Education Source, PsyINFO, PubMed, Science Direct, Scopus, Web of Science, and ABI Inform Global. The search was limited to English language publications. Search terms used were “Social Story” and “Social Stories”. The terms were combined using the Boolean operand “OR” and across strings using the Boolean operand “AND”. The publication date was not restricted. The search results were managed and analysed using EndNote™ X9 (Endnote 2013).

Study Eligibility and Selection

The objective of the initial search was to investigate the current state of research on SS. The search that was informed by this goal yielded 459 citations following the removal of duplicates and the exclusion of citations that were not in English, were not peer-reviewed, were not related to ASD, and that were not about SS research. Titles and keywords of the remaining articles were further screened in more detail. Autism was not included in the original search terms as there are a wide range of potential variations and reading the titles and keywords ensured that the relevant studies were not erroneously excluded. This resulted in 119 full-text articles. The reading of these articles highlighted that they were already included in several reviews of literature that ranged from synthesis of literature; systematic reviews; meta-analyses; comparative reviews; and descriptive reviews. The publication dates of these articles ranged from 2004 to 2019. Thus, this scoping review included exclusively peer-reviewed reviews of literature. The final number of articles that met the inclusion criteria, and that were included in the current scoping review, was 17. The study search and selection process are presented in Fig. 1.

Fig. 1
figure 1

PRISMA flow diagram for scoping review


The search results were analysed in terms of the aims identified for this scoping review. An inductive, data-driven, analysis was carried out to identify themes that could appraise the current state of SS research. A deductive analysis, which aimed to map the elements reported in the reviews of literature that specifically focused on outcomes (i.e. effectiveness of SS interventions) and factors which influence outcomes, was also carried out. NVIVO-12 software (QRS International 1999) was employed for this stage of the review, whilst a semantic content analysis of the data, as described by Braun and Clarke (2006), led to the coding of data according to the pre-defined research aims. Descriptive characteristics including title, author(s), year of publication, type of review, inclusion criteria, and the number of studies included in the review were extracted and organised. Also, key findings and conclusions, as well as recommendations for future research, were added to descriptive information to create detailed extraction tables (Tables 1, 2, 3, and 4).

Table 1 Details of studies included in scoping review
Table 2 Summary of findings and recommendations of the articles included in the scoping review
Table 3 Research design of studies included in each articled included in this scoping review
Table 4 Characteristics of sample of participants included in each articled included in this scoping review


The 17 reviews included in this scoping study reported on a total of 120 individual studies focusing on SSs, which were conducted from 1995 to 2018. Some of these articles are included in more than one of the research syntheses included in this scoping review. Two themes were identified, as a result of the deductive analysis, whilst a further 3 themes were identified as a result of the inductive analysis of the 17 articles. The themes are (1) research design of SS research, (2) effectiveness of SS, (3) factors influencing outcomes of SSs, (4) social validity of goals of SS interventions, and (5) maintenance and generalisation of SS intervention outcomes. The “factors influencing outcomes” category consist of 2 further subthemes, which are (3.1) environmental factors and (3.2) within-child factors. These themes and subthemes are organised in hierarchical order (see Fig. 2) which illustrates their relevance in relation to the research objectives of this scoping review.

Fig. 2
figure 2

Themes identified from inductive and deductive analysis of articles that were included in the scoping review

1) Research design

The research design most frequently encountered in SS research is single-subject research (Ali and Frederickson 2006; Bucholz 2012). The only exception to this is Karkhaneh et al.’s (2010) review where only randomised control trials (RCTs) and controlled clinical trials (CCT) are included. Table 2 presents an overview of the different research designs that have been included in the reviews. AB designs (where “A” describes a baseline phase and “B” an intervention phase) feature regularly in reviews conducted before 2012. After 2012, the reviews highlight the recurrent use of studies that, unlike AB designs, could see threats to internal validity, such as variations of ABAB designs, alternating treatment designs, and multiple-baseline designs (which involves the concurrent measurement of two or more behaviours in a baseline condition, followed by the application of the treatment variable to one of the behaviours).

The single-case research design quality utilised in SS research has been a persistent concern (McGill et al. 2015). The quality of individual studies included in any review will determine the quality of the outcomes of a review, and will consequently impact the strength and validity of the claims made by that review (Schlosser et al. 2007). The quality of single-case research should be rigorously analysed for conclusions on causal relationships among interventions and outcomes to be determined. Tools for quality appraisal of single-case research guides are relatively novel (Lobo et al. 2017), but are available in the works of Horner et al. (2005) and What Words Clearinghouse Standards (Kratochwill et al. 2013).

From the 17 reviews selected, only 8 used quality appraisal criteria of studies they reviewed. Reynhout and Carter (2011), Test et al. (2011), and Mayton et al. (2013) utilised Horner et al. (2005) criteria. McGill et al. (2015) and Qi et al. (2018) utilised What Works Clearinghouse Standards (WWC). Karal and Wolfe (2018) and Aldabas (2019) used National Autism Centre’s (NAC 2015) Scientific Merit Rating Scale (SMRS) as a means of objectively assessing if the methods used in each study were sufficiently rigorous to determine whether or not SS intervention was effective for participants on the autism spectrum. Karkhaneh et al. (2010) only reviewed RCT and CCTs, and thus utilised the Jadad Scale (Jadad et al. 1996) which is a validated scale that is used for assessing the quality of reports of RCTs.

Nine studies failed to evaluate the methodological quality of the studies that were included in the review. These included Sansosti et al. (2004); Kokina and Kern (2010); Reynhout and Carter (2006); Styles (2011); Bucholz (2012); Rhodes (2014); Saad (2016); and Rodríguez et al. (2019).

With the introduction of standards in 2005 (Horner et al. 2005) as well as WWC standards (Kratochwill et al. 2013), an increase in study quality has been reported. In McGill et al.’s (2015) review, the authors explain how only one study, published from 1995 to 2004 included in their review, met all seven of WWC’s design standards. In contrast, eight studies published from 2005 to 2012 met all seven of the design standards. Overall, the 15% increase in average indicators met as well as the nominal increase in the number of studies meeting all of the standards across the periods provides some evidence of systematic improvements in single-case research quality over time.

2) Effectiveness of SS

The main focus of the reviews of literature that were included in this scoping study was to investigate the effectiveness of SS intervention. Sansosti et al. (2004) concluded that the empirical foundation regarding the effectiveness of Social Stories is limited (Sansosti et al. 2004). Kokina and Kern (2010) argue that their findings are indicative of questionable effectiveness of Social Story interventions for students with ASD (Kokina and Kern 2010). Reynhout and Carter (2006); Sansosti et al. (2004); and Bucholz (2012) argue that the effects of SS are highly variable. More recent reviews, such as Karal and Wolfe (2018); Qi et al. (2018); and Aldabas (2019), indicate the SS research published since 2013 has increased in quality and has also reported relatively higher effectiveness ratings.

Mean difference effect size statistic was used to interpret outcome of intervention using group designs (0.80 = large effect size, 0.50 = moderate, and 0.20 = small). Visual analysis ratings (VARs) (+ 2 significant decrease in target behaviours from baseline, + 1 moderate decrease in target behaviours from baseline, 0 little to no decrease in target behaviours from baseline, − 1 moderate increase in target behaviours from baseline, − 2 significant increases in target behaviours from baseline), percentage of non-overlapping data (PND) statistic (PNDs > 80 are indicative of a strong effect, 60 to 79 is a moderate effect, and PNDs < 60 indicate negligible effect), improvement rate difference (IRD) (range 0 to 1.00, > 0.75 indicate very large effect sizes, scores between 0.70 and 0.75 indicate large, scores between 0.51 and 0.70 indicate moderate, and scores less than 0.50 indicate small effect sizes), and points exceeding the median (PEM) (range 0 to 1. < 0.7 reflects an intervention that is not effective, PEM of 0.7 to 0.9 reflects moderate effectiveness, PEM of 0.9 to 1 reflects a highly effective treatment) were used to interpret the efficacy of SS intervention in single-case experimental designs. Whilst it is claimed that the IRD metric is the strongest validated metric, when compared to PND and PEM (Parker et al. 2007), nevertheless, IRD seems to have been employed to a limited extent whilst PND is the most used metric in the reviews.

Reynhout and Carter (2006) obtained a Mean PND of 43 (range 16–95) which indicates that SS intervention is ineffective according to PND evaluative criteria (refer to Scruggs and Mastropieri 1998, 2001; Scruggs et al. 1987). Kokina and Kern (2010) report a mean PND score of 60% (range, 11–100%) for SS interventions. This score places SS in the low/questionable effectiveness category according to Scruggs and Mastropieri (1998). Reynhout and Carter (2011) report on PND (mean 51, range 0–100, SD = 30) and IRD (mean 0.57, range 0–1, SD = 0.26) metrics and suggest that the SS intervention is only mildly effective. On the other hand, the PEM metric resulted in a mean score of 72% (range 0–100, SD 26), which is indicative of moderate effectiveness.

In Test et al.’s (2011) review, PNDs could be calculated only for 10 of the 28 studies reviewed because the research design utilised meant that a functional relation could not be determined. However, the PND scores for 6 of the studies indicated “very effective” or “effective” outcomes, whilst the findings of the remaining studies were indicative of questionable or ineffective outcomes.

McGill et al. (2015) obtained an overall mean VAR of 0.68 (range 0 to + 2) and a mean PND of 51% (range 0–100%). Such scores are indicative of small-to-negligible effects. However, the weighted effect size estimator of 0.79 indicated moderate-to-large treatment effects. Karal and Wolfe (2018) obtained an average IRD score of 0.61 which is indicative of moderate effectiveness of SS interventions.

Qi et al. (2018) found a median PND value of 70%, which, contrary to findings from previous reviews, indicate that overall SS interventions are deemed effective for individuals with ASD. Similarly, Aldabas (2019) reported a mean effect size of 0.70, which is a high effect size and suggests that SS interventions are effective for individuals with ASD.

3) Factors Influencing Outcomes

Effectiveness of SS intervention may vary depending on the environment as well as within-child variables (Rust and Smith 2006). Overall effect sizes indicate that social stories are moderately effective, but specific intervention characteristics are associated with stronger outcomes (Karal and Wolfe 2018). The factors that have been identified to influence the outcomes of a SS intervention could be grouped into two broad categories: environmental factors and within-child factors (refer to Fig. 2). Within-child factors refer to characteristics of the individual participants. Environmental factors refer to a set of variables related to the research context and setting which are not participant related.

Description of participant characteristics is highly variable across SS literature. The poor or limited descriptions of participants are reported to make it difficult to determine if any specific participant characteristics were associated with intervention effectiveness. However, based on the articles included in this scoping study, age and gender, reading ability, verbal comprehension, and intellectual ability were within-child variables that were hypothesised to potentially influence outcomes of SS research. Environmental factors refer to a set of variables related to the research context and setting which are not participant related. Rather such factors are deemed to be pertinent to the environment in which the research was carried out. These factors are intervention setting, delivery of the intervention, modality, Gray’s criteria, comprehension checks, treatment packages, treatment intensity, and treatment integrity.

Age and Gender

The analysis of the reviews synthesised in this scoping review indicated that each review included 22 to 227 participants (refer to Table 4). Most of these participants were males. The ages of these participants ranged from 2 to 57 years. However, the more common age range was that of 3 to 15 years. Karal and Wolfe’s (2018) findings support and highlight the positive effects of SSs for school-aged autistic children whose ages range from 8 to 11 years. Mayton et al.’s (2013) findings also support the view that the effect of SSs is lower in studies with participants older than age 9.

Reading Ability

Rhodes (2014) and Reynhout and Carter (2006) propose that a child’s reading ability is a characteristic that could be considered a confounding variable. Nevertheless, it seems that, from the few studies reviewed that have included standardised achievement scores (such as Bledsoe et al. 2003; Brownell 2002; Kuoch and Mirenda 2003; Staley 2002; Thiemann and Goldstein 2001), the reading ability does not have a significant impact on the outcome of the intervention (Rhodes 2014).

Verbal Comprehension

Karkhaneh et al. (2010), Rhodes (2014), and Styles (2011) include in their reviews an article by Quirmbach et al. (2009) which highlights how a verbal comprehension index of at least 68 or greater on the Weschler Intelligence Scales for Children 4th Edition (Wechsler 2003) was associated with better effectiveness outcomes.

Intellectual Ability

Gray and Garand’s (1993) original focus was for SS to be used with higher functioning individuals who possess basic language ability. However, Kokina and Kern (2010) report that the effects of SS intervention seem to be somewhat higher for participants with lower cognitive ability than for individuals with high or average intelligence. Nevertheless, this factor is one which is underresearched, as the intellectual ability of individuals participating in SS research is rarely included in the participants’ description (Reynhout and Carter 2006).

Intervention Setting

Most of the studies conducted were reported to have been carried out in school settings (Aldabas 2019; Qi et al. 2018). More specifically, most deployed SS in structured classroom or small group settings (Styles 2011). The setting where the SS intervention is carried out is reported to impact intervention outcome. Interventions in general education reportedly produce larger effects when compared to home settings (Kokina and Kern 2010; McGill et al. 2015).

Intervention Delivery

McGill et al. (2015) reported that SS interventions delivered by researchers produce larger effects than those delivered by teachers. Rodríguez et al. (2019) report that whilst the majority of studies included in their review report on SS intervention that is conducted in schools by teachers, the results show a promising and positive effect of the intervention if it is carried out by people such as family members and teachers.

Intervention Modality

Combining visual elements with verbal cues is a common practice in SS interventions. Visual elements which have been included in SS interventions are photographs of participants, peers, and the environment; computer-presented social stories; and video feedback. Texts, graphics, animations, images, videos, and sounds are also reported to have been used in SS interventions delivered through technological aids (Sani Bozkurt et al. 2017).

Gray’s Criteria

Criteria for SS interventions were officially introduced by Carol Gray in 2004. In 2010, and subsequently, in 2014, these 10 criteria were revised. Gray’s criteria are reported to have been developed with learning characteristics of people with ASD in mind (Gray 2004). However, it is unclear whether a SS intervention’s conformity with such criteria is less or more effective than interventions that do not. Reynhout and Carter’s (2006) analysis concluded that from the 16 studies before 2004, a number of these deviated considerably from the criteria prescribed by Gray (2003). However, outcome measures indicated that a deviation from Gray’s criteria did not negatively impact PND.

Test et al.’s (2011) review reported that 75% of the studies included in their review stated that they had used Gray’s Criteria for developing SS interventions. They report that of the five out of the six studies that yielded “very effective” or “effective” PND scores used Gray’s criteria. On the other hand, both studies with intervention PNDs of 0% (i.e. not effective) also reported using Gray’s criteria. In his review of literature, Aldabas (2019) recommends practitioners to construct sound SSs through the implementation of sound guidelines such as Gray’s. However, similar to Reynhout and Carter (2006), outcomes, in terms of effect size, indicate that adherence to Gray’s criteria alone may not necessarily guarantee effectiveness. This seems to indicate that the relation between Gray’s criteria and SS effectiveness is unclear.

Comprehension Checks

Comprehension checks may be an important part of the implementation of SS intervention. Indeed, early guidelines by Gray and Garand (1993) required comprehension check to be a fixed component of the intervention. This to prevent inaccurate interpretation of the situation. In Reynhout and Carter’s (2006) review, it is reported that stories where authors reported a comprehension component yielded a higher mean PND than those who did not. Similarly, in Kokina and Kern’s (2010) meta-analysis, lower PND scores were obtained for the studies that did not involve comprehension checks. Furthermore, Styles (2011) reports that in studies where SS were read regularly, as the participant’s comprehension of the SS improved, so did the reported effectiveness of the intervention.

Treatment Packages

Ali and Frederickson (2006) report that the evidence base in 2006 suggested that SS interventions can be used alone or can be supported by combining it with other approaches. The use of SS interventions along with other approaches, such as prompting or reinforcement strategies, is reported in 3 of the reviews. Test et al. (2011) report that in 17 out of 28 studies, SS treatment packages have been evaluated. That indicates that 60% of the studies were not evaluating SS outcomes, but SS in combination with other interventions. Test et al. (2011) also report that in six of the studies that were included in their review that had “very effective” or “effective” PNDs, only two investigated Social Stories only whilst four studies investigated treatment packages that included Social Stories. The need for clarity on what is exactly being investigated (i.e. SSs alone vs treatment package that include SSs), as well as the need for more research on the efficacy of SS as part of a comprehensive treatment package, has been highlighted in a number of reviews such as Kokina and Kern (2010), McGill et al. (2015), and Aldabas (2019).

Treatment Intensity

Treatment intensity refers to the number of social stories an individual is exposed to, and the number of times it is read every day. Karkhaneh et al. (2010) remark that some studies describe treatment frequency and duration, but do not explore treatment dose for short-term and long-term maintenance. Kokina and Kern’s (2010) review notes that in the few studies that used several Social Stories per child, a higher treatment effect was reported. This could indicate that higher treatment intensity is associated with improved outcomes.

Treatment Integrity

Treatment integrity is a term that refers to the degree to which interventions are implemented as intended (Gresham 1989). Sansosti et al. (2004) reports that few studies exist that have assessed treatment integrity or procedural reliability. Test et al. (2011) reports that 37.5% of studies published from 1995 to 2004 included procedural reliability, whilst 58.3% of studies published from 2005 to 2007 measured procedural reliability. Test et al.’s (2011) findings suggest that measures of treatment integrity may be associated with intervention effectiveness. Similarly, Bucholz (2012) reports that the ineffectiveness of SS intervention may be due to a lack of treatment integrity. Qi et al. (2018) report that from the studies that reported treatment fidelity, the median PND was 75%, and the means of PEM, PEM-T (i.e. Percentage of data exceeding the median trend line), and PDO2 (i.e. Pairwise data overlap squared) were 93%, 100%, and 92%, respectively. For studies that did not report fidelity, the median PND was 50%. Nevertheless, the lack of consistent reporting of treatment fidelity makes it difficult to conclude with a degree of certainty that treatment fidelity could influence the effectiveness of the intervention. Interestingly, in Rhodes’s (2014) review, the two studies that had unsuccessful outcomes had treatment integrity of 100%.

4) Social Validity

Social validity refers to the observed social significance of the goals selected, the acceptability of procedures employed, and the effectiveness of the outcomes produced in interventions as perceived by service users (Snodgrass et al. 2018). Sansosti (2004) reports that most of the studies published at the time did not report on social validity. This made it difficult to determine whether caregivers and/or educators perceive such interventions to be acceptable for children with ASD. The lack of reporting on qualitative research is also highlighted in Reynhout and Carter’s (2006) review—where only three of the sixteen studies reviewed examined an aspect of social validity—and also by Test et al. (2011).

Reynhout and Carter (2011) report on formal measures of social validity made using Martens et al.’s (1985) Intervention Rating Profile—15, and other or other ad hoc scales aimed at measuring family members’ reported perceptions. Similar measures were reported in 53% of the studies in Reynhout and Carter’s (2011) review, and in 59% of the studies included in McGill et al.’s (2015) review.

In more recent reviews, such as those carried out by Aldabas (2019) and Rodríguez et al. (2019), the authors have concluded that the increased reporting on social validity by teachers found SS intervention to be one of the most effective methods to teach new behaviours and decrease inappropriate behaviours in schools. Furthermore, Rodríguez et al. (2019) report that SS intervention is a tool with great reported acceptability from professionals, family members, and people with ASD themselves.

5) Maintenance and Generalisation

It is clear, from the studies included in the reviews, that single-case research is focusing mostly on determining the functional relationship between SS intervention and behavioural change. However, this emphasis has shifted gradually to include response maintenance as well as generalisability of behaviour change. The term maintenance refers to the measurement of effectiveness when the intervention is withdrawn, whilst the term generalisation refers to the effect of the intervention outside the direct environment, or context, of the study.

Reviews by Sansosti et al. (2004) and Ali and Frederickson (2006) highlight the lack of programming for maintenance and generalisation in SS research that was published at the time. Reynhout and Carter (2006) also mention that maintenance and generalisation are inadequately addressed in the studies included in their review. This pattern of methodological shortcomings was again reported by Karkhaneh et al. (2010), Kokina and Kern (2010), Reynhout and Carter (2011), Test et al. (2011), and Bucholz (2012). These reviews emphasise the importance of including, within the research design, evaluation of response maintenance.

Styles’s (2011) descriptive review, however, argues that whilst the concept of maintenance of effects after the cessation of the intervention is important to research, the concept of generalisation is not an adequate measure of the effectiveness of SS. He argues that the goal of the SS intervention is not generalisation. Rather, SS interventions are context- and situation-specific. Thus, whilst it would be desirable to have learnt behaviours generalised beyond the specific context, the scope of SS, in the first place, is more context-specific. Nevertheless, Styles (2011) maintains that there is insufficient evidence to suggest that positive outcomes are routinely maintained after SS has been withdrawn. Such conclusions were also reached by Mayton et al. (2013) and Rhodes (2014), who argue that it is unclear whether maintenance of behaviour is dependent upon continuing the SS intervention or not.

Qi et al. (2018) also came to a similar conclusion with regard to maintenance and generalisation effects. From the studies included in their review, only 7 of the 22 studies provided generalisation data to other settings. However, from these limited studies, it is suggested that SS intervention was effective or very effective in the maintenance of target skills.


The scoping review produced a comprehensive synthesis of research on SS interventions as reported through the various literature reviews identified. The empirical research on SS interventions is relatively large and is mostly based on single-subject research (SSR) designs. The examination of the effectiveness of interventions is the area in which SSR studies are most well-represented (Morgan and Morgan 2001). SSR is experimental and aims to document causal, or functional, relationships between independent and dependent variables (Horner et al. 2005). Participants in a single-subject experiment provide their own control data for comparison in a within-subject rather than a between-subject design. Such controls are seen to be threats to internal validity (Krasny-Pacini and Evans 2018).

Sansosti et al. (2004) published the first review that focuses on SS. The objective of their research synthesis was to evaluate the effectiveness of SS intervention. According to Sansosti et al. (2004), AB designs presented limited control over threats to internal validity in SSR studies that were carried out. Furthermore, they argue that ABA or ABAB designs are also not adequate to ascertain the effectiveness of social stories. The reason for this is related to the withdrawal of the intervention, an aspect of the design that could be harmful to the participant since it is unsafe for participants to return to the baseline phase of the intervention (Reynolds 2008). Secondly, the “reversal” of the behaviour to baseline conditions once the SS intervention is withdrawn may not even be possible since the objective of a SS is to attain long-standing behavioural change. Thus, when the target of the SS intervention is the decrease of inappropriate behaviours, the return to baseline conditions of the targeted behaviour could also be interpreted as an ineffective intervention outcome. Nevertheless, the use of single-case experiments, particularly multiple-baseline design, seems to have presented researchers with the opportunity to see to the issue of heterogeneity in autism symptomology as well as to the issues of ethics and “irreversibility”.

The strength of a SS intervention is indeed its “customisability”. Thus, whilst every intervention is based on similar principals, no one social story is the same as another. Furthermore, great variability in the administration can lead to great variability in terms of outcomes. One of these “variables” is intervention setting. The majority of studies are carried out in schools and structured classroom setting. Furthermore, the number of SS interventions being carried out by researchers and professionals outnumber the interventions carried out by teachers. Parents and guardians are those who figure the least in literature. This could be a result of the poor treatment fidelity reported in studies that centre around parents. Nevertheless, the small sample of studies in which parents administered the intervention has reported promising results. Yet, outcomes of such studies could have been confounded by administration procedures that might have deviated from the standard, and in so doing delivering some sort of treatment packages that included verbal prompts, encouragement, and reinforcement.

Reports, such as McGill et al.’s (2015), of larger effect sizes being reported in interventions that were delivered by researchers should lead to questioning the accessibility of the intervention, i.e. whilst it is reported that social stories are largely popular as a result of their “ease of use”, McGill et al.’s (2015) report seems to indicate that outcomes of the administration are more positive in settings where the administrators are highly trained individuals. Thus, this finding could challenge the notion that SSs are easy to create and to use, as the outcomes are tied to professional preparation and knowledge of the intervention.

The need to adhere, or not, to Gray’s (2004) criteria for writing social stories is also a confounding variable. The evidence seems to indicate that whilst adherence to Gray’s criteria could yield sound SSs, adherence to these criteria alone may not guarantee effectiveness. Furthermore, the lack of reporting on the use or adherence to Gray’s Criteria has been highlighted in most of the reviews. This limits the conclusions that can be made on this issue.

The issue of the modality of delivery also seems to be central to the question of SS intervention effectiveness. Modality, in this case, refers to the mode of administration (using electronic devices or print) and also refers to the inclusion of photos, text, graphics, animations, and sounds. The literature identified does not attempt to answer the question of which modality produces more positive outcomes. However, Kokina and Kern (2010) conclude that Functional Behavioural Analysis (FBA) could guide Social Story interventions. Thus, it could be conceivable to argue that knowledge of the individual’s distinctive characteristics—which include intellectual and verbal abilities, language (expressive and receptive) skills, behaviours, needs, preferences, strengths and weaknesses—could be important when administering SS interventions.

Overall, it seems that the question of “is a SS intervention effective for children with autism” is still not adequately answered in literature. The finding that the large majority of studies consist of male participants also puts into questions the effectiveness of SS interventions with female participants. Early reviews, e.g. Sansosti (2004) indicate that the evidence of the effectiveness of SS is limited. The PND scores reported by various reviews that are synthesised in this study are variable. These scores range from a mean PND of 43 (Reynhout and Carter 2006) to median PND scores of 70 (Qi et al. 2018). These scores are indicative of effects which range from negligible to moderate, respectively. These scores could be a result of the poor-quality SS research. There has been a reported improvement in the quality of the research since 1995, especially since the introduction of Horner et al.’s (2005) evaluative criteria for single-case research, National Autism Center’s (2009) Scientific Merit Rating Scale (SMRS), and Kratochwill et al.’s (2013) WWC standards. However, the variability in such quality suggests that notwithstanding the 120 studies included in the reviews, the evidence to support the effectiveness, or not, of the intervention is weak.

Furthermore, PND scores appear to be the most frequently used metric to evaluate the effect of the SS intervention. However, the PND metric is not necessarily the best, or even the only, way to measure outcomes of single-case research. Whilst it is claimed that the IRD metric is the strongest validated metric when compared to PND and PEM (Parker et al. 2009), nevertheless, IRD seems to have been employed to a limited extent to measure outcomes, whilst PND is the most used metric in the reviews. It could be argued that PND lacks sensitivity or discrimination ability (Parker et al. 2007). Thus, the procedure used to “measure” the effectiveness of SS intervention could be rethought.

Finally, the question of whether or not SS interventions could be considered an EPB is not answered in the literature. Whilst promising, the most frequent reply to this question is that “further research is needed on the effectiveness of social stories”.

Recommendations for Future Research

Outcomes of this scoping review indicate that notwithstanding the relatively large body of research, the great variability in reported outcomes of SS interventions is substantial. The improvement in the quality of research has been noted in the reviews following 2012. This could be attributed to the introduction of guidelines such as the NAC and WWC research guidelines for SSR. Together with this reported improvement, more positive outcomes have also been recorded. However, the implications of this scoping review go beyond the mere cataloguing of reported outcomes. Rather, the synthesis of literature has implications on aspects of research that should be seen by future researchers to continue to improve quality as well as contribute towards answering the question of whether or not SS interventions could be considered an EBP.

As recommended by both NAC and WWC guidelines, the variables that are reported to possibly confound/effect intervention outcomes should be included and more thoroughly described in research reports. These variables include (1) SS adherence to Gray’s criteria, (2) modality of social story delivery, (3) data on maintenance and generalisation of behaviours/skills, (4) number of sessions carried out, (5) goal of intervention, (6) intervention setting, (7) information regarding treatment fidelity, and (8) information on who carries out the intervention.

Several considerations should also be made when reporting on participants’ characteristics. Details, such as age, gender, intellectual ability, reading ability, and severity of difficulties should be adequately reported. Furthermore, Functional Behaviour Analysis (LaBelle and Charlop-Christy 2002) should also be carried out to ascertain the frequency and intensity of the behaviours that are going to be targeted. A better understanding of the target behaviour could also yield more apposite stories. Such information could also inform the social validity of the intervention, i.e. the degree to which the goal is important. To inform further this aspect of research, qualitative research strategies could also be employed, as well as quantitative measures of peers’ behaviours aimed to compare changes in behaviour to neurotypical children’s behaviour. Such measures should be put in place to see threats to the internal validity of the research design.

The issue of gender in SS research should also be taken into consideration, since most of the literature available focuses on males, as most of those diagnosed with autism are male. Future research should be sensitive to potential gender differences. Finally, the issue of which measure should be used to summarise and evaluate SSR outcomes is still relevant. Thus, whilst it is argued, in this paper, that PND is not necessarily the most adequate standard to evaluate SSR, the use of other outcome metrics which include, but are not limited to, VAR, IRD, PEM, PAND (see Parker et al. 2011) should be considered. Thus, researchers should present original data of baseline and intervention observations, both graphically and numerically, in the published report to ensure the accurate calculations of the various metrics.


Several limitations associated with this review must be recognised. Specifically, findings are limited to databases included in the scoping process, which means that not all available research could have been identified. Also, this study did not include an evaluation of the quality of reviews that were identified and included. Furthermore, two of the reviews, namely Reynhout and Carter (2006) and Reynhout and Carter (2011), also included studies with children that were not autistic. These reviews were included because more than 80% of the studies that they included were with participants with autism. Finally, since the scoping review only included synthesis and other reviews of literature, it could have excluded other published research that was published in 2019 and 2020 that had not been included in the identified papers.