Introduction

Outcome measures in speech-language pathology (SLP) are an essential component of assessing treatment efficacy, monitoring progress during intervention and planning future treatment [1]. The ASHA scope of practice in SLP has highlighted that a clinician’s use of outcome measures is central to evidence-based practice [2]. The measurement of treatment change is of interest to both clinicians and researchers in SLP [3]. It is not always clear, however, from research literature how treatment change should be measured. Specifically, it is difficult to ascertain which behaviours should be measured and how measurement should be carried out.

The International Classification of Functioning, Disability and Health: Children and Youth Version (ICF-CY) provides a conceptual framework for measuring health and disability factors at individual and population levels [4]. The ICF-CY not only encompasses impairment level factors (body structure and function) but also considers the impact of these from a broader social perspective in terms of changes in children’s activities and participation. In addition, potential environmental and personal factors that interfere with a child’s ability to communicate and participate in their home and/or community are considered [410]. The ICF-CY has been applied broadly in many areas of SLP, including assessing performance of children with speech impairment, children who stutter and developmental language impairment [6, 11, 12].

One area of SLP that is challenging for clinicians and researchers to evaluate treatment change is children with SSD with a speech motor control component [1]. Their speech difficulties arise from an impairment of the neuromuscular and/or motor control system and lead to difficulty in planning and executing speech sounds [13]. Their speech is characterized by deletions, substitutions and distortions, as well as inconsistent production in childhood apraxia of speech (CAS) [13, 14]. Inconsistent productions and approximations of speech sounds are unreliably captured in perceptual judgement of speech due to the categorical nature of perception [15]. For the purpose of this review, we will not include studies of SSDs arising from linguistically-based phonological issues. The primary focus, instead, is on phonetic-based articulation disorders arising from fine speech motor control issues, which include CAS, dysarthria and motor-speech disorders not otherwise specified [MSD-NOS; 13]. Children with these diagnoses are at increased risk for academic, social and emotional difficulties, and thus it is essential to monitor their speech performance during and subsequent to intervention, in order to assess intervention effectiveness [e.g. see 16].

To date, there has been no comprehensive and critical examination of methodology relating to outcome measures in children with SSD and speech motor control issues. There is one review of literature published between 1990 and 2006 relating to standardized speech/non-speech motor performance tests in children [1] and a handful of individual reviews of standardized tests in this area [17, 18]. To address this lack of summary information, we carried out a narrative review of the literature (between 1985 and 2014) that examined the use of measures of treatment change beyond standardized assessments. The purpose of this review was to evaluate the use of outcome measures to assess treatment change in children with SSD with a motor basis.

Method

Search Methodology

Seven databases were searched for journal articles published between January 1st, 1985 and December 31st, 2014 to identify intervention studies in children with SSD, including AMED, CINAHL, Embase, Medline (including In-Process and Other Non-Indexed Citations), PsycINFO, Scopus and speechBITE. A preliminary search revealed that studies reporting motor speech treatment were first published in mid-1980; therefore, 1985 was selected as the start-date for the search. Search terms relating to SSD were combined with terms relating to intervention. Specific keywords, syntax and refinements varied depending on database search criteria and limits. The results were further narrowed using the age search limit: child (0–18 years). The search strategy used in Medline is shown in Table 1. The completed search identified a total of 4029 articles.

Table 1 Medline search strategy

Screening

Figure 1 illustrates the screening process. All references were exported to RefWorks (Version 2.0; RefWorks-Cos). Duplicate records were removed and references were screened by title and abstract. Abstracts were included if they measured treatment of developmental SSDs with a motor basis. Abstracts describing treatment in children with phonologically-based SSD were only included if the treatment studied was an articulation-based intervention. Exclusion criteria included review articles, non-peer reviewed sources, test validity papers, assessment/diagnostic papers, no treatment administration/measurement studies, non-speech papers (e.g. hip dysplasia), language impairment, bilingualism, prosody/lexical deficits, phonological (linguistic-based) disorders, non-speech oro-motor exercises, alternative and augmentative communication (AAC), oral structural issues, traumatic brain injury/tumors, surgical-based intervention and publications not in English. Articles were not rated for quality and/or levels of evidence (e.g. Oxford Centre for Evidence-Based Medicine–Levels of Evidence) [19] as the focus of the review was to examine the outcome measures used and not the efficacy of interventions reported. Two hundred and fifty articles were randomly selected and screened for acceptance by a second author. Krippendorf’s alpha for reliability between two independent coders was 0.85. Sixty-six articles were accepted for further analysis.

Fig. 1
figure 1

Screening and review process

Results

Publication Year and Methodological Characteristics

Publication Year

The review identified 66 published studies (see Appendix Table 3) that report outcome measures following treatment in children with SSD with a motor basis. The number of studies per year has varied over the last three decades, with a surge in publications in recent years (see Fig. 2).

Fig. 2
figure 2

Number of published studies from 1985 to 2014

Participants

Table 2 shows the number of participants by speech disorder and the number of studies evaluating treatment in these populations included in the review. With few exceptions, the majority of studies included less than 10 participants (81.8 %). The age of participants ranged from 3; 0 to 16; 0 years.

Table 2 Number of participants and number of studies by speech disorder

Levels of Measurement

Figure 3 shows the types of measures used according to the classification outlined in the ICF-CY [4]. The outcome measures in the review relate to two of the ICF-CY categories: impairment level (body function) and participation level. Body function measures included perceptual, physiologic and acoustic measures (e.g. rating scales, transcription measures, tongue-palate contact patterns and formant frequencies). The majority of studies (68.2 %) use only perceptual measures to document change following treatment. About 25.8 % of studies use instrumental (acoustic and physiological measures). Only three studies (4.5 %) used participation level measures, ranging from a parent/school questionnaire to standardized assessment, such as Focus on the Outcomes of Children Under Six (FOCUS) and The Socialization Scale (from the Vineland Adaptive Behaviour Scales–Second Edition) [7, 85••, 86]. See Appendix Table 4 for a detailed record of outcome measures used in the selected studies.

Fig. 3
figure 3

Number of studies reporting outcome measures used according to ICF-CY classification

Discussion

This article provided a narrative review of studies reporting on outcome measures in treatment of children with motor-based SSDs. The review examined 66 treatment studies published between 1985 and 2014 (see Appendix Table 3) and summarized the publication year and methodological information from these studies. In the past 10 years, there has been a steady increase in publications under the scope of this review. Fifty-two different outcome measures (see Appendix Table 4) were identified, which were categorized into body function (perceptual and instrumental) and participation level measures. The range of available measures combined with limited information relating to the appropriate use of these measures makes it challenging for clinicians and researchers to accurately measure change following treatment in this population. A synthesis of the findings is discussed below in addition to recommendations for future clinical and research application.

The participants in the reviewed studies ranged in age, type and severity of speech disorder. Half of the studies included in the review evaluated treatment in children with an articulation disorder. The remaining studies included children with a phonological disorder, mixed articulation and phonological disorder, CAS or speech disorder secondary to other disorders. The majority of studies involved a small number of participants (n < 10), while one large-scale study (n = 730) examined outcomes of treatment of a whole speech and language therapy service cohort over a 12-year period [24]. The results highlight a need for larger-scale studies to ensure the generalizability of study findings.

Levels of Measurement

All papers in the review presented outcome measures at the ICF-CY body function level (body structure issues, such as oral structural issues, were excluded from analysis) (Fig. 3). The primary focus across studies was therefore impairment-based as studies aimed to increase accuracy of target sound productions, expand phonetic/phonemic inventories, decrease production variability and increase speech intelligibility.

Since the introduction of the ICF-CY in 2007, only three studies [32, 59, 65] measured outcomes from a broader social perspective, indicating that the application of the multiple levels in the ICF-CY framework in practice has not taken flight in the area of motor-based SSDs. Although the Mecrow et al.’s study [59] showed some significant and positive changes relating to how much the child’s speech difficulties affected him/her at home and at school, the study was not without limitations. They had a limited study design (e.g. control group did not complete the questionnaires), lack of information regarding tool validity and reliability, as well as reduced sensitivity of the questionnaire items. Pennington et al. [65] used FOCUS, a standardized tool, to examine communicative participation in young children with CP post-intensive therapy. Even though FOCUS scores increased following therapy (mean change scores; 30.3 for parents and 28.25 for teachers), these changes did not correlate with increases in intelligibility [65]. Another standardized, norm-reference measure—The Socialization Scale (from the Vineland Adaptive Behaviour Scales–Second Edition)—was used to assess activity and participation levels following PROMPT treatment for children with CAS [32]. Increase in scores post treatment was significant for three out of four participants based on confidence intervals provided in the test manual. The finding of limited reporting of treatment change at the level of activity and participation is not dissimilar to those reported in the recent review by Baker and McLeod [8] for studies on phonological intervention in children. In their review, the majority of 134 studies also evaluated change in treatment only at the impairment level [8].

The lack of participation level measures is surprising, since after the late 1990s (1996–1997) at least three outcome measures were developed that focus on measuring change from a broader social perspective and could be used with pre-school children with speech and language disorders. These measures are American-Speech-Language-Hearing Association National Outcome Measure System (Pre-K NOMS), Therapy Outcome Measures (TOMs) and FOCUS [7, 85••, 88, 89]. Of these three measures, FOCUS is particularly recommended due to its sensitivity, published data on validity and reliability and its ability to capture changes across all of the ICF-CY levels [7, 85••]. In a recent study, the FOCUS measure was also shown to be sensitive to intensity of motor speech treatment in children with CAS, with larger effect sizes reported for higher (twice/week) than lower (once/week) intensity of treatment [90•]. In sum, both clinicians and researchers are strongly encouraged to adopt a more comprehensive intervention measurement and reporting strategy across all ICF-CY levels. A comprehensive review of assessment and intervention procedures as they relate to ICF-CY levels can be found in McLeod and Threats [10].

Transcription-Based Perceptual Procedures

Outcome measures using transcription-based approaches were very common across the reviewed articles (84.8 %). These measures include standardized tests, criterion-referenced measures and measures of intelligibility. Transcription measures were used across a range of speaking tasks from imitation to spontaneous speech at word, sentence and conversation level. In the reviewed studies, clinicians either used broad “phonemic” transcriptions (21.2 %, e.g. Goldman-Fristoe Test of Articulation-2 [GFTA-2; [91]) or narrow “phonetic” transcriptions (18.2 %, e.g. Khan-Lewis Phonological Analysis-2 [KLPA-2; [92]), while the remainder of studies does not specify the type of transcription employed. As a perceptual procedure, however, transcription is susceptible to bias and error. For example, listeners may “fill in” information from the acoustic signal, a phenomenon known as phonemic restoration; listeners’ perception is influenced by stress and intonation patterns; and even expert judges have poor inter-rater reliability [15]. While narrow transcription provides greater level of detail, it is less reliable than broad transcription [15]. Additionally, the finer discrimination required to describe distortions in motor-based speech disorders is limited due to the categorical nature of auditory perception [15].

Standardized Norm-Referenced Tests

Standardized assessments (e.g. norm-referenced) were used in 19.7 % studies. As a general rule, the use of norm-referenced standardized tests to measure change following treatment is not recommended due to serious limitations such as, regression to mean (i.e. participants with low scores at pretest may improve more than those with high scores) and lack of sensitivity. Norm-referenced tests may sample a wide range of behaviours and those targeted in intervention may only be a subset of these behaviours, and therefore, the test may not be sensitive enough to document behavioural change following treatment [93]. Thus, use of norm-referenced tests may result in under or overestimation of change [for excellent reviews on this topic, see 1, 93, 94].

One way to remediate these problems is to utilize norm-referenced tests in a criterion referenced-mode for assessing treatment progress. For example, Namasivayam et al. [62••] used pre–post scores from the GFTA-2 to investigate the effect of PROMPT therapy on speech production and intelligibility in children with moderate to severe SSD. They relied on the standard error of measurement (SEM) to determine significant change following treatment that is not a result of measurement error. The mean SEM for all pre-school age groups in the GFTA-2 is 3.7 and 3.0 for males and females, respectively [91]. Therefore, a minimum increase of 4-points was required at post testing to indicate meaningful improvement in articulation skills.

Criterion-Referenced Procedures

Given the above difficulties using norm-referenced standardized tests, it is not surprising that researchers and clinicians most often use criterion-based scoring to assess intervention-related change in SSDs [94]. Our analysis reveal that the majority (68.2 %) have utilized criterion referenced procedures (e.g. Percent consonant correct (PCC), percent vowel correct (PVC) and accuracy of target sounds) alone, or, in fewer instances, in combination with objective instrumental measures (25.8 %). Although transcription-based criterion-referenced procedures like PCC [e.g. 95] are better than using norm-referenced tests, they are not without limitations. First, PCC was originally designed to assess severity (in bands, e.g. 50–65 % = moderate-severe) rather than measure change subsequent to intervention [96]. Second, the original calculation of PCC required measuring all consonants in all word positions—treatment of select phonemes/sounds did not significantly alter PCC scores. Several modifications to PCC have been made, such as using pre-determined subsets of sounds, or using a differential weighting approach (PCC-Revised) [48, 97]. These changes, however, still do not permit scoring of closer approximations within omitted or substituted sound categories [96].

The limitations with PCC-type measures have led to alternative procedures like the probe-word scoring system (PSS; 96] that allow monitoring of “degrees of change” or approximations towards specific therapy targets. Early PSS systems (e.g. those used by Hall et al.) [96] utilized a voice, place,and manner judgements, where a minus point is given for each feature mismatch to the target. More recent versions of PSS are more sophisticated and use a 3-point scaled perceptual scoring (0 = incorrect production, 1 = close approximation and 2 = correct production) that includes both segmental and suprasegmental aspects of words and phrases [e.g. 32, 63, 72, 75, 76, 98••]. These newer PSS methods are a substantial departure from earlier auditory-perceptual scoring of distinctive feature errors as they include visual observation and reporting of movement gesture approximations [e.g. 32, 72] as well as sound distortions, and temporal and prosodic aspects of speech productions [98••].

Nevertheless, PSS methods do not account for changes in articulatory/sound transitions, changes in movement trajectories, subtle changes in speech motor control, vowel productions or suprasegmentals, which may affect overall speech intelligibility scores [51, 62••, 99]. Further, speech intelligibility at both the word-and sentence-level was significantly correlated with speech motor control (measured using Verbal Motor Production assessment for Children (VMPAC) [100] and not articulatory proficiency (measured using GFTA-2) [62••, 91].

Speech Intelligibility

Only a few studies (19.7 %) reviewed in this manuscript report changes in overall speech intelligibility as a treatment outcome measure, despite this being an important goal of speech therapy in general [33, 51, 101, 102]. Intelligibility is a measure of severity of speech impairment [103] and an index of body function in the ICF-CY [10, 104]. Speech samples in the reviewed studies ranged from spontaneous speech elicited during naturalistic play to word/sentence imitation or picture naming tasks. In children with severe SSDs and unintelligible speech, eliciting sufficient spontaneous speech in a naturalistic setting may not be possible as it may be difficult to quantify listener understanding when target words are not known. Thus, elicited procedures such as imitation or picture-naming were more frequently used with these children [51, 71, 105].

The speech intelligibility assessment procedures typically involved either the listener selecting a word from multiple alternatives (closed-set; e.g. Children’s Speech Intelligibility Measure (CSIM)) [106] or writing down what they hear (open-set; e.g. Beginner’s Intelligibility Test (BIT)) [107]. Impressionistic judgements and rating scales, given their reported lack of sensitivity, validity and reliability, were rarely reported in research studies reviewed here. Nevertheless, these measures are popular with clinicians, as indicated in a recent survey [71, 108]. Overall findings from the current study are not dissimilar to those reported by others for children with phonologically-based SSD, as shown in a recent review of outcome measures for children with phonologically-based SSD, where only 2 of 134 studies made reference to an intelligibility assessment [8].

Another area of speech intelligibility testing that requires further attention is the need for a behavioural standard to indicate that observed changes in speech intelligibility following treatment are not due to measurement error. Namasivayam et al. [62••] indicated that ∼8 % change in CSIM word-level speech intelligibility scores following motor speech treatment was outside of 90 % confidence intervals (see CSIM test manual) [104] indicating an actual change in child’s performance outside of measurement error. Of course, such behavioural standards are influenced by type of elicitation procedures, type of treatment and nature and severity of SSD; having such cut-off scores, however, will be one step closer to facilitating the integration of more robust and valid speech intelligibility testing procedures in the clinic.

Instrumental Procedures

In the reviewed studies, only a small percentage (30.3 %) has utilized instrumental analysis to evaluate change following treatment. Instrumental procedures were most frequently used in studies providing instrumentation-based treatment including electropalatography (EPG), ultrasound and motion-tracking systems such as Vicon 460 (Vicon Motion Systems, LA, USA). It is argued that in order to interact optimally with instruments and receive maximum benefits children must be at certain maturity and cognitive development; hence children under the age of 5 years are considered poor candidates. Further, due to the high cost of devices and their parts (e.g. a custom artificial palate for EPG), the use of instruments has been restricted to children with severe articulation disorders for whom conventional treatments have failed [52].

In the reviewed studies, instrumentation was used to objectively document pre–post changes [e.g. 29, 36, 38], and continuously track intervention-related changes [43]. EPG measures are concerned with a proper tongue position and closure interval duration during consonant production [e.g. 29, 36, 38, 43, 64]. Articulatory kinematic variables reported in two studies using the Vicon motion-tracking system included displacements, peak velocities and durations of movements of the lips and jaw [40, 79•]. These studies reported that changes in articulatory kinematics were associated with positive changes in PCC/PVC scores and visual improvements in speech movement accuracy and speech intelligibility following intervention. The instruments do not have to be very sophisticated or expensive. The importance of using accessible and available acoustic measures is highlighted in the study by Huer [43]. Huer tracked intervention over a 70-day period for a child with /w/ → /r/ substitution using both spectrographic analysis (e.g. second formant transition rates, standard deviation of formant values) and perceptual (percent correct) approaches. Changes in acoustic-spectrographic measures were present earlier than changes in perceptual judgement and thus offered greater precision in measuring speech production change over the course of intervention. These findings highlight the importance of using instruments to track results of response evocation strategy across time in order to modify treatment online as necessary. Considering the significant limitations of perceptual measures, we must move toward consistently using instrumentation to evaluate change during and following treatment.

Application to Practice

The importance of aligning theory, disorder classification and measurement cannot be over-emphasized [109] and is key to understanding mechanism of treatment action. Treatment and measurement strategy should be aligned with underlying deficits. For example, if children with CAS have difficulty in planning and/or programming speech movements, then effective treatment and measures of treatment change should be focused on these components [55]. To illustrate, Pennington et al. [110] implemented a speech breathing and speaking rate treatment to support articulatory precision with six children with cerebral palsy. They chose to use speech intelligibility as their only measure of treatment change. Although these strategies improve speech intelligibility as a whole, as Pennington et al. [110] pointed out without direct measures of change in speech breathing and articulation we cannot decipher factors that contributed to changes in speech intelligibility. Clinicians and researchers should routinely create a tentative hypothesis of why an intervention is expected to work, i.e. a possible mechanism of therapeutic action or effect and then proceed to choose an outcome measure that best reflects this hypothesis.

Clinically, measurement of treatment change should not be restricted to posttreatment outcome measures. Ongoing measurement could guide decisions at every step of the clinical process [111]. The use of ongoing probes that assess multi-dimensional aspects of speech (e.g. movement trajectories and prosody) can be useful to guide treatment goals [32, 43, 72, 112]. The accurate evaluation of change during treatment will help clinician’s to respond efficiently to the specific needs of a child and adjust treatment targets to optimise treatment effectiveness.

The majority of studies in the review focus on measuring specific aspects of speech, without taking into account the whole child and how they use speech to interact with their environment. As highlighted by Baker and McLeod [9], the ICF-CY framework provides a scaffold to think about the child from a broader perspective. Changes at the level of body function must also have an impact at the level of participation in order to determine that treatment is effective and functional to meet a child’s needs.

Conclusion

The narrative review identified a wide variation of measures used to document change following treatment in children with SSD with a motor basis. It is critical to first understand the nature of the underlying deficit before choosing a specific outcome measure [109]. Clinicians and researchers need to be aware of and address the limitations of perceptual measurement, for example, by using reference samples and reducing sources of variability [15]. Additionally, perceptual measures should be supplemented with instrumental measures of the same behaviours to increase reliability and precision of analysis [15]. Further studies using multiple levels of measurement (perceptual/ instrumental and body function/participation) will strengthen our understanding of the relationship between measures and evaluate the functional, meaningful impact treatment has on children with motor-based SSD.