From a pool of 19 studies, 7 were excluded, leaving 12 studies for summary and analysis. The appendix provides the details of excluded studies, while included studies are indicated in the reference list with an asterisk. Table 1 provides summaries of the participants, target skills, procedures, main findings, and certainty of evidence for each of the 12 included studies.
Within these 12 studies, a total of 147 participants received music therapy. Two studies did not report the participants’ genders, but the other studies had a collective total of 77 males and 8 females. Participant ages ranged from 3 to 38 years (M = 6.97 years). In a majority of studies, participants were between 3 and 5 years of age.
Sample size of individuals in studies ranged from 1 to 50 participants (M = 12.25). Two studies included only 1 participant [Studies 4 and 9], three studies had 2 to 4 participants [Studies 1, 2, and 5], and four studies had 8 to 12 participants [Studies 3, 6, 7, and 8]. The remaining three studies had sample sizes of 22, 24, and 50 participants, respectively [Studies 10, 11, and 12]. All of the participants had been diagnosed with a type of ASD. One study specified that, of the 24 participants included, 10 had diagnoses of autistic disorder, while 12 had diagnoses of pervasive developmental disorder, not otherwise specified (PDD-NOS), and two had diagnoses of Asperger’s disorder [Study 11]. Four studies used the Childhood Autism Rating Scale (CARS; Schopler et al. 1988, 1998) to identify the severity of ASD for each participant [Studies 2, 5, 9 and 10]. From the 57 participants in these four studies, 25 were categorized as having mild symptoms of autism, five were categorized as having mild/moderate symptoms, 26 were categorized as having moderate/severe symptoms, and one was categorized as having severe symptoms.
Intervention settings were described for 7 of the 12 studies [Studies 1, 2, 4, 5, 6, 9, and 11]. Of these, one study was conducted at a private practice clinic [Study 6], one was conducted in a hospital [Study 11], one was conducted in participants’ homes [Study 1], three were conducted in a preschool [Studies 2, 4, and 5], and one was split between the participant’s home and a preschool [Study 9]. The research took place in a range of countries, including the USA, Canada, South Korea, Italy, Japan, and Brazil.
Target skills for intervention were coded into five categories: (a) decreasing undesirable behavior, (b) promoting social interaction and social communication, (c) improving independent functioning, (d) enhancing understanding of emotions, and (e) increasing verbal communication. Two studies, involving 11 participants [Studies 1 and 3] targeted decreasing undesirable behaviors, such as aberrant vocalizations, rewinding/fast forwarding video tapes, rummaging in the kitchen, or psychomotor agitation (although the authors did not operationally define psychomotor agitation). Social interaction and social communication were the broad foci for five studies, involving 49 participants [Studies 2, 6, 8, 9, and 11]. Specific targets in this category included increasing peer interaction and participation; facilitating joint attention behaviors and nonverbal communication skills; and increasing emotional, motivational, and interpersonal responsiveness in joint engagement.
Independent functioning featured in two studies, involving three participants [Studies 4 and 5]. The target skills involved increasing independent completion of multi-step tasks, such as hand washing, toileting, cleaning up, and performing a morning greeting routine at preschool. One study [Study 7], with 12 participants, focused on developing understanding of four emotions (i.e., happiness, sadness, anger, and fear) by measuring participants’ abilities to recognize facial expression in pictures and facially express corresponding emotions themselves. Finally, three studies, including 96 participants, focused on increasing verbal communication [Studies 10, 11, and 12]. Study 11 was the only study that was identified in two categories, namely verbal communication and social interaction/social communication.
Many of the studies implemented music therapy interventions featuring the use of specific songs with lyrics related to target skills [Studies 1, 2, 4, 5, 7, 9, 10, and 12]. Ninety-five of the 147 participants (65 %) received this type of intervention approach. Two studies used pre-composed songs that fit the purposes of the intervention, including a song about cleaning up, and children’s songs about emotions [Studies 4 and 7]. Three studies used adapted lyrics set to familiar melodies [Studies 1, 4, and 9], and six studies used originally composed lyrics and music [Studies 2, 4, 5, 7, 10, and 12]. In Study 1, a prescriptive song protocol was used to compose song lyrics based on social stories (Gray and Garand 1993). In Study 10, a video recording of the songs was made, which the participants watched in the intervention.
Several studies focused on music improvisation as the main music therapy approach [Studies 6, 7, 8, and 11]. Fifty-six of the 147 participants (38 %) received this type of intervention. Studies 6 and 8 divided improvised music therapy sessions into two halves. The first half involved following the child’s lead in musical play, which was then supported by the therapist. The second half was therapist-directed, that is the therapist introduced modeling and turn-taking activities. In addition to pre-composed songs, Study 7 used recordings of piano improvisations to represent four emotions: happiness, sadness, anger, and fear. The recordings were then played as background music during verbal instruction for each emotion. Study 11 used relational music therapy, which was described as an approach where sessions were mainly client-led and improvised activities were used. Study 3 involved active music therapy sessions including drumming, singing, and piano playing. However, it is unclear whether these were structured or improvisational activities.
Studies were classified as experimental or quasi-experimental. Quasi-experimental designs included A-B designs or a single-group design (Lang et al. 2012; Davis et al. 2013). Ten of the 12 included studies, involving a total of 138 participants, were classified as experimental [Studies 1, 2, 5–12]. Two studies, involving nine participants, were classified as quasi-experimental [Studies 3 and 4].
The experimental studies included the use of several types of single-case experimental designs [Studies 1, 2, 5, and 9]. For example, Studies 1 and 5 used an A-B-A-B design or modified A-B-A-B design, Study 2 used a multiple-baseline design, and Study 9 used an alternating treatments design with baseline and follow-up. The remaining experimental studies were randomized controlled trials, involving 74 participants [Studies 10 and 11], or a repeated-measures design with a control condition and counterbalancing [Studies 6, 7, 8, and 12], involving 54 participants. Study 3 was classified as quasi-experimental because it utilized a pre-post measure without a control group, and Study 4 was classified as quasi-experimental because it seemingly employed an A-B design, although it was unclear whether baseline data were collected.
Follow-Up and Generalization
Only one study reported follow-up data after implementation of the intervention [Study 9]. In this study, follow-up took place two weeks after the intervention had ended. Two follow-up sessions, one week apart, were conducted. Additionally, Study 1 stated that the music therapist followed up with the families of each of the participants three weeks after completion of the intervention to obtain verbal reports of occurrences of target behaviors. None of the studies reported measures of generalization; however, the third phase of Study 9 appeared to have included an element of generalization. Specifically, in Phase B, an alternating treatments intervention was introduced, alternating between play sessions with three toys, and musical play sessions with three other toys. In Phase C, the music sessions were continued, as these sessions appeared to be the more effective of the two treatments. At this stage, the toys from the Phase B play session were used in the music sessions to see whether the positive effects observed in the Phase B music sessions would still be evident with the use of the other toys.
Reliability of Data and Treatment Integrity
Most of the studies reported assessing reliability of data collection using inter-observer agreement measures [Studies 1, 2, 4–11]. Of the inter-rater reliability data reported, most were above the generally accepted standard of 80 % agreement. It was unclear whether inter-rater reliability was collected for Study 7, but the authors stated that a researcher and three reliability observers matched photographs to emotions, with a criterion set at .75 for a photograph to be coded as a correct response. Study 11 reported a procedure where inter-rater agreement between two raters was determined by using the study’s dependent variables to rate seven children who were not part of the study. Measures of inter-rater reliability would have been appropriate in the two remaining studies [Studies 3 and 12], but such data do not seem to have been collected.
Treatment integrity data were only reported in one study [Study 2] where teachers and peers were trained by the music therapist to implement a music intervention. The results were varied, but the study reported that most teachers and peers demonstrated a high level of treatment fidelity. Some studies described use of treatment protocols or guidelines [Studies 5, 6, 8, and 11].
Intervention outcomes were classified as positive, negative, or mixed, in accordance with the categories described by Lang et al. (2012). Seven of the studies (58 %), involving 99 of the total 147 participants (67 %), demonstrated positive outcomes [Studies 2, 5, 6, 8–10, 12]. In these studies, significant gains for the treatment condition were found, compared with the control group/condition, or visual analysis of data suggested improvement in all dependent variables for all participants for single-subject research designs. There were mixed results for the remaining five studies [Studies 1, 3, 4, 7, and 11], which involved a total of 48 participants.
Study 1 used an ABAB design. The intervention effects appeared positive from baseline to intervention; however, there was a failure to observe a reversal of trends in the second baseline for two of the three participants. In Study 3, significant improvements in dependent variables were observed for only some of the time periods. Study 4 demonstrated generally positive effects, but the results did not show one condition as consistently more effective than the other in the alternating treatment design. Generally, positive effects were also observed in Study 7; however, the intervention conditions did not show evidence of significantly greater improvement compared with control conditions. A further analysis revealed that once participants’ pre-test scores were taken into account, the intervention conditions appeared to be more effective than the control conditions. Similarly, Study 11 did not demonstrate a significant improvement in the experimental group compared with the control group, but a further analysis showed a statistically significant improvement for a subset of the participants. Only participants in the experimental group with diagnoses of autistic disorder (rather than PDD-NOS or Asperger’s disorder) showed significant improvement compared with the control group.
Certainty of Evidence
The certainty of evidence was rated as insufficient, preponderant, or conclusive in accordance with Davis et al.’s (2013) definitions. Seven of the studies provided conclusive evidence [Studies 2, 5, 6, 8, 9, 10, and 11]. The majority of these were those that indicated positive outcomes (excluding Study 11). Three studies were rated as providing preponderant evidence [Studies 1, 7, and 12], and two studies were rated as providing insufficient evidence [Studies 3 and 4]. The preponderant ratings were due to the presence of confounding variables and possible carry-over effects [Study 1] and insufficient or absent inter-rater agreement data [Studies 7 and 12]. The two studies with insufficient evidence ratings were classified as such due to reliance on quasi-experimental designs. Study 3 employed a pre-post-test without a control group, while Study 4 employed what appeared to be an A-B design.