Journal of Autism and Developmental Disorders

, Volume 43, Issue 11, pp 2491–2501

Measurement Tools and Target Symptoms/Skills Used to Assess Treatment Response for Individuals with Autism Spectrum Disorder


  • Erin Elizabeth Bolte
    • Center for Children and FamiliesUniversity of Notre Dame
    • Department of Molecular and Cellular BiologyUniversity of Guelph
    • Center for Children and FamiliesUniversity of Notre Dame
Original Paper

DOI: 10.1007/s10803-013-1798-7

Cite this article as:
Bolte, E.E. & Diehl, J.J. J Autism Dev Disord (2013) 43: 2491. doi:10.1007/s10803-013-1798-7


This study examined the measurement tools and target symptoms/skills used to assess treatment response during Autism Spectrum Disorder (ASD) intervention trials from 2001 through 2010. Data from 195 prospective trials were analyzed. There were 289 unique measurement tools, of which 61.6 % were used only once, and 20.8 % were investigator-designed. Only three tools were used in more than 2 % of the studies, and none were used in more than 7 % of studies. Studies investigated an average of 11.4 tool-symptom combinations per trial, with as many as 45 in one study. These results represent a lack of consistency in outcome measurements in ASD intervention trials. These findings highlight the need to set guidelines for appropriate outcome measurement in the ASD field.




The dramatic increase in the prevalence of Autism Spectrum Disorder (ASD), most recently reported at one in 88 children in the United States (CDC 2012), has highlighted the need to understand the best treatment approaches. Until recently, the focus of the majority of research on ASD measurement tools was concentrated on accurate and early diagnosis, rather than assessment of change during interventions. However, the issue of assessment tool usage has impacted ASD research for many years (Campbell et al. 1991; Roseman et al. 2001; Kasari 2002; Lord et al. 2005; Matson 2007). Now, ever-increasing amounts of intervention data exist, which results in a wide variety of measurement tools being used to assess changes in symptoms or skills during treatment trials. This large number of tools complicates the comparison of results between ASD intervention studies (Magiati et al. 2011).

One challenge that researchers face is the heterogeneous nature of the symptoms displayed by individuals with ASD. Although the diagnostic categories of social interaction, communication/language, and stereotyped/repetitive behaviors (APA 2000) are most often targeted for improvement during therapies (e.g., Rogers and Vismara 2008), the vast range of symptom types and severity of symptom presentation makes outcome measurement difficult. For example, children diagnosed with ASD can exhibit social communication deficits in diverse areas such as social-emotional reciprocity, nonverbal communication behaviors used for social interaction, and developing and maintaining relationships that are appropriate to developmental level, among many others (Volkmar et al. 2004). Restricted, repetitive patterns of behavior, interests, or activities can vary greatly from child to child as well (Leekam et al. 2011). As an example, children may display excessive adherence to routines, echolalia, stereotyped motor movements, or an intense interest in a very specific topic, such as the educational pedigree of the United States presidents. Additionally, some of the most problematic symptoms frequently associated with ASD are not dictated by the diagnostic criteria, including (but not limited to) self-injurious behavior, sensory issues, aggression, irritability, insomnia, seizures, and anxiety (Levy et al. 2009). The diverse collection of core and associated behavioral symptoms results in a plethora of potential behaviors, any of which could be measured as study target symptoms.

An equally diverse assortment of measurement tools exists to assess these symptoms. Measurement tools range from broad to very narrow, and well-established to investigator-designed. Although the Autism Diagnostic Observation Schedule (ADOS; Lord et al. 2002, 2012) and the Autism Diagnostic Interview-Revised (ADI-R; Rutter et al. 2003) are the gold standard assessments for the diagnosis of ASD (Gray et al. 2008), no such comparable instruments exist for measuring response to interventions (Magiati et al. 2011). This review reports the wide range of assessment tools that have been used to measure changes in behavior during intervention trials. We provide a descriptive, quantitative representation of the measurement tools and target symptoms/skills used during ASD intervention trials in order to provide a framework for discussion and future clinical research on this population.

Purpose of this Study

The primary objective of this review was to identify and quantify the measurement tools and target symptoms/skills used in prospective intervention trials for individuals with ASD. We first examined the measurement tools used in these intervention trials, noting which measures were investigator-designed. We then evaluated the quantity and distribution of target symptoms/skills that were assessed by the measurement tools. Finally, we examined the total number of tool-symptom combinations used in these studies. These objectives were considered collectively for all ASD studies and also analyzed separately based on three different treatment classifications: psychological, pharmacological, and complementary/alternative medicine (CAM). We predicted that: (a) a large number of measurement tools would be used within and across fields of study, (b) a considerable portion of these would be investigator-designed or modified, (c) a large number and diversity of symptoms/skills would be targeted, given the heterogeneity of symptomatology in the autism spectrum, and (d) a correspondingly large number of tool-symptom combinations would be used per study.


The published research literature was systematically searched from January 1, 2001 to December 31, 2010 for the treatment of ASD in human study participants. The reason for this time restriction was twofold. First, it allows for the data used in this review to be recent and relevant to research currently being conducted. Second, it restricts the search to a manageable, yet representative volume of data. The search was limited to the published literature accessible through MEDLINE™ or PsycINFO® search engines. In MEDLINE™, the search was restricted to clinical trials, which are defined as the account “of a pre-planned clinical study of the safety, efficacy, or optimum dosage schedule of one or more diagnostic, therapeutic, or prophylactic drugs, devices, or techniques in humans selected according to predetermined criteria of eligibility and observed for predefined evidence of favorable and unfavorable effects,” (NIH 2011). In PsycINFO®, the search was restricted to peer-reviewed journal articles, which are defined as “a publication that is authored by academics for a target audience that is mainly academic,” (ProQuest LLC 2012). Participant criteria included studies that listed the primary participants as having ASD(s), autism, Asperger syndrome, or Pervasive Developmental Disorder—Not Otherwise Specified.

Abstracts of these papers were then analyzed for inclusion in this study. If the abstract did not contain sufficient information to evaluate the inclusion criteria, the full article was obtained. To be incorporated in this review, studies met specific, predetermined inclusion criteria. These criteria were used to ensure adequate comparative value of the articles. First, the studies had to be a prospective trial of a psychological, pharmacological, or CAM intervention for the treatment of ASD. The studies were original research and implemented by the investigators themselves; reviews and meta-analyses were excluded from this sample. Additionally, the study had to be written in English, and it must have enrolled at least 20 participants with ASD.

Once included, the trial was categorized as a psychological, pharmacological, or CAM intervention based upon the nature of the therapy. “Psychological” trials included interventions that attempted to modify cognition, behavior, or functional level. This was a broad category that included psychological treatments such as cognitive, behavioral, or cognitive/behavioral therapy approaches, but also included other psychosocial, psychoeducational, and speech/language approaches that attempted to modify cognition, behavior, or functional level. “Pharmacological” trials included all medications that required a physician’s order to obtain. At this time, there are only two drugs approved for the treatment of irritability associated with ASD: aripiprazole and risperidone (Wink et al. 2010). Off-label use of all other drugs was included under the classification of pharmacological trials if a physician’s order was required. In cases where a trial involved both pharmacological and psychological treatments, the article was classified as pharmacological. CAM trials included all non-prescription dietary manipulations and nutraceuticals such as vitamins and supplements. In addition, therapies that directly stimulated the body like acupuncture, massage, and hyperbaric oxygen were included in CAM. The use of subcutaneous methylcobalamin, a form of vitamin B-12 (Bertoglio et al. 2010), was classified under CAM, even though it required a prescription (because it was administered by injection). This study was classified as CAM because the substance is essentially vitamin B, and we felt it was better represented in the CAM category, based on our definitions.

The method sections of articles that met the inclusion criteria were analyzed for: (a) the measurement tools that were utilized to assess the target symptoms/skills, (b) the number of target symptoms/skills studied, and (c) the total number of tool-symptom combinations used in the study. In this review, the term “target symptom/skill” is used to mean any symptom or skill which a study focuses on as a measure of change. Any change in the target symptom/skill is determined by the measurement (or synonymously, assessment) tool. Consequently, only outcomes measured both before and after (or during) the trial with the intent of assessing change were considered target symptoms/skills. Characteristics that were used at baseline for trial-entry thresholds and group/population matching were not included as target symptoms/skills.

Target symptoms/skills were assigned to one of fourteen global category groups, such as Language/Communication or Adaptive Functioning (all fourteen are described in Table 1). The fourteen categories cluster the symptoms that are closely related in nature. This grouping resolves the issue of symptom/skill redundancy and overlap among closely-related behaviors, thereby allowing overall analyses of target symptoms/skills. Consideration could not be given to the investigators’ assignment of primary, secondary, tertiary, etc target symptoms/skills, as not all authors specified their preference. All analyses were conducted twice; first, with all variables included, and second, with just cognitive/behavioral symptoms (excluding symptoms in the Physical/Medical and Side Effect categories; see Table 1). This procedure gives a full picture of the overall patterns of symptoms/skills that were specifically measured for improvement in treatment trials. In addition, it ensures that studies which assessed for a wide-range of adverse events were not artificially inflated for taking such precautions.
Table 1

Descriptions of the fourteen categories for dividing the target symptoms/skills into subgroups

Category name


About the intervention

Target symptoms/skills that do not “stand alone,” meaning that they require context within the study or they can vaguely be applied to any random intervention, e.g., Improvement Overall and Satisfaction

Adaptive functioning

Target symptoms/skills that relate to self-care, daily living skills, and the ability to navigate one’s world

Autistic severity/symptom

Target symptoms/skills that refer to overall ASD symptoms

Behavior/dysfunctional behavior

Target symptoms/skills that concern behavior and behavior problems that are not directly associated with relationships. It excludes repetitive behavior, one of the core diagnostic features of ASD


Target symptoms/skills that concern intelligence or specific aspects of cognitive functioning


Describes one of the three diagnostic criteria of ASD. Includes symptoms/skills that contained the words communication, language, or other key terms: words, gestures, vocalization, vocabulary, etc


Target symptoms/skills that relate to emotions, moods, and feelings


Target symptoms/skills that do not directly measure the child, but instead monitor a change in the family or caregiver


Target symptoms/skills that are both physical and behavioral in nature


Target symptoms/skills that are bodily and constitutively exhibited (not behavioral)

Repetitive behavior/stereotyped interests

Describes one of the three diagnostic criteria of ASD. It includes repetitive behaviors and stereotypy


Target symptoms/skills that concern the reception of or reaction to sensory information

Side effectsa

Target symptoms/skills that were monitored for safety, whether neurological, physical, or other. Most studies were very clear about what symptoms were being monitored for safety reasons

Social/social cognitive

Describes one of the three diagnostic criteria of ASD. It includes symptoms/skills that relate to the child’s ability to interact with people or his/her cognition of social interaction

aThese categories were not included in analyses that examined cognitive/behavioral target symptoms and skills

For this paper, we also examined measurement tool—target symptom pairs. The purpose of this analysis was to examine how many different “variables” investigators were measuring in each study. For example, if a study measured hyperactivity (one target symptom) using three different measurement tools, then there were three measured “variables,” or three tool-symptom combinations. Conversely, if a study used one measurement tool to measure hyperactivity, irritability, anger, and language (four different target symptoms/skills), then there were four measured “variables,” or four combinations of tools and symptoms. Subscales of measurement tools were counted separately only if the study used the individual subscale paired with a defined target symptom or skill. Additionally, it was noted if the measurement tool was investigator-designed or modified for the specific study, as opposed to a published and/or marketed instrument.


Using the prescribed search criteria, the initial search for clinical trials involving the treatment of ASD yielded 397 MEDLINE™ and 2,367 PsycINFO® articles to be examined. Application of the specific eligibility criteria reduced this number to 195 (135 MEDLINE™ and 60 PsycINFO®) articles. The breakdown of the reasons for exclusion is provided in Table 2. Citations for these articles can be found in Appendix A in ESM. Of the total included articles, 52.8 % (103/195) were psychological, 31.8 % (62/195) were pharmacological, and 15.4 % (30/195) were CAM. In the following sections, we examine these results in the context of the measurement tools used, followed by the target symptoms/skills, and finally the tool-symptom combinations. In each section, we first present the overall results, followed by subcategories of analysis that are relevant to the section.
Table 2

Breakdown of the article exclusion process based on the six inclusion criteria

Search engine




Total articles considered




Not ASD specific




Not a treatment trial




Not a prospective design




Not an original work




Less than 20 participants




Overlap with MEDLINE™ search engine








Measurement Tools

Overall Results

The articles yielded a total of 674 different measurement tools, including subscales that were analyzed independently and different editions of the same measurement tool. When subscales were collapsed, the number of measurement tools was reduced to 319. Further omission of repetitive editions/versions yielded 289 unique measurement tools (a complete list of these tools can be found in Appendix B in ESM). Of the 289 unique measurement tools, 61.6 % were used only once over the entire 10 year span. The average was 4.7 measurement tools per trial, and the median was four, not including subscales. The range was one to 21 measurement tools used per article. When only cognitive/behavioral measures were included (e.g., excluding the Physical/Medical and Side Effect categories from Table 1), the total number of measurement tools was 253, the mean number of measurement tools per trial was 3.5, and the range was zero to 21 (see Fig. 1).
Fig. 1

Percentage of articles that used a given number of unique measurement tools. Percentages are provided for all symptoms/skills measured, and also exclusively for cognitive/behavioral symptoms (excluding Side Effects and Physical/Medical symptoms)

Analysis of the most-used measurement tools was based upon the 289 unique measurement tools (253 if only cognitive/behavioral symptom measurements are included). If a measurement tool was used more than once in a study, then it was only counted once. In other words, the results are not skewed by counting the analysis of measurement tool subscales. Overall, the most-used measurement tool was the Aberrant Behavior Checklist (ABC; Aman et al. 1985), used 5.0 % of the time (6.7 % when only cognitive/behavioral symptoms are included), followed by the Clinical Global Impression scales (CGI; Guy 1976) and the Vineland Adaptive Behavior Scales (VABS; Sparrow et al. 1984). These were the only tools used more than 3 % of the time across all studies (see Table 3, top), even if only cognitive/behavioral symptoms were considered.
Table 3

Most commonly used measurement tools, across all studies and by intervention classification


Name of measurement toola

Pct of times used overall (% used to measure cognitive/behavioral symptoms/skills)



Aberrant Behavior Checklist (Aman et al. 1985)

5.0 (6.7)


Clinical Global Impressions (Guy 1976)

4.6 (6.1)


Vineland Adaptive Behavior Scales (Sparrow et al. 1984)

3.9 (5.3)


Videotape Observation (Investigator-designed)

1.9 (2.5)


Bayley Scales of Infant Development (Bayley 1993)

1.7 (2.3)



Vineland Adaptive Behavior Scales (Sparrow et al. 1984)

6.3 (6.4)


Bayley Scales of Infant Development (Bayley 1993)

3.9 (3.9)


Videotape Observation (Investigator-designed)

3.6 (3.7)



Aberrant Behavior Checklist (Aman et al. 1985)

10.1 (21.4)


Clinical Global Impressions (Guy 1976)

8.8 (18.5)


Childhood Autism Rating Scale (Schopler et al. 1988)

2.7 (5.6)

Complementary/alternative medicine


Aberrant Behavior Checklist (Aman et al. 1985)

4.7 (6.1)


Clinical Global Impressions (Guy 1976)

3.9 (5.1)


Vineland Adaptive Behavior Scales (Sparrow et al. 1984)

3.2 (4.0)

Percentages provided for all symptoms/skills, and exclusively for symptoms/skills measuring cognitive/behavioral symptoms/skills

aReferences provided in this table refer to the most-frequently used edition, although other editions might have been used in individual studies

Results by Intervention Classification

These analyses were repeated for the three intervention classifications (see Table 3, bottom). For psychological intervention studies, 196 unique measurement tools were used at least once among all of the 103 psychological intervention studies (189 if only cognitive/behavioral symptoms are considered). Of these tools, 65.3 % were used only once (64.0 % if only cognitive/behavioral symptoms are considered). Among the 62 pharmacological studies, 81 unique tools were used at least once (54 if only cognitive/behavioral symptoms are considered). Of the 81 unique tools in pharmacological studies, 49.4 % were used only once (51.9 % if only cognitive/behavioral symptoms are considered). Finally, 74 unique tools were used at least once among the 30 CAM clinical trials (61 if only cognitive/behavioral symptoms are considered). Of the 74 unique tools used in CAM trials, 62.2 % (or 62.3 % if only cognitive/behavioral symptoms are considered) were used only once. The number of tools identified from pharmacological studies (54 tools, only considering cognitive/behavioral measurements) was lower than the number of pharmacological articles (62 studies). This trend was not witnessed in psychological or CAM articles, where the number of cognitive/behavioral tools identified (189 and 61, respectively) exceeded the number of articles (103 and 30). This finding suggests greater consistency and consensus of measurement tools used in pharmacological studies.

Investigator-Designed and Modified Measurement Tools

Investigator-designed and modified measurement tools were of particular interest because they are harder for other research groups to replicate. Of the 289 unique measurement tools identified, 20.8 % were investigator-designed or modified. If only the 253 cognitive/behavioral measurements are considered, then 21.4 % were investigator-designed or modified. The use of investigator-designed tools was similar across psychological studies (18.9 % of all measurement tools used, 19.6 % if only cognitive/behavioral symptoms/skills are considered), pharmacological studies (18.5 % of all tools used, and just measuring cognitive/behavioral symptoms/skills as well), and CAM studies (21.6 % of all tools used, 21.3 % of tools measuring cognitive/behavioral symptoms/skills).

Target Symptoms/Skills

The 195 total articles yielded 611 unique target symptoms or skills that were divided into fourteen categories based upon the nature of the symptom (Table 1). When analyzed with respect to the categories, Social/Social Cognitive contained the largest variety of target symptoms/skills (21.2 % of total target symptoms), followed by Communication/Language (11.6 %) and Cognition (9.7 %). When the Physical/Medical and Side Effect categories are excluded, the variety of target symptoms/skills in the Social/Social Cognitive category rose to 25.2 % of all symptoms/skills, 13.9 % for Communication/Language, and 11.5 % for Cognition. It is not surprising that a large number of studies examined these variables, because they are thought to be central to the characterization of ASD; however, the sheer number of different characterizations of those symptoms and skills highlights the problem of consistency across studies in these vital categories.

Specific Target Symptoms/Skills

Because of inconsistencies in the naming of target symptoms and skills, we selected a small number of outcomes that are specific in nature in order to highlight the different tools that were used to measure a target symptom/skill. The symptoms/skills that were chosen were not referred to by other names, as is the case for many symptoms/skills. For example, a plethora of related target symptoms/skills were identified for expressive language: expressive language, expressive language in object labeling, expressive language semantics, expressive language syntax, expressive vocabulary, vocalizing, spontaneous language, and verbal/expressive communication. Five specific symptoms/skills were identified that were not characterized by similar names or other overlapping target symptoms/skills: Aggression, Anxiety, Compulsion, Eye Contact, and Hyperactivity. Table 4 displays the number of different measurement tools used to measure these target symptoms/skills. Even though these symptoms/skills are specific in nature, multiple measurement tools were still used to assess change. In particular, Eye Contact was assessed on eight different occasions across the 195 articles. In each of those occasions, it was measured by a different measurement tool, seven of which were investigator-designed. Another interesting example was the target symptom Hyperactivity. It was measured on 62 different occasions by 14 different tools. One of these tools was the ABC, the most-used tool of the 195 articles. The ABC was used to measure Hyperactivity 35 out of the 62 times (56.5 %). The remaining 43.5 % of tool-symptom combinations assessing Hyperactivity were measured by the other 13 tools, five of which were investigator-designed. These examples demonstrate the surprising number of different measurement tools that are being used to assess specific target symptoms/skills, in addition to displaying the prevalence of investigator-designed tools in making these measurements.
Table 4

Number of different measurement tools used to assess five specific target symptoms/skills


Number of times the target symptom or skill was investigateda

Number of different tools used to measure the target symptom or skill

Number of tools that were investigator-designed or modified













Eye contact








aThis number is the number of tool-symptom combinations associated with the given target symptom/skill

Tool-Symptom Combinations

We then examined the frequency and distribution of tool-symptom combinations across studies. For the 195 articles, there were a total of 2,227 tool-symptom combinations (1,861 cognitive/behavioral combinations). Recall that a tool-symptom combination is the combination of a target symptom/skill as measured by a specific measurement tool. For example, “Hyperactivity as measured by the ABC” is a tool-symptom combination. On average, 11.4 tool-symptom combinations were used per trial, and the median was 10 tool-symptom combinations per trial with a range of one to 45. For only cognitive/behavioral variables, the average number of tool-symptom combinations used per trial shifted to 9.5, the median to seven, and the range maximum of 45 remained unchanged. In fact, one in 15 articles still examined 20 or more tool-symptom combinations.

The percentages of tool-symptom combinations in each category across the three intervention types were examined (see Fig. 2). Here, tendencies were witnessed for psychological studies to focus on Social/Social Cognitive and Communication/Language symptoms; pharmacological studies focused on Behavior/Dysfunctional Behavior, Repetitive Behavior/Stereotyped Interests, and Side Effects; and CAM studies focused on Physical/Medical and Communication/Language. The three intervention classifications focused on different target symptoms/skills, indicating different outcome priorities across ASD investigations.
Fig. 2

Distribution of tool-symptom combinations across the fourteen target symptom/skill categories per the three intervention classifications. Psych psychological studies, CAM complementary and alternative medicine studies, Pharm pharmacological studies

Psychological studies averaged 10.0 tool-symptom combinations per study (9.8 when only cognitive/behavioral symptoms are considered). Pharmacological studies averaged 12.5 tool-symptom combinations per study (8.2 when measuring only cognitive/behavioral symptoms). CAM studies had an average of 14.1 tool-symptom combinations per study (11.5 when measuring only cognitive/behavioral symptoms). Figure 3 displays the percentage of articles that measured a given number of tool-symptom combinations (only cognitive/behavioral measurements) with respect to the three intervention classifications. It should be noted that 16.7 % of CAM studies included 21 or greater tool-symptom combinations when only cognitive/behavioral symptoms are included.
Fig. 3

Distribution of the number of tool-symptom combinations used per article, separated according to intervention type. Percentages include cognitive/behavioral symptoms/skills, but exclude Side Effects and Physical/Medical symptoms/skills as described in Table 1. Psych psychological studies, Pharm pharmacological studies, CAM complementary and alternative medicine studies

In addition, we examined the use of investigator-designed tools in tool-symptom combinations. Investigator-designed tools measured 16.1 % of all tool-symptom combinations (17.0 % if only tool-symptom combinations that used a cognitive/behavioral measure are counted). The use of investigator-designed tools differed across the three study classifications, and was higher in psychological studies (20.7 % of all tool-symptom combinations, 21.1 % of cognitive/behavioral combinations) and CAM studies (19.2 % of all tool-symptom combinations, 19.5 % of cognitive/behavioral combinations) than in pharmacological studies (only 8.4 % of all tool-symptom combinations, 6.5 % of cognitive/behavioral combinations).

Additionally, the exact use of the investigator-designed tools was analyzed in the context of the nature of the measured target symptom or skill. For example, tools could measure a specific phenomenon (e.g., eye contact, child-to-adult communicative initiations that received an adult’s response, picture to object correspondence, contact/touching adult, and syllable identification through lip reading), or broad observations (e.g. social interaction, communication, behavior, core autistic features, and cognitive ability). Measurements of specific skills represented approximately half of investigator-designed tool-symptom combinations, whereas measurements of broader categories represented approximately one-third. The remainder of the investigator-designed tool-symptom combinations (approximately one-sixth) could not be readily classified as specific or broad in nature.


ASD is a neurobehavioral disorder characterized by impairments in communication/language, repetitive/restrictive behavior, and social interaction. Given the rising prevalence of ASD, it has become increasingly important for researchers to focus on the development of appropriate treatments for ASD. Unfortunately, there is little consensus on what measurement tools and target symptoms/skills to use in order to evaluate the efficacy of interventions (Roseman et al. 2001). Compounding this problem is the fact that ASD is an extensive spectrum of diagnostic and associated symptoms. These features range from narrowly-defined characteristics to broad observations of behavior that are difficult to delineate and measure. Any or all of these multidimensional symptoms could be subject to intervention. In this study, we attempted to quantify these issues to highlight the magnitude of these problems. In our survey of clinical trials in the previous decade, we found that there were a large number of measurement tools used (over half of which were only used once), and a remarkable diversity of target symptoms/skills, both of which contributed to a high number of tool-symptom combinations used per study. In the remainder of the discussion, we consider these findings, and suggest future directions for clinical research.

Measurement Tools

Of the 289 unique measurement tools used in studies, 61.6 % were used only once over the course of 10 years. Only three tools (ABC, CGI, and VABS) were used in more than 2 % of the studies, and none were used in more than 7 % of studies. Although we had predicted the review would result in a large number of unique measurement tools, the percentage of tools that were used only once highlights the need for a consensus on appropriate measures of symptom/skill change in order to facilitate comparisons between approaches, and between studies of the same approach. In the case of similar or competing therapies that attempt to address the same target symptoms, it is quite difficult to compare results between the trials if the investigators use different measurement tools to assess treatment response (Virués-Ortega 2010). The sheer number of unique measurement tools was higher than anticipated, particularly given the very restrictive inclusion criteria we used in this study. We only included studies with large sample sizes (20 or more participants with ASD), which excluded a substantial number of small trials and single case studies that might have greatly expanded the number of discovered measurement tools. Certainly, small trials and case studies are an important part of intervention research, and for which outcome measurement is equally important. Considering their exclusion, this study was a conservative estimate of the measurement tool problem, and it highlights the scope of the problem.

On the whole, there was little overlap between the three types of intervention classifications with respect to what measurement tools were used. The tools that were used most-often across all intervention classifications were the ABC, the CGI, and the VABS. Of these, the VABS was most popular among psychological studies, whereas pharmacological and CAM studies tended to use the ABC and CGI. Pharmacological studies demonstrated considerably larger consistency in their choice of measurement tools than psychological or CAM studies. Whereas psychological and CAM studies did not use any single measure in more than 7 % of the studies, pharmacological studies used the ABC 21.4 % of the time and the CGI 18.5 % of the time (when considering only cognitive/behavioral tools). Even within the different fields of study, however, no single measure was used in more than 25 % of the studies within that field.

Investigator-designed tools represented one-fifth of the total number of unique measurement tools. Of these investigator-designed tools, approximately half were used as direct observations of specific phenomena (e.g., eye contact). Standard measures of the desired symptom/skill may not exist, or exist but do not assess change of the symptom or skill with satisfactory specificity. It also should be noted that the use of investigator-designed tools can represent innovation in the measurement of symptoms/skills, and can be useful when examining the same symptom or skill using a multiple-levels-of-analysis approach (e.g., anxiety as measured by psychophysiological responses and self-report).

Still, many of the studies we reviewed used investigator-designed tools as the primary (or only) measure of a single symptom or skill (and in some cases very broad skills). This approach can be problematic for replication and cross-study comparisons. Measuring a broad symptom or skill (such as “behavioral problems” or “autistic features”) with an investigator-designed tool is questionable when established measurement tools are commonly used in other trials to assess the same symptom/skill.

Overall, it is likely that a number of factors contribute to the inconsistency among the choice of measurement tools. First, this number may reflect how it is not widely-accepted which assessment tools should be used to measure target symptoms/skills. It also may suggest that there are not enough readily-available, appropriately-sensitive, or standardized tools to measure ASD symptomatology. The ADOS and ADI-R are the gold standard assessments for the diagnosis of ASD (Gray et al. 2008), but these tools were not created to measure subtle improvements in behavioral change (Lord and Corsello 2005). Although the availability, sensitivity, and standardization of tools were beyond the sphere of this review, these issues have been raised in previous reviews and are certainly worth noting in the context of these findings (e.g., Teal and Wiebe 1986; Garfin et al. 1988; Sevin et al. 1991; Aman et al. 1995; Conners et al. 1998; Cohen et al. 2003; Cohen 2003; which are examples of studies that analyze the validity of specific measurement tools used in ASD research). It should also be noted that there are different approaches to assessment, both within a particular field of study, and across different fields of study. Still, the extent of the differences in the measures used (even within the same general field) was striking, and we maintain that developing a consensus on measures of common symptoms (in addition to the important psychometric properties of a measure) is an important issue in the field.

Target Symptoms/Skills

We identified 611 unique target symptoms and skills. A large variety of symptoms/skills was especially noticeable in categories measuring social behavior and communication/language, which are core symptoms of the current and future ASD diagnostic categories. In addition to the diverse array of symptoms/skills, it was also noted that individual symptoms/skills were assessed using a variety of measurement tools. For example, in the eight studies that examined eye contact, eight different measurement tools were used to measure eye contact, seven of which were investigator-designed.

Why do we have so many target symptoms/skills? Because of the rapid growth of ASD research, the spectrum nature ASD symptomatology, and disagreements about core symptoms, it is reasonable to expect a broad range and large number of target symptoms/skills across ASD studies. One reason is the broadly-inclusive nature of symptom categories, given the complexity of the behaviors that define the disorder. Because of this, a wide variety of symptoms can be expected, to a certain extent. By nature, “social interaction” is a diverse construct, composed of generalized terminologies. But, the generalized terminology that contributes to the problem is used in the Diagnostic and Statistical Manual of Mental Disorders, 4th Edition—Text Revision (DSM-IV-TR; APA 2000) and International Statistical Classification of Diseases and Related Health Problems, 10th Revision (ICD-10; WHO 2008), which provide the criteria used to characterize the populations in these studies. In the DSM-IV-TR criteria for Autistic Disorder, there are numerous examples of behaviors that are classified as examples of abnormal social interaction, including nonverbal behavior (e.g., eye gaze, gestures, facial expressions), development of peer relationships, and social-emotional reciprocity, all of which contain a diverse set of behaviors that can represent characteristic deficits. Moreover, children vary greatly in the extent to which they exhibit these symptoms. There has been an attempt in the DSM-5 (APA 2012) to streamline the diagnostic criteria, but it is necessarily difficult to create precise criteria for constructs such as social interaction and communication.

Tool-Symptom Combinations

We believe that a number of factors contributed to the large number of tool-symptom combinations used in the studies we reviewed. The lack of a consensus among measurement tools and the diversity of symptoms/skills that could be targeted in an intervention contribute to a relatively large number of tool-symptom combinations per study. It is important to note that an excess of tool-symptom combinations can be problematic because it increases the likelihood of Type I statistical errors. Certainly, there are a number of circumstances in which it is understandable that a researcher would choose to have a larger number of measurements. In particular, in a growing and relatively young field that lacks guidelines, it is important to be sure that investigators can capture symptom/skill change. Still, it is important for the field to move toward a context where symptom/skill change can be captured without compromising statistical analysis.

Conclusions and Future Directions

Greater consistency in the use of measurement tools in ASD clinical trials is a worthwhile and achievable goal. With the release of the DSM-5 and ICD-11, there is an opportunity for workgroups to address the numerous issues related to measuring outcomes in ASD intervention studies. As a part of this process, it would be useful to establish a guide for researchers on the benefits and drawbacks of existing measures, and to generate a roadmap for the creation of a set of measures that would serve as the gold-standard for outcome measurement. Guidelines would also be useful as a set of standards for the review process of academic journals and granting agencies. We believe that this work would facilitate the reproduction of results, allow for the comparison of data between studies within and across fields, and decrease the number of tool-symptom combinations used in treatment studies.


The study was supported in part by the Glynn Family Honors Program at the University of Notre Dame and a grant from the Boler Family Foundation (J. Diehl, P.I.). We would like to thank Ellen Bolte, R.N., for her thoughtful comments as we developed this review, and Stephany Mazur for her feedback during the editing process.

Conflict of interest

The authors declare that they have no conflict of interest.

Supplementary material

10803_2013_1798_MOESM1_ESM.doc (330 kb)
Supplementary material 1 (DOC 330 kb)

Copyright information

© Springer Science+Business Media New York 2013