Measurement of core and associated features of ASD is complicated by developmental and phenotypic heterogeneity. There is an absence of reliable, sensitive endpoints for measuring clinically relevant changes in core symptoms of the disorder (Anagnostou et al. 2015; Aman et al. 2015; McConachie et al. 2015), which limits the development and evaluation of novel treatments that target core symptoms of ASD.

In recent reviews of rating scales for use as clinical endpoints, scales were classified on their clinical relevance and psychometric properties (Anagnostou et al. 2015; Lecavalier et al. 2014; Scahill et al. 2015). Examples of measures deemed to show high relevance for ASD and reasonably strong psychometric properties included: Child and Adolescent Symptom Inventory Anxiety Domain (CASI-Anxiety) (Hallett et al. 2013; Sukhodolsky et al. 2008), Repetitive Behavior Scale—Revised (RBS-R) (Bodfish et al. 1999), and the Aberrant Behavior Checklist (ABC) (Aman et al. 2004; Aman and Singh 2017). However, these scales tend to focus on specific features of ASD or were not designed to capture core features of ASD with granularity. Therefore, two or more assessment approaches might be needed to measure all relevant concepts in ASD. Additionally, not all scales are suitable for use in children and adults. For example, the Autism Impact Measure (Kanne et al. 2014; Mazurek et al. 2018) has been recently developed to assess core autism symptoms in children with ASD.

There remains a need for efficient scales able to document short-term sensitivity to change in core and associated symptoms of ASD, and which are validated in well characterized samples of individuals with ASD across a range of ages, representative of participants who might be involved in clinical trials.

To address this gap, the Autism Behavior Inventory (ABI) was recently developed as a novel, web-based, parent or caregiver rating scale for assessing ASD core symptoms and associated behaviors over a 1-week recall period (Fig. 1). Our aim was to create a psychometrically sound and sensitive outcome measure for ASD clinical trials and other interventional studies. The scale is aimed to be suitable for caregivers of people with ASD, age 3 through adulthood. The design, development, and initial psychometric properties of the ABI v 1.0, and the short form (ABI-S) are described elsewhere (Bangerter et al. 2017).

Fig. 1
figure 1

Sample ABI

In brief, the ABI was developed through an iterative process involving public-health and clinician experts, statistical validation, and parent feedback (Fig. 2). The clinical experts provided input to conceptualize the ABI by generating items, refining item wording, and evaluating completeness of item coverage across ASD domains; they also performed initial assessments of clarity and readability. After selection, the items were assigned to groups, forming domains and sub-domains of the ABI, which were confirmed with factor analysis in a sample (n = 353) of online survey responses. Reliability and validity of the 93-item ABI in a small clinical sample (n = 30) resulted in a reduction of items to a 73-item scale (Bangerter et al. 2017) and a 35 item short form. Short form items were selected based on their statistical performance, and clinical expert feedback through a Delphi process which required consideration of items that were likely to be of significance to caregivers, and most likely to show signs of change in the short term. Following consultation with the Food and Drug Administration (FDA), and a series of cognitive interviews, aimed at assessing caregiver understanding and perceived relevance of items, the scales were further refined and reduced (Pandina et al. 2018). The ABI v 1.1 is a 62-item scale, and the ABI-S v 1.1 has 24 items.

Fig. 2
figure 2

Development of the Autism Behaviour Inventory (ABI) Scale

This current observational study was designed to evaluate psychometric properties of the ABI (Table 1) and ABI-S in a larger, independent cohort of individuals with ASD (n = 144).

Table 1 Autism Behavior Inventory

Methods

Ethical Practices

Institutional Review BoardsFootnote 1 approved the study protocol and its amendments. Participants, their parents (for participants < 18 years old), or legally authorized representatives provided written informed consent before participating in the study. Minors who were participants provided assent. The study was registered at clinicaltrials.gov, NCT02299700.

Study Samples

The study enrolled males and females aged ≥ 6 years with a confirmed diagnosis of ASD based on clinical examination including the Autism Diagnostic Observation Schedule, 2nd edition (ADOS-2) (Lord et al. 2012). These participants were requested to maintain ongoing behavioral and/or pharmacologic treatments during the course of the study, and it was expected that some changes would be seen in behaviors over the 8–10 week period as a result of these interventions or other prevailing events. Participants either lived with a parent or primary caregiver, spent at least 3 h a day for at least 4 days each week, or at least three weekends a month with a parent or primary caregiver. Other components of the broader study (Ness et al. 2019) required participation in a biosensor task battery resulting in exclusion criteria which included a measured composite score on the Kaufmann Brief Intelligence Test-2 (KBIT-2) (Kaufman and Kaufman 2004) of < 60 during screening (or other recent IQ evaluation), history of or current significant medical illness, and psychological and/or emotional problems that would render informed consent invalid or limit participant ability to comply with study requirements, based on clinical judgment.

The study also enrolled volunteer control participants through advertising across all sites. This sample included typically developing (TD) males and females aged ≥ 6 years with a score in the normal range on the SCQ, no DSM-5 defined major mental health disorder, no significant medical illness as assessed by the MINI-Kid v7.0.0 (Sheehan et al. 2010), and not taking psychotropic medication. This TD cohort provided normative data for comparison with ASD participants.

The study population comprised 144 participants with ASD and 41 TD participants. The majority were male (ASD 77.8%; TD 65.9%), consistent with higher male:female ratio in ASD (Loomes et al. 2017); their mean age was 15 years (Table 2). Mean (SD) ADOS Calibrated Severity Score (CSS) for the ASD participants was 7.6 (1.7), IQ was 99.2 (19.6), and all were verbal, based on parents report of language ability. Mean ABI scale scores at baseline showed clear differences between the ASD and TD cohorts for all domains, consistent with expectations (Table 1), and between younger (≤ 10 years old) and older (> 11) ASD participants for the Self-Regulation domain (Fig. 3).

Table 2 Participant characteristics
Fig. 3
figure 3

Mean ABI Scale Scores for ASD and TD participants at baseline based on Caregiver Responses to ABI

Study Design

This was a non-intervention, multicenter study, conducted from 06 July 2015 to 14 October 2015 at 9 study sites in the US.

The study consisted of a 14-day screening phase followed by an 8-to-10-week data-collection phase during which parents/caregivers of ASD participants completed the ABI/ABI-S at baseline, 3–5 days later, at week 4, and study endpoint (8–10 weeks). Parents of TD participants completed the ABI at a single visit. Parents/caregivers of ASD participants completed the remaining instruments at baseline, midpoint, and study endpoint.

Instruments

Autism Behavior Inventory (ABI)

The ABI v 1.0 presented in the study consisted of 73 items across 5 domains as follows: (a) Social Communication (b) Restrictive Behaviors and co-occurring symptom domains of (c) Mood and Anxiety (items related to sadness, irritability, worry, and anxiety), (d) Self Regulation (inattentiveness, impulsiveness, overactivity, and sleep issues); and Challenging Behavior (verbal and physical aggression, tantrums, absconding). Caregivers were asked to respond to items on two of 4 possible dimensions: Quality (how well behaviors are carried out), Context (the variety of situations in which the behaviors occur), Frequency or Intensity (not present to very severe). “Quality and Context” and “Frequency and Intensity” are usually paired together. The data obtained in this study was used to evaluate the utility of 2 response dimensions, and to evaluate the performance of items selected for the ABI v 1.1. The ABI (v1.1) contains 61/73 items, plus 1 new item (Table 4). The analysis represented here reflects the ABI (v 1.1) 61 items.

Autism Behavior Inventory—Short Form (ABI-S)

The ABI-S contains a subset of 24 items across each of the five domains in the ABI.

Aberrant Behavior Checklist (ABC)

ABC (Aman et al. 2004; Aman and Singh 2017) is a 58-item behavior rating scale used to measure behavior problems across five subscales: Irritability, Social Withdrawal, Stereotypic Behavior, Hyperactivity/Noncompliance, and Inappropriate Speech. Items are rated on a 4-point Likert scale (ranging from 0 [not at all a problem] to 3 [The problem is severe in degree]), with higher scores indicating more severe problems. The ABC has recently been validated for use in ASD (Kaat et al. 2014).

Zarit Burden Interview (ZBI)

ZBI—short version (Zarit et al. 1980) is a scale with 22 items designed to assess psychological burden experienced by caregivers. Items ask how the caregivers feel, and responses range from 0 to 4 (never to nearly always). The ZBI has been used to assess burden among caregivers of individuals with ASD (Cadman et al. 2012; Hérbert et al. 2000).

Social Responsiveness Scale (SRS-2)

Social Responsiveness Scale 2™ (SRS-2) (Constantino et al. 2003) identifies presence and severity of social impairment due to ASD. It contains 65 items intended to assess social communication and restricted and repetitive behaviors. Three forms are available, dependent on the age of the individual with ASD.

Child and Adolescent Symptom Inventory—CASI-Anxiety

CASI-Anxiety (Hallett et al. 2013; Sukhodolsky et al. 2008) is a 21-item anxiety scale that has been recommended as a possible outcome measure for anxiety symptoms in ASD (Lecavalier et al. 2014).

Repetitive Behavior Scale—Revised RBS-R Parent

RBS-R (parent) (Bodfish et al. 1999) is a 43-item report scale to indicate occurrence of repetitive behaviors and degree to which a behavior is a problem on a range from 0 to 3 (does not occur to severe problem).

Psychometric Analyses

Descriptive statistics were used to assess the measurement properties of the ABI, including evaluation of response variability and floor and ceiling effects. Comparison of the ABI responses using a single vs. dual response option was made using Pearson’s correlation coefficient of items in a domain scored on combined response option compared with first response option. Each domain was assessed by Cronbach’s alpha and item-total correlations for internal consistency. A domain was generally considered to have adequate internal consistency if Cronbach’s alpha was > 0.70.

Test–retest reliability at baseline and 3-to-5 days later was evaluated using Intraclass Correlation Coefficient (ICC). An ICC value of 0.70 or greater was considered evidence of acceptable test–retest reliability for subscale means and for use in detecting group mean differences. This time period was selected as a compromise. A shorter time period between test and retest may increase the likelihood of memory effects. This shorter period was selected, since the recall period for the ABI is one week, and therefore caregivers would be reporting on some of the behaviors within the same time frame as the original completion.

Scale-level convergent and discriminant validity were assessed by examining Pearson correlation coefficients between ABI domain scores and scores from related instruments at baseline. Convergent validity was established if at least moderate correlation (> 0.40) was observed between established measures and ABI scales hypothesized to measure the same or similar construct, and discriminant validity if correlations were lower than 0.40.

Exploratory Analysis of Change Over Time

Though this was a non-interventional study, participants were instructed to continue treatment as usual, and so change in reported behaviors was measured at baseline and endpoint (8 weeks). Sensitivity to change was explored by comparing parent-reported change scores of participants whose health state did not change during this time to those who showed improvement. Two definitions of improvement in health state were evaluated and included improvement in at least one category on the: (1) SRS-2 Total Score severity category (within normal limits, mild, moderate, and severe) and (2) ZBI item 22 on overall caregiver burden (not at all, a little, moderately, quite a bit, and extremely). These measures were selected for comparison based on observed correlations between domains of interest. The magnitude of each within-group change was assessed using a paired t-test. Within-group effect sizes (ESs) were computed as the ratio of the mean change score to the pooled standard deviation of the change scores.

Results

Response Options

Pearson’s correlation coefficient between single or dual item responses was high (0.95–0.99) for each of the domains of the ABI. Since this suggested limited utility of the dual response, further analysis took place based on scores generated from a single response option.

Internal Consistency

Internal consistency was high across domains, with Cronbach’s alpha ranging from 0.84 to 0.89 in the ABI (ABI-S: 0.69–0.79). Three items were identified through item-total correlations as having low correlation with their hypothesized domain score (r < 0.4 after adjusting for overlap), and when deleted resulted in a higher coefficient alpha for the remaining items in their hypothesized domain. Two of these items—Shows inappropriate affection to unfamiliar people [ABI 24] and Attempts to harm him/herself [ABI 39]—were identified previously as low prevalence behaviors but were maintained after review by clinical experts due to their seriousness when present. Wording changes were made to both of these items in order to provide clarification for future versions, as Cognitive Interviews also revealed potential confusion (Pandina et al. 2018). The correlation between Has sleep problems and the Self-Regulation domain adjusted for overlap was (0.38). This item was moved to the Mood and Anxiety domain.

Test–Retest Reliability

Test–retest reliability of each domain score on the ABI 3–5 days after baseline was excellent, with ICC values ranging from 0.84 to 0.93. ABI-S test–retest reliability was good (0.77–0.88). Means did not change significantly between test and retest (Table 3).

Table 3 Test–retest correlations for all ABI/ABI-S Subscales based on Caregiver Responses to ABI

Convergent and Discriminant Validity

Pearson’s correlations between ABI domains and comparison instruments were strongly positive (Table 4), demonstrating good convergent validity between subscales. The numbers in Table 4 that appear in bold font demonstrate examples of pre-specified variables showing convergent validity for subscales assessing analogous constructs. Correlations between ABI domains and ADOS were small (Table 4).

Table 4 Pearson correlations between ABI domains and related instruments (N = 139 ASD participants)

Discriminant validity was generally established in that the correlations between analogous constructs exceeded correlations between non-analogous constructs. For example, the correlation between the CASI-Anxiety score and the Mood & Anxiety Domain (r = 0.77) exceeded correlations between the CASI-Anxiety score and the remaining ABI domains. An exception was that the SRS-2 Social Communication and Interaction domain was unexpectedly highly correlated with the Restrictive Repetitive Behaviors domain from the ABI (r = 0.68).

The ZBI Total Score was moderately correlated to all ABI domains except Social Communication. Pearson’s correlation coefficient for the ABI with the ABI-S domains are shown in the final row of Table 4. The relationship between the ABI-S and the other parent rated scales was similar to the ABI (e.g. Core Symptoms ABI-S with SRS total 0.80, Mood & Anxiety ABI-S with CASI-Anx 0.76, Self regulation ABI-S with ABC Hyperactivity and Non-Compliance 0.83).

Change Over the Duration of the Study

Supplemental Table 1 presents changes in ABI and other scales observed over the course of the study. A trend towards improvement was seen across all scales over the 8–10 week period.

Change Over Time

Changes in ABI scores between baseline and study endpoint were compared with changes in SRS Total Score severity category and changes in overall parent burden (ZBI item 22) in an exploratory analysis of change over time. Subscales responsive to improvement should have a large positive effect size for participants experiencing improvement and a smaller (close to 0) effect for those who did not experience change.

Participants showing improvements in ASD severity based on category change in SRS-2 Total Scores showed analogous ABI domain score improvements in Core ASD Symptoms, Social Communication, and Restrictive Repetitive Behaviors (moderate to large within-group effect sizes of 0.63, 0.50, and 0.41, respectively) (Table 5). And, participants showing improvements in overall burden based on category change in ZBI showed analogous ABI domain scores improvements in Restrictive Repetitive Behaviors, Mood and Anxiety, Self Regulation, and Challenging Behavior (mild-to-moderate within-group effect sizes of 0.39, 0.27, 0.29, and 0.27 respectively). In both cases, these effects were not observed in groups with no documented change or who had worsened.

Table 5 Summary of effect sizes of selected Patient Reported Outcomes at endpoint visit

Discussion

Internal consistency (α) was high for all ABI domains, and test–retest reliability was excellent based on established benchmarks (good = 0.64–0.74, excellent ≥ 0.75) (Cicchetti and Sparrow 1981). Strong positive correlations were observed with analogous parent-reported subscales, and only mostly moderate correlations with subscales assessing divergent constructs. Thus, ABI and the ABI-S allow for the potential to complete one instrument in place of discrete alternatives commonly used in treatment outcome studies and clinical drug trials.

Analysis of response option performance indicated that scores obtained based on combination of 2 response options, such as frequency and intensity, were very closely related and it appears, with this observation, that the second response may be redundant. We introduced the dual response options with the intention that this would result in increased sensitivity to change. While this is still possible, we cannot draw this conclusion based on the available observations. Given the increased burden to caregivers, essentially doubling the items on the scale for completion, and the potential for increased complexity, we opted to finally select a single response anchor: Quality or Frequency. Use of a single anchor response and possible response options were tested and received a favorable response from parents and caregivers in the cognitive interview study (Pandina et al. 2018).

The ABI-S also shows good psychometric properties. The intention is to use the short form of the ABI more frequently over the course of a clinical study to further reduce caregiver burden. Further data on change over time in response to intervention on the ABI compared to the ABI-S is required to determine which version is most useful as an outcome measure.

Our preliminary change over time analyses suggest that the ABI changes were consistent with corresponding changes across multiple categories in other parent-reported scales that occur over an 8–10 week period. This empirical, anchor-based approach is consistent with some FDA guidance for patient-reported outcome measures (FDA 2009). Based on observed correlations, the SRS-2 was selected as an appropriate anchor for the Social Communication and Restrictive Repetitive Behaviors domains, while parent burden assessed on the ZBI (Item 22 Overall Burden) was an appropriate anchor for the Restrictive Repetitive Behaviors, Mental Health, Self-regulation, and Challenging Behavior domains since it was correlated with these scales at baseline.

Scores on the ABI were associated with changes of at least one severity category in SRS-2. Effect sizes for the group who improved exceeded 0.40 for both Social Communication and Restrictive Repetitive Behaviors domains, whereas the largest effect size of participants whose SRS-2 severity did not change was 0.29. Reductions in the ABI were also associated with reductions of at least one category in parent burden, indicating that as symptoms were reducing, parent burden was reported as lower. This was an exploratory approach which aimed to link parent-reported change in child behavior to a meaningful quality-of-life indicator (in this case, level of burden felt by parents in caring for their children with ASD). In this group, burden was not related to Social Communication skills, but did relate to behaviors reported in other domains. However, we note that this approach is limited by the issue of “source or method variance” (Campbell and Fiske 1959; Podsakoff et al. 2003) (i.e. insofar as change is concerned, we cannot determine with certainty whether the parents were accurately reporting genuine alterations in behavior or perceived changes). We acknowledge the limitation, and we are currently evaluating the ABI’s performance in a placebo-controlled, randomized clinical trial of a rational therapeutic agent. This trial also includes clinician-reported measures, such as the Clinical Global Impressions Scale (CGI) (Arnold et al. 2000). In the meantime, these analyses suggest that ABI is sensitive over time in a manner that is congruent with other clinical measures.

This study examined a well-characterized sample of participants with a clinical diagnosis of ASD confirmed by ADOS. However, there was a poor correlation between ABI, which is intended to measure changes in “states over time” based on parent observation in natural settings, and ADOS, a tool principally designed to capture patient “traits” and evaluate the presence/absence of ASD based on direct assessment (usually lasting an hour or less) in clinical settings. Discrepancies between parent report and direct assessment have been observed in other studies (see review by Achenbach et al. 1987; Kaat et al. 2014; Mirenda et al. 2010; Sturm et al. 2017), and the ADOS specifically (Mazurek et al. 2018). This further suggests that behaviors specific to autism and critical for diagnosis may not be the same as those that indicate changes in symptom severity over time. For example, the items in the ABI social communication domain may be more commensurate with measures of adaptive behavior.

The ABI was not developed as a diagnostic tool. It was designed to focus on behaviors that might be targets for change in ASD rather than those that might demonstrate greatest sensitivity and specificity for diagnosis. Therefore, we did not include comparison participants with intellectual disability or communication disorders, which were often a typical part of the validation process in the past for diagnostic scales. However, the ABI did show good discrimination between ASD and TD groups, suggesting that it can be used to define ASD symptom severity for use as an inclusion criterion in clinical trials.

Taken together, our findings support use of the ABI as a clinical endpoint with the potential to identify and measure short term change in parent-reported behaviors. Our methodological approach included statistical and clinical review of items and careful selection and consideration of response scales provide appropriate response options for parents (Fok and Henry 2015). The 1-week time period for reporting, compared to other scales with longer recall, may enhance suitability of the scale for this purpose.

The cohort in this validation study covered a broad range of participant ages and ASD severity levels. However, it is likely that, given other requirements of the study, this group of individuals had less extreme challenging behaviors, which would explain near floor effects in reported items such as elopement and physical aggression. The lack of representativeness of this group is the reason for retaining these items. Cognitive interviewing indicated the appropriateness of these items. Further psychometric validation in populations including more minimally verbal participants and those with a broader range of challenging behaviors is planned. In addition, our sample included only individuals over the age of 6 years, whereas the ABI items were designed to be suitable for children aged 3 years and above. There were also fewer individuals over 18, and the cohort was of average IQ, and predominantly Caucasian. Further studies with younger children and older adults, as well as a sample with greater diversity in race/ethnicity and IQ are also planned. Translation and validation of the ABI in other languages and cultures are also in progress.

Though the ABI has been used by different groups of raters, there are currently insufficient interrater reliability data between caregivers for statistical analysis. A clinician-rated version is in development and will be reported elsewhere. A self-report version for individuals capable of responding is also planned. The ability of individuals with ASD to self-report and how this differs from a parent perspective are both important to determine in future research.

The ABI and ABI-S are available without charge for academic, research, and professional use, subject to terms and conditions. They can be downloaded in the USA from https://www.janssenmd.com/ (in the tools/psychiatry section) and accessed outside the USA via email request to autismbehaviorinventory@its.jnj.com.

Limitations of the study include a modest-size sample (for psychometric purposes), reliance on existing interventions to monitor change, and the source or method-variance issue.

In summary, the ABI continues to demonstrate good psychometric properties—sound structure and good reliability and validity—in two clinical populations of individuals with ASD. There is some evidence of change in the short term, congruent with changes in other measures, which is critical for clinical endpoint assessments. The next line of investigation is the use of ABI as a parent-reported measure in ASD treatment studies.