Increasingly, telemedicine-based assessment (i.e., tele-assessment) has been adopted as an approach for evaluation and diagnosis of autism spectrum disorder (ASD) in young children (Berger et al., 2021; Stavropoulos et al., 2022; Wagner et al., 2022). Over the past decade, tele-assessment has been described as a promising strategy for increasing access to diagnostic services (Juárez et al., 2018; Reese et al., 2013; Zwaigenbaum & Warren, 2020), particularly in the context of the recognized importance of early identification of ASD (Hyman et al., 2020). The COVID-19 pandemic hastened the uptake of ASD tele-assessment as service systems sought to meet clinical demand and family needs while in-person clinical care was paused (Dahiya et al., 2021; Stavropoulos et al., 2022; Wagner et al., 2020). This recent tele-assessment research has emphasized broad family and clinician satisfaction while also highlighting need for ongoing studies of diagnostic accuracy, performance of tele-assessment tools, and characteristics of children and families best served by tele-assessment.

Initial research on ASD tele-assessment focused on either (1) caregiver administration of traditional ASD assessments tools (Reese et al., 2013, 2015) or (2) clinic-based tele-assessment that included a remote psychologist as well as a trained, on-site para-professional who completed assessment activities with the child (Juárez et al., 2018). These initial studies demonstrated feasibility and acceptability of tele-assessment, as well as diagnostic accuracy (Juárez et al., 2018) and system-level benefits related to reduced travel burden, decreased wait time, and a shift in referral patterns toward tele-assessment rather than assessment in tertiary care centers (Stainbrook et al., 2019). This early work also recognized barriers to large-scale replication, especially challenges related to involvement of multiple trained professionals or detailed training of caregivers to implement structured activities.

The next phase of tele-assessment investigation focused on the development and deployment of instruments specifically adapted or designed for clinician-guided, caregiver-mediated use (Corona, Wagner et al., 2021; Corona, Weitlauf, et al., 2021). Prior to the pandemic, our team published preliminary interim data from a clinical trial (NCT03847337) on the use of either the TAP (previously TELE-ASD-PEDS; Corona et al., 2020) or remote administration of the prompts of the Screening Tool for Autism in Toddlers and Young Children (STAT; Stone et al., 2000) in a clinical research setting (Corona, Weitlauf, et al., 2021). The remote STAT administration, an unstandardized use of the instrument beyond the scope intended by the authors (hereafter referred to as the TELE-STAT), was employed as an experimental procedure to benchmark accuracy of the TAP. These tools were also chosen for study based on their relatively brief administration time (i.e., approximately 15–20 min), accessibility of materials, use in previous work (Juárez et al., 2018; Stainbrook et al., 2019), and potential for clinicians to guide caregivers to complete activities without need for prior caregiver training or practice. Interim data highlighted caregiver satisfaction with caregiver-mediated tele-assessment procedures and noted diagnostic agreement of 86% when comparing remote clinicians’ diagnostic impressions to participants’ diagnoses from traditional in-person assessment.

At the onset of the COVID-19 pandemic, one of the tools included in this study, the TAP, was deployed in direct-to-home settings (i.e., within families’ homes, as opposed to tele-assessment completed within a medical or clinical research setting) and disseminated broadly across the United States and beyond. Analysis of clinician-reported data both within (Wagner et al., 2020) and external to our institution (Wagner et al., 2022) indicated that clinicians were able to make definitive diagnostic decisions via tele-assessment for a majority of children with a high degree of diagnostic certainty. Additional external studies of tele-assessment using the TAP highlighted caregiver satisfaction with tele-assessment procedures, including logistical convenience, positive interactions with clinicians, and satisfaction with diagnostic feedback and recommendations (Jones, 2022; Stavropoulos et al., 2022). However, the rapid roll-out of the TAP in response to the urgent need for telemedicine tools preceded controlled investigation of the TAP in comparison to traditional, in-person assessment.

To date, limited data are available regarding the psychometric properties of tele-assessment tools, the characteristics of children evaluated via tele-assessment, and the diagnostic agreement between tele- and traditional in-person assessments. This study presents complete data from our original TAP clinical trial (NCT03847337). Specifically, this study evaluates tele-assessment procedures incorporating either the TAP or remote administration of the STAT (TELE-STAT), with a focus on diagnostic accuracy in comparison to traditional in-person assessment, clinical characterization of children identified vs. not identified as having ASD via tele-assessment, and overall clinician and caregiver feedback on tele-assessment procedures. The primary purpose of this study was not to directly compare the performance of the TAP to the TELE-STAT, but rather to compare tele-assessment using clinician-guided, caregiver-led procedures to traditional in-person ASD assessment procedures. Given broad use and uptake of the TAP in recent years, analysis of TAP scoring procedures is also presented, including a comparison of the dichotomous and Likert scoring procedures.

Methods

Participants

Eligible participants were toddlers between 15 and 36 months of age with concerns for ASD or developmental delays, with at least one primary caregiver with sufficient English facility to complete study measures. Exclusion criteria included significant sensory impairments or medical complexity that would preclude use of the study assessment battery.

Participants included 144 children (29% female, 71% male) between 17 and 36 months of age (mean = 2.5 year, SD = 0.33 years) and their caregivers. All children were recruited from a clinical waitlist in which they had been referred by a regional medical or early intervention provider specifically due to concern for possible ASD or other developmental differences. Participating caregivers included 124 mothers, 13 fathers, and seven other caregivers (e.g., grandparents, foster parents, legal guardians). Additional demographic information is presented in Table 1.

Table 1 Participant demographics

Tele-assessment Measures

TAP. The TAP (formerly TELE-ASD-PEDS; Corona et al., 2020) is designed for evaluating characteristics of ASD in toddlers via parent-mediated tele-assessment. The TAP was designed for use with children between the ages of 14–36 months (Wagner et al., 2021). It includes eight activities, including free play, physical play routines, social responding (e.g., name calling, directing attention), and activities that may prompt a child to request (e.g., snack). Immediately following administration, clinicians rate child behavior on seven distinct behavioral anchors (i.e., socially directed speech and sounds; frequent and flexible eye contact; unusual vocalizations; unusual or repetitive play; unusual or repetitive body movements; combines gestures, eye contact, and speech/vocalization; unusual sensory exploration or reaction). Scoring instructions include the use of both dichotomous ratings (presence/absence) as well as Likert scoring (3 = behaviors characteristic of ASD clearly present; 2 = possible atypical behavior; 1 = behaviors characteristic of ASD not present). Total Likert scores of ≥ 11 are considered to indicate increased likelihood of ASD.

TELE-STAT. This experimental procedure was adapted from the Screening Tool for Autism in Toddlers and Young Children (STAT). The STAT is a Level 2 screening instrument validated for in-person use with children between 14 and 47 months of age (Stone et al., 2000, 2004, 2008). It includes 12 activities assessing a child’s skills related to play, requesting, directing attention, and imitating adult actions. Activities are scored as “pass” (score of 0), indicating that the behavior or skill was observed, or “fail” (score of 0.25, 0.5, or 1 depending on domain), indicating that the behavior or skill was not observed. Total scores range from 0 to 4, with scores of 2 or higher indicating increased likelihood of ASD.

In the TELE-STAT, participating caregivers were asked to complete STAT activities with guidance and instruction from remote clinicians. Remote clinicians scored each activity as it was completed, seeking clarification from caregivers about child behaviors including eye contact and vocalizations as needed. This was considered a non-standardized administration of the STAT, but thought to be a reasonable alternative to use in preliminary evaluation of the TAP, particularly given the dearth of caregiver-led tele-assessments available at the time (i.e., assessments such as the Brief Observation of Symptoms of Autism (BOSA; Dow et al., 2021) were not yet published or broadly available at the onset of this trial).

Clinical interview. Clinicians completed a clinical interview with caregivers to gather information about child social and communication skills, as well as restricted, repetitive behaviors. Clinical interviews were not standardized but were based on DSM-5 criteria for ASD.

Caregiver questionnaire. Following tele-assessment, caregivers completed a questionnaire designed specifically for the current study to assess their perceptions of the tele-assessment process. Questions asked about caregivers’ understanding of clinician-delivered instructions, their comfort during the tele-assessment, and their perceptions of the duration and content of the activities.

Clinician questionnaire. Following tele-assessment, clinicians completed questionnaires documenting their diagnostic impression, diagnostic certainty, and observed and reported characteristics of ASD according to DSM-5 criteria. Clinicians also provided information about any factors that impacted the assessment (e.g., child behavior, parent behavior, technology challenges).

In-person Assessment Measures

Mullen Scales of Early Learning (MSEL). The MSEL (Mullen, 1995) is a standardized, normed developmental assessment for children through age 68 months. It provides an overall index of ability, the Early Learning Composite, as well as subscale scores (Receptive/Expressive Language, Visual Reception, Gross/Fine Motor).

Vineland Adaptive Behavior Scales, Third Edition (VABS-3). The VABS-3 (Sparrow, 2016) is a clinician-administered caregiver interview that assesses adaptive functioning in social, communication, motor, and daily living skills. Norms-based standard scores are provided for birth through adulthood for each domain, as well as an overall Adaptive Behavior Composite score.

Autism Diagnostic Observation Schedule, Second Edition (ADOS-2). The ADOS-2 (Lord, 2012; Luyster et al., 2009) is a semi-structured, play-based interaction and observation designed to assess characteristics of ASD. In response to the ongoing COVID-19 pandemic, aspects of ADOS-2 administration were modified during the latter year of data collection (n = 59 participants). Modifications included use of masks by clinicians and caregivers, omission of the snack activity (substituting toys for food when requesting) and responsive social smile, and substitution of materials to allow for thorough cleaning (i.e., substitution of remote control car for rabbit toy in joint attention activity). Given non-standardized administration and modification, standardized risk scores should be interpreted with caution (Dow et al., 2021). Of note, mean ADOS-2 severity scores were not significantly different (t = 0.19, p > 0.05) for administrations completed prior to COVID (n = 83; m = 7.78, SD = 2.64) and those completed with COVID modifications (n = 59; m = 7.86, SD = 2.24). Diagnostic agreement between tele- and in-person assessment did not differ prior to and following the introduction of COVID-19 precautions (χ = 2.37; p > 0.05). As recommended, the ADOS-2 was used as part of a broader evaluation, including expert clinical judgment, when making diagnostic determinations (Lord et al., 2000). Specifically, psychometric findings of assessment instruments were combined with the totality of available information (i.e., clinician observations, caregiver report) when determining whether a child met criteria for the diagnosis of autism spectrum disorder, per DSM-5 diagnostic criteria.

Clinical interview. Clinicians completed a clinical interview with caregivers to gather information about child social and communication skills, as well as restricted, repetitive behaviors. Clinical interviews were not standardized but were based on DSM-5 criteria for ASD.

Caregiver questionnaire. Following in-person assessment, caregivers completed a questionnaire designed for the current study. This questionnaire asked them to compare the tele- and in-person assessments, including indicating whether they would prefer to play with their child as part of an assessment, observe a clinician interact with their child, or both. Caregivers were also given the opportunity to provide open-ended feedback on the tele-assessment and in-person assessment.

Clinician questionnaire. Following in-person assessment, clinicians completed questionnaires documenting their diagnostic impression, diagnostic certainty, and observed and reported characteristics of ASD according to DSM-5 criteria.

Procedures

Participants were randomized to receive one of two tele-assessment procedures: the TAP (n = 73) or the TELE-STAT (n = 71). All procedures were approved by the Institutional Review Board. No adverse events or participant withdrawals occurred. Clinical procedures (i.e., both tele-assessments and in-person assessments) were completed by psychological providers (licensed psychologists, licensed senior psychological examiners, supervised postdoctoral psychology fellows) with expertise in the evaluation and diagnosis of ASD in toddlers. All psychological providers had achieved research reliability on the ADOS-2 and completed training on the STAT. All providers participated in regular reliability discussions about TAP and TELE-STAT scoring and procedures.

Tele-assessment took place within a clinical research setting. Upon arriving, participating families were escorted to a tele-assessment room by a research assistant. Research assistants were responsible for orienting families to the assessment room, test materials (e.g., toys, bubbles, snacks), and tele-assessment technology. Research assistants accessed the Zoom platform via wall-mounted monitor and speakers allowing for two-way audiovisual communication and camera control by the remote assessor. The research assistant left the room after ensuring that the family and clinician could see and hear each other.

Remote clinicians (n = 15 psychological providers, as defined above) guided caregivers through structured interactions with their children, following the procedures described above for either the TELE-STAT or TAP. Each measure was coded according to its instructions regarding behaviors that the clinician observed. Clinicians also completed a clinical interview with caregivers focused on autism-related characteristics. Tele-assessments lasted an average of 39 min (SD = 11.12). Remote clinicians documented their diagnostic impressions (ASD vs. no ASD) and diagnostic certainty immediately following tele-assessment. Diagnostic impressions were informed by the totality of tele-assessment information (i.e., TAP or TELE-STAT plus clinical interview). Tele-assessment diagnostic impressions were not shared with families or in-person clinicians. Following tele-assessment, caregivers completed the caregiver questionnaire to share their thoughts on the tele-assessment process.

Immediately after completing tele-assessment, participating families moved into another clinic room to complete traditional in-person evaluation with a different clinician blind to tele-assessment results. Tele-assessment was always completed prior to in-person assessment for several reasons, including the anticipated length of in-person assessments, the receipt of diagnostic feedback following in-person assessment, and the desire for caregivers to provide initial feedback on tele-assessment procedures before completing in-person assessment. In-person evaluation included the MSEL, VABS-3, ADOS-2, and a clinical interview with caregivers. In-person clinicians (n = 8) were a subset of the providers who completed tele-assessments, including licensed psychologists and licensed senior psychological examiners. In-person clinicians shared assessment results and delivered diagnostic feedback following in-person assessment.

Analytic Plan

Diagnostic outcomes of assessments are reported descriptively. Independent sample t-tests and chi square tests were used to compare TAP and TELE-STAT groups in terms of diagnostic agreement, diagnostic certainty, and clinician and caregiver perceptions of tele-assessment. To examine assessment scores as a function of diagnostic agreement status, the Brown-Forsythe or the Welch test were used to compare total scores on the TELE-STAT, TAP, MSEL, VABS-3, and ADOS-2. The Brown-Forsythe test rather than traditional one-way ANOVA was used when the assumptions of normality and homogeneity of variance were met because this test is robust in the presence of unequal sample sizes (Maxwell & Delaney, 2003). The Welch test was used when the assumption of normality was met, but the assumption of homogeneity of variance was violated (Tomarken & Serlin, 1986). The Bonferroni method was used to correct for multiple comparisons.

Analysis of the psychometric properties of the TAP focused on comparing the utility of dichotomous and Likert scoring procedures. Cronbach’s alpha for total scores was calculated to measure the internal consistency of both scoring procedures (Cronbach, 1951). A procedure derived by Feldt (1969, 1980) was used to test the statistical significance of the difference between the internal consistency coefficients of the different response formats of the TAP (Charter & Feldt, 1996).

The diagnostic accuracy of the dichotomous and Likert scoring procedures for the TAP were comparatively evaluated using receiver operating characteristic (ROC) curves and the area under the curve (AUC) (Zhou et al., 2011). Confidence intervals for the AUCs were calculated using DeLong’s method for paired ROC curves (DeLong et al., 1988). The AUCs of the dichotomous and Likert scoring procedures for the TAP were compared using the stratified bootstrap test for paired ROC curves.

To determine the optimal cutoff score for the dichotomous and Likert scoring procedures, the following indices were calculated at all possible cutoff points: sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), Cohen’s kappa, and Youden’s Index. Cohen’s kappa is a measure of interrater agreement and in this context, represents the agreement between the TAP and the clinical diagnosis based on in-person assessments. Youden’s index represents the likelihood of a positive test result among individuals with the condition versus individuals without the condition of interest (Zhou et al., 2011). Youden’s index is calculated as the sum of the sensitivity and specificity minus one.

Once optimal cutoff scores were identified, the sensitivity, specificity, positive predictive value, and negative predictive values were calculated and compared. Specifically, differences in sensitivity and specificity between the two scoring procedures for the TAP were analyzed using McNemar’s test, while differences in PPV and NPV were analyzed using the method proposed by Moskowitz & Pepe (2006).

All statistical analyses were performed in R (R Core Team, 2020). The cocron package was used to compare the Cronbach’s alpha coefficients of the two scoring procedures for the TAP (Diedenhofen, 2016). The pROC package was used to conduct the ROC curve analyses (Robin et al., 2011). The cutpointr package was used to calculated indices used to select the optimal cutoff points (Thiele & Hirschfeld, 2021). The DTComPair package was used to compare the sensitivity, specificity, PPV, and NPV of the optimal cutoff points of the TAP with dichotomous ratings and the TAP with Likert scoring (Stock & Hielscher, 2014).

Results

Preliminary Analyses

Following in-person assessment, 92% of toddlers were diagnosed with ASD. Diagnostic outcomes for children not meeting ASD criteria included speech delay (n = 3), global developmental delay (n = 3), unspecified developmental delays or behavioral concerns (n = 4), and typical development (n = 2).

No differences were observed between the TAP and TELE-STAT groups in terms of occurrence of diagnostic discrepancies (i.e., ASD vs. no ASD, χ = 0.13; p > 0.05), clinician diagnostic certainty following tele-assessment (t = 1.16, p > 0.05), or total time spent on tele-assessment (t = 0.19, p > 0.05). Caregiver satisfaction also did not differ as a function of the tele-assessment tool used (all p-values > 0.05). See Table 2 for detailed caregiver satisfaction questions.

Table 2 Caregiver and provider satisfaction with tele-assessment

Diagnostic Accuracy of tele-assessment

Comparing clinician diagnostic impressions following tele-assessment and in-person assessment indicated diagnostic agreement for 92% of participants. Remote clinicians correctly identified 124 toddlers as having ASD (ASD-ASD group). Remote clinicians correctly identified nine toddlers as not having ASD (No ASD-No ASD group). Eight children diagnosed with ASD following in-person assessment were missed on tele-assessment (ASD-No ASD group). Three children were inaccurately identified as having ASD following tele-assessment (no ASD-ASD group). Group comparisons were used to investigate differences between these four groups in terms of: child age, sex, and race; telehealth clinician diagnostic certainty; and scores on both telehealth and in-person assessments. Results of these comparisons are reported below.

Group comparison using the Brown-Forsythe test indicate that child age differed among groups (F*(3, 18.79) = 6.001, p = 0.005). Children in the No ASD-ASD group were significantly younger (m = 1.97, SD = 0.11) than children in all other groups (see Table 3). That is, the three children inaccurately identified as having ASD by tele-assessment were younger than all other children.

Table 3 Participant characteristics [m(SD)] by in-person diagnosis-telemedicine diagnosis

Fisher’s exact test was used to determine if there was a relationship between gender and diagnostic agreement status. There was not a significant relationship between gender and the four diagnostic agreement groups (p = 0.892).

Regarding diagnostic certainty, no differences were found for in-person clinicians across the four groups. However, telehealth clinician diagnostic certainty differed as a function of group (F*(3, 10.17) = 11.18, p = 0.001). Telehealth clinicians reported significantly greater diagnostic certainty for children in the ASD-ASD group. As seen in Table 3, telehealth clinicians reported lower diagnostic certainty for all other groups, including in cases of diagnostic disagreement (ASD-No ASD and No ASD-ASD groups), as well as for children for whom ASD was accurately ruled out by tele-assessment (No ASD-No ASD group).

Table 3 contains scores on tele-assessment and in-person assessment measures as a function of diagnostic outcome and diagnostic agreement status. Group comparisons using the Brown-Forsythe or Welch test revealed statistically significant group differences on the TAP (F*(2, 5.073) = 45.543, p < 0.001), TELE-STAT (F*(3, 2.680) = 11.111, p = 0.049), MSEL Early Learning Composite (W(3, 6.888) = 5.360, p = 0.032), VABS-3 Adaptive Behavior Composite (F*(3, 13.673) = 16.295, p < 0.001), and ADOS-2 severity scores (F*(3, 7.593) = 52.165, p < 0.001).

As expected, scores on tele-assessment instruments (i.e., TAP and TELE-STAT) were highest for children in the ASD-ASD group. Bonferonni post-hoc analyses of all possible between-group pairwise comparisons (see Table 3) indicated that, as expected, children in the No ASD-No ASD group had statistically significantly lower scores on both the TAP and TELE-STAT, in comparison to the ASD-ASD group. Children diagnosed with ASD following in-person assessment who were not identified on tele-assessment (ASD-No ASD group) also had statistically significantly lower scores on both the TAP and TELE-STAT. Children inaccurately identified as having ASD on tele-assessment (No ASD-ASD group) had TELE-STAT scores that were not significantly different than the ASD-ASD group. This comparison (i.e., No ASD-ASD vs. ASD-ASD) could not be calculated for the TAP, as only one child was in the No ASD-ASD group.

Fisher’s exact test was used to determine whether children’s risk classifications, using recommended cut-scores on tele-assessment instruments (i.e., scores ≥ 12 on the TAP or > 2 on the STAT), were associated with diagnostic agreement status. For both tele-assessments, Fisher’s exact test indicated statistically significant differences. For the TAP, all children in the ASD-ASD group had scores ≥ 12. No children in the No ASD-No ASD group had scores ≥12. The one child inaccurately identified as having ASD in the TAP condition had a TAP score ≥12, and 50% (n = 2) of the children in the ASD-No ASD group (i.e., children whose ASD diagnosis was missed on tele-assessment) had a TAP score ≥ 12. For the STAT, 83% of children in the ASD-ASD group had STAT scores >2. Half (n = 1) of the children in the No ASD-ASD group (i.e., those inaccurately identified has having ASD) had STAT scores >2. No children in the No ASD-No ASD group or in the ASD-No ASD (i.e., those with ASD missed on tele-assessment) had STAT scores exceeding the risk threshold.

Scores on in-person assessment also revealed between group differences on the ADOS-2, the MSEL, and the VABS-3 (see Table 3). Post hoc analysis indicated that the ASD-ASD group had higher scores on the ADOS-2 than all other groups. Children in the ASD-No ASD group had statistically significantly higher ADOS-2 scores than children without ASD (No ASD-No ASD). Children in the No ASD-No ASD group had significantly higher MSEL-ELC scores, compared to both children in the ASD-ASD group and children in the ASD-No ASD group. Children inaccurately identified as having ASD by tele-assessment (No ASD-ASD group) also had higher MSEL-ELC scores, compared to the ASD-ASD group. Regarding adaptive behavior, children in the No ASD-No ASD and No ASD-ASD groups had statistically significantly higher VABS-3 composite scores than children in the ASD-ASD group. Statistically significant differences did not emerge for MSEL composite scores.

Comparison of TAP Scoring Procedures

Cronbach’s alpha was used to measure the internal consistency of both TAP scoring procedures. The Cronbach’s alpha associated with the Likert scoring method was significantly higher than the dichotomous scoring method (t (71) = 1.921, p = 0.029). The Cronbach’s alpha for the TAP with dichotomous ratings was 0.795 (95% CI: 0.767–0.821). The Cronbach’s alpha for the TAP with Likert scoring was 0.834 (95% CI: 0.811–0.855).

Receiver operating characteristic (ROC) curves were used to examine the diagnostic accuracy of the TAP using dichotomous and Likert scoring procedures (see Fig. 1). The area under the curve (AUC) can range from 0 to 1, with a value of 0.50 indicating that prediction is at chance level. Analyses indicated an AUC of 0.923 (95% CI: 0.801-1.000) for the dichotomous scoring and an AUC of 0.949 (95% CI: 0.860-1.000) for the Likert scoring. Both were significantly different from chance level (p < 0.001). The AUCs were not significantly different (p = 0.175). Using Youden’s Index, the optimal cutoff point was found to be 15 for the dichotomous scoring method and 12 for the Likert scoring method. The sensitivity, specificity, positive predictive value, and negative predictive value associated with these cutoff points are presented in Table 4. The sensitivity of the Likert scoring method was significantly different from the sensitivity of the dichotomous rating method (p = 0.014). The negative predictive value (NPV) of the Likert scoring procedure was also significantly different from the NPV of the dichotomous scoring (p = 0.015).

Table 4 Predictive values of optimal TAP cutoff points
Fig. 1
figure 1

ROC curves of the TAP with dichotomous ratings and Likert scoring methods

Discussion

The present study describes outcomes from a clinical trial investigating tele-assessment of autism in toddlers using clinician-guided, caregiver-led procedures. Together, results of this study provide additional evidence supporting the use of tele-assessment in the identification of ASD in young children. Assessments combining caregiver-led play activities (using either the TAP or an experimental remote administration of the STAT) and clinical interviews with caregivers yielded high rates of diagnostic agreement with in-person evaluation, high levels of diagnostic certainty, and satisfaction from both families and clinicians.

Analysis of the TAP, now in broad clinical use (Wagner et al., 2022), found that a slight increase in the score more accurately identified increased likelihood of ASD. Preliminary use of the scale recommended interpretation of scores ≥ 11 as indicative of increased ASD likelihood. Analysis of the present data indicates that a score of ≥ 12 optimally identified children who went on to receive a diagnosis of ASD. This slightly more conservative interpretation of scores may help to reduce the chances of inaccurately identifying a child as having ASD via telehealth.

Across both groups, clinicians’ diagnostic impressions following tele-assessment agreed with the results of traditional, in-person assessment for 92% of participants. Of these, the vast majority (93%) were diagnosed with ASD. This likely reflects research recruitment pathways that rely heavily upon pre-screening of participants by referring community providers. Additional ongoing work is evaluating TAP functionality in a non-referred community sample.

As expected, participants identified as having ASD following both in-person and tele-assessment had high scores on tele-assessment instruments as well as the ADOS-2. Of note, diagnostic agreement and ADOS-2 scores did not differ as a function of pandemic precautions, including use of face masks, instituted part-way through the present study. Across the course of the study, eight participants were diagnosed with ASD following in-person assessment but not identified by tele-assessment. These participants had significantly lower scores on the TAP, TELE-STAT, and ADOS-2 and higher adaptive behavior skills relative to children with clearly identified ASD. The three participants incorrectly identified as having ASD on tele-assessment had TAP or TELE-STAT scores that approached or exceeded cut-offs for risk classification. These three children were significantly younger than children in all other groups. Though a small sample, this finding may call for additional caution when assessing very young children for ASD via tele-assessment. Additional research, in a larger and broader sample, will allow for ongoing investigation of the characteristics of children for whom tele-assessment does not provide an accurate or definitive diagnosis. Ultimately, it is not unexpected that some children, particularly those with more complex presentations, may be more accurately classified via in-person assessment.

Clinicians’ diagnostic certainty was significantly lower for children without ASD and for children inaccurately identified as having ASD on tele-assessment. That clinicians reported uncertainty in cases of diagnostic disagreement may inform clinical practice, with clinical uncertainty helping to guide recommendations regarding children who may need further or in-person evaluation. This finding is also consistent with prior work from our team (Wagner et al., 2020) and may reflect a referral bias in which our sample is skewed toward children with high levels of concern related to developmental differences, including autism-specific concerns as well as significant developmental delays, attention concerns, or challenging behaviors. Past findings from a broader sample of clinicians, with samples including a higher number of children without ASD, indicate high levels of certainty when ruling out ASD on tele-assessment (Wagner et al., 2022).

Across tele-assessment tools, caregivers and providers reported broad satisfaction with tele-assessment procedures. Consistent with past work (Corona, Weitlauf, et al., 2021), caregivers reported that tele-assessment procedures were easy to understand, lasted the right amount of time, and often elicited the behaviors about which they were most concerned. Most caregivers reported that they would recommend the use of tele-assessment to others. Across this and prior work, a smaller number of caregivers have qualitatively reported challenges related to technology use, managing child behavior while interacting with the clinician, and concerns that clinicians may not have a complete picture of their child following tele-assessment alone (Corona, Weitlauf, et al., 2021). Tele-assessment also presents barriers for families without reliable access to technology or internet connections, many of whom may be from underrepresented groups or from rural communities. Ongoing work is investigating child and family characteristics that predict for whom tele-assessment works well and who may be best served by in-person assessment, as well as focusing on optimizing clinical procedures to address some of these concerns.

Finally, this study provides the most comprehensive opportunity to date for detailed analysis of the psychometric properties of the TAP, including comparison of Likert and dichotomous scoring procedures. This analysis suggests increasing the previous cut-off score of ≥ 11 to score of ≥ 12, when using Likert scoring procedures. Additionally, these analyses support continued use of Likert scoring rather than dichotomous scoring. Though both scoring procedures performed well in distinguishing children with ASD from those without, Likert scoring procedures afforded statistically significantly higher sensitivity and negative predictive value. In general, the increase in scoring options provided by Likert versus dichotomous procedures affords greater variability in scores, thereby strengthening the psychometric properties of the test (Finn et al., 2015).

Limitations and Future Directions

A significant and primary limitation of the current study is that participants were referred due to high levels of concern for ASD, resulting in a sample heavily weighted toward children who received autism diagnosis. The vast majority of participants (92%) were diagnosed with ASD following in-person assessment, presenting barriers to detailed analysis of how tele-assessment functions for children without ASD. The current data supports clinicians’ ability to readily and confidently identify ASD via tele-assessment when it is present; this data allows for less commentary on how well tele-assessment differentiates children with ASD vs. other concerns. To address this important limitation, ongoing work is intentionally recruiting a broader sample, including children screened within community settings and children referred for a broader range of developmental concerns.

A second limitation in this study is the use of controlled clinical settings and materials. Within this study, families came into a clinical lab space, and tele-assessment was completed using technology and assessment materials provided by the research team. Tele-assessments were completed in a small room, with few distractions, with only the child and one or two caregivers present. At the onset of the trial, direct-to-home tele-assessment was thought to be years away and was not a routine clinical care option. The goal of this trial was to investigate the use of tele-assessment in a controlled way, with a longer-term goal of replication in less controlled, home settings. The onset of the pandemic necessarily expediated the move from clinic-based to home-based tele-assessment, prior to investigation in home settings. The current study, then, does not account for multitude of environmental factors present in home settings, including technology issues (e.g., devices, internet connectivity), people present (e.g., caregivers, other children), and other factors associated with the home setting (e.g., presence or absence of toys, distracting items or events, etc.) (Wagner et al., 2020). Ongoing work is studying the use of direct-to-home tele-assessment in comparison to in-person assessment and will provide a useful contrast to the present work, as well as greater generalizability to the current ways in which tele-assessment is used clinically.

Another possible limitation associated with study research procedures is that all families participated in tele-assessment immediately prior to in-person assessment, as opposed to use of a counter-balanced design. This design was meant to reduce the likelihood of child fatigue following in-person procedures, as well as to first probe families’ initial impressions of tele-assessment without comparison to in-person assessment. However, it is possible that both toddlers and caregivers experienced increased fatigue during in-person assessment, which may have impacted toddler scores on direct assessment. All clinicians engaged in dialogue with families to query whether caregivers felt that in-person observations accurately captured their child’s usual behavior, and that caregivers had the opportunity to share any behaviors that the clinician did not observe.

Finally, tele- and in-person assessments in this study both included autism-focused clinical interviews that were not standardized or coded across clinicians. Though the present research focused primarily on use and scoring of parent-administered, play-based assessment procedures, ongoing and future work may further probe information gained via clinical interviewing. Caregiver report is a vital part of any autism assessment, but limited work to date has discussed the relative impact of caregiver report and clinician observation in the context of tele-assessments. As clinical use of tele-assessment continues, it will be important to explore ways in which varying tele-assessment procedures are used to meet varying family and clinician needs.

Conclusion

In sum, the conclusion of this clinical trial represents one of the first controlled studies of caregiver-mediated tele-assessment for autism in toddlers. This work supports the claim that trained clinicians with expertise in the diagnosis of ASD can readily identify ASD characteristics via tele-assessment, and that this type of assessment is acceptable to both clinicians and families. This study further demonstrated that multiple, distinct caregiver-mediated assessment tools can facilitate remote ASD identification. In the context of broad uptake and use of tele-assessment, ongoing work is needed to optimize tele-assessment procedures, to ensure equitable access, and to understand for whom and in what situations tele-assessment can be most successful. Continued development and refinement of tele-assessment tools will also expand the reach of telemedicine in meeting the needs of various clinicians, families, and circumstances.