Background

A core feature of autism as characterized by the Diagnostic and Statistical Manual of Mental Disorders, 5th Edition (DSM-5; [1]) is differences in response to sensory stimuli, including hyper-reactivity (exaggerated response), hyporeactivity (reduced or absent response), and unusual interest in sensory aspects of the environment (commonly referred to as “sensory seeking”; [1]). Alterations in multisensory integration and broader sensation and perception are also commonly observed in persons on the autism spectrum [2,3,4,5]. To date, much of the literature on sensory function in autism has focused on exteroceptive senses (e.g., vision, audition, or somatosensation; [6,7,8]). More recently, however, researchers have also begun to examine whether autism may be associated with differences in interoception, the processing of internal stimuli such as heartbeats and gut distention [9,10,11].

Interoception can be understood as the sense of the internal state of the body and contributes to allostasis by providing information about visceral processes (e.g., the perception of hunger, pain, temperature, thirst, or a number of other sensations; [12,13,14]. Interoceptive signals have also been suggested as a physiological substrate of emotional experience [15,16,17], and disrupted interoception has been implicated in the pathophysiology of multiple psychiatric conditions, including autism [9,10,11]. Poor interoceptive ability has specifically been hypothesized as the psychophysiologic basis of alexithymia [18,19,20,21], a personality trait that is commonly observed in the autisticFootnote 1 population and characterized by difficulties with the identification or interpretation of one’s own or others’ emotional states [25, 26]. The study of interoception in autism, thus, has the potential to inform our understanding of not only sensory processing alterations, but also a number of affective features frequently reported in this population, such as alexithymia and emotion regulation difficulties [27].

Garfinkel et al. [28] put forth a comprehensive theoretical framework for conceptualizing interoception, proposing three separable dimensions of interoceptive experience: interoceptive accuracy, interoceptive sensibility, and interoceptive awareness. Interoceptive accuracy is defined as objective accuracy in detecting internal bodily sensations (e.g., can one accurately report when one’s heart is beating). Interoceptive sensibility is defined as self-perceived dispositional tendency to be internally self-focused and interoceptively cognizant (e.g., measured by self-report questions such as, “To what extent do you believe you focus on and detect internal bodily sensations?”; [16]). Interoceptive awareness is defined as metacognitive awareness of interoceptive accuracy (e.g., the accuracy of one’s subjective evaluation of one’s own ability to count heartbeats). It is important to note that interoceptive accuracy is most often tested through empirical measures of perception with an objective “ground truth” (e.g., heartbeat detection tasks; [29,30,31,32]), whereas interoceptive sensibility and awareness are subjective and, thus, typically tapped via self-report measures. It is also relevant to note that reports of interoceptive awareness do not always correlate strongly with ratings of interoceptive sensibility or performance on interoceptive accuracy tasks [28]. This finding does not necessarily that the construct of interoception is invalid; rather, it suggests that multiple facets of interoception exist, each contributing different yet meaningful information to our overall understanding of this construct.

When assessing whether autism is associated with differences in interoceptive accuracy or sensibility, investigators have often obtained seemingly discrepant results between empirical and self-report measures. Some studies have found increased interoceptive sensibility in people on the autism spectrum versus neurotypical peers [33], whereas others have found the opposite [21, 34, 35] or failed to detect between-group differences [25]. A similar pattern of discrepant results has been obtained for differences in interoceptive accuracy [33, 36, 37]. These findings provide additional evidence to suggest that the three facets of interoception are not interchangeable when determining whether a clinical population has impaired interoceptive ability. The discrepancies across studies may also be explained by limitations of the measures being used, highlighting the need for better tools that have been comprehensively, psychometrically evaluated.

One reason the results of extant studies may be so varied is because of how interoceptive sensibility has been conceptualized in self-report measures. Although different measures of interoceptive sensibility aim to assess the same latent construct, correlations between these measures are often modest [38, 39]. The low convergent validity between such measures suggests that the overlap in the constructs being assessed by different questionnaires may be quite minimal. For example, scores on the Multidimensional Assessment of Interoceptive Awareness (MAIA; [40]) have relatively weak correlations with other scales that purport to measure the same construct, including the Body Awareness Scale and the Interoception Sensory Questionnaire (rs < 0.35; [38, 39]). Moreover, though these rating scales are based on theoretical models, there is generally a lack of psychometric work validating these measures, particularly in the clinical populations about which they are so often used to make inferences. By providing theory-based definitions of interoceptive constructs, measures developed to date have allowed us to refine our conceptualization of interoception and to gather preliminary data from clinical populations. However, research in this field would benefit greatly from systematic psychometric analyses in large samples, particularly within clinical groups of interest. Thus, in the current study, we complement the aforementioned theory-driven approach by quantitatively assessing the statistical properties of a promising measure of interoceptive sensibility, the Interoception Sensory Questionnaire (ISQ).

The ISQ was developed by Fiene and colleagues [38] as a research tool intended specifically to assess the differences in interoceptive sensations between individuals with and without autism. The authors of this instrument qualitatively analyzed the content of online video blogs and semi-structured interviews with adults on the autism spectrum, drafting a preliminary 60-item questionnaire that was further reduced based on empirical analyses. In brief, the authors of the ISQ tested each item on its ability to discriminate between individuals with high and low levels of autistic traits, excluding 30 items that did not exhibit at least moderate between-group differences (η2 > 0.06). An exploratory factor analysis of the remaining 30 items (principal axis factoring of Pearson correlations) indicated that a single factor was sufficient to explain the covariance between item responses. A further 10 items were removed from the measure based on their low factor loadings (< 0.63), leaving 20 items in the final self-report tool. The final, 20-item version of the ISQ from showed high internal consistency and adequate convergent/discriminative validity. Due to the manner in which items were selected, the ISQ total score necessarily differentiated between autistic and neurotypical participants quite strongly. Notably, however, due to the relatively small autism sample in this study (n = 52), the authors were unable to confirm the factor structure of the ISQ specifically within the population of autistic adults.

A potential concern with the 20-item instantiation of the ISQ is redundancy in item content, as the questionnaire contains several pairs of items that seem to be “asking the same question twice” [41] (e.g., “Sometimes I don’t know how to interpret sensations I feel within my body” and “I find it difficult to read the signs and signals within my body [e.g., when I have hurt myself or need rest]”). Although some questionnaires include redundant item pairs in order to detect inconsistent responses, the authors of the ISQ made no mention of this in their original paper, indicating that item redundancy on this form was not intentional. Notably, when combined together into a total scale score, such redundant item pairs can cause a number of issues with an assessment. First, redundant items over-weight certain questions when deriving scores, as the content tapped by both items is effectively counted twice. Additionally, redundant items violate the assumption of local independence needed to conduct factor analysis. This can cause factor loadings and reliability coefficients to be artificially inflated and introduce bias [42,43,44,45].

Building on the work of Fiene et al. [38], this study aims to examine the psychometric properties of the ISQ in a larger sample of adults diagnosed with autism than previously tested, evaluating the fit of the proposed factor structure in the measure’s target population using confirmatory factor analysis (CFA). Furthermore, we seek to identify and eliminate any redundant items from the measure, producing a shortened form that satisfies the assumption of local independence. This reduced form will be tested in an item response theory (IRT) framework and tested for differential item functioning (DIF) across different sociodemographic groups. Lastly, we will investigate whether the ISQ is valid for use in self-reporting autistic adolescents, testing for the presence of DIF between adolescents and adults in our sample. We hypothesize that the unidimensional structure will remain intact, that several items can be removed, and that the items will function equivalently across sociodemographic groups, including between adolescents and adults.

Methods

Participants

This study was a secondary analysis of the ISQ completed by 495 adults and 187 adolescents on the autism spectrum recruited from the Simons Powering Autism Research Knowledge cohort (SPARK; [46]) using the SPARK Research Match service. These participants were recruited as a part of a larger study on the genetic underpinnings of sensory aspects of autism (RM0035Woynaroski). Participants were included if they submitted a genetic sample to SPARK, agreed to be contacted about further research, indicated reading proficiency in English, and were 13 years of age or older. Exclusion criteria included a diagnosed genetic disorder concomitant with autism (e.g., fragile X syndrome), or significant sensory impairments (i.e., blindness and/or deafness). The full sample was 51.6% male, 82.2% non-Hispanic White, and had a mean age of 31.2 years (range: 13.1–77.8 years). Full demographic information for the sample and adolescent/adult subsamples can be found in Table 1. All participants gave informed consent or assent for participation in the study, and parental consent was obtained for minors. All study procedures were approved by the Institutional Review Board at Vanderbilt University Medical Center.

Table 1 Demographics for adult, adolescent, and combined samples

Procedures

Participants for the study were recruited as a part of the SPARK Research Match Process. Briefly, individuals enrolled in SPARK and meeting inclusion/exclusion criteria for the larger study on the genetic basis of sensory alterations in autism (RM0035Woynaroski) were contacted about participation in a supplemental research opportunity via email. Interested individuals subsequently consented for participation and completed a series of surveys regarding their sensory experiences, including the ISQ, via an online platform. Demographics were drawn from the larger SPARK study.

Measures

The ISQ [38] is a 20-item self-report questionnaire intended to measure interoceptive challenges in autistic adults using a single factor scale. The items aim to identify the broad ways in which individuals on the autism spectrum may experience differences in interoceptive processing using a 7-point Likert scale (1 = “Not true at all of me”, 7 = “Very true of me”) where a higher score indicates more difficulty registering or interpreting interoceptive sensations. Three items were reverse-scored to maintain scoring consistency.

The reliability of the ISQ in autistic individuals, as estimated by Cronbach’s alpha, is quite high, both in the sample reported by Fiene et al. (α = 0.96) and the current sample of adults on the autism spectrum (α  = 0.96, 95% CI [0.95, 0.97]). Fiene et al. [38] found evidence for the questionnaire’s construct validity as evidenced by associations between the ISQ, the Toronto Alexithymia Scale [47], Big Five personality traits [48], and subscales from the MAIA [40]. Specifically, alexithymia scores from the Alexithymia Scale had a strong positive correlation with interoceptive difficulty as measured by the ISQ. Extraversion, body listening, emotional awareness, attention regulation, and self-regulation were all inversely correlated with interoceptive difficulty. Further correlational analyses showed that gender, age, and years of education were not associated with ISQ scores in a neurotypical group of 459 participants [38].

Statistical analysis

Descriptive statistics

All statistical analyses were conducted in the R programming environment [49]. Item-level descriptive statistics including item means and standard deviations, and skewness were calculated. In addition, we analyzed the polychoric item correlation matrix, examining the magnitude of correlations between each item and all other items on the ISQ as a measure of item redundancy [50]. The mean (polychoric) correlation between each item and all other items, as well as the number of intercorrelations for each item exceeding 0.7, was reported. As correlations of 0.7 reflect approximately 50% shared variance between the latent continua underlying each item pair, correlations above this value are highly suggestive of item content redundancy [50].

Confirmatory factor analysis

Confirmatory factor analysis (CFA) was used to fit the one-factor model proposed by Fiene et al. [38] in our sample of autistic adults in order to determine whether the ISQ conforms to a unidimensional structure in this population. We fit the model using a Diagonally Weighted Least Squares estimator [51] with a mean- and variance-corrected test statistic (i.e., “WLSMV” estimation), as implemented in the R package lavaan [52]. As very few of the item responses in our dataset contained missing values (0.004% missing item responses), we handled missing values in our model using pairwise deletion.

Model fit was evaluated using the chi-square test of exact fit. However, given the test’s high likelihood of rejecting models that differ trivially from the population structure (cf. [53]), several additional fit indices were also calculated, including the comparative fit index (CFI; [53]), Tucker-Lewis index (TLI; [54]), root mean square error of approximation (RMSEA; [55]), standardized root mean square residual (SRMR; [56]), correlation root mean square residual (CRMR; [57]), and weighted root mean square residual (WRMR; [58, 59]). Notably, we employed the categorical maximum likelihood (cML) estimators of the CFI, TLI, and RMSEA proposed by Savalei [60], as these indices better approximate the population values of the maximum likelihood-based fit indices used in linear CFA. Moreover, the SRMR and CRMR were calculated using the unbiased estimators (i.e., SRMRu and CRMRu) proposed by Maydeu-Olivares [57, 61] and implemented in lavaan for categorical estimators. We judged fit using the widely accepted guidelines of Hu & Bentler [56], which state that CFI/TLI values of > 0.95, SRMR (and by extension CRMR) values of < 0.08, and RMSEA values of < 0.06 indicate good model fit (though see [62,63,64] for limitations of standardized fit index cutoffs). Though the WRMR is a less well-studied index of fit, recent simulation work supports the assertions of Yu [59] that values below 1.0 generally suggest good model fit [58].

In addition to global fit indices, we checked for localized areas of model misfit using the approach proposed by Saris et al. [65]. In this approach, the modification index (MI) of a structural coefficient is considered alongside the expected parameter change and the power of the MI test to determine whether two items likely exhibited correlated error terms (as determined by an expected parameter change of ≥ 0.1). Information from this analysis and the analysis of inter-item correlations was combined to determine whether any items on the scale should be deemed redundant and eliminated. A model-based estimate of internal consistency reliability, McDonald’s [66] coefficient omega (ω), was calculated from the one-factor model using the categorical data estimator proposed by Green and Yang [67]. 95% confidence intervals for omega were constructed using the bias-corrected and accelerated bootstrap approach (1000 resamples) recommended by Kelley and Pornprasermanit [68].

Item reduction

Using the information from the misspecification analysis and correlation matrix inspection, the set of items was reduced to the maximum number of items that satisfied the following criteria: (a) no polychoric correlation between two items exceeds 0.7 and (b) the Saris et al. [65] method does not flag any item pair as having correlated error terms with an estimated parameter change (EPC) of 0.1 or greater. The reduced scale was re-fit using the same CFA methods, and its fit was compared to that of the longer form.

Item response theory analysis

After reducing the number of items on the ISQ, we analyzed the resulting short form within an item response theory framework, fitting data from those items to a unidimensional graded response model [69] in our adult sample. The model was fit using maximum marginal likelihood estimation via the Bock–Aitkin EM algorithm [70], as implemented in the mirt R package [71]. Model fit was assessed using the limited-information C2 statistic [72, 73], as well as C2-based approximate fit indices and SRMR. The guidelines for adequate fit proposed by Maydeu-Olivares and Joe [74] for the RMSEA2 and SRMR were used to establish adequate fit of the IRT model. To further confirm that item redundancy was not affecting IRT parameters, we calculated Chen and Thissen’s [75] standardized local dependency (LD) χ2 statistic for each item pair. Standardized LD-χ2 values greater than 10 are typically indicative of practically significant local dependence [76].

Once the adequacy of the model was established, we used information generated by the IRT parameters to further understand the psychometrics of the shortened ISQ form. Marginal reliability of the latent trait score was calculated, and the 95% confidence interval for this value was constructed using a simple percentile bootstrap (1000 resamples). Reliability coefficients for each individual respondent were also examined, with values greater than 0.7 being deemed sufficiently reliable for interpretation at the individual level. The performance of each item was also evaluated by examining item characteristic curves and item information curves, as well as testing for differential item functioning (DIF). Items were evaluated for DIF in the adult sample across groups based on age (> 40 vs. ≤ 40 years), biological sex, gender identity, and annual household income (> $50,000 vs. ≤ $50,000). Age and income cut-points were chosen based on approximate median splits. DIF by race/ethnicity was not able to be tested due to the small number of individuals identifying as categories other than non-Hispanic White. DIF was tested using the iterative Wald test procedure proposed by Cao et al. [77] and implemented by Williams [78], with p values < 0.05 (FDR-corrected; [79]) used to flag items for DIF. Significant omnibus Wald tests were followed up with tests of individual item parameters to determine which parameters significantly differed between groups.

In order to test the validity of the shortened ISQ in a population of adolescents on the autism spectrum, we fit a multiple-group graded response model to data in both the adolescent and adult samples, assessing overall model fit using the criteria described above. To determine whether scores in the two groups were comparable, we tested for DIF between adolescents and adults using the iterative Wald test procedure [77, 78] and an FDR-corrected p-value threshold of 0.05. As no significant DIF was found between the groups, we then re-fit the graded response model to the full dataset, using item parameters from this final model to calculate latent trait scores on the ISQ. Lastly, to examine the effects of demographics on ISQ latent trait scores, we then regressed the ISQ latent trait score on age (in years), sex (male vs. female), and the interaction between age and sex.

Results

Descriptive statistics

ISQ means, standard deviations, skewness, number of large correlations (r > 0.7), and mean correlations are displayed in Table 2. Several items (Items 6, 10, 11, 12, 13, 14, 16, 18) showed many (> 5) large correlations (> 0.7). Out of 190 unique correlations, there were 43 (22.6%) that were greater than 0.7, indicating that there was likely a high degree of item content overlap [50]. Several problematic item pairs (e.g., Item 5. I find it difficult to describe feelings like hunger, thirst, hot or cold and Item 13. It is difficult for me to describe what it feels like to be hungry, thirsty, hot, cold or in pain; Item 3 I have difficulty feeling my bodily need for food and Item 11. I have difficulty understanding when I am hungry or thirsty; Item 10. I find it difficult to read the signs and signals within my own body (e.g., when I have hurt myself or I need to rest) and Item 14. I am confused about my bodily sensations) had a very high degree of correlation (e.g., rpoly = 0.85 for Items 5 and 13).

Table 2 ISQ item content and descriptive statistics for adult sample

Confirmatory factor analysis

Model fit for the 20-item ISQ was inadequate based on conventional fit criteria (Table 3). The Chi-square test was significant (p < 0.001), rejecting the null hypothesis of exact model fit. Other fit indices also failed to meet a priori cutoff values (i.e., CFIcML/TLIcML > 0.95, RMSEAcML < 0.06, WRMR < 1.0, and SRMRu/CRMRu < 0.08), suggesting that this model did not fit the data in our sample well. Using McDonald’s omega, the model showed good reliability (ω = 0.966, 95% bootstrapped CI [0.961, 0.971]); however, as a model-based reliability coefficient is only as valid as the model it is based on [80], this coefficient should be interpreted with caution given the poor fit of the model. Factor loadings for the items in the CFA model are displayed in Table 4.

Table 3 Fit indices for original and revised ISQ confirmatory factor models
Table 4 Factor loadings for ISQ-20 and ISQ-8

Item reduction and short form construction

Misspecification analysis was conducted to identify the specific pairs of items driving the misfit of the unidimensional model. Based on this method, several pairs of items were found to have omitted error correlations (i.e., EPC > 0.1; [51]), indicating item content redundancy (e.g., Items 19/20, 5/13, and 3/11; see Additional file 1: Table S1 for a full list of flagged item pairs) sing the polychoric correlation matrix, the items were ordered by number of large correlations (> 0.7). First, the 6 items with the most intercorrelations were removed (Items 6, 10, 11, 13, 14, 16). Item 17 was then cut because of its high correlations with Items 12 and 1 (r values = 0.73 and 0.71, respectively; 17. I don’t tend to notice feelings in my body until they’re very intense; 12. I find it difficult to identify some of the signals that my body is telling me [e.g., If I’m about to faint or I’ve over exerted myself]; 1. I have difficulty making sense of my body’s signals unless they are very strong). After these reductions, several large correlations were still present among the 13 remaining items. To further reduce item redundancy, each of the flagged item pairs was compared, and the item whose content was more general was retained for the final scale. Using this criterion, Item 3 was kept over Item 8 (3. I have difficulty feeling my bodily need for food; 8. I only notice I need to eat when I’m in pain or feeling nauseous or weak), Item 20 was kept over Item 19 (20. Even when I know that I am physically uncomfortable, I do not act to change my situation; 19. Even when I know that I am hungry, thirsty, in pain, hot or cold, I don’t feel the need to do anything about it), and item 5 was kept over Item 18 (5. I find it difficult to describe feelings like hunger, thirst, hot or cold; 18. I find it difficult to put my internal bodily sensations into words). This item reduction process resulted in a 10-item scale with all inter-item correlations less than 0.7. Based on information from the misspecification analyses item pairs 2/3 (2. I tend to rely on visual reminders (e.g., times on the clock) to help me know when to eat and drink; 3. I have difficulty feeling my bodily need for food) and 7/20 (7. If I injure myself badly, even though I can feel it, I don’t feel the need to do much about it; 20. Even when I know that I am physically uncomfortable, I do not act to change my situation) were further identified as misspecified, and Items 3 and 20 were retained due to their more general content. The final short form of the ISQ contained 8 items (ISQ Items 1, 3, 4, 5, 9, 12, 15, and 20; Additional file 1: Table S2).

The short form ISQ (ISQ-8) showed far better fit after item reduction using the same criteria (Table 3). The Chi-square test once again rejected the null hypothesis of exact model fit (p = 0.007), signaling at least some degree of model misspecification. Other fit indices met a priori criteria (i.e., CFIcML/TLIcML > 0.95, RMSEAcML < 0.06, WRMR < 1.0, and SRMRu/CRMRu < 0.08), demonstrating trivial levels of global misfit, and misspecification analysis of this reduced-item set showed no flagged pairs, indicating a low likelihood of item content redundancy. Reliability of the model was evaluated with coefficient omega (ω = 0.901, 95% bootstrapped CI [0.886, 0.913]) suggesting good internal consistency for this 8-item model.

Item response theory analyses

The model for the ISQ-8 showed overall good fit in the adult sample (C2(20) = 32.5, p = 0.038, CFIC2 = 0.997, RMSEAC2 = 0.036, SRMR = 0.040). Additionally, the standardized LD-χ2 values were all less than 5.79, providing no evidence for remaining item redundancies. The marginal reliability of the ISQ-8 was good (ρxx = 0.891, 95% bootstrapped CI [0.881, 0.890]), further demonstrating the psychometric adequacy of the reduced scale. Scores for individual participants all had reliability values greater than 0.7, indicating the 8-item form measured the construct with sufficient precision in all cases. Factor loadings and IRT slope/intercept parameters can be found in Table 4.

Based on an examination of the item category characteristic curves (Additional file 1: Figure S4), we concluded that a 7-point response scale was not optimal for the ISQ-8. For all 8 items, the plots showed that there were item responses that at no point on the latent continuum were the most probable choice, thus suggesting that there were too many response options. As a result, item responses were collapsed together to create a 5-point scale (i.e., the “2”/“3” responses were combined together into a single response option, as were the “5”/ “6” responses). Using this new 5-point scale, the IRT model was re-run in the adult sample. This model also showed good fit (C2(20) = 32.0, p = 0.043, CFIC2 = 0.997, RMSEAC2 = 0.035, SRMR = 0.038), no local dependencies (LD-χ2 values < 9.26), and good reliability (ρxx = 0.887, 95% bootstrapped CI [0.878, 0.897]). EAP-estimated latent trait scores derived from the recoded ISQ-8 correlated very highly with those derived from the original ISQ-8 (r > 0.997). The item trace lines for the 5-point scale indicated more consistent response utilization than those for the 7-point scale, but the middle response was still shown to be underutilized in a number of cases (Additional file 1: Figure S4).

Differential item function was also evaluated using the iterative Wald test procedure to identify differences in performance by age, sex, gender, and household income. No differential item functioning was found between any of the tested groups on any item (all p’s > 0.101, FDR corrected; see Additional file 1: Table S3 for full DIF results). Given that no difference was observed between the adult and adolescent groups, the two were combined and run together in another model using the 5-point scale. This model showed good overall fit (C2(20) = 48.2, p < 0.001, CFIC2 = 0.994, RMSEAC2 = 0.046, SRMR = 0.036), no local dependence (LD-χ2 values all < 9.14), and good reliability (ρxx = 0.880, 95% bootstrapped CI: [0.871, 0.889]). Latent trait scores from this model (EAP estimation) correlated very highly with total scores on the original ISQ-20 (r = 0.942). We, therefore, concluded that this short form adequately represented the longer measure from which it was derived. A regression of ISQ-8 score on age and sex across the full sample explained very little of the variance in interoceptive sensibility (R2 = 0.045), although a statistically significant main effect of sex indicated moderately higher levels of interoceptive difficulties in autistic women and girls compared to autistic men and boys (βF-M = 0.612, p < 0.001). The main effect of age and the age by sex interaction were not significant (p’s > 0.104). These results were found to be the same according to both reported sex or gender identity.

Discussion

The current study is the first to evaluate the latent structure of an interoceptive sensibility questionnaire in a large sample of autistic individuals, presenting preliminary data to support the use of a shortened version of the ISQ (ISQ-8) in this population. The unidimensional factor model of the full-length ISQ proposed by Fiene and colleagues [38] exhibited suboptimal fit to the data in our sample, likely driven by a large number of unmodeled correlated error terms. However, after removing a number of redundant items and reducing the number of response options from 7 to 5, we were able to create a psychometrically-improved version of the ISQ with unidimensional structure, excellent model-data fit, trivial levels of misspecification, and high score reliability. The ISQ-8 items did not function differently across sociodemographic groups, and the lack of DIF seen between adolescent and adult samples supports the validity of this measure in adolescents on the autism spectrum in addition to autistic adults. Although scores on the ISQ-8 were independent of age, we did find moderately higher levels of interoceptive difficulties in autistic females. This finding notably differed from the lack of ISQ score differences by gender found in the original study by Fiene et al. [38], potentially indicating a sex difference that is unique to individuals on the autism spectrum. Although further validation of the ISQ-8 is needed in both autism and neurotypical samples, our study provides a necessary first step toward developing a robust self-report measure of interoceptive sensibility in the autistic population.

Though Feine et al. [27] reported that the original ISQ form was unidimensional in structure, the fit of our one-factor CFA model was inadequate, driven by the psychometric consequences of doublet factors (i.e., “asking the same question twice”; [41, 81]. Item pairs, such as ISQ items 5 (I find it difficult to describe feelings like hunger, thirst, hot or cold) and 13 (It is difficult for me to describe what it feels like to be hungry, thirsty, hot, cold or in pain) correlated extremely highly, reflecting shared variance due to the latent factor and additional shared variance due to overlap in item wording or semantic content. When not accounted for in a given model, item redundancy can artificially inflate factor loadings, IRT slope parameters, and model-based reliability coefficients [42,43,44,45], causing some authors to favor high item inter-correlations over the broader content coverage needed for an instrument to have construct validity [50]. Furthermore, as the use of a measure’s summed total score implies a latent trait model with uncorrelated errors [82], questionnaires such as the ISQ-20 with many redundant items produce total scores that are biased estimates of the underlying latent trait. Thus, in order to improve the psychometric adequacy of the ISQ, we felt justified in removing many of the questionnaire’s items to meet the assumption of local dependence.

Item response theory models were then fit to the reduced form, confirming its unidimensionality, good reliability, and lack of local dependence. However, analysis of item trace lines demonstrated that the 7-point response scale originally proposed by Fiene contained more response options than meaningfully used by autistic participants. We thus re-coded the item responses along a 5-point scale, reducing the amount of between-subject error variance attributable to trait-unrelated tendencies to respond closer to the middle of a bipolar scale. Although item trace lines after re-coding indicated that the middle item response was still underutilized in most cases (see also [83] for an argument against the use of neutral response options), it is possible that this pattern would not be observed if participants were to respond to ISQ-8 items on a 5-point scale rather than a recoded 7-point scale. Thus, while this finding does provide preliminary support for the possible elimination of a neutral response option in future versions of the ISQ (see also: [72]), further research using the 5-point response scale is necessary to make conclusive recommendations.

After confirming the psychometric adequacy of the ISQ-8 in our sample of autistic adults, we tested the factorial validity of the ISQ-8 in our adolescent sample. Our DIF analyses found that all ISQ-8 items functioned equivalently between adults and adolescents on the autism spectrum, supporting the decision to derive item parameters from a combined adolescent-adult sample. Although model fit was slightly reduced when compared to the adult-only sample (i.e., the C2-based RMSEA increased slightly), the unidimensional graded response model fit this data adequately, justifying the interpretation of estimated ISQ-8 latent trait scores in both adolescents and adults on the autism spectrum. To facilitate the use of these latent trait scores in future studies, we have created an easy-to-use online scoring tool that can convert patterns of ISQ-8 item responses (on either a 5- or 7-point scale) into calibrated latent trait estimates and corresponding T-scores (available at https://asdmeasures.shinyapps.io/ISQ_Score/). However, as these scores have only been validated in autistic adolescents and adults, future studies are necessary to validate these scores in adolescents and adults without autism diagnoses and to determine whether DIF exists between participants on the autism spectrum and the general population.

This work has meaningful implications for the study of interoception in autistic people, as it provides strong psychometric support for the use of the ISQ-8 as a measure of interoceptive sensibility in this population. While research to date has demonstrated broad group differences in interoceptive constructs associated with autism, the lack of validation in many forms of measurement makes it challenging to identify exactly where these differences lie. The value of psychometric work on the ISQ specifically is that researchers can now employ this tool to examine how interoceptive traits manifest in persons on the autism spectrum, knowing that differences in interoceptive sensibility across this population are not driven by qualitatively different item responding across sociodemographic groups. This measure can also be used to test the convergent validity of other interoceptive sensibility questionnaires in the autistic population, allowing future research to identify whether other tools such as the Body Perception Questionnaire (BPQ; [81]) and MAIA are tapping similar interoceptive constructs in the autistic population. Perhaps most importantly, this work builds on the foundational work of Fiene et al. [27] to provide a robust measurement tool for use in autism interoception research, setting the stage for future investigations of the relations between self-reported interoceptive differences, autistic features, and co-occurring psychopathology.

This study had several notable strengths including its sample size, robust statistical analyses, inclusion of adolescents in the sample, and ability to test the psychometric properties of a measure within a specific clinical group of interest. Psychometric studies are crucial to the success of research in psychology, as the inferences that we can make about psychological constructs are limited by the validity of the tools used to measure them [84]. Given the large sample available through SPARK, we were able to test the psychometric properties of the ISQ in its target population, using that information to refine and validate the scale in both adolescents and adults on the autism spectrum. In our sample, the final form of the ISQ-8 demonstrates high reliability, unidimensionality, and a lack of item redundancy. This brief questionnaire has excellent psychometric properties in autistic individuals, and future studies will determine whether the ISQ-8 is suitable to quantify interoceptive sensibility in other psychiatric conditions thought to be associated with interoceptive deficits [10].

Limitations

One major limitation of this study is the lack of neurotypical individuals with whom to compare broad group differences or conduct differential item functioning analyses by diagnosis. Without this comparison, it is difficult to conclude how individuals with and without autism differ on the ISQ, and it remains possible that the diagnostic group differences observed by Fiene et al. were significantly distorted by DIF. It is also worth noting that our sample contained a relatively high proportion of female participants compared to estimates in the wider autism population (currently estimated at a 3:1 male to female ratio in research; [85]). Our finding that interoception may differ according to sex and gender is in accordance with other work in autism research suggesting sex-based differences in exteroceptive sensory functioning (e.g., [86, 87]). Furthermore, while this study proposes a 5-response scale for the ISQ-8, our data were not collected using this method; thus, the psychometric properties for the 5-response instantiation of this instrument are not entirely known. Additionally, the ISQ-8 with a 5-point scale is not validated in neurotypical or other clinical groups where this form may be of interest. Therefore, though the present results support the recommendation that future versions of the ISQ use a 5-point response scale, further work is needed to assess the adequacy of this response format in both autistic and neurotypical populations.

Another shortcoming of this study is the lack of tests of convergent and broader nomological validity. The present study did not test whether the ISQ converged with other measures of interoceptive sensibility (e.g., the BPQ) or showed theoretically-supported associations with related constructs, such as core autism symptoms, anxiety, or neuroticism. This type of research is necessary in the future to determine whether the ISQ taps the same construct that other interoceptive sensibility measures aim to assess and whether this measure can predict important clinical outcomes such as affective symptoms or anxiety.

Lastly, it remains unknown whether self-rated interoceptive sensibility on the ISQ correlates meaningfully with measures of interoceptive accuracy or interoceptive awareness. This limitation in particular makes it challenging to understand how the ISQ is situated within the nomological network of the superordinate interoception construct. While there is some ambiguity regarding the degree to which separable interoceptive subconstructs should correlate, general difficulties in interoceptive ability should theoretically cause all three aspects of interoception to covary to some degree.

Another limitation of the SPARK pool is that autism diagnoses are self-reported and are not verified. Although web-based autism registries have been shown to be reliable [88], the lack of confirmation of autism diagnoses limits the study’s ability to draw definitive psychometric conclusions about the performance of the ISQ in this population. This study, therefore, begs for replication in a large sample of individuals for whom autism diagnoses are independently confirmed via gold-standard measures.

In sum, the limitations of this study include a lack of neurotypical control group, unrepresentative sample of the wider autistic population, reliance of our findings on data derived from the longer ISQ-20, and the lack of tests of the nomological validity of the ISQ-8. Future work would benefit from comparing autistic and neurotypical individuals with other neuropsychiatric conditions using the ISQ-8, particularly testing whether significant differential item functioning exists across groups. Furthermore, it would be valuable to compare the scores on this measure with other measures of interoceptive sensibility, interoceptive awareness, and interoceptive accuracy. Doing so would not only help establish a fuller picture of interoceptive differences in autism, but also advance our understanding of the psychometrics of the various tools intended to tap various aspects of interoception across populations.

Conclusions

The ISQ is a recently developed measure intended to index interoceptive sensibility in autistic people. However, it has previously lacked robust psychometric evidence supporting its use when evaluating persons on the autism spectrum. Drawing upon data from a large sample obtained via partnership with SPARK, we sought to investigate the ISQ using CFA and proposed a new, short-form version (the ISQ-8) with superior psychometric properties for use in adolescents and adults on the autism spectrum. This revised questionnaire shows great promise as a tool for measuring interoceptive sensibility in autism going forward and would benefit from further studies testing its construct validity both within the autism population and across diagnostic groups.