The autism spectrum hypothesis proposes that autistic behaviours and personality expressions would be presented in clinical populations and within and along with the general population in different severity (Hoekstra et al., 2008). Inside this spectrum, the broad autism phenotype (BAP) is defined by elevated but nonclinical levels of autism-related symptoms expanding them beyond the threshold of the disorder towards the general population levels (De Groot & Van Strien, 2017). Thus, the BAP expressions inserted in the autism spectrum would contribute to understanding the disorder and its developmental trajectories (Landry & Chouinard, 2016). Also, the BAP opens the possibilities of carrying out studies with larger and more heterogeneous samples and, so, entailing a better representation of the population and guaranteeing greater statistical power (Mitchell & Jolley, 2013).

When studying the BAP, the Broad Autism Phenotype Questionnaire (BAPQ; Hurley et al., 2007) has been extendedly applied over different countries and cultures (e.g., China, Shi et al., 2015; Israel, Seidman et al., 2012). It was designed to measure the BAP in parents of children diagnosed with autism spectrum disorder (ASD; APA, 2013) and it has been subsequently applied to broader community samples (e.g.,Álvarez-Couto et al., 2021; Godoy-Giménez et al., 2018; Jakobson et al., 2018; Jamil et al., 2017; Morrison et al., 2018; Stojković et al., 2018). It represents the BAP original structure: Aloof personality, Rigid personality, and Pragmatic Language impairment (Piven et al., 1997a, b) which paralleled autism disorder domains of impairment of the prior DSM-IV definition (APA, 1994).

While research in psychometric properties of the BAPQ has provided validity evidence on the BAPQ scores’ inferences (based on their relations with other variables, Sasson et al., 2013b; on test-criterion relationships, Broderick et al., 2015; on differential scores among groups, Shi et al., 2015; and on comparing BAPQ versions, Sasson et al., 2014), little is known about BAPQ reliability at the different levels of the severity continuum, item severities or the adequacy of response categories. According to the autism spectrum hypothesis, understanding that autism expressions are spread somehow across the population implies that the items of the BAPQ could be scaled according to their severity. It would imply that the items conforming to each of the three BAPQ subscales should express different degrees of severity to grasp the essence of the spectrum of autism (at subclinical levels in the case of the BAP) and, at the same time, that every person may be located at some point inside this continuum even those who express mild severity BAP behaviours.

Item response theory (IRT) become of great relevance in this respect as they could bring many advantages in the study of BAPQ psychometric properties (e.g., Bond & Fox, 2015) providing, among others: (i) both interval-level scaling of persons and precision estimations for each severity level (essential for interpreting BAPQ’s scores according to their severity), (ii) test-independent scores, useful for comparing BAP levels measured with different tests, or (iii) conjoint scaling of items and respondents along the continuum and for inferring which behaviours are more likeable at different levels of severity. This conjoint estimation will also allow inferring the amount of BAP severity of each item (the magnitude of BAP that would be required to adhere to each expression, scaled in a severity continuum from above-average functional behaviours to severe autistic-associated behaviours which require substantial help) and will ease the discussion about the differential performance of the items in different groups (e.g., ASD relatives vs. the general population). Furthermore, the resulting severity order can be used as an additional source of validity evidence.

On the other hand, many authors have highlighted the importance of assessing differential test functioning (DTF) and differential item functioning (DIF) to assure that all the test adaptions preserve the equivalence of measurement (item hierarchies and/or severities) among cultures (Hambleton, 1994). Unlike other questionnaires that have analysed test stability between versions (the Social Responsiveness Scale—SRS; Bölte et al., 2008), the BAPQ DTF has never been tested across test adaptations. This is especially important considering that some social expressions linked to ASD vary their severity depending on the country (Kim, 2012).

This study aims to further analyse the psychometric properties of the BAPQ through IRT models. The importance of this study relies on the ASD dimensional approach in the DSM-V (APA, 2013) and the autism spectrum hypothesis (De Groot & Van Strien, 2017) summed to the necessity of exploring whether the BAP operationalization targeted in the BAPQ would follow this severity-dimensional approach. That is, whether the items of the three subscales of the BAPQ could be graded in severity and displayed creating three hierarchies of severity in three different domains (corresponding to Aloof, Pragmatic Language, and Rigid subscales).

For those purposes, the original BAPQ (Hurley et al., 2007) and its Spanish version, the BAPQ-SP (Godoy-Giménez et al., 2018) will be applied to two community samples (one from the United Kingdom and one from Spain) and analysed using a Rasch rating scale model (RSM, Masters & Wright, 1984).

Secondly, we will test whether the items’ hierarchies and severity of each BAPQ item remain stable in its Spanish version. DTF (equivalence in the functioning of sets of items) and DIF (the loss of item estimate invariance across subsamples of respondents; Bond & Fox, 2015) between both BAPQ versions will reveal that the members of different groups with the same level of BAP would answer in significantly different ways to the same items (e.g., Hambleton, 2006). Even if the severities of some autism-related characteristics could be affected by cultural idiosyncrasies, item invariance failures could also alert us to potential problems with the test (Bond & Fox, 2015). If so, we will consider whether these severity differences are due to cultural differences. This aim is also very important since the BAPQ is one of the most translated tests of the BAP (e.g., Bang et al., 2021 [Swedish version]; Godoy-Giménez et al., 2018 [Spanish version]; Shi et al., 2015 [Chinese version]) and no evidence has been documented about the invariance structure of the BAPQ either the functioning of its items across versions.

The following results would support the use of the original BAPQ and the BAPQ-SP: (i) an adequate item fit informing about the capability of the resulting item hierarchy to predict people patterns of responses (i.e., items statements should endorse progressively from the least to the most severe); (ii) separation indices indicating that it is possible to distinguish, at least, between two statistically different strata in the sample; (iii) item-person maps showing the BAPQ items are distributed covering the severity levels of the persons, and (iv) all the response categories being distinguishable for the respondents. We believe that our results will be in line with previous studies indicating adverse evidence based on the BAPQ internal structure (e.g., Lin et al., 2021; Stojković et al., 2018). Especially, a lack of unidimensionality of some BAPQ subscales is expected, mainly in the case of the Pragmatic Language or Rigid subscales (see Godoy-Giménez et al., 2018; Lin et al., 2021; Sasson et al., 2013a; Sharma and Bhushan, 2018; Stojković et al., 2018). If results matched with expectations, it would be impossible to perform RSM in these subscales. Additionally, we expect to find some item severity equivalence among the BAPQ items when comparing the English (BAPQ-EN) and the Spanish (BAPQ-SP) versions. Finally, but not less important, in this study we also present the correction of two items of the Spanish BAPQ (items 4 and 23; Godoy-Giménez et al., 2018). Precedent Spanish version of item 4 does not completely represent the original content of the item in the BAPQ-EN (the Spanish item 4 is expressing the idea of following the track of the general conversation while the original item 4 in the BAPQ-EN is expressing keeping the track of your point in a conversation). For item 23, its Spanish version seems to express a preference for superficial conversations (an aloof prototypical behaviour; corresponding to a direct item) whereas in the BAPQ-EN this item reflects the idea of being good at casual or spontaneous conversations (reversed item). Alternative translations for these items are provided (see instruments section). After adjusting both translations, we will expect that they will not exhibit adverse item fit. A closer inspection of those items will be performed as we expect that alternative translations will work adequately and they will not generate any discrepancy in their severity estimation and comparisons between both BAPQ versions.

Method

Participants

Quota sampling was performed for having access to representative samples, regarding age and sex, of the Spanish and English populations. In the case of Spain, seven quotas by age and gender were calculated on a sample of 600 participants. Through incidental snowball, university students helped us to have access to a larger and general sample that involved themselves, relatives, friends, and acquaintances (undergraduate received two-course credits for collaborating). Later on, incidental sampling was conducted until reaching 970 Spanish participants (Social Networks and Spanish associations of autism took part). The English sample was recruited from an international company (https://www.gfk.com) and it consisted of 533 participants. We prepared eight quotas on a sample of 500 participants and we gave them to the company. The enterprise posted a notice with the study information in an online panel of respondents and participants who met the criteria were given access to the survey. Accessibility was restricted once a quota was completed. Participants received €6 for their collaboration. We took nationality as an exclusion criterion in both cases.

Sample sizes adequacy was calculated for item calibrations. Considering a test with 36 polytomous items, a sample of 500 would produce robust statistically stable measures (item calibrations or person measures stables within ± 1.0 logits, robust confidence) in adverse circumstances (Aziza et al.., 2020; Linacre, 2018; Wright & Stone, 1979). Furthermore, in the present study item separation indices were tested to confirm that both sample sizes were adequate to estimate the item hierarchies (as exposed hereafter, indices > 3 assume that the sample size is large enough to verify the item severity hierarchy; Linacre, 2018; and to find the same item placements if identical items are applied to another same-sized sample behaving similarly; Bond & Fox, 2015). Finally, for conducting comparisons between groups (testing differential item or test functioning) and for longer than two-item scale, a sample size of 200 participants per group is required for adequate power (> 80%; Scott et al., 2009).

Instruments

The BAPQ self-report form (Hurley et al., 2007) is a 36-item screening questionnaire that demands participants to answer how frequently a statement applies to them on a 6-point scale from 1 “Very rarely” to 6 “Very often”. BAPQ items are grouped into three subscales: aloof personality, rigid personality, and pragmatic language problems (items conforming to each subscale and reversed items can be found in Hurley et al., 2007). The self-report BAPQ-SP (Godoy-Giménez et al., 2018) was applied in this study together with the original English self-report one. Alternatives translations of items 4 (“Me cuesta evitar irme por las ramas en una conversación”) and 23i (“Se me da bien la charla insustancial (estar de cháchara)”) were included this time.

Procedure

The data was compiled in a single session of approximately 30 min; both samples had equal conditions. A survey was created in the survey online administrator LimeSurvey (https://www.limesurvey.org/) and it includes the BAPQ together with another test out of scope of the present study and some sociodemographic questions always displayed at the end. The survey had two versions, the English-based language version and the Spanish one. A link drove each participant to an online and completely anonymous survey (links were different depending on the survey language). Personal links (anonymous tokens) were created for the English sample as the company required it to provide a refund for participants’ collaboration. All the participants read the study instructions, the data treatment ethics, and expressed their consent to participate in the study and to the use of their data only for scientific purposes before starting. They were asked to read each question carefully and answer them sincerely. They were also warned that aleatory patterns of responses and short time lapses will be tracked and were reasons enough to remove collaboration reinforcers.

Data analysis

Descriptive statistics and independent sample t-tests for continuous variables were calculated to compare both BAPQ versions’ means (statistical significance was set at p < 0.001 using Bonferroni correction). Effect sizes were estimated with Cohen's d using G*Power software v3.1 (d = 0.2 small effect size, 0.5 medium effect size, and 0.8 large effect size; Cohen, 1988). Dimensionality was checked using a principal components analysis of residuals (its interpretation differs from dimensionality studies using factor analysis). If unidimensionality was achieved, Rasch rating scale models (RSMFootnote 1) for tests that have the same rating scales in all items were applied to each subscale separately (6-option Likert in the BAPQ; see Wright & Masters, 1982). RSM analyses were conducted in WINSTEPS software version 3.63.2 (Linacre, 2018) and plots were drawn with R software version 3.6.1 (R Core Team, 2019).

In the first place, item separation indices were tested to confirm that the sample size was adequate to estimate the item hierarchy. Indices over 3 assume that the person sample is large enough to verify the item severity hierarchy (Linacre, 2018) and to consider that it is highly likeable to find the same item placements if identical items are applied to another same-sized sample behaving similarly (Bond & Fox, 2015). In a second place, item fit (i.e., the degree to which the proposed item location can predict consistently participants’ responses to these items) was analyzed (see Smith et al., 2008). Both, infit (sensitive to patterns of response of people with similar severity levels than one of the items) and outfit (sensitive to patterns of response of people with severity levels far from one of the items) mean-squared residual differences were estimated (values between 0.6 and 1.5 are interpreted as acceptable item fit; Lunz et al., 1990; Wright & Masters, 1982). In a third place, the locations of the items along each severity continuum were analyzed according to their content and the coherence of the item distribution was taken as a source of construct validity. In fourth place, the category probability curves of the items were analyzed. We explored whether the categories were ordered as expected and, whether they had the highest probability of adhesion, at least, at one point of the continuum. In fifth place, the efficacy of BAPQ to distinguish people along the different severity continuums was studied through people's separation indices and person reliability (i.e., the proportion of observed variance not due to measurement error; equivalent to the traditional test reliability). Person separation indices lower than 2 (person reliability < 0.80) indicate that the test cannot even distinguish between high and low severity levels (Bond & Fox, 2015). Furthermore, the precision of each set of items (i.e., Aloof, Rigid, and Pragmatic Language) throughout their corresponding severity continuum was examined using their test information functions (subscale in this case). Finally, we compared the performance between both versions of each subscale as depicted by their items; that is, DTF between the subscales of the English and the Spanish BAPQs with identity plots (for the whole analysis, consult Bond & Fox, 2015; see Fig. 5). These scatterplots represent item severity of each subscale version in each axis, an identity line symbolizing the expected values for severity invariance, and a pair of 95% quality control lines.

Results

The BAPQ-EN items showed higher averages and standard deviations than items in the BAPQ-SP (see Table 1 and Fig. 1). Particularly, nine, 10, and 11 items showed higher means in the Aloof, Pragmatic Language, and Rigid subscales respectively. The means of the items in the BAPQ-EN as well as in the BAPQ-SP were low, some of them were between 3–4 and only two items in the BAPQ-SP were beyond 4.

Table 1 Item descriptive statistics and IRT parameters
Fig. 1
figure 1

Item descriptive statistics: Percentage of responses to each response category in real groups

Twenty-four items showed significant differences between both samples (significance was set at p < 0.001 using Bonferroni correction). Cohen’s d revealed 22 items with small effect sizes (effect sizes; d = 0.2; items 1i, 2, 3i, 5, 6, 8, 9i, 10, 11, 13, 14, 18, 19i, 21i, 24, 25i, 28i, 29, 30i, 31, 33, 35), two mediums (d = 0.5; items 4 and 23i) and none item with large effect sizes (d = 0.8).

Subscale dimensionality and item separation indices

Results pointed to unidimensionality of both Aloof subscales (variances explained by measures were BAPQ-EN: 50.60% and BAPQ-SP: 55.00%). The BAPQ-EN Rigid subscale seemed to present lack of unidimensionality (variance explained by measure: 45.3%, eigenvalue = 2.769) with a secondary dimension (explained variance in the first contrast 12.60%, eigenvalue = 2.76, disattenuated correlation with de primary dimension = 0.41) formed by items 3i, 15i, 19i, and 30i. The BAPQ-SP Rigid subscale was unidimensional (variance explained by measures: 51.10%). Finally, lack of unidimensionality was also observed in both Pragmatic Language subscales (explained variances by measures were BAQ-EN: 37.90%, BAPQ-SP: 42.00%). In the case of the BAPQ-EN, the secondary dimension (explained variance in first contrast 14.2%, eigenvalues = 2.74, disattenuated correlation with the primary dimension = 0.21) was formed by items 7i, 21i, and 34i. In BAPQ-SP, secondary dimension (explained variance in first contrast 9.50%, eigenvalue = 1.97, disattenuated correlation with the primary dimension = 0.49) was formed by items 14, 17, 20, and 32. Lack of unidimensionality impeded to perform further RSM analysis on both Pragmatic Language subscales and Rigid of the BAPQ-EN.

Regarding item separation indices, for the BAPQ-EN, they were between [7.23, 7.64] for Aloof items. For the BAPQ-SP, they were between [7.98, 8.37] for Aloof items and [16.23, 16.94] for Rigid ones.

Correlations of the persons' IRT scores between subscales in BAPQ-SP was: Aloof-Rigidity 0.63, p < 0.001.

Item fit

Infit and outfit mean-squared residual differences for Aloof and Rigid subscales are displayed in Table 1. Regarding the BAPQ-EN Aloof subscale, only Item 18 was slightly upon (1.65) the established boundaries on its outfit index (i.e., [0.6, 1.5]). None item in the BAPQ-SP Aloof and Rigid subscales showed a misfit. Item-total correlations showed adequate values for the BAPQ-EN and BAPQ-SP. The correction of the mistranslation in item 23 resulted in good fit indices and a positive correlation between the item and the subscale (see Table 1) pointing to its adequate functioning similar to what was observed for its English version (BAPQ-EN Aloof subscale).

Item locations

The estimated item and person locations are exhibited in Fig. 2 while item locations and estimated parameters are presented in Table 1.

Fig. 2
figure 2

Person-item maps of the Aloof BAPQ-EN and Aloof and Rigid subscales BAPQ-SP

Overall, Aloof item locations showed higher levels of severity (about one item standard deviation above the item mean) than the average Aloof level in the samples for both BAPQ versions. Additionally, some items shared similar locations in the continuum although they refer to different contents (e.g., items 5 and 18 in the BAPQ-EN or items 5,12i, 31, and 36i in the BAPQ-SP). The item severity mean in the Rigid subscale in the BAPQ-SP was about one item standard deviation of the mean of the items. Items 15i, 22, and 35; items 6 and 13; and items 19i and 24 also clustered in the continuum. The remaining five items were spread in different positions below and above the item severity mean.

Category probability curves

Aloof subscale in the BAPQ-EN and Aloof and Rigid in the BAPQ-SP showed adequate category functioning (see Fig. 3).

People separation indices and person reliability

Fig. 3
figure 3

Category probability curves of the Aloof BAPQ-EN and Aloof and Rigid subscales BAPQ-SP

The separation index was between [2.60, 3.08] for the BAPQ-EN Aloof subscale ([2.81, 3.29] in the BAPQ-SP), and the scale reliability was between [0.87, 0.90] ([0.89, 0.92] in the BAPQ-SP) that means that both Aloof subscales were able to distinguish almost among three statistically different strata. In the BAPQ-SP Rigid subscale, the separation index was between [2.37, 2.70] and the scale reliability was between [0.85, 0.88], which means that it was able to statistically distinguish between two different strata.

Subscale information functions, as well as standard errors of measurements, can be observed in Fig. 4. The point where the Aloof subscale reached its highest precision level corresponded, approximately, with the raw scores 13.14 in BAPQ-EN and 13.22 in BAPQ-SP. The point where Rigid in BAPQ-SP reached its highest precision level corresponded, approximately, with the 13.39 raw score (Fig. 5).

Differential Test Functioning (DTF)

Fig. 4
figure 4

Test information and standard error functions of the Aloof BAPQ-EN and Aloof and Rigid subscales BAPQ-SP. Note. T. I. F. = test information function; S.E. = Standard Error

Fig. 5
figure 5

Identity plot with each Aloof subscale. Note: Scatterplots represent the severity location of the items scores regarding each BAPQ version (BAPQ-EN and BAPQ-SP). Points placed on the x-axis represent the respective item values along the BAPQ-EN severity continuum. Points placed on the y-axis represent the respective item values along the BAPQ-SP severity continuum

Finally, we checked the Aloof subscale DTF between the English and the Spanish BAPQs (see Fig. 5).

Ten out of twelve items of the Aloof subscale showed DIF. Only items 5 and 9i had similar functioning. With similar levels of Aloofness, items 1i, 18, 23i, and 25i would be less likely to be endorsed by the Spanish sample than by the English sample, while the opposite happened with items 12i, 16 i, 27, 28i, 31, and 36i.

Discussion

In this study, we applied IRT to further study two BAPQ versions (BAPQ-EN and BAPQ-SP; Hambleton et al., 1991). Additionally, we tested the stability of the items’ hierarchy and which BAPQ item severities function differentially across versions (Hambleton, 2006).

Significant differences were observed when the same items were applied to both samples (the larger ones are explained further down) being the BAPQ-EN sample the one with overall higher item means. However, only small and medium effect sizes indicated no critical relevance.

Unidimensionality problems in both Pragmatic Language subscales and Rigid one in the BAPQ-EN impeded further rating scale analysis (Bond & Fox, 2015) and, congruently with previous findings, provides negative validity evidence regarding subscales’ internal structures. Particularly, Pragmatic Language adverse evidence was reported in Godoy-Giménez et al. (2018), Sasson et al. (2013a), and Sharma and Bhushan (2018); Rigid adverse evidence was reported in Lin et al. (2021), and adverse evidence of both subscales in Stojković et al. (2018). These findings could be better understood under the light of the most updated BAP operationalization (Godoy-Giménez et al., 2021) which revolves around two core domains paralleling it with the last ASD definition (APA, 2013) and includes some variations in the test content.

Regarding item fit, in the BAPQ-EN, we consider that the item in the Aloof subscale (Item 18) that exhibited little misfit could be explained by a lack of concreteness in the item content (any person rating high or low in BAP could agree with it because being polite is independent of the preference for social interactions), yet this slight item misfit could not degrade the measure (Linacre, 2018). By contrast, in the BAPQ-SP, this item did not show any misfit. As this item is located in different parts of the severity continuums of Aloof in both samples, we could suggest that cultural aspects could be playing a role in how different cultures interpret the item. In this regard, if we consider that the item is characterized by a lack of concreteness, it could be possible that the Spanish group understands the item more concretely. However, it would be also reasonable to think that some cultures give more importance to politeness than others. Finally, it is worth mentioning that this item, as other items of the BAPQ, is linked to particularity as in the BAPQ some items are specifically referred to as casual interactions with acquaintances (see Godoy-Giménez et al., 2018; Hurley et al., 2007). Consequently, it could be possible that both samples have differed in how they applied this particularity at the moment to respond to the item. Any other item has not shown misfit nor item 23 from the Aloof subscale in the BAPQ-SP. This result is relevant since, together with a good item-total scale correlation, it supports the alternative translation of this item. In sum, we can conclude that the results of the present study have reinforced the alternative translations of the Spanish items 4 and 23 (item-total correlation values of Spanish item 4 [0.64] and 23 [0.57] spoke in favour of this).

Thirdly, average item severities for both Aloof and BAPQ-SP Rigid subscales, were upper than the average severity level of both community samples. Consequently, subscales’ reliability reaches its higher value always for upper scores than both samples mean. This confirms our hypotheses that the items will mainly be situated in middle-upper levels of the continuum indicating that the BAPQ severity levels are more targeted for assessing high levels of BAP, like those observed in parents of ASD children as originally intended in Hurley et al., 2007. Thus, for those studies focused on community or severity-diverse populations (e.g., Faso et al., 2016; Morrison et al., 2018), it could be advisable to use tests with subtler BAP indicators to enhance the measurement accuracy at lower BAP severity levels.

Regardless of this issue, the items and persons have congruently been scaled according to their severity in both Aloof subscales and the Spanish Rigid one and that implied that the items of the BAPQ subscales are susceptible to being scaled along a severity dimension (severity hierarchies will also be discussed below in terms of validity evidence). Furthermore, our analyses revelated adequate category functioning for both Aloof subscales and the Rigid BAPQ-SP. Thus, even if the test counts on a 6-point scale, we have reported that all the rating options functioned and are most probably chosen at some point in the continuum in both samples. In the same line, adequate reliability and separation indices of both Aloof and the Spanish Rigid subscales inform that subscale hierarchy can statistically differentiate at least among two different strata of BAP severity (i.e., high and low BAP). This represents an important highlight since it goes in line with the ASD severity dimensional approach and opens the door to locate the BAP inside the autism continuum. In this regard, BAP behaviours would not define a qualitatively different group but rather it would comprehend people who could express autistic-like behaviours, at least, at higher or milder severity that makes them more or less functionally independent.

Our second objective was to compare the differential functioning of both Aloof subscales between the Spanish adaptation of the BAPQ (Godoy-Giménez et al., 2018) and its original version (Hurley et al., 2007). Two items showed invariant severity and thus, they would be equally endorsed in both samples by persons with the same severity level of aloofness. By contrast, the lack of invariance in the rest of the items hindered the comparison of the severity hierarchies of both Aloof subscales between the English and Spanish versions.

On the other hand, the location and content of those invariant items need to be discussed. As such, we consider that a mild introverted behaviour could be not to enjoy being in social situations (item 9i: “I enjoy being in social situations”) which does not imply that the respondent avoids spending time with a few close friends or relatives. By contrast, expressing an instrumental use of the acts of socializing, in general (also involves interactions with relatives, close friends, and partners) could be taken as a more severe indicator of aloofness (item 5: “I would rather talk to people to get information than to socialize”). Nevertheless, it is worth mentioning that this last behaviour still locates the person in social interaction and surrounded by others. Coherently, authors have found that ASD parents have shown less interest in purely social interactions, and report having fewer and lower quality friendships (Faso et al., 2016).

Even though the rest of the items have not resulted invariant, both Aloof hierarchies share similar severity order in both groups. This also deserves to be highlighted as it could be taken as additional validity evidence. Specifically, six items of both Aloof hierarchies have shown the same order (from upper to lower locations in the continuum: item 27, 18, 5, 9i, 25i, 16i). Items that are sharing near placements in both Aloof hierarchies are going to be commented together (upper locations denote more aloofness-like behaviours or more Aloof severity). Item 27 (“Conversation bores me”) is located in upper locations in both severity continuums and it refers to an open expression of disinterest for establishing social communication in casual interactions with acquaintances. Items expressing less amount of Aloof were 18 (“When I make conversation it is just to be polite”), 5 (“I would rather talk to people to get information than to socialize”), and 9i (“I enjoy being in social situations”) which also express this disinterest for social interactions but more subtly and, even when people do not give chance for establishing longer interactions (and do it only for being polite), they do participate in social communication. Finally, items 25i and 16i are below the mentioned items. Coherently with severity scaling of the items, those two items refer to not having an active pursuit of social situations, social involvement, or social enjoyment, but still, it does not imply being in social situations and participating in them. We could consider that they express milder severity behaviours (e.g., item 25i “I feel like I am really connecting with other people”; item 16i “I look forward to situations where I can meet new people”).

Importantly, this order in both Aloof hierarchies is also congruent with ASD specifiers in the DSM-V (APA; 2013): Social impaired behaviours at ASD levels could also vary from a decreased interest in social interactions (severity level 1, individual are more functional), to limited initiation of social interactions (severity level 2), and to unusual social approaches restricted to meet needs (severity level 3, individual are less functional; ASD severity specifiers; APA, 2013). This is also an important highlight because both shared Aloof hierarchy orders would be susceptible of being reinserted in the autistic severity continuum (De Groot & Van Strien, 2017; Hoekstra et al., 2008) and also provide validity evidence about the fact that the BAP does not only shares the same autistic behaviours but also that the scalability of the severity of those behaviours follows the same path than in more severe levels of ASD behaviours (APA, 2013).

By contrast, some items were located in a different order in both Aloof hierarchies. For example, items 1i and 23i are situated below item 27 in upper locations in the BAPQ-SP and item 31i in milder locations between items 5 and 9i while these items are all located below item 5 in the Aloof BAPQ-EN. In the case of the BAPQ-EN, items 28i, 12i, and 36i are surrounding item 27 (upper locations of the severity continuum) while in the Aloof BAPQ-SP those items are targeting milder placements. Items with different placements may be referring to varying social situations. In the case of Aloof BAPQ-SP, severe item 11 (“I like being around other people”) and 23i (“I am good at making small talk”) refers to social enjoyment and involvement but with the nuance of being surrounded by many people or known people. In the BAPQ-EN, upper items 28i, 12i, and 36i express being warm and friendly in social interactions (28i “I am warm and friendly in my interactions with others”; 12i “People find it easy to approach me”; 36i “I enjoy chatting with people”).

For the item severities that did not remain stable between versions, Hambleton (1994) pointed to “genuine cultural specifics” as a possible explanation of item variance. In this regard, and linked to what we have explained earlier about item 18, we could speculate if this could be suggesting that being polite is an important aspect of social interactions for the English sample while social involvement could be more important in the case of the Spanish one. Nevertheless, these hypotheses should be studied in the future. In the same way, the hierarchical three-level framework that explains how culture affects the perception and diagnosis of psychiatric disorders proposed by Rogler (1993) could be applied to ASD and, in extension, to BAP levels of severity. Accordingly, cultural norms mediate both the endorsement of symptoms and how people rate the severity of symptoms. Therefore, two participants with the same BAP severity but from different countries could express different behaviours responding differently to the BAPQ.

Since the severities of some Aloof items are sensible to the influence of diverse cultural aspects, the direction, as well as the magnitude of these influences, should be established if authors want to further study the interactions among ASD and BAP expressions and the culture. This implies integrating a comprehensive approach taking into account both the invariant and the non-invariant item severities and hierarchies, paying special attention to the influence of the cultures on the development of BAP-related behaviours and preferences but also on their expression either in a natural context or in a test. Ethnically-based cultural norms would modulate the perception of socially undesirable mental symptoms, and, ultimately, the own endorsement of those symptoms (Matson et al., 2017). As such, how people report the behaviours they think they are expressing and their severity may be influenced by the aspects that are considered more worrying within the culture. For example, some studies carried out in the United States have suggested that American parents tend to be more concerned about language delays (Coonrod & Stone, 2004) while Indian parents would tend to have early concerns about social difficulties (Daley, 2004), and Latina mothers may be worried about their child temperament (Ratto et al., 2016).

On the other hand, the severity hierarchy of the Rigid BAPQ-SP should be equally commented. Items sharing near placements of the Rigid continuum are going to be discussed together (upper locations denote more rigidity or more Rigid severity). Items 26 (“People get frustrated by my unwillingness to bend”) and 8 (“I have to warm myself up to the idea of visiting an unfamiliar place”) which refers to strong stubbornness or an extreme need for sameness and daily routines were situated above the rest of the items. Below them, we can also find items 13 and 6 that also express a preference for sameness and routines (e.g., 13 “I feel a strong need for sameness from day to day”, 6 “People have to talk me into trying something new”) and 24 (“I act very set in my ways”) and 19i (“ I look forward to trying new things") which also reflect stubbornness, sameness, and routines but more subtly without the nuances of “warm myself” or “unwillingness to bend” that comprehend more severe items (i.e., 26 and 8). Further down in the severity continuum there are items concerning struggling with alternative work procedures (generally about how things must be done) or changes in daily routines (e.g., 35 “I keep doing things the way I know, even if another way might be better”, 22 “I have a hard time dealing with changes in my routine” and 15i “I am flexible about how things should be done”) but, importantly, items do not specify that those difficulties impede to conduct alternative protocols or alternative plans. In lower locations in the continuum, item 33 refers to a preference for fixed work protocol (“I like to closely follow a routine while working”) but does not express the difficulty component. Finally, following severity order, last placements are taken by items that express a preference for daily routines or not dealing with unexpected plan changes (3i “I am comfortable with unexpected changes in plans”; 30i “I alter my daily routine by trying something different”) which we consider behaviours to which almost every person would be comfortable to adhere to. Additionally, we should appreciate that the resulting order of Rigid hierarchy is aligned with the ASD.

Like in the case of Aloof subscales, the resulting order of the Rigid BAPQ-SP hierarchy could open the door to insert behaviours at BAP levels of severity inside the autism continuum (De Groot & Van Strien, 2017; Hoekstra et al., 2008). In this sense, stubbornness, persisting routines and a strong need for sameness (i.e., Item 26) could be nearer clinical levels of severity, mostly when inflexible behaviours cause significant interference with individual functioning in one or more contexts (Item 15i; level 1), it comprises difficulty coping with change (Item 8; level 2), and those behaviours marking interfering in all the spheres (level 3; ASD severity specifiers; APA, 2013). Directly linked with this, some autistic-behaviours found upper in the Rigid BAP-SP severity hierarchy are similar to those observed in Obsessive–Compulsive Personality Disorder (e.g., overly rigid and/or stubborn, perfectionism, and very strict work standards; APA, 2013) or even to obsessive–compulsive disorder ones (e.g., obsessional thoughts; APA, 2013). Of importance, obsessive–compulsive behaviours have been reported comorbid to autism spectrum disorder (e.g., Meier et al., 2015; Micali et al., 2004) and this would also support the resulting Rigid severity order and the reinsertion of this dimension inside the autism continuum.

Altogether, the presented results lead us to think that the BAPQ could have been a useful measurement tool for measuring the BAP in English populations with middle-high BAP severity levels according to the original BAP operationalization (Piven et al., 1997a). Nevertheless, this study, in conjunction with others, has provided adverse evidence about the internal structure of the BAPQ (e.g., Godoy-Giménez et al., 2018; Lin et al., 2021; Sasson et al., 2013a; Sharma and Bhushan, 2018; Stojković et al., 2018). These findings could be better understood considering the most updated BAP operationalization (Godoy-Giménez et al., 2021; Morrison et al., 2018; Sasson et al., 2013b) and suggest that a new BAP test based on an updated BAP definition (as the middle expression of the current two-dimensions operationalization of the ASD; APA, 2013; see Godoy-Giménez et al., 2021) should be built. The new test would aim to measure BAP covering all autism subthreshold levels of expression including those in the general population. Additionally, to accomplish the requirements of the current international research context we believe that this new test should aim to inform about how the severity of the items could vary among cultures.