The Hikikomori phenomenon, also known as social withdrawal, describes adolescents and young adults who remain locked in their homes, unable to work or go to school for months or years. This phenomenon was initially linked to a specific condition of the Japanese culture (Teo & Gaw, 2010), but it has recently become object of attention also in Western countries (Malagón-Amor et al., 2015). The current debate on the phenomenon is mainly focused on the absence of a clear definition and consensus as a syndrome or a specific cultural condition (Tajan, 2015; Teo & Gaw, 2010). Indeed, some authors differentiate between two types of Hikikomori. The primary type refers to a condition strictly related to behavioral problems, thus excluding mental illness. In line with this position, social withdrawal, which is considered as a primary symptom (Suwa & Suzuki, 2013), characterizes those individuals who avoid the pressures of society, school, and parents by retiring to their own residence for at least six months (Saito, 1998). Two subtypes have been proposed: the ‘hard core’ subtype that includes individuals who never leave their room and never talk to their families, and the ‘soft’ subtype that characterizes those individuals who go out and talk to others occasionally (Heinze & Thomas, 2014). The secondary type of the Hikikomori condition is related to a pervasive developmental disorder (Heinze & Thomas, 2014; Suwa & Suzuki, 2013; Suwa et al., 2003), that is caused by preexisting psychological issues (Kondo et al., 2013), or to a form of “modern-type depression” (Kato et al., 2016).

Although there are no worldwide epidemiological data on Hikikomori, its prevalence has been found to be approximately 0.87–1.2% in Japan, 1.9% in Hong Kong, and 2.3% in Korea. The prevalence from 12.64% to 63.07% in Eastern countries may be linked to differences in inclusion criteria, assessment tools, or recruitment strategies (Pozza et al., 2019).

Concerning construct measurement, three self-report tools have been proposed: the NEET-Hikikomori Risk (Uchida & Norasakkunkit, 2015), the Hikikomori Questionnaire (Teo et al., 2018), and the Hikikomori Risk Inventory (HRI-24; Loscalzo, et al., 2022). Among these scales, only the HRI-24 was developed through the collaboration of Western (Italian) and Eastern (Japanese) researchers. This scale showed satisfactory psychometric properties in both contexts, and it was found to be invariant across Italian and Japanese respondents. This feature of the HRI-24 is very important for its consideration as a promising tool for the study of the Hikikomori phenomenon at an international level. Concerning gender differences, ANOVAs revealed that the HRI-24 total score did not differ across males and females in both the Italian and Japanese samples. However, the authors did not test measurement invariance across gender. As pointed out in the literature, measurement invariance is a crucial scale property for the comparison between groups to be meaningful (Colledani, 2018; Colledani et al., 2019, 2022; Vandenberg & Lance, 2000).

The first aim of this work was the validation of the HRI-24 in the Italian context with an explicit focus on the adolescent population. Since the Hikikomori phenomenon often starts during adolescence (Koyama et al., 2010) and, once it develops, it tends to persist (Hamasaki et al., 2021), having available an instrument specifically validated for the detection of this condition at early ages could play a pivotal role to reduce the chronicity risk (Koyama et al., 2010). This work took into consideration middle and high school students, who were recruited from schools located throughout the entire national territory.

To date, a short version of the HRI-24 does not exist. However, such an instrument could be useful in large epidemiological and screening studies in which several dimensions are investigated, as well as when the participant burden may be high. The second aim of this work was the development of a short version of the HRI-24.

To reach the objectives, two different studies were carried out. In Study 1, the structural validity of the HRI-24 was verified on a large sample of middle and high school students and measurement invariance across gender and school levels was tested. Moreover, a short version of the instrument was developed and validated. In Study 2, the short form was tested on a new large and independent sample of students.

Study 1

Method

Participants and Procedure

This study was based on the data collected within the project “Generation Z” (“Generazione Z”), a national survey conducted by the Italian National Institute of Health (Istituto Superiore di Sanità) in 2022. The survey included different sections that evaluated students’ demographic characteristics, lifestyles, habits, and subjective feelings. It was administered through an electronic format, in the classroom during class time and in the presence of a trained experimenter who was instructed to help students if necessary. Students, their parents, and school directors were asked to consent to student participation in the study. A total of 312 middle and high schools were invited to participate. At the end, 10.1% middle schools (2 in North West, 4 in North East, 1 in Central Italy, 3 in the South, and 4 in the Islands) and 8.6% high schools (5 in North West, 3 in North East, 1 in Central Italy, 3 in the South, and 3 in the Islands) participated in this study. Data collection was anonymous, and it took place from 26 March to 13 April 2022. The study was approved by the Ethical Committee of the Italian National Institute of Health (prot. PREBIO CE 01.01, March 28, 2022).

The sample included 2,034 students (mean age = 13.97, SD = 2.05; females = 1,001, 49.2%; 56 students, 2.8%, did not report their gender) attending middle (first, second, and third grade) or high schools (first, second, third, and fourth grade). Students attending middle school were 1,037 (51.0%; mean age = 12.30, SD = 0.935; females = 517, 47.9%; 23 students, 2.2%, did not report their gender), while those attending high school were the remaining 997 (49.0%; mean age = 15.71, SD = 1.30; females = 484, 48.5%; 33 students, 3.3%, did not report their gender).

Measures

The HRI-24 (Loscalzo et al., 2022) was administered to all participants. It investigates the typical Hikikomori feelings and behaviors, and includes 24 items which measure five factors: Anthropophobia, Agoraphobia, Lethargy, Paranoia, and Depressive Mood. Anthropophobia is the fear of people and social contacts. Agoraphobia is the avoidance of places where it would be difficult to get assistance in case of a panic attack or a high state of anxiety. Lethargy is the feeling of having little energy or of being unable or unwilling to do anything. Paranoia describes feelings of suspicion and distrust of others. Depressive mood is a feeling of sadness and disinterest, with loss of pleasure in life. Anthropophobia, Agoraphobia, and Lethargy are measured by four items each, Paranoia by seven items, and Depressive Mood by five items. All items are scored on a 5-point Likert scale ranging from 1 (Strongly disagree) to 5 (Strongly agree). According to the authors, a score can be computed for each subscale as well as a total score representing a Hikikomori risk score. Construct validity, factor structure, and internal consistency were verified in Western (Italy) and Eastern (Japan) countries, and showed satisfactory results (Loscalzo et al., 2022).

Demographic variables such as gender and age were investigated with a few close-ended items. One yes–no item (“Have you ever experienced the tendency to lock yourself in your room for several months, never going out, not even to eat meals or to entertain social relations?”) was used to investigate the Hikikomori risk.

Data Analysis

A five-factor exploratory structural equation model (ESEM) was run on the total sample to explore the structure of the HRI-24. This model was used to test configural, metric, and scalar invariance across gender (males and females) and school level (middle and high schools). All models were run using the robust maximum likelihood estimator (MLR; Yuan & Bentler, 2000). To evaluate the goodness of fit of the models, several fit indexes were inspected: χ2, comparative fit index (CFI), standardized root mean square residual (SRMR), and root mean square error of approximation (RMSEA). A good fit is indicated by non-significant (p ≥ 0.05) χ2 values. Since this statistic is sensitive to sample size, the other fit measures were also inspected. CFI values close to 0.95 (0.90 to 0.95 for reasonable fit), and SRMR and RMSEA smaller than 0.06 (0.06 to 0.08 for reasonable fit) were considered indicative of adequate fit (Marsh et al., 2004). For testing the equivalence of nested models in measurement invariance, the tests of change in CFI, RMSEA, and SRMR (ΔCFI, ΔRMSEA, ΔSRMR; Chen, 2007; Cheung & Rensvold, 2002) were used. Invariance is supported by ΔCFI values ≤|.01|, paired with ΔRMSEA values ≤|.015| and ΔSRMR values ≤|.030| for metric invariance or ≤|.015| for scalar invariance. Mean differences in the scores of the five subscales and the total score across gender and school levels were tested through t-tests. Cohen’s d measures of effect size were also reported (d < 0.2, 0.2 ≤ d < 0.5, 0.5 ≤ d < 0.8, and d ≥ 0.8 denote very small, small, medium, and large effect size, respectively; Cohen, 1988).

Relying on the results of the ESEM, a short version of the HRI-24 was developed by selecting 15 items, three for each of the five subscales. The choice to select three items for each subscale was motivated by the intention to obtain an instrument shorter than the original one and comprising subscales of equal length. The items to be included in the short version, which is denoted as HRI-15, were selected based on several criteria: the strength of the factor loadings on the target factor (selecting items with substantial loadings, even if not necessarily the largest ones), the absence of cross-loadings, the absence of gender and school-level bias, and the relevance of the item content to the target factor. These criteria were employed to develop simple structured measures, ensuring high content validity and avoiding biases related to gender or age.

Concerning reliability, Cronbach’s α coefficients were computed for the total HRI-15 and its subscales and were compared with those of the HRI-24. The Spearman-Brown prophecy formula (Brown, 1910; Spearman, 1910) was used for estimating the internal consistency that was expected for a shortened instrument consisting of only three items for each subscale. In addition, composite reliability (CR; Bagozzi & Yi, 1988) was calculated. It is another measure of internal consistency, which is conceptually similar to Cronbach’s α as it represents the ratio of true variance to total variance. However, compared to Cronbach’s α, CR is often considered a better index of internal consistency (Raykov, 2001).

Validity was evaluated through Pearson’s correlation coefficients which were calculated between the scores obtained on the full-length and abbreviated scales, with the correction for common items suggested by Levy (1967). Moreover, the average variance extracted (AVE) was calculated for each subscale. AVE is a measure of the proportion of variance captured by a construct relative to the variance attributed to measurement error. In general, values close to 0.50 are considered acceptable and indicative of convergent validity (Fornell & Larcker, 1981). The square root of the AVE was employed to evaluate discriminant validity. If this value exceeds the highest correlation with any other latent variable, discriminant validity is established at the construct level (Fornell & Larcker, 1981).

The effectiveness of the HRI-15 in identifying at-risk individuals was explored using receiver operating characteristic (ROC) curve analysis. This method allows the detection of the total score, among all possible scores, that best discriminates the individuals based on the presence or absence of the measured characteristic (Zhou et al., 2011). To compute a ROC curve, two types of data are needed, one being represented by the total scores of a group of individuals on the instrument and the other being an external criterion (named gold standard), which indicates the condition of each individual (e.g., at-risk vs non-at-risk). Starting from these data, the ROC curve allows for computing sensitivity (i.e., the proportion of at-risk individuals who are correctly classified as being at-risk) and specificity (i.e., the proportion of non-at-risk individuals who are correctly classified as being non-at-risk) for each of the possible total scores. Once the total score maximizing both sensitivity and specificity is identified, it can be used as cut-off score to perform future classifications. In this study, ROC curve analyses were run on both the HRI-24 and the HRI-15. The external classification criterion used in the two analyses was the binary answer (Yes, No) to the self-report question asking individuals to indicate their tendency to lock themselves in their room for several months. The performance of the HRI-24 and the HRI-15 was compared in terms of sensitivity, specificity, accuracy, and area under the ROC curve (AUC). Accuracy is the proportion of individuals correctly classified as being at-risk or non-at-risk. AUC is a measure of classification accuracy that indicates how much an instrument is capable of distinguishing between at-risk and non-at-risk individuals. The four measures range from 0 to 1, with higher values indicating higher capability of the instrument to correctly classify individuals.

Results

The ESEM run on the total sample reached a successful fit (Table 1). All items loaded on the intended factor with large and significant coefficients (Table 2). Configural, metric, and scalar invariance across gender and school levels were also supported (Table 1). The t-tests revealed that females scored higher than males on both the subscales and the total score (t from -12.185 to -16.998, df = 1,976, p < 0.001, Cohen’s d from 0.547 to 0.766), and that high school students scored higher than middle school students on the total score and all subscales (t from -4.045 to -8.137, df = 2,032, p < 0.001, Cohen’s d from 0.180 to 0.361), excluding agoraphobia (t = -0.354, df = 2014,316, p = 0.723, Cohen’s d = 0.016). The effect sizes showed that the mean score differences were medium for the comparisons between males and females, and from very small to small for the comparisons between middle and high school students.

Table 1 Fit Indices of the ESEM Run on the Total Sample and on Subsamples by School Level and Gender, and Results of Invariance Testing
Table 2 Results of the ESEM Run on the Total Sample

Based on the results of the ESEM on the total sample and on the subsamples by school level and gender, 15 items were selected to compose the HRI-15 (Table 2). Since no item showed school-level or gender bias, the selection was performed considering the other selection criteria. For the subscales measuring Anthropophobia and Agoraphobia, the items were selected considering the large loadings on the intended factor (that, in these two subscales, turned out to be the largest ones), the simple structure, and the relevance of the item content to the target factor. For the subscale measuring Paranoia, items 10 and 11 were selected for their large loadings on the intended factor (that turned out to be the largest ones), whereas item 15 was selected for its substantial loading on the target factor, the relevance of its content, and the simple structure. For the subscale measuring Lethargy, items 17 and 18 were selected considering their large loadings on the intended factor (that turned out to be the largest ones) and the simple structure, while item 19 was selected for its substantial loading on the target factor, the simple structure, and the relevance of its content relative to the intended dimension. Finally, for the subscale measuring Depressive Mood, items 23 and 24 were selected for their large loadings on the intended factor, while item 21 was chosen considering its relevant content and simple structure.

The short subscales composed of the selected items showed satisfactory reliability. Cronbach’s α coefficients ranged from 0.72 to 0.91 and were larger than the coefficients for 3-item subscales that were predicted using the Spearman-Brown prophecy formula (Table 3). Furthermore, CR coefficients were found to be satisfactory for both the short and full-length scales (Table 3).

Table 3 Descriptive Statistics, Internal Consistency Coefficients, Correlations and ROC Curve Analyses for Short and Full-Length Scales

The correlation coefficients (corrected for common items) between the short and full-length scales were positive and large (Table 3). A weaker coefficient was observed for the subscale measuring Paranoia (r = 0.60). This result was expected since this subscale exhibited the lowest Cronbach’s α (a value that enters in Levy’s correlation correction formula for common items) and since the full-length and short Paranoia subscales differed by four items (the full-length and short subscales pertaining to the other four factors differed by one or two items only). Concerning convergent validity, the AVE values of the full-length scales were close to 0.50 (ranging from 0.46 to 0.54), excluding those of the subscale measuring Paranoia and the total scale, which were lower (0.35 and 0.44, respectively). Concerning the short scales, all AVE values were close to or higher than 0.50 (ranging from 0.47 to 0.57), excluding that of the subscale measuring Paranoia which was lower (0.44; Table 3; for the short scales, AVE was computed based on the factor loadings of an ESEM model including only the 15 selected items; rs ranging from 0.44 to 0.71; factor loadings from 0.29 to 0.93). These results suggest sufficient convergent validity for both the short and full-length scales. Concerning discriminant validity, the square roots of AVEs were larger than the correlations with the other latent variables for all the full-length scales, excluding those of Anthropophobia and Agoraphobia, which were weaker than the correlation between them (r = 0.71), Paranoia, which was weaker than the correlation with Depressive Mood (r = 0.62), and Depressive Mood, which was weaker than the correlation with Lethargy (r = 0.73). A better result was observed for the short scales, where all the square roots of AVEs were larger than the correlations with the other latent variables, excluding that of Depressive Mood, which was weaker than the correlation with Lethargy (r = 0.71). Overall, the results suggest a satisfactory discriminant validity of both the short and full-length scales.

ROC curve analyses revealed that both the HRI-24 and the HRI-15 can be useful instruments to identify at-risk adolescents who express the tendency to lock themselves to avoid interactions and social relationships (Fig. 1). AUCs for both scales were good: 0.81 for the HRI-24 and 0.80 for the HRI-15 (based on the literature, AUCs between 0.80 and 0.90 can be interpreted as good; Safari et al., 2016). These values indicate that the two scales allow for correctly distinguishing between at-risk and non-at-risk individuals in about 80% of cases. The cut-off score of 59, defined on the HRI-24 total score, and the cut-off score of 37, defined on the HRI-15 total score, showed good accuracy, sensitivity, and specificity. The cut-off score defined on the HRI-15 total score slightly outperformed that defined on the HRI-24 total score in accuracy and specificity but fell slightly behind it in sensitivity (Table 3).

Fig. 1
figure 1

ROC Curves for the HRI-24 and the HRI-15. Note. The figure depicts the ROC curves for the HRI-24 and the HRI-15. The closer the curves to the upper right corner, the larger the AUC and the more accurate the instrument. AUC = .81 for the HRI-24, AUC = .80 for the HRI-15

Brief Discussion

The five-factor structure of the HRI-24 was supported and measurement invariance across gender and school levels was confirmed. A short version of the HRI-24 was developed which consists of 15 items, three for each of the five subscales. Both the HRI-24 and the HRI-15 revealed satisfactory reliability and validity coefficients. Compared with the HRI-24, the HRI-15 showed better convergent (i.e., greater AVEs) and discriminant validity.

The HRI-15 showed slightly larger accuracy and specificity in identifying at-risk adolescents, while the HRI-24 was slightly better in terms of sensitivity. This means that the HRI-24 slightly outperforms the HRI-15 in identifying at-risk individuals, while the latter slightly outperforms the former in identifying non-at-risk adolescents.

Study 2

Method

Participants

This study is based on the data collected within the project “Generation Z” (“Generazione Z”) conducted by the Italian National Institute of Health (Istituto Superiore di Sanità) in 2022. This second data collection took place from 6 May to 7 June 2022, with 21.7% of the invited middle schools (4 in North West, 5 in North East, 8 in Central Italy, 9 in the South, and 4 in the Islands) and 13.2% of the invited high schools (7 in North West, 4 in North East, 2 in Central Italy, 5 in the South, and 5 in the Islands) taking part in the study.

The sample included 1,599 students (mean age = 13.79, SD = 2.08; females = 726, 45.4%; 48 students, 3.0%, did not report their gender). The students attending middle school were 956 (59.8%; mean age = 12.35, SD = 0.98; females = 439, 45.9%; 30 students, 3.1%, did not report their gender), while those attending high school were the remaining 643 (40.2%; mean age = 15.93, SD = 1.31; females = 287, 44.6%; 18 students, 2.8%, did not report their gender).

Measure

The HRI-15 developed in Study 1 was administered to all participants. The 15 items of the instrument are scored on a 5-point Likert scale (from 1 = “Strongly disagree” to 5 = “Strongly agree”) and assess the five factors pertaining to Anthropophobia, Agoraphobia, Paranoia, Lethargy, and Depressive Mood (three items for each factor).

Data Analysis

The factor structure of the HRI-15 was verified through confirmatory factor analyses (CFAs). Three models were tested and compared. In the first model, five correlated factors were defined, each measured by three items (correlated five-factor model). In the second model, the five first-order factors were used as indicators of a second-order factor (second-order model). In the third model, finally, a general factor, measured by all the 15 items of the HRI-15, was modeled together with five non-correlated specific factors, measured by three items each (bifactor model). A graphical representation of the three models is provided in Fig. 2. These models were tested to deeply investigate the scale structure and to determine whether a common underlying construct accounting for the variance in the observed indicators exists, this suggesting that the HRI-15 total score can be used to evaluate the Hikikomori risk. The goodness-of-fit of the three models was evaluated using the same fit indices described in Study 1 and compared using the Akaike information criterion (AIC; Akaike, 1974). A difference in AIC (∆AIC) by 10 or more was considered meaningful (Burnham et al., 2011). All the analyses were run using Mplus7 (Muthén & Muthén, 2012), and the maximum likelihood estimator with adjusted means and variances (MLMV; Muthén & Muthén, 2012) that provides standard errors and statistical tests that are robust to non-normality.

Fig. 2
figure 2

Graphical Representation of the Bifactor, Correlated Five-Factor, and Second-Order Models. Note. An. = Anthropophobia; Ag. = Agoraphobia; P = Paranoia; L = Lethargy; D = Depressive Mood. In the bifactor model, a general factor, measured by all the 15 items of the HRI-15, is modeled together with five non-correlated specific factors, measured by three items each. In the correlated five-factor model, five correlated factors were defined, each measured by three items. In the second-order model, the five first-order factors were used as indicators of a second-order factor

For the bifactor model, a series of indices were also calculated: explained common variance (ECV; Sijtsma, 2009; Ten Berge & Sočan, 2004), percent of uncontaminated correlations (PUC; Rodriguez et al., 2016b), and McDonald’s (1999) omega (ω) and omega hierarchical (ωh) coefficients. ECV is the ratio between the common variance explained by the general factor and the total common variance (Reise et al., 2013a, 2013b; Reise et al., 2013a, 2013b; Rodriguez et al., 2016a). PUC describes the percentage of covariance terms which only reflect the variance from the general factor (Rodriguez et al., 2016b), and measures the biasing effects of forcing bifactor data into a one-dimensional model. According to Rodriguez et al. (2016b), ECV values > 0.70 paired with PUC values > 0.70 can be taken as an indication that a scale, despite the presence of some multidimensionality, can be regarded as the measure of an essentially one-dimensional construct. McDonald’s (1999) ω and ωh coefficients are factor-analytic “model-based” estimates of internal consistency. The former represents the proportion of variance of the scores that can be attributed to all sources of variance (i.e., general and domain-specific factors), whereas the latter quantifies the amount of variance accounted for by the general factor (Revelle & Zinbarg, 2009; Zinbarg et al., 2005, 2007). In the present study, ω was computed for the general factor and for each domain-specific factor, whereas ωh was computed for the general factor only. Concerning ω, values close to or greater than 0.70 are satisfactory. Concerning ωh, values larger than 0.75-0.80 indicate that the general factor can be interpreted as the measure of a single construct despite multidimensionality (Reise et al., 2013a, 2013b; Reise et al., 2013a, 2013b).

Metric (equality of factor loadings) and scalar (equality of both factor loadings and item intercepts) invariance across gender and school levels were tested. The same indices considered in Study 1 were used to test the equivalence of nested models.

Results and Brief Discussion

The correlated five-factor model showed an excellent fit (χ2(80) = 219.821, p < 0.001; RMSEA = 0.033 [0.028, 0.038]; CFI = 0.985; SRMR = 0.025; AIC = 66,975.664), items strongly loading on the intended factor (Table 4), and latent factors positively and strongly correlated. The second-order model also reached a good fit (χ2(85) = 377.654, p < 0.001; RMSEA = 0.046 [0.042, 0.051]; CFI = 0.968; SRMR = 0.038; AIC = 67,211.839), with all first-order factors strongly loading on the higher-order dimension (Table 4). However, as indicated by the ∆AIC (|236.175|), the correlated five-factor model fitted the data better than the second-order model. The bifactor model also showed a good fit (χ2(75) = 297.231, p < 0.001; RMSEA = 0.043 [0.038, 0.048]; CFI = 0.976; SRMR = 0.031; AIC = 67,097.939), with all items significantly and meaningfully loading on both the general and the intended specific factors (Table 4). In this model, the factor loadings pertaining to the general factor were, on the whole, quite similar to those observed on the correlated five-factor model. The bifactor model fitted the data better than the second-order model (∆AIC =|122.275|), but the correlated five-factor model was superior to both the bifactor model (∆AIC =|122.275|) and the second-order model (∆AIC =|236.175|).

Table 4 Results of the Bifactor, Correlated Five-Factor, and Second-Order Models

In the bifactor model, PUC was 0.86 and ECV and ωh coefficient of the general factor were 0.73, and 0.89, respectively. Taken together these results suggest that, despite the multidimensional nature of the scale, the common variance accounted for by the general factor can be regarded as essentially one-dimensional (Reise et al., 2013a, 2013b; Rodriguez et al., 2016a).

Having chosen the correlated five-factor model as the best fitting model, it was used to test gender and school-level invariance. The results confirmed full metric and scalar invariance across gender and school levels (Table 5).

Table 5 Invariance of the Correlated Five-Factor Structure Across Gender and School Level

Conclusion

This work aimed to validate the HRI-24 on adolescents and to develop and validate a short version of the instrument. To these aims, two studies were carried out that used two large and nationwide samples of Italian adolescents attending middle and high schools. The results of Study 1 supported the structural, convergent, and discriminant validity of the HRI-24 in adolescent samples. Evidence of configural, metric, and scalar invariance was also found between gender and school levels, indicating that the scale has the same functioning across the considered groups. However, mean score differences were observed across gender and school levels. Another relevant contribution provided by this study is the development of a short version of the instrument, which was called HRI-15. Its psychometric properties were satisfactory and analogous to those of the full-length scale. A final merit of this study is that, for the first time, empirical cut-off scores were defined (59 and 37 for the HRI-24 and the HRI-15, respectively), which allow for discriminating against at-risk and non-at-risk adolescents. Both the HRI-24 and the HRI-15 showed satisfactory accuracy, specificity, and sensitivity. In this respect, the performance of the two scales was very similar. However, while the HRI-24 turned out to be slightly better in identifying at-risk individuals, the HRI-15 revealed to be slightly better in identifying non-at-risk adolescents. Researchers should choose between the HRI-15 and the HRI-24 based on their specific needs. If reducing the respondent burden is a crucial consideration, the HRI-15 may be preferred. On the other hand, if greater sensitivity is desired, the HRI-24 could be the preferable option. Study 2 validated the HRI-15 on a different large sample of adolescents. The factor structure of the instrument was verified through CFAs. The results indicated that the correlated five-factor model outperformed the second-order and bifactor models. Moreover, the bifactor model allowed for observing that the general factor, despite the multidimensional nature of the instrument, accounts for a large portion of the variance. This provides empirical evidence that the HRI-15 total score can be used to evaluate the Hikikomori risk, together with the scores on the five subscales. Finally, the results of this study supported measurement invariance of the HRI-15 across gender and school levels.

Although the findings of this work are relevant, some limitations could be highlighted. A single self-report item (“Have you ever experienced the tendency to lock yourself in your room for several months, never going out, not even to eat meals or to entertain social relations?”) was used to investigate the Hikikomori risk. This was done because a consensus gold standard for Hikikomori risk does not yet exist (Teo et al., 2018). Future studies could be devoted to further validate the HRI-24 and the HRI-15, as well as the empirical cut-off scores defined on their total scores, using widely accepted gold standards, if they become available, or individuals from the clinical population. Another limitation is the lack of empirical evidence of validity in comparison to other measures of the same construct or related constructs, such as depression, poor emotional regulation, gaming addiction, or internet addiction (Lin et al., 2022). Future studies are advocated that examine this aspect of validity. In the present work, females were found to score higher than males on all HRI-24 subscales and on the total score, whereas in the work of Loscalzo et al. (2022), such gender differences were not observed. Future studies should further explore gender differences, as well as try to replicate our findings in cross-cultural contexts. Finally, future studies can be aimed to develop even shorter versions of the instrument to use in large epidemiological studies. For instance, it would be interesting to develop a very short scale, including only one item from each subscale. To such purpose, the most suitable items could be selected relying on the I-ECV (item-explained common variance) indices from a bifactor model since they allow for identifying the items that are most adequate to develop an essentially unidimensional measure (Stucky et al., 2013).

Overall, this work provides a relevant contribution to the literature on the Hikikomori phenomenon by validating, on adolescent samples, a scale that has been found to be adequate in both Eastern and Western contexts. This contribution could foster the worldwide expansion of the research on this increasingly alarming phenomenon. Moreover, the validation of the scale on adolescents would help professionals to screen young people at the first onset of the phenomenon in order to reduce the chronicity risk. To such purpose, this work provides other two relevant contributions by defining a short version of the instrument, which could be highly useful in large epidemiological and screening studies, and by providing cut-off scores that can be used to identify at-risk adolescents. After identifying them, the specific factors that are responsible for the onset of the Hikikomori condition can be singled out and, based on them, policymakers can develop appropriate prevention campaigns.