The Inventory of Callous-Unemotional traits (ICU; Frick 2004) is the most commonly used measure of callous-unemotional traits (CU) in children and adolescents. The ICU has been applied in multiple studies of both community and at-risk samples in several countries and languages (Byrd et al. 2017; Essau et al. 2006; Hawes et al. 2014a; Hawes et al. 2014b; Kimonis et al. 2008; Roose et al. 2010). Research on the factor structure of the ICU is not consistent in regard to the dimensionality of the measure (e.g. Ciucci et al. 2014; Pechorro et al. 2016b; Hawes et al. 2014b). A recent meta-analytic paper argues that there are no strong theoretical background nor empirical correlates to substantiate subfactors on the ICU (Ray and Frick 2018). Three studies applying Item Response Theory (IRT) analysis have suggested that the empirically derived factor structures of the ICU are due to methodological variance caused by the use of standard- and reverse-scored items (Lin et al. 2018; Paiva-Salisbury et al. 2017; Ray et al. 2016). Typically, studies find two main factors related to item-scoring procedures: a Callousness factor consisting of standard-scored items like “I do not care who I hurt to get what I want”, and an Uncaring factor consisting of reverse-scored items like “I apologize to people I hurt”. IRT-analyses suggest that the standard-scored items comprised of socially demeaned statements are harder to endorse (i.e. more difficult in IRT terminology) compared to not endorsing the socially encouraged statements of the reverse-scored items.

Another possible explanation for the variations observed in the ICU factor structure could be differential item functioning (DIF) across raters and age groups. In previous studies, the parent-report version has primarily been studied with children younger than 12 years, while the self-report version is most widely used in adolescent samples. The DIF between self- and parent-report on the standard- and reverse-scored ICU-items observed in a large sample of delinquent youth (Lin et al. 2018), could give rise to different factor structures of the measure across raters. Additionally, a meta-analysis of studies that supported bi-factor models of the ICU, found that the unique variance explained by the general CU-factor exceeded that of the specific bi-factors (Ray and Frick 2018). These findings suggest that despite the observed factor structures created by method variance effects, DIF and other sample-based sources of variance, the ICU primarily measures a core unidimensional CU construct. Additional research across samples and responder groups is needed to assess the empirical support for such a unidimensional conceptualization of the ICU.

Study Aims and Hypotheses

The primary aim of this study was to further investigate the structural validity of the ICU. Both the parent-, self- and teacher-report versions of the ICU were tested in a Norwegian mixed gender sample of adolescents with behavior problems. We hypothesized that all three versions of the Norwegian ICU represented a unidimensional construct of Callous-Unemotional Traits (Paiva-Salisbury et al. 2017; Ray et al. 2016; Ray and Frick 2018). In line with previous IRT-based research, we assumed that any factor structure found in the data would primarily be due to method variance resulting from the increased difficulty of agreeing with standard-scored items compared to disagreeing with reverse-scored items. Consequently, we expected that the addition of a method-variance element to the tested structural models of the ICU would increase model fit.

The secondary aim of this study was to assess the concurrent validity of the Norwegian translations of the ICU in the same sample. Although ICU has been studied in several other languages, psychological assessment tools should be re-assessed when translated and applied in other countries and contexts (de Vet et al. 2011). In particular, an instrument used to assess anti-social traits should be particularly closely assessed, as the consequences of measurement errors could potentially lead to wrongful labeling, stigma and flawed treatment interventions (Edens et al. 2001). The Norwegian ICU has not yet been subject to solid psychometric validation, and this study aimed to address this problem by using a multi-informant design to assess both intra-, inter- and cross-rater validity. CU has been linked to increased and particularly recalcitrant behavior problems (Frick 2012), which in adolescents manifests itself as more pronounced aggressive, violent and criminal behavior (Frick et al. 2014; Frick and White 2008; Lawing et al. 2010). We therefore hypothesized the ICU to correlate positively with moderate strength to measures of externalizing problems, aggression, self-reported delinquency and problematic alcohol use. The increased level of anti-social acts in adolescents with CU is linked to a lack of prosocial emotions, guilt, fear and behavioral inhibition (Bjørnebekk 2007; Viding and Kimonis 2018), and we therefore hypothesized the ICU to show negative correlations of moderate strength to measures of anxiety and behavioral inhibition (Ciucci et al. 2014). As the association between CU and anxiety is potentially moderated and thereby masked by co-occurring conduct problems (Frick 2012), we controlled for the concurrent level of behavior problems in this analysis.

To assess the discriminative validity of the ICU we hypothesized negligible correlations of the ICU to self-ratings of subjective well-being, and to parent- and teacher-ratings of withdrawn/depressive symptoms and somatic complaints.



Data for this study came from a randomized controlled trial of Functional Family Therapy (FFT) in Norway (Ogden 2013). Families referred to Child Welfare Services for FFT-treatment were asked to participate in the trial. The inclusion criteria were adolescents aged 11–19 years who displayed or were at risk for one or more of the following behavior problems: delinquency, aggressive or violent behavior, verbal aggression or threats, truancy, school-related problem behavior and/or drug use in relation to problem behaviors mentioned above. The following exclusion criteria also applied: adolescents living by themselves, autism, acute psychotic episode, imminent risk of suicide, home environments that pose a threat to therapist life or safety, ongoing investigations by the local child welfare service, and concurrent services that were incompatible with commencing FFT-treatment.

We recruited 160 families to the study comprising 160 adolescents, 152 mothers (including eight step- and nine foster-mothers) and 90 fathers (including 15 step- and five foster-fathers). The participants were all informed of their right to later revoke their consent to the study. The adolescents in the sample were 86 (53.7%) boys and 74 (46.3%) girls with a collective mean age of 14.7 years (SD = 1.47). The majority (77.5%) of the adolescents were on or above the culture-, age- and gender-appropriate clinical cut-off score on the externalizing scale of the Child Behavior Check List (CBCL; Achenbach and Rescorla 2007). A small fraction (8.8%) of the adolescents were in the borderline range, while the remaining (12.5%) scored in the normal range. CBCL-externalizing scores were missing for 2 (1.2%) adolescents as their parent(s) only partially completed the CBCL. Among the adolescents, 38.8% resided primarily with one biological parent, 25.5% with both biological parents, 24.8% with one biological parent and a step-parent, 6.4% in adoption- or foster-homes and 4.5% had joint physical custody arrangements. The mothers had a mean age of 43.6 years (SD = 6.80) and the average paternal age was 46.4 years (SD = 7.35). Most of the mothers (76.2%) and fathers (61.9%) were working full- or part-time, and 43.0% of the mothers and 34.4% of the fathers had a university or college degree.


The data used in this study was collected prior to randomization to treatment condition in the trial. All questionnaires for parents and adolescents were programmed in the Ci3 software (Sawtooth Software n.d.). The family met with a research assistant in their home or at a municipality office to complete the questionnaires on portable computers provided by the research assistant. The research assistant gave general instructions on how to use the Ci3 system and was available for assistance during questionnaire completion. The family received a light snack and a minor monetary compensation (approximately 50 US Dollars) for taking the time to complete the questionnaires. When consenting to the trial, parents and adolescents were asked to give consent for the research assistant to contact the adolescent’s teacher. If consent was obtained, teachers were contacted and asked to complete and mail back questionnaires on paper.


Callous-Unemotional Traits

The ICU is a 24-item questionnaire consisting of three standard- and three reverse-scored items derived from each of the four CU-items on the Antisocial Process Screening Devise (Kimonis et al. 2008). Examples of standard-scored items are “I seem very cold and uncaring to others” and “I do not care who I hurt to get what I want”. Examples of reverse-scored items are “I feel bad or guilty when I do something wrong” and “I try not to hurt others’ feelings”. Items are rated on a 4-point scale: 0 (not true at all), 1 (somewhat true), 2 (very true), and 3 (definitively true). A total score is calculated by adding all item scores together. The self-, parent- and teacher-report versions of the ICU were translated to Norwegian following general guidelines for questionnaire translation (de Vet et al. 2011). In this sample, the ICU means were 33.80 (SD = 12.48) for the parent-version, 29.40 (SD = 10.70) for the self-report version and 35.66 (SD = 12.88) for the teacher-version.

Externalizing Behavior

Parents and teachers in this study completed the widely used CBCL and Teacher Report Form (TRF; Achenbach and Rescorla 2001), respectively. Both questionnaires contain 120 items describing various child behaviors that are rated on a 3-point scale: 0 (not true), 1 (somewhat or sometimes true), and 2 (very true or often true). Raw scores were used to ensure analyses of the full range of variability on the scale and avoid loss of data resulting from the truncating process entailed in conversion to T-scores (Achenbach and Rescorla 2001). In this study the rule-breaking behavior and the aggressive behavior subscales showed acceptable reliability, αs = .813–.921, and were used as parent- and teacher-reported measures of externalizing behavior. In addition, parents were asked to report at what age the behavior problems of their child started, M = 9.32 years, SD = 4.57.

Self-Reported Delinquency

The Self-Reported Delinquency scale (SRD: Elliott et al. 1983) consists of items related to offences with a base rate above 1% as reported in the Uniform Crime Report (Elliott and Ageton 1980). In the present study, the adolescents reported how many times (from 0 to 9) in the past month they had engaged in each of the delinquent behaviors listed on the SRD. The total sum of all items on the SRD was used as a measure of the amount of delinquency, and this measure showed good reliability (α = .904). A sum score of dichotomized answers (yes/no) on each of the SRD questions was calculated as a measure of the diversity of delinquency. Both the SRD-total and the SRD-diversity scores were truncated due to floor effects and skewed distributions. Therefore, both scores underwent Box-Cox transformation to allow for parametric analysis within a range of .15–.95 and .15–.98 of the z-scores, respectively.


The Norwegian version of the Angry Aggression Scales (AAS; Bjørnebekk and Howard 2012) was used to assess four motivationally distinct types of aggression. The scales are based on Howard’s (2011) quadripartite model of aggression (QVT) designating four types of aggression: explosive/reactive, vengeful/ruminative, thrill seeking and coercive. Items from the explosive/reactive type of aggression include “Sometimes I get so angry that I don’t know what I’m doing”; items from the vengefulness type include “When I feel angry with somebody, I work out ways to get my own back”; items from the thrill seeking type include “When I make someone suffer I get ‘turned on’ and lose control”; and items from the coercive type include “Sometimes I use aggression to control others”. The four AAS subscales had acceptable reliability, with αs ranging from .835 (vengeful/ruminative) to .941 (thrill seeking). Floor effects and skewed distributions were observed for the subscales, which suggested the use of non-parametric tests.

Problematic Alcohol Use

The Alcohol Use Disorder Identification Test (AUDIT; Saunders et al. 1993) is a 10-item self-report measure of alcohol use. Each item is rated on a scale from 0 (never) to 4, where higher scores indicate more frequent alcohol use. Item scores are summarized to a total score ranging from 0 to 40. If the first AUDIT-question (“How often do you have a drink containing alcohol?”) was answered with a 0, all remaining items on the AUDIT were scored 0. This applied to 61.9% of our sample. A dichotomous variable indicating problematic alcohol use, was computed based on a cut-off score of 5 or above on the total AUDIT score (Liskola et al. 2018). This applied for 18.7% of our sample, with the highest observed AUDIT score being 24.


Anxious-depressed symptoms were measured by both self-, parent- and teacher-report. Adolescents completed the Symptom Checklist 10-item-version (SCL-10; Strand et al. 2003) where 4 items related to anxiety (e.g. “Feeling fearful”) and 6 items related to depression (e.g. “Feeling blue”) are rated on a 4-point scale from 1 (not at all) to 4 (extremely). The full scale SCL-10 score showed good reliability (α = .919) and was used as self-report measure of anxiety. A cut-off score of 18.5 is proposed as representing symptoms of psychological problems (Strand et al. 2003), and 39.4% in our sample scored above this cut-off. Parent- and teacher-ratings of anxiety were obtained from the anxious-depressed subscale on the CBCL/TRF, which showed reliabilities of α = .865 and α = .825, respectively. When comparing the CBCL anxious-depressed scale scores to the multicultural, age and gender dependent cut-off scores, 37.5% of the sample were in the clinical range and 16.3% were in the borderline range.

Sensitivity to Punishment

The youth version of Carver and White’s BAS/BIS scales (Carver and White 1994) was used to assess sensitivity in the behavioral inhibition system (BIS). The BIS scale consists of 7 items that refer to the anticipation of punishment (e.g. “I worry about making mistakes”). Items are rated on a 4-point scale: 4 (very true for me), 3 (fairly true for me), 2 (partly true for me), and 1 (not at all true for me). Empirical studies conducted by Bjørnebekk (2009) on normal populations of middle school children, and Bjørnebekk and Howard (2012) on a sample of adolescent offenders, have demonstrated the reliability and validity of the Norwegian youth version of these scales. The reliability of the BIS-scale in the present study was acceptable (α = .798).

Measures Related to Discriminative Validity

Three measures were included to assess the discriminative validity of the ICU. Firstly, self-reported well-being was measured by the WHO-5 Well-being index (World Health Organization, Regional Office for Europe 1998). The WHO-5 scale consists of 5 items on well-being that are rated on a 6-point scale, ranging from 0 (Never) to 5 (All of the time). Within an adolescent sample a cutoff score of 9 or lower should advise further investigation of the patient (Allgaier et al. 2012). The scale showed acceptable reliability (α = .844) in our sample, and 47.7% of the adolescents reported scores of 9 or below. Secondly, the withdrawn-depressed scales on the CBCL/TRF were used as discriminative validity measures towards withdrawn-depressed symptoms. Thirdly, the somatic complaints scale on the CBCL/TRF was also included to test for discriminative validity. The reliabilities of these four CBCL/TRF scales were acceptable (αs = .781–.795) apart from the TRF Somatic complaints scale that showed poor reliability (α = .596).


Missing Data

Maternal data was the primary source of parental data, and if missing, paternal data was used when available and complete. Multiple Imputation (MI) was applied within each responder groupFootnote 1 to estimate missing values based on age, gender and the available scale scores. We ran 20 imputations, resulting in imputed data sets with 159 parental responses, 157 adolescent responses and 95 teacher responsesFootnote 2 available for the analyses. Given the uneven number of participants in each respondent group, cross-informant analyses created missing data points, which were handled by pairwise deletion.

Data Analysis Plan

We applied Confirmatory Factor Analysis (CFA) to assess the structural validity of the ICU. All CFA analyses were run in Mplus Version 8 (Muthén and Muthén 2017), where we tested the fit of different models of the ICU to our data. We chose weighted least squares estimation (WLSMV) for categorical indicators as the ICU uses a 4-point scale and the item response distributions in our sample rarely resembled a normal distribution. The fit of each model was assessed using the chi-square (X2) fit statistic, the comparative fit index (CFI; Bentler 1990), the Tucker-Lewis index (TLI; McDonald and Marsh 1990), the root mean square error of approximation (RMSEA) and the weighted root mean square residual (WRMR). The CFI and TLI both indicate acceptable fit with values above .90 and good fit with values above .95 (Hu and Bentler 1999). RMSEA values between .05 and .08 indicate acceptable fit, whereas values below .05 are considered to represent good fit. The WRMR is considered acceptable when lower than 1.0 for both continuous and categorical variables (Yu 2002). The DIFFTEST function in Mplus was applied to assess the statistical significance of X2 differences between nested models.

All remaining statistical analyses were conducted in SPSS version 25 (IBM Corporation 2017). In the identified best fitting model of the ICU, Cronbach’s αs were calculated separately for each respondent group to analyze the reliabilities of the scales. Corrected item-total correlations (CITCs) were used to identify poorly functioning items, with the limit value set to .30 (Nunnally and Bernstein 1994). Inter- and cross-rater reliabilities were assessed by estimating a single measure, consistency, 2-way mixed effects model for intraclass correlations (ICC) and their 99% confidence intervals. Convergent validity of the ICU was investigated by its Pearson product-moment correlations to the hypothesized measures of relevance. Spearman’s rank-order correlation was used for non-normally distributed variables. Partial correlation was applied when analyzing ICU subfactors to control for the effect of correlating factors. Independent sample t-tests were used to assess gender differences. To control for age and gender effects, logistic regression was applied to assess the influence of CU on the dichotomous variable of problematic alcohol use. The level of statistical significance was set to α = .01, to adjust for multiple testing without being too restrictive when testing correlated hypotheses (Bender and Lange 2001). In the assessment of the concurrent validity, however, the strengths of the relationships were regarded as more relevant than the statistical significance of the results (de Vet et al. 2011). For partial correlation analyses, pooled p values are not computed in SPSS25 for MI-datasets, and we chose to restrictively report on the largest observed p value among the original and imputed datasets.


Confirmatory Factor Analysis

Based on previous research on the parental- and self-report-versions of the ICU, four main models of the ICU were tested in a CFA: a 3-bifactor model (Essau et al. 2006; Roose et al. 2010; Waller et al. 2015), an IRT-shortened 2-factor model (S. W. Hawes et al. 2014b), a unidimensional model (Ray and Frick 2018) and an IRT-shortened unidimensional model (Ray et al. 2016). To account for any method-variance effect of standard- and reverse-scored items, the Unidimensional and 3-bifactor models were also tested with the addition of a method-variance bi-factor loading on all reverse-scored items (Paiva-Salisbury et al. 2017). This addition was not relevant for the IRT-shortened 2-factor model, as the Uncaring factor comprised all reversed-scored items. In the IRT-shortened Unidimensional model, eight of the ten items are reverse-scored, and it was therefore more parsimonious to test for method variance in this model by specifying shared method variance between the two standard-scored items on this scale.

The DIFFTEST function showed significant reductions of the X2 test statistics for the models when method-variance was accounted for. This applied in both parent-reported data: 3-bifactor model (∆X2 = 53.15, df = 12, p < .001), unidimensional model (∆X2 = 100.956, df = 12, p < .001), and unidimensional short model (∆X2 = 5.194, df = 1, p = .023, marginally significant); in self-reported data: 3-bifactor model (∆X2 = 182.669, df = 12, p < .001), unidimensional model (∆X2 = 163.030, df = 12, p < .001) and unidimensional short model (∆X2 = 60.269, df = 1, p < .001); and in two of three models in the teacher-reported data: the 3-bifactor model (∆X2 = 63.900, df = 12, p < .001) and the unidimensional model (∆X2 = 152.841, df = 12, p < .001), but not the unidimensional short model (∆X2 = 1.793, df = 1, p = .181). All other fit indices were also improved when a method-variance element was included in the model, again except for the unidimensional short model in the teacher-data. Subsequently, all analysis and model comparisons were made between models accounting for method-variance. Table 1, 2 and 3 show the fit-indices for these models when analyzing the ICU by parent-, self- and teacher-report, respectively. The fit-indices of the models without a method-variance element can be found in Online Resource 1.

Table 1 Fit-indices in the parent-reported data of ICU models including a method variance element
Table 2 Fit indices in the self-reported data of ICU models including a method variance element
Table 3 Fit-indices in the teacher-reported data of ICU models including a method variance element

Across all responder groups, the full unidimensional model displayed overall poor model fit. The IRT-shortened unidimensional model showed adequate fit in relation to the CFI, the TLI and the WRMR indices in both parent- and self-data, while these indices showed marginal fit in the teacher data. In all datasets however, the RMSEA indicated inadequate fit of this model. The 2-factor model showed good fit in parent, self and teacher data based on the CFI, the TLI and the WRMR indices. The RMSEA indicated marginal fit for this model across datasets. The 3-bifactor model showed adequate fit in the parent data and marginal fit in the self and teacher data.

The RMSEA indicator favors models with higher degrees of freedom, and in our study the 3-bifactor model with a method-variance factor by far has the highest degrees of freedom. This could explain why this model has the lowest RMSEA scores. Typically, it would require a larger sample size to have enough power to detect significant findings related to RMSEA scores on models such as the 2-factor model with 53 degrees of freedom (MacCallum et al. 1996). The CFI and TFI indices tend to be less sensitive to sample size. The 2-factor model was therefore assessed as the overall better fitting model for ICU across datasets, despite the slightly lower RMSEA values observed for the 3-bifactor model.

The interpretation of the 2-factor model in relation to method variance effects is not clear cut. The model embeds a method variance effect within the 2-factor structure with standard- and reverse-scored items separately constituting each factor. Arguably, the 2-factor structure could be a method-variance artefact and could alternatively be modeled as a unidimensional CU construct constituting the 12 items, and a method variance bi-factor related to the five reverse-scored “Uncaring”-items. Post-hoc CFA-analysis of this alternative model yielded very similar model fit indices as for the 2-factor model in both parent- (X2 = 125.655, df = 51, CFI = .949, TLI = .934, RMSEA = .097, 90% CI = [.075–.118], WRMR = 0.87), self- (X2 = 101.730, df = 51, CFI = .957, TLI = .944, RMSEA = .080, 90% CI = .[057–.103], WRMR = 0.82), and teacher-reported data (X2 = 130.869, df = 51, CFI = .958, TFI = .946, RMSEA = .117, 90% CI = [.093-.142], WRMR = 0.87). Thus, the results of our confirmatory factor analyses alone cannot make the distinction on whether the 2-factor model represents two interrelated factors, or a unidimensional construct with method-variance effects between items. Therefore, all subsequent analysis of convergent validity used both full scale and factor scores to enable a comparison of the scores’ correlational patterns.

Reliability and Subscale Correlations

Reliability analyses of the ICU12 were run separately for each respondent group. For the ICU12-P (parent-report), Cronbach’s α was .869 for the full scale, .816 for the 7-item Callousness subscale, and .791 for the 5-item Uncaring subscale. For the ICU12-S (self-report), the Cronbach’s αs were .810 (total scale), .848 (Callousness-subscale), and .811 (Uncaring-subscale). For the ICU12-T (teacher-report), the Cronbach’s αs were .901 (total scale), .855 (Callousness-subscale), and .839 (Uncaring-subscale). For all ICU12 versions, all items showed corrected item-total correlations (CITCs) > .30 to both the total scale and its respective subscale. While the two subscales correlated strongly on the ICU12-P, r(155) = .623, p < .001, and the ICU12-T, r(88) = .707, p < .001, the correlation was weak and not statistically significant on the ICU12-S, r(151) = .174, p = .031. Due to the observed inter-correlation between the ICU12 subscales, partial correlation was applied when subsequently assessing the convergent validity of the subscale scores.

Cross- and Inter-Rater Reliability

The cross-rater reliability of the ICU12 was weak, both between parent- and self-report, ICC of .170, 99% CI [−.04, .37], F(149, 149) = 1.41, p = .018, and between teacher- and self-report, ICC of .226, 99% CI [−.05, .47], F(85, 85) = 1.59, p = .018. The cross-rater correlations between the ICU12-P and -S subscales were similarly weak for both the Callousness, ICC = .067, 99% CI [−.15, .27], F(149, 149) = 1.14, p = .209, and the Uncaring subscales, ICC = .141, 99% CI [−.07, .34], F(150, 150) = 1.33, p = .042. The inter-rater reliability of the ICU12 between parents and teachers was also weak, ICC = .204, 99% CI [−.07, .45], F(87, 87) = 1.51, p = .028.

Age and Gender Effects

While self-reported CU showed a small negative correlation to age, r(155) = −.192, p = .017, this was not observed for parent-report, r(157) = .061, p = .446, nor teacher-report, r(93) = .045, p = .665. The 99% CI for the gender difference in ICU12 scores between boys and girls was [−3.70, 2.43], p = .592 for parent-report, [0.73, 6.25], p = .001 for self-report and [−.50, 7.93], p = .023 for teacher-report. On average boys obtained scores that were approximately 0.5 SD higher on the ICU12-S and the ICU12-T when compared to girls, while they were 0.08 SD higher for boys on the ICU12-P.

Convergent, Divergent and Discriminative Validity

Externalizing Problems and Aggression

The results of the correlation analyses between parent, self- and teacher-reported CU and the main measures of externalizing problems are shown in Table 4. By parent- and teacher-report we found strong within-rater correlations between ICU and measures of aggression and rule breaking behavior. The cross-rater correlations were weak and for self- and teacher-reported CU only observed for the rule breaking scale. Age of onset had a weak negative correlation to the ICU-P only. Moderate correlations were seen between self-reported delinquency and the various ICU scores. On a subscale level, the associations appeared more marked for the Callousness subscale.

Table 4 Convergent validity of the ICU-scales to measures of behavior problems

Spearman rank-order correlations showed the expected convergent validity of self-reported CU to coercive and thrill-seeking aggression, r(155) = .286, p < .001 and r(155) = .285, p = .001, respectively. The ICU12-S also had a small to moderate positive correlation to the Vengeful aggression scale, r(155) = .224, p = .005, while the correlation to the Explosive aggression scale was smaller, r(155) = .137, p = .088. Parent-reported CU showed small to negligible correlations to the aggression scales (rs = .035–.149, ps = .065–.675), while teacher-reported CU correlated somewhat with thrill-seeking aggression only, r(91) = .226, p = .035.

Logistic regression controlling for age and gender effects was applied to study the relationship between the ICU12-scores and indicated problematic alcohol use (AUDIT-score ≥ 5). Self- and teacher-reported, but not parent-reported, CU showed a significant association to problematic alcohol use: OR = 1.09, p = .008; OR = 1.11, p = .008, and OR = 1.04, p = .173, respectively. When both subscales were entered in the model instead of the total ICU12-S score, the Uncaring subscale had a higher odds ratio, OR: 1.13, p = .045, compared to the Callousness scale, OR:1.07, p = .101.

Anxiety and Punishment Sensitivity

The partial correlations of the ICU-scores to measures of anxiety and punishment sensitivity when controlling for concurrent levels of externalizing problems are seen in table 5. While the ICU12-S and the ICU12-T showed some small to moderate negative correlations to both within- and cross-rater measures of anxiety and punishment sensitivity, the ICU12-P only demonstrated a within-rater negative relationship to anxiety. On a subscale level, the associations were observed mostly for the Uncaring subscale.

Table 5 Relationship of the ICU-scales to measures of anxiety and behavioral inhibition

Discriminant Validity

Overall, the ICU12-S demonstrated discriminative validity to the WHO-5 Well-being index and the withdrawn-depressed and somatic complaints scales on the CBCL/TRF (rs = −.090–.034, ps = .336–.918). Both the ICU12-P and ICU12-T showed discriminative validity to subjective well-being and parent- and teacher-reports of somatic complaints (rs = −.146–.091, ps = .088–.996), but moderate strength positive correlations to within-rater withdrawn-depressed scales were observed, r(157) = .323, p < .001, and r(93) = .415, p < .001, respectively.


This study sought to assess the structural validity of the ICU and the concurrent validity of the Norwegian ICU translation in a sample of adolescents with behavior problems. In line with previous research, our results indicated a method variance effect of the standard- and reverse-scored items of the ICU (Paiva-Salisbury et al. 2017; Ray et al. 2016). However, our test of models including this method variance, failed to yield strong support for the ICU as a unidimensional measure of CU. Neither the full nor the IRT-shortened unidimensional model provided the best model fit in our data. The overall best fit was observed for the 12 item 2-factor model of the ICU across all responder groups. This is in line with some previous studies of the ICU self-report version in adolescent samples (Carvalho et al. 2018; Colins et al. 2015; Pechorro et al. 2016a). These findings do not fully undermine the idea of CU as unidimensional as the ICU12 2-factor model can be remodeled as a unidimensional construct with a method variance bi-factor related to reverse-scoring items. Our post-hoc analyses showed that this would result in similar levels of model fit. Therefore, we will discuss the concurrent validity of the ICU total score across responder groups, before subsequently discussing whether the two factors can be meaningfully justified by any distinct pattern of correlations to the concurrent measures.

Cross-Rater Agreement and Concurrent Validity

The low levels of cross-rater agreement observed between parents, adolescents and teachers on the ICU12, were similar to or slightly weaker than those reported in previous studies (Berg et al. 2013; Gao and Zhang 2015; R. D. Latzman et al. 2013; Levy et al. 2017; Roose et al. 2010; White et al. 2009). Low levels of parent-child agreement on mental health measures could reflect different responder perspectives (Hemmingsson et al. 2016), as observed by differential item functioning across observer- (parent/teacher) and self-report on the ICU (Lin et al. 2018). These findings indicate that the ICU12 might not measure the same construct across respondents, and caution against relying on single-informant assessment of CU.

This study found support for convergent and divergent concurrent validity of the ICU12 when using within-rater measures, as commonly used in previous research on the ICU (Ciucci et al. 2014; Fanti et al. 2009; Kimonis et al. 2008; Pechorro et al. 2016a; Pihet et al. 2014). The ICU12-P showed convergent validity to parent-reported aggression, rule-breaking behavior and age of onset and divergent validity to parent-reported anxiety, with similar strength of the associations as seen in previous studies (Berg et al. 2013; Levy et al. 2017). The ICU12-S showed convergent validity to self-reported aggression, delinquency and problematic alcohol use, and divergent validity to self-reported anxiety and punishment sensitivity. Similarly, the concurrent validity of ICU12-T was supported by expected associations to teacher-reported externalizing behaviors and anxiety. The ICU12 versions also showed appropriate discriminant validity, apart from the within-rater correlations to the CBCL/TRF withdrawn-depressed scale, possibly caused by shared method variance (Gao and Zhang 2015). These intra-rater results for the hypothesized relationships and the divergent validity support the construct validity of the ICU12 within each respondent group.

Consistent with previous multi-informant validation studies of the ICU, we found fewer and weaker cross- and inter-rater correlations between the ICU12 scores and hypothesized convergent constructs (Berg et al. 2013; Docherty et al. 2016; Lin et al. 2018; Roose et al. 2010). This might be related to the relatively weak inter- and cross-rater reliability of the ICU12 (Docherty et al. 2016; Gao and Zhang 2015; Lin et al. 2018), and the lack of shared method-variance that might inflate intra-rater correlations (Gao and Zhang 2015). Demonstrating cross-rater correlations for convergent validity could therefore point to particularly salient relationships between constructs.

In this study, cross-rater construct validity was most evident between self-reported delinquency and parent- and teacher-reported CU. This applied for both the total delinquency score and the diversity of delinquency score. These cross-rater relationships indicate how CU specifically relates to a higher likelihood of criminal behavior among adolescents with behavior problems (Frick et al. 2014; Frick and White 2008). The strength of the cross-rater relationships was similar to the within-rater correlations, indicating that parent-, teacher- and self-reported CU can account equally well for retrospectively self-reported delinquency. A prospectively designed study found parent-reported CU to be better at predicting youth detained status (Docherty et al. 2016), which also provides an argument for assessing CU in youth not only based on self-report.

For the remaining hypothesized correlates only some cross-rater support for the concurrent validity of the ICU was observed. The ICU12-S had a small correlation to parent- and teacher-ratings of rule-breaking behavior, but none to parents’ and teachers’ aggression ratings. This contrasts previous studies using these CBCL scales where CU has been more strongly associated to aggression than to rule-breaking (Benesch et al. 2014; Berg et al. 2013), or the associations have been equal in strength (Pihet et al. 2014; Waller et al. 2015). The ICU12-S also showed the hypothesized negative relationship to both parent- and teacher-ratings of adolescent anxiety. Overall, the concurrent validity of the ICU12-S was supported by cross-rater relationships.

When comparing the cross-informant concurrent validity of the ICU12-P and the ICU12-T to self-reported aggression, alcohol use, anxiety and fear sensitivity, the ICU12-T displayed small correlations in the hypothesized direction, while the ICU12-P did not. A previous study found a combined teacher-parent CU-measure to have cross-rater relationships to self-reported fear sensitivity (Roose et al. 2010), but it is not known if this result would have been found by using parental CU-scores only. The lower ability of parent-reported CU to gain cross-rater support for its validity compared to teacher-reported CU, could indicate that teachers might have a more objective and less biased appraisal when assessing a potentially stigmatizing construct as CU. Another possible reason for the ICU12-P to not be related to self-reported anxiety and punishment insensitivity, is the weak cross-rater agreement on the Uncaring subscale, which is the subscale of the ICU12-S most strongly related to anxiety and fear sensitivity. If the effects of differential item functioning on the ICU lead parents to assess a CU construct containing less of the Uncaring aspects experienced by the adolescent, this might then result in a parent-based CU measure with a less clear relationship to self-reported anxiety and fear sensitivity.

The fact that the concurrent validity of the ICU12 was only partially supported by cross-rater measures, could indicate biases in single rater assessment of CU. While adolescents might underreport, be unaware of or strategically report on a negative character trait like CU (Levy et al. 2017), parents and teachers can only infer about the internal states of their child/student and might thus be unaware or biased in their assessments. In addition, differential item functioning between parent- and self-report suggests that some ICU items are read, interpreted and rated differently by parent and adolescents (Lin et al. 2018). Although general recommendations on mental health assessments, supporting the use of comprehensive multi-informant data, might limit the risk of biased single source scores in CU-assessments (Gao and Zhang 2015; Hemmingsson et al. 2016), there could still be a concern for the degree to which the ICU12 versions, individually or combined, appropriately measure CU.

Unidimensional vs Two Factors

Across raters, this study found the subscales of the ICU12 to show distinctive patterns of correlation to the measures of convergent validity. The Callousness subscale generally showed stronger associations to aggression, rule breaking behavior and delinquency in our study, in line with studies linking this subscale to rule-breaking behavior (Colins et al. 2015), criminal behavior (Colins et al. 2015; Lin et al. 2018; Pechorro et al. 2016b) and proactive aggression (Fanti et al. 2009; Pihet et al. 2014). The increased item difficulty of this scale (i.e. the higher thresholds for endorsing these standard-scored items), could make it more informative within a group of at-risk adolescents. The Uncaring subscale, on the other hand, was more strongly related to lower levels of anxiety and punishment sensitivity as seen elsewhere (Colins et al. 2015). These findings support a two factors representation of the ICU12.

However, findings from other studies are often more mixed, as some studies link the Callousness scale to lower punishment sensitivity, and the Uncaring scale aggression and externalizing problems (Ciucci et al. 2014; Colins et al. 2015; Fanti et al. 2009). Our results also showed that while the subfactors did show distinct correlation patterns, they rarely indicated correlations not observed for the full scale ICU12. So, while a factor structure can be observed in the ICU12 due do varying item difficulty and item wording (standard- or reverse-scored), and a distinct correlational pattern of the factors can be seen, the full scale ICU12 unidimensional score might in and of itself be an appropriate measurement of the CU construct.

Assessing Lack of Prosocial Emotions

Concern could be raised about whether the ICU12 appropriately assesses a lack of prosocial emotions. The ICU12 contains only one item from the Unemotional subscale which has shown negative associations to emotional empathy and emotional responses to distress (Berg et al. 2013; Kimonis et al. 2008). Post-hoc analysis of our data showed that the Unemotional subscale was unrelated to self-reported delinquency, punishment-sensitivity and aggression and showed some negative correlations to parent- and teacher-reported externalizing problems, see Online Resource 2. Additionally, the self-reported Unemotional subscale showed significant positive correlations to both self- and parent-reported levels of anxious-depressed symptoms.

These findings suggest that the Unemotional subscale is more related to negative emotions in general, than to lack of prosocial emotionality specifically (Berg et al. 2013). Several other studies point to weaknesses of this subscale (Carvalho et al. 2018; Ciucci et al. 2014; R. D. Latzman et al. 2013; Waller et al. 2015), and a meta-analytic study found it to lack appropriate reliability and construct validity (Cardinale and Marsh 2020). While the current findings indicate that the Uncaring factor might cover some aspects of lowered emotionality related to CU, the call made for rewriting the Unemotional items to specifically capture lack of prosocial emotionality is warranted, as this could improve the ICU as an assessment of CU (Waller et al. 2015).

Furthermore, the value of assessing psychopathy in children and adolescents more broadly than CU, should not be forgotten (Andershed et al. 2018; Colins et al. 2018; Salekin 2017). Recent studies have shown that the inclusion of measures from other domains of psychopathy leads to better predictions (Andershed et al. 2018; Colins et al. 2018). Within a framework of cumulative risk factors, this equals the notion of improved risk assessment based on the number of risk factors measured. This does not necessarily call for an abandonment of further use and development of CU-measures when assessing adolescents with serious behavior problems. It does, however, suggest that more detailed and multi-faceted assessments within the broader psychopathic domain might increase our understanding of the multiple factors that contribute to persistent anti-sociality. Similarly, in line with work on psychopathy in adults, it could be relevant to include the multiple dimensions of psychopathy as specifiers for conduct disorder in the DSM-V and ICD-11 diagnostic systems (American Psychiatric Association 2013; Salekin 2016; World Health Organization 2018). Future research should also prioritize to study whether multiple-informant measures of the various psychopathy dimensions can assist in selecting optimal treatment interventions for children and adolescents with severe behavior problems.


Although this study has strengths in evaluating the ICU using a multi-informant design with several validated measures in an at-risk adolescent population, it has some limitations worth mentioning. Firstly, the sample size of this study is somewhat small, as study recruitment did not reach the goal of 250 participants but was terminated at 160 participants after four years of recruiting. The sample size creates a risk of lacking power to gain statistical significance for observed relationships and is a barrier to applying Item Response Theory to study item difficulty, item discriminative values and differential item functioning. Secondly, the data used in this study was cross-sectional and does not relay information regarding directionality of the associations found. Thirdly, the data in this study is based on questionnaire responses and did not include observational, neurological or biological data nor a criterion related measure of CU or the Conduct Disorder diagnosis. Lastly, in relation to the Unemotional subscale, this study is limited by not including a more focused measure of shallow and flattened affect or a measure of emotional empathy that might have been more relevant for the concurrent validity of this subscale.