1 Introduction

Social and emotional competencies are recognised as fundamental to optimizing childhood mental health and wellbeing, and to improving developmental trajectories (Denham & Brown, 2010; Greenberg et al., 2003; Guerra & Bradshaw, 2008; Weissberg et al., 2015). Many schools today incorporate social-emotional competencies as key educational targets within curricula, and deliver formal teaching of these competencies to all students via Social-Emotional Learning (SEL) programs (Organisation for Economic Co-operation Development [OECD], 2015). To guide the provision of SEL, the Collaborative for Academic, Social, and Emotional Learning (CASEL) defined five competencies as core foci for these programs: self-awareness, self-management, social awareness, relationship skills, and responsible decision-making (CASEL, 2005). Despite increasing commitment from many education systems internationally to fostering these skills, few validated measures exist that afford large-scale, school-based assessment of students’ functioning on all five competencies (Denham, 2015; Tolan et al., 2016). Fewer still, are instruments that have been validated for use in Australia (Frydenberg et al., 2017), where there is marked diversity in the SEL programs delivered and limited empirical evidence regarding their effectiveness (Collie et al., 2017; Dix et al., 2020; Laurens et al., 2022; National Mental Health Commission, 2021). The National Children’s Mental Health and Wellbeing Strategy (National Mental Health Commission, 2021) recently called for nationwide assessment and monitoring of students’ social and emotional strengths and needs. Validation of social-emotional competency measures is required to facilitate this monitoring, and the SEL program evaluations that are needed to generate a local evidence-base to guide SEL provision.

The late middle childhood period (aged 11–12 years) presents an opportune time for social-emotional competency assessment. Children at this age have attained sufficient cognitive development to reliably report on their own health and wellbeing (Bianchi & Robinson, 1997; Riley, 2004); yet, few measures have been developed that capture children’s unique and important perspectives on their social-emotional functioning according to the CASEL model (Denham, 2015). A brief (20-item), self-report measure of CASEL’s five competencies was recently developed and validated using a large representative sample (n = 26,837) of Australian children aged 11–12 years (Carpendale et al., in press) from the longitudinal New South Wales Child Development Study (NSW-CDS; Green et al., 2018). This new measure, the Middle Childhood Survey – Social-Emotional Learning (MCS-SEL), demonstrates psychometrically sound internal consistency and construct validity, but the consistency of its functioning across key demographic profiles remains undetermined. For use in population-level assessment, measures of social-emotional competencies must function in a similar manner for all children, to provide robust and accurate assessment regardless of demographic differences in sex, language, cultural backgrounds, and so on (McKown, 2019; Meredith, 1993; Schmitt & Kuljanin, 2008). Such validation is determined through measurement invariance testing (Putnick & Bornstein, 2016). Measurement non-invariance, wherein a measure functions differently across subgroups, indicates that the underlying constructs are not being assessed in the same way. This non-invariance then compromises the interpretation of group mean differences and, as a result, could potentially produce biased inequalities in assessment (Pendergast et al., 2017; Putnick & Bornstein, 2016). According to McKown (2019), affirming measurement invariance is an especially important consideration for social-emotional assessments, where the relative value and presentation of skills may differ across demographic profiles.

Confirmation that constructs and items are being assessed in the same way by an instrument provides a necessary foundation for meaningful statistical and quantitative comparisons of social-emotional functioning between subgroups. Boys and girls, for example, demonstrate differences in mean social-emotional functioning and patterns of social-emotional development across adolescence (Elliott et al., 2018; OECD, 2021; Romer et al., 2011; Ross et al., 2019). Whether these findings reflect true differences in functioning or are a consequence of test bias cannot be ascertained without first determining measurement invariance according to sex/gender. Among a U.S. sample of children enrolled in the 5th to 11th grades, the WCSD Social-Emotional Competency Assessment (Crowder et al., 2019) showed differential item functioning (DIF) on 6 of 34 items (18%) according to gender. Although this is considered minimal DIF, these findings emphasise the value of establishing measurement invariance across demographic profiles for the newly developed Australian measure of social-emotional competencies, the MCS-SEL (Carpendale et al., in press).

In addition to sex, demonstration of consistent functioning of this English-language measure across groups with English and non-English language backgrounds will validate it for use within multicultural Australian society. The 2016 Australian Census identified that approximately one-fifth (21%) of Australians spoke a language other than English at home, with over 300 individual languages reported (Australian Bureau of Statistics, 2017). Varying levels of proficiency in English may impact children’s interpretation of and response to questionnaire items. Moreover, several studies of social-emotional functioning have found interaction effects between demographic variables, such as sex/gender and race/ethnicity, indicating that measurement invariance testing should also be explored at the factorial subgroup level (Crowder et al., 2019; McClure et al., 2010). Interaction subgroups provide a more sensitive means for evaluating invariance than assessing invariance separately across variables. Crowder et al. (2019) explored the interaction between gender and race/ethnicity using the WCSD Social-Emotional Competency Assessment and observed DIF on four items according to the specific subgroup interactions. If race/ethnicity and gender had been explored separately, DIF on these items may have been obscured.

The present study sought to determine the extent to which the brief, self-report MCS-SEL (Carpendale et al., in press) demonstrates invariance across sex-by-language background subgroups. We hypothesised that the measure would demonstrate full configural, metric, scalar, and residual measurement invariance across four demographic profiles (male, English; female, English; male, non-English; female, non-English).

2 Method

2.1 Participants

This study used data from a subsample of children in the NSW-CDS population cohort (Green et al., 2018) who completed the self-report Middle Childhood Survey (MCS; Laurens et al., 2017). The target sample for the MCS comprised all 6th grade students attending a government or non-government school in New South Wales, Australia, in 2015. Schools were recruited to the survey via an email sent to the Principal (head teacher) of all schools in NSW with a 6th grade enrolment (n = 2,371), with 889 Principals responding affirmatively to participation. The MCS was subsequently administered to students enrolled at 829 (93%) of these schools, with 60 not completing MCS administration. Child recruitment within these participating schools was managed by school personnel. Following child or parent opt-out (n = 1,389) and absences or server failures (n = 3,192), the final sample comprised 26,671 students (31.4% of 88,572 NSW children enrolled in the 6th grade) with complete MCS and demographic data (for further detail see Laurens et al., 2017). This subsample was representative of the NSW population across a range of demographic indices (Laurens et al., 2017).

The present study used a subsample of children who completed the MCS, excluding those who were not aged 11 or 12 years at the time of MCS completion (n = 330). The remaining 26,348 children were divided into four subgroups according to their sex and main language spoken at home, namely: male, English (n = 11,666); female, English (n = 11,621); male, language other than English (n = 1,542); and female, language other than English (n = 1,519). As large imbalances in the size of subgroups used in measurement invariance testing can produce misleading results and reduce power, random subsampling was used to equate group sample sizes for analysis (Yoon & Lai, 2018). Analyses were repeated across two independent samples (a test and a validation sample) to increase robustness of study findings, with the representativeness of these samples relative to the unselected students determined. The independent test and validation samples were derived using disproportionate probability pseudo-random sampling, with each comprising 2,000 children (500 from each of the four demographic strata).

2.2 Procedure

The MCS survey was administered online by teachers to participating students during class time between July and September of 2015. Students were able to complete the survey over multiple sessions (mean survey completion time = 16.5 min) and children with special needs were able to receive assistance from their usual classroom support (e.g., adult helper) and/or use an audio recording of the survey. A third-party information technology contractor handled administration and email correspondence with school personnel, providing researchers with de-identified data only. Ethical approval for the MCS was obtained from the University of New South Wales Human Research Ethics Committee (UNSW HREC reference HC14307) and the NSW Department of Education State Education Research Applications Process (reference 2,015,082).

2.3 Measures

The present study used demographic indices and the 20 MCS items that comprise the recently developed MCS-SEL (Carpendale et al., in press). This instrument measures five subscales: Self-Awareness (4 items), Self-Management (3 items), Social Awareness (5 items), Relationship Skills (5 items), and Responsible Decision-Making (3 items). Each item has three response options: not true (scored 0), somewhat true (scored 1), and certainly true (scored 2). The measure has good model fit as a correlated five-factor model (CFI = 0.967, TLI = 0.961, RMSEA = 0.044), satisfactory internal consistency within scales (McDonald’s ω ranging from 0.653 to 0.764), and theoretically coherent construct validity patterns (Carpendale et al., in press). To test the representativeness of the sex-by-language subsamples derived for analysis against the total sample with available data, total subscale scores were generated by summing the items within each subscale.

2.4 Statistical Analysis

A series of one-way analysis of variance (ANOVA) tests were conducted first to determine the representativeness of the two independent samples (test and validation samples). For each of the four sex-by-language subgroups, the two samples were compared to the unselected sample of remaining children according to their mean scores on each of the five subscales. To increase sensitivity to potential subsample differences, an α level of 0.050 was retained for each of the individual tests.

Measurement invariance was assessed using a multi-group confirmatory factor analysis (MG-CFA) in Mplus version 8.0 (Muthén & Muthén, 1998). Relative to other techniques, MG-CFA simultaneously allows for the detection of uniform (constant across construct levels) and non-uniform (varied across construct levels) differential item functioning, and demonstrates a superior sensitivity to identifying differential item functioning among ordinal (categorical) indicators (Kim & Yoon, 2011; Pendergast et al., 2017). The scale was defined by assigning a reference/anchor item within each of the five competency scales (the first item within each construct). The reference items had a fixed factor loading of 1.0 and their item thresholds were constrained to be equal across all groups (Johnson et al., 2009; Wells, 2021). As a sensitivity analysis, all measurement invariance testing was repeated using a different reference item for each construct (the second item on each construct) to confirm the consistent functioning of the original reference items across groups (Wells, 2021). On account of the categorical nature of the indicators, models were fit using mean and variance adjusted weighted-least squares estimation method (WLSMV) – the recommended estimation method for categorical indicators and non-normally distributed data (Wells, 2021). Theta parameterization was employed to avoid the constraint of the means for latent variables, affording a more accurate assessment of measurement invariance relative to delta parameterization (Wells, 2021).

Measurement invariance was evaluated using a four-step method, respectively testing: configural (structural), metric (weak factorial), scalar (strong factorial), and residual (strict factorial) invariance (Widaman & Reise, 1997). All four sex-by-language subgroups were compared within one model, with the referent group being ‘male, English’ as this was the largest subgroup in the full sample. Figure 1 illustrates a hypothetical example of non-invariance between the reference group (Male, English) and a comparison group (Female, non-English) at each of the four levels of non-invariance tested in the study (configural, metric, scalar, and residual). This sequential process involved fitting a model where parameters were free to vary across groups (configural baseline), followed by the successive addition of model constraints so that parameters were made equal to those of the reference group. The fit of these nested models were then compared, and a model was deemed invariant if the model misfit (indexed by the reduction in the fit of the constrained model compared to the freed model) was below threshold. While the difference in the chi-square (χ2) fit statistic is a commonly used threshold for invariance, this statistic appears overly sensitive to small, trivial deviations within large samples and can lead to over-rejection of models demonstrating practical measurement invariance (Chen, 2007; Cheung & Rensvold, 2002; Putnick & Bornstein, 2016). As such, this study utilised magnitude of change in alternative fit indices (i.e., comparative fit index [CFI] and root mean square error of approximation [RMSEA]) to assess measurement invariance (Putnick & Bornstein, 2016). A criterion of -0.10 change in CFI for comparing nested models is most widely endorsed, alongside a criterion of 0.015 change in RMSEA values (Chen, 2007; Cheung & Rensvold, 2002; Putnick & Bornstein, 2016). In cases where full invariance was not achieved, partial invariance was sought by progressively freeing parameters within a group according to the indicator with the largest modification index (Byrne et al., 1989).

Fig. 1
figure 1

Hypothetical example of non-invariance between the reference group (Male, English) and a comparison group (Female, non-English)

Note: l1-l5 = factor loading; t1,3,5,7,9 = threshold 1 (between not true and somewhat true responses); t2,4,6,8,10 = threshold 2 (between somewhat true and certainly true responses); r1-r5 = residual error.

2.4.1 Configural Invariance

Configural invariance determines whether the dimensional structure of the measure is consistent across groups. The same model structure was fit separately for each group and no equality constraints were placed on the model, excepting the reference item within each scale. The criterion for configural invariance was acceptable model fit for all groups (CFI > 0.900, Tucker-Lewis Index [TLI] > 0.900, RMSEA < 0.060) (Bentler & Bonett, 1980; Hu & Bentler, 1998). Figure 1a depicts a hypothetical example of configural non-invariance where the comparison group (Female, non-English) demonstrates a divergent model structure.

2.4.2 Metric Invariance

Metric invariance evaluates whether items load onto (contribute to) their specific factor to a similar degree in all groups. Firstly, an unconstrained baseline model was specified, in which all factor loadings and thresholds were freed across the four groups. Residual variance was fixed to 1.0 for all items, as required when using theta parameterization (Millsap & Yun-Tein, 2004). A constrained metric invariance model was then fit, where factor loadings were constrained to be equal between groups (equivalent to the reference group’s factor loadings). The fit of this metric invariance model was then compared to the unconstrained baseline model according to the ΔRMSEA < 0.015 and ΔCFI <-0.10 thresholds. Figure 2b demonstrates an example of metric non-invariance whereby an item’s factor loading (item 5) differs across groups.

2.4.3 Scalar Invariance

Scalar invariance is a pre-requisite for comparing latent trait values between subpopulations as it assesses the equivalence of item intercepts/thresholds (Putnick & Bornstein, 2016). When using ordinal variables, scalar invariance determines whether groups shift in their probability of endorsing one response to another at comparable levels of the underlying construct. Item thresholds were constrained to be equal across groups (equivalent to the thresholds in the reference group) and the fit of this model was compared to the previous metric invariance model, with trivial (i.e., below threshold) change in fit evidencing invariance. Figure 1c demonstrates a case of scalar non-invariance, with t10 (the threshold between the somewhat true and certainly true responses on item 5) being meaningfully different between groups.

2.4.4 Residual Invariance

Residual invariance determines the similarity of the sum of specific variance (variance in items that is not shared with the latent construct) and error variance (measurement error) across groups and is a requirement for direct group comparisons using summed scores (which do not account for measurement error) (Putnick & Bornstein, 2016). The fit of a model in which residual variances were freed (residual model) was compared to a model in which residual variances were constrained to be equal. This constrained model was the same as the scalar model, given theta parameterization required residual variances be constrained to 1.0 in all previous models. Figure 1d demonstrates an example of residual non-invariance where, when freed, residual errors on item 5 are meaningfully different between the reference and comparison group.

3 Results

3.1 Demographics

Supplementary Table S1 displays the demographic characteristics of the total sample of 11- and 12-year-olds with MCS data (n = 26,348) and the two independent subsamples (trial and validation; each of n = 2,000). Supplementary Table S2 demonstrates the disproportionate probability of children being randomly selected within each of the four sex-by-language subgroups. The multiple one-way ANOVAs performed to determine the representativeness of the selected children in each subgroup revealed no significant differences in mean social-emotional competency scale scores across the two selected samples and the unselected sample (Supplementary Table S3).

3.2 Measurement Invariance

Table 1 displays the measurement invariance results for the test sample. Unconstrained configural models, fit separately for each subgroup, revealed acceptable model fit (CFI and TLI > 0.90 and RMSEA < 0.06) for all subgroups (three of the four, excepting ‘male, non-English’, achieved good fit [CFI and TLI > 0.95 and RMSEA < 0.05]).

Table 1 Measurement Invariance in the Test sample (n = 2,000)

Following this confirmation of structural/configural invariance, fixing factor loadings across subgroups, with ‘male, English’ as the reference group, revealed improved CFI (∆CFI = 0.002) and RMSEA (∆RMSEA = − 0.002) values, indicating metric invariance.

Following the constraint of item thresholds, scalar invariance was observed, with the scalar model demonstrating minimally poorer RMSEA values (∆RMSEA = 0.002) than the previous (metric) model and ∆CFI below threshold (-0.008).

Comparing the scalar model (in which residual variances were fixed to 1.0 for all items) to a model with freed residual variances for all items, bar reference items, revealed minimally poorer RMSEA and CFI (∆RMSEA = 0.001, ∆CFI = − 0.001) supporting residual invariance across subgroups.

Replication of invariance testing using the validation sample affirmed the full configural, metric, scalar, and residual invariance achieved among the test sample (Table 2).

Table 2 Measurement Invariance in the Validation sample (n = 2,000)

Sensitivity analyses, where the second item of each factor was assigned as the reference item (rather than the first), supported full invariance across all four steps among both the test and validation samples (Supplementary Table S4 and S5).

4 Discussion

Using a four-group MG-CFA method, this study observed consistent structural and item-level functioning on the brief, self-report Middle Childhood Survey – Social-Emotional Learning (MCS-SEL) (Carpendale et al., in press) across demographic subgroups defined by sex (male, female) and language background (main language English, main language non-English). Across both the test and validation samples, and sensitivity analyses (using different reference items), testing revealed a consistent correlated five-factor structure (configural invariance), as well as equivalent item-factor loadings (metric invariance), item thresholds (scalar invariance), and residual errors (residual invariance) across subpopulations. No evidence was found to suggest that any of the 20 items used in this measure are systematically interpreted differently by subpopulations of children differentiated by sex and language background. As measurement invariance is a necessary pre-requisite to mean comparisons across subpopulations (Pendergast et al., 2017; Putnick & Bornstein, 2016), these findings support the robust interpretation of mean differences in these competencies across subgroups. This study therefore validates the use of this measure in its current form as a universal assessment tool for determining the social-emotional functioning of Australian students in late middle childhood.

Presently, few social-emotional competency measures have reported invariance according to demographic indices, limiting their potential for large-scale application and meaningful group comparisons. Despite methodological and contextual differences, our findings align with the equivalent item functioning observed across language background subgroups (differentiated on English language proficiency) in the United States for the WCSD Social-Emotional Competency Assessment (Crowder et al., 2019). However, where Crowder et al. (2019) observed differential item functioning on 6 of 34 items according to gender (i.e., in gender-by-race/ethnicity subgroups), the MCS-SEL demonstrated no non-invariant items. The invariance of the MCS-SEL may be a product of item selection from established and previously validated psychological scales (e.g., Big Five Questionnaire For Children–Short Form: Bore et al., 2020). As measurement invariance is a function of specific item wording and content, and population characteristics, it is recommended that establishing invariance within local contexts be a key consideration for the validation of all social-emotional competency measures.

Drawing data from a large, representative, population cohort (Green et al., 2018; Laurens et al., 2017), this study’s sample afforded sufficient power and replication of analyses across two independent samples. In a relatively novel approach, this study explored measurement invariance across interaction subgroups and, thus, accounted for potentially nuanced patterns in social-emotional functioning. As a result, this study validates mean difference comparisons at both a group (sex, language background) and subgroup (sex-by-language background) level. Findings endorse the correlated five-factor dimensional structure across all four subgroups and reinforce CASEL’s five competencies as universal skills that should each be supported within whole-school SEL programs and initiatives.

Limitations of this study include the disproportionate subgroup sample sizes, which meant that children whose main language was not English had a higher probability of being represented in the test and validation samples than children whose main language was English, regardless of sex. However, testing was conducted to confirm that children in the selected samples were representative of their broader subpopulation. By nature, measurement invariance testing can only report on the equivalence of psychometric functioning for indicators included within the assessment (McKown, 2019), such that the extent to which this measure represents all of the important group-specific social-emotional competencies cannot be determined. In this study, measurement invariance could not be determined according to specific languages (other than English), cultures, and ages. To address these issues, future social-emotional measurement development may benefit from discussion with children and parents from diverse backgrounds pertaining to their perspective on these competencies (McKown, 2019). Within the NSW-CDS cohort, 130 languages were reported as being a main language spoken at home. As such, invariance across all languages spoken could not be determined and selecting specific languages for analysis would not do justice to the diverse, multicultural nature of Australian society. Information related to children’s English language proficiency or confidence was not available to classify groups by learner status, with these effects likely to vary among children whose main language was not English. Similarly, demographic items within the MCS did not assess ethnicity, thus restricting our capacity to explore measurement functioning across specific ethnic or cultural backgrounds. We therefore recommend that this measure be used cautiously for any inference regarding middle childhood social-emotional competency presentation across different cultural groups, until invariance by culture can be confirmed. This study was also limited to a sample of children from the 6th grade (aged approximately 11 to 12 years) and, therefore, the MCS-SEL has been validated for use only among this late middle childhood age-group. This research was also conducted in an Australian setting and, though generalisability to other multicultural contexts is likely, this assumption could be affirmed prior to use of this measure in other settings.

This study extends evidence of the validity of the 20-item self-report MCS-SEL (Carpendale et al., in press), in specifically demonstrating this measure as robust and invariant according to sex and language background profiles. Overall, the observed measurement invariance supports the utility of the MCS-SEL as a robust and sensitive classroom-based, population-level assessment of middle childhood social-emotional competencies. As the transition from middle childhood to adolescence is characterised by declines in social and emotional functioning (Eccles, 2004; OECD, 2021), this assessment may help educators and practitioners identify areas of students’ strengths and areas which need further instruction within SEL programming. This measure also provides a psychometrically sound foundation for future exploration of these middle childhood competencies in the context of linked health, education, and social data available within the NSW-CDS (Green et al., 2018). Both avenues afford the critical research needed to improve present efforts to foster childhood social-emotional functioning within Australia and the broader international context.