Over the last ten years, an increasing number of youth have indicated mental health challenges (Twenge et al., 2019), with a significant rise in rates since the onset of the COVID pandemic (Radhakrishnan et al., 2022). Schools are a consistent setting for youth, allowing for early identification of mental health symptoms and thus early intervention services. However, the overall number of youth with mental health needs has strained existing school-based supports (Weir, 2020), with many students not receiving needed or timely treatment (Anderson et al., 2019). Without a comprehensive and universal prevention system in place, many youth may have unmet mental health needs resulting in significant and negative short and long-term outcomes (Perfect et al., 2016). These challenges are particularly pronounced for adolescents with internalizing concerns who are less likely to be identified for supports than youth with externalizing behavior concerns (Kalberg et al., 2011).

Over 15% of adolescents will display symptoms of internalizing concerns, including withdrawal, depression, and anxiety, which is the most commonly reported mental health problem (Mojtabai et al., 2016). Many mental health prevalence studies (e.g., Merikangas et al., 2010) utilize data from assessment tools designed to identify psychopathology or presenting symptoms aligned to a traditional and psychological deficit focus; the shift to a modern conceptualization of mental health needs from a Dual Factor Model of Mental Health (Suldo & Shaffer, 2008) that recognizes the manifestation of psychopathology and the lack of prosocial skills would suggest a substantially larger number of youth with needs. For example, in a sample of at-risk youth, nearly 30% reported low psychopathology and low well-being (i.e., “Vulnerable”; Suldo & Shaffer, 2008) and would be unlikely to be identified with traditional assessment methods.

Given the substantial numbers of students in need, there is an increased focus on prevention and early intervention for internalizing concerns in the setting in which youth have the most consistent access to critical support, schools (Collins et al., 2019). Schools have increasingly adopted multi-tiered systems of support (MTSS), focusing on prevention science and an integrated system of care, resulting in a treatment continuum based on student needs (Evans et al., 2013). For an integrated MTSS to be successful, effective tools and decision-making processes are necessary to identify pre-symptomatology and facilitate early intervention (Lyon et al., 2016). Universal screening, from a dual-factor perspective, offers an essential advancement toward realizing the potential of a preventative system of care (von der Embse et al., 2022).

Multi-Informant Universal Screening

Best practice guidance (see Romer et al., 2020) has emerged to inform how universal screening is implemented and used to inform service delivery. Research based-guidance is essential to inform practices as there has been a substantial increase in the number of schools using screening (Bruhn et al., 2014; Auerbach et al., 2019). For example, recent research has identified the stability of risk to inform the number of screenings (Dever et al., 2018; Jenkins et al., 2021) and the unique contributing information from both teachers and students across grades (von der Embse et al., 2019). In other areas, research is notably lacking. There exists a disconnect between best practice guidance for the use of multi-informant assessment (Achenbach et al., 1987) with rater discrepancies that have been identified as potentially problematic to effective decision-making (Youngstrom et al., 2000), including and across internalizing concerns (Baumgartner et al., 2020). As such, research is needed to inform both the theoretical conceptualization of multi-informant decision-making and practical guidance for educators and clinicians alike to improve the identification and treatment of internalizing concerns.

Research is needed to inform three areas of decision-making, including (1) who is identified, (2) what type of need is determined, and (3) the type of treatment necessary (De Los Reyes & Kazdin, 2004). Regarding who is identified, youth behavior is context-dependent such that a child may display significant internalizing concerns (e.g., withdrawal) in one classroom while demonstrating age-appropriate social-emotional coping in another setting. This is particularly challenging in later adolescence as there are increased rates of anxiety and depression with substantial variability in symptomology and the inconsistency in adult access to identify changes in youth emotional functioning (e.g., students in later grades with multiple teachers, Miller et al., 2014; van der Ende et al., 2012). Given these variabilities, the need for multiple raters is pronounced when identifying internalizing concerns (Navarro et al., 2020). Internalizing concerns that begin to increase in frequency and intensity in middle school years suggests the need to incorporate student self-report, yet rater discrepancies are expected to be even more pronounced as middle and high school teachers have less access to students than do elementary teachers. Best practice guidance indicates the need to incorporate teacher and student ratings (von der Embse & De Los Reyes, 2022), with the preferred rater for student social-emotional and behavioral concerns to be the teacher in grades Kindergarten through fifth grade (elementary), the student self-report in high school, and a combination during the middle school years (Romer et al., 2020). However, consensus on how to use discrepant ratings has yet to transpire.

Multiple models have been proposed to reconcile rater differences while acknowledging that rater differences in some instances are both expected and desired (De Los Reyes, 2013). These models generally fall within two domains, including person-centered or variable-centered approaches. A person-centered approach utilizes data aggregation techniques such as latent profile (LPA) or latent class analysis (LCA) to identify the profiles of respondents (Herman et al., 2018). Research using these techniques has identified unique profiles of rater groups (e.g., student-low risk and teacher-low risk, student-high risk and teacher-low risk) that differentially predict distal outcomes (von der Embse et al., 2021). Makol and colleagues (2020) used variable-centered analyses with a Trait score approach. The researchers identified informants with an expectation of variability dependent on rating context. This method shifts the aggregation from multiple items to multiple rater scores. Results indicated superior predictability beyond a single composite score (Makol et al., 2020).

Recently, von der Embse and colleagues (2019) utilized a tri-factor model with multi-informant data. The tri-factor model has the benefits of both a variable-centered approach (e.g., score aggregation) and a person-centered approach allowing for the separation of variance from raters, constructs (e.g., internalizing, externalizing concern), and items (Bauer et al., 2013). Results indicated differences amongst raters based upon demographic characteristics of the ratee and domain (internalizing; von der Embse et al., 2019). However, there was limited evidence to support how differences would influence applied decision-making necessary to inform who is at risk, in what domain, and what treatment is needed (De Los Reyes & Kazdin, 2004). Future research would be required to confirm trifactor model structure and incorporate the heterogeneity of latent classes (e.g., differences across ratee and construct).

Kim & von der Embse (2020) extended this work through a trifactor mixture model of multi-informant universal screening. Specifically, the proposed trifactor model was examined across targets and subgroups to determine if different models would be necessary. For example, the researchers examined if the informant’s perspective was variable depending on a specific population. Results from the study indicated changes to the model (and class membership) when predictors of latent factors were incorporated. Thus, the trifactor mixture model could be useful for multi-informant decision-making by (1) delineating dis/agreement on latent factors, including common and informant perspectives, (2) determining if incongruence between informants is based upon a fundamental difference in how the raters interpret items and respond to items, and (3) identifying predictors of dis/concordance. However, additional research would be necessary to extend this work in determining how the degree of difference in rating (i.e., concordance) may be important to identifying risk. In other words, the degree of difference in ratings between student and teacher may indicate either measurement problems or unique perspectives that would inform identification and treatment decision. Determining predictors and interpretation of dis/concordance is particularly important within the internalizing domain as rater differences are often more pronounced (Miller et al., 2014; van der Ende et al., 2012).

Purpose of the Present Study

Increasing mental health concerns have increased the need for consistent guidance on aggregating potentially disparate data for internalizing problems. Previous studies have focused on inconsistencies among classes of raters (Herman et al., 2018). However, extant research demonstrating discrepancies in ratings of internalizing behaviors between teachers, parents, and students (De Los Reyes et al., 2015), it is imperative to establish the degree of concordance between raters to inform intervention and data-based decision making. The present study aims to address the following research questions: What was the agreement between student and teacher raters on internalizing concerns (i.e., emotional behaviors)? and does agreement on internalizing behaviors change across ages/grades, and does this differentially impact distal outcomes?

Method

Sample

Researchers partnered with FastBridge Learning (FBL; www.fastbridge.org), a national assessment company, to extract a de-identified national sample. Schools with at least 80% of students rated on the Social, Academic, and Emotional Behavior Risk Screener (SAEBRS) were identified for inclusion such that the tool was used across the school rather than with limited or targeted populations. All data were deemed exempt by the University Institutional Review Board. The final dataset comprised 24,094 student responses, of which less than half (47.0%; n = 11,332) were female. The sample was representative of students across all grade levels from kindergarten through twelfth, with the average student age at 11.07 (SD = 2.61). The majority of students who responded were White (40.3%; n = 9,714), followed by African American students (16.7%; n = 4,026), and Hispanic/Latino (6.9%; n = 1,657). The remaining students did not report race and/or ethnicity (30.5%; n = 7,335). A small proportion of students (4.6%; n = 1,116) received special education services relative to national averages (15%, National Center for Education Statistics, 2020). The dataset did not include identifying information on state or location, thus the national representativeness of the sample is unknown.

Measures

Social, Academic, and Emotional Behavior Risk Screener (SAEBRS)

The SAEBRS (Kilgus & von der Embse, 2014) was used in the present study and is a universal screening tool that measures risk for social, academic, and emotional behaviors from a Dual Factor Model of Mental health; items include indicators of problem behaviors as well as prosocial skills. The SAEBRS is a 19-item teacher completed rating scale, and the mySAEBRS is a 20-item student completed rating scale (von der Embse et al., 2017) based upon a 4-point Likert type rating. The SAEBRS and mySAEBRS includes three subscales: Social—6 items for SAEBRS, 7 items for mySAEBRS, Academic—6 items, and Emotional—7 items behavior subscales, as well as a total behavior scale. Items reflect both positive skills as well as problem behaviors (reverse coded) such that higher scores are indicative of better functioning. The Emotional behavior subscale (item key words listed in Table 1) includes items that measure internalizing behavior problems as well as social-emotional skills. The Academic behavior subscale includes items that reflect academic enabler (DiPerna, 2006) skills as well as behaviors that prevent benefit from a high quality instructional environment. Social behavior subscale includes items that assess social skills as well as externalizing behavior problems. There have been nearly twenty published studies to support the psychometric properties of the SAEBRS (e.g., Kilgus et al., 2016; Kilgus et al., 2015; Kilgus et al., 2016; von der Embse et al., 2016). The emotional behavior (EB) subscale was of primary interest in the present investigation, given the variability in the rating of internalizing concerns across raters. Cronbach’s alpha and composite reliability omega for the EB subscales are 0.61 and 0.63 for student ratings and 0.83 and 0.84 for teacher ratings with the sample in this study.

Table 1 Descriptive Statistics

Reading and Math

Previous research has provided evidence to suggest that emotional and behavioral functioning can result in poor academic outcomes (Eklund et al., 2017). Adaptive math (aMath) and adaptive reading (aReading) scores were evaluated to measure student reading and math performance. The aMath and aReading measures are adaptive computerized assessments and based on item response theory (IRT). The aMath assessment is a test of broad math proficiency for students in kindergarten through 12th grade. The difficulty of items and the proportion of items in each category are based on student performance throughout the assessment. The FastBridge Learning technical manual revealed good reliability (0.96), validity (0.66-0.84), sensitivity (0.84-0.95), and specificity (0.80 − 1.00) for students in first through fifth grade (Christ et al., 2014). The aReading assessment is a test of broad reading proficiency for students in kindergarten through 12th grade. Items indicate phonological and phonemic awareness, vocabulary, and reading comprehension. The FastBridge Learning technical manual revealed strong internal consistency (0.95), strong criterion-related validity (0.71-0.76), and specificity (0.71.93) (Christ et al., 2014).

Data Analysis

Data analysis proceeded in two phases. First, a trifactor model was used to examine the degree of agreement between students and teachers in emotional behavior ratings and the predictability of the agreement and disagreement on the outcome (reading and math achievement). For the degree of agreement, we utilized explained common variance that is estimated from the trifactor model. To address the second research aim, a multigroup trifactor analysis was employed. First, findings from prior measurement invariance studies were confirmed by replicating the trifactor model across grades. Based on the model results, the predictability of rater agreement on a distal outcome across grades was evaluated. Lastly, a polynomial regression and response surface analysis were used, and preliminary analyses to assess the relationship between agreement and distal outcome.

Trifactor Model

A trifactor model was proposed specifically for multiple informant data to decompose the variance into common variance between informants and specific variances unique to informants and item-specific variances (Bauer et al., 2013; von der Embse et al., 2019). The trifactor model utilizes a confirmatory factor model in which three distinctive factors are specified with specific constraints as described below. It should also be noted that the model specification in this study was based on prior research (von der Embse et al., 2019). The trifactor model for emotional behaviors is illustrated in Fig. 1. The unidimensionality of the observed variables was assumed. All variables of emotional behaviors from students and teachers loaded on the common factor. Because the common factor (emotional behavior or EB) is the agreement between informants, the factor loadings were constrained equal between informants for each variable (e.g., factor loading of item 1 for students = factor loading of item 1 for teachers). Two unique factors were specified to represent student unique and teacher unique perspectives on student emotional behaviors. Because student and teacher unique factors were assumed to be dissimilar, the student perspectives factor loadings were allowed to be different from those of teacher perspectives. Finally, although each item was supposed to measure the construct of emotional behaviors, there could be item-specific effects not explained that are modeled as an item-specific factor. Item specific factor loadings were constrained equal between two items (e.g., item 1 of students and item 1 of teachers). It was expected that item-specific factor loadings would be small and close to zero if the measure were well developed. Finally, measurement errors were estimated as usual in confirmatory factor analysis.

Fig. 1
figure 1

Trifactor model of Emotional Behaviors (EB) with Parameter Estimates. (Note. S and T represent students and teachers, respectively. Standardized factor loadings are presented. The factor loadings of each item were constrained equal between students and teachers for the common factor (Emotional Behaviors) and the item specific factors. However, after standardization, those constrained parameter estimates became different between student and teacher items. The dashed lines represent the factor loadings of student and teacher unique perspective factors. The residuals of each observed variable and the mean structure are not shown for simplicity)

Indices to Evaluate Agreement and Disagreement

In the trifactor, all three types of factors (common, informant unique, and item-specific factors) are orthogonal. This unique feature of the trifactor model allows researchers to decompose the total variance of observed variables (student scores and teacher scores in tandem) into three uncorrelated sources and evaluate each factor’s unique contribution. Based on this feature, several indices can be utilized to assess the degree of agreement and disagreement between informants. We adopted statistical indices derived from bifactor models (Reise et al., 2012; Rodriguez et al., 2016). For example, the Explained Common Variance (ECV) is used to estimate the degree of unidimensionality (the degree to which scores represent a single factor rather than multiple factors or multidimensionality) in the bifactor model (Reise et al., 2010, 2013; Ten Berge & Sočan, 2004). In this study, ECV is used to evaluate the degree of agreement between informants (common factor variance as agreement variance).

$$ECV=\frac{\left(\sum {{\lambda }^{2}}_{iEB}\right)}{\left(\sum {{\lambda }^{2}}_{iEB}\right)+\left(\sum {{\lambda }^{2}}_{iStu}\right)+\left(\sum {{\lambda }^{2}}_{iTea}\right)}$$

where \({\lambda }_{i}\) denotes the standardized factor loading of item i for the corresponding factor (EB for the common factor of Emotional Behaviors, Stu for student unique perspective factor, and Tea for teacher unique perspective factor). Of note is that ECV is a model-based index (Reise et al., 2013; Rodriguez et al., 2016) by utilizing the standardized factor loadings estimated from the bifactor or trifactor models. Thus, the misspecification in the trifactor model will result in bias in the computed index (Rodriguez et al., 2016). Also, ECV would be interpretable when measurement invariance between informants holds for the common factor and thus the common factor is meaningful as agreement. In addition to ECV, when researchers intend to use the composite scores of combined teacher and student ratings, its viability can be assessed with indices such as omega hierarchical (\({\omega }_{H}\); McDonald, 1999) along with omega for unique perspective factors (Reise et al., 2013).

Trifactor Model with Outcomes

After the degree of dis/agreement was estimated, an outcome variable (reading and math scores) was added to the trifactor model. Specifically, the common factor representing agreement and the two informant unique factors representing disagreement between informants are predictors of reading and math achievement scaled scores. Note that each outcome was tested separately. The regression coefficients of the common factor and the informant unique factors indicate how strong emotional behaviors are related to reading achievement (or math achievement) when informants agree about EB and when informants disagree and utilize their unique perspectives, respectively. The predictability of EB under agreement and disagreement will also be evaluated with r2. Because the three predictors are orthogonal, it is possible to evaluate relative predictability by comparing r2 values.

Multigroup Trifactor Model with Outcomes

To address the second research question, potential variability of dis/agreement of EB across grades and resulting predictability of the outcome variables by grade, a multigroup trifactor analysis was conducted with grade as a grouping variable. The equality of factor loadings across grades indicated no difference in EB agreement across grades. Because measurement invariance across grades was tested and established in previous research (von der Embse et al., 2019), measurement invariance is presented briefly within the results section. The multigroup trifactor model was then extended with reading or math as an outcome of the three factors under measurement invariance across grades. Next, the effects (regression coefficients) of the common factor and two informant factors on reading or math achievement for each grade were estimated. To determine whether such effects are different across grades, the model with those effects constrained equal, assuming no difference, and the model with such effects relaxed across grades was compared. To this end, a likelihood ratio test was conducted in which statistical significance indicates differences across groups. Information criteria (ICs) such as AIC, BIC, and sample-size adjusted BIC in favor of the model were compared with smaller values of ICs and evaluated the changes in model fit between models (see below for the model fit criteria). Grades 10, 11, and 12 were collapsed, given the relatively small sample.

The commonly used fit criteria: chi-square goodness of fit p-value ≥ 0.05, comparative fit index (CFI) ≥ 0.90, root mean square error of approximation (RMSEA) ≤ 0.08, and standardized root mean square residual (SRMR) ≤ 0.08 were used to evaluate model fit (Hu & Bentler, 1999) and a maximum likelihood with robust mean and variance (MLR) was used for model estimation. Mplus version 7.4 was used for data analysis, and R.4.0.3 and SPSS version 26 were used for data management, missing data analysis, and descriptive statistics. The response surface plot was generated with EXCEL (the file was downloaded from the Jeffery Edwards websiteFootnote 1).

Supplementary Analysis: Polynomial Regression With Response Surface Plot

Although the trifactor model is advantageous for the multiple informant data analysis, because common and informant unique factors are completely separated as two orthogonal sources of score variances, it is difficult to evaluate how the degree of agreement (agree more or less) predicts the outcome. A polynomial regression (e.g., Edwards & Cable, 2009) was used to model the degree of agreement and disagreement as predictors of the outcome. The polynomial regression model was built with teacher scores of EB, student scores of EB, and their squared terms and interaction terms included as predictors of the reading achievement (and math), as shown below.

$$\widehat{read}={\beta }_{0}+{\beta }_{1}{EB}_{S}+{\beta }_{2}{EB}_{T}+{\beta }_{3}{EB}_{S}^{2}+{\beta }_{4}{EB}_{S}{EB}_{T}+{\beta }_{5}{EB}_{T}^{2}$$

where the subscripts S and T represent Student and Teachers, respectively, \({\beta }_{0}\) is the intercept, and \({\beta }_{1}\)through \({\beta }_{5}\) are regression coefficients. The response surface plot based on the polynomial regression model allows researchers to visually inspect the relations of agreement and disagreement with the outcome. For example, if the outcome becomes higher as informants agree more and is maximized when the informants completely agree, it is expected that \(\left({\beta }_{3}-{\beta }_{4}+{\beta }_{5}\right)<0\), and the intercept and slope of the first principal axis of the response surface are 0 and 1, respectively. For the details and demonstrations of polynomial regression and response surface analysis, refer to Edwards & Cable (2009). To estimate polynomial regression, maximum likelihood was used, and confidence intervals were created with bootstrapping of 5000 random draws.

Missing Data

It should be noted that there are missing data in the teacher ratings of student emotional behaviors, reading, and math scores (missing rates = 40%, 39%, and 31%, respectively). Although the assumption of missing completely at random (MCAR) was rejected based on the Little’s MCAR test: χ2(174) = 2121.30, p < .001, the missingness in one variable was not correlated with any other study variables (r near zero in most cases with maximum 0.14). Full information maximum likelihood (FIML) estimation method was used under the assumption of ignorable missingness. However, the supplementary analysis using polynomial regression was conducted with a subsample of the complete cases (n = 8843 for reading) because FIML was not applied to exogenous variables in Mplus.

Results

Descriptive Statistics

The descriptive statistics and correlations between variables are presented in Tables 1, 2 and 3. There was no discernable deviation of normality across variables based on the skewness and kurtosis. The correlations of the teacher rating variables were generally higher (0.27 ~ 0.63) than those of the student rating variables (-0.06 ~ 0.44). The correlations of teacher variables with reading (0.12 ~ 0.29) and math (0.11 ~ 0.29) were also higher than those of student variables (0.02 ~ 0.16 with reading; 0.02 ~ 0.12 with math). The correlations of teacher EB composite scores with reading and math were 0.22 and 0.23, respectively; those of students were 0.11 and 0.08, respectively.

Table 2 Descriptive Statistics of Emotional Behaviors Mean Composite Scores by Grade
Table 3 Correlations of Study Variables

The Trifactor Model and the Degree of Agreement

The fit of the trifactor model was overall acceptable: χ2(65) = 2765.221, p < .001; RMSEA = 0.041, 90% CI = [0.040, 0.043]; CFI = 0.945; SRMR = 0.051. The standardized factor loadings are presented in Fig. 1. The computed ECV, the degree of agreement, was very low (0.12), indicating students and teachers have limited agreement on internalizing behaviors. The EB common factor explained only 12% of the common variance. On the other hand, 88% of the common variance was explained by student unique factor (32%) and teacher unique factor (57%). The proportion of common variance due to the common factor was very small compared to the informant unique factors. The EB rating scores of students and teachers mostly reflect their perspectives of EB, and each individual may not be sufficient for a complete understanding of student internalizing concerns. If the composite scores with student and teacher ratings combined were used for emotional behaviors, the composite scores would hardly reflect the common factor (\({\omega }_{H}\) = 0.06 for the common factor).

The Predictability of Agreement and Disagreement on Reading and Math

To assess the predictability of dis/agreement on student reading scores (or math), the trifactor model was fit with reading as an outcome. That is, the effect of common factor on reading was estimated for the predictability of agreement; the effects of teacher and student unique factors on reading were estimated for the predictability of disagreement. First, the trifactor model showed reasonable fit with reading as an outcome: χ2(76) = 3146. 128, p < .001; RMSEA = 0.041, 90% CI = [0.040, 0.042]; CFI = 0.940; SRMR = 0.052; with math: χ2(76) = 3146. 128, p < .001; RMSEA = 0.041, 90% CI = [0.040, 0.042]; CFI = 0.940; SRMR = 0.052. The standardized regression coefficients are shown in Table 4.

Table 4 Standardized Regression Coefficients

All three factors were positively associated with reading and math. When students and teachers agreed, higher (i.e., more positive) emotional behaviors were related to higher reading/math performance. Students with higher emotional behaviors from the student and teacher unique perspectives also showed higher reading performance. However, the predictability of the student unique factor was lower compared to those of common and teacher unique factors. Emotional behaviors, overall, explained 11% of the variance in reading scores and math scores, while common and teacher factors explained 5% each, and student factor explained only 1% or less.

The Predictability of Agreement and Disagreement on Reading by Grade

First, the degrees of dis/agreement were evaluated as equal across grades by testing the equality of factor loadings across grades (namely, metric invariance) in the multigroup trifactor model. As demonstrated in von der Embse et al., (2019), metric invariance was satisfied. The fit of the metric invariance model was acceptable: χ2(1144) = 5403. 656, p < .001; RMSEA = 0.045, 90% CI = [0.044, 0.046]; CFI = 0.916; SRMR = 0.070. Also, the fit was not deteriorated much from the configural invariance model, which again supported metric invariance: Δχ2(Δdf = 264) = 738.136, p < .001; ΔRMSEA = − 0.003 (indicating the metric invariance model fit even better); ΔCFI = 0.009; ΔSRMR = 0.008. Thus, the degrees of dis/agreement on emotional behaviors were not different across grades. Because von der Embse et al., (2019) found that one item intercept of EB was not invariant across grades, one item intercept varied across grades (partial scalar invariance model: χ2(1237) = 5911. 384, p < .001; RMSEA = 0.045, 90% CI = [0.044, 0.046]; CFI = 0.908; SRMR = 0.071) and proceeded to build the multigroup trifactor models with reading and math as outcomes, respectively. The predictability of dis/agreement on the outcome was compared across grades by examining regression coefficients constrained equal, and the model with those coefficients freely estimated. The results of these two models are shown in Table 5.

Table 5 Predictability of Agreement on Two Outcomes

For both reading and math, there was no strong evidence for the relaxed model in which the regression coefficients were allowed to be different across grade levels. Although the Satorra-Bentler scaled chi-square difference was statistically significant (p < .001) and the AIC supported the relaxed model, other indices favored the constrained model in which all regression coefficients were constrained equal across grades. The AIC is well known for its over-sensitivity to trivial differences when the sample size is sufficiently large (Akaike, 1974). Thus, the constrained model was selected considering BIC and saBIC with more weights. In addition, even though 27 parameters were constrained as equal across groups, the deterioration of model fit was negligible (no or near-zero changes in CFI, RMSEA, and SRMR). Taken together, the predictability of dis/agreement on reading and math was not different across grades. In the constrained model, the regression coefficients of common, student unique, and teacher unique factors were 0.23, 0.11, and 0.21, respectively, for reading; 0.24, 0.06, and 0.32, respectively, for math across all grades, which confirmed the findings of the single-group trifactor model shown in Table 4.

Supplementary Analysis Using Polynomial Regression

With the subsample with the complete data, polynomial regression was utilized with reading (or math) as an outcome and polynomial terms as predictors (student scores and teacher scores of EB and their quadratic and interaction terms) to assess how the degree of agreement or disagreement was related with reading (or math) performance. Because the results are similar between reading and math, results of reading outcomes are presented. Based on the estimated polynomial regression coefficients presented in Table 6, a response surface plot was created (Fig. 2). Results are interpreted based on the response surface plot. First, as illustrated by the almost flat surface, student reading did not improve as teachers and students agreed more about EB: β3 – β4 + β5 = 0.01, 95% CI [-0.02, 0.03], which was not statistically significant; Slope = 5.14, 95% CI [2.86, 23.70]. Although the trifactor model showed that students and teachers mostly disagreed about EB (very low ECV), when they agreed, EB linearly predicted reading performance, as illustrated in the right upper panel (EB S = EB T) in Fig. 2: as EB is higher, reading performance is higher (β1 + β2 = 0.37, 95% CI [0.34, 0.40]). When they disagreed (EB S = -EB T), reading performance tended to be lower curvilinearly as student rated higher on EB (see the right lower panel); it tended to be higher curvilinearly as teachers rated higher. The R-square of the polynomial regression model was 0.08.

Table 6 Polynomial Regression Coefficients for Reading
Fig. 2
figure 2

Response Surface Plot of Emotional Behaviors (EB) and Student Reading Achievement. (Note. EB T = teacher-rated student emotional behaviors; EB S = student self-rated emotional behaviors)

Discussion

The first purpose of this paper was to examine the degree of agreement between informants regarding ratings of internalizing concerns. The ECV statistic resulting from the trifactor model indicated low agreement between student and teacher informants, as indicated by the low proportion of variance due to the common factor (which represented informant agreement within the trifactor model). A more notable proportion of the variance was attributed to the informant unique factors (which represented informant disagreement in the model), suggesting ratings of student internalizing concerns are largely driven by informant perception.

Results from this investigation are not conclusive in the context of clinical utility. It could be that both teacher and student ratings are necessary to achieve a complete understanding of a student’s internalizing symptoms. The students offer information regarding their internal states, as well as their perception of their external manifestations of their internalizing concerns. Meanwhile, the teacher provides perhaps a more accurate perception of the student’s external manifestations, which are grounded in more objective, third-party observations of the student’s behavior. In contrast, it could be that one informant (e.g., the teacher) affords unique but uninformative data regarding student internalizing concerns. However, experience, stress, and preconceived notions and implicit biases influence teacher ratings of student social-emotional, and behavioral needs thus calling into question which rater offers more reliable information (Splett et al., 2018). Unfortunately, the current study does not provide a means of discerning the difference between that which is unique and accurate or informative. In addition, there are some internalizing behaviors (e.g., anxiety) that could be positively associated with academic performance. To this end, future research should employ external, objective indicators of student internalizing concerns to determine the extent to which each informant’s perspective contributes to an understanding of those concerns. Of course, the field remains somewhat divided on what such objective indicators should be, as most measures of internalizing concerns are grounded in rater perception, while other observation-based tools can miss crucial elements of emotional functioning. This suggests the need for basic research like this study and measurement work that informs the creation of accurate and objective tools.

It should also be noted that the measurement invariance findings suggested the pattern of informant dis/agreement was equal across grades. This result is compelling given documented developmental differences in internalizing concerns. Young children are more likely to externalize their emotionality (Leadbeater & Homel, 2015), potentially allowing teachers to better perceive child emotions, thereby promoting greater informant agreement. This agreement might lessen over time as emotions move more inward for older children and adolescents, making them less observable to teachers (Zee & Moritz Rudasill, 2021). Yet, the current findings would suggest that the pattern of dis/agreement remains constant, suggesting students and teachers have unique information to provide across all grades, and thus that both informants should be considered when evaluating internalizing concerns.

The second purpose of this study was to examine whether informant concordance changed across grades and whether this differentially impacted distal outcomes. There were two particularly notable findings in relation to this purpose. First, the measurement invariance findings suggested the pattern of informant dis/agreement was equal across grades. This finding is significant given developmental differences in internalizing concerns. As noted above, young children are more likely than their older peers to externalize their emotionality, potentially allowing teachers to better perceive their internalizing concerns, thus promoting informant agreement. One might then expect the agreement to lessen over time as student emotions move more inward for older children and adolescents. Yet, the current findings would suggest that the pattern of dis/agreement remains constant, suggesting students and teachers have unique information to provide at all ages. Second, multigroup trifactor analyses indicated that the common factor was positively associated with reading and math scores, showing that agreement between teachers and students predicted student academic performance. This was further supported by the polynomial regression, which demonstrated that when teachers and students agreed, emotional behavior positively predicted reading and math performance. Yet, the rater-specific factors showed a slightly different pattern. Though the teacher unique factor was related to math and reading scores (explaining 5% of the variance in these scores), the student unique factor was weakly related (explaining only 1% of the variance in math and reading scores). These findings could suggest a teacher’s unique perceptions of student emotional functioning are in part related to their academic performance. This would align with previous research, which has demonstrated that teachers perceive student frustration and anxiety related to academic tasks (e.g., high-stakes tests; Lyneham et al., 2008), as well as enjoyment and motivation related to academics (Dicke et al., 2012).

Taken together, findings suggested the relationship between emotional behavior and academic outcomes was consistent across grades and could hold implications regarding the nature of emotionality and its relationship with academic functioning. That is, student emotions, including perceptions of which are agreed upon or unique to each informant, do not become more or less tied to academics across grades. Of course, there could be exceptions to this rule, with higher-achieving students experiencing greater academic-related emotions over time in accordance with increased academic demands (Suldo & Shaunessy-Dedrick, 2013). Results from the present study offer several potentially important benefits for research and practices. First, given consistency in agreement across grades, students from all grade levels, including young children (kindergarten through third grade) may offer informative ratings of emotional (or internalizing) behaviors. Second, teacher ratings demonstrated a relatively stronger relation to distal academic outcomes compared with student self-ratings. In comparison to prior suggestions that students should be the preferred informant in later (high school) grades (Romer et al., 2020), results from this study suggest even middle and high school teachers with relatively limited access to students can offer potentially important information of student emotional functioning. Thus, when school teams are considering how to implement universal screening across various grades and settings, both student and teacher raters could offer important information for decision-making.

Limitations and Future Directions

Certain limitations to this study should be noted, as they present opportunities for future research in this area. First, though the national dataset under examination was quite large, it contained sparse demographic data for students. Though the current findings could be broadly generalizable, it is largely unknown to whom and where the findings generalize. Thus, it is imperative for future research to consider datasets with more complete data, thereby yielding a better understanding of result relevance. Second, the examined outcomes were limited to academic functioning. Future research should consider a broader range of outcomes, including other ecologically valid school-based variables, which are of frequent interest to educators (e.g., office discipline referrals, attendance, nurse visits). Third, we employed a single indicator of internalizing concerns that was founded upon seven items, which were completed by students and teachers. Future research should consider longer and more varied indicators of emotional functioning, while also considering caregiver perspectives. Fourth, the current dataset was cross-sectional, amassed at a single time point. Subsequent studies ought to examine longitudinal datasets, which would permit the examination of the change in patterns of dis/agreement over time, as well as whether changes in emotional functioning and dis/agreement predict academic outcomes.

Conclusion

This study highlighted the expected relatively low levels of agreement between teacher and student report of internalizing concerns. However, when raters did exhibit agreement, higher agreement was indicative of later higher reading and math performance. These relationships were demonstrated across grade levels. These results highlight the promise of using multi-informant decision making to inform intervention supports and ultimately improve student outcomes.