Introduction

Social and emotional learning (SEL) seeks to promote the social-emotional competencies (SEC) essential to healthy relationships, education and job success, and engaged citizenry (Greenberg & Weissberg, 2018). SEL is often delivered universally (i.e., to all children) within a school or out-of-school time setting through free-standing skill-focused lessons. In a multi-tiered system of support, SEL is often performed as a Tier I intervention, intended to prevent problem behaviors and encourage positive behaviors in all students (Collins et al., 2016; Kilgus et al., 2015). This model of SEL program has a considerable evidence base (e.g., Corcoran et al., 2018; Durlak et al., 2011; Sklad et al., 2012; Taylor et al., 2017), which has been leveraged to call for the wide-scale adoption of programs, such as those featured in successive program guides (2005, 2013, 2015) created by the Collaborative for Academic, Social, and Emotional Learning (CASEL). Significant challenges have been noted, however, in the uptake of free-standing, skills-focused SEL programs (Fagan et al., 2015; Jones & Bouffard, 2012). Primary concerns are with their “fit” into the school day as “one more thing” to do (Weston et al., 2018), the diversity of students upon which effective programs have been tested (Rowe & Trickett, 2018), and whether lessons typically generalize out of their specific instructional context, especially in the absence of positive school climates, relationships, and other scaffolds and supports (Darling-Hammond et al., 2020). Recently, SEL approaches have begun evolving to include (a) the integration of SEL into academic curricula, (b) fostering teachers’ own wellbeing and capacities to promote SEC as part of their instruction, and (c) whole-school strategies (Domitrovich et al., 2017; Dusenbury et al., 2015). This paper considers the effectiveness of a popular SEL program and explores student growth across two different implementation directives, which differentially emphasize lesson-based delivery and integration strategies.

TOOLBOX (Collin, 2015) is a universal, school-based SEL program that aims to promote elementary school children’s SEC through the instruction and reinforcement of 12 tools (e.g., Breathing Tool, Listening Tool, Courage Tool). The developer employed the metaphor of tools to provide students with rhetorical devices that call upon their intrinsic capacities to achieve emotional, social, and academic well-being and resilience. While TOOLBOX is taught formally as a curriculum, the essence of TOOLBOX is providing a common language and practices used across contexts throughout the school day. In TOOLBOX, adults are asked to “go first” and internalize the 12 tools for their own wellbeing. Then, teachers are trained to use an inquiry-based approach to support student discovery and decision-making. For example, teachers are trained to ask children “what three tools might you try?” during a classroom conversation or “what tools did you try?” when trying to unpack a playground conflict. Each child constructs a manila toolbox to keep at their work station that can be personalized and referenced. Each tool has an icon and hand gesture, enabling teachers to non-verbally suggest a tool to students during community meetings or academic instruction, or to pause and “name” which tools children could use in the moment. Staff hang posters and wear fandecks on lanyards with the tool icons to facilitate quick references. The whole school community uses mantras for the tools (e.g., Patience Tool: I am strong enough to wait!) and invites families to reference the tools at home.

TOOLBOX promotes a flexible approach to program implementation, such that program delivery can occur with an emphasis on longer, free-standing lessons (in the tradition of classic social skills curriculum) or a more integrative “common language” approach of referencing and reinforcing tools after only a brief introductory lesson. The Standard implementation strategy provides structured stand-alone lesson plans and comprehensive resources for delivery while the Primer implementation strategy provides only the brief, introductory “light touch” lessons to the 12 tools and only the most essential resources for delivery. Typically, district or school leaders choose the overall implementation approach that they believe to be best suited for their educational context. The choice between the Standard implementation strategy and the Primer implementation strategy depends on local resources, readiness, and preferences. Theoretically, higher dosage (i.e., how much of the program components have been delivered; Durlak & DuPre, 2008) is anticipated in the Standard implementation than in the Primer implementation. Within either of these leadership-selected implementation choices, classroom teachers vary in their delivery of program components through lesson-based methods (e.g., explicit instruction about the concepts of each tool) and non-lesson-based strategies (e.g., modeling how to use tools, incorporating tools into academic curriculum, and applying tools to daily classroom interactions) based on their individual strengths, resources, constraints, and preferences.

These flexible implementation features had made TOOLBOX appealing to many educators. More than 40 school districts in Northern California have implemented TOOLBOX. Yet, only two unpublished studies have been conducted to date to examine the program’s theory of change (see Fig. 1), thus far exploring (a) the acceptability and utility of training and resource inputs, (b) the presence and strength of instructional output in the classroom, and (c) the proximal student outcomes perceived by teachers, all conducted in contexts of educational leaders selecting the standard implementation protocols (De Long-Cotty, 2010; Dovetail Learning, 2013). These studies found that elementary school teachers positively rated the value of program materials and training resources, implemented lessons and other delivery strategies in their classrooms, and observed students using the “tools” and encouraging others to use the tools. Through pre/post comparison on the Behavioral and Emotional Rating Scale (BERS; Epstein, 2004; Epstein & Sharma, 1998), teacher ratings of students’ intrapersonal and affective strengths increased over 3 months. Change was not detected, however, on BERS teacher ratings of interpersonal strengths or school functioning, nor on any of the BERS parent rating scales (De Long-Cotty, 2010). No prior study has compared students experiencing TOOLBOX to a comparison group, comparing student development under TOOLBOX conditions to the typical maturation of social-emotional development.

Fig. 1
figure 1

TOOLBOX theory of change as communicated in the 2014 TOOLBOX Project Administrator's Guide (Copyright Mark A. Collin. All rights reserved.)

This quasi-experimental study aims to examine the relationship between TOOLBOX, as implemented in routine school settings (i.e., no additional coaching or technical assistance made available by nature of being studied), and the development of K-2 students’ SEC. This study first explored the overall effectiveness of TOOLBOX by comparing student SEC growth trajectories between TOOLBOX and non-TOOLBOX conditions over one academic year. Then, within the TOOLBOX condition, this study examined the extent to which TOOLBOX was implemented differently across two implementation directives (i.e., Standard and Primer) as well as the extent to which the two implementation directives had differential effects on student SEC growth trajectories. The research questions and hypotheses are as follows:

  • Question 1: To what extent was the TOOLBOX intervention related to growth trajectories of students’ SEC? Hypothesis 1: Students in the TOOLBOX conditions will have higher rates of growth in SEC as compared to students in non-TOOLBOX conditions.

  • Question 2: To what extent was TOOLBOX implemented across two different implementation directives? Hypothesis 2: Although there are no benchmarks to guide specific hypotheses about dosage, we hypothesize that standard implementation teachers will report higher levels of implementation dosage as compared to Primer implementation teachers. Indicators of implementation quality will not be different across TOOLBOX implementation directives.

  • Question 3: To what extent were the TOOLBOX implementation directives related to growth trajectories of students’ SEC? Hypothesis 3: Students in schools with the Standard implementation directive will have higher rates of growth in SEC compared to students in schools with the Primer implementation directive.

Method

Design and Sample

The TOOLBOX Implementation Research Project (TIRP) aimed to understand variation in the routine implementation of TOOLBOX, as distributed, at the time of study, by Dovetail Learning, and to explore the relationship between one academic year of TOOLBOX implementation and student outcomes. The TIRP is a quasi-experimental study situated within a single California school district (who initiated practice-driven research). According to the publicly available district statistics, 59.1% of elementary school students in this district were identified as Hispanic/Latinx, 39.8% were English language learners, 71.3% were eligible for free and reduced priced lunch, and students meeting or exceeding the state educational standards in were 23.2% English language arts and 24.0% in mathematics (District, 2016).

Funding enabled four elementary schools to initiate TOOLBOX, and two comparison schools to participate in a measurement-only, practice-as-usual (non-TOOLBOX) condition, during the 2015–2016 academic year. Schools were assigned to conditions in a way that intentionally distributed student demographic characteristics as evenly as possible across TOOLBOX and non-TOOLBOX conditions. Of the four TOOLBOX schools, two were given resources to implement the TOOLBOX Standard package, and the other two were given resources to implement the TOOLBOX Primer package.

In August, prior to the beginning of the fall semester, a 6-hour training was provided to teachers and staff from the four TOOLBOX schools, and 94% of classroom teachers attended the training. More detailed descriptions of the training and findings from a post-training teacher survey (including teacher attitudes, capacities, expectations for implementation and impact, etc.) are presented in a paper by Shapiro et al. (2020). To monitor implementation, the SEL Implementation Survey (SEL-IS), a self-report survey on program implementation behaviors, was administered to teachers and staff in these four schools at three time points throughout the year (October, December, and May). Also, classroom teachers were asked to complete the Devereux Student Strengths Assessment-Mini (DESSA-Mini), a brief 8-item behavioral rating scale assessing student SEC, at three time points throughout the year (October, December/January, and April/May). All research protocols were approved by the institutional review board at the University of California, Berkeley.

The current study uses a sample of 1766 K-2 students. The mean age of students in the sample at the beginning of the year was 6.05 years (SD = 0.89), and 48.6% of students in the sample were female. Administrative records provided by the district indicated that more than half of the students (55.3%) in the sample were identified as Hispanic/Latinx, followed by Asian/Asian American (13.6%), Black/African American (10.9%), White (7.5%), and Others (7.5%, including Filipinx, Pacific Islander, and American Indian/Native American); about half of the students in the sample (48.2%) were identified as English language learners (ELL; primary language used at home includes Spanish [70.9%], Cantonese [12.8%], Tagalog [4.3%], Vietnamese [4.3%], and Arabic [3.1%]); 8% of children in the sample were receiving special education (SPED) services; and 67.8% were eligible for free and reduced price lunch (FRL) based on their household economic status.

There were 562 students in schools under the TOOLBOX Standard implementation directive, 608 in schools under the Primer implementation directive, and 596 in the non-TOOLBOX condition. No difference was observed in the distribution of gender, age, SPED, and FRL across three study conditions. Race/ethnicity distribution differed between TOOLBOX and non-TOOLBOX conditions (χ2(5) = 20.72, p < 0.01, Cramer’s V = 0.11, indicating a small-to-medium difference): in the TOOLBOX condition, there were fewer Hispanic/Latinx (51.9% versus 62.1%) and more Black/African American (12.6% versus 7.6%) students. Within the TOOLBOX condition, race/ethnicity distribution (χ2(5) = 36.60, p < 0.001, Cramer’s V = 0.18, indicating a medium-to-large difference) as well as ELL status (χ2(1) = 10.91, p < 0.01, ϕ = 0.10, indicating a small difference) were different between Standard and Primer implementation directives: in TOOLBOX Standard, there were more Black/African American students (15.8% versus 9.5%), fewer Asian/Asian American students (10.5% versus 18.4%), fewer Others (5.2% versus 10.5%), and fewer ELL students (41.3% versus 51.6%). Table 1 describes student demographic characteristics for the entire student sample and disaggregated by study condition.

Table 1 Student demographic characteristics by study condition

Students were nested within 85 classrooms (26 Standard, 30 Primer, 29 non-TOOLBOX). In the current study, a sample of 80 K-2 classroom teachers is used (25 Standard, 28 Primer, 27 non-TOOLBOX), excluding five non-responding/research-consenting teachers. The majority of teachers (93.8%) in the sample identified themselves as female. Sixty-five percent of teachers in the sample identified as White, 17.5% as Hispanic/Latinx, and 8.8% as Asian/Asian American. No difference was found in the distribution of teachers’ gender and race/ethnicity across study conditions.

Measurement

Social-Emotional Competence

Teachers assessed students’ SEC using the DESSA-Mini, a strength-based behavioral rating scale (Naglieri et al., 2011a), at three time points throughout the year of 2015–2016: October (Fall), December/January (Winter), and April/May (Spring). Following the recommended use of the DESSA-Mini, teachers rated the frequency (never = 0, rarely = 1, occasionally = 2, frequently = 3, very frequently = 4) of students’ positive behaviors (e.g., do something nice for somebody) over the past 4 weeks (Simmons et al., 2016). The sum of eight items—transformed into a T score based on national norms (an expected sample mean of 50 and standard deviation of 10)—yields a social-emotional total (SET) score for each student at each time (Naglieri et al., 2013). The adequacy of the DESSA norms has been independently reviewed (e.g., Atlas, 2010; Malcomb, 2010) and determined to be sufficiently large and diverse (Merrell & Gueldner, 2010). The DESSA-Mini is a brief version of the DESSA, which assesses 8 social and emotional competencies: self-awareness (e.g., describe how they were feeling; 7 items), social-awareness (e.g., get along with different types of people; 9 items), self-management (e.g., stay calm when faced with a challenge; 11 items), goal-directed behavior (e.g., keeping trying when unsuccessful; 10 items), relationship skills (e.g., express concern for another person; 10 items), personal responsibility (e.g., remember important information; 10 items), decision-making (e.g., learn from experience; 8 items), and optimistic thinking (e.g., look forward to classes or activities at school; 7 items). The DESSA-Mini has four alternative forms, each comprised of eight different indicators of social-emotional competence from the full DESSA. The alternative forms can be used in rotation to limit practice effects (LeBuffe et al., 2018; Lee et al., 2022a). Prior studies have shown that alternative form reliability meets or exceeds 0.90 across all forms. DESSA-Mini forms 1 (Fall), 2 (Winter), and 3 (Spring) were used in this study.

The DESSA-Mini has been evaluated against commonly accepted criteria for brief behavior rating scales measuring social, emotional, and behavioral risks (e.g., Glover & Albers, 2007; Jenkins et al., 2014). DESSA-Mini scores have been shown to be reliable (e.g., Shapiro et al., 2017a); sensitive and specific (e.g., Naglieri et al., 2011b); discerning between children with and without mental problems, emotional problems, behavioral problems, impairments, and adaptive skills (e.g., Goldstein & Naglieri, 2016; Nickerson & Fishman, 2009; Shapiro & Lebuffe, 2006); and predictive of serious disciplinary infractions (Shapiro et al., 2017b) and academic achievement (Balfanz & Byrnes, 2020; Chain et al., 2017). The DESSA and DESSA-Mini were designed with strategies to avoid rating bias (Mahoney et al., 2022), have been used with diverse populations across diverse settings (Hwang et al., 2022), and have been empirically tested for measurement invariance across subgroups of students as characterized by gender, race and ethnicity, special education, English language learning, and socioeconomic status (Lee et al., 2022b). The vast majority of variance in DESSA-Mini scores is attributable to differences between students, relative to differences attributable to differences between teachers (Shapiro et al., 2016; Tanner et al., 2018).

Implementation Variables

Implementation dosage (e.g., count, frequency, duration) and quality were measured through the SEL-IS, a self-report survey on program implementation behaviors, at three time points throughout the year of 2015–2016: October (Fall), December (Winter), and May (Spring). We used both composite indicators (an average of multiple related items) and individual items in our comparisons to balance the discordant desires of limiting the likelihood of a type 2 error through multiple comparisons, reducing measurement error, and promoting interpretability and clear implications for practice. Where composite scores were generated, we report an assessment of their internal reliability (i.e., the extent to which the items on the same scale measure the same underlying construct) using Cronbach’s alpha (Cronbach, 1951). Noting that Cronbach’s alpha is highly influenced by the number of items on a given scale, and the need for practical scales for applied uses to be brief, a 0.60 criterion was used to indicate acceptable internal reliability (Schmitt, 1996).

Implementation dosage was measured using four distinct dose forms (i.e., ways of delivering program components): lesson delivery, modeling, incorporation, and application. Lesson delivery dosage was measured by summing up the number of lessons teachers reported teaching until the time of survey administration (by checking off the ones they instructed from the list of 17 total possible lessons). Teachers were then asked to select their “most favorite” and “least favorite” lesson from the ones they reported teaching as a referent for some of the subsequent questions. Modeling dosage was measured by two items: the frequency of teachers (a) using tools themselves in the classroom and (b) telling students the tools teachers need in the moment. Incorporation dosage was measured by three items: the frequency of incorporating tools into (a) writing, (b) literature, and (c) arts. Application dosage was measured by three items: the frequency of (a) discussing and (b) asking how students can use tools in their daily lives and (c) naming tools in the moment that students are using. Modeling, incorporation, and application dosage items were scored using a 5-point Likert scale (0 = never, 1 = rarely, 2 = occasionally, 3 = often, and 4 = very frequently). Cronbach’s alpha coefficients for these eight non-lesson-based (i.e., modeling, incorporation, and application) dosage items were 0.76 in Fall, 0.65 in Winter, and 0.86 in Spring. In addition to these count and frequency indicators of dosage, duration was also measured by three items: time spent (a) teaching most favorite lesson, (b) teaching least favorite lesson, and (c) using other strategies outside of lesson structure in a typical week (0 = less than 10 min, 1 = 20 min, 2 = 30 min, 3 = 40 min, 4 = 50 min or more). Cronbach’s alpha coefficients for the two lesson-based duration items were 0.55 in Fall, 0.74 in Winter, and 0.66 in Spring.

Implementation quality was measured by two items asking teaching quality when teaching their most favorite lesson and their least favorite lesson, in addition to one item asking the quality when using strategies outside of lesson structure (1 = F to 13 = A +). Cronbach’s alpha coefficients for the two lesson-based quality items were 0.74 in Fall, 0.63 in Winter, and 0.59 in Spring.

Student Demographic Characteristics

Student demographic characteristics were included as covariates when comparing student SEC growth trajectories by condition. Student age and gender were reported by teachers when they completed the DESSA-Mini. Other student characteristics including race/ethnicity, ELL status, SPED status, and FRL eligibility status were collected from the 2015–2016 district administrative records. Student age in years in Fall was included as a continuous variable. Variables measured dichotomously were dummy coded, including gender (0 = male, 1 = female), ELL status (0 = non-ELL, 1 = ELL), SPED status (0 = no SPED services, 1 = SPED services), and FRL eligibility status (0 = not eligible for FRL, 1 = eligible for FRL). The race/ethnicity variable was transformed using the effect coding method, such that the mean of each subgroup can be compared to the grand mean across all subgroups. This method is especially useful for examining variables like race/ethnicity without assuming any specific group is normative, against which all other groups are compared (Mayhew & Simonoff, 2015).

Analytic Procedures

To investigate question 1, latent growth modeling (LGM) was conducted to assess the extent to which the TOOLBOX intervention was related to student SEC growth trajectory, including the initial level (i.e., intercept) and the rate of change (i.e., slope). The unconditional LGM was first performed to test whether a linear growth trajectory model fit our sample data. Then, the conditional LGM was performed to compare student SEC growth trajectories by intervention condition, while accounting for any variations in outcomes associated with student demographic characteristics. To investigate question 2, independent samples t tests were conducted to compare the means of implementation variables at the teacher level between Standard and Primer implementation directives. To investigate question 3, the conditional LGM was performed to compare student SEC growth trajectories between Standard and Primer directives, while accounting for any variations related to student demographic characteristics.

For LGM (questions 1 and 3), a goodness-of-fit was assessed following Hair and colleagues’ (2009) fit assessment guidelines for a moderately complex model (i.e., having 12 to 30 observed variables) with a large sample of n > 250). They suggest that a significant chi-square statistic (p < 0.05) is expected, while a comparative fit index (CFI) or Tucker-Lewis index (TLI) higher than 0.92, a standardized root mean squared residual (SRMR) lower than 0.08, and a root mean square error of approximation (RMSEA) lower than 0.07 demonstrate goodness-of-fit. LGM was conducted using the full information maximum likelihood method with a sandwich estimator in Mplus version 8 in order to address combinations of different types of predictor variables, compute standard errors that are robust to non-normality, and handle missing data efficiently (Muthén & Muthén, 1998). For independent samples t tests (question 2), Levene’s tests for homogeneity of variances were first conducted. If there was a violation of the homogeneity assumption, the degrees of freedom were adjusted using the Welch-Satterthwaite method. An alpha level of 0.05 was applied to assess a statistically significant difference between group means. t tests were conducted in SPSS version 25.

Results

Table 2 presents descriptive statistics and missing rates of DESSA-Mini scores at each time point by condition. In our sample, the total average SEC score in Fall was M = 50.37 (SD = 11.08), which is close to the national norm of M = 50 (SD = 10), with no differences in Fall SEC across study conditions (F(2, 1622) = 0.33, p = 0.72). DESSA-Mini scores were 8% missing in Fall, 15.6% in Winter, and 15.1% in Spring. The missing rates did not differ by condition in Spring, but differed at the first two waves. In Fall, there were more missing data in the TOOLBOX condition versus non-TOOLBOX (10.3% versus 3.5%; χ2(1) = 24.36, p < 0.001) and in Primer versus Standard (18.4% versus 1.4%; χ2(1) = 91.67, p < 0.001). In Winter, more data were missing in the non-TOOLBOX sample versus TOOLBOX sample (30.2% versus 8.1%, χ2(1) = 146.45, p < 0.001) and in Primer versus Standard (12.7% versus 3.2%; χ2(1) = 35.05, p < 0.001). We present this information transparently, since the design is quasi-experimental.

Table 2 Student SEC descriptive statistics and missing rates by study condition

The unconditional linear growth trajectory model, with actual measurement time span reflected in slope loadings (i.e., 0, 0.28, 1) and equal residual variance assumed across three time points, showed an acceptable fit (χ2(3) = 9.66, p = 0.02; CFI = 0.996; TLI = 0.996; SRMR = 0.032, RMSEA = 0.035, 90% CI = [0.012, 0.062]). In this unconditional model, the mean intercept was estimated to be 50.30, and the average rate of growth over the academic year was 2.60 T score points. The correlation between intercept and slope was not statistically significant (r =  − 0.07, p = 0.27), suggesting that across the full sample, a student’s SEC level at the start of the year did not predict a student’s growth in SEC throughout the year.

  • Question 1: to What Extent Was the TOOLBOX Intervention Related to K-2 Students’ Growth Trajectories of Social-Emotional Competence?

Table 3 presents the estimation results of these two conditional growth trajectory modeling approaches. The model without any covariates adjusted showed an acceptable fit to our data (χ2(4) = 11.39, p = 0.02; CFI = 0.996; TLI = 0.994; SRMR = 0.029; RMSEA = 0.032, 90% CI = [0.01, 0.06]). The intercept was not different by the TOOLBOX intervention condition (b = 0.89, p = 0.10), but the slope differed by 1.60 point (p < 0.01) by Spring. The mean slope for non-TOOLBOX students was estimated to be 1.53 T score point over the year, and the mean slope for TOOLBOX students was estimated to be 3.13 T score points. After including demographic covariates, the model still showed an acceptable fit (χ2(13) = 21.22, p = 0.07; CFI = 0.996; TLI = 0.990; SRMR = 0.014; RMSEA = 0.020, 90% CI = [0.00, 0.04]). Holding all the other covariates constant, the intercept did not differ by the TOOLBOX intervention condition (b = 0.68, p = 0.19), but the slope differed by 1.74 point (p < 0.01) by Spring. These findings suggest that although student SEC started at a similar level in Fall, TOOLBOX students, on average, demonstrated a higher rate of growth in SEC than non-TOOLBOX students over the year.

Table 3 Difference in student SEC growth trajectory by intervention condition
  • Question 2: to What Extent Was TOOLBOX Implemented Across Two Different Implementation Directives?

Table 4 presents descriptive statistics of implementation variables (both composite scales and individual items) by implementation directive and the corresponding t test results. On average, teachers taught about 5 lessons in Fall, 8 lessons by Winter, and 11 lessons by Spring. Besides the cumulative number of lessons taught, no clear descriptive pattern was observed for other implementation variables. t test results showed no scale-level differences in teacher reports of their implementation behaviors between Standard and Primer implementation directives. Item-level analysis revealed that Standard teachers tended to use the three incorporation strategies (i.e., incorporating tools within academic curricula) more frequently in Spring than Primer teachers: incorporation into writing (Standard M = 1.35, Primer M = 0.64, t(43) = 2.67, p < 0.05), into literature (Standard M = 2.17, Primer M = 1.41, t(43) = 2.23, p < 0.05), and into arts and crafts (Standard M = 1.43, Primer M = 0.73, t(43) = 2.59, p < 0.05). In addition, Standard teachers reported a higher level of quality of teaching their least favorite lesson in Spring (Standard M = 9.09, Primer M = 8.00, t(43) = 2.27, p < 0.05) relative to Primer teachers. Primer teachers reported naming tools that students are using in the moment more frequently in Winter than Standard teachers (Standard M = 2.33, Primer M = 2.76, t(43) =  − 2.05, p < 0.05). For all the other implementation variables, no statistical difference was found between the two implementation directives at the p < 0.05 level.

Table 4 Teacher implementation comparisons between TOOLBOX Standard and Primer directives
  • Question 3: to What Extent Was the TOOLBOX Implementation Directives Related to K-2 Students’ Growth Trajectories of Social-Emotional Competence?

Table 5 presents the estimation results of these two conditional growth trajectory modeling approaches within the TOOLBOX sample (Standard versus Primer implementation directives). The model without any covariates adjusted showed an acceptable fit to our data (χ2(4) = 1.69, p = 0.79; CFI = 1.00; TLI = 1.00; SRMR = 0.012; RMSEA = 0.000, 90% CI = [0.00, 0.03]). Neither intercept (b =  − 0.10, p = 0.88) nor slope (b = 0.01, p = 0.99) differed by implementation directive. After including demographic covariates, the model still showed an acceptable fit (χ2(13) = 19.37, p = 0.11; CFI = 0.996; TLI = 0.989; SRMR = 0.010, RMSEA = 0.022 [0.00, 0.04]), and neither intercept (b = 0.92, p = 0.14) nor slope (b = 0.06, p = 0.93) differed by implementation directive. These findings suggest that the two different TOOLBOX implementation directives, which largely did not change implementation behavior, had no differential effects on student growth in SEC.

Table 5 Difference in student SEC growth trajectory by implementation directive

Discussion

This quasi-experimental study provides promising evidence to support TOOLBOX effects on K-2 students’ social and emotional growth in a routine practice setting. On average, students in TOOLBOX schools gained 3.13 T score points across the school year, 1.60 more T score points than non-TOOLBOX students. To interpret the magnitude of these gains, one might compare results from this study of TOOLBOX to a study in which the DESSA-Mini was used to measure the SEC of students receiving the Promoting Alternative Thinking Strategies (PATHS; Kusché & Greenberg, 1994) curriculum (Shapiro et al., 2018). The Blueprints for Healthy Youth Development (Mihalic & Elliott, 2015)—a clearinghouse created to help consumers determine “what works”—lists PATHS as a “model” program. Shapiro et al. (2018) observed that K-2 students exposed to PATHS, with robust technical assistance, gained an average of 3.66 DESSA-Mini T score points across the school year (no comparison group available). Therefore, available evidence suggests that TOOLBOX may also be a promising approach for augmenting student SEC.

In response to a secondary aim, this study fails to provide robust evidence that the school-level decision to purchase either the TOOLBOX Standard or the TOOLBOX Primer package (i.e., the school-level implementation directive) differentially shaped teacher implementation behavior or student outcomes. The extra resources provided with the Standard curriculum were not associated with reports of higher-quality instruction. Teachers in the Standard TOOLBOX directive did not report a higher level of quality, with one exception—in their least favorite lesson in the springtime. Also counter to expectation, teachers did not report many differences in dosage across diverse dose forms. Item-level analysis indicated that teachers in the Standard TOOLBOX directive were more likely than teachers in the Primer directive to incorporate the program into academic lessons, despite the Standard directive emphasizing stand-alone lessons and the Primer directive emphasizing integration. These findings could imply that more highly scripted lessons have an under-acknowledged benefit as a proxy for professional development, enabling the teachers to learn the material themselves before teaching it, and then to flexibly integrate SEL content into other curricular areas and to teach less-resonate lessons well. This is concordant with prior research indicating a plurality of teachers hold a preference for initial structure with increasing flexibility when implementing a new initiative (Shapiro et al., 2016).

The challenge of taking high-quality SEL to scale is formidable, and innovative approaches to adoption abound. Many are guided by the foot-in-the-door compliance tactic (Freedman & Fraser, 1966), and the diffusion theory premise that less disruption is better for adoption (Rogers, 1995). Some program developers and administrators have, in turn, conceived that a lower-burden, flexible approach (e.g., a “primer”) may prepare the implementation environment for the subsequent adoption of a more comprehensive (“standard”) curriculum. Yet, a “flexible” approach may not necessarily be a lower burden approach. Given the complexity of integrating SEL into academic instruction, often in the absence of ongoing training and technical assistance, and with an array of competing mandates, additional research should consider whether diffuse directives and flexibility are the best way to promote wide-scale implementation in schools. This research should be longitudinal, as it may also be the case that different implementation directives will lead to more disparate implementation behaviors or student outcomes over time. It might also be helpful for programs to be analyzed in micro-randomized trials to understand the direct and interactive effects of their component parts to optimize for implementation and effectiveness (Collins et al., 2014).

Although it is now well established that how a practice is put into place shapes SEL program outcomes (Durlak et al., 2011; Rojas-Andrade & Bahamondes, 2019), we did not find evidence that the TOOLBOX implementation directive ultimately shaped the growth of student SEC within the first year of implementation. This does not imply that actual implementation behavior does not shape the growth of student SEC, but rather, that the instructions and materials provided at the start of the academic year were not necessarily the most meaningful source of variance determining individual teacher implementation behaviors. Now that we have observed the relationship between the directives and student outcomes, future research should examine the direct relationship between teacher implementation behaviors and child outcomes through multi-level analysis. This analysis is beyond the scope of this paper, which sought to explore TOOLBOX effectiveness under different conditions created by school- or district-level adoption decisions.

The completion of this research project renders it no longer the case that students are receiving an SEL program (i.e., TOOLBOX) without the demonstration of growth relative to a comparison group, yet several limitations should be considered. In order for teacher ratings of student SEC to be completed, teachers needed to be familiar with students. This means that our initial assessment was in October, following an August TOOLBOX training, and approximately 1 month of instruction. We realize we may have missed some initial growth in SEC by nature of this limitation, but differences by condition were not detected at the time of our initial assessment. The quasi-experimental design (i.e., lack of random assignment to condition), detection bias (i.e., informants were likely aware of their assignment to condition and therefore potentially biased evaluators of their own work), and slight differences in missing rates by condition limit the potential for strong causal claims, but the “routine” practice conditions enhance the study’s utility for informing practice-as-usual decisions. Although there were many students and teachers within each condition, the scant number of schools assigned to each condition, and their origins within a single district, limits the generalizability of our findings. On the other hand, the diverse student body, thoroughly described in this study, is a strength relative to much of the SEL literature (Rowe & Trickett, 2018).

Finally, it is important to recognize the limitations of self-reported implementation variables. Third-party observations of implementation behaviors have typically shown a stronger relationship to outcomes than self-report data (Lillehoj et al., 2004). Third-party observations, however, are more typical of free-standing, lesson-based SEL programs than SEL programs intended to be integrated, modeled, and applied throughout the school day (Shapiro et al., 2018). The direct observation of behavior is best suited for observing the frequency of high-prevalence, discrete behaviors, against their own baseline, for a minimum of five 30-min sessions to achieve reliable estimates (Doll & Elliott, 1994). The SEL-IS is designed to be a pragmatic alternative for monitoring implementation in routine practice, primarily for continuous quality improvement purposes. Its use for research is largely exploratory. For example, single-item indicators, used to enhance efficiency and problem-solving in practice settings, may contain more measurement error than is desired for research purposes, and some tactics used to normalize distributions of SEL-IS self-report data (e.g., how fully teachers are implementing TOOLBOX compared to other colleagues) could also obscure between-school comparisons. On the other hand, although many self-report efforts find teachers uniformly rating themselves favorably, the data from the SEL-IS were not particularly skewed.

Beyond these strengths and limitations, this study inspires an additional research direction. The finding that TOOLBOX was beneficial for student SEC growth, on average, and under various implementation directives, does not necessarily imply that it benefits all students equally. SEC growth trajectories may differ across diverse subgroups of students, and a universal SEL program like TOOLBOX may work differently for different students. A few prior studies have examined how student characteristics such as gender, race/ethnicity, and socioeconomic status relate to their SEC growth in general as well as in response to a universal SEL program (e.g., Frey et al., 2005; Holsen et al., 2009; Jones and Bouffard, 2012; Low et al., 2019; Malti et al., 2011). Nevertheless, current evidence is inadequate for understanding the extent to which socio-demographic disparities in SEC exist in student populations, and whether the benefits of a universal SEL program are distributed equally or differentially across diverse subgroups of students. Our analysis of covariates suggests that student SEC growth trajectories may vary by student characteristics such as age, gender, race/ethnicity, and special education services status, independent of the observed program effects. In order to understand the effectiveness of TOOLBOX, and in addition to randomized trials observing sustained effects, it is an important next step to determine whether the program contributes to educational equity or education disparities. Further research is needed to examine program effects on student social-emotional growth across diverse subgroups of students.

Practicing school psychologists may consider lessons that can be derived from this analysis for delivering social and emotional learning in the context of a Multi-Tiered System of Support (MTSS). In the California MTSS framework, for example, evidence-based practices are provided to all students to support whole child development, a feature of which is inclusive, transformative social and emotional instruction (Orange County Department of Education, 2021). It is therefore important for school psychologists, who may be engaged in decisions to select and adopt programs, to understand whether SEL programs like TOOLBOX have an evidence base and transformative potential. The California MTSS framework further asserts that in order to promote whole child development, there should be strong leadership, educator support, and organizational structures for integration, in the context of a positive school climate, and trusting partnerships with the district, families, and other community institutions. We discovered that implementation behaviors and student growth did not systemically vary based on the school-wide directive to adopt the standard or primer version of TOOLBOX. Instead, it is likely that a school psychologist who contributes to strong implementation leadership (e.g., serves on an inclusive SEL leadership team to develop a written implementation plan and remove implementation obstacles; Lee et al., 2018), and provides implementation support (through training, coaching, tools, and feedback loops; Wandersman et al., 2008), can more effectively shape the implementation behaviors associated with the optimal growth of student social and emotional competence.