Higher education institutions are facing increasing pressure from accrediting bodies, and from state and federal regulatory agencies to improve accountability by creating meaningful evaluations of student success. These bodies expect institutions to provide evidence of student retention and completion (Blankenberger & Williams, 2020; Conner & Rabovsky, 2011; Heller, 2001; Ochoa, 2011; Russell, 2011; Spellings, 2006; U.S Department of Education, 2011), and of student learning (Blankenberger et al., 2017, 2020, Astin, 2012; Kuh & Ikenberry, 2009; National Governors' Association, 1986; Spellings, 2006). However, measuring student performance in meaningful ways is challenging. There are so many factors affecting student performance that isolating the impact of participation in specific institutional programs is very difficult (Perna & Thomas, 2006). Higher education administrators and policymakers too often make decisions based on descriptive data analysis or even anecdote when better analysis options could produce more nuanced, and more actionable results. With the advance of statistical matching techniques researchers have greater opportunities to do just that. In this article, we compare analysis techniques that we have employed at a midwestern regional institution to improve program evaluation processes.

Assessment of Student Outcomes as Program Evaluation

Evaluating student outcomes is not easy, but it can help to conceptualize the process as a type of program evaluation (Blankenberger & Cantrell-Bruce, 2016; Blankenberger et al., 2017; Blankenberger, 2020; Cantrell-Bruce & Blankenberger, 2015). The classic systems theory model has been in use for decades in organizational theory and provides a model to aid those engaged in evaluations. The model emphasizes the inputs, processes, and outputs of a program, within the context of its environment (Rossi et al., 2003; Sylvia & Sylvia, 2012). Generally speaking, for an educational program the core inputs include the faculty, students, and the financial, administrative, material, and technological resources necessary to deliver an educational program. As these inputs are fed into a program's educational processes, there are several expected outputs to be produced such as student learning, retention, and degree completion, as well as broader objectives such as informed citizenship, appreciation for diversity, research productivity, et al. Moreover, evaluators need to consider the many environmental and personal factors that impact student success when evaluating programs.

As Perna and Thomas (2006) note in their conceptual framework for analyzing student outcomes, evaluation of student success should be contextualized within multiple layers including internal student context, family context, school context, and the social, economic, and policy context. Accounting for the numerous factors that can impact student and program outcomes can be daunting when designing an evaluation. If evaluators were able to employ randomized experimental design this would help to account for these factors. However, as with other types of program evaluation, the ideal of experimental design can rarely be employed when evaluating educational programs.

Non-Experimental, Experimental and Quasi-Experimental Strategies

Researchers often employ a between-subjects design to analyze the treatment effects of an educational intervention by comparing the outcomes of two different groups, or a within-subjects design to examine differences between the same group pre-post treatment (Gravezetter & Forzano, 2009). Ideally, to determine whether a program/treatment produced the desired outcome, an evaluator would employ an experimental research strategy which would include an independent variable that would be manipulated by the researcher, i.e., two or more treatment conditions, a dependent variable that changes based on the treatment, and participants would be randomly assigned to experimental and control conditions. Additionally, the researcher would need to control for other covariates which could impact the relationship between the independent and the dependent variables (Gravezetter & Forzano, 2009). However, typically it is not possible to satisfy these conditions in program evaluations because either individuals cannot be randomly assigned to treatment and control groups, because the program has already been concluded or cannot ethically be denied to individuals, or because there is no way to effectively isolate the experiment from important extraneous factors which could impact the relationship between the independent and dependent variables (Gravezetter & Forzano, 2009; Rossi et al., 2003; Sylvia & Sylvia, 2012). This is typically the case in higher education program evaluations because most of the time students cannot be randomly assigned to control and treatment groups, they are not in controlled experimental settings, and they are impacted by many internal and external factors that can impact their performance (Perna & Thomas, 2006).

Recognizing these limitations, researchers evaluating education programs should seek to achieve the strongest design possible to produce results that will allow administrators and policymakers to make better, more informed decisions. Ultimately, evaluators may be constrained with the evaluation techniques they are able to employ, but they should not be satisfied with non-experimental designs that do little to minimize threats to validity. They should attempt to emulate experimental conditions as much as possible, by employing quasi-experimental designs that reduce threats to internal and external validity.

In a non-experimental educational program evaluation design, researchers may compare outcomes such as GPA or retention for two groups—either between subjects or within subjects. They then use traditional statistical techniques to compare the outcomes such as chi-square, t test, or ANOVA. If the researcher has additional data to use for control variables, they could conduct chi-square with layering, ANCOVA or regression analyses to control for the impact of some factors and determine the strength of relationships between variables. For example, in a regression equation, the researcher is able to show how much of the change in the dependent variable is accounted for by the overall model and the various factors in the model (Gravezetter & Forzano, 2009).

However, evaluation researchers can better approximate the experimental condition and improve the effectiveness of their analyses, by using quasi-experimental approaches. With propensity score matching (PSM) or exact matching the evaluator can create matched pairs to better mimic the randomization process (Austin, 2011; Thoemmes, 2012; Thoemmes & Kim, 2011). The choice of evaluation design can have important implications for the evaluation, and therefore for institutions. Although such designs cannot establish true causality, they better isolate the treatment effects associated with the intervention or program. Even the choice of whether to employ traditional PSM matching or coarsened exact matching (CEM) can have ramifications for the results of an evaluation and therefore on the choices made by those acting on those results. Researchers conducting these evaluations need to be aware of the options available to them as well as the possible impact of the design choices they make.

In this article we discuss three program evaluations we conducted at a midwestern regional public university by comparing the types of results we uncovered based on the use of traditional non-experimental techniques (t test, Chi-Square, regression), as well as quasi-experimental approaches using propensity score matching, and a mixed design combining the use of propensity score matching and exact matching. We compare the results of each type of design to show how the type and depth of the analysis can have serious implications for the evaluation and for the possible resulting actions taken by administrators and policymakers. We discuss these techniques in greater detail in the next section.

General Design and Research Questions

Like most universities, the institution in our study has attempted to increase the effectiveness of its program evaluations both to improve student and institutional success and to satisfy accreditation and government oversight expectations. The authors have conducted multiple analyses designed to measure the impact of educational and student service programs on student outcomes. However, as at most institutions, we are faced with the data access and structural limitations that accompany such evaluations. Available institutional data cannot account for many of the complex factors that impact student performance. Further, when interventions are introduced, randomly assigning students to experimental and control conditions was not an option. Therefore, to improve our evaluations, we employed propensity score matching to analyze existing datasets and imitate the randomization process in traditional experimental design as carried out by other educational researchers (e.g., An, 2013; (Blankenberger et al., 2017a; Dietrich & Lichtenberger, 2015; Gehlhausen Anderson & Blankenberger, 2020; Lichtenberger et al., 2014; Lane et al., 2012; Taylor, 2015).

Propensity Score Matching Techniques

In observational studies, propensity score matching involves applying statistical techniques to extant data to generate a propensity score which equates to the predicted probability of participation in a treatment condition to try to control for experimental bias (Austin, 2011; Thoemmes, 2012; Thoemmes & Kim, 2011). The propensity scores are utilized to produce simulated control and treatment groups that are equally likely to have participated in the treatment. Once the comparison groups are created, traditional statistical analysis techniques are employed to assess differences in outcomes between the two groups. However, since the propensity score matching process converts multiple factors, including categorical variables, into a single numeric propensity score, there is a great likelihood that the process can yield unbalanced groups on factors that may be considered important to evaluators such as race and gender (Iacus et al., 2012). Thus, some researchers have suggested that exact matching on categorical variables or coarsened exact matching, or even a mixed approach of exact matching on some variables and PSM for others may produce better results when creating matched groups (Bai & Clark, 2019; Burden et al., 2017; Imai et al., 2008; Stuart, 2010; Wells et al., 2013). Some categorical variables such as gender or the presence of a treatment may be matched exactly while others such as grade, age, or race may need to be coarsened before exact matching (e.g., coarsening several reported racial/ethnic categories to white and students of color). However, in either case, exact matching when multiple characteristics are involved can be challenging and may lead to very small matched groups which could increase the likelihood of yielding invalid findings.

These approaches provide program evaluators some valuable options, but each will have its own implications that should be considered. In this article we discuss the results of three program analyses. For each, we employed between-subjects designs, and analyzed the data with traditional analysis techniques, PSM, coarsened exact matching and exact matching. We discuss the results of each and the associated strengths and weaknesses of the techniques. Our overall research questions encompass the individual program evaluations and the overarching comparison of approaches.

  1. (1)

    Controlling for academic, demographic, and non-cognitive covariates, is participation in a freshman seminar associated with greater retention and GPA?

  2. (2)

    Controlling for academic, demographic, and non-cognitive covariates, is participation in engaged citizenship common experience courses associated with improved color blind racial attitude scale (racial bias measure) scores?

  3. (3)

    Controlling for academic, demographic, and non-cognitive covariates, is participation in a living learning community associated with greater retention and GPA?

  4. (4)

    What impacts on policy and institutional decision making can these different approaches yield?

  5. (5)

    What are the comparative strengths and weaknesses of employing propensity score matching to create matched groups compared to a mix of exact matching, coarsened exact matching and PSM to create matched groups?

We will answer the first three questions in the sections dedicated to each separate study. We will consider the final two questions in the conclusion.

Methods and Results for Three Institutional Examples of Program Evaluations

Overall Approach to the Evaluations

We first conducted a simple preliminary statistical analysis, t test and chi-square, without attempting to control for covariates. Then we conducted regression analyses adding in available data on covariates to determine the extent of the relationship between the independent variable and the outcome variable while controlling for covariates. However, we were concerned that the preliminary analyses in each evaluation insufficiently controlled for possible confounders. To attempt to improve the analyses, we employed propensity score matching to create equivalent groups and control for factors that might impact the analysis of the relationship between participation in the education program and the relevant outcomes.

For all three of these evaluations, we used the propensity score matching tool in SPSS to conduct the matches. The first step was to compare the profiles of the student groups participating in the different treatment conditions (e.g., those who participated in a freshman seminar vs. those who did not) using the data available from the university’s institutional research office. Running preliminary regressions enabled us to identify which factors were related to the outcome variables and which were not to determine which to include in the match. We also checked for collinearity to see which factors might be redundant.

Next, we generated the propensity scores, i.e., the predicted probabilities of student participation in a treatment group, using a logistic regression model with membership in the group as the dependent variable and the baseline attributes as the predictor variables (Austin, 2011). We included the relevant characteristics (aided by the regression and collinearity results) to generate the predicted probability, such as gender, race, GPA, et al. We then matched the group members to the nearest hundredth (i.e., a caliper of 0.01) on the key characteristics that we included in the PSM. This can be done using the PSM command in SPSS (or other statistical software program). Alternately, the propensity score can be generated by running a logistic regression with participation in the treatment/program as the dependent variable and saving the probabilities as the predictor variable for participation in the program/treatment. The comparison groups are then matched using the predictor variable to the nearest hundredth as the cut off to create the two groups—one who participated in the treatment, one who did not. Standard t-tests or chi-squares are then run to compare the “matched groups” based on an outcome such as GPA or retention.

Although using PSM to generate matched groups improves the balance of the two groups across key characteristics, and it creates the highest number of matched pairs, this process can result in unbalanced groups. So, for each evaluation, we ran balancing diagnostics. We split the students into groups based on participation in the treatment and created output tables with the descriptive statistics. We then checked standardized differences between the two groups across each factor to use as a barometer for which factors might be unbalanced between the two groups. Although there is disagreement on the cut off score for the differences, typically, 0.2 is considered acceptable, especially with smaller groups, though 0.1 is more broadly accepted (Austin, 2011). Exceeding this indicates the groups are unbalanced on that characteristic.

However, even when matched by score this way, the groups may not match closely on key nominal variables such as race, gender or income level, particularly when there is some collinearity between two factors (e.g., race and income quartile). In each of our evaluations, we chose to add a combination of PSM on ratio factors, with exact matching and coarsened exact matching for some nominal factors. In some instances, an evaluator may consider it important to match exactly on some characteristic(s) (such as gender) to ensure the groups are matched in a way important to the evaluation, or matched by groups (coarsened exact match) in which the researcher employs grouped traits, such as for income level, region, or race. We used SPSS to transform variables into groups around race, for example, so that we could create larger numbers of coarsened matches by white/students of color as opposed to breaking out by each group separately which will reduce the number of matched pairs. Lastly, we used the sort functions in SPSS to create matched pairs based on the combination of the propensity score to the nearest 0.01 and exact or coarsened matches on nominal factors we deemed important to have exactly balanced. Again, we had to create the balancing tables to ensure the balance between the two groups was within the appropriate tolerance level.

First Year Seminar Evaluation

In 2017, one of the authors conducted an evaluation of the impact of the university’s first year seminar (FYS) on student GPA and retention. This evaluation was conducted in conjunction with a few others being done at the same time as a part of an initiative by the institution to make better use of data. Freshman seminars are common, offered at almost 90% of institutions (Padgett & Keup, 2011). Typically, they incorporate orientation, study skills, and/or academic programming designed to improve student academic success. Permzadian and Credé (2016) conducted a meta-analysis of the literature findings on first year seminars finding that overall, these seminars have a small positive effect on retention (Cohen’s d = 0.11—less than 0.2 is a small effect size) but “almost no effect” on first year GPA (Cohen’s d = 0.02). Nonetheless, they argued that these programs are worthwhile since even small percentage gains in a campus’s outcomes reflect improved success for hundreds or even thousands of students.

Data and Initial Analysis

The university had introduced the first year seminar in Spring 2011, implemented it in Fall 2012, and was ready to evaluate program impacts on student success. The institutional research office provided data on freshman from Fall 2009—Spring 2016 for the analysis, so we were able to compare students who entered prior to the introduction of FYS with those after its introduction (n = 2215). We conducted several analyses to try to determine the relationship between FYS participation and two designated outcomes: retention into fall of year two and cumulative GPA in semester three. We started with a simple between group analysis comparing students who took FYS course(s) with those who did not on these outcomes (n = 2215). We conducted the analysis for all freshman, then split the analysis groups into honors and non-honors students. The preliminary results without controlling for covariates, indicated slight gains in retention and GPA for students participating in FYS vs those who did not (see Table 1). We ran an initial logistic regression with FYS participation as the independent variable, several covariates, and retention into second semester as the dependent variable to explore the relationship among these variables. The model explained only 5.2% (Nagelkerke R2 = 0.052) of the variation in the dependent variable, and only high school GPA and race were significantly related to the retention, not participation in FYS. We also ran a regression with GPA in semester three as the dependent variable (adjusted R2 = 0.406). High School GPA, ACT, race, gender, and the interaction variable of FYS participation/gender/race were significant.

Table 1 Basic descriptive analysis: first year seminar group comparisons

PSM Design

As noted in the “overall approach” section, we hoped to improve the accuracy of the evaluation by employing propensity score matching. In this evaluation, participation in FYS was the independent variable and the outcomes of retention in year two and GPA in semester three were the dependent variables. After preliminary analyses we eliminated a few factors that were either shown to not to be associated through regression, or redundant through tests for collinearity. The variables used for the initial propensity score match included the characteristics of gender, race/ethnicity, high school GPA and cumulative ACT score.

Reviewing balancing table data, we found that balance on high school GPA and ACT were strong, but there was substantial unequal distribution on race and gender (see Table 2). We decided that given the importance of the possible differential impact on outcomes by race and gender we did not want to compare unbalanced groups. So, we changed the approach to a combination of exact matching and PSM. We matched exactly on gender and race/ethnicity combined with propensity score matching based on cumulative ACT and high school GPA. With this approach, we were able to match 1602 students, about 300 fewer than with the overall PSM match.

Table 2 Balancing tables for freshmen seminar group participation comparisons

Results: Why the Different Approaches Mattered

If we had stopped at the basic descriptive analysis (see Table 1), we would have found that retention appeared not to be affected by first year seminar participation—a non-significant 0.8 percentage point difference, 77.7% of those who did not take FYS were retained, while 78.5% of those who participated in FYS were retained in year two (n = 2215). Similarly, GPA was not significantly different (2.90 for those who did not participate in FYS, 2.94 for those who participated in FYS). Regressions controlling for multiple covariates showed that FYS participation was not significantly associated with either outcome.

Conducting the analysis with standard PSM matching on all four characteristics (N = 1904) yielded slightly different results with nearly the same GPA for those participating in FYS (2.730) versus those not (2.734), but retention rates were slightly higher for those participating in FYS (78.5%) compared to those who did not (77.6%). Thus, we see a difference compared with the “no effect” result when not matched, though even the slight positive association in retention was considered “negligible” with a Cramer’s V below 0.1 (Kotrlick et al., 2011; Rea & Parker, 1992). However, white students were heavily overrepresented in the control group and African American and Hispanic students were overrepresented in the treatment group. Gender representation was less uneven, but we felt the differences could still be important.

Conducting the analysis with exact matching on race and gender and PSM matching on ACT and high school GPA, we were able to get exact matches on key characteristics though we were not able to create as many matched pairs (n = 1,602). Our results changed as well. Overall, 78.2% of those who did not take FYS were retained in year two, while 77.9% of those who took FYS were retained (see Table 3). This is a substantively small change (three students out of 1000), but nonetheless is noteworthy for the comparison of the three models. Instead of slight improvement for those participating in FYS as indicated in the descriptive model, we had slight decline. More notably though, there were important differences by gender and racial groups which we would have missed in both the descriptive and PSM-only models. For women, there was a slight improvement in retention for those participating in FYS (79.8%) compared to those who did not (78.1%). For men there was a decline in retention for those participating in FYS (75.1%) compared to those not participating (78.2%). Although this is a small difference (about three out of 100 students), this result indicates that the program appeared to be associated with a negative impact for male students rather than a positive one. Furthermore, when broken out by race/ethnicity, some groups who took FYS saw increases to retention, but others did not, notably Hispanic males (no FYS 95.5% vs FYS 72.7%, about 23 students per 100) and female students (no FYS 76.1% vs FYS 73.9%, only about two per 100), as well as male African American students (no FYS 86.8% vs FYS 79.1%, i.e., almost eight students per 100). All other groups saw improved retention for FYS participants versus their matched non-FYS peers. Here again, the differences were small with a Cramer’s V below 0.1 indicating a “negligible association” (Kotrlick et al., 2011; Rea & Parker, 1992) so we do not want to overly exaggerate the importance. However, this result did indicate an apparent negative differential impact for students of color which was the opposite of the purpose of the program.

Table 3 Matched comparisons: freshman year seminar (FYS) participation group comparisons after matching

The findings were also mixed for GPA. Again, the mean GPA in semester three differences were essentially negligible (effect size = 0.041) (Cohen, 1988). However, we saw some differentiation by combined race and gender groups. African American women who participated in FYS had statistically significant lower GPAs (2.473 vs. 2.278 with an effect size of 0.245 indicating a small association with FYS participation), while white female students who participated in FYS had a significantly higher GPA (2.989 vs 3.108 but with a very small effect size of 0.149). Latina women and African American male students saw non-significant lower GPAs for FYS participants though the effect size for African American males indicated a small effect size. All other groups enjoyed higher GPAs.

Impact on University Policy

The results of our analysis were fairly consistent with the findings of Permzadian and Credé (2016) showing some small but important contributions to student outcomes, but the benefits were unevenly distributed across groups. If the university had relied on the basic descriptive analysis, it might have been satisfied with the small gains in retention and GPA. After all, the literature indicated the gains might be small. Even the standard PSM approach showed the same results so may not have led to changes. However, the matched pair design enabled us to identify the more uneven results across gender and race/ethnicity groups that might not otherwise have been apparent. Discovering these differences in outcomes for underrepresented students was important. The first year seminar program was intended to improve outcomes for all students and these disproportionate impacts for students of color were concerning. Not surprisingly, the university has sought to improve the program. Using the analysis results, and feedback from stakeholders, the university launched a revision process for the First Year Seminar program in the hopes of making improvements to better serve all students. Some of what has been considered and implemented include changing the focus of the seminar to better align with general education goals, adding greater diversity awareness to the content, enhancing the orientation for FYS faculty, and more closely connecting the courses to the Living Learning Community experiences.

General Education Engaged Citizenship Common Experience Evaluation

Program Background

The second evaluation we have included is for an upper division educational program in part intended to help reduce racial bias and improve cultural understanding. Many institutions include general education outcomes related to reducing racial bias, developing citizenship, improving cultural awareness, et al. (Blankenberger et al., 2017; Cohen, 2013; Furman, 2013), but assessing these is challenging. For this program analysis, one of the authors along with several co-authors conducted an analysis of the impact of the university’s upper division engaged citizenship common experience (ECCE) coursework on improving racial bias (Blankenberger et al., 2017). At the university, the general education program includes ten semester hours of upper division ECCE courses: a one semester hour speakers series course, along with three courses in two of the following three categories, global awareness (GA), U.S. communities (USC), and engagement experience (mostly internships and research projects). These courses stress acquiring greater understanding of diverse cultures in the U.S. and in the world. For this analysis, we measured racial bias as one proximate indicator for ECCE outcomes. We selected the color-blind racial attitude scale (CoBRAS) as an instrument to collect student attitudes of racial bias. The CoBRAS assessment tool is intended to “examine the degree to which individuals deny, distort, and/or minimize the existence of institutional racism” (Neville et al., 2006, p. 280). The instrument is a self-report measure consisting of 20-items rated on a 6-point Likert-type scale ranging from 1 (strongly disagree) to 6 (strongly agree). The instrument (Neville et al., 2000) scores range from 20 to 120, with higher scores indicating greater racial bias.

Data and Initial Analysis

We collected CoBRAS results through an online survey and merged the results with student demographic data through the university’s institutional research office. We obtained 584 usable records out of about 3000 total undergraduates. Similar to the first year seminar study, we employed a between-subjects design (Gravezetter, & Forzano, 2009) comparing two different groups, those who took ECCE courses and those who did not. We created several comparison groups to enable us to determine whether participation in certain ECCE courses was associated with better CoBRAS racial bias scores (See Tables 4, 5, 6). In the preliminary analyses, we conducted t-tests and ANOVAs to compare the mean CoBRAS racial bias scores of different groups, such as those who had taken an ECCE U.S. communities course versus those who had not, those who had taken an ECCE global awareness course versus those who had not, and those who had completed three of their required ECCE courses versus those who had not. Several of the group differences were significant indicating an apparent benefit to taking ECCE courses (See Table 4), but with no covariates to control for group differences these are dubious findings.

Table 4 Group comparisons with no matching or covariates
Table 5 Balancing tables for ECCE group comparison examples
Table 6 Matched group design: all matched students 25 and under

Following the preliminary analyses, we performed a multivariate regression analysis to control for the impact of covariates. Several variables were significantly associated with change in CoBRAS racial bias scores, including passing a global awareness course (β = − 0.193, p = 0.007) and passing a US communities course (β = − 0.154, p = 0.021) as well as being a social science/humanities major (β = − 0.228, p = 0.014), gender (β = 0.147, p = 0.012) and race/ethnicity (β = 0.397, p = 0.000) (Author, 2017).

PSM Design

Initially, we conducted a matching procedure using propensity score matching on the characteristics of gender, race/ethnicity, major, age, high school GPA and cumulative ACT score combined. The matching process resulted in unequal distribution of groups, particularly on race/ethnicity, age and gender (see Table 5). Since race, age and gender are significantly associated with CoBRAS scores, we changed our approach to a combination of exact and coarsened matching on these traits, along with propensity score matching. We used exact match on gender, a coarsened match on race (white/non-white), and age (25 or less/26+) combined with propensity score matching for prior academic ability incorporating high school GPA and cumulative ACT. The age cutoff separated those born before 1990 and those born after. We could have used the common 25+ cut off that is employed in federal education data reporting, but our data file had birthdates for age, so it was easy to simply split by the birthdate. Because of the limited overall cases, we already had to collapse or coarsen the racial groups into white/students of color, and we were concerned about diminishing the possible impact even further. However, we were also concerned that we might not be able to match many cases limiting the results of our analysis. Fortunately, with the full PSM process we matched 122 students, and were still able to match 102 when employing the mixed PSM and exact matching process. Differences in size for the global awareness and U.S. communities matched pairs were also minimal. Nevertheless, the small numbers limited the strength of the findings. Unfortunately, there were too few students in the over-25 sub-group to make the results very meaningful, and no relationships were significant. In part this was because of missing data on the older students’ records limiting our ability to successfully find matched pairs.

Results: Why the Different Approaches Mattered

The results for matched students 25 and under indicated a positive relationship between taking global awareness and U.S. communities courses and improved CoBRAS scores (See Table 6). This was especially true for science/business/math/computer science majors. We chose to compare this group of majors to social science/humanities majors since the latter were more likely to have already experienced repeated exposures to the type of course content that ECCE courses include. Students taking both a GA and a USC course had significantly lower (improved) racial bias scores (47.30, p = 0.000, Cohen’s d = 0.857 indicating a large effect size) than their matched peers who did not participate in either (61.14). For science/business/math/computer science majors, there was an even larger difference (41.82 vs. 66.82 respectively; p = 0.020; Cohen’s d = 1.555 indicating a very large effect size) compared to other majors (Blankenberger et al., 2017).

In this evaluation, if we had only conducted the basic descriptive analysis, we would have found evidence of small improvements in the CoBRAS racial bias scores, and the multivariate regression controlling for covariates would have shown that ECCE global awareness and U.S. communities course participation were significantly associated with improved CoBRAS scores. However, using the PSM matching techniques enabled us to uncover further depth. By creating these matched pairs based on gender, race, age, and major we were able to show that the effect size was very high for students taking both global awareness and U.S. communities courses, but even taking one course in global awareness yielded a large effect size, and one in U.S. communities displayed a moderate effect size. Further, the improvement was much stronger for students in science/business/math/computer science majors than for others.

Impact on University Policy

The results of the analysis and its impact on the ECCE program are still being considered by the university, but overall it was important to see that the program appeared to be positively associated with an important university learning outcome—reducing racial bias. Of course, the small number of cases, and the inherent limitations with using any such instrument to measure racial bias temper the results, and additional analyses with more students is needed, but these evaluation results provided some support for the benefits of the ECCE program. Although not surprising, it was also helpful to note the apparent greater role ECCE participation appeared to have for certain majors, and for White students, particularly male ones, though clearly more study is needed.

The university is presently revisiting the need for the ECCE in its current form. ECCE currently involves a 10-credit hour commitment for students which many departments feel is too much of a burden for students, especially since most of the university’s undergrads are transfer students. Prior to this analysis little was known about student outcomes in the program. Without this analysis, it would be easier to dismiss the value of ECCE, so in that sense, any analysis showing that ECCE is associated with some positive outcome is important data. The descriptive analysis did indicate a small effect, and indicated small gender, major, and age, differences, as well as larger racial differences in CoBRAS racial bias scores. Furthermore, the regression indicated significance for these factors. However, the PSM analysis provided a richer level of analysis and stronger controls for the covariate factors. Knowing the apparent benefit of taking both a global awareness and a U.S. communities course, that the global awareness course is associated with a larger effect size than the U.S. communities course, and the extremely large effect size for science/business/ math/computer science majors is critical to the discussion about what the university does next. The latter is especially important given that the push for reducing ECCE hours is coming from those majors. Additionally, if the university decides to reduce the ECCE hours but not eliminate it, the effect size differences in the subgroups is valuable data. Again, we do need to keep in mind the limitations of the evaluation (especially the small “n”) and that this is only one outcome, but to have this level of analysis has been considered more beneficial in the ongoing discussions than a purely descriptive non-experimental analysis.

Living Learning Community Evaluation

Program Background

The third evaluation is of the university’s living-learning communities (LLCs). The university offers students the opportunity to participate in one of three LLCs: capital scholars honors program (CAP), a traditional honors program; necessary steps mentoring program (NS), created to assist first-generation college students; and students transitioning for academic retention and success (STARS) which is intended for students identified as academically at-risk (Gehlhausen Anderson, 2019; Gehlhausen Anderson & Blankenberger, 2020). Learning communities and living-learning communities are intended to improve student outcomes by building a sense of community and supporting students academically and socially (Inkelas & Weisman, 2003). They take different forms, but generally involve grouping targeted students together in classes, residence halls, and/or curricula in order to build a sense of community and give students the academic and social support they need (Inkelas et al., 2007; Zhao & Kuh, 2004). They have shown some positive impacts on students’ experiences and academic outcomes (Inkelas & Weisman, 2003; Inkelas et al., 2007; Pasque & Murphy, 2005; Stassen, 2003; Zhao & Kuh, 2004) such as a sense of community, additional academic support, and the opportunity to interact with each other, staff, and faculty (Bean & Eaton, 2001), improving student levels of college engagement and stronger academic outcomes, sense of belonging, retention and graduation rates, and positive perceptions of college and residence hall environment (Cambridge-Williams et al., 2013; Inkelas & Weisman, 2003; Spanierman et al., 2013; Stassen, 2003; Zhao & Kuh, 2004). However, living-learning community participation has also shown mixed impacts on measures of academic achievement, retention and graduation, in particular across racial and ethnic groups (Cambridge-Williams et al., 2013; Noble et al., 2007; Pasque & Murphy, 2005; Purdie & Rosser, 2011).

Data and Initial Analysis

We conducted an evaluation of the relationship between participation in Living Learning Communities (LLCs—capital scholars honors, necessary steps mentoring, and students transitioning for academic retention and success) and student retention and college GPA (Gehlhausen Anderson, 2019; Gehlhausen Anderson & Blankenberger, 2020). Students must meet eligibility criteria, but are not required to take the program, so selection bias is an inherent validity concern. Hence, the need to use PSM to try to simulate equivalent groups is especially important, but this self-selection issue will always be a limitation. Additionally, the numbers are limited since historically, the university was created as an upper division institution and only admitted freshmen starting in 2001. The numbers of freshmen are still low, only about 300 of 3000 undergraduates.

We employed between-subjects design for the analysis (Gravezetter, & Forzano, 2009), comparing LLC participants with non-participants. We used three cohorts (2013–2015) for the analysis (N = 577; No LLC = 208, Honors = 223, NS = 75, and STARS = 71). As with the other evaluations, initially we conducted a basic descriptive analysis without controlling for covariates, then conducted a regression analysis to determine which factors were related to the outcomes. We found mixed results regarding participation in an LLC and improved retention and GPA. For the preliminary analysis in our matched pair design, we used no covariates and did not create matched student pairs (n = 575). Without controlling for covariates, scores were fairly consistent with a significant difference only between Honors and no LLC students in retention into semester seven (see Table 7). GPA differences were more common. However, we knew we were comparing unlike students so there was little to gain from these results. So, we ran a series of regressions that included covariates. The binary logistic regression model for third semester retention was significant overall (Nagelkerke R2 = 0.183), but not significant for participation in any of the living learning communities. Similarly, the model for seventh semester retention was significant overall (Nagelkerke R2 = 0.216), but in this case participation in CAP honors was significantly associated (Exp ß = 0.51), while the other two were not. The logistic regression models for third and sixth semester GPA were both significant (adj R2 = 0.508 and adj R2 = 0.487 respectively). Of the two models, the only LLC significantly associated with the GPA outcomes was necessary steps participation for both GPA models (Standardized ß = 0.111 and = 0.073 respectively). So, while controlling for other variables, LLC participation appeared to have little to do with improved student outcomes. However, there were several factors that were significantly associated with the outcomes providing us key characteristics for the PSM analyses.

Table 7 Matched group design: initial comparisons without covariates

PSM Design

We followed up with an analysis using propensity score matching on the characteristics of gender, race/ethnicity (white vs. non-white), Pell status (ineligible vs. eligible), intent to complete (transfer vs unsure/intend to complete), validation score [using data from a Mid-Year Student Assessment™ (Ruffalo Noel Levitz, 2015) to assess students’ sense of validation after they had spent two months at the institution], high school GPA, and cumulative ACT score combined (Gehlhausen Anderson, 2019; Gehlhausen Anderson & Blankenberger, 2020). Again, we discovered that the matching process yielded unequal distributions in the comparison groups, particularly on race/ethnicity, gender and Pell eligibility, but we were able to match a much larger number of students since the pairs did not have to match exactly on the four key factors (see Table 8).

Table 8 Balancing tables for living learning communities group comparisons

Results were mixed in the full PSM analysis (See Tables 9, 10). Chi-square tests of association showed no or even negative associations between LLC participation and retention. For both STARS and Capital Scholars Honors participants, there were negative significant differences with their matched non-participant peers for third semester retention. Necessary Steps students showed an unforeseen result with significantly lower seventh semester retention than non-necessary steps students. For all three significant negative associations, the Cramer’s V was low indicating weak to negligible strength of association (Kotrlick et al., 2011; Rea & Parker, 1992). Similarly, participation in the LLCs was not significantly positively associated with GPA at either third or sixth semester, although there was a significant negative difference for CAP honors students for both third semester GPA (M = 3.014) compared with their non-CAP matched peers (M = 3.299), and sixth semester GPA compared to non-CAP students (M = 2.923 vs M = 3.303). The effect size indicated that the former had a weak strength of association and the latter a moderate one.

Table 9 Comparison of results by matching technique: retained/graduated semesters 3 and 7
Table 10 Comparison of results by matching technique: cumulative GPA semesters 3 and 6

Again for this analysis, we were concerned about the uneven group scores for the nominal variables after the PSM match. These factors had been significant in the regression models and could have altered the results. For example, the control group included nearly ten more students who answered “intends to complete” (rather than transfer) compared to the CAP Honors group, nearly identical to the difference in retention numbers. Furthermore, we knew that many of the honors students begin at the university but intend to transfer to sister institutions in the system after a year or two, so we needed to account for that. Similarly, the race, gender and income factors were correlated to the student outcomes and could change the results as well, particularly given the large proportion of non-white students in two of the LLCs. So, we altered our approach to match students exactly on the categories of gender, Pell status, and intent to complete, and a coarsened match on race/ethnicity (white versus students of color), combined with using propensity score matching for validation score, and prior academic ability based on cumulative ACT and high school GPA. We were very concerned about the small number of cases limiting the strength of our results with the coarsened/exact match, but we felt it would benefit the overall analysis to balance the groups and see what outcome differences might result.

Results: Why the Different Approaches Mattered

There were several differences in the revised analysis. The most notable difference was the elimination of the negative retention result we found with the CAP Honors students. As we conjectured, it appears that the difference could simply reflect the rebalance on the “intent to complete” factor. However, the challenge of exact/coarsened matching on so many factors severely curtailed the number of matches we could make. We will need to revisit the analysis in upcoming years once we have more data available. Similarly, the statistically significant difference for Honors students on GPA evaporated. Also, Necessary Steps students had significantly better retention in semester three and in semester seven compared to their matched peers with moderate associations for each (Cramer’s V of 0.300 and 0.265 respectively). None of the other matched groups saw superior performance. However, the STARS students were paired with a group of students that were all retained into the third semester. This is likely an anomaly resulting from random matching and very small numbers (N = 20), something that we were concerned might happen before beginning the analysis. This indicates a potential weakness in exact match designs that could result in a relatively small number of matched pairs. To try to improve the match result and address this anomaly, we ran a propensity score match that combined Necessary steps and STARS participants. We felt justified in this merger because the NS and STARS groups have similar demographic backgrounds. This yielded 32 matched pairs (64/575 total students) and the LLC students performed better than their matched peers in both semester three and seven, though not significantly so. The revised combined group match did not change the results for retention, but it did yield a statistically significant difference in semester three GPA at a medium effect size (no LLC 2.432 GPA, NS and STARS combined group 2.822 GPA, Cohen’s d = 0.535).

Ultimately, if we had stopped at the basic descriptive analysis, we would have a different result for the evaluation of the program. First, the initial group comparison was not very worthwhile since the groups in the Living Learning Communities are not typical of the overall student body. Second, regression results indicated the LLCs were not associated with the outcomes, though several other factors were. Identifying these factors was useful in creating the PSM analyses. The PSM with all covariates yielded more useful results but the Honors results were stronger after the exact/coarsened match which included the match on “intent to complete”, as were the results related to GPA gains for the combined Necessary Steps and STARS groups.

Impact on University Policy

We presented results of the analysis to stakeholder groups, but the results must be considered as preliminary. The results are useful to suggest some areas for further study, but due to the limited number of records available for analysis, we plan to do additional analyses before any stronger conclusions can be reached. However, there are several important themes that we noted. First, “intent to complete” appears to be an important factor for Honors students. The Honors LLC program would have preferred more positive results related to their program but recognizing the transfer patterns of their students was valuable information. The data on reduced retention from the PSM-only model would have been especially troubling, so the more neutral effect found in the PSM with exact matching model was more positive. The PSM with exact matching model findings of the positive association of Necessary Steps with retention and higher GPA that was not uncovered in the PSM-only model was helpful, though the neutral findings related to STARS was disappointing. The administration has not made any major changes to the LLC programs related to the analysis, but it has contributed to discussions about the delivery and effectiveness of the programs.

Conclusion and Discussion

We have answered the first three research questions related to the program evaluation results themselves. We have also answered the fourth question regarding the impacts on policy and institutional decision making that result from the different approaches. In all three cases the evaluation results were different based on the techniques employed. We have summarized the results in Table 11 for easier side by side comparison. The basic descriptive non-experimental designs provided some, but very limited information about the potential effectiveness of the programs. For the analysis of the first year seminar the initial descriptive analysis revealed small improvements in retention and GPA. The regression models indicated no relationship between FYS participation and student retention or GPA. The quasi-experimental propensity score matching mixed with exact matching approach proved to be much more useful. The standard PSM approach revealed about the same results, but the exact matched paired design enabled us to identify the uneven results across race and gender that would not otherwise have been apparent. Discovering these differences in student outcomes for underrepresented students was decidedly important and has led the university to launch a substantive change to the First Year Seminar program—one which they would not have likely launched had it not been for the more detailed analysis.

Table 11 Comparison of the approaches and results across cases

For the engaged citizenship common experience evaluation, if we had only conducted the basic descriptive analysis, we would have found evidence of slight improvements in the CoBRAS racial bias scores. Similarly, the multivariate regression with covariates showed that global awareness and U.S. communities course participation were significantly associated with improved CoBRAS racial bias scores. The regression also indicated that race, age, gender and major mattered, so using the unbalanced groups that resulted from the full-PSM analysis was not satisfactory for our study. Using the PSM, CEM and exact matching techniques enabled us to add depth to the analysis and we were able to show that the effect size was large for students taking both global awareness and U.S. communities courses, that taking even one course in global awareness was associated with a medium-large effect size, and one course in U.S. communities a small effect size. Most importantly, we found that the improvement was much stronger for students in science/business/math/ computer science majors than for others. We were concerned about the small “n” sizes for the study, and that such attitudinal results have limitations. Nonetheless, these more nuanced results have proven to be very helpful, particularly since the university is currently involved in a contentious review of the ECCE program and there had been limited data regarding the relative success of the program.

In the living learning communities evaluation, the initial non-experimental descriptive analysis was not useful since the groups in the LLCs could not be appropriately compared to the overall student population. Regression models provided good data on other factors associated with the student outcomes of retention and improved GPA, but only the CAP Honors LLC was associated with retention and the necessary steps LLC with GPA. The propensity score matching with all covariates yielded more useful results. However, the Honors results were more compelling after the exact/coarsened match was conducted since this accounted for the imbalance in the student “intent to complete” as well as other key traits that were not balanced. Similarly, in the coarsened/exact match design, the necessary steps students had significantly better retention in semesters three and seven compared to their matched peers with moderate effect sizes. However, we were unable to match many students exactly since there were four important factors on which to match. This is a severe limitation and we will need to conduct more analyses as more data become available. The university is still considering what the results mean for potential changes to the living learning communities, but if the negative results for the Honors program from the first PSM had held up in the coarsened/exact PSM analysis, this may have resulted in quicker changes. Additionally, the STARS and necessary steps program have greater reason to compare what they are doing differently since one appears to have been more successful, yet their student profiles are very similar.

As for the last research question, we have noted some comparative strengths and weaknesses between non-experimental designs, regression models, standard propensity score matching, and the mix of coarsened and exact matching and PSM to create matched groups for program evaluation. First, as expected, matched pair designs provided much more valuable information than simple unmatched group comparisons. In all three instances, the unmatched group comparisons yielded results which were not supported once the matched pair designs were conducted. Although program evaluations are often hampered by limited data access, when the data are available to control for important potential confounding factors by creating matched pairs this should be done to improve the analysis results. Although evaluators are restricted by what data are available, better program evaluation includes the factors that could impact successful achievement of identified program outcomes.

Second, regression analysis alone was sufficient to show some significant relationships between the program treatments and the outcomes as well as to some covariates, but in some instances, it missed important relationships. In the ECCE general education evaluation the regression showed essentially the same association with outcomes as the matched pairs designs. However, in the freshman seminar evaluation it showed no relationship to the outcomes—unlike in the matched pair designs. Further, in the living learning community evaluation, it indicated different results related to associations than the matched designs did. Nevertheless, we recommend running preliminary regressions in each case. These are not only useful for providing analysis of associations between the treatments and outcomes, but also offers data about which covariates are related to the outcomes, and hence gives an evaluator guidance as to which factors are key for the match process.

Third, using PSM to combine multiple ratio and nominal covariates into a single predicted probability score created an improved analysis of impacts compared to basic comparisons. PSM matched pairs yielded more useful analysis results. However, checking for balance between the groups is critical, particularly on nominal variables. Often the PSM process leaves unbalanced groups and if these are on important factors that could alter the analysis this could prove problematic.

Fourth, we found that using coarsened and/or exact matching, particularly on race and gender, but also on factors like age, income, and major was a valuable addition to standard PSM. Using PSM on larger populations may make the differences in unbalanced groups negligible, but for program evaluations with more limited data sets, sometimes it is better to mix in different matching approaches on certain characteristics. For example, in our first year seminar study, we achieved a very different understanding of the outcomes associated with the program once we had created exact matches on race and gender. Similarly, without refining the matches on age and major, we may have missed critical information on the analysis of ECCE outcomes. This observation supports what some critics of PSM have suggested about the use of exact matching on categorical variables as a better option for creating matched groups (Iacus et al., 2012; Imai et al., 2008; Stuart, 2010).

However, there are limits to the usefulness of exact and/or coarsened matching when evaluating programs with relatively small numbers of individuals receiving the treatment or invention and/or relatively small numbers of potential comparison group members. For each attempt using exact matching, we lost cases, potentially diminishing the value of our results. In some instances, we only lost about one-fifth of our cases, but in others, we lost half or more. We also found that we had to collapse or coarsen what could be important data categories when the numbers became too small. This was particularly true for race where we were able to have separate comparisons for several racial groups in the First Year Seminar study in which we had 2215 cases, but we had to collapse race to white/students of color for the other two studies. This is of course an oversimplification of the dynamics within these groups and we may be missing important differences as we found with the FYS analysis. Small numbers also increase the likelihood that one could reach a spurious result, such as our Living Learning Community analysis in which we had just 20 STARS students left after the exact match—too few to make this anything but a preliminary analysis. Lastly, adding too many exact categorical matches increases the complexity of the matching process greatly. A researcher must carefully weigh the importance of the categories they are considering for exact matches compared to the time available for the analysis. Our analysis of the Living Learning Communities with four categorical exact matches was much more time-consuming than the two categorical analyses. In program evaluation, the availability of time is often just as important a limitation as other matters.