Predicting STEM Major Choice: a Machine Learning Classification and Regression Tree Approach

Despite the increasing demand for professionals in science, technology, engineering, and mathematics (STEM), only a small portion of young people in the USA pursue a postsecondary degree in STEM. To identify the major predictors of STEM participation, this study uses a machine learning approach, a Classification and Regression Tree (CART), to analyze a wide range of individual, family, and school factors obtained from national survey data of US high school freshmen in fall 2009 who eventually enrolled in STEM college majors by 2016. The analytic results indicate that calculus credits, science identity, total STEM credits, and math achievement are the most predictive factors during the high school years of college STEM major selection. The CART-based tree also shows how these four variables interactively predict the likelihood of students enrolling in STEM college majors.

self-efficacy, and financial support-that are related to STEM college major choice and career aspirations (Mau & Li, 2018;Wang, 2013;Wille et al., 2020). Less clear is which factor plays a relatively more significant role for adolescents who choose a STEM college major in pursuit of a STEM-related career. Identifying the most predictive factor(s) of STEM college major choice among high school students has important implications for efforts to increase STEM participation. This study addresses that critical research gap by analyzing the US nationally representative High School Longitudinal Study of 2009-2016 (HSLS:09-16) (National Center for Education Statistics [NCES], 2018) to identify individual, family, and school factors in adolescence that are most predictive of students entering a postsecondary STEM degree program.
The HSLS:09-16 study began with more than 23,000 US ninth graders (high school freshmen), their parents, math and science teachers, school administrators, and school counselors in fall 2009 (NCES, 2018). It collected a broad range of STEM-related variables, including both intrinsic factors (such as math interest and science identity) and extrinsic/behavioral factors (such as STEM course-taking and afterschool program participation) (NCES, 2018). We employed the Classification and Regression Tree (CART) algorithm, which uses a machine learning approach that permits auto-selection and furnishes the results with a tree structure, to help visualize how STEM-related variables influence students' decision-making related to STEM major choice (Steinberg & Colla, 2009a, b). This study is one of the first to apply the CART method to uncover the most predictive factors that influence pursuit of STEM degrees based on hundreds of variables in a nationally representative, longitudinal study.

Literature Review
Increasing opportunities for learners to choose STEM careers are a national priority (National Science Foundation, 2020). Given that most STEM workers (72.3%) have a college degree in STEM (U.S. Census Bureau, 2019), investigating the factors that influence a high school student's choice to pursue a STEM college major can help address this priority. Previous studies have identified various pre-college factors that might be associated with STEM college major choice. Students' demographic and family backgrounds are major factors. Specifically, female students, racial and ethnic minorities, and economically disadvantaged students tend to show lower interest in pursuing careers in STEM (Riegle-Crumb & Morton, 2017;. Parents' occupations and involvement also influence students' STEM learning and career development (Howard et al., 2019;Moakler & Kim, 2014). Other factors include students' performance and motivation in STEM (Eccles, 1983(Eccles, , 2009Lent et al., 1994;Wang, 2013), as well as their 1 3 learning experiences and context factors in high school, such as school location (Saw & Agger, 2021), teacher quality (Althauser, 2015;Lee et al., 2015;Park et al., 2019), extracurricular opportunities (Kitchen et al., 2018;Franco & Patel, 2017;Means et al., 2016), and STEM course-taking (Gottfried & Bozick, 2016).
Although previous studies have collectively identified a broad range of factors that could potentially affect STEM college major choice, each study has only covered limited aspects due to the scope of the research. Some studies suggest that future research should include more potential exogenous variables, such as sciencerelated motivational factors rather than merely math-related expectancy value constructs, to investigate the links between these factors and the pursuit of STEM career pathways (Gottfried & Bozick, 2016;Wang, 2013;Wille et al., 2020). In practice, all of these identified factors work simultaneously throughout the STEM career development process. Therefore, it is important not only to investigate what factors can predict the choice of a STEM college major, but also to identify how some of the factors can play a relatively more significant role than others in predicting the choice of college major.
To fill this literature gap, the present study includes a wider range of predictors collected by the HSLS:09-16 study. For students' demographics, we included predictors such as socioeconomic status (SES), gender, and race/ethnicity, as previous studies have shown that female students, racial and ethnic minorities, and low-income students are less likely to pursue STEM careers (Riegle-Crumb & Morton, 2017;. For students' family backgrounds and parental involvement, we selected predictors such as parents' occupations and their support for math and science homework, as well as inschool and out-of-school STEM activities. These parental factors could benefit students' learning and career development in STEM by providing them with greater exposure and opportunities (Howard et al., 2019;Moakler & Kim, 2014). For students' career aspirations, motivation, and performance in STEM, we selected variables including their career and education goals, math and science self-efficacy, utility, identity, interest, and cost, as well as a range of performance measures (e.g., math standardized scores, GPA, SAT, ACT, AP and IB scores in STEM), based on expectancy-value theory, social cognitive career theory, and prior research (Eccles, 1983(Eccles, , 2009Lent et al., 1994;Wang, 2013).
For teacher quality, we included unobserved factors that are critical to students' STEM learning achievement, such as math and science teachers' perceptions of professional learning communities, self-efficacy, expectations, collective responsibility, and principal support (Althauser, 2015;Lee et al., 2015;Park et al., 2019). For school location, we selected urbanicity and geographic region as predictors, given the geographic disparities in postsecondary STEM participation (Saw & Agger, 2021). For extracurricular opportunities, we included variables such as whether a school offers STEM-related programs (e.g., supporting underrepresented students in STEM and informing parents about college majors and careers in STEM), which may benefit students pursuing careers in STEM (Kitchen et al., 2018;Franco & Patel, 2017;Means et al., 2016). Since high school STEM course completion is positively linked to college major choice (Gottfried & Bozick, 2016), we included a list of STEM courses taken as predictors.
No prior studies have included such a large number of relevant variables to explore factors that could predict the choice of a STEM college major in high school students. The CART algorithm is a powerful tool for identifying factors with the most predictive power while unveiling how the selected factors interactively predict the STEM college major choice. This has never been applied in prior studies due to substantially smaller numbers of variables along with traditional analytic approaches (e.g., logistic regression in Lee's (2015) study, multilevel logistic regression in Bottia andcolleagues' (2017) study, andWang's (2013) study with the use of structural equation modeling). By examining a wide range of potential predictors and applying this advanced technique, this study could provide educators and policymakers with new perspectives and insights into which factors could be relatively more important in predicting the choice of a STEM college major among high school students.

Sample and Measures
The eligible sample from HSLS:09-16 is composed of 11,560 US high school students who participated in the 2009 base year, 2012 first follow-up survey, 2013-2014 updates and high school transcripts collection, and then reported their college majors in the 2016s follow-up survey. About 23% of these students majored in STEM. Guided by prior studies, we selected a wide range of 102 variables, including individual, family, and school factors. These variables are used simultaneously to predict students' college majors as either STEM or non-STEM. The list of variables for this study is provided in the Appendix.

Analytic Strategy
We employed the CART algorithm, implemented using the R package rpart (Therneau & Atkinson, 1997), to capture the complex mechanism of students' decision-making with regard to enrolling in STEM college majors by identifying a set of factors and explaining how those factors predict the students' decisions about enrolling in STEM majors. The algorithm was chosen due to its desirable properties: (a) it does not require strong model assumptions, which are typically needed when using traditional regression models; (b) it automatically identifies the important predictors and their linear/nonlinear relationships with outcomes (Lee et al., 2010;Steinberg & Colla, 2009a, b;Timofeev, 2004); (c) it is able to handle missing data without extra imputation procedures (Deconinck et al., 2005;Feelders, 1999;Verbyla, 1987); and (d) it is an interpretation-friendly algorithm compared with other "black-box" data-mining techniques.
First, to build a CART-based tree, the Gini index (Breiman et al., 1984;Steinberg & Colla, 2009a, b) was used to automatically select the important independent variables. The maximum depth of the tree was set at 30. Cost complexity (Breiman et al., 1984), with complexity parameters equaling 0.1, was chosen in the pruning process. Surrogate splitting (Feelders, 1999) was used to handle missing data for independent variables. Through these settings, the algorithm produced a pruned tree to predict the probability that a given student will declare a college major in STEM based on the selected predictors.
Second, to avoid model overfitting issues and to be able to evaluate predictive accuracy, the sample was split into training and testing datasets using the 80/20 rule (Anis et al., 2015;Zheng, 2004). Specifically, we used the random sampling method without replacement to select 80% of the samples ( N train = 9248) as the training data for developing the CART-based tree. The remaining 20% of the samples ( N test = 2312), who were not exposed to the tree development, served as the testing data to evaluate the predictive accuracy of the tree. In other words, we established the statistical model used to predict the outcome using the training dataset, and we used the testing data to validate the prediction through the established model. The measure of prediction accuracy was examined. A sensitivity analysis using random forest analysis was conducted to evaluate the consistency of the CART results. The CART algorithm also applied the student longitudinal analytic weight provided by HSLS:09-16. Hence, the results are weighted to represent US ninth graders in fall 2009. Figure 1 shows the output of the final CART-based tree, which predicts the probability of a student declaring a STEM college major. Out of all the independent variables, only four variables are deemed relatively more important and are automatically selected to construct the final tree: credits earned in calculus during high school, science identity in grade 11, total STEM credits earned during high school, and math achievement in grade 11. Therefore, these four variables play relatively important roles in a student's decision to choose a STEM college major.

Results
With the final four predictors, the trained samples were split into five groups. As illustrated in Fig. 1, Group 1 is students (accounting for 81% of the high school students) who did not earn any credits in calculus and have a low probability of majoring in STEM (prob. = 0.16). Group 2 is students (accounting for 7% of the high school students) who earned credits in calculus and had a percentile rank (PR) for science identity in 11th grade < 74, and also have a low probability of selecting a college major in STEM (prob. = 0.22). Group 3 is students (accounting for 5% of the high school students) who earned credits in calculus, had a PR for science identity in 11th grade ≥ 74, earned fewer than 9.8 STEM credits during high school, and had a PR for math achievement scores in 11th grade < 97, and have a probability of declaring a STEM college major (prob. = 0.37). The HSLS:09-16 provides Z scores for science identity (X2SCIID) and T scores for math achievement (X2TXMTSCOR). To make these two standardized scores more comprehensible when interpreting the results, while also keeping the interpretation consistent across these two measures, we present the percentile ranks (PR) for these two measures converted from Z score and T score 1 3 The remaining two groups have average probabilities larger than 0.5. Group 4 is students (accounting for 2% of the high school students) who had credits in calculus, a PR of science identity in 11th grade ≥ 74, total STEM credits < 9.8, and a PR for math achievement scores in 11th grade ≥ 97, and show the highest probability of enrolling in a STEM major in college (prob. = 0.68). Group 5 is students (accounting for 5% of the high school students) who earned credits in calculus, had a PR for science identity in 11th grade ≥ 74, and earned at least 9.8 credits in STEM (i.e., In summary, if high school students do not earn any calculus credits, the likelihood of majoring in STEM disciplines will be only 16%. Furthermore, even if students earn calculus credit(s), their chance of pursuing a postsecondary STEM degree will still be low (22%) if they do not exhibit a high science identity in the 11th grade (PR < 74). On the other hand, if students earn calculus credit(s) and have a high level of science identity in the 11th grade (PR ≥ 74), the likelihood of enrolling in a STEM college major will increase substantially (from 16 to 37%). Interestingly, the probability of students majoring in STEM will be boosted to 68% if students earn calculus credit(s), have a high level of science identity in the 11th grade, and either earn at least 9.8 credits in STEM-related courses or have high math achievement in the11th grade (PR ≥ 97) X3TCREDSTEM ≥ 9.8), and also have the highest probability of selecting a college major in STEM (prob. = 0.68).
Using the CART algorithm for prediction based on the test dataset (i.e., 20% of the full data) led to a classification accuracy equaling 0.80. A sensitivity analysis using random forest analysis indicated that the four selected variables in the CART are also identified as important variables using the mean decrease accuracy method, which further strengthened our confidence in the CART results. The CART-based tree can also be converted into Fig. 2, a more understandable image demonstrating how these four variables interactively predict STEM college major choice.

Discussion
Consistent with prior studies (Gottfried & Bozick, 2016;Riegle-Crumb et al., 2012), our findings suggest that completing at least one calculus class during high school is highly predictive of entering STEM fields. More importantly, our CART analysis is the first to demonstrate that calculus course completion is the most predictive factor among 102 examined variables, including individual, family, and school factors. Specifically, the probability of selecting a STEM college major is only 16% for students who do not earn any calculus credits during high school. This set of findings underscores the importance of offering and supporting the completion of advanced math coursework for high school students, particularly the study of calculus. Alarmingly, only about 50% of high schools in the USA offer calculus (U.S. Department of Education, 2016). Although our study does not include the school-level course offering variables, we could still speculate that students from the calculus-excluded schools might be more likely to have a lower rate of STEM participation.
Our study also uncovers that science identity is the second most predictive variable for enrolling in a postsecondary STEM degree program. It is important to note that science identity is relatively more significant when compared to other STEM motivational factors, including math self-efficacy, science interest, and STEM career aspiration. Science identity reflects how students act to convince themselves and others that they are science students, which is a powerful source of persistence in science (Robinson et al., 2019;Stets et al., 2017). Our CART study indicates that if students earn calculus credit(s) and report a high level of science identity (PR ≥ 74), the likelihood of choosing a STEM college major will increase substantially, to 37% and higher. School administrators and policymakers might consider developing or adopting programs or curricula that can help students cultivate a science identity in high school or at an earlier stage.
Predictably, students who earn more credits in STEM-related courses and have excellent math achievement in high school are more likely to enroll in STEM majors in college. However, these two factors (3rd and 4th most predictive variables) are "conditional" on the first two. In other words, only if students earn 9.8 or more STEM credits or demonstrate excellent math achievement, in combination with earning calculus credit(s) and having a high level of science identity, will the probability of declaring a STEM college major increase from 37 to 68%. This "conditional" implication, uncovered by the CART method, is a novel finding and addition to the literature on STEM education and career development.
There are four limitations to our study. First, our study relies on public-use secondary data. Other important predictors that are not released (e.g., school-level course offerings) or collected (e.g., neighborhood STEM resources) might be omitted. For example, the inclusion of school-and district-level data could provide insight into how related policies, resources, and programs contribute to students' STEM learning and pursuit of STEM careers. Due to this limitation, we are unable to determine the relative importance of the four selected variables compared to the variables omitted from this study. Although we could not include all possible predictors in our model, our study covers a broader range of aspects for prediction than previous studies. Second, we restrict the initial sample (23,000 + ninth graders from 944 schools in 2009) to those students (n = 11,560) who participated in the follow-up surveys from 2009 to 2016. We acknowledge that attrition bias might be a threat to internal and external validity. Therefore, to reduce the threat, our analysis has applied the student longitudinal analytic weight provided by the NCES. Third, our study results could only represent the findings for the 2009 ninth-grade cohort. Nonetheless, this cohort is the latest nationally representative, longitudinal high school sample for STEM education research conducted by the NCES. Fourth, some of our measures (e.g., parental involvement and STEM motivational factors) involve items with repeated measures. However, each of these items is measured only twice (grades 9 and 11) in high school. Due to the limited number of repeated measures, we use the regular CART for this study. Future longitudinal studies with repeated measures at multiple time points could consider employing the promising longitudinal CART algorithm (Kundu & Harezlak, 2019).
Despite these limitations, the findings of this study contribute to the current STEM literature in the following important ways: (a) identifying the relatively important variables among a rich set of predictors associated with STEM college major choice, (b) presenting how these four most predictive variables interactively predict the likelihood of choosing a STEM college major, and (c) demonstrating the potential of using the CART algorithm to uncover previously unexamined nuances of STEM educational and career pathways. Well-developed and effectively implemented programs could increase STEM participation and motivation (Hudson et al., 2020;Pike & Robbins, 2019). Our findings provide educators and policymakers with new perspectives and insights on which relatively important factors could be intervened among young students. (continued) S1TEPOPULAR S1 Time/effort in math/science means 9th grader won't be popular Categorical S1TEMAKEFUN S1 Time/effort in math/science means people will make fun of 9th grader  (2012)