Advertisement

Predicting University Students’ Academic Success and Major Using Random Forests

  • Cédric BeaulacEmail author
  • Jeffrey S. Rosenthal
Article

Abstract

In this article, a large data set containing every course taken by every undergraduate student in a major university in Canada over 10 years is analysed. Modern machine learning algorithms can use large data sets to build useful tools for the data provider, in this case, the university. In this article, two classifiers are constructed using random forests. To begin, the first two semesters of courses completed by a student are used to predict if they will obtain an undergraduate degree. Secondly, for the students that completed a program, their major is predicted using once again the first few courses they have registered to. A classification tree is an intuitive and powerful classifier and building a random forest of trees improves this classifier. Random forests also allow for reliable variable importance measurements. These measures explain what variables are useful to the classifiers and can be used to better understand what is statistically related to the students’ situation. The results are two accurate classifiers and a variable importance analysis that provides useful information to university administrations.

Keywords

Higher education Student retention Academic success Machine learning Classification tree Random forest Variable importance 

Notes

Acknowledgements

We are very grateful to Glenn Loney and Sinisa Markovic of the University of Toronto for providing us with students grade data. The authors also gratefully acknowledge the financial support from the NSERC of Canada.

References

  1. Aulck, L., Velagapudi, N., Blumenstock, J., & West, J. (2016 June). Predicting Student Dropout in Higher Education. ArXiv e-prints.Google Scholar
  2. Bailey, M. A., Rosenthal, J. S., & Yoon, A. H. (2016). Grades and incentives: assessing competing grade point average measures and postgraduate outcomes. Studies in Higher Education, 41(9), 1548–1562.  https://doi.org/10.1080/03075079.2014.982528.CrossRefGoogle Scholar
  3. Bar, T., Kadiyali, V., & Zussman, A. (2009). Grade information and grade inflation: The cornell experiment. Journal of Economic Perspectivs, 23(3), 93–108.CrossRefGoogle Scholar
  4. Breiman, L. (1996a). Bagging predictors. Machine Learning, 24(2), 123–140.  https://doi.org/10.1007/BF00058655.Google Scholar
  5. Breiman, L. (1996b, 12). Heuristics of instability and stabilization in model selection. The Annals of Statistics, 24(6), 2350–2383.  https://doi.org/10.1214/aos/1032181158
  6. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.  https://doi.org/10.1023/A:1010933404324.CrossRefGoogle Scholar
  7. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Belmont, CA: Wadsworth Publishing Company.Google Scholar
  8. Chen, R., & DesJardins, S. L. (2008). Exploring the effects of financial aid on the gap in student dropout risks by income Level. Research in Higher Education, 49(1), 1–18.  https://doi.org/10.1007/s11162-007-9060-9.CrossRefGoogle Scholar
  9. Chen, R., & DesJardins, S. L. (2010). Investigating the impact of financial aid on student dropout risks: Racial and ethnic differences. The Journal of Higher Education, 81(2), 179–208. http://www.jstor.org/stable/40606850
  10. Chen, Y.- L., Hsu, C.- L., & Chou, S.- C. (2003). Constructing a multi-valued and multi-labeled decision tree. Expert Systems with Applications, 25(2), 199–209.  https://doi.org/10.1016/S0957-4174(03)00047-2. Retrieved from http://www.sciencedirect.com/science/article/pii/S0957417403000472.
  11. Chou, S., & Hsu, C.-L. (2005). MMDT: A multi-valued and multi-labeled decision tree classifier for data mining. Expert Systems With Applications, 28(4), 799–812.  https://doi.org/10.1016/j.eswa.2004.12.035.CrossRefGoogle Scholar
  12. Clare, A., & King, R. D. (2001). Knowledge discovery in multi-label phenotype data. In L. De Raedt & A. Siebes (Eds.), Principles of Data Mining and Knowledge Discovery: 5th European Conference, PKDD 2001, Freiburg, Germany, September 3–5, 2001 Proceedings (pp. 42–53). Berlin, Heidelberg: Springer Berlin Heidelberg. Retrieved from  https://doi.org/10.1007/3-540-44794-6.
  13. Eddelbuettel, D., & Francois, R. (2011). Rcpp: Seamless R and C++ integration. Journal of Statistical Software, 40(1), 1–18.  https://doi.org/10.18637/jss.v040.i08. Retrieved from https://www.jstatsoft.org/index.php/jss/article/view/v040i08
  14. Glaesser, J., & Cooper, B. (2012). Gender, parental education, and ability: their interacting roles in predicting GCSE success. Cambridge Journal of Education, 42(4), 463–480.  https://doi.org/10.1080/0305764X.2012.733346.CrossRefGoogle Scholar
  15. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning (2nd ed.). Berlin: Springer.CrossRefGoogle Scholar
  16. Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 15(3), 651–674.  https://doi.org/10.1198/106186006X133933.CrossRefGoogle Scholar
  17. Johnson, S. R., & Stage, F. K. (2018). Academic engagement and student success: Do high-impact practices mean higher graduation rates. The Journal of Higher Education, 0(0), 1–29.  https://doi.org/10.1080/00221546.2018.1441107.Google Scholar
  18. Johnson, V. E. (2003). Grade inflation : A crisis in college education. New York: Springer.Google Scholar
  19. Kappe, R., & van der Flier, H. (2012). Predicting academic success in higher education: what’s more important than being smart? European Journal of Psychology of Education, 27(4), 605–619.  https://doi.org/10.1007/s10212-011-0099-9.CrossRefGoogle Scholar
  20. Kim, H., & Loh, W.- Y. (2001). Classification trees with unbiased multiway splits. Journal of the American Statistical Association, 96, 589–604. Retrieved from http://www.stat.wisc.edu/~loh/treeprogs/cruise/cruise.pdf.
  21. Kononenko, I. (1995). On biases in estimating multi-valued attributes. In Proceedings of the 14th international joint conference on artificial intelligence (vol. 2, pp. 1034–1040). San Francisco, CA, USAM organ Kaufmann Publishers Inc. Retrieved from http://dl.acm.org/citation.cfm?id=1643031.1643034.
  22. Leeds, D. M., & DesJardins, S. L. (2015). The effect of merit aid on enrollment: A regression discontinuity analysis of iowa’s national scholars award. Research in Higher Education, 56(5), 471–495.  https://doi.org/10.1007/s11162-014-9359-2.CrossRefGoogle Scholar
  23. Liaw, A., & Wiener, M. (2002). Classification and regression by randomforest. R News 2(3), 18-22. Retrieved from http://CRAN.R-project.org/doc/Rnews/.
  24. Loh, W.- Y. (2002). Regression trees with unbiased variable selection and interaction detection. Statistica Sinica, 12, 361–386. Retrieved from http://www.stat.wisc.edu/~loh/treeprogs/guide/guide02.pdf.
  25. Loh, W.- Y., & Shih, Y.- S. (1997). Split selection methods for classification trees. Statistica Sinica, 7, 815–840. Retrieved from http://www3.stat.sinica.edu.tw/statistica/j7n4/j7n41/j7n41.htm.
  26. Mills, J. S., & Blankstein, K. R. (2000). Perfectionism, intrinsic vs extrinsic motivation, and motivated strategies for learning: a multidimensional analysis of university students. Personality and Individual Differences 29(6), 1191–1204.  https://doi.org/10.1016/S0191-8869(00)00003-9. Retrieved from http://www.sciencedirect.com/science/article/pii/S0191886900000039.
  27. Niessen, A. S. M., Meijer, R. R., & Tendeiro, J. N. (2016). Predicting performance in higher education using proximal predictors. PLoS ONE, 11(4), 1–14.  https://doi.org/10.1371/journal.pone.0153663.CrossRefGoogle Scholar
  28. Ost, B. (2010). The role of peers and grades in determining major persistence in sciences. Economics of Education Review, 29, 923–934.CrossRefGoogle Scholar
  29. Sabot, R., & Wakeman-Linn, J. (1991). Grade inflation and course choice. Journal of Economic Perspectives, 5, 159–170.CrossRefGoogle Scholar
  30. Strobl, C., Boulesteix, A.-L., Zeileis, A., & Hothorn, T. (2007). Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinformatics, 8(1), 25.  https://doi.org/10.1186/1471-2105-8-25.CrossRefGoogle Scholar
  31. University of Toronto. (2017). Degree requirements (h.b.a., h.b.sc., bcom). Retrieved 2017-08-30 http://calendar.artsci.utoronto.ca/Degree_Requirements_(H.B.A.,_H.B.Sc.,_BCom).html.

Copyright information

© Springer Nature B.V. 2019

Authors and Affiliations

  1. 1.University of TorontoTorontoCanada

Personalised recommendations