Abstract
Early dropout of students is one of the bigger problems that universities face currently. Several machine learning techniques have been used for detecting students at risk of dropout. By using sociodemographic data and qualifications of the previous level, the accuracy of these predictive models is good enough for implementing retention programs. In addition, by using grades of the first semesters, the accuracy of these models increases. Nevertheless, the classification errors produced by these models cause undetected students to be discarded from the retention programs, whereas students with no actual risk consume additional resources. In order to provide more accurate models, we propose the use of a stacking ensemble technique to obtain an improved combined dropout model, while using relatively few variables. The model results show values on the expected ranges for an early dropout model, but with considerably fewer features and historical information, and we show that deploying the models would be cost-efficient for the institution if applied towards an intervention program.
Similar content being viewed by others
Data availability
The datasets analyzed during the current study are available in the Institute for the Future of Education’s Educational Innovation collection of the Tecnologico de Monterrey’s Data Hub repository, https://doi.org/10.57687/FK2/PWJRSJ.
Abbreviations
- CRISP-DM :
-
Cross Industry Standard Process for Data Mining
- RFE :
-
Recursive Feature Elimination
- SMOTE :
-
Synthetic Minority Oversampling Technique
- ROC :
-
Receiver Operating Characteristic curve
- PRC :
-
Precision Recall Curve
- AUC :
-
Area Under the Curve
- LR :
-
Logistic regression
- KNN :
-
k-Nearest Neighbors
References
Alvarado-Uribe, J., Mejía-Almada, P., Masetto-Herrera, A., Molontay, R., Hilliger, I., Hegde, V., Montemayor-Gallegos, J., Ramírez-Díaz, R., Ceballos, H. (2022). Student dataset from Tecnologico de Monterrey in Mexico to Predict Dropout in Higher Education. Data.
Berens, J., Schneider, K., Gortz, S., Oster, S., & Burghoff, J. (2019). Early detection of students at risk - Predicting student dropouts using administrative student data from german universities and machine learning methods. Journal of Educational Data Mining, 11(3), 1–41. https://doi.org/10.5281/zenodo.3594771
Borrella, I., Caballero-Caballero, S., & Ponce-Cueto, E. (2022). Taking action to reduce dropout in MOOCs: tested interventions. Computers & Education, 179, 104412. https://doi.org/10.1016/J.COMPEDU.2021.104412
Casanova, J. R., Cervero, A., Núñez, J. C., Almeida, L. S., & Bernardo, A. (2018). Factors that determine the persistence and dropout of university students. Psicothema, 30(4), 408–414. https://doi.org/10.7334/psicothema2018.155
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953
Chung, J. Y., & Lee, S. (2019). Dropout early warning systems for high school students using machine learning. Children and Youth Services Review, 96, 346–353. https://doi.org/10.1016/J.CHILDYOUTH.2018.11.030
Davis, J., & Goadrich, M. (2006). The relationship between precision-recall and ROC curves. ACM International Conference Proceeding Series, 148, 233–240. https://doi.org/10.1145/1143844.1143874
Fawcett, T. (2004). ROC graphs: Notes and practical considerations for researchers. Machine learning, 31(1), 1–38. http://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:ROC+Graphs:+Notes+and+Practical+Considerations+for+Researchers#0. Accessed 24 Aug 2022
Gelman, A., & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge University Press. https://doi.org/10.1017/CBO9780511790942
Heublein, U. (2013). Student drop-out from german Higher Education Institutions. European Journal of Education, 49(4), 497–513. https://doi.org/10.1111/EJED.12097
Isphording, I. E., & Raabe, T. (2019). Early Identification of College Dropouts Using Machine-Learning (IZA Research Reports 89). Institute of Labor Economics (IZA). https://ftp.iza.org/report_pdfs/iza_report_89.pdf. Accessed 1/11/2022
Kemper, L., Vorhoff, G., & Wigger, B. U. (2020). Predicting student dropout: a machine learning approach. European Journal of Higher Education, 10(1), 28–47. https://doi.org/10.1080/21568235.2020.1718520
Larsen, M., Sommersel, H., & Larsen, M. (2013). Evidence on dropout phenomena at universities (1). Danish Clearinghouse for Educational Research. 1–53. http://edu.au.dk/fileadmin/edu/Udgivelser/Clearinghouse/Review/Evidence_on_dropout_from_universities_brief_version.pdf. Accessed 1/6/2022
Latif, A., Ai, C., & Aa, H. (2015). Economic effects of student dropouts: a comparative study. Journal of Global Economics, 3(2), 2–5. https://doi.org/10.4172/2375-4389.1000137
Liem, J., Dillon, C., & Gore, S. (2001). Mental health consequences associated with dropping out of high school. Annual Conference of the American Psychological Association, 109. https://eric.ed.gov/?id=ED457502. Accessed 10/04/2022
Mduma, N., Kalegele, K., & Machuve, D. (2019). A survey of machine learning approaches and techniques for student dropout prediction. Data Science Journal, 18(1), 14. https://doi.org/10.5334/dsj-2019-014
Mubarak, A. A., Cao, H., & Hezam, I. M. (2021). Deep analytic model for student dropout prediction in massive open online courses. Computers & Electrical Engineering, 93, 107271. https://doi.org/10.1016/j.compeleceng.2021.107271
Niyogisubizo, J., Liao, L., Nziyumva, E., Murwanashyaka, E., & Nshimyumukiza, P. C. (2022). Predicting student’s dropout in university classes using two-layer ensemble machine learning approach: a novel stacked generalization. Computers and Education: Artificial Intelligence, 3, 100066. https://doi.org/10.1016/J.CAEAI.2022.100066
OECD. (2022). Education at a glance 2022: OECD Indicators. OECD Publishing. https://doi.org/10.1787/3197152b-en
Ozay, M., & Vural, F. T. Y. (2012). A new fuzzy stacked generalization technique and analysis of its performance. arXiv: Learning. http://arxiv.org/abs/1204.0171. Accessed 1/6/2022
Saito, T., & Rehmsmeier, M. (2015). The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One, 10(3), e0118432. https://doi.org/10.1371/JOURNAL.PONE.0118432
Silva, J., & Roman, N. (2021). Predicting dropout in Higher Education: a systematic review. Anais do XXXII Simpósio Brasileiro de Informática na Educação. Porto Alegre: SBC, 1107–1117. https://doi.org/10.5753/sbie.2021.21743.
Solis, M., Moreira, T., Gonzalez, R., Fernandez, T., & Hernandez, M. (2018). Perspectives to predict dropout in university students with machine learning. 2018 IEEE International Work Conference on Bioinspired Intelligence, IWOBI 2018 - Proceedings, September. https://doi.org/10.1109/IWOBI.2018.8464191
Viloria, A., Lezama, O. B. P., & Varela, N. (2019). Bayesian classifier applied to Higher Education dropout. Procedia Computer Science, 160, 573–577. https://doi.org/10.1016/J.PROCS.2019.11.045
Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5(2), 241–259. https://doi.org/10.1016/S0893-6080(05)80023-1
Xia, X., & Qi, W. (2022). Early warning mechanism of interactive learning process based on temporal memory enhancement model. Education and Information Technologies, 28, 1019–1040. https://doi.org/10.1007/s10639-022-11206-1
Zeineddine, H., Braendle, U., & Farah, A. (2021). Enhancing prediction of student success: automated machine learning approach. Computers and Electrical Engineering, 89, 106903. https://doi.org/10.1016/j.compeleceng.2020.106903
Zhang, W., Wang, Y., & Wang, S. (2022). Predicting academic performance using tree-based machine learning models: a case study of bachelor students in an engineering department in China. Education and Information Technologies, 27(9), 13051–13066. https://doi.org/10.1007/s10639-022-11170-w
Acknowledgements
The authors would like to acknowledge the Living Lab & Data Hub of the Institute for the Future of Education, Tecnológico de Monterrey, Mexico, for the data published through the Call “Bringing New Solutions to the Challenges of Predicting and Countering Student Dropout in Higher Education” used in the production of this work.
Funding
The authors would like to thank the financial support from Tecnológico de Monterrey through the “Challenge-Based Research Funding Program 2022”. Project ID # I004 - IFE001 - C2-T3 – T.
Author information
Authors and Affiliations
Contributions
JAT performed the literature search, data analysis, and developed the 1st draft of the document. HC critically reviewed the work, provided commentary, supervised, and guided the final development of the article. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Institutional review board statement
Privacy issues related to the collection, curation, and publication of student data were validated with Tecnológico de Monterrey’s Data Owners and the Data Security and Information Management Departments.
Competing interests
Juan Talamás and Héctor Ceballos declare that they have no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Talamás-Carvajal, J.A., Ceballos, H.G. A stacking ensemble machine learning method for early identification of students at risk of dropout. Educ Inf Technol 28, 12169–12189 (2023). https://doi.org/10.1007/s10639-023-11682-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10639-023-11682-z