An empirical assessment of smote variants techniques and interpretation methods in improving the accuracy and the interpretability of student performance models

Sahlaoui, Hayat; Alaoui, El Arbi Abdellaoui; Agoujil, Said; Nayyar, Anand

doi:10.1007/s10639-023-12007-w

An empirical assessment of smote variants techniques and interpretation methods in improving the accuracy and the interpretability of student performance models

Published: 17 July 2023

Volume 29, pages 5447–5483, (2024)
Cite this article

Education and Information Technologies Aims and scope Submit manuscript

371 Accesses
5 Citations
Explore all metrics

Abstract

Predicting student performance using educational data is a significant area of machine learning research. However, class imbalance in datasets and the challenge of developing interpretable models can hinder accuracy. This study compares different variations of the Synthetic Minority Oversampling Technique (SMOTE) combined with classification algorithms to create prediction models. The results show that SMOTE with Edited Nearest Neighbors is superior, and the balanced random forest classifier performs better when using SMOTE-ENN, achieving 96% accuracy, precision, and F-value. Smote also has faster execution time. For model interpretability, combining Local Interpretable Model-agnostic Explanations (LIME) and SHapley Additive exPlanations (SHAP) provides deeper insights. LIME is suitable for single-prediction interpretation, while SHAP is better for overall model interpretation. This research offers guidelines to mitigate data imbalance and improve fairness in education through data-driven innovations like early warning systems. It also educates academics on explainability approaches to facilitate wider use of machine learning methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 8

Fig. 9

Practical early prediction of students’ performance using machine learning and eXplainable AI

Article 10 June 2022

To Ameliorate Classification Accuracy Using Ensemble Vote Approach and Base Classifiers

Studying Weariness Prediction Using SMOTE and Random Forests

Data availability

Authors declare that all the data being used in the design and production cum layout of the manuscript is declared in the manuscript.

References

Akçapınar, G. A. (2019). Developing an early-warning system for spotting at-risk students by using eBook interaction logs. Smart Learning Environments, 6, 4.
Article Google Scholar
Ali, A. A. (2013). Classification with class imbalance problem. International Journal of Advances in Soft Computing and its Applications, 5 (3), 176–204.
Awad, A. A. E. D. (2017). Early hospital mortality prediction of intensive care unit patients using an ensemble learning approach. International Journal of Medical Informatics, 108, 185–195.
Article Google Scholar
Barandela, R. A. (2004). The imbalanced training sample problem: Under or over sampling? Joint IAPR international workshops on statistical techniques in pattern recognition (SPR) and structural and syntactic pattern recognition (SSPR) (s. 806--814). Springer.
Barros, T. M. (2019). Predictive models for imbalanced data: A school dropout perspective. Education Sciences, 9, 275.
Article Google Scholar
Batista, G. E. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter, 6, 20–29.
Article Google Scholar
Belachew, E. B. (2017). Student performance prediction model using machine learning approach: The case of Wolkite university. International Journal If Advanced Research in Computer Science and Software Engineering, 7, 46–50.
Google Scholar
Berkson, J. (1944). Application of the logistic function to bio-assay. Journal of the American Statistical Association, 39, 357–365.
Google Scholar
Brownlee, J. (2018). A Gentle Introduction to Normality Tests in Python. https://machinelearningmastery.com/a-gentle-introduction-to-normality-tests-in-python/
Buenaño-Ferńandez, D. A. M. (2019). Application of machine learning in predicting performance for computer engineering students: A case study. Sustainability, 11, 2833.
Article Google Scholar
Carvalho, D. V. (2019). Machine learning interpretability: A survey on methods and metrics. Electronics, 8, 832.
Article Google Scholar
Chawla, N. V. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.
Article Google Scholar
Chen, C. A. (2004). Using random forest to learn imbalanced data. University of California, Berkeley, 110, 24.
Google Scholar
Chitti, M. A. (2020). Need for interpretable student performance prediction. 2020 13th International Conference on Developments in eSystems Engineering (DeSE) (s. 269--272). IEEE.
Darabi, H. R. (2018). Forecasting mortality risk for patients admitted to intensive care units using machine learning. Procedia Computer Science, 140, 306–313.
Article Google Scholar
Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 7, 1–30.
MathSciNet Google Scholar
Ferńandez, A. (2018). SMOTE for learning from imbalanced data: Progress and challenges, marking the 15-year anniversary. Journal of Artificial Intelligence Research, 61, 863–905.
Article MathSciNet Google Scholar
Fisher, R. (1956). Statistical methods and scientific inference Oxford. Hafner Publishing Co.
Freund, Y. A. (1996). Experiments with a new boosting algorithm. icml (Cilt 96, s. 148--156). içinde Citeseer.
Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32, 675–701.
Article Google Scholar
Friedman, M. (1940). A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics, 11, 86–92.
Article MathSciNet Google Scholar
Galar, M. A. (2011). A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42, 463–484.
Article Google Scholar
Ghorbani, R. A. (2019). Predictive data mining approaches in medical diagnosis: A review of some diseases prediction. International Journal of Data and Network Science, 3, 47–70.
Article Google Scholar
Ghorbani, R. A. (2020). Comparing different resampling methods in predicting students’ performance using machine learning techniques. IEEE Access, 8, 67899–67911.
Article Google Scholar
Ghose, S. A. (2015). An improved patient-specific mortality risk prediction in ICU in a random Forest classification framework. Studies in Health Technology and Informatics, 214, 56–61.
Google Scholar
Guan, D. A. K. (2009). Nearest neighbor editing aided by unlabeled data. Information Sciences, 179, 2273–2282.
Article Google Scholar
Guo, B. A. (2015). Predicting students performance in educational data mining. 2015 international symposium on educational technology (ISET) (s. 125--128). IEEE.
Haixiang, G. A. (2017). Learning from class imbalanced data: Review of methods and applications. Expert Systems with Applications, 73, 220–239.
Article Google Scholar
Han, H. A.-Y.-H. (2005). Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. International conference on intelligent computing (s. 878--887). Springer.
Hu, Y.-H.A.L.P. (2014). Developing early warning systems to predict students’ online learning performance. Computers in Human Behavior, 36, 469–478.
Article Google Scholar
Hussain, M. A. (2018). Student engagement predictions in an e-learning system and their impact on student course assessment scores. Computational Intelligence and Neuroscience, 2018, 21.
Jäntschi, L. (2018). Computation of probability associated with Anderson-Darling statistic. Mathematics, 6, 88.
Article Google Scholar
Johnson, J. M. (2019). Survey on deep learning with class imbalance. Journal of Big Data, 6, 1–54.
Article Google Scholar
Karlos, S. A. (2020). Predicting and interpreting students’ grades in distance higher education through a semi-regression method. Applied Sciences, 10, 8413.
Article Google Scholar
Kaur, A. A. (2018). An empirical evaluation of classification algorithms for fault prediction in open source projects. Journal of King Saud University-Computer and Information Sciences, 30, 2–17.
Article Google Scholar
Keshtkar, F. A. (2018). Predicting risk of failure in online learning platforms using machine learning algorithms for modeling students’ academic performance. Southeast Missouri State University.
Khosravi, H. A. (2017). Using learning analytics to investigate patterns of performance and engagement in large classes. Proceedings of the 2017 acm sigcse technical symposium on computer science education (s. 309--314). içinde
Kotsiantis, S. A. (2006). Handling imbalanced datasets: A review. GESTS International Transactions on Computer Science and Engineering, 30, 25–36.
Google Scholar
Koutina, M. A. (2011). Predicting postgraduate students’ performance using machine learning techniques. Artificial intelligence applications and innovations (s. 159--168). içinde Springer.
Kovács, G. (2019). An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets. Applied Soft Computing, 83, 105662.
Article Google Scholar
Kuncheva, L.I.-G.-P.-F. (2019). Instance selection improves geometric mean accuracy: A study on imbalanced data classification. Progress in Artificial Intelligence, 8, 215–228.
Article Google Scholar
Li, H. A. C. (2013). Parametric prediction on default risk of Chinese listed tourism companies by using random oversampling, isomap, and locally linear embeddings on imbalanced samples. International Journal of Hospitality Management, 35, 141–151.
Article Google Scholar
Liu, J. A. (2018). Mortality prediction based on imbalanced high-dimensional ICU big data. Computers in Industry, 98, 218–225.
Article Google Scholar
Liu, X.-Y.A.H. (2008). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39, 539–550.
Google Scholar
Longadge, R. A. (2013). Class imbalance problem in data mining review. arXiv preprint arXiv:1305.1707.
Lopez, V. A. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250, 113–141.
Article Google Scholar
Lundberg, S. M.-I. (2017). A unified approach to interpreting model predictions. Advances in Neural Information Processing Systems, 30.
Ma, Y. A. (2013). Imbalanced learning: foundations, algorithms, and applications. John Wiley & Sons.
Márquez-Vera, C. A. (2013). Predicting student failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced data. Applied Intelligence, 38, 315–330.
Article Google Scholar
Mathew, J. A. (2015). Kernel-based SMOTE for SVM classification of imbalanced datasets. IECON 2015–41st Annual Conference of the IEEE Industrial Electronics Society (s. 001127--001132). içinde IEEE.
Moreno García, M. N. (2014). Machine learning methods for mortality prediction of polytraumatized patients in intensive care units--dealing with imbalanced and high-dimensional data. International Conference on Intelligent Data Engineering and Automated Learning (s. 309--317). Springer.
Mueen, A. A. (2016). Modeling and predicting students’ academic performance using data mining techniques. International Journal of Modern Education & Computer Science, 8 (11), 36.
Napierala, K. A. (2016). Types of minority class examples and their influence on learning classifiers from imbalanced data. Journal of Intelligent Information Systems, 46, 563–597.
Article Google Scholar
Poduska, J. (2018). SHAP and LIME Python Libraries. Part 1 - Great Explainers, with Pros and Cons to Both. https://www.dominodatalab.com/blog/shap-lime-python-libraries-part-1-great-explainers-pros-cons
Pojon, M. (2017). Using machine learning to predict student performance. Luonnontieteiden tiedekunta, Faculty of Natural Sciences.
Radečići, D. (2020, Nov 27). LIME: How to Interpret Machine Learning Models With Python. https://towardsdatascience.com/lime-how-to-interpret-machine-learning-models-with-python-94b0e7e4432eadresindenalindi
Rashu, R. I. (2014). Data mining approaches to predict final grade by overcoming class imbalance problem. 2014 17th International conference on computer and information technology (ICCIT) (s. 14--19). IEEE.
Ribeiro, M. T. (2016). Why should i trust you?” Explaining the predictions of any classifier. Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, (s. 1135–1144). San Francisco: ACM.
Roumani, Y. F. (2013). Classifying highly imbalanced ICU data. Health Care Management Science, 16, 119–128.
Article Google Scholar
Sahlaoui, H. A. (2021). Predicting and Interpreting Student Performance Using Ensemble Models and Shapley Additive Explanations. IEEE Access, 9, 152688–152703.
Article Google Scholar
Sahlaoui, H. A. (2023). A Game Theoretic Framework for Interpretable Student Performance Model. International Conference on Networking, Intelligent Systems and Security (s. 21--34). Springer.
Seiffert, C. A. (2009). RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 40, 185–197.
Article Google Scholar
Seiffert, K., Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., & Napolitano, A. (2010). Rusboost: A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 40, 185–197.
Article Google Scholar
Solanki, S. (2022). How to use LIME to interpret predictions of ML models? https://coderzcolumn.com/tutorials/machine-learning/how-to-use-lime-to-understand-sklearn-models-predictions
Straw, J. (2017). Building trust in machine learning models (using LIME in Python. https://www.analyticsvidhya.com/blog/2017/06/building-trust-in-machine-learning-models/
Sun, Y. A. (2021). Classifier selection and ensemble model for multi-class imbalance learning in education grants prediction. Applied Artificial Intelligence, 35, 290–303.
Article Google Scholar
Tang, Y. A. Q. (2008). SVMs modeling for highly imbalanced classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39, 281–288.
Article Google Scholar
Thammasiri, D. A. (2014). A critical assessment of imbalanced class distribution problem: The case of predicting freshmen student attrition. Expert Systems with Applications, 41, 321–330.
Article Google Scholar
Van Hulse, J. A. (2007). Experimental perspectives on learning from imbalanced data. Proceedings of the 24th international conference on Machine learning, (s. 935–942). New York: ACM.
Vultureanu-Albişi, A. A. (2021). Improving students’ performance by interpretable explanations using ensemble tree-based approaches. 2021 IEEE 15th International Symposium on Applied Computational Intelligence and Informatics (SACI) (s. 215--220). IEEE.
Wandera, H. A. (2020). Investigating similarities and differences between South African and Sierra Leonean school outcomes using Machine Learning. arXiv preprint arXiv:2004.11369.
Weiss, G. M. (2004). Mining with rarity: A unifying framework. ACM Sigkdd Explorations Newsletter, 6, 7–19.
Article Google Scholar
Yap, B. W. (2014). An application of oversampling, undersampling, bagging and boosting in handling imbalanced datasets. Proceedings of the first international conference on advanced data and information engineering (DaEng-2013) (s. 13--22). Springer.

Download references

Author information

Authors and Affiliations

Department of Computer Science, Faculty of Sciences and Techniques at Errachidia, Moulay Ismail University of Meknes, Route Meknes, Errachidia, Morocco
Hayat Sahlaoui
IEVIA team, IMAGE laboratory, Department of Sciences, Ecole Normale Supérieure, Moulay Ismail University of Meknes, Morocco, Meknes, Morocco
El Arbi Abdellaoui Alaoui
Ecole Nationale de Commerce Et de Gestion, Moulay Ismail University of Meknes, El Hajeb, Morocco
Said Agoujil
Graduate School, Faculty of Information Technology, Duy Tan University, Da Nang, 550000, Vietnam
Anand Nayyar

Authors

Hayat Sahlaoui
View author publications
You can also search for this author in PubMed Google Scholar
El Arbi Abdellaoui Alaoui
View author publications
You can also search for this author in PubMed Google Scholar
Said Agoujil
View author publications
You can also search for this author in PubMed Google Scholar
Anand Nayyar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anand Nayyar.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflicts of interest to report regarding the present study.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sahlaoui, H., Alaoui, E.A.A., Agoujil, S. et al. An empirical assessment of smote variants techniques and interpretation methods in improving the accuracy and the interpretability of student performance models. Educ Inf Technol 29, 5447–5483 (2024). https://doi.org/10.1007/s10639-023-12007-w

Download citation

Received: 02 March 2023
Accepted: 27 June 2023
Published: 17 July 2023
Issue Date: April 2024
DOI: https://doi.org/10.1007/s10639-023-12007-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An empirical assessment of smote variants techniques and interpretation methods in improving the accuracy and the interpretability of student performance models

Abstract

Access this article

Similar content being viewed by others

Practical early prediction of students’ performance using machine learning and eXplainable AI

To Ameliorate Classification Accuracy Using Ensemble Vote Approach and Base Classifiers

Studying Weariness Prediction Using SMOTE and Random Forests

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An empirical assessment of smote variants techniques and interpretation methods in improving the accuracy and the interpretability of student performance models

Abstract

Access this article

Similar content being viewed by others

Practical early prediction of students’ performance using machine learning and eXplainable AI

To Ameliorate Classification Accuracy Using Ensemble Vote Approach and Base Classifiers

Studying Weariness Prediction Using SMOTE and Random Forests

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation