Supervised machine learning algorithms for predicting student dropout and academic success: a comparative study

Villar, Alice; de Andrade, Carolina Robledo Velini

doi:10.1007/s44163-023-00079-z

Supervised machine learning algorithms for predicting student dropout and academic success: a comparative study

Research
Open access
Published: 04 January 2024

Volume 4, article number 2, (2024)
Cite this article

Download PDF

You have full access to this open access article

Discover Artificial Intelligence Aims and scope Submit manuscript

Supervised machine learning algorithms for predicting student dropout and academic success: a comparative study

Download PDF

2970 Accesses
Explore all metrics

Abstract

Utilizing a dataset sourced from a higher education institution, this study aims to assess the efficacy of diverse machine learning algorithms in predicting student dropout and academic success. Our focus was on algorithms capable of effectively handling imbalanced data. To tackle class imbalance, we employed the SMOTE resampling technique. We applied a range of algorithms, including Decision Tree (DT), Support Vector Machine (SVM), Random Forest (RF), as well as boosting algorithms such as Gradient Boosting (GB), Extreme Gradient Boosting (XGBoost), CatBoost (CB), and Light Gradient Boosting Machine (LB). To enhance the models' performance, we conducted hyperparameter tuning using Optuna. Additionally, we employed the Isolation Forest (IF) method to identify outliers or anomalies within the dataset. Notably, our findings indicate that boosting algorithms, particularly LightGBM and CatBoost with Optuna, outperformed traditional classification methods. Our study's generalizability to other contexts is constrained due to its reliance on a single dataset, with inherent limitations. Nevertheless, this research provides valuable insights into the effectiveness of various machine learning algorithms for predicting student dropout and academic success. By benchmarking these algorithms, our project offers guidance to both researchers and practitioners in their choice of suitable approaches for similar predictive tasks.

A random forest guided tour

Article 19 April 2016

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A comparative analysis of gradient boosting algorithms

Article 24 August 2020

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Predicting student behavior is a crucial task for educational institutions, as it can help to improve curriculum design and plan academic support interventions that are timely and personalized. As Chung and Lee [1] highlighted, “at-risk students” who drop out of school due to difficulties are more likely to adopt antisocial behaviors or face challenges in the labor market, making it harder for them to adapt to society.

Machine learning (ML) techniques, such as predictive modeling, have the potential to improve student retention by allowing educators to recognize students' weaknesses and provide learning metrics at any stage of educational progress [2]. The application of these techniques can be used to aid in the development of early warning systems, which can detect students who are at risk of dropping out in advance and offer the necessary support [3].

In this comparative study, we explore various methods for predicting student success and dropout in higher education institutions, building upon the work of Martins et al. [4], “Early Prediction of Student’s Performance in Higher Education: a Case Study.” Martins et al. used machine learning classification models to predict students who might be at risk of failing to complete their degrees on time at the Polytechnic Institute of Portalegre (IPP) in Portugal.

While their study focused solely on supervised classification algorithms, we expand upon their work by incorporating unsupervised classification algorithms into our analysis. We apply resampling methods such as SMOTE and ADASYN to enhance the representation of the minority class in the dataset. Additionally, we employ various machine learning algorithms, including cost-sensitive learning algorithms, ensemble algorithms, and unsupervised anomaly detection algorithms. Furthermore, we leverage SMOTE and ADASYN with the unsupervised classification algorithm Isolation Forest to identify patterns and clusters within the data.

In this paper, we present a comparative study that utilizes the dataset collected by Martins et al. [4].

Specifically, we aim to address the following research questions:

RQ1: how effective are different resampling techniques SMOTE and ADASYN in addressing class imbalance in the context of dropout prediction?
RQ2: which machine learning algorithms are most effective in predicting student dropout in our dataset?
RQ3: how do boosting algorithms (Gradient Boosting, Extreme Gradient Boosting, CatBoost, and LightGBM) compare to traditional machine learning algorithms in predicting student dropout?
RQ4: what are the key factors that contribute to student success or failure, as revealed by the application of SHAP (SHapley Additive exPlanations)?

The remainder of this paper is organized as follows. In Sect. 2, we provide a condensed analysis of related work, focusing on the class imbalance problem in the context of dropout prediction. In this section, we also discuss the findings of Martins et al. [4], which serve as a foundation for our comparative study. In Sect. 2, we offer a concise review of related research, with a specific focus on addressing the class imbalance problem within dropout prediction. Additionally, we discuss the key findings of Martins et al. [4], which serve as the basis for our comparative study. Later in Sect. 3, we initiate our exploration with a bibliometric analysis, examining the knowledge structure surrounding the efficacy of boosting algorithms in forecasting student academic success. Subsequently, we transition into a systematic literature review, concentrating on the most highly cited papers in this domain. Section 4 describes the methodology used to build and evaluate the Machine Learning (ML) models, including the data description, methods used to handle the unbalanced dataset, and the procedures for training and evaluating the classification models. In Sect. 5, we present the experimental results of Supervised and Unsupervised Machine Learning models. In this section, we compare F1-scores of Supervised ML Algorithms before and after Hyperparameter Optimization with Optina and apply the SHAP (SHapley Additive exPlanations) method to explain the output of the algorithms that performed best, specifically LightGBM and CatBoost with Optuna. Section 6 presents a comprehensive assessment of supervised machine learning algorithms' performance in predicting student outcomes. Section 7 concludes the paper and suggests future research directions.

2 Analyzing class imbalance in dropout prediction and building on prior research

Section 2 provides a condensed analysis of the related work. The first subsection discusses the class imbalance problem and the application of unsupervised machine learning techniques in classification, particularly in the context of dropout prediction. The second subsection provides a summary of Martins et al. ’s [4] research, which serves as a foundation for our comparative study.

2.1 Addressing class imbalance in machine learning for classification

In educational data mining (EDM), class imbalance is a common issue, particularly when dealing with student retention data. This occurs because the number of students who drop out is significantly smaller than the number of students who stay in school or have good academic performance.

Class imbalance negatively impacts the accuracy of predictive models. Due to the scarcity of data points on the minority class, they tend to be biased towards the majority class, resulting in poor predictions for the minority class. [2, 5]. In such cases, standard classifier algorithms may not perform well, as they usually have a bias towards the majority class.

To address this issue, a number of techniques have been proposed in the literature. As noted by Islam et al. [6], in the past few decades, three popular approaches have been used to handle imbalanced data: data-driven, algorithm-based, and hybrid methods.

Data-driven methods focus on balancing the distribution of classes in the training set by either oversampling the minority class or undersampling the majority class (increasing the representation of the minority class or decreasing the representation of the majority class). Oversampling techniques include Random Over-Sampling Examples (ROSE), Synthetic Minority Over-Sampling Technique (SMOTE), and Adaptive Synthetic (ADASYN), among others. Undersampling techniques include Random Under-Sampling Examples (RUSE) and Tomek Links (TL).
Algorithm-based techniques, on the other hand, modify the learning algorithms to improve their performance on imbalanced data. Examples of such techniques include cost-sensitive learning, anomaly detection, and ensemble-based methods like bagging and boosting.
Hybrid methods combine data-driven and algorithm-based techniques to improve the performance of machine learning models on imbalanced data. For example, hybrid methods may use data-driven techniques to balance the distribution of classes in the training set and then apply algorithm-based techniques to further improve the performance of the model.

Class imbalance is a common problem in predicting student performance, and various techniques have been proposed in the literature to address it. Rastrollo-Guerrero et al. [7] conducted a review of nearly 70 papers to identify the modern techniques commonly used for predicting student performance. They found that supervised learning, particularly the Support Vector Machine algorithm, was the most widely used and provided accurate results. Decision Tree (DT), Naïve Bayes (NB), and Random Forest (RF) were also well-studied algorithms that produced good results.

According to Rastrollo-Guerrero et al. [7], unsupervised learning is often considered an unattractive technique for researchers due to its low accuracy in predicting students' behavior in certain cases. However, the authors suggest that this can serve as an incentive for further research to improve these techniques and obtain more reliable results.

Our study aims to contribute to this area by exploring the effectiveness of unsupervised machine learning using Isolation Forest (IF) for predicting student dropout, while also addressing class imbalance in the data. Despite our efforts, our study found that IF did not perform well in predicting student dropout. These findings highlight the need for continued research into alternative machine learning algorithms and techniques that can effectively handle class imbalance and improve dropout prediction.

Recent research has demonstrated the effectiveness of boosting algorithms in predicting student dropout [8, 9]. These studies are in line with our findings, which demonstrate that boosting algorithms outperformed traditional classification methods.

2.2 Summary of Martins et al. [4] research

In this section, we briefly describe the purpose of their study, the methods they used, the datasets they worked with, and their findings. In addition, it shows the include limitations of their study and how our comparative study builds upon their work.

Martins et al. [4] aimed to develop a system that could identify students with potential difficulties in their academic path at an early stage, so that strategies to support the students could be put into place. Their research focused on building a system that generalizes to any course at IPP, rather than focusing on a specific field of study. The dataset included information from students enrolled in several courses from the four different schools belonging to IPP. Additionally, their paper only relied on information available at the time of enrollment and did not include any information on academic performance after enrollment. Another unique aspect of the paper was the use of a third intermediate class (relative success) in addition to the usual approach of restricting the model categories to failure/success. This allowed for different interventions for academic support and guidance for students who are at moderate risk versus those who are at high risk of being unsuccessful.

In their work, Martins et al. [4] highlighted the limitations of using accuracy as a performance metric for models trained on imbalanced datasets. This is because accuracy may give a false sense of good performance, as it tends to favor the majority class and can ignore the performance of the minority classes. To address this issue, Martins et al. used single-class metrics to evaluate the performance of the model on each class separately.

Specifically, they used the F1 measure, which takes into account the balance between precision and recall, as the performance metric for the three classes in the dataset. By computing F1 scores for each class, Martins et al. were able to gain insights into the performance of the model for both the majority and minority classes. They were then able to use the average F1 score for the three classes as the metric for hyperparameter tuning, ensuring that the model is optimized to perform well on all classes.

Additionally, Martins et al. computed the accuracy of the optimized model as an overall metric, which provides a useful summary of the model’s performance. By using both F1 scores and accuracy as performance metrics, Martins et al. were able to gain a more complete understanding of the performance of their model, especially for imbalanced datasets.

Martins et al. [4] aimed to improve classification performance using different algorithms. Initially, they applied four algorithms, namely DT, SVM, RL, and RF, to the dataset. Then, they applied the same four algorithms with hyperparameter tuning using grid search and compared the results. To deal with the class imbalance challenge, they applied data sampling techniques SMOTE (Synthetic Minority Over-sampling Technique) and ADASYN (Adaptive Synthetic Sampling) to the dataset prior to model training. They found that SMOTE outperformed ADASYN. Thus, they applied SMOTE and used the same four algorithms with hyperparameter tuning using grid search and compared the results. Furthermore, Martins et al. [4] applied four boosting algorithms, namely Gradient Boosting, Extreme Gradient Boosting, CatBoost, and LogitBoost, using SMOTE. They then applied the same four boosting algorithms with hyperparameter tuning using randomized grid search and compared the results, using SMOTE.

In the study conducted by Martins et al. [4], the findings suggest that these boosting algorithms outperform standard methods when dealing with particular classification tasks. Among the four boosting algorithms assessed, Extreme Gradient Boosting emerged as the top classifier, although very similar to Gradient Boosting. However, despite the use of boosting algorithms, Martins et al. found that these models still struggled to accurately classify the minority classes in the imbalanced dataset. This is a common issue with imbalanced datasets, where the majority class dominates the dataset and makes it difficult for the model to learn from the minority class samples.

3 Systematic review and bibliometric analysis: boosting algorithms in student success prediction

In this section, we conducted a comprehensive bibliometric analysis using the Bibliometrix R-package. Table 1 presents the assessment criteria and Table 2 outlines our search strategy methodology. Our search yielded a total of 203 relevant articles.

Table 1 Assessment criteria

Supervised machine learning algorithms for predicting student dropout and academic success: a comparative study

Abstract

Similar content being viewed by others

A random forest guided tour

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A comparative analysis of gradient boosting algorithms

1 Introduction

2 Analyzing class imbalance in dropout prediction and building on prior research

2.1 Addressing class imbalance in machine learning for classification

2.2 Summary of Martins et al. [4] research

3 Systematic review and bibliometric analysis: boosting algorithms in student success prediction

3.1 Overview of the literature landscape

3.2 Systematic literature review: leveraging boosting algorithms for precision in predicting student performance

4 Methodology used to build and evaluate the machine learning models

4.1 Overview

4.2 Data description

4.3 Resampling methods to address class imbalance

4.4 Navigating challenges of data preprocessing

5 Supervised ML algorithm comparison and Optuna hyperparameter optimization

5.1 Supervised algorithms

5.1.1 Traditional algorithms

5.1.2 Boosting algorithms

5.2 Unsupervised outlier detection with isolation forest

5.3 SHAP method application

6 Discussion: performance evaluation and model comparison in predicting student outcomes

7 Conclusion

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation