Skip to main content

Advertisement

Log in

Predicting students’ academic performance by mining the educational data through machine learning-based classification model

  • Published:
Education and Information Technologies Aims and scope Submit manuscript

Abstract

Students’ academic performance prediction is one of the most important applications of Educational Data Mining (EDM) that helps to improve the quality of the education process. The attainment of student outcomes in an Outcome-based Education (OBE) system adds invaluable rewards to facilitate corrective measures to the learning processes. Furthermore, the explosive increase of e-learning platforms generates a large volume of data that demands the extraction of useful information using up-to-date techniques. Keeping this view in mind and to check the impact of various features on student outcomes during online classes, we have analyzed two sets of datasets; the Kalboard 360 dataset (a larger dataset) that contains academic, demographic as well as behavioral features which have been observed and recorded during the classes held and a local Institute dataset that does not acquire behavioral features. To achieve this, we have selected a few machine learning algorithms such as Decision Tree (J48), Naïve Bayes (NB), Random Forest (RF), and Multilayer Perceptron (MLP) to classify the students, along with a few filter-based feature selection methods like Info gain, gain ratio, and correlation features have been applied to select the key attributes. Finally, we have fine-tuned the learning parameters of MLP called “Opt-MLP” to get an optimized output and compared it with other classification models. Our experimental results conclude that Opt-MLP proves its superiority over other classification models by predicting an accuracy of 87.14% without the feature selection (WOFS) and 90.74% accuracy with the feature selection (WFS) method for data set 1 and an accuracy of 79.37% without feature selection and 97.08% with feature selection for dataset 2. But, when the students’ behavioral feature is considered along with other features, the RF model provides 100% accuracy justifying that students’ behavior during class hours has a great impact on attaining the students’ outcomes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Data availability

1. The datasets generated and analysed during the current study are available in the Kaggle repository. https://www.kaggle.com/datasets/shaikvaheed91/kalboard360

2. The datasets generated and analysed during the current study are available in the Kaggle repository by second author name. https://www.kaggle.com/datasets/shaikvaheed91/griet2021 

3. The datasets generated and analysed during the current study are available in the GitHub repository by second author name. https://github.com/vaheed4274/Student-Perforamance-Analyzer

References

  • Anoop Kumar, M., & Md Zubair Rahman, A. M. J. (2016). A review on data mining techniques and factors used in educational data mining to predict student amelioration (2016). Proc. 2016 IEEE Int. Conf. on Data Min. Adv. Comput. SAPIENCE 2016, 122–133.

  • Ahmad, Ahmadi, et al. (2023). Prediction of academic motivation based on variables of personality traits, academic self-efficacy, academic alienation and social support in paramedical students. Community Health Equity Research & Policy, 43(2), 195–201. https://doi.org/10.1177/0272684X211004948

    Article  Google Scholar 

  • Bradley, P., Fayyad, U., & Renia, C. (1999). Scaling EM clustering to large databases. Technical Report. Microsoft Research, Redmond, WA 98052, USA, MSR-TR-98-35.

  • Burcu A. M. (2013). A path model for analyzing undergraduate students’ achievement. Journal of WEI Business and Economics, 2(3), 1–7.

  • Cerezo, R., Esteban, M., Sánchez-Santillán, M., & Núñez, J. C. (2017). Procrastinating behavior in computer-based learning environments to predict performance: A case study in Moodle. Frontiers In Psychology, 8, 1403.

    Article  Google Scholar 

  • Dutt, A., Ismail, M. A., & Herawan, T. (2017). A systematic review on educational data mining. IEEE Access: Practical Innovations, Open Solutions, 5, 15991–16005.

    Article  Google Scholar 

  • El-Halees, A. (2008). Mining students data to analyze learning behavior: a case study. The 2008 international Arab Conference of Information Technology (ACIT2008) – Conference Proceedings, University of Sfax, Tunisia, Dec 15–18.

  • Elvers, G. C., Polzella, D. J., & Graetz, K. (2003). Procrastination in online courses: Performance and attitudinal differences. Teaching of Psychology, 30(2), 159–162.

    Article  Google Scholar 

  • Gopika, N., & Kowshalaya M.E., A. M. (2018). Correlation based feature selection algorithm for machine learning. 2018 3rd International Conference on Communication and Electronics Systems (ICCES) (pp 692–695).  https://doi.org/10.1109/CESYS.2018.8723980

  • Hellas, A., Ihantola, P., Petersen, A., Ajanovski, V. V., Gutica, M., Hynninen, T., Knutas, A., Leinonen, J., Messom, C., & Liao, S. N. (2018). Predicting Academic performance: A systematic literature review (pp. 175–199). ACM. https://doi.org/10.1145/3293881.3295783

  • Kalboard360E-learning system (2015). http://cloud.kalboard360.com/User/Login#home/index/. Accessed 31 July 2015.

  • Khan, A., & Ghosh, S. K. (2021). Student performance analysis and prediction in classroom learning: A review of educational data mining studies. Education and Information Technologies, 26, 205–240.

    Article  Google Scholar 

  • Kotsiantis, S. (2009). Educational data mining: A case study for predicting dropout-prone students. Int Journal of Knowledge Engineering and Soft Data Paradigm, 1, 101–111.

    Article  Google Scholar 

  • Kotsiantis, S., Patriarcheas, K., & Xenos, M. (2010). A Combinational incremental ensemble of classifiers as a technique for predicting student’s performance in distance education. Knowledge Based Systems, 23(6), 529–535. https://doi.org/10.1016/j.knosys.2010.03.010

  • Marbouti, F., Diefes-Dux, H. A., & Madhavan, K. (2016). Models for early prediction of at-risk students in a course using standards-based grading. Computers & Education, 103, 1–15.

    Article  Google Scholar 

  • Michinov, N., Brunot, S., Le Bohec, O., Juhel, J., & Delaval, M. (2011). Procrastination, participation, and performance in online learning environments. Computers & Education, 56, 243–252.

    Article  Google Scholar 

  • Nti, I . K., Sam, S. A., Bediako-Kyeremeh, B., et al. (2021) Predicting Students Academic Performance Using Machine Learning Algorithms (MLAs). Journal of Computer in Education, 9 (1-2). https://doi.org/10.1007/s40692-021-00201-z

  • Oshodi, O. S., Aluko, R. O., Daniel, E. I., Aigbavboa, C. O., & Abisuga, A. O. (2018). Towards reliable prediction of academic performance of architecture students using data mining techniques. Journal of Engineering Design and Technology, 16(3), 385–397.

    Article  Google Scholar 

  • Owusu-Boadu, B. et al. (2021). Academic performance modelling with machine learning based on cognitive and non-cognitive features. Applied Computer Systems, (2), 122–131. https://doi.org/10.2478/acss-2021-0015

  • Sk. Vaheed, R. P., Singh, P., Nayak, C., & Mallikarjuna Rao (2022). Students’ Academic Performance Prediction using Ensemble methods through educational data mining. In Proceedings of Smart Intelligent Computing and Applications (Vol. 1, pp. 215–224).

  • Verma, C., Stoffová, V., Illes, Z., et al. (2020a). Machine learning-based student native place identification for real-time. IEEE Access: Practical Innovations, Open Solutions, 8, 130840–130854.

    Article  Google Scholar 

  • Verma, C., Illes, Z., & Stoffova, V. (2020b). Study level prediction of Indian and Hungarian students towards ICT and mobile technology for the realtime. In Proc. Int. Conf. Comput., Autom. Knowl. Manage. (ICCAKM), pp. 219–223. https://doi.org/10.1109/iccakm46823.2020.9051551

  • Verma, C., Illés, Z., & Sttofová, V. (2020c). Real-time classification of national and international students for ICT and mobile technology: An experimental study on Indian and Hungarian University. Journal of Physics: Conference Series. 1432, Art. no. 012091. https://doi.org/10.1088/1742-6596/1432/1/012091

  • Verma, C., Stoffova, V., & Illes, Z. (2020d). Ensemble methods to predict the locality scope of Indian and Hungarian students for the real-time. In Advances in Intelligent Systems and Computing, Odisha, India (pp. 1–13).

  • Verma, C., Tarawneh, A. S., Illes, Z., Stoffova, V., & Dahiya, S. (2018). Gender prediction of the European school’s teachers using machine learning: Preliminary results. In Proc. IEEE 8th Int. Advance Comput. Conf. (IACC), Dec. pp. 213–220. https://doi.org/10.1109/iadcc.2018.8692100

  • Verma, C., Illes, Z., & Stoffova, V. (2019). Age group predictive models for the real-time prediction of the University students using machine learning: Preliminary results. In Proc. IEEE Int. Conf. Electr., Comput. Commun. Technol. (ICECCT), pp. 1–7. https://doi.org/10.1109/icecct.2019.8869136

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Padmalaya Nayak.

Ethics declarations

Conflict of interest

None.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

  • Confusion Matrix: It presents the complete performance of the model and presents output in a matrix form by calculating the actual values and predicted values as shown in Table 1.

    Example of a Confusion Matrix

    • True positive is defined as the predicted value equal to the actual value and both are positive.

    • True negative is defined as the predicted value equal to the actual value, but both are negative.

    • False-positive presents the difference between the predicted value and actual value. The model can predict a positive value but the actual value is negative.

    • False-negative implies that the predicted value is negative, but the actual output is positive.

  • Accuracy: It is a ratio between numbers of correctly predicted instances to the total number of instances. It is given by.

$$\text{Accuracy}=\frac{\text{Sum of diagonal of Confusion matrix}}{\text{Sum of the confusion matrix}}$$
  • Precision: It is a ratio between correctly classified positive values to total predicted positive values.

$$\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}\left({\text{C}}_{\text{k}}\right)= \frac{\text{T}\text{P}}{\text{T}\text{P}+\text{F}\text{P}}$$
  • Recall: Recall is a measure of the ratio of correctly predicted positive classes to all classes in the actual class.

$${\text{Recall}(\text{C}}_\text{k})=\frac{\text{TP}}{\text{TP}+\text{FN}}$$
  • Root Mean square error (RMSE): It is expressed as the square root of the mean of all the errors i.e. the square difference between actual values and predicted values.

$$\text{RMSE}=\sqrt{\frac{\left({\text{actual}}_\text{i}-{\text{predicted}}_\text{i}\right)^2}{\text{Total predictions}}}$$
  • ROC area: Receiver Operating Characteristics (ROC) is one of the most used parameters for evaluating an ML model. It is a graph between the true positive rate (TPR) and the false-positive rate (FPR). The area under this curve is called the ROC area.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nayak, P., Vaheed, S., Gupta, S. et al. Predicting students’ academic performance by mining the educational data through machine learning-based classification model. Educ Inf Technol 28, 14611–14637 (2023). https://doi.org/10.1007/s10639-023-11706-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10639-023-11706-8

Keywords

Navigation