On Semi-supervised Learning with Sparse Data Handling for Educational Data Classification

  • Vo Thi Ngoc ChauEmail author
  • Nguyen Hua Phung
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10646)


An educational data classification task at the program level is investigated in this paper. This task concentrates on predicting the final study status of each student from the second year to the fourth year in their study path. By doing that, in-trouble students can be predicted as soon as possible. However, the task faces two main problems. The first problem is the existence of incomplete data once we conduct an early prediction and the second one is the lack of labeled data for a supervised learning process of this task. In order to overcome those difficulties, our work proposes a robust semi-supervised learning method with sparse data handling in either sequential or iterative approach. The sparse data handling process can help us with the k-nearest neighbors-based data imputation and the semi-supervised learning process with a random forest model as a base learner can exploit the availability of a larger set of unlabeled data in the task. These two processes can be conducted in sequence or integrated in each other for robustness and effectiveness in educational data classification. The experimental results show that our resulting robust random forest-based self-training algorithm with the iterative approach to sparse data handling outperforms the other algorithms with different sequential and traditional approaches for conducting the task. This algorithm provides us with a more effective classifier as a practical solution on educational data over the time.


Self-training Random forest Educational data classification Sparse data K-Nearest neighbors 



This research is funded by Vietnam National University Ho Chi Minh City, Vietnam, under grant number C2017-20-18.


  1. 1.
    Academic Affairs Office, Ho Chi Minh City University of Technology, Vietnam. Accessed 29 June 2017
  2. 2.
    Bayer, J., Bydzovska, H., Geryk, J., Obsivac, T., Popelinsky, L.: Predicting drop-out from social behaviour of students. In Proceedings of the 5th International Conference on Educational Data Mining, pp. 103–109 (2012)Google Scholar
  3. 3.
    Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)CrossRefzbMATHGoogle Scholar
  4. 4.
    Dejaeger, K., Goethals, F., Giangreco, A., Mola, L., Baesens, B.: Gaining insight into student satisfaction using comprehensible data mining techniques. Eur. J. Oper. Res. 218, 548–562 (2012)CrossRefGoogle Scholar
  5. 5.
    Dong, A., Chung, F., Wang, S.: Semi-supervised classification method through oversampling and common hidden space. Inf. Sci. 349–350, 216–228 (2016)CrossRefGoogle Scholar
  6. 6.
    Hathaway, R.J., Bezdek, J.C.: Fuzzy c-means clustering of incomplete data. IEEE Tran. Syst. Man Cybern. Part B Cybern. 31(5), 735–744 (2001)CrossRefGoogle Scholar
  7. 7.
    Koprinska, I., Stretton, J., Yacef, K.: Predicting student performance from multiple data sources. Artif. Intell. Educ. 9112, 678–681 (2015)CrossRefGoogle Scholar
  8. 8.
    Kostopoulos, G., Kotsiantis, S., Pintelas, P.: Estimating student dropout in distance higher education using semi-supervised techniques. In: Proceedings of the 19th Panhellenic Conference on Informatics, pp. 38–43 (2015)Google Scholar
  9. 9.
    Kravvaris, D., Kermanidis, K.L., Thanou, E.: Success is hidden in the students’ data. Artif. Intell. Appl. Innovations 382, 401–410 (2012)CrossRefGoogle Scholar
  10. 10.
    Li, M., Zhou, Z.H.: Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. IEEE Trans. Syst. Man Cybern. A Syst. Hum. 37(6), 1088–1098 (2007)CrossRefGoogle Scholar
  11. 11.
    Márquez-Vera, C., Cano, A., Romero, C., Ventura, S.: Predicting student failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced data. Appl. Intell. 38, 315–330 (2013)CrossRefGoogle Scholar
  12. 12.
    Peña-Ayala, A.: Educational data mining: a survey and a data mining-based analysis of recent works. Expert Syst. Appl. 41, 1432–1462 (2014)CrossRefGoogle Scholar
  13. 13.
    Romero, C., Espejo, P.G., Zafra, A., Romero, J.R., Ventura, S.: Web usage mining for predicting final marks of students that use Moodle courses. Comput. Appl. Eng. Educ. 21, 135–146 (2013)CrossRefGoogle Scholar
  14. 14.
    Saarela, M., Karkkainen, T.: Analysing Student Performance using Sparse Data of Core Bachelor Courses. Journal of Educational Data Mining 7(1), 3–32 (2015)Google Scholar
  15. 15.
    Tanha, J., Someren, M., Afsarmanesh, H.: Semi-supervised self-training for decision tree classifier. Int. J. Mach. Learn. Cyber., 1–16 (2015). doi: 10.1007/s13042-015-0328-7
  16. 16.
    Taruna, S., Pandey, M.: An empirical analysis of classification techniques for predicting academic performance. In: Proceedings of the IEEE International Advance Computing Conference, pp. 523–528 (2014)Google Scholar
  17. 17.
    Triguero, I., Garíca, S., Herrera, F.: Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study. Knowl. Inform. Syst. 42(2), 245–284 (2015)CrossRefGoogle Scholar
  18. 18.
    Triguero, I., Garíca, S., Herrera, F.: SEG-SSC: a framework based on synthetic examples generation for self-labeled semi-supervised classification. IEEE Trans. Cybern. 45(4), 622–634 (2015)CrossRefGoogle Scholar
  19. 19.
    Weka 3, Data Mining Software in Java. Accessed 28 June 2017
  20. 20.
    Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: Proceedings the 33rd Annual Meeting of the Association for Computational Linguistics, pp. 189–196 (1995)Google Scholar
  21. 21.
    Zhou, Z.H., Li, M.: Tri-Training: exploiting unlabeled data using three classifiers. IEEE Trans. Knowl. Data Eng. 17, 1529–1541 (2005)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Ho Chi Minh City University of Technology, Vietnam National University – HCMCHo Chi Minh CityVietnam

Personalised recommendations