Abstract
Data pre-processing is the first step in any data mining process, being one of the most important but less studied tasks in educational data mining research. Pre-processing allows transforming the available raw educational data into a suitable format ready to be used by a data mining algorithm for solving a specific educational problem. However, most of the authors rarely describe this important step or only provide a few works focused on the pre-processing of data. In order to solve the lack of specific references about this topic, this paper specifically surveys the task of preparing educational data. Firstly, it describes different types of educational environments and the data they provide. Then, it shows the main tasks and issues in the pre-processing of educational data, Moodle data being mainly used in the examples. Next, it describes some general and specific pre-processing tools and finally, some conclusions and future research lines are outlined.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
The full name column covers the identification of subjects.
- 2.
The student column covers the identification of subjects.
Abbreviations
- AIHS:
-
Adaptive and intelligent hypermedia system
- ARFF:
-
Attribute-relation File Format
- CBE:
-
Computer-based education
- CSV:
-
Comma-separated values
- DM:
-
Data mining
- EDM:
-
Educational data mining
- HTML:
-
Hypertext Markup language
- ID:
-
Identifier
- IP:
-
Internet Protocol
- ITS:
-
Intelligent tutoring system
- KDD:
-
Knowledge discovery in databases
- LMS:
-
Learning management system
- MCQ:
-
Multiple choice question
- MIS:
-
Management information system
- MOOC:
-
Massive Open Online Course
- OLAP:
-
Online Analytical Processing
- SQL:
-
Structured Query Language
- WUM:
-
Web Usage Mining
- WWW:
-
World Wide Web
- XML:
-
Extensible Markup Language
References
Romero, C., Ventura, S.: Data mining in education. WIREs Data Min. Knowl. Disc. 1(3), 12–27 (2013)
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, San Francisco (2006)
Miksovsky, P., Matousek, K., Kouba, Z.: Data Pre-processing support for data mining. In: IEEE International Conference on Systems, Man and Cybernetics, pp. 208–212, Hammamet, Tunisia (2002)
Pyle, D.: Data Preparation for Data Mining. Morgan Kaufmann, San Francisco (1999)
Gonçalves, P.M., Barros, R.S.M., Vieria, D.C.L: On the use of data mining tools for data preparation in classification problems. In: 11th International Conference on Computer and Information Science, pp. 173–178, IEEE, Washington (2012)
Bohanec, M., Moyle, S., Wettschereck, D., Miksovsk, P.: A software architecture for data pre-processing using data mining and decision support models. In: ECML/PKDD’01 Workshop on Integrating Aspects of Data Mining, Decision Support and Meta-Learning, pp. 13–24 (2001)
Sael, N., Abdelaziz, A., Behja, H.: Investigating and advanced approach to data pre-processing in Moodle platform. Int. Rev. Comput. Softw. 7(3), 977–982 (2012)
Marquardt, C.G., Becker, K., Ruiz, D.D.: A Pre-processing tool for web usage mining in the distance education Domain. In: International Database Engineering and Applications Symposium, pp. 78–87. IEEE Computer Society, Washington (2004)
Wettschereck, D.: Educational data pre-processing. In: ECML’02 Discovery Challenge Workshop, pp. 1–6. University of Helsinki, Helsinki (2002)
Simon, J.: Data preprocessing using a priori knowledge. In: D’Mello, S.K., Calvo, R.A., Olney, A. (eds.) 6th International Conference on Educational Data Mining, pp. 352–353. International Educational Data Mining Society, Memphis (2013)
Rice, W.H.: Moodle E-learning Course Development. A Complete Guide to Successful Learning Using Moodle. Packt publishing, Birmingham (2006)
Ma, Y., Liu, B., Wong, C., Yu, P., Lee, S.: Targeting the right students using data mining. In: Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data mining, pp. 457–464. ACM, New York (2000)
Silva, D., Vieira, M.: Using data warehouse and data mining resources for ongoing assessment in distance learning. In: IEEE International Conference on Advanced Learning Technologies, pp. 40–45. IEEE Computer Society, Kazan (2002)
Clow, D.: MOOCs and the funnel of participation. In: Suthers, D., Verbert, K., Duval, E., Ochoa, X. (eds.) International Conference on Learning Analytics and Knowledge, pp. 185–189. ACM New York, NY (2013)
Anderson, J., Corbett, A., Koedinger, K.: Cognitive tutors. J. Learn. Sci. 4(2), 67–207 (1995)
Mostow, J., Beck, J.: Some useful tactics to modify, map and mine data from intelligent tutors. J. Nat. Lang. Eng. 12(2), 95–208 (2006)
Brusilovsky, P., Peylo, C.: Adaptive and intelligent web-based educational systems. Int. J. Artif. Intell. Educ. 13(2–4), 159–172 (2003)
Merceron, A., Yacef, K.: Mining student data captured from a web-based tutoring tool: initial exploration and results. J. Interact. Learn. Res. 15(4), 319–346 (2004)
Brusilovsky, P., Miller, P.: Web-based testing for distance education. In: De Bra, P., Leggett, J. (eds.) WebNet’99, World Conference of the WWW and Internet, pp. 149–154. AACE, Honolulu (1999)
Hanna, M.: Data mining in the e-learning domain. Campus-Wide Inf. Syst. 21(1), 29–34 (2004)
Romero, C., Ventura, S., Salcines, E.: Data mining in course management systems: moodle case study and tutorial. Comput. Educ. 51(1), 368–384 (2008)
Agrawal, R., Srikant, R.: Mining sequential patterns. In: Eleventh International Conference on Data Engineering, pp. 3–4. IEEE, Washington (1995)
Romero, C., Ventura, S., Zafra, A., De Bra, P.: Applying web usage mining for personalizing hyperlinks in web-based adaptive educational systems. Comput. Educ. 53(3), 828–840 (2009)
Feldman, R., Sanger, J.: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, Cambridge (2006)
Dringus, L.P., Ellis, T.: Using data mining as a strategy for assessing asynchronous discussion forums. Comput. Educ. J. 45(1), 141–160 (2005)
Petrushin, V., Khan, L. (eds.): Multimedia Data Mining and Knowledge Discovery. Springer, London (2007)
Bari, M., Lavoie, B.: Predicting interactive properties by mining educational multimedia presentations. In: International Conference on Information and Communications Technology, pp. 231–234. Bangladesh University of Engineering and Technology, Dhaka (2007)
Srivastava, J., Cooley, R., Deshpande, M., Tan, P.: Web usage mining: discovery and applications of usage patterns from web data. SIGKDD Explor. 1(2), 12–23 (2000)
Romero, C., Ventura, S.: Educational data mining: a survey from 1995 to 2005. Expert Syst. Appl. 33(1), 135–146 (2007)
Vranic, M., Pintar, D., Skocir, Z.: The use of data mining in education environment. In: 9th International Conference on Telecommunications, pp. 243–250. IEEE, Zagreb (2007)
Gibert, K., Izquierdo, J., Holmes, G., Athanasiadis, I., Comas, J., Sanchez, M.: On the role of pre and post processing in environmental data mining. In: Sánchez-Marré, M., Béjar, J., Comas, J., Rizzoli, A. E., Guariso, G. (eds.) iEMSs Fourth Biennial Meeting: International Congress on Environmental Modelling and Software (iEMSs 2008), pp. 1937–1958. International Environmental Modelling and Software Society, Barcelona (2008)
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco (2005)
Zhu, F., Ip, H., Fok, A., Cao, J.: PeRES: A Personalized Recommendation Education System Based on Multi-Agents & SCORM. In: Leung, H., Li, F., Lau, R., Li, Q. (eds.) Advances in Web Based Learning—ICWL 2007. LNCS, vol. 4823, pp. 31–42. Springer, Heidelberg (2007)
Avouris, N., Komis, V., Fiotakis, G., Margaritis, M., Voyiatzaki, E.: Why logging of fingertip actions is not enough for analysis of learning activities. In: Workshop on Usage Analysis in Learning Systems, pp. 1–8. AIED Conference, Amsterdam (2005)
Chanchary, F.H., Haque, I., Khalid, M.S.: Web usage mining to evaluate the transfer of learning in a web-based learning environment. In: International Workshop on Knowledge Discovery and Data Mining, pp. 249–253. IEEE, Washington (2008)
Spacco, J., Winters, T., Payne, T.: Inferring use cases from unit testing. In: AAAI Workshop on Educational Data Mining, pp. 1–7, AAAI Press, New York (2006)
Zhang, L, Liu, X., Liu, X.: Personalized instructing recommendation system based on web mining. In: International Conference for Young Computer Scientists, pp. 2517–2521. IEEE Computer Society Washington (2008)
Barnes, T.: The Q-matrix method: mining student response data for knowledge. In: AAAI-2005 Workshop on Educational Data Mining, pp. 1–8, AAAI Press, Pittsburgh (2005)
Chen, C., Chen, M., Li, Y.: Mining key formative assessment rules based on learner profiles for web-based learning systems. In: Spector, J.M., Sampson D.G., Okamoto, T., Kinshuk, Cerri, S.A., Ueno, M., Kashihara, A. (eds.) IEEE International Conference on Advanced Learning Technologies, pp. 1–5. IEEE Computer Society, Los Alamitos (2007)
Wang, F.H.: A fuzzy neural network for item sequencing in personalized cognitive scaffolding with adaptive formative assessment. Expert Syst. Appl. J. 27(1), 11–25 (2004)
Markham, S., Ceddia, J., Sheard, J., Burvill, C., Weir, J., Field, B.: Applying agent technology to evaluation tasks in e-learning environments. In: International Conference of the Exploring Educational Technologies, pp. 1–7. Monash University, Melbourne (2003)
Medvedeva, O., Chavan, G., Crowley, R.: A data collection framework for capturing its data based on an agent communication standard. In: 20th Annual Meeting of the American Association for Artificial Intelligence, pp. 23–30, AAAI, Pittsburgh (2005)
Shen, R., Han, P., Yang, F., Yang, Q., Huang, J.: Data mining and case-based reasoning for distance learning. J. Distance Educ. Technol. 1(3), 46–58 (2003)
Lenzerini, M.: Data integration: a theoretical perspective. In: International Conference on ACM SIGMOD/PODS, pp. 233–246. ACM, New York (2002)
Ingram, A.: Using web server logs in evaluating instructional web sites. J. Educ. Technol. Syst. 28(2), 137–157 (1999)
Peled, A., Rashty, D.: Logging for success: advancing the use of WWW logs to improve computer mediated distance learning. J. Educ. Comput. Res. 21(4), 413–431 (1999)
Talavera, L., Gaudioso, E.: Mining student data to characterize similar behavior groups in unstructured collaboration spaces. In: Workshop on Artificial Intelligence in CSCL, pp. 17–23. Valencia (2004)
Romero, C., Ventura, S., Bra, P.D.: Knowledge discovery with genetic programming for providing feedback to courseware author. User modeling and user-adapted interaction. J. Personalization Res. 14(5), 425–464 (2004)
Mostow, J., Beck, J.E.: Why, what, and how to log? Lessons from LISTEN. In: Barnes, T., Desmarais, M., Romero, R., Ventura, S. (eds.) 2nd International Conference on Educational Data Mining, pp. 269–278. International Educational Data Mining Society, Cordoba (2009)
Binli, S.: Research on data-preprocessing for construction of university information systems. In: International Conference on Computer Application and System Modeling, pp. 459–462. IEEE, Taiyuan (2010)
Dierenfeld, H., Merceron, A.: Learning analytics with excel pivot tables. In: Moodle Research Conference, pp. 115–121. University of Piraeus, Heraklion (2012)
Solodovnikova, D., Niedrite, L.: Using data warehouse resources for assessment of e-earning influence on university processes. In: Eder, J., Haav, H.M., Kalja, A., Penjam, J. (eds.) 9th East European Conference, ADBIS 2005. Advances in Databases and Information Systems. LNCS, vol. 3631, pp. 233-248. Springer, Heidelberg (2005)
Merceron, A., Yacef, K.: Directions to Enhance Learning Management Systems for Better Data Mining. Personal Communication (2010)
Yan, S., Li, Z.: Commercial decision system based on data warehouse and OLAP. Microelectron. Comput. 2, 64–67 (2006)
Zorrilla, M.E., Menasalvas, E., Marin, D., Mora, E., Segovia, J.: Web usage mining project for improving web-based learning sites. In: Moreno-Díaz, R., Pichler, F., Quesada-Arencibia, A. (eds.) Computer Aided Systems Theory—EUROCAST 2005. LNCS, vol. 3643, pp. 205–210. Springer, Heidelberg (2005)
Yin, C., Luo, Q.: Personality mining system in e-learning by using improved association rules. In: International Conference on Machine Learning and Cybernetics, pp. 4130–4134. IEEE, Hong Kong (2007)
Heiner, C., Beck, J.E., Mostow, J.: Lessons on using ITS data to answer educational research questions. In: Lester, J.C., Vicari, R.S., Paraguaçu, F. (eds.) Intelligent Tutoring Systems, 7th International Conference, ITS 2004. LNCS, vol. 3220, pp. 1–9. Springer, Heidelberg (2004)
Rubin, D.B., Little, R.J.A.: Statistical Analysis with Missing Data. Wiley, New York (2002)
Salmeron-Majadas, S., Santos, O., Boticario, J.G., Cabestrero, R., Quiros, P.: Gathering emotional data from multiple sources. In: D’Mello, S.K., Calvo, R.A., Olney, A. (eds.) 6th International Conference on Educational Data Mining, pp. 404–405. International Educational Data Mining Society, Memphis (2013)
Shuangcheng, L., Ping, W.: Study on the data preprocessing of the questionarie based on the combined classification data mining model. In: International Conference on e-Learning, Enterprise Information Systems and E-Goverment, pp. 217–220. Las Vegas (2009)
García, E., Romero, C., Ventura, S., Castro, C.: An architecture for making recommendations to courseware authors using association rule mining and collaborative filtering. User Model. User-Adap. Inter. 19(1–2), 99–132 (2009)
Huang, C., Lin, W., Wang, S., Wang, W.: Planning of educational training courses by data mining: using China Motor Corporation as an example. Expert Syst. Appl. J. 36(3), 7199–7209 (2009)
Kantardzic, M.: Data Mining: Concepts, Models, Methods, and Algorithms. Wiley, New York (2003)
Beck, J.E.: Using learning decomposition to analyze student fluency development. In: Workshop on Educational Data Mining at the 8th International Conference on Intelligent Tutoring Systems, pp. 21–28. Jhongli (2006)
Redpath, R., Sheard, J.: Domain knowledge to support understanding and treatment of outliers. In: International Conference on Information and Automation, pp. 398–403. IEEE, Colombo (2005)
Sunita, S.B., Lobo, L.M.: Data preparation strategy in e-learning system using association rule algorithm. Int. J. Comput. Appl. 41(3), 35–40 (2012)
Ivancsy, R., Juhasz, S.: Analysis of web user identification methods. World Acad. Sci. Eng. Technol. J. 34, 338–345 (2007)
Rahkila, M., Karjalainen, M.: Evaluation of learning in computer based education using log systems. In: ASEE/IEEE Frontiers in Education Conference, pp. 16–21. IEEE, San Juan (1999)
Wang, F.H.: Content recommendation based on education-contextualized browsing events for web-based personalized learning. Educ. Technol. Soc. 11(4), 94–112 (2008)
Munk, M., Drlík, M.: Impact of Different pre-processing tasks on effective identification of users’ behavioral patterns in web-based educational system. Procedia Comput. Sci. 4, 1640–1649 (2011)
Heraud, J.M., France, L., Mille, A.: Pixed: an ITS that guides students with the help of learners’ interaction log. In: Lester, J.C., Vicari, R.S., Paraguaçu, F. (eds.) Intelligent Tutoring Systems, 7th International Conference, ITS 2004. LNCS, vol. 3220, pp. 57–64. Springer, Heidelberg (2004)
Sheard, J., Ceddia, J., Hurst, J., Tuovinen, J.: Inferring student learning behaviour from website interactions: a usage analysis. J. Educ. Inf. Technol. 8(3), 245–266 (2003)
Petersen, R.J.: Policy dimensions of analytics in higher education. Educause Rev. 47, 44–49 (2012)
Bienkowski, M., Feng, M., Means, B.: Enhancing Teaching and Learning Through Educational Data Mining and Learning Analytics: An Issue Brief. U.S. Department of Education, Office of Educational Technology, pp. 1–57 (2012)
Liu, H., Motoda, H.: Computational Methods of Feature Selection. Chapman & Hall/CRC, Boca Raton (2007)
Delavari, N., Phon-Amnuaisuk, S., Beikzadeh, M.: Data mining application in higher learning institutions. Inf. Educ. J. 7(1), 31–54 (2008)
Kotsiantis, B., Kanellopoulos, D., Pintelas, P.: Data pre-processing for supervised learning. Int. J. Comput. Sci. 1(2), 111–117 (2006)
Mihaescu, C., Burdescu, D.: Testing attribute selection algorithms for classification performance on real data. In: International IEEE Conference Intelligent Systems, pp. 581–586. IEEE, London (2006)
Márquez-Vera, C., Cano, A., Romero, C., Ventura, S.: Predicting student failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced data. Appl. Intell. 38(3), 315–330 (2013)
Wong, S.K., Nguyen, T.T., Chang, E., Jayaratnal, N.: Usability metrics for e-learning. In: Meersman, R., Tari, Z. (eds.) On the Move to Meaningful Internet Systems 2003: OTM 2003 Workshops, LNCS, vol. 2889, pp. 235–252. Springer, Heidelberg (2003)
Hershkovitz, A. Nachmias, R.: Consistency of students’ pace in online learning. In: Barnes, T., Desmarais, M., Romero, R., Ventura, S. (eds.) 2nd International Conference on Educational Data Mining, pp. 71–80. International Educational Data Mining Society, Cordoba (2009)
Mor, E., Minguillón, J.: E-learning personalization based on itineraries and long-term navigational behavior. In: Thirteenth World Wide Web Conference, pp. 264–265. ACM, New York (2004)
Nilakant, K., Mitrovic, A.: Application of data mining in constraint based intelligent tutoring systems. In: International Conference on Artificial Intelligence in Education, pp. 896–898. Amsterdam (2005)
Baker, R., Carvalho, M.: A labeling student behavior faster and more precisely with text replays. In: Baker, R.S.J.d, Barnes, T., Beck, J.E. (eds.) 1st International Conference on Educational Data Mining, pp. 38–47. International Educational Data Mining Society, Montreal (2008)
Zhou, M., Xu, Y., Nesbit., J.C., Winne, P.H.: Sequential pattern analysis of learning logs: methodology and applications. In: Romero, C., Ventura, S., Pechenizkiy, M., Baker, R.S. J.D. (eds.) Handbook of Educational Data Mining, Chapman & Hall/CRC Data Mining and Knowledge Discovery Series, pp. 107–120. CRC Press, Boca Raton (2010)
Liu, B.: Web Data Mining: Exploring Hyperlinks, Contents and Usage Data. Springer, Heidelberg (2011)
Thai, D., Wu, H., Li, P.: A hybrid system: neural network with data mining in an e-learning environment. In: Jain, L., Howlett, R.J., Apolloni, B. (eds.) Knowledge-Based Intelligent Information and Engineering Systems, 11th International Conference, KES 2007, XVII Italian Workshop on Neural Networks. LNCS, vol. 4693, pp. 42–49. Springer, Heidelberg (2007)
Hien, N.T.N., Haddawy, P.: A decision support system for evaluating international student applications. In: Frontiers in Education Conference, pp. 1–6. IEEE, Piscataway (2007)
Kosheleva, O., Kreinovich, V., Longrpre, L.: Towards interval techniques for processing educational data. In: International Symposium on Scientific Computing, Computer Arithmetic and Validated Numerics, pp. 1–28. IEEE Computer Society, Washington (2006)
Hämäläinen, W., Vinni, M.: Classifiers for educational data mining. In: Romero, C., Ventura, S., Pechenizkiy, M., Baker, R.S.J.d. (eds.) Handbook of Educational Data Mining, Chapman & Hall/CRC Data Mining and Knowledge Discovery Series, pp. 57–71. CRC Press, Boca Raton (2010)
Cocea, M., Weibelzahl, S.: Can log files analysis estimate learners’ level of motivation? In: Workshop week Lernen—Wissensentdeckung—Adaptivität, pp. 32–35. Hildesheim (2006)
Tanimoto, S.L.: Improving the prospects for educational data mining. In: Track on Educational Data Mining, at the Workshop on Data Mining for User Modeling, at the 11th International Conference on User Modeling, pp. 1–6. User Modeling Inc., Corfu (2007)
Werner, L., McDowell, C., Denner, J.: A first step in learning analytics: pre-processing low-level Alice logging data of middle school students. J. Educ. Data Min. (2013, in press)
Alcalá, J., Sanchez, L., García, S., Del Jesus, M.J., Ventura, S., Garrell, J.M., Otero, J., Romero, C., Bacardit, J., Rivas, V., Fernández, J.C., Herrera, F.: KEEL: a software tool to assess evolutionary algorithms to data mining problems. Soft. Comput. 13(3), 307–318 (2009)
Gonçalves, P.M., Barros, R.S.M.: Automating data preprocessing with DMPML and KDDML. In: 10th IEEE/ACIS International Conference on Computer and Information Science, pp. 97–103. IEEE, Washington (2011)
Zaïne, O.R., Luo, J.: Towards evaluating learners’ behaviour in a web-based distance learning environment. In: IEEE International Conference on Advanced Learning Technologies, pp. 357–360. Madison, WI (2001)
Ceddia, J., Sheard, J., Tibbery, G.: WAT: a tool for classifying learning activities from a log file. In: Ninth Australasian Computing Education Conference, pp. 11–17. Australian Computer Society, Darlinghurst (2007)
Rodrigo, M.T., Baker, R., McLaren, B.M., Jayme, A., Dy, T. : Development of a workbench to address the educational data mining bottleneck. In: Yacef, K., Zaïane, O., Hershkovitz, A., Yudelson, M., Stamper, J. (eds.) 5th International Conference on Educational Data Mining, pp. 152–155. International Educational Data Mining Society, Chania (2012)
Koedinger, K., Cunningham, K., Skogsholm, A., LEBER, B.: An open repository and analysis tools for fine-grained, longitudinal learner data. In: Baker, R.S.J.d, Barnes, T., Beck, J.E. (eds.) 1st International Conference on Educational Data Mining, pp. 157–166. International Educational Data Mining Society, Montreal (2008)
Acknowledgments
This research is supported by projects of the Regional Government of Andalucía and the Ministry of Science and Technology, P08-TIC-3720 and TIN-2011-22408, respectively, and FEDER funds.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Romero, C., Romero, J.R., Ventura, S. (2014). A Survey on Pre-Processing Educational Data. In: Peña-Ayala, A. (eds) Educational Data Mining. Studies in Computational Intelligence, vol 524. Springer, Cham. https://doi.org/10.1007/978-3-319-02738-8_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-02738-8_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-02737-1
Online ISBN: 978-3-319-02738-8
eBook Packages: EngineeringEngineering (R0)