Abstract
A data mining algorithm may perform differently on datasets with different characteristics, e.g., it might perform better on a dataset with continuous attributes rather than with categorical attributes, or the other way around. As a matter of fact, a dataset usually needs to be pre-processed. Taking into account all the possible pre-processing operators, there exists a staggeringly large number of alternatives and non-experienced users become overwhelmed. We show that this problem can be addressed by an automated approach, leveraging ideas from meta-learning. Specifically, we consider a wide range of data pre-processing techniques and a set of data mining algorithms. For each data mining algorithm and selected dataset, we are able to predict the transformations that improve the result of the algorithm on the respective dataset. Our approach will help non-expert users to more effectively identify the transformations appropriate to their applications, and hence to achieve improved results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bilalli, B., Abelló, A., Aluja-Banet, T., Wrembel, R.: Towards intelligent data analysis: the metadata challenge. In: IoTBD (2016)
Charest, M., et al.: Bridging the gap between data mining and decision support: a case-based reasoning and ontology approach. In: IDA (2008)
Crone, S.F., Lessmann, S., Stahlbock, R.: The impact of preprocessing on data mining: an evaluation of classifier sensitivity in direct marketing. Eur. J. Oper. Res. 173(3), 781–800 (2006)
Dasu, T., Johnson, T.: Exploratory Data Mining and Data Cleaning, vol. 479. Wiley, Hoboken (2003)
Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.: From data mining to knowledge discovery in databases. AI Magazine (1996)
Guazzelli, A., Zeller, M., Lin, W.-C., Williams, G., et al.: PMML: an open standard for sharing models. R J. 1(1), 60–65 (2009)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., et al.: The weka data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Kalousis, A., Hilario, M.: Model selection via meta-learning: a comparative study. Int. J. Artif. Intell. Tools 10(4), 525–554 (2001)
Kietz, J.-U., Serban, F., Fischer, S., Bernstein, A.: “Semantics Inside!” but let’s not tell the data miners: intelligent support for data mining. In: Presutti, V., d’Amato, C., Gandon, F., d’Aquin, M., Staab, S., Tordai, A. (eds.) ESWC 2014. LNCS, vol. 8465, pp. 706–720. Springer, Heidelberg (2014)
Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: IJCAI (1995)
Michie, D., Spiegelhalter, D.J., Taylor, C.C., Campbell, J. (eds.): Machine Learning, Neural and Statistical Classification. Ellis Horwood, Upper Saddle River (1994)
Mierswa, I.: Rapid miner. Künstliche Intelligenz (2009)
Munson, M.A.: A study on the importance of and time spent on different modeling steps. SIGKDD Explor. Newsl. 13(2), 65–71 (2012)
Pyle, D.: Data Preparation for Data Mining. Morgan Kaufmann, San Francisco (1999)
Serban, F., Vanschoren, J., Kietz, J., Bernstein, A.: A survey of intelligent assistants for data analysis. ACM Comput. Surv. 45(3), 31 (2013)
Thornton, C., Hutter, F., Hoos, H.H., et al.: Auto-weka: combined selection and hyperparameter optimization of classification algorithms. In: KDD (2013)
Vanschoren, J., van Rijn, J.N., Bischl, B., Torgo, L.: OpenML: networked science in machine learning. ACM SIGKDD Explor. Newsl. 15(2), 49–60 (2014)
Acknowledgments
This research has been funded by the European Commission through the Erasmus Mundus Joint Doctorate “Information Technologies for Business Intelligence - Doctoral College” (IT4BI-DC).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Bilalli, B., Abelló, A., Aluja-Banet, T., Wrembel, R. (2016). Automated Data Pre-processing via Meta-learning. In: Bellatreche, L., Pastor, Ó., Almendros Jiménez, J., Aït-Ameur, Y. (eds) Model and Data Engineering. MEDI 2016. Lecture Notes in Computer Science(), vol 9893. Springer, Cham. https://doi.org/10.1007/978-3-319-45547-1_16
Download citation
DOI: https://doi.org/10.1007/978-3-319-45547-1_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45546-4
Online ISBN: 978-3-319-45547-1
eBook Packages: Computer ScienceComputer Science (R0)