Automated Data Pre-processing via Meta-learning

  • Besim Bilalli
  • Alberto Abelló
  • Tomàs Aluja-Banet
  • Robert Wrembel
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9893)


A data mining algorithm may perform differently on datasets with different characteristics, e.g., it might perform better on a dataset with continuous attributes rather than with categorical attributes, or the other way around. As a matter of fact, a dataset usually needs to be pre-processed. Taking into account all the possible pre-processing operators, there exists a staggeringly large number of alternatives and non-experienced users become overwhelmed. We show that this problem can be addressed by an automated approach, leveraging ideas from meta-learning. Specifically, we consider a wide range of data pre-processing techniques and a set of data mining algorithms. For each data mining algorithm and selected dataset, we are able to predict the transformations that improve the result of the algorithm on the respective dataset. Our approach will help non-expert users to more effectively identify the transformations appropriate to their applications, and hence to achieve improved results.


  1. 1.
    Bilalli, B., Abelló, A., Aluja-Banet, T., Wrembel, R.: Towards intelligent data analysis: the metadata challenge. In: IoTBD (2016)Google Scholar
  2. 2.
    Charest, M., et al.: Bridging the gap between data mining and decision support: a case-based reasoning and ontology approach. In: IDA (2008)Google Scholar
  3. 3.
    Crone, S.F., Lessmann, S., Stahlbock, R.: The impact of preprocessing on data mining: an evaluation of classifier sensitivity in direct marketing. Eur. J. Oper. Res. 173(3), 781–800 (2006)MathSciNetCrossRefMATHGoogle Scholar
  4. 4.
    Dasu, T., Johnson, T.: Exploratory Data Mining and Data Cleaning, vol. 479. Wiley, Hoboken (2003)CrossRefMATHGoogle Scholar
  5. 5.
    Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.: From data mining to knowledge discovery in databases. AI Magazine (1996)Google Scholar
  6. 6.
    Guazzelli, A., Zeller, M., Lin, W.-C., Williams, G., et al.: PMML: an open standard for sharing models. R J. 1(1), 60–65 (2009)Google Scholar
  7. 7.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., et al.: The weka data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  8. 8.
    Kalousis, A., Hilario, M.: Model selection via meta-learning: a comparative study. Int. J. Artif. Intell. Tools 10(4), 525–554 (2001)CrossRefMATHGoogle Scholar
  9. 9.
    Kietz, J.-U., Serban, F., Fischer, S., Bernstein, A.: “Semantics Inside!” but let’s not tell the data miners: intelligent support for data mining. In: Presutti, V., d’Amato, C., Gandon, F., d’Aquin, M., Staab, S., Tordai, A. (eds.) ESWC 2014. LNCS, vol. 8465, pp. 706–720. Springer, Heidelberg (2014)CrossRefGoogle Scholar
  10. 10.
    Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. In: IJCAI (1995)Google Scholar
  11. 11.
    Michie, D., Spiegelhalter, D.J., Taylor, C.C., Campbell, J. (eds.): Machine Learning, Neural and Statistical Classification. Ellis Horwood, Upper Saddle River (1994)MATHGoogle Scholar
  12. 12.
    Mierswa, I.: Rapid miner. Künstliche Intelligenz (2009)Google Scholar
  13. 13.
    Munson, M.A.: A study on the importance of and time spent on different modeling steps. SIGKDD Explor. Newsl. 13(2), 65–71 (2012)CrossRefGoogle Scholar
  14. 14.
    Pyle, D.: Data Preparation for Data Mining. Morgan Kaufmann, San Francisco (1999)Google Scholar
  15. 15.
    Serban, F., Vanschoren, J., Kietz, J., Bernstein, A.: A survey of intelligent assistants for data analysis. ACM Comput. Surv. 45(3), 31 (2013)CrossRefGoogle Scholar
  16. 16.
    Thornton, C., Hutter, F., Hoos, H.H., et al.: Auto-weka: combined selection and hyperparameter optimization of classification algorithms. In: KDD (2013)Google Scholar
  17. 17.
    Vanschoren, J., van Rijn, J.N., Bischl, B., Torgo, L.: OpenML: networked science in machine learning. ACM SIGKDD Explor. Newsl. 15(2), 49–60 (2014)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Besim Bilalli
    • 1
  • Alberto Abelló
    • 1
  • Tomàs Aluja-Banet
    • 1
  • Robert Wrembel
    • 2
  1. 1.Universitat Politécnica de CatalunyaBarcelonaSpain
  2. 2.Poznan University of TechnologyPoznanPoland

Personalised recommendations