Abstract
Data quality is essential for database integration, machine learning and data science in general. Despite the increasing number of tools for data preparation, the most tedious tasks of data wrangling –and feature manipulation in particular– still resist automation partly because the problem strongly depends on domain information. For instance, if the strings “17th of August of 2017” and “2017-08-17” are to be formatted into “08/17/2017” to be properly recognised by a data analytics tool, humans usually process this in two steps: (1) they recognise that this is about dates and (2) they apply conversions that are specific to the date domain. However, the mechanisms to manipulate dates are very different from those to manipulate addresses. This requires huge amounts of background knowledge, which usually becomes a bottleneck as the diversity of domains and formats increases. In this paper we help alleviate this problem by using inductive programming (IP) with a dynamic background knowledge (BK) fuelled by a machine learning meta-model that selects the domain, the primitives (or both) from several descriptive features of the data wrangling problem. We illustrate these new alternatives for the automation of data format transformation, which we evaluate on an integrated benchmark and code for data wrangling, which we share publicly for the community.
This research was supported by the EU (FEDER) and the Spanish MINECO RTI2018-094403-B-C32, and the Generalitat Valenciana PROMETEO/2019/098. L. Contreras-Ochando was also supported by the Spanish MECD (FPU15/03219). J. Hernández-Orallo is also funded by FLI (RFP2-152). F. Martínez-Plumed was also supported by INCIBE (Ayudas para la excelencia de los equipos de investigación avanzada en ciberseguridad), the European Commission (JRC) HUMAINT project (CT-EX2018D335821-101), and UPV (Primeros Proyectos de lnvestigación PAID-06-18).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
RapidMiner Studio - Feature List: https://goo.gl/oYypMh.
- 2.
Trifacta Wrangler - Wrangle Language: https://goo.gl/pJHSFw.
- 3.
Excel - Data types in Data Models: https://goo.gl/uWnbZh.
- 4.
Trifacta Documentation - Supported Data Types: https://goo.gl/pV1owi.
- 5.
We observed that the maximum number of functions needed to solve the most complex problem collected in our benchmark is \(k=12\).
- 6.
An application example of our system can be seen on: https://www.youtube.com/watch?v=wxFhXYyonOw.
- 7.
TDE Benchmark: https://github.com/Yeye-He/Transform-Data-by-Example.
References
Bhupatiraju, S., Singh, R., Mohamed, A.r., Kohli, P.: Deep API programmer: learning to program with APIs. arXiv preprint arXiv:1704.04327 (2017)
Contreras-Ochando, L.: DataWrangling-DSI: BETA - Extended Results (2019). https://doi.org/10.5281/zenodo.2557385
Contreras-Ochando, L., Ferri, C., Hernández-Orallo, J., Martínez-Plumed, F., Ramírez-Quintana, M.J., Katayama, S.: General-purpose declarative inductive programming with domain-specific background knowledge for data wrangling automation. arXiv preprint arXiv:1809.10054 (2018)
Cropper, A., Tamaddoni-Nezhad, A., Muggleton, S.H.: Meta-interpretive learning of data transformation programs. In: Inoue, K., Ohwada, H., Yamamoto, A. (eds.) ILP 2015. LNCS (LNAI), vol. 9575, pp. 46–59. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-40566-7_4
Devlin, J., Bunel, R.R., Singh, R., Hausknecht, M., Kohli, P.: Neural program meta-induction. In: NIPS, pp. 2077–2085 (2017)
Ferri-Ramírez, C., Hernández-Orallo, J., Ramírez-Quintana, M.J.: Incremental learning of functional logic programs. In: Kuchen, H., Ueda, K. (eds.) FLOPS 2001. LNCS, vol. 2024, pp. 233–247. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44716-4_15
Flener, P., Schmid, U.: An introduction to inductive programming. Artif. Intell. Rev. 29(1), 45–62 (2008)
Gulwani, S.: Automating string processing in spreadsheets using input-output examples. In: Proceedings of the 38th Principles of Programming Languages, pp. 317–330 (2011)
Gulwani, S., Harris, W.R., Singh, R.: Spreadsheet data manipulation using examples. Commun. ACM 55(8), 97–105 (2012)
Gulwani, S., Hernandez-Orallo, J., Kitzelmann, E., Muggleton, S.H., Schmid, U., Zorn, B.: Inductive programming meets the real world. Commun. ACM 58(11), 90–99 (2015)
He, Y., Chu, X., Ganjam, K., Zheng, Y., Narasayya, V., Chaudhuri, S.: Transform-data-by-example (TDE): an extensible search engine for data transformations. Proc. VLDB Endow. 11(10), 1165–1177 (2018)
Henderson, R.: Incremental learning in inductive programming. In: Schmid, U., Kitzelmann, E., Plasmeijer, R. (eds.) AAIP 2009. LNCS, vol. 5812, pp. 74–92. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-11931-6_4
Kandel, S., Paepcke, A., Hellerstein, J., Heer, J.: Wrangler: interactive visual specification of data transformation scripts. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 3363–3372. ACM (2011)
Kandel, S., et al.: Research directions in data wrangling: visualizations and transformations for usable and credible data. Inf. Vis. 10(4), 271–288 (2011)
Katayama, S.: An analytical inductive functional programming system that avoids unintended programs. In: Proceedings of the ACM SIGPLAN 2012 Workshop on Partial Evaluation and Program Manipulation PEPM, pp. 43–52. ACM (2012)
Kietz, J.U., Wrobel, S.: Controlling the complexity of learning in logic through syntactic and task-oriented models. In: Inductive Logic Programming. Citeseer (1992)
Menon, A., Tamuz, O., Gulwani, S., Lampson, B., Kalai, A.: A machine learning framework for programming by example. In: ICML, pp. 187–195 (2013)
Mitchell, T., et al.: Never-ending learning. Commun. ACM 61(5), 103–115 (2018)
Mitchell, T.M.: The need for biases in learning generalizations. Rutgers Univ., New Jersey (1980)
Mitchell, T.M., et al.: Theo: a framework for self-improving systems. In: Architectures for Intelligence: The Twenty-Second Carnegie Mellon Symposium on Congnition, pp. 323–355 (1991)
Parisotto, E., Mohamed, A.r., Singh, R., Li, L., Zhou, D., Kohli, P.: Neuro-symbolic program synthesis. arXiv preprint arXiv:1611.01855 (2016)
Shu, C., Zhang, H.: Neural programming by example. In: AAAI, pp. 1539–1545 (2017)
Singh, R., Gulwani, S.: Predicting a correct program in programming by example. In: Kroening, D., Păsăreanu, C.S. (eds.) CAV 2015. LNCS, vol. 9206, pp. 398–414. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-21690-4_23
Singh, R., Gulwani, S.: Transforming spreadsheet data types using examples. In: Proceedings of the 43rd Principles of Programming Languages, pp. 343–356 (2016)
Srinivasan, A., King, R.D., Bain, M.E.: An empirical study of the use of relevance information in inductive logic programming. JMLR 4, 369–383 (2003)
Wu, B., Szekely, P., Knoblock, C.A.: Learning data transformation rules through examples: preliminary results. In: Information Integration on the Web, p. 8 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Contreras-Ochando, L., Ferri, C., Hernández-Orallo, J., Martínez-Plumed, F., Ramírez-Quintana, M.J., Katayama, S. (2020). Automated Data Transformation with Inductive Programming and Dynamic Background Knowledge. In: Brefeld, U., Fromont, E., Hotho, A., Knobbe, A., Maathuis, M., Robardet, C. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2019. Lecture Notes in Computer Science(), vol 11908. Springer, Cham. https://doi.org/10.1007/978-3-030-46133-1_44
Download citation
DOI: https://doi.org/10.1007/978-3-030-46133-1_44
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-46132-4
Online ISBN: 978-3-030-46133-1
eBook Packages: Computer ScienceComputer Science (R0)