Automated Data Transformation with Inductive Programming and Dynamic Background Knowledge

Contreras-Ochando, Lidia; Ferri, Cèsar; Hernández-Orallo, José; Martínez-Plumed, Fernando; Ramírez-Quintana, María José; Katayama, Susumu

doi:10.1007/978-3-030-46133-1_44

Lidia Contreras-Ochando¹⁴,
Cèsar Ferri¹⁴,
José Hernández-Orallo¹⁴,
Fernando Martínez-Plumed¹⁴,
María José Ramírez-Quintana¹⁴ &
…
Susumu Katayama¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11908))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

1764 Accesses

Abstract

Data quality is essential for database integration, machine learning and data science in general. Despite the increasing number of tools for data preparation, the most tedious tasks of data wrangling –and feature manipulation in particular– still resist automation partly because the problem strongly depends on domain information. For instance, if the strings “17th of August of 2017” and “2017-08-17” are to be formatted into “08/17/2017” to be properly recognised by a data analytics tool, humans usually process this in two steps: (1) they recognise that this is about dates and (2) they apply conversions that are specific to the date domain. However, the mechanisms to manipulate dates are very different from those to manipulate addresses. This requires huge amounts of background knowledge, which usually becomes a bottleneck as the diversity of domains and formats increases. In this paper we help alleviate this problem by using inductive programming (IP) with a dynamic background knowledge (BK) fuelled by a machine learning meta-model that selects the domain, the primitives (or both) from several descriptive features of the data wrangling problem. We illustrate these new alternatives for the automation of data format transformation, which we evaluate on an integrated benchmark and code for data wrangling, which we share publicly for the community.

This research was supported by the EU (FEDER) and the Spanish MINECO RTI2018-094403-B-C32, and the Generalitat Valenciana PROMETEO/2019/098. L. Contreras-Ochando was also supported by the Spanish MECD (FPU15/03219). J. Hernández-Orallo is also funded by FLI (RFP2-152). F. Martínez-Plumed was also supported by INCIBE (Ayudas para la excelencia de los equipos de investigación avanzada en ciberseguridad), the European Commission (JRC) HUMAINT project (CT-EX2018D335821-101), and UPV (Primeros Proyectos de lnvestigación PAID-06-18).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
RapidMiner Studio - Feature List: https://goo.gl/oYypMh.
2.
Trifacta Wrangler - Wrangle Language: https://goo.gl/pJHSFw.
3.
Excel - Data types in Data Models: https://goo.gl/uWnbZh.
4.
Trifacta Documentation - Supported Data Types: https://goo.gl/pV1owi.
5.
We observed that the maximum number of functions needed to solve the most complex problem collected in our benchmark is \(k=12\).
6.
An application example of our system can be seen on: https://www.youtube.com/watch?v=wxFhXYyonOw.
7.
TDE Benchmark: https://github.com/Yeye-He/Transform-Data-by-Example.

References

Bhupatiraju, S., Singh, R., Mohamed, A.r., Kohli, P.: Deep API programmer: learning to program with APIs. arXiv preprint arXiv:1704.04327 (2017)
Contreras-Ochando, L.: DataWrangling-DSI: BETA - Extended Results (2019). https://doi.org/10.5281/zenodo.2557385
Contreras-Ochando, L., Ferri, C., Hernández-Orallo, J., Martínez-Plumed, F., Ramírez-Quintana, M.J., Katayama, S.: General-purpose declarative inductive programming with domain-specific background knowledge for data wrangling automation. arXiv preprint arXiv:1809.10054 (2018)
Cropper, A., Tamaddoni-Nezhad, A., Muggleton, S.H.: Meta-interpretive learning of data transformation programs. In: Inoue, K., Ohwada, H., Yamamoto, A. (eds.) ILP 2015. LNCS (LNAI), vol. 9575, pp. 46–59. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-40566-7_4
Chapter Google Scholar
Devlin, J., Bunel, R.R., Singh, R., Hausknecht, M., Kohli, P.: Neural program meta-induction. In: NIPS, pp. 2077–2085 (2017)
Google Scholar
Ferri-Ramírez, C., Hernández-Orallo, J., Ramírez-Quintana, M.J.: Incremental learning of functional logic programs. In: Kuchen, H., Ueda, K. (eds.) FLOPS 2001. LNCS, vol. 2024, pp. 233–247. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44716-4_15
Chapter MATH Google Scholar
Flener, P., Schmid, U.: An introduction to inductive programming. Artif. Intell. Rev. 29(1), 45–62 (2008)
Article Google Scholar
Gulwani, S.: Automating string processing in spreadsheets using input-output examples. In: Proceedings of the 38th Principles of Programming Languages, pp. 317–330 (2011)
Google Scholar
Gulwani, S., Harris, W.R., Singh, R.: Spreadsheet data manipulation using examples. Commun. ACM 55(8), 97–105 (2012)
Article Google Scholar
Gulwani, S., Hernandez-Orallo, J., Kitzelmann, E., Muggleton, S.H., Schmid, U., Zorn, B.: Inductive programming meets the real world. Commun. ACM 58(11), 90–99 (2015)
Article Google Scholar
He, Y., Chu, X., Ganjam, K., Zheng, Y., Narasayya, V., Chaudhuri, S.: Transform-data-by-example (TDE): an extensible search engine for data transformations. Proc. VLDB Endow. 11(10), 1165–1177 (2018)
Article Google Scholar
Henderson, R.: Incremental learning in inductive programming. In: Schmid, U., Kitzelmann, E., Plasmeijer, R. (eds.) AAIP 2009. LNCS, vol. 5812, pp. 74–92. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-11931-6_4
Chapter Google Scholar
Kandel, S., Paepcke, A., Hellerstein, J., Heer, J.: Wrangler: interactive visual specification of data transformation scripts. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 3363–3372. ACM (2011)
Google Scholar
Kandel, S., et al.: Research directions in data wrangling: visualizations and transformations for usable and credible data. Inf. Vis. 10(4), 271–288 (2011)
Article Google Scholar
Katayama, S.: An analytical inductive functional programming system that avoids unintended programs. In: Proceedings of the ACM SIGPLAN 2012 Workshop on Partial Evaluation and Program Manipulation PEPM, pp. 43–52. ACM (2012)
Google Scholar
Kietz, J.U., Wrobel, S.: Controlling the complexity of learning in logic through syntactic and task-oriented models. In: Inductive Logic Programming. Citeseer (1992)
Google Scholar
Menon, A., Tamuz, O., Gulwani, S., Lampson, B., Kalai, A.: A machine learning framework for programming by example. In: ICML, pp. 187–195 (2013)
Google Scholar
Mitchell, T., et al.: Never-ending learning. Commun. ACM 61(5), 103–115 (2018)
Article Google Scholar
Mitchell, T.M.: The need for biases in learning generalizations. Rutgers Univ., New Jersey (1980)
Google Scholar
Mitchell, T.M., et al.: Theo: a framework for self-improving systems. In: Architectures for Intelligence: The Twenty-Second Carnegie Mellon Symposium on Congnition, pp. 323–355 (1991)
Google Scholar
Parisotto, E., Mohamed, A.r., Singh, R., Li, L., Zhou, D., Kohli, P.: Neuro-symbolic program synthesis. arXiv preprint arXiv:1611.01855 (2016)
Shu, C., Zhang, H.: Neural programming by example. In: AAAI, pp. 1539–1545 (2017)
Google Scholar
Singh, R., Gulwani, S.: Predicting a correct program in programming by example. In: Kroening, D., Păsăreanu, C.S. (eds.) CAV 2015. LNCS, vol. 9206, pp. 398–414. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-21690-4_23
Chapter Google Scholar
Singh, R., Gulwani, S.: Transforming spreadsheet data types using examples. In: Proceedings of the 43rd Principles of Programming Languages, pp. 343–356 (2016)
Google Scholar
Srinivasan, A., King, R.D., Bain, M.E.: An empirical study of the use of relevance information in inductive logic programming. JMLR 4, 369–383 (2003)
MathSciNet MATH Google Scholar
Wu, B., Szekely, P., Knoblock, C.A.: Learning data transformation rules through examples: preliminary results. In: Information Integration on the Web, p. 8 (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Valencian Research Institute for Artificial Intelligence (vrAIn), Universitat Politècnica de València, Valencia, Spain
Lidia Contreras-Ochando, Cèsar Ferri, José Hernández-Orallo, Fernando Martínez-Plumed & María José Ramírez-Quintana
University of Miyazaki, Miyazaki, Japan
Susumu Katayama

Authors

Lidia Contreras-Ochando
View author publications
You can also search for this author in PubMed Google Scholar
Cèsar Ferri
View author publications
You can also search for this author in PubMed Google Scholar
José Hernández-Orallo
View author publications
You can also search for this author in PubMed Google Scholar
Fernando Martínez-Plumed
View author publications
You can also search for this author in PubMed Google Scholar
María José Ramírez-Quintana
View author publications
You can also search for this author in PubMed Google Scholar
Susumu Katayama
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lidia Contreras-Ochando .

Editor information

Editors and Affiliations

Leuphana University, Lüneburg, Germany
Ulf Brefeld
IRISA/Inria, Rennes, France
Elisa Fromont
University of Würzburg, Würzburg, Germany
Andreas Hotho
Leiden University, Leiden, The Netherlands
Arno Knobbe
ETH Zurich, Zurich, Switzerland
Marloes Maathuis
Institut National des Sciences Appliquées, Villeurbanne, France
Céline Robardet

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Contreras-Ochando, L., Ferri, C., Hernández-Orallo, J., Martínez-Plumed, F., Ramírez-Quintana, M.J., Katayama, S. (2020). Automated Data Transformation with Inductive Programming and Dynamic Background Knowledge. In: Brefeld, U., Fromont, E., Hotho, A., Knobbe, A., Maathuis, M., Robardet, C. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2019. Lecture Notes in Computer Science(), vol 11908. Springer, Cham. https://doi.org/10.1007/978-3-030-46133-1_44

Download citation

DOI: https://doi.org/10.1007/978-3-030-46133-1_44
Published: 30 April 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-46132-4
Online ISBN: 978-3-030-46133-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)