Skip to main content

Automated Data Transformation with Inductive Programming and Dynamic Background Knowledge

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2019)

Abstract

Data quality is essential for database integration, machine learning and data science in general. Despite the increasing number of tools for data preparation, the most tedious tasks of data wrangling –and feature manipulation in particular– still resist automation partly because the problem strongly depends on domain information. For instance, if the strings “17th of August of 2017” and “2017-08-17” are to be formatted into “08/17/2017” to be properly recognised by a data analytics tool, humans usually process this in two steps: (1) they recognise that this is about dates and (2) they apply conversions that are specific to the date domain. However, the mechanisms to manipulate dates are very different from those to manipulate addresses. This requires huge amounts of background knowledge, which usually becomes a bottleneck as the diversity of domains and formats increases. In this paper we help alleviate this problem by using inductive programming (IP) with a dynamic background knowledge (BK) fuelled by a machine learning meta-model that selects the domain, the primitives (or both) from several descriptive features of the data wrangling problem. We illustrate these new alternatives for the automation of data format transformation, which we evaluate on an integrated benchmark and code for data wrangling, which we share publicly for the community.

This research was supported by the EU (FEDER) and the Spanish MINECO RTI2018-094403-B-C32, and the Generalitat Valenciana PROMETEO/2019/098. L. Contreras-Ochando was also supported by the Spanish MECD (FPU15/03219). J. Hernández-Orallo is also funded by FLI (RFP2-152). F. Martínez-Plumed was also supported by INCIBE (Ayudas para la excelencia de los equipos de investigación avanzada en ciberseguridad), the European Commission (JRC) HUMAINT project (CT-EX2018D335821-101), and UPV (Primeros Proyectos de lnvestigación PAID-06-18).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    RapidMiner Studio - Feature List: https://goo.gl/oYypMh.

  2. 2.

    Trifacta Wrangler - Wrangle Language: https://goo.gl/pJHSFw.

  3. 3.

    Excel - Data types in Data Models: https://goo.gl/uWnbZh.

  4. 4.

    Trifacta Documentation - Supported Data Types: https://goo.gl/pV1owi.

  5. 5.

    We observed that the maximum number of functions needed to solve the most complex problem collected in our benchmark is \(k=12\).

  6. 6.

    An application example of our system can be seen on: https://www.youtube.com/watch?v=wxFhXYyonOw.

  7. 7.

    TDE Benchmark: https://github.com/Yeye-He/Transform-Data-by-Example.

References

  1. Bhupatiraju, S., Singh, R., Mohamed, A.r., Kohli, P.: Deep API programmer: learning to program with APIs. arXiv preprint arXiv:1704.04327 (2017)

  2. Contreras-Ochando, L.: DataWrangling-DSI: BETA - Extended Results (2019). https://doi.org/10.5281/zenodo.2557385

  3. Contreras-Ochando, L., Ferri, C., Hernández-Orallo, J., Martínez-Plumed, F., Ramírez-Quintana, M.J., Katayama, S.: General-purpose declarative inductive programming with domain-specific background knowledge for data wrangling automation. arXiv preprint arXiv:1809.10054 (2018)

  4. Cropper, A., Tamaddoni-Nezhad, A., Muggleton, S.H.: Meta-interpretive learning of data transformation programs. In: Inoue, K., Ohwada, H., Yamamoto, A. (eds.) ILP 2015. LNCS (LNAI), vol. 9575, pp. 46–59. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-40566-7_4

    Chapter  Google Scholar 

  5. Devlin, J., Bunel, R.R., Singh, R., Hausknecht, M., Kohli, P.: Neural program meta-induction. In: NIPS, pp. 2077–2085 (2017)

    Google Scholar 

  6. Ferri-Ramírez, C., Hernández-Orallo, J., Ramírez-Quintana, M.J.: Incremental learning of functional logic programs. In: Kuchen, H., Ueda, K. (eds.) FLOPS 2001. LNCS, vol. 2024, pp. 233–247. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44716-4_15

    Chapter  MATH  Google Scholar 

  7. Flener, P., Schmid, U.: An introduction to inductive programming. Artif. Intell. Rev. 29(1), 45–62 (2008)

    Article  Google Scholar 

  8. Gulwani, S.: Automating string processing in spreadsheets using input-output examples. In: Proceedings of the 38th Principles of Programming Languages, pp. 317–330 (2011)

    Google Scholar 

  9. Gulwani, S., Harris, W.R., Singh, R.: Spreadsheet data manipulation using examples. Commun. ACM 55(8), 97–105 (2012)

    Article  Google Scholar 

  10. Gulwani, S., Hernandez-Orallo, J., Kitzelmann, E., Muggleton, S.H., Schmid, U., Zorn, B.: Inductive programming meets the real world. Commun. ACM 58(11), 90–99 (2015)

    Article  Google Scholar 

  11. He, Y., Chu, X., Ganjam, K., Zheng, Y., Narasayya, V., Chaudhuri, S.: Transform-data-by-example (TDE): an extensible search engine for data transformations. Proc. VLDB Endow. 11(10), 1165–1177 (2018)

    Article  Google Scholar 

  12. Henderson, R.: Incremental learning in inductive programming. In: Schmid, U., Kitzelmann, E., Plasmeijer, R. (eds.) AAIP 2009. LNCS, vol. 5812, pp. 74–92. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-11931-6_4

    Chapter  Google Scholar 

  13. Kandel, S., Paepcke, A., Hellerstein, J., Heer, J.: Wrangler: interactive visual specification of data transformation scripts. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 3363–3372. ACM (2011)

    Google Scholar 

  14. Kandel, S., et al.: Research directions in data wrangling: visualizations and transformations for usable and credible data. Inf. Vis. 10(4), 271–288 (2011)

    Article  Google Scholar 

  15. Katayama, S.: An analytical inductive functional programming system that avoids unintended programs. In: Proceedings of the ACM SIGPLAN 2012 Workshop on Partial Evaluation and Program Manipulation PEPM, pp. 43–52. ACM (2012)

    Google Scholar 

  16. Kietz, J.U., Wrobel, S.: Controlling the complexity of learning in logic through syntactic and task-oriented models. In: Inductive Logic Programming. Citeseer (1992)

    Google Scholar 

  17. Menon, A., Tamuz, O., Gulwani, S., Lampson, B., Kalai, A.: A machine learning framework for programming by example. In: ICML, pp. 187–195 (2013)

    Google Scholar 

  18. Mitchell, T., et al.: Never-ending learning. Commun. ACM 61(5), 103–115 (2018)

    Article  Google Scholar 

  19. Mitchell, T.M.: The need for biases in learning generalizations. Rutgers Univ., New Jersey (1980)

    Google Scholar 

  20. Mitchell, T.M., et al.: Theo: a framework for self-improving systems. In: Architectures for Intelligence: The Twenty-Second Carnegie Mellon Symposium on Congnition, pp. 323–355 (1991)

    Google Scholar 

  21. Parisotto, E., Mohamed, A.r., Singh, R., Li, L., Zhou, D., Kohli, P.: Neuro-symbolic program synthesis. arXiv preprint arXiv:1611.01855 (2016)

  22. Shu, C., Zhang, H.: Neural programming by example. In: AAAI, pp. 1539–1545 (2017)

    Google Scholar 

  23. Singh, R., Gulwani, S.: Predicting a correct program in programming by example. In: Kroening, D., Păsăreanu, C.S. (eds.) CAV 2015. LNCS, vol. 9206, pp. 398–414. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-21690-4_23

    Chapter  Google Scholar 

  24. Singh, R., Gulwani, S.: Transforming spreadsheet data types using examples. In: Proceedings of the 43rd Principles of Programming Languages, pp. 343–356 (2016)

    Google Scholar 

  25. Srinivasan, A., King, R.D., Bain, M.E.: An empirical study of the use of relevance information in inductive logic programming. JMLR 4, 369–383 (2003)

    MathSciNet  MATH  Google Scholar 

  26. Wu, B., Szekely, P., Knoblock, C.A.: Learning data transformation rules through examples: preliminary results. In: Information Integration on the Web, p. 8 (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lidia Contreras-Ochando .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Contreras-Ochando, L., Ferri, C., Hernández-Orallo, J., Martínez-Plumed, F., Ramírez-Quintana, M.J., Katayama, S. (2020). Automated Data Transformation with Inductive Programming and Dynamic Background Knowledge. In: Brefeld, U., Fromont, E., Hotho, A., Knobbe, A., Maathuis, M., Robardet, C. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2019. Lecture Notes in Computer Science(), vol 11908. Springer, Cham. https://doi.org/10.1007/978-3-030-46133-1_44

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-46133-1_44

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-46132-4

  • Online ISBN: 978-3-030-46133-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics