Skip to main content

Automated Data Transformation with Inductive Programming and Dynamic Background Knowledge

Part of the Lecture Notes in Computer Science book series (LNAI,volume 11908)

Abstract

Data quality is essential for database integration, machine learning and data science in general. Despite the increasing number of tools for data preparation, the most tedious tasks of data wrangling –and feature manipulation in particular– still resist automation partly because the problem strongly depends on domain information. For instance, if the strings “17th of August of 2017” and “2017-08-17” are to be formatted into “08/17/2017” to be properly recognised by a data analytics tool, humans usually process this in two steps: (1) they recognise that this is about dates and (2) they apply conversions that are specific to the date domain. However, the mechanisms to manipulate dates are very different from those to manipulate addresses. This requires huge amounts of background knowledge, which usually becomes a bottleneck as the diversity of domains and formats increases. In this paper we help alleviate this problem by using inductive programming (IP) with a dynamic background knowledge (BK) fuelled by a machine learning meta-model that selects the domain, the primitives (or both) from several descriptive features of the data wrangling problem. We illustrate these new alternatives for the automation of data format transformation, which we evaluate on an integrated benchmark and code for data wrangling, which we share publicly for the community.

Keywords

  • Inductive programming
  • Data wrangling automation
  • Declarative programming languages
  • Dynamic background knowledge

This research was supported by the EU (FEDER) and the Spanish MINECO RTI2018-094403-B-C32, and the Generalitat Valenciana PROMETEO/2019/098. L. Contreras-Ochando was also supported by the Spanish MECD (FPU15/03219). J. Hernández-Orallo is also funded by FLI (RFP2-152). F. Martínez-Plumed was also supported by INCIBE (Ayudas para la excelencia de los equipos de investigación avanzada en ciberseguridad), the European Commission (JRC) HUMAINT project (CT-EX2018D335821-101), and UPV (Primeros Proyectos de lnvestigación PAID-06-18).

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-46133-1_44
  • Chapter length: 17 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   89.00
Price excludes VAT (USA)
  • ISBN: 978-3-030-46133-1
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   119.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.

Notes

  1. 1.

    RapidMiner Studio - Feature List: https://goo.gl/oYypMh.

  2. 2.

    Trifacta Wrangler - Wrangle Language: https://goo.gl/pJHSFw.

  3. 3.

    Excel - Data types in Data Models: https://goo.gl/uWnbZh.

  4. 4.

    Trifacta Documentation - Supported Data Types: https://goo.gl/pV1owi.

  5. 5.

    We observed that the maximum number of functions needed to solve the most complex problem collected in our benchmark is \(k=12\).

  6. 6.

    An application example of our system can be seen on: https://www.youtube.com/watch?v=wxFhXYyonOw.

  7. 7.

    TDE Benchmark: https://github.com/Yeye-He/Transform-Data-by-Example.

References

  1. Bhupatiraju, S., Singh, R., Mohamed, A.r., Kohli, P.: Deep API programmer: learning to program with APIs. arXiv preprint arXiv:1704.04327 (2017)

  2. Contreras-Ochando, L.: DataWrangling-DSI: BETA - Extended Results (2019). https://doi.org/10.5281/zenodo.2557385

  3. Contreras-Ochando, L., Ferri, C., Hernández-Orallo, J., Martínez-Plumed, F., Ramírez-Quintana, M.J., Katayama, S.: General-purpose declarative inductive programming with domain-specific background knowledge for data wrangling automation. arXiv preprint arXiv:1809.10054 (2018)

  4. Cropper, A., Tamaddoni-Nezhad, A., Muggleton, S.H.: Meta-interpretive learning of data transformation programs. In: Inoue, K., Ohwada, H., Yamamoto, A. (eds.) ILP 2015. LNCS (LNAI), vol. 9575, pp. 46–59. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-40566-7_4

    CrossRef  Google Scholar 

  5. Devlin, J., Bunel, R.R., Singh, R., Hausknecht, M., Kohli, P.: Neural program meta-induction. In: NIPS, pp. 2077–2085 (2017)

    Google Scholar 

  6. Ferri-Ramírez, C., Hernández-Orallo, J., Ramírez-Quintana, M.J.: Incremental learning of functional logic programs. In: Kuchen, H., Ueda, K. (eds.) FLOPS 2001. LNCS, vol. 2024, pp. 233–247. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44716-4_15

    CrossRef  MATH  Google Scholar 

  7. Flener, P., Schmid, U.: An introduction to inductive programming. Artif. Intell. Rev. 29(1), 45–62 (2008)

    CrossRef  Google Scholar 

  8. Gulwani, S.: Automating string processing in spreadsheets using input-output examples. In: Proceedings of the 38th Principles of Programming Languages, pp. 317–330 (2011)

    Google Scholar 

  9. Gulwani, S., Harris, W.R., Singh, R.: Spreadsheet data manipulation using examples. Commun. ACM 55(8), 97–105 (2012)

    CrossRef  Google Scholar 

  10. Gulwani, S., Hernandez-Orallo, J., Kitzelmann, E., Muggleton, S.H., Schmid, U., Zorn, B.: Inductive programming meets the real world. Commun. ACM 58(11), 90–99 (2015)

    CrossRef  Google Scholar 

  11. He, Y., Chu, X., Ganjam, K., Zheng, Y., Narasayya, V., Chaudhuri, S.: Transform-data-by-example (TDE): an extensible search engine for data transformations. Proc. VLDB Endow. 11(10), 1165–1177 (2018)

    CrossRef  Google Scholar 

  12. Henderson, R.: Incremental learning in inductive programming. In: Schmid, U., Kitzelmann, E., Plasmeijer, R. (eds.) AAIP 2009. LNCS, vol. 5812, pp. 74–92. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-11931-6_4

    CrossRef  Google Scholar 

  13. Kandel, S., Paepcke, A., Hellerstein, J., Heer, J.: Wrangler: interactive visual specification of data transformation scripts. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 3363–3372. ACM (2011)

    Google Scholar 

  14. Kandel, S., et al.: Research directions in data wrangling: visualizations and transformations for usable and credible data. Inf. Vis. 10(4), 271–288 (2011)

    CrossRef  Google Scholar 

  15. Katayama, S.: An analytical inductive functional programming system that avoids unintended programs. In: Proceedings of the ACM SIGPLAN 2012 Workshop on Partial Evaluation and Program Manipulation PEPM, pp. 43–52. ACM (2012)

    Google Scholar 

  16. Kietz, J.U., Wrobel, S.: Controlling the complexity of learning in logic through syntactic and task-oriented models. In: Inductive Logic Programming. Citeseer (1992)

    Google Scholar 

  17. Menon, A., Tamuz, O., Gulwani, S., Lampson, B., Kalai, A.: A machine learning framework for programming by example. In: ICML, pp. 187–195 (2013)

    Google Scholar 

  18. Mitchell, T., et al.: Never-ending learning. Commun. ACM 61(5), 103–115 (2018)

    CrossRef  Google Scholar 

  19. Mitchell, T.M.: The need for biases in learning generalizations. Rutgers Univ., New Jersey (1980)

    Google Scholar 

  20. Mitchell, T.M., et al.: Theo: a framework for self-improving systems. In: Architectures for Intelligence: The Twenty-Second Carnegie Mellon Symposium on Congnition, pp. 323–355 (1991)

    Google Scholar 

  21. Parisotto, E., Mohamed, A.r., Singh, R., Li, L., Zhou, D., Kohli, P.: Neuro-symbolic program synthesis. arXiv preprint arXiv:1611.01855 (2016)

  22. Shu, C., Zhang, H.: Neural programming by example. In: AAAI, pp. 1539–1545 (2017)

    Google Scholar 

  23. Singh, R., Gulwani, S.: Predicting a correct program in programming by example. In: Kroening, D., Păsăreanu, C.S. (eds.) CAV 2015. LNCS, vol. 9206, pp. 398–414. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-21690-4_23

    CrossRef  Google Scholar 

  24. Singh, R., Gulwani, S.: Transforming spreadsheet data types using examples. In: Proceedings of the 43rd Principles of Programming Languages, pp. 343–356 (2016)

    Google Scholar 

  25. Srinivasan, A., King, R.D., Bain, M.E.: An empirical study of the use of relevance information in inductive logic programming. JMLR 4, 369–383 (2003)

    MathSciNet  MATH  Google Scholar 

  26. Wu, B., Szekely, P., Knoblock, C.A.: Learning data transformation rules through examples: preliminary results. In: Information Integration on the Web, p. 8 (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lidia Contreras-Ochando .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Contreras-Ochando, L., Ferri, C., Hernández-Orallo, J., Martínez-Plumed, F., Ramírez-Quintana, M.J., Katayama, S. (2020). Automated Data Transformation with Inductive Programming and Dynamic Background Knowledge. In: Brefeld, U., Fromont, E., Hotho, A., Knobbe, A., Maathuis, M., Robardet, C. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2019. Lecture Notes in Computer Science(), vol 11908. Springer, Cham. https://doi.org/10.1007/978-3-030-46133-1_44

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-46133-1_44

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-46132-4

  • Online ISBN: 978-3-030-46133-1

  • eBook Packages: Computer ScienceComputer Science (R0)