Towards Automatic Data Format Transformations: Data Wrangling at Scale

  • Alex Bogatu
  • Norman W. Paton
  • Alvaro A. A. Fernandes
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10365)

Abstract

Data wrangling is the process whereby data is cleaned and integrated for analysis. Data wrangling, even with tool support, is typically a labour intensive process. One aspect of data wrangling involves carrying out format transformations on attribute values, for example so that names or phone numbers are represented consistently. Recent research has developed techniques for synthesising format transformation programs from examples of the source and target representations. This is valuable, but still requires a user to provide suitable examples, something that may be challenging in applications in which there are huge data sets or numerous data sources. In this paper we investigate the automatic discovery of examples that can be used to synthesise format transformation programs. In particular, we propose an approach to identifying candidate data examples and validating the transformations that are synthesised from them. The approach is evaluated empirically using data sets from open government data.

Keywords

Format transformations Data wrangling Program synthesis 

References

  1. 1.
    Chu, X., Ilyas, I.F., Papotti, P.: Holistic data cleaning: putting violations into context. In: ICDE 2013, pp. 458–469 (2013)Google Scholar
  2. 2.
    Fan, W.: Dependencies revisited for improving data quality. In: PODS 2008, pp. 159–170, 9–11 June 2008Google Scholar
  3. 3.
    Fan, W.: Data quality: from theory to practice. SIGMOD Rec. 44(3), 7–18 (2015)CrossRefGoogle Scholar
  4. 4.
    Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Towards certain fixes with editing rules and master data. VLDB J. 21(2), 213–238 (2012)CrossRefGoogle Scholar
  5. 5.
    Furche, T., Gottlob, G., Libkin, L., Orsi, G., Paton, N.W.: Data wrangling for big data: challenges and opportunities. In: EDBT, pp. 473–478 (2016)Google Scholar
  6. 6.
    Gulwani, S.: Automating string processing in spreadsheets using input-output examples. In: POPL, pp. 317–330 (2011)Google Scholar
  7. 7.
    Heer, J., Hellerstein, J.M., Kandel, S.: Predictive interaction for data transformation. In: CIDR 2015, 4–7 January 2015Google Scholar
  8. 8.
    Kandel, S., Paepcke, A., Hellerstein, J., Heer, J.: Wrangler: interactive visual specification of data transformation scripts. In: CHI, pp. 3363–3372 (2011)Google Scholar
  9. 9.
    Papenbrock, T., Naumann, F.: A hybrid approach to functional dependency discovery. In: SIGMOD Conference, pp. 821–833. ACM (2016)Google Scholar
  10. 10.
    Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDBJ 10(4), 334–350 (2001)CrossRefMATHGoogle Scholar
  11. 11.
    Raman, V., Hellerstein, J.M.: Potter’s wheel: an interactive data cleaning system. In: VLDB 2001, pp. 381–390, 11–14 September 2001Google Scholar
  12. 12.
    Singh, R.: BlinkFill: semi-supervised programming by example for syntactic string transformations. PVLDB 9(10), 816–827 (2016)Google Scholar
  13. 13.
    Jia, X., Fan, W., Geerts, F., Kementsietsidis, A.: Conditional functional dependencies for capturing data inconsistencies. TODS 33(1), 6:1–6:48 (2008)Google Scholar
  14. 14.
    Wu, B., Knoblock, C.A.: An iterative approach to synthesize data transformation programs. In: IJCAI 2015, pp. 1726–1732, 25–31 July 2015Google Scholar
  15. 15.
    Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M., Ilyas, I.F.: Guided data repair. PVLDB 4(5), 279–289 (2011)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Alex Bogatu
    • 1
  • Norman W. Paton
    • 1
  • Alvaro A. A. Fernandes
    • 1
  1. 1.School of Computer ScienceUniversity of ManchesterManchesterUK

Personalised recommendations