Towards Automatic Data Format Transformations: Data Wrangling at Scale

Bogatu, Alex; Paton, Norman W.; Fernandes, Alvaro A. A.

doi:10.1007/978-3-319-60795-5_4

Alex Bogatu¹⁷,
Norman W. Paton¹⁷ &
Alvaro A. A. Fernandes¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10365))

Included in the following conference series:

British International Conference on Databases

1496 Accesses
6 Citations

Abstract

Data wrangling is the process whereby data is cleaned and integrated for analysis. Data wrangling, even with tool support, is typically a labour intensive process. One aspect of data wrangling involves carrying out format transformations on attribute values, for example so that names or phone numbers are represented consistently. Recent research has developed techniques for synthesising format transformation programs from examples of the source and target representations. This is valuable, but still requires a user to provide suitable examples, something that may be challenging in applications in which there are huge data sets or numerous data sources. In this paper we investigate the automatic discovery of examples that can be used to synthesise format transformation programs. In particular, we propose an approach to identifying candidate data examples and validating the transformations that are synthesised from them. The approach is evaluated empirically using data sets from open government data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://nyti.ms/1Aqif2X.
2.
It can be seen that the complexity of Algorithm 1 is \(\mathcal {O}(nm)\) where n is the number of attributes of S and m is the number of attributes of T. This is due to the cross product between the columns of the two data sets (i.e. the two for loops from the beginning of the algorithm). We do not analyse here the complexity of the other algorithms used in our experiments as this has been done in the original papers. Nor do we emphasize on the impact of input size on the overall solution. In our experiments, the run-time of Algorithm 1, pertaining examples generation alone, did not exceed one second for any of the datasets used.
3.
http://bit.ly/2fLVvtl.
4.
http://bit.ly/2f5DwJW.
5.
http://www.pentaho.com/.
6.
https://www.talend.com/.
7.
http://openrefine.org/.

References

Chu, X., Ilyas, I.F., Papotti, P.: Holistic data cleaning: putting violations into context. In: ICDE 2013, pp. 458–469 (2013)
Google Scholar
Fan, W.: Dependencies revisited for improving data quality. In: PODS 2008, pp. 159–170, 9–11 June 2008
Google Scholar
Fan, W.: Data quality: from theory to practice. SIGMOD Rec. 44(3), 7–18 (2015)
Article Google Scholar
Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Towards certain fixes with editing rules and master data. VLDB J. 21(2), 213–238 (2012)
Article Google Scholar
Furche, T., Gottlob, G., Libkin, L., Orsi, G., Paton, N.W.: Data wrangling for big data: challenges and opportunities. In: EDBT, pp. 473–478 (2016)
Google Scholar
Gulwani, S.: Automating string processing in spreadsheets using input-output examples. In: POPL, pp. 317–330 (2011)
Google Scholar
Heer, J., Hellerstein, J.M., Kandel, S.: Predictive interaction for data transformation. In: CIDR 2015, 4–7 January 2015
Google Scholar
Kandel, S., Paepcke, A., Hellerstein, J., Heer, J.: Wrangler: interactive visual specification of data transformation scripts. In: CHI, pp. 3363–3372 (2011)
Google Scholar
Papenbrock, T., Naumann, F.: A hybrid approach to functional dependency discovery. In: SIGMOD Conference, pp. 821–833. ACM (2016)
Google Scholar
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDBJ 10(4), 334–350 (2001)
Article MATH Google Scholar
Raman, V., Hellerstein, J.M.: Potter’s wheel: an interactive data cleaning system. In: VLDB 2001, pp. 381–390, 11–14 September 2001
Google Scholar
Singh, R.: BlinkFill: semi-supervised programming by example for syntactic string transformations. PVLDB 9(10), 816–827 (2016)
Google Scholar
Jia, X., Fan, W., Geerts, F., Kementsietsidis, A.: Conditional functional dependencies for capturing data inconsistencies. TODS 33(1), 6:1–6:48 (2008)
Google Scholar
Wu, B., Knoblock, C.A.: An iterative approach to synthesize data transformation programs. In: IJCAI 2015, pp. 1726–1732, 25–31 July 2015
Google Scholar
Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M., Ilyas, I.F.: Guided data repair. PVLDB 4(5), 279–289 (2011)
Google Scholar

Download references

Acknowledgement

This work has been made possible by funding from the UK Engineering and Physical Sciences Research council, whose support we are pleased to acknowledge.

Author information

Authors and Affiliations

School of Computer Science, University of Manchester, Manchester, M13 9PL, UK
Alex Bogatu, Norman W. Paton & Alvaro A. A. Fernandes

Authors

Alex Bogatu
View author publications
You can also search for this author in PubMed Google Scholar
Norman W. Paton
View author publications
You can also search for this author in PubMed Google Scholar
Alvaro A. A. Fernandes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alex Bogatu .

Editor information

Editors and Affiliations

Birkbeck, University of London, London, United Kingdom
Andrea Calì
Birkbeck, University of London, London, United Kingdom
Peter Wood
Birkbeck, University of London, London, United Kingdom
Nigel Martin
Birkbeck, University of London, London, United Kingdom
Alexandra Poulovassilis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bogatu, A., Paton, N.W., Fernandes, A.A.A. (2017). Towards Automatic Data Format Transformations: Data Wrangling at Scale. In: Calì, A., Wood, P., Martin, N., Poulovassilis, A. (eds) Data Analytics. BICOD 2017. Lecture Notes in Computer Science(), vol 10365. Springer, Cham. https://doi.org/10.1007/978-3-319-60795-5_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-60795-5_4
Published: 14 June 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-60794-8
Online ISBN: 978-3-319-60795-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics