Advertisement

Wrangling messy CSV files by detecting row and type patterns

  • G. J. J. van den BurgEmail author
  • A. Nazábal
  • C. Sutton
Article

Abstract

Data scientists spend the majority of their time on preparing data for analysis. One of the first steps in this preparation phase is to load the data from the raw storage format. Comma-separated value (CSV) files are a popular format for tabular data due to their simplicity and ostensible ease of use. However, formatting standards for CSV files are not followed consistently, so each file requires manual inspection and potentially repair before the data can be loaded, an enormous waste of human effort for a task that should be one of the simplest parts of data science. The first and most essential step in retrieving data from CSV files is deciding on the dialect of the file, such as the cell delimiter and quote character. Existing dialect detection approaches are few and non-robust. In this paper, we propose a dialect detection method based on a novel measure of data consistency of parsed data files. Our method achieves 97% overall accuracy on a large corpus of real-world CSV files and improves the accuracy on messy CSV files by almost 22% compared to existing approaches, including those in the Python standard library. Our measure of data consistency is not specific to the data parsing problem, and has potential for more general applicability.

Keywords

Data wrangling Data parsing Comma separated values 

Notes

Acknowledgements

The authors would like to acknowledge the funding provided by the UK Government’s Defence & Security Programme in support of the Alan Turing Institute. The authors thank Chris Williams for useful discussions.

References

  1. Arnold VI (2003) Catastrophe theory. Springer, BerlinGoogle Scholar
  2. Codd EF (1970) A relational model of data for large shared data banks. Commun ACM 13(6):377–387CrossRefzbMATHGoogle Scholar
  3. Crockford D (2006) The application/json media type for javascript object notation (JSON). Tech. Rep. RFC 4627, Internet requests for commentsGoogle Scholar
  4. Crowdflower (2016) Data science report. visit.figure-eight.com/data-science-report.html. Accessed 19 Nov 2018
  5. Dasu T, Johnson T (2003) Exploratory data mining and data cleaning, vol 479. Wiley, HobokenCrossRefzbMATHGoogle Scholar
  6. Döhmen T, Mühleisen H, Boncz P (2017) Multi-hypothesis CSV parsing. In: Proceedings of the 29th international conference on scientific and statistical database management, ACM, pp. 16:1–16:12Google Scholar
  7. Eberius J, Werner C, Thiele M, Braunschweig K, Dannecker L, Lehner W (2013) DeExcelerator: a framework for extracting relational data from partially structured documents. In: Proceedings of the 22nd ACM international conference on information & knowledge management, ACM, pp. 2477–2480Google Scholar
  8. Evans C (2001) YAML draft 0.1. yaml.org. Accessed 19 Nov 2018
  9. Fisher K, Walker D, Zhu KQ, White P (2008) From dirt to shovels: fully automatic tool generation from ad hoc data. ACM SIGPLAN Not 43:421–434CrossRefGoogle Scholar
  10. Frictionless Data (2017) CSV dialect specification. frictionlessdata.io/specs/csv-dialect. Accessed 19 Nov 2018
  11. Guo PJ, Kandel S, Hellerstein JM, Heer J (2011) Proactive wrangling: Mixed-initiative end-user programming of data transformation scripts. In: Proceedings of the 24th annual ACM symposium on user interface software and technology, ACM, pp. 65–74Google Scholar
  12. Kaggle (2017) The state of data science & machine learning. www.kaggle.com/surveys/2017. Accessed 27 Sept 2018
  13. Kandel S, Paepcke A, Hellerstein J, Heer J (2011) Wrangler: interactive visual specification of data transformation scripts. In: Proceedings of the SIGCHI conference on human factors in computing systems, ACM, pp. 3363–3372Google Scholar
  14. Kleene SC (1956) Representation of events in nerve nets and finite automata. In: Shannon CE, McCarthy J (eds) Automata studies. Princeton University Press, PrincetonGoogle Scholar
  15. Koci E, Thiele M, Romero Moral Ó, Lehner W (2016) A machine learning approach for layout inference in spreadsheets. In: IC3K 2016: proceedings of the 8th international joint conference on knowledge discovery, knowledge engineering and knowledge management: volume 1: KDIR, pp. 77–88Google Scholar
  16. Le V, Gulwani S (2014) FlashExtract: a framework for data extraction by examples. ACM SIGPLAN Not 49:542–553CrossRefGoogle Scholar
  17. Lohr S (2014) For big-data scientists, “janitor work” is key hurdle to insights. The New York Times www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html. Accessed 22 Nov 2018
  18. Mitlöhner J, Neumaier S, Umbrich J, Polleres A (2016) Characteristics of open data CSV files. In: 2nd International conference on open and big data (OBD), pp. 72–79Google Scholar
  19. Ng HT, Lim CY, Koo JLT (1999) Learning to recognize tables in free text. In: Proceedings of the 37th annual meeting of the association for computational linguistics, ACL, pp. 443–450Google Scholar
  20. Petricek T, Geddes J, Sutton C (2018) Wrattler: reproducible, live and polyglot notebooks. In: 10th USENIX workshop on the theory and practice of provenance (TaPP 2018)Google Scholar
  21. Pinto D, McCallum A, Wei X, Croft WB (2003) Table extraction using conditional random fields. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval, ACM, pp. 235–242Google Scholar
  22. Raymond ES (2003) The art of Unix programming. Addison-Wesley Professional, Boston Google Scholar
  23. Rovegno J, Fenner M (2015) CSVY: YAML frontmatter for CSV file format. csvy.org. Accessed 19 Nov 2018
  24. Shafranovich Y (2005) Common format and MIME type for comma-separated values (CSV) files. Tech. Rep. RFC 4180, Internet requests for commentsGoogle Scholar
  25. Sutton C, Hobson T, Geddes J, Caruana R (2018) Data diff: interpretable, executable summaries of changes in distributions for data wrangling. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, ACM, pp. 2279–2288Google Scholar
  26. Tennison J (2016) CSV on the web: a primer. Tech. rep., W3CGoogle Scholar
  27. Tennison J, Kellogg G (2015) Metadata vocabulary for tabular data. Tech. rep., W3CGoogle Scholar
  28. The Unicode Consortium (2018) The unicode standard. Version 11.0.0Google Scholar
  29. Wells C (2002) Python-DSV. python-dsv.sourceforge.net. Accessed 08 Nov 2018
  30. Wickham H (2014) Tidy data. J Stat Softw 59(10):1–23CrossRefGoogle Scholar

Copyright information

© The Author(s), under exclusive licence to Springer Science+Business Media LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.The Alan Turing InstituteLondonUK
  2. 2.Google, Inc.Mountain ViewUSA
  3. 3.School of InformaticsThe University of EdinburghEdinburghUK

Personalised recommendations