Skip to main content

RefDataCleaner: A Usable Data Cleaning Tool

  • Conference paper
  • First Online:

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1051))

Abstract

While the democratization of data science may still be some way off, several vendors of tools for data wrangling and analytics have recently emphasized the usability of their products with the aim of attracting an ever broader range of users. In this paper, we carry out an experiment to compare user performance when cleaning data using two contrasting tools: RefDataCleaner, a bespoke web-based tool that we created specifically for detecting and fixing errors in structured and semi-structured data files, and Microsoft Excel, a spreadsheet application in widespread use in organizations throughout the world which is used for diverse types of tasks, including data cleaning. With RefDataCleaner, a user specifies rules to detect and fix data errors, using hard-coded values or by retrieving values from a reference data file. In contrast, with Microsoft Excel, a non-expert user may clean data by specifying formulae and applying find/replace functions. The results of this initial study, carried out using a focus group of volunteers, show that users were able clean dirty data-sets more accurately using RefDataCleaner, and moreover, that this tool was generally preferred for this purpose.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://hilda.io/2019/.

  2. 2.

    http://poloclub.gatech.edu/idea2018/.

  3. 3.

    https://github.com/refdatacleaner/version_1_0/.

  4. 4.

    https://refdatacleaner.shinyapps.io/version_1_0/.

  5. 5.

    In the case of Microsoft Excel, participants are shown how substitution rules may be mimicked using find/replace/copy/paste functionality, and reference rules using VLOOKUP formulae. However, participants are free to use any functionality available in Excel for the data cleaning process.

References

  1. Exploratory home page. https://exploratory.io/. Accessed 17 June 2019

  2. List of highest-grossing films. https://en.wikipedia.org/wiki/List_of_highest-grossing_films. Accessed 14 Apr 2019

  3. Tableau website. https://www.tableau.com/learn/whitepapers/make-everyone-your-organization-data-scientist. Accessed 17 June 2019

  4. Abedjan, Z., et al.: Detecting data errors: where are we and what needs to be done? Proc. VLDB Endow. 9(12), 993–1004 (2016)

    Article  Google Scholar 

  5. Bernstein, P.A., Haas, L.M.: Information integration in the enterprise. Commun. ACM 51(9), 72–79 (2008)

    Article  Google Scholar 

  6. Fan, W., Geerts, F.: Foundations of Data Quality Management (2012)

    Article  Google Scholar 

  7. Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugen. 7(2), 179–188 (1936)

    Article  Google Scholar 

  8. Furche, T., Gottlob, G., Libkin, L., Orsi, G., Paton, N.W.: Data wrangling for big data: challenges and opportunities. In: EDBT, pp. 473–478 (2016)

    Google Scholar 

  9. Galpin, I., Abel, E., Paton, N.W.: Source selection languages: a usability evaluation. In: Proceedings of the Workshop on Human-In-the-Loop Data Analytics, p. 8. ACM (2018)

    Google Scholar 

  10. Kim, W., Choi, B.J., Hong, E., Kim, S.K., Lee, D.: A taxonomy of dirty data. Data Min. Knowl. Discov. 7(1), 81–99 (2003)

    Article  MathSciNet  Google Scholar 

  11. Koehler, M., et al.: Data context informed data wrangling. In: 2017 IEEE International Conference on Big Data (Big Data), pp. 956–963. IEEE (2017)

    Google Scholar 

  12. Konstantinou, N., et al.: The VADA architecture for cost-effective data wrangling. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 1599–1602. ACM (2017)

    Google Scholar 

  13. Lohr, S.: For big-data scientists, ‘janitor work’ is key hurdle to insights. https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html. Accessed 15 May 2019

  14. Müller, H., Freytag, J.C.: Problems, Methods, and Challenges in Comprehensive Data Cleansing, pp. 1–23. Humboldt-Universität zu, Berlin (2003)

    Google Scholar 

  15. Oliveira, P., Rodrigues, F., Rangel Henriques, P., Galhardas, H.: A taxonomy of data quality problems. J. Data Inf. Qual. JDIQ (2005)

    Google Scholar 

  16. Olson, D., Dursun, D.: Advanced Data Mining Techniques, 1st edn. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-76917-0

    Book  MATH  Google Scholar 

  17. Orr, K.: Data quality and systems theory. Commun. ACM 41(2), 66–71 (1998)

    Article  Google Scholar 

  18. Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)

    Google Scholar 

  19. Redman, T.C.: The impact of poor data quality on the typical enterprise. Commun. ACM 41(2), 79–82 (1998)

    Article  Google Scholar 

  20. Sauro, J.: Measuring usability with the system usability scale (SUS). https://measuringu.com/sus/. Accessed 10 May 2019

  21. International Organization for Standardization: Software product quality. https://iso25000.com/index.php/en/iso-25000-standards/iso-25010. Accessed 21 May 2019

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ixent Galpin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Leon-Medina, J.C., Galpin, I. (2019). RefDataCleaner: A Usable Data Cleaning Tool. In: Florez, H., Leon, M., Diaz-Nafria, J., Belli, S. (eds) Applied Informatics. ICAI 2019. Communications in Computer and Information Science, vol 1051. Springer, Cham. https://doi.org/10.1007/978-3-030-32475-9_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-32475-9_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-32474-2

  • Online ISBN: 978-3-030-32475-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics