RefDataCleaner: A Usable Data Cleaning Tool

Leon-Medina, Juan Carlos; Galpin, Ixent

doi:10.1007/978-3-030-32475-9_8

Juan Carlos Leon-Medina¹⁰ &
Ixent Galpin¹⁰

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1051))

Included in the following conference series:

International Conference on Applied Informatics

1090 Accesses
1 Citations

Abstract

While the democratization of data science may still be some way off, several vendors of tools for data wrangling and analytics have recently emphasized the usability of their products with the aim of attracting an ever broader range of users. In this paper, we carry out an experiment to compare user performance when cleaning data using two contrasting tools: RefDataCleaner, a bespoke web-based tool that we created specifically for detecting and fixing errors in structured and semi-structured data files, and Microsoft Excel, a spreadsheet application in widespread use in organizations throughout the world which is used for diverse types of tasks, including data cleaning. With RefDataCleaner, a user specifies rules to detect and fix data errors, using hard-coded values or by retrieving values from a reference data file. In contrast, with Microsoft Excel, a non-expert user may clean data by specifying formulae and applying find/replace functions. The results of this initial study, carried out using a focus group of volunteers, show that users were able clean dirty data-sets more accurately using RefDataCleaner, and moreover, that this tool was generally preferred for this purpose.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://hilda.io/2019/.
2.
http://poloclub.gatech.edu/idea2018/.
3.
https://github.com/refdatacleaner/version_1_0/.
4.
https://refdatacleaner.shinyapps.io/version_1_0/.
5.
In the case of Microsoft Excel, participants are shown how substitution rules may be mimicked using find/replace/copy/paste functionality, and reference rules using VLOOKUP formulae. However, participants are free to use any functionality available in Excel for the data cleaning process.

References

Exploratory home page. https://exploratory.io/. Accessed 17 June 2019
List of highest-grossing films. https://en.wikipedia.org/wiki/List_of_highest-grossing_films. Accessed 14 Apr 2019
Tableau website. https://www.tableau.com/learn/whitepapers/make-everyone-your-organization-data-scientist. Accessed 17 June 2019
Abedjan, Z., et al.: Detecting data errors: where are we and what needs to be done? Proc. VLDB Endow. 9(12), 993–1004 (2016)
Article Google Scholar
Bernstein, P.A., Haas, L.M.: Information integration in the enterprise. Commun. ACM 51(9), 72–79 (2008)
Article Google Scholar
Fan, W., Geerts, F.: Foundations of Data Quality Management (2012)
Article Google Scholar
Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugen. 7(2), 179–188 (1936)
Article Google Scholar
Furche, T., Gottlob, G., Libkin, L., Orsi, G., Paton, N.W.: Data wrangling for big data: challenges and opportunities. In: EDBT, pp. 473–478 (2016)
Google Scholar
Galpin, I., Abel, E., Paton, N.W.: Source selection languages: a usability evaluation. In: Proceedings of the Workshop on Human-In-the-Loop Data Analytics, p. 8. ACM (2018)
Google Scholar
Kim, W., Choi, B.J., Hong, E., Kim, S.K., Lee, D.: A taxonomy of dirty data. Data Min. Knowl. Discov. 7(1), 81–99 (2003)
Article MathSciNet Google Scholar
Koehler, M., et al.: Data context informed data wrangling. In: 2017 IEEE International Conference on Big Data (Big Data), pp. 956–963. IEEE (2017)
Google Scholar
Konstantinou, N., et al.: The VADA architecture for cost-effective data wrangling. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 1599–1602. ACM (2017)
Google Scholar
Lohr, S.: For big-data scientists, ‘janitor work’ is key hurdle to insights. https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html. Accessed 15 May 2019
Müller, H., Freytag, J.C.: Problems, Methods, and Challenges in Comprehensive Data Cleansing, pp. 1–23. Humboldt-Universität zu, Berlin (2003)
Google Scholar
Oliveira, P., Rodrigues, F., Rangel Henriques, P., Galhardas, H.: A taxonomy of data quality problems. J. Data Inf. Qual. JDIQ (2005)
Google Scholar
Olson, D., Dursun, D.: Advanced Data Mining Techniques, 1st edn. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-76917-0
Book MATH Google Scholar
Orr, K.: Data quality and systems theory. Commun. ACM 41(2), 66–71 (1998)
Article Google Scholar
Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)
Google Scholar
Redman, T.C.: The impact of poor data quality on the typical enterprise. Commun. ACM 41(2), 79–82 (1998)
Article Google Scholar
Sauro, J.: Measuring usability with the system usability scale (SUS). https://measuringu.com/sus/. Accessed 10 May 2019
International Organization for Standardization: Software product quality. https://iso25000.com/index.php/en/iso-25000-standards/iso-25010. Accessed 21 May 2019

Download references

Author information

Authors and Affiliations

Dpto. de Ingeniería, Universidad Jorge Tadeo Lozano, Bogotá, Colombia
Juan Carlos Leon-Medina & Ixent Galpin

Authors

Juan Carlos Leon-Medina
View author publications
You can also search for this author in PubMed Google Scholar
Ixent Galpin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ixent Galpin .

Editor information

Editors and Affiliations

Universidad Distrital Francisco Jose de Caldas, Bogota, Colombia
Hector Florez
Universidad Nacional de Loja, Loja, Ecuador
Marcelo Leon
Universidad a Distancia de Madrid, Madrid, Spain
Jose Maria Diaz-Nafria
Universidad Complutense de Madrid, Madrid, Spain
Simone Belli

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Leon-Medina, J.C., Galpin, I. (2019). RefDataCleaner: A Usable Data Cleaning Tool. In: Florez, H., Leon, M., Diaz-Nafria, J., Belli, S. (eds) Applied Informatics. ICAI 2019. Communications in Computer and Information Science, vol 1051. Springer, Cham. https://doi.org/10.1007/978-3-030-32475-9_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-32475-9_8
Published: 28 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32474-2
Online ISBN: 978-3-030-32475-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics