Models for Distributed, Large Scale Data Cleaning

Maccio, Vincent J.; Chiang, Fei; Down, Douglas G.

doi:10.1007/978-3-319-13186-3_34

Vincent J. Maccio¹¹,
Fei Chiang¹¹ &
Douglas G. Down¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8643))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

2229 Accesses
1 Citations

Abstract

Poor data quality is a serious and costly problem affecting organizations across all industries. Real data is often dirty, containing missing, erroneous, incomplete, and duplicate values. Declarative data cleaning techniques have been proposed to resolve some of these underlying errors by identifying the inconsistencies and proposing updates to the data. However, much of this work has focused on cleaning data in static environments. Given the Big Data era, modern applications are operating in dynamic data environments where large scale data may be frequently changing. For example, consider data in sensor environments where there is a frequent stream of data arrivals, or financial data of stock prices and trading volumes. Data cleaning in such dynamic environments requires understanding the properties of the incoming data streams, and configuration of system parameters to maximize performance and improved data quality. In this paper, we present a set of queueing models, and analyze the impact of various system parameters on the output quality of a data cleaning system and its performance. We assume random routing in our models, and consider a variety of system configurations that reflect potential data cleaning scenarios. We present experimental results showing that our models are able to closely predict expected system behaviour.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Raman, V., Hellerstein, J.M.: Potter’s wheel: An interactive data cleaning system. In: VLBD, pp. 381–390 (2001)
Google Scholar
Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.A.: Declarative data cleaning: Language, model, and algorithms. In: VLDB, pp. 371–380 (2001)
Google Scholar
Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A., Ilyas, I.F., Ouzzani, M., Tang, N.: NADEEF: a commodity data cleaning system. In: SIGMOD, pp. 541–552 (2013)
Google Scholar
Bohannon, P., Fan, W., Flaster, M., Rastogi, R.: A cost-based model and effective heuristic for repairing constraints by value modification. In: SIGMOD, pp. 143–154 (2005)
Google Scholar
Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M., Ilyas, I.F.: Guided data repair. VLDB Endow. 4(5), 279–289 (2011)
Article Google Scholar
Chiang, F., Miller, R.J.: A unified model for data and constraint repair. In: ICDE, pp. 446–457 (2011)
Google Scholar
Chiang, F., Wang, Y.: Repairing integrity rules for improved data quality. IJIQ 20 p. (2014)
Google Scholar
Volkovs, M., Chiang, F., Szlichta, J., Miller, R.J.: Continuous data cleaning. In: ICDE, pp. 244–255 (2014)
Google Scholar
Geerts, F., Mecca, G., Papotti, P., Santoro, D.: The LLUNATIC data-cleaning framework. PVLDB 6(9), 625–636 (2013)
Google Scholar
Chiang, F., Miller, R.J.: Active repair of data quality rules. In: ICIQ, pp. 174–188 (2011)
Google Scholar
Beskales, G., Ilyas, I.F., Golab, L., Galiullin, A.: On the relative trust between inconsistent data and inaccurate constraints. In: ICDE, pp. 541–552 (2013)
Google Scholar
Gross, D., Harris, C.M.: Fundamentals of Queueing Theory, 3rd edn. Wiley-Interscience, New York (1998)
MATH Google Scholar
Harchol-Balter, M.: Performance Modeling and Design of Computer Systems: Queueing Theory in Action. Cambridge University Press, New York (2013)
Google Scholar
Kleinrock, L.: Queueing Systems, vol. 1. Wiley-Interscience, New York (1975)
MATH Google Scholar
Rubinovitch, M.: The slow server problem. J. Appl. Probab. 22(4), 205–213 (1985)
Article MATH MathSciNet Google Scholar
Mesquite Software CSIM 19. http://www.mesquite.com/

Download references

Author information

Authors and Affiliations

McMaster University, Hamilton, ON, Canada
Vincent J. Maccio, Fei Chiang & Douglas G. Down

Authors

Vincent J. Maccio
View author publications
You can also search for this author in PubMed Google Scholar
Fei Chiang
View author publications
You can also search for this author in PubMed Google Scholar
Douglas G. Down
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fei Chiang .

Editor information

Editors and Affiliations

National Chiao Tung University, Hsinchu, Taiwan
Wen-Chih Peng
Google Research, Mountain View, California, USA
Haixun Wang
University of Melbourne, Melbourne, Victoria, Australia
James Bailey
National Cheng Kung University, Tainan, Taiwan
Vincent S. Tseng
Japan Advanced Institute of Science and Technology, Nomi City, Japan
Tu Bao Ho
Nanjing University, Nanjing, China
Zhi-Hua Zhou
National Chengchi University, Taipei, Taiwan
Arbee L.P. Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Maccio, V.J., Chiang, F., Down, D.G. (2014). Models for Distributed, Large Scale Data Cleaning. In: Peng, WC., et al. Trends and Applications in Knowledge Discovery and Data Mining. PAKDD 2014. Lecture Notes in Computer Science(), vol 8643. Springer, Cham. https://doi.org/10.1007/978-3-319-13186-3_34

Download citation

DOI: https://doi.org/10.1007/978-3-319-13186-3_34
Published: 26 November 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-13185-6
Online ISBN: 978-3-319-13186-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics