Skip to main content

Models for Distributed, Large Scale Data Cleaning

  • Conference paper
  • First Online:
Trends and Applications in Knowledge Discovery and Data Mining (PAKDD 2014)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8643))

Included in the following conference series:

Abstract

Poor data quality is a serious and costly problem affecting organizations across all industries. Real data is often dirty, containing missing, erroneous, incomplete, and duplicate values. Declarative data cleaning techniques have been proposed to resolve some of these underlying errors by identifying the inconsistencies and proposing updates to the data. However, much of this work has focused on cleaning data in static environments. Given the Big Data era, modern applications are operating in dynamic data environments where large scale data may be frequently changing. For example, consider data in sensor environments where there is a frequent stream of data arrivals, or financial data of stock prices and trading volumes. Data cleaning in such dynamic environments requires understanding the properties of the incoming data streams, and configuration of system parameters to maximize performance and improved data quality. In this paper, we present a set of queueing models, and analyze the impact of various system parameters on the output quality of a data cleaning system and its performance. We assume random routing in our models, and consider a variety of system configurations that reflect potential data cleaning scenarios. We present experimental results showing that our models are able to closely predict expected system behaviour.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Raman, V., Hellerstein, J.M.: Potter’s wheel: An interactive data cleaning system. In: VLBD, pp. 381–390 (2001)

    Google Scholar 

  2. Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.A.: Declarative data cleaning: Language, model, and algorithms. In: VLDB, pp. 371–380 (2001)

    Google Scholar 

  3. Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A., Ilyas, I.F., Ouzzani, M., Tang, N.: NADEEF: a commodity data cleaning system. In: SIGMOD, pp. 541–552 (2013)

    Google Scholar 

  4. Bohannon, P., Fan, W., Flaster, M., Rastogi, R.: A cost-based model and effective heuristic for repairing constraints by value modification. In: SIGMOD, pp. 143–154 (2005)

    Google Scholar 

  5. Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M., Ilyas, I.F.: Guided data repair. VLDB Endow. 4(5), 279–289 (2011)

    Article  Google Scholar 

  6. Chiang, F., Miller, R.J.: A unified model for data and constraint repair. In: ICDE, pp. 446–457 (2011)

    Google Scholar 

  7. Chiang, F., Wang, Y.: Repairing integrity rules for improved data quality. IJIQ 20 p. (2014)

    Google Scholar 

  8. Volkovs, M., Chiang, F., Szlichta, J., Miller, R.J.: Continuous data cleaning. In: ICDE, pp. 244–255 (2014)

    Google Scholar 

  9. Geerts, F., Mecca, G., Papotti, P., Santoro, D.: The LLUNATIC data-cleaning framework. PVLDB 6(9), 625–636 (2013)

    Google Scholar 

  10. Chiang, F., Miller, R.J.: Active repair of data quality rules. In: ICIQ, pp. 174–188 (2011)

    Google Scholar 

  11. Beskales, G., Ilyas, I.F., Golab, L., Galiullin, A.: On the relative trust between inconsistent data and inaccurate constraints. In: ICDE, pp. 541–552 (2013)

    Google Scholar 

  12. Gross, D., Harris, C.M.: Fundamentals of Queueing Theory, 3rd edn. Wiley-Interscience, New York (1998)

    MATH  Google Scholar 

  13. Harchol-Balter, M.: Performance Modeling and Design of Computer Systems: Queueing Theory in Action. Cambridge University Press, New York (2013)

    Google Scholar 

  14. Kleinrock, L.: Queueing Systems, vol. 1. Wiley-Interscience, New York (1975)

    MATH  Google Scholar 

  15. Rubinovitch, M.: The slow server problem. J. Appl. Probab. 22(4), 205–213 (1985)

    Article  MATH  MathSciNet  Google Scholar 

  16. Mesquite Software CSIM 19. http://www.mesquite.com/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fei Chiang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Maccio, V.J., Chiang, F., Down, D.G. (2014). Models for Distributed, Large Scale Data Cleaning. In: Peng, WC., et al. Trends and Applications in Knowledge Discovery and Data Mining. PAKDD 2014. Lecture Notes in Computer Science(), vol 8643. Springer, Cham. https://doi.org/10.1007/978-3-319-13186-3_34

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-13186-3_34

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-13185-6

  • Online ISBN: 978-3-319-13186-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics