Advertisement

One-Pass Inconsistency Detection Algorithms for Big Data

  • Meifan Zhang
  • Hongzhi WangEmail author
  • Jianzhong Li
  • Hong Gao
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9642)

Abstract

Data in the real world is often dirty. Inconsistency is an important kind of dirty data. Before repairing inconsistency, we need to detect them first. The time complexities of current inconsistency detection algorithms are super-linear to the size of data and not suitable for big data. For inconsistency detection for big data, we develop an algorithm that detects inconsistency within one-pass scan of the data according to both the functional dependency (FD) and the conditional functional dependency (CFD). We compare our detection algorithm with existing approaches experimentally. Experimental results on real datasets show that our approach could detect inconsistency effectively and efficiently.

Keywords

Inconsistency detection Big data One-pass algorithm Data quality 

Notes

Acknowledgment

This paper was supported by NGFR 973 grant 2012CB316200, NSFC grant U1509216,61472099,61133002 and National Sci-Tech Support Plan 2015 BAH10F01.

References

  1. 1.
    Wayne, W.E.: Data quality and the bottom line: achieving business success through a commitment to high quality data. In: TDWI report (2004)Google Scholar
  2. 2.
    Bohannon, P., Fan, W., Geerts, F., et al.: Conditional functional dependencies for data cleaning. In: ICDE, pp. 746–755 (2007)Google Scholar
  3. 3.
    Chen, W., Fan, W., Ma, S.: Analyses and validation of conditional dependencies with built-in predicates. In: Bhowmick, S.S., Küng, J., Wagner, R. (eds.) DEXA 2009. LNCS, vol. 5690, pp. 576–591. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  4. 4.
    Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: consistency and accuracy. In: VLDB, pp. 315–326 (2007)Google Scholar
  5. 5.
    Fan, W., Geerts, F., Tang, N., et al.: Inferring data currency and consistency for conflict resolution. In: ICDE, pp. 470–481 (2013)Google Scholar
  6. 6.
    Bohannon, P., Fan, W., Flaster, M., et al.: A cost-based model and effective heuristic for repairing constraints by value modification. In: SIGMOD, pp. 143–154 (2005)Google Scholar
  7. 7.
    Chu, X., Ilyas, I.F., Papotti, P.: Holistic data cleaning: putting violations into context. In: ICDE, pp. 458–469 (2013)Google Scholar
  8. 8.
    Kolahi, S., Lakshmanan, L.V.S.: On approximating optimum repairs for functional dependency violations. In: ICDT, pp. 53–62 (2009)Google Scholar
  9. 9.
    Yakout, M., Elmagarmid, A.K., et al.: Guided data repair. In: PVLDB, pp. 279–289 (2011)Google Scholar
  10. 10.
    Korn, F., Muthukrishnan, S., Zhu, Y.: Checks and balances: monitoring data quality problems in network traffic databases. In: VLDB, pp. 536–547 (2003)Google Scholar
  11. 11.
    Xiong, H., Pandey, G., Steinbach, M., et al.: Enhancing data analysis with noise removal. In: TKDE, pp. 304–319 (2006)Google Scholar
  12. 12.
    Fan, W., Geerts, F.: Foundations of Data Quality Management, Synthesis Lectures on Data Management, pp. 71–82 (2012)Google Scholar
  13. 13.
    Chiang, F., Miller, R.J.: Discovering data quality rules. In: VLDB, pp. 1166–1177 (2008)Google Scholar
  14. 14.
    Golab, L., Karloff, H., Korn, F., Srivastava, D., Yu, B.: On generating near-optimal tableaux for conditional functional dependencies. In: VLDB, pp. 1161–1172 (2008)Google Scholar
  15. 15.
    Fan, W., Geerts, F., Li, J., Xiong, M.: Discovering conditional functional dependencies. In: TKDE, pp. 683–698 (2011)Google Scholar
  16. 16.
    Geerts, F., Mecca, G., Papotti, P., Santoro, D.: The LLUNATIC data-cleaning framework. In: PVLDB, pp. 625–636 (2013)Google Scholar
  17. 17.
    Bertossi, L., Bravo, L., et al.: The complexity and approximation of fixing numerical attributes in databases under integrity constraints. In: Information Systems, pp. 407–434 (2008)Google Scholar
  18. 18.
    Fan, W., Li, J., Ma, S., et al.: Towards certain fixes with editing rules and master data. VLDB 3, 173–184 (2010)Google Scholar
  19. 19.
    Talukder, N., Ouzzani, M., Elmagarmid, A.K., et al.: Detecting inconsistencies in private data with secure function evaluation. Technical report, Purdue University (2011)Google Scholar
  20. 20.
    Demsky, B., Rinard, M.: Automatic detection and repair of errors in data structures. In: SIGPLAN Notices, pp. 78–95 (2003)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Meifan Zhang
    • 1
  • Hongzhi Wang
    • 1
    Email author
  • Jianzhong Li
    • 1
  • Hong Gao
    • 1
  1. 1.Department of Computer Science and TechnologyHarbin Institute of TechnologyHarbinChina

Personalised recommendations