Advertisement

Source Selection for Inconsistency Detection

  • Lingli LiEmail author
  • Xu Feng
  • Hongyu Shao
  • Jinbao LiEmail author
Conference paper
  • 2.4k Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10828)

Abstract

Inconsistencies in a database can be detected based on violations of integrity constraints, such as functional depencies (FDs). In big data era, many related data sources give us the chance of detecting inconsistency extensively. That is, even though violations do not exist in a single data set D, we can leverage other data sources to discover potential violations. A significant challenge for violation detection based on data sources is that accessing too many data sources introduces a huge cost, while involving too few data sources may miss serious violations. Motivated by this, we investigate how to select a proper subset of sources for inconsistency detection. To address this problem, we formulate the gain model of sources and introduce the optimization problem of source selection, called SSID, in which the gain is maximized with the cost under a threshold. We show that the SSID problem is NP-hard and propose a greedy approximation approach for SSID. To avoid accessing data sources, we also present a randomized technique for gain estimation with theoretical guarantees. Experimental results on both real and synthetic data show high performance on both effectiveness and efficiency of our algorithm.

Notes

Acknowledgements

This work was supported by NSFC61602159, 61370222 and Program for Group of Science Harbin technological innovation 2015RAXXJ004. The authors wish to thank Hongzhi Wang, Rong Zhu and Ran Bi for helpful discussions of this paper.

References

  1. 1.
    Chu, X., Ilyas, I.F., Papotti, P.: Holistic data cleaning: putting violations into context, pp. 458–469 (2013)Google Scholar
  2. 2.
    Cohen, E., Datar, M., Fujiwara, S., Gionis, A., Indyk, P., Motwani, R., Ullman, J.D., Yang, C.: Finding interesting associations without support pruning. TKDE 13(1), 64–78 (2001)Google Scholar
  3. 3.
    Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A., Ilyas, I.F., Ouzzani, M., Tang, N.: NADEEF: a commodity data cleaning system. In: SIGMOD, pp. 541–552. ACM (2013)Google Scholar
  4. 4.
    Dong, X.L., Berti-Equille, L., Srivastava, D.: Integrating conflicting data: the role of source dependence. VLDB 2(1), 550–561 (2009)Google Scholar
  5. 5.
    Dong, X.L., Berti-Equille, L., Srivastava, D.: Truth discovery and copying detection in a dynamic world. VLDB 2(1), 562–573 (2009)Google Scholar
  6. 6.
    Dong, X.L., Saha, B., Srivastava, D.: Less is more: selecting sources wisely for integration. In: VLDB, vol. 6, pp. 37–48. VLDB Endowment (2012)Google Scholar
  7. 7.
    Fan, W.: Dependencies revisited for improving data quality. In: PODSGoogle Scholar
  8. 8.
    Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for capturing data inconsistencies. TODS 33(2), 6 (2008)CrossRefGoogle Scholar
  9. 9.
    Fan, W., Geerts, F., Ma, S., Müller, H.: Detecting inconsistencies in distributed data. In: ICDE, pp. 64–75. IEEE (2010)Google Scholar
  10. 10.
    Fan, W., Li, J., Tang, N., et al.: Incremental detection of inconsistencies in distributed data. TKDE 26(6), 1367–1383 (2014)Google Scholar
  11. 11.
    Hochbaum, D.S.: Approximating covering and packing problems: set cover, vertex cover, independent set, and related problems. In: Approximation Algorithms for NP-hard Problems, pp. 94–143. PWS Publishing Co. (1996)Google Scholar
  12. 12.
    Li, X., Dong, X.L., Lyons, K., Meng, W., Srivastava, D.: Truth finding on the deep web: is the problem solved? VLDB 6(2), 97–108 (2012)Google Scholar
  13. 13.
    Li, X., Dong, X.L., Lyons, K., Meng, W., Srivastava, D.: Scaling up copy detection. In: ICDE, pp. 89–100 (2015)Google Scholar
  14. 14.
    Liu, X., Dong, X.L., Ooi, B.C., Srivastava, D.: Online data fusion. VLDB 4(11), 932–943 (2011)Google Scholar
  15. 15.
    Mayfield, C., Neville, J., Prabhakar, S.: ERACER: a database approach for statistical inference and data cleaning. In: SIGMOD, pp. 75–86. ACM (2010)Google Scholar
  16. 16.
    Nemhauser, G.L., Wolsey, L.A., Fisher, M.L.: An analysis of approximations for maximizing submodular set functionsi. Math. Program. 14(1), 265–294 (1978)CrossRefGoogle Scholar
  17. 17.
    Phillips, J.M.: Chernoff-hoeffding inequality and applications. arXiv preprint arXiv:1209.6396 (2012)
  18. 18.
    Pochampally, R., Das Sarma, A., Dong, X.L., Meliou, A., Srivastava, D.: Fusing data with correlations. In: SIGMOD, pp. 433–444. ACM (2014)Google Scholar
  19. 19.
    Rekatsinas, T., Dong, X.L., Getoor, L., Srivastava, D.: Finding quality in quantity: the challenge of discovering valuable sources for integration. In: CIDR (2015)Google Scholar
  20. 20.
    Rekatsinas, T., Dong, X.L., Srivastava, D.: Characterizing and selecting fresh data sources. In: SIGMOD, pp. 919–930. ACM (2014)Google Scholar
  21. 21.
    Salloum, M., Dong, X.L., Srivastava, D., Tsotras, V.J.: Online ordering of overlapping data sources. VLDB 7(3), 133–144 (2013)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Computer Science and TechnologyHeilongjiang UniversityHarbinChina

Personalised recommendations