Drawing Density Core-Sets from Incomplete Relational Data

  • Yongnan LiuEmail author
  • Jianzhong Li
  • Hong Gao
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10178)


Incompleteness is a ubiquitous issue and brings challenges to answer queries with completeness guaranteed. A density core-set is a subset of an incomplete dataset, whose completeness is approximate to the completeness of the entire dataset. Density core-sets are effective mechanisms to estimate completeness of queries on incomplete datasets. This paper studies the problems of drawing density core-sets on incomplete relational data. To the best of our knowledge, there is no such proposal in the past. (1) We study the problems of drawing density core-sets in different requirements, and prove the problems are all NP-Complete whether functional dependencies are given. (2) An efficient approximate algorithm to draw an approximate density core-set is proposed, where an approximate Knapsack algorithm and weighted sampling techniques are employed to select important candidate tuples. (3) Analysis of the proposed approximate algorithm shows the relative error between completeness of the approximate density core-set and that of a density core-set with same size is within a given relative error bound with high probability. (4) Experiments on both real-world and synthetic datasets demonstrate the effectiveness and efficiency of the algorithm.


Data quality Density core-sets Incomplete data Query completeness estimation 



This work is supported in part by the Key Research and Development Plan of National Ministry of Science and Technology under grant No. 2016YFB1000703, and the Key Program of the National Natural Science Foundation of China under Grant No. 61190115, 61632010 and U1509216.


  1. 1.
    Acharya, S., Gibbons, P.B., Poosala, V.: Congressional samples for approximate answering of group-by queries. In: ACM SIGMOD Record, vol. 29, pp. 487–498. ACM (2000)Google Scholar
  2. 2.
    Agarwal, P.K., Har-Peled, S., Varadarajan, K.R.: Approximating extent measures of points. J. ACM 51(4), 606–635 (2004)MathSciNetCrossRefzbMATHGoogle Scholar
  3. 3.
    Agarwal, S., Mozafari, B., Panda, A., Milner, H., Madden, S., Stoica, I.: BlinkDB: queries with bounded errors and bounded response times on very large data. In: Proceedings of the 8th ACM European Conference on Computer Systems, pp. 29–42. ACM (2013)Google Scholar
  4. 4.
    Arocena, P.C., Glavic, B., Miller, R.J.: Value invention in data exchange. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 157–168. ACM (2013)Google Scholar
  5. 5.
    Beskales, G., Ilyas, I.F., Golab, L., Galiullin, A.: Sampling from repairs of conditional functional dependency violations. VLDB J. 23(1), 103–128 (2014)CrossRefGoogle Scholar
  6. 6.
    Chaudhuri, S., Motwani, R., Narasayya, V.: Random sampling for histogram construction: how much is enough? ACM SIGMOD Rec. 27, 436–447 (1998). ACMCrossRefGoogle Scholar
  7. 7.
    Chen, K., Chen, H., Conway, N., Hellerstein, J.M., Parikh, T.S.: Usher: improving data quality with dynamic forms. IEEE Trans. Knowl. Data Eng. 23(8), 1138–1153 (2011)CrossRefGoogle Scholar
  8. 8.
    Cheng, S., Cai, Z., Li, J., Fang, X.: Drawing dominant dataset from big sensory data in wireless sensor networks. In: 2015 IEEE Conference on Computer Communications (INFOCOM), pp. 531–539. IEEE (2015)Google Scholar
  9. 9.
    Cormode, G., Garofalakis, M., Haas, P.J., Jermaine, C.: Synopses for massive data: samples, histograms, wavelets, sketches. Found. Trends Databases 4(1–3), 1–294 (2012)zbMATHGoogle Scholar
  10. 10.
    Deng, T., Fan, W., Geerts, F.: On recommendation problems beyond points of interest. Inf. Syst. 48, 64–88 (2015)CrossRefGoogle Scholar
  11. 11.
    Dong, X.L., Gabrilovich, E., Murphy, K., Dang, V., Horn, W., Lugaresi, C., Sun, S., Zhang, W.: Knowledge-based trust: estimating the trustworthiness of web sources. Proc. VLDB Endow. 8(9), 938–949 (2015)CrossRefGoogle Scholar
  12. 12.
    Fan, W.: Dependencies revisited for improving data quality. In: Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 159–170. ACM (2008)Google Scholar
  13. 13.
    Fan, W., Geerts, F.: Capturing missing tuples and missing values. In: Proceedings of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 169–178. ACM, June 2010Google Scholar
  14. 14.
    Fan, W., Geerts, F.: Relative information completeness. ACM Trans. Database Syst. 35(4), 27 (2010)CrossRefGoogle Scholar
  15. 15.
    Indyk, P., Mahabadi, S., Mahdian, M., Mirrokni, V.S.: Composable core-sets for diversity and coverage maximization. In: Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2014, pp. 100–108. ACM (2014)Google Scholar
  16. 16.
    Ito, H., Kiyoshima, S., Yoshida, Y.: Constant-time approximation algorithms for the knapsack problem. In: Theory and Applications of Models of Computation, pp. 131–142 (2012)Google Scholar
  17. 17.
    Levy, A.Y.: Obtaining complete answers from incomplete databases. In: Proceedings of the 22th International Conference on Very Large Data Bases, pp. 402–412. Morgan Kaufmann Publishers Inc. (1996)Google Scholar
  18. 18.
    Motro, A.: Integrity = validity + completeness. ACM Trans. Database Syst. 14(4), 480–502 (1989)CrossRefGoogle Scholar
  19. 19.
    Phillips, J.M.: Coresets and sketches.
  20. 20.
    Pol, A., Jermaine, C.: Relational confidence bounds are easy with the bootstrap. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pp. 587–598. ACM (2005)Google Scholar
  21. 21.
    Poleto, F.Z., Singer, J.M., Paulino, C.D.: Missing data mechanisms and their implications on the analysis of categorical data. Stat. Comput. 21(1), 31–43 (2011)MathSciNetCrossRefzbMATHGoogle Scholar
  22. 22.
    Potti, N., Patel, J.M.: DAQ: a new paradigm for approximate query processing. Proc. VLDB Endow. 8(9), 898–909 (2015)CrossRefGoogle Scholar
  23. 23.
    Razniewski, S., Nutt, W.: Completeness of queries over incomplete databases. Proc. VLDB Endow. 4(11), 749–760 (2011)Google Scholar
  24. 24.
    Saha, B., Srivastava, D.: Data quality: the other face of big data. In: 2014 IEEE 30th International Conference on Data Engineering (ICDE), pp. 1294–1297. IEEE (2014)Google Scholar
  25. 25.
    Song, S., Zhang, A., Chen, L., Wang, J.: Enriching data imputation with extensive similarity neighbors. Proc. VLDB Endow. 8(11), 1286–1297 (2015)CrossRefGoogle Scholar
  26. 26.
    Wayne, W.: Data quality and the bottom line: achieving business success through a commitment to high quality data. The Data warehouse Institute (TDWI) report (2004).

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Harbin Institute of TechnologyHarbinChina

Personalised recommendations