Advertisement

Aggregate Query Processing on Incomplete Data

  • Anzhen ZhangEmail author
  • Jinbao Wang
  • Jianzhong Li
  • Hong Gao
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10987)

Abstract

Incomplete data has been a longstanding issue in database community, and yet the subject is poorly handled by both theory and practice. In this paper, we propose to directly estimate the aggregate query result on incomplete data, rather than imputing the missing values. An interval estimation, composed of the upper and lower bound of aggregate query results among all possible interpretation of missing values, are presented to the end-users. The ground-truth aggregate result is guaranteed to be among the interval. Experimental results are consistent with the theoretical results, and suggest that the estimation is invaluable to better assess the results of aggregate queries on incomplete data.

Keywords

Aggregate query Incomplete data Estimation 

References

  1. 1.
    Osborne, J.W.: Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data. Sage, Thousand Oaks (2012)Google Scholar
  2. 2.
    Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)Google Scholar
  3. 3.
    Ebaid, A., Elmagarmid, A.K., Ilyas, I.F., Ouzzani, M., Quiané-Ruiz, J.-A., Tang, N., Yin, S.: NADEEF: a generalized data cleaning system. PVLDB 6(12), 1218–1221 (2013)Google Scholar
  4. 4.
    Deng, T., Fan, W., Geerts, F.: Capturing missing tuples and missing values. ACM Trans. Database Syst. 41(2), 10:1–10:47 (2016)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Guagliardo, P., Libkin, L.: Correctness of SQL queries on databases with nulls. SIGMOD Rec. 46(3), 5–16 (2017)CrossRefGoogle Scholar
  6. 6.
    Fahandar, M.A., Hüllermeier, E., Couso, I.: Statistical inference for incomplete ranking data: the case of rank-dependent coarsening. In: Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017, pp. 1078–1087 (2017)Google Scholar
  7. 7.
    Sarabia, J.M., Shahtahmassebi, G.: Bayesian estimation of incomplete data using conditionally specified priors. Commun. Stat. Simul. Comput. 46(5), 3419–3435 (2017)MathSciNetzbMATHGoogle Scholar
  8. 8.
    Lipski Jr., W.: On semantic issues connected with incomplete information databases. ACM Trans. Database Syst. 4(3), 262–296 (1979)CrossRefGoogle Scholar
  9. 9.
    Reiter, R.: On closed world data bases. In: Logic and Data Bases, Symposium on Logic and Data Bases, Centre d’études et de recherches de Toulouse, pp. 55–76 (1977)CrossRefGoogle Scholar
  10. 10.
    Codd, E.F.: Extending the database relational model to capture more meaning. ACM Trans. Database Syst. 4(4), 397–434 (1979)CrossRefGoogle Scholar
  11. 11.
    Mayfield, C., Neville, J., Prabhakar, S.: ERACER: a database approach for statistical inference and data cleaning. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6–10, 2010, pp. 75–86 (2010)Google Scholar
  12. 12.
    Rubin, D.B., Little, R.J.A.: Statistical Analysis with Missing Data. Wiley, Hoboken (2014)zbMATHGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Anzhen Zhang
    • 1
    Email author
  • Jinbao Wang
    • 1
  • Jianzhong Li
    • 1
  • Hong Gao
    • 1
  1. 1.Department of Computer Science and TechnologyHarbin Institute of TechnologyHarbinChina

Personalised recommendations