Aggregate Query Processing on Incomplete Data
Incomplete data has been a longstanding issue in database community, and yet the subject is poorly handled by both theory and practice. In this paper, we propose to directly estimate the aggregate query result on incomplete data, rather than imputing the missing values. An interval estimation, composed of the upper and lower bound of aggregate query results among all possible interpretation of missing values, are presented to the end-users. The ground-truth aggregate result is guaranteed to be among the interval. Experimental results are consistent with the theoretical results, and suggest that the estimation is invaluable to better assess the results of aggregate queries on incomplete data.
KeywordsAggregate query Incomplete data Estimation
- 1.Osborne, J.W.: Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data. Sage, Thousand Oaks (2012)Google Scholar
- 2.Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)Google Scholar
- 3.Ebaid, A., Elmagarmid, A.K., Ilyas, I.F., Ouzzani, M., Quiané-Ruiz, J.-A., Tang, N., Yin, S.: NADEEF: a generalized data cleaning system. PVLDB 6(12), 1218–1221 (2013)Google Scholar
- 6.Fahandar, M.A., Hüllermeier, E., Couso, I.: Statistical inference for incomplete ranking data: the case of rank-dependent coarsening. In: Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017, pp. 1078–1087 (2017)Google Scholar
- 11.Mayfield, C., Neville, J., Prabhakar, S.: ERACER: a database approach for statistical inference and data cleaning. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, Indianapolis, Indiana, USA, June 6–10, 2010, pp. 75–86 (2010)Google Scholar