The VLDB Journal

, Volume 19, Issue 4, pp 477–501

Supporting ranking queries on uncertain and incomplete data

  • Mohamed A. Soliman
  • Ihab F. Ilyas
  • Shalev Ben-David
Regular Paper

Abstract

Large databases with uncertain information are becoming more common in many applications including data integration, location tracking, and Web search. In these applications, ranking records with uncertain attributes introduces new problems that are fundamentally different from conventional ranking. Specifically, uncertainty in records’ scores induces a partial order over records, as opposed to the total order that is assumed in the conventional ranking settings. In this paper, we present a new probabilistic model, based on partial orders, to encapsulate the space of possible rankings originating from score uncertainty. Under this model, we formulate several ranking query types with different semantics. We describe and analyze a set of efficient query evaluation algorithms. We show that our techniques can be used to solve the problem of rank aggregation in partial orders under two widely adopted distance metrics. In addition, we design sampling techniques based on Markov chains to compute approximate query answers. Our experimental evaluation uses both real and synthetic data. The experimental study demonstrates the efficiency and effectiveness of our techniques under various configurations.

Keywords

Ranking Top-k Uncertain data Probabilistic data Partial orders Rank aggregation Kendall tau 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Sarma, A.D., Benjelloun, O., Halevy, A., Widom, J.: Working models for uncertain data. In: ICDE (2006)Google Scholar
  2. 2.
    Benjelloun, O., Sarma, A.D., Halevy, A., Widom, J.: Uldbs: databases with uncertainty and lineage. In: VLDB (2006)Google Scholar
  3. 3.
    Dalvi, N., Suciu, D.: Efficient query evaluation on probabilistic databases. In: VLDB (2004)Google Scholar
  4. 4.
    Chang, K.C.-C., Hwang, S.: Minimal probing: supporting expensive predicates for top-k queries. In: SIGMOD (2002)Google Scholar
  5. 5.
    Ilyas, I.F., Beskales, G., Soliman, M.A.: A survey of top-k query processing techniques in relational database systems. ACM Comput. Surv. 40(4) (2008)Google Scholar
  6. 6.
    Wolf, G., Khatri, H., Chokshi, B., Fan, J., Chen, Y., Kambhampati, S.: Query processing over incomplete autonomous databases. In: VLDB (2007)Google Scholar
  7. 7.
    Wu, X., Barbará, D.: Learning missing values from summary constraints. SIGKDD Explor. 4(1) (2002)Google Scholar
  8. 8.
    Chomicki, J.: Preference formulas in relational queries. ACM Trans. Database Syst. 28(4) (2003)Google Scholar
  9. 9.
    Chan, C.-Y., Jagadish, H.V., Tan, K.-L., Tung, A.K.H., Zhang, Z.: Finding k-dominant skylines in high dimensional space. In: SIGMOD (2006)Google Scholar
  10. 10.
    Tao, Y., Xiao, X., Pei, J.: Efficient skyline and top-k retrieval in subspaces. TKDE 19(8) (2007)Google Scholar
  11. 11.
    Brightwell, G., Winkler, P.: Counting linear extensions is #p-complete. In: STOC (1991)Google Scholar
  12. 12.
    Cheng, R., Prabhakar, S., Kalashnikov, D.V.: Querying imprecise data in moving object environments. In: ICDE (2003)Google Scholar
  13. 13.
    Dwork, C., Kumar, R., Naor, M., Sivakumar, D.: Rank aggregation methods for the web. In: WWW (2001)Google Scholar
  14. 14.
    Deshpande, A., Guestrin, C., Madden, S., Hellerstein, J.M., Hong, W.: Model-based approximate querying in sensor networks. VLDB J. 14(4) (2005)Google Scholar
  15. 15.
    Abiteboul, S., Kanellakis, P., Grahne, G.: On the representation and querying of sets of possible worlds. In: SIGMOD (1987)Google Scholar
  16. 16.
    Soliman, M.A., Ilyas, I.F., Chang, K.C.-C.: Top-k query processing in uncertain databases. In: ICDE (2007)Google Scholar
  17. 17.
    Zhang, X., Chomicki, J.: On the semantics and evaluation of top-k queries in probabilistic databases. In: ICDE Workshops (2008)Google Scholar
  18. 18.
    Hua, M., Pei, J., Zhang, W., Lin, X.: Ranking queries on uncertain data: a probabilistic threshold approach. In: SIGMOD (2008)Google Scholar
  19. 19.
    O’Leary, D.P.: Multidimensional integration: partition and conquer. Comput. Sci. Eng. 6(6) (2004)Google Scholar
  20. 20.
    Jerrum, M., Sinclair, A.: The markov chain monte carlo method: an approach to approximate counting and integration. Approximation algorithms for NP-hard problems (1997)Google Scholar
  21. 21.
    Hastings, W.K.: Monte carlo sampling methods using markov chains and their applications. Biometrika 57(1) (1970)Google Scholar
  22. 22.
    Gelman, A., Rubin, D.B.: Inference from iterative simulation using multiple sequences. Stat. Sci. 7(4) (1992)Google Scholar
  23. 23.
    Cowles, M.K., Carlin, B.P.: Markov chain Monte Carlo convergence diagnostics: a comparative review. J. Am. Stat. Assoc. 91(434) (1996)Google Scholar
  24. 24.
    Kenyon-Mathieu, C., Schudy, W.: How to rank with few errors. In: STOC (2007)Google Scholar
  25. 25.
    van Acker, P.: Transitivity revisited. Ann. Oper. Res. 23(1–4) (1990)Google Scholar
  26. 26.
    Intriligator, M.D.: A probabilistic model of social choice. Rev. Econ. Stud. 40(4) (1973)Google Scholar
  27. 27.
    Fishburn, P.C.: Probabilistic social choice based on simple voting comparisons. Rev. Econ. Stud. 51(4) (1984)Google Scholar
  28. 28.
    Ilyas, I.F., Markl, V., Haas, P.J., Brown, P., Aboulnaga, A.: Cords: automatic discovery of correlations and soft functional dependencies. In: SIGMOD (2004)Google Scholar
  29. 29.
    Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci. 1(1) (2001)Google Scholar
  30. 30.
    Xin, D., Han, J., Chang, K.C.-C.: Progressive and selective merge: computing top-k with ad-hoc ranking functions. In: SIGMOD (2007)Google Scholar
  31. 31.
    The R project for statistical computing: http://www.r-project.org
  32. 32.
    Bubley, R., Dyer, M.: Faster random generation of linear extensions. In: SODA (1998)Google Scholar
  33. 33.
    Re, C., Dalvi, N., Suciu, D.: Efficient top-k query evaluation on probabilistic data. In: ICDE (2007)Google Scholar
  34. 34.
    Wu, M., Jermaine, C.: A Bayesian method for guessing the extreme values in a data set. In: VLDB (2007)Google Scholar
  35. 35.
    Li, J., Saha, B., Deshpande, A.: A unified approach to ranking in probabilistic databases. PVLDB 2(1) (2009)Google Scholar
  36. 36.
    Li, J., Deshpande, A.: Consensus answers for queries over probabilistic databases. In: PODS (2009)Google Scholar
  37. 37.
    Little R., Rubin D.B.: Statistical Analysis with Missing Data. Wiley & Sons, New York (1987)MATHGoogle Scholar
  38. 38.
    Rubin D.B.: Multiple Imputation for Nonresponse in Surveys. Wiley & Sons, New York (1987)CrossRefGoogle Scholar
  39. 39.
    Ola, A., Ozsoyoglu, G.: Incomplete relational database models based on intervals. IEEE TKDE 05(2) (1993)Google Scholar

Copyright information

© Springer-Verlag 2010

Authors and Affiliations

  • Mohamed A. Soliman
    • 1
  • Ihab F. Ilyas
    • 1
  • Shalev Ben-David
    • 1
  1. 1.School of Computer ScienceUniversity of WaterlooWaterlooCanada

Personalised recommendations