The VLDB Journal

, Volume 20, Issue 2, pp 249–275 | Cite as

A unified approach to ranking in probabilistic databases

Special Issue Paper

Abstract

Ranking is a fundamental operation in data analysis and decision support and plays an even more crucial role if the dataset being explored exhibits uncertainty. This has led to much work in understanding how to rank the tuples in a probabilistic dataset in recent years. In this article, we present a unified approach to ranking and top-k query processing in probabilistic databases by viewing it as a multi-criterion optimization problem and by deriving a set of features that capture the key properties of a probabilistic dataset that dictate the ranked result. We contend that a single, specific ranking function may not suffice for probabilistic databases, and we instead propose two parameterized ranking functions, called PRFω and PRFe, that generalize or can approximate many of the previously proposed ranking functions. We present novel generating functions-based algorithms for efficiently ranking large datasets according to these ranking functions, even if the datasets exhibit complex correlations modeled using probabilistic and/xor trees or Markov networks. We further propose that the parameters of the ranking function be learned from user preferences, and we develop an approach to learn those parameters. Finally, we present a comprehensive experimental study that illustrates the effectiveness of our parameterized ranking functions, especially PRFe, at approximating other ranking functions and the scalability of our proposed algorithms for exact or approximate ranking.

Keywords

Probabilistic databases Ranking Learning to rank Approximation techniques Graphical models 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Adar, E., Re, C.: Managing uncertainty in social networks. IEEE Data Eng. Bull. (2007)Google Scholar
  2. 2.
    Andritsos, P., Fuxman, A., Miller, R.J.: Clean answers over dirty databases. In: ICDE (2006)Google Scholar
  3. 3.
    Azar, Y., Gamzu, I., Yin X.: Multiple intents re-ranking. In: STOC, pp. 669–678 (2009)Google Scholar
  4. 4.
    Bansal, N., Jain, K., Kazeykina, A., Naor, J.: Approximation algorithms for diversified search ranking. In: ICALP, pp. 273–284 (2010)Google Scholar
  5. 5.
    Beskales, G., Soliman, M., IIyas, I.: Efficient search for the top-k probable nearest neighbors in uncertain databases. In: VLDB (2008)Google Scholar
  6. 6.
    Beylkin G., Monzon L.: On approximation of functions by exponential sums. Appl. Comput. Harm. Anal. 19, 17–48 (2005)CrossRefMATHMathSciNetGoogle Scholar
  7. 7.
    Bjorck A., Pereyra V.: Solution of vandermonde systems of equations. Math. Comput. 24(112), 893–903 (1970)MathSciNetGoogle Scholar
  8. 8.
    Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., Hullender, G.: Learning to rank using gradient descent. In: ICML, pp. 89–96 (2005)Google Scholar
  9. 9.
    Cheng, R., Chen, J., Mokbel, M., Chow, C.: Probabilistic verifiers: evaluating constrained nearest-neighbor queries over uncertain data. In: ICDE (2008)Google Scholar
  10. 10.
    Cheng, R., Chen, L., Chen, J., Xie, X.: Evaluating probability threshold k-nearest-neighbor queries over uncertain data. In: EDBT (2009)Google Scholar
  11. 11.
    Cheng, R., Kalashnikov, D., Prabhakar, S.: Evaluating probabilistic queries over imprecise data. In: SIGMOD (2003)Google Scholar
  12. 12.
    Cormode, G., Li, F., Yi, K.: Semantics of ranking queries for probabilistic data and expected ranks. In: ICDE (2009)Google Scholar
  13. 13.
    Dalvi, N., Suciu, D.: Efficient query evaluation on probabilistic databases. In: VLDB (2004)Google Scholar
  14. 14.
    Dalvi, N., Suciu, D.: Management of probabilistic data: foundations and challenges. In: PODS (2007)Google Scholar
  15. 15.
    Dekel, O., Manning, C., Singer, Y.: Log-linear models for label-ranking. In: NIPS 16 (2004)Google Scholar
  16. 16.
    Deshpande, A., Guestrin, C., Madden, S.: Using probabilistic models for data management in acquisitional environments. In: CIDR (2005)Google Scholar
  17. 17.
    Dong, X.L., Halevy, A., Yu, C.: Data integration with uncertainty. In: VLDB (2007)Google Scholar
  18. 18.
    Dwork, C., Kumar, R., Naor, M., Sivakumar, D.: Rank aggregation methods for the web. In: WWW (2001)Google Scholar
  19. 19.
    Fagin, R., Kumar, R., Sivakumar, D.: Comparing top-k lists. In: SODA (2003)Google Scholar
  20. 20.
    Fama E., Macbeth J.: Risk, return, and equilibrium: empirical tests. J. Polit. Econ. 81(3), 607–636 (1973)CrossRefGoogle Scholar
  21. 21.
    Fuhr, N., Rolleke, T.: A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Trans. Info. Syst. (1997)Google Scholar
  22. 22.
    Ge, T., Zdonik, S., Madden, S.: Top-k queries on uncertain data: on score distribution and typical answers. In: SIGMOD, pp. 375–388 (2009)Google Scholar
  23. 23.
    Green, T., Karvounarakis, G., Tannen, V.: Provenance semirings. In: PODS, pp. 31–40 (2007)Google Scholar
  24. 24.
    Green, T., Tannen, V.: Models for incomplete and probabilistic information. In: EDBT (2006)Google Scholar
  25. 25.
    Gupta, R., Sarawagi, S.: Creating probabilistic databases from information extraction models. In: VLDB (2006)Google Scholar
  26. 26.
    Hauer J.F., Demeure C.J., Scharf L.L.: Initial results in prony analysis of power system response signals. IEEE Trans. Power Syst. 5(1), 80–89 (1990)CrossRefGoogle Scholar
  27. 27.
    Herbrich, R., Graepel, T., Bollmann-Sdorra, P., Obermayer, K.: Learning preference relations for information retrieval. In: ICML-98 Workshop: Text Categorization and Machine Learning, pp. 83–86 (1998)Google Scholar
  28. 28.
    Hua, M., Pei, J., Zhang, W., Lin, X.: Ranking queries on uncertain data: a probabilistic threshold approach. In: SIGMOD (2008)Google Scholar
  29. 29.
    Ilyas, I., Beskales, G., Soliman, M.: A survey of top-k query processing techniques in relational database systems. ACM Comput. Surv. (2008)Google Scholar
  30. 30.
    Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Syst. (2002)Google Scholar
  31. 31.
    Jayram T.S., Krishnamurthy R., Raghavan S., Vaithyanathan S., Zhu H.: Avatar information extraction system. IEEE Data Eng. Bull. 29(1), 40–48 (2006)Google Scholar
  32. 32.
    Jensen, F., Jensen, F.: Optimal junction trees. In: UAI, pp. 360–366 (1994)Google Scholar
  33. 33.
    Jin, C., Yi, K., Chen, L., Xu Yu, Lin, J. X.: Sliding-window top-k queries on uncertain streams. In: VLDB (2008)Google Scholar
  34. 34.
    Joachims, T.: Optimizing search engines using click-through data. In: Proceedings SIGKDD, pp. 133–142 (2002)Google Scholar
  35. 35.
    Kanagal, B., Deshpande, A.: Efficient query evaluation over temporally correlated probabilistic streams. In: ICDE (2009)Google Scholar
  36. 36.
    Kanagal, B., Deshpande, A.: Indexing correlated probabilistic databases. In: SIGMOD (2009)Google Scholar
  37. 37.
    Kimelfeld, B., Ré, C.: Transducing markov sequences. In: PODS, pp. 15–26 (2010)Google Scholar
  38. 38.
    Koch, C.: MayBMS: a system for managing large uncertain and probabilistic databases. In: Aggarwal C. (ed.) Managing and Mining Uncertain Data (2009)Google Scholar
  39. 39.
    Koch C., Olteanu D.: Conditioning probabilistic databases. PVLDB 1(1), 313–325 (2008)Google Scholar
  40. 40.
    Kriegel, H.P., Kunath, P., Renz, M.: Probabilistic nearest-neighbor query on uncertain objects. In: DASFAA (2007)Google Scholar
  41. 41.
    Lakshmanan, L., Leone, N., Ross, R., Subrahmanian, V.S.: Probview: a flexible probabilistic database system. TODS (1997)Google Scholar
  42. 42.
    Li, J., Deshpande, A.: Consensus answers for queries over probabilistic databases. PODS (2009)Google Scholar
  43. 43.
    Li, J., Deshpande, A.: Ranking continuous probabilistic datasets. In: VLDB (2010)Google Scholar
  44. 44.
    Liu T.Y.: Learning to rank for information retrieval. Found. Trends Inf. Retr. 3(3), 225–331 (2009)CrossRefGoogle Scholar
  45. 45.
    Liu, X., Ye, M., Xu, J., Tian, Y., Lee, W.: k-selection query over uncertain data. In: DASFAA (2010)Google Scholar
  46. 46.
    Ré, C., Dalvi, N., Suciu, D.: Efficient top-k query evaluation on probabilistic data. In: ICDE (2007)Google Scholar
  47. 47.
    Ré, C., Letchner, J., Balazinska, M., Suciu, D.: Event queries on correlated probabilistic streams. In: SIGMOD Conference (2008)Google Scholar
  48. 48.
    Sarma, A., Benjelloun, O., Halevy, A., Widom, J.: Working models for uncertain data. In: ICDE (2006)Google Scholar
  49. 49.
    Sen P., Deshpande A., Getoor L.: PrDB: managing and exploiting rich correlations in probabilistic databases. VLDB J. 18(5), 1065–1090 (2009)CrossRefGoogle Scholar
  50. 50.
    Soliman, M., Ilyas, I., Chang, K.C.: Top-k query processing in uncertain databases. In: ICDE (2007)Google Scholar
  51. 51.
    Soliman, M., Ilyas, I.: Ranking with uncertain scores. In: ICDE, pp. 317–328 (2009)Google Scholar
  52. 52.
    Talukdar P., Jacob M., Mehmood M., Crammer K., Ives Z., Pereira F., Guha S.: Learning to create data-integrating queries. PVLDB 1(1), 785–796 (2008)Google Scholar
  53. 53.
    Widom, J.: Trio: a system for integrated management of data, accuracy, and lineage. In: CIDR (2005)Google Scholar
  54. 54.
    Yi, K., Li, F., Srivastava, D., Kollios, G.: Efficient processing of top-k queries in uncertain databases. In: ICDE (2008)Google Scholar
  55. 55.
    Zhang, X., Chomicki, J.: On the semantics and evaluation of top-k queries in probabilistic databases. In: DBRank (2008)Google Scholar
  56. 56.
    Zuk, O., Ein-Dor, L., Domany, E.: Ranking under uncertainty. In: UAI pp. 466–473 (2007)Google Scholar

Copyright information

© Springer-Verlag 2011

Authors and Affiliations

  1. 1.Computer Science DepartmentUniversity of MarylandCollege ParkUSA

Personalised recommendations