The VLDB Journal

, 18:1065 | Cite as

PrDB: managing and exploiting rich correlations in probabilistic databases

Special Issue Paper

Abstract

Due to numerous applications producing noisy data, e.g., sensor data, experimental data, data from uncurated sources, information extraction, etc., there has been a surge of interest in the development of probabilistic databases. Most probabilistic database models proposed to date, however, fail to meet the challenges of real-world applications on two counts: (1) they often restrict the kinds of uncertainty that the user can represent; and (2) the query processing algorithms often cannot scale up to the needs of the application. In this work, we define a probabilistic database model, PrDB, that uses graphical models, a state-of-the-art probabilistic modeling technique developed within the statistics and machine learning community, to model uncertain data. We show how this results in a rich, complex yet compact probabilistic database model, which can capture the commonly occurring uncertainty models (tuple uncertainty, attribute uncertainty), more complex models (correlated tuples and attributes) and allows compact representation (shared and schema-level correlations). In addition, we show how query evaluation in PrDB translates into inference in an appropriately augmented graphical model. This allows us to easily use any of a myriad of exact and approximate inference algorithms developed within the graphical modeling community. While probabilistic inference provides a generic approach to solving queries, we show how the use of shared correlations, together with a novel inference algorithm that we developed based on bisimulation, can speed query processing significantly. We present a comprehensive experimental evaluation of the proposed techniques and show that even with a few shared correlations, significant speedups are possible.

Keywords

Probabilistic databases Uncertain databases Graphical models Query processing Lifted inference Bisimulation 

References

  1. 1.
    Andritsos, P., Fuxman, A., Miller, R.J.: Clean answers over dirty databases. In: ICDE (2006)Google Scholar
  2. 2.
    Arnborg S.: Efficient algorithms for combinatorial problems on graphs with bounded decomposability—a survey. BIT 25(1), 2–23 (1985)MATHCrossRefMathSciNetGoogle Scholar
  3. 3.
    Bosc P., Pivert O.: About projection-selection-join queries addressed to possibilistic relational databases. IEEE Trans. Fuzzy Syst. 13(1), 124–139 (2005)CrossRefGoogle Scholar
  4. 4.
    Boulos, J., Dalvi, N., Mandhani, B., Re, C., Mathur, S., Suciu, D.: Mystiq: a system for finding more answers by using probabilities. In: SIGMOD (2005)Google Scholar
  5. 5.
    Bravo, H., Ramakrishnan, R.: Optimizing MPF queries: decision support and probabilistic inference. In: SIGMOD (2007)Google Scholar
  6. 6.
    Buckles B., Petry F.: A fuzzy model for relational databases. Fuzzy Sets Syst. 7(3), 213–226 (1982)MATHCrossRefGoogle Scholar
  7. 7.
    Cheng, R., Kalashnikov, D., Prabhakar, S.: Evaluating probabilistic queries over imprecise data. In: SIGMOD (2003)Google Scholar
  8. 8.
    Choenni, S., Blok, H.E., Leertouwer, E.: Handling uncertainty and ignorance in databases: a rule to combine dependent data. In: DASFAA (2006)Google Scholar
  9. 9.
    Cowell R., Dawid A., Lauritzen S., Spiegelhater D.: Probabilistic Networks and Expert Systems. Springer, Berlin (1999)MATHGoogle Scholar
  10. 10.
    Dalvi, N., Suciu, D.: Management of probabilistic data: foundations and challenges. In: PODS (2007)Google Scholar
  11. 11.
    Dalvi, N., Suciu, D.: Efficient query evaluation on probabilistic databases. In: VLDB (2004)Google Scholar
  12. 12.
    Das Sarma, A., Agrawal, P., Nabar, S., Widom, J.: Towards special-purpose indexes and statistics for uncertain data. In: Workshop on Management of Uncertain Data (MUD), Auckland, New Zealand (2008)Google Scholar
  13. 13.
    Das Sarma, A., Theobald, M., Widom, J.: Exploiting lineage for confidence computation in uncertain and probabilistic databases. In: ICDE (2008)Google Scholar
  14. 14.
    De Raedt, L., Kimmig, A., Toivonen, H.: Problog: a probabilistic prolog and its application in link discovery. In: IJCAI (2007)Google Scholar
  15. 15.
    de Salvo Braz, R., Amir, E., Roth, D.: Lifted first-order probabilistic inference. In: IJCAI (2005)Google Scholar
  16. 16.
    Dechter, R.: Bucket elimination: a unifying framework for probabilistic inference. In: UAI (1996)Google Scholar
  17. 17.
    Deshpande, A., Guestrin, C., Madden, S., Hellerstein, J.M., Hong, W.: Model-driven data acquisition in sensor networks. In: VLDB (2004)Google Scholar
  18. 18.
    Dovier, A., Piazza, C., Policriti, A.: A fast bisimulation algorithm. In: International Conference on Computer Aided Verification, Paris, France (2001)Google Scholar
  19. 19.
    Frey, B.: Extending factor graphs so as to unify directed and undirected graphical models. In: UAI (2003)Google Scholar
  20. 20.
    Friedman, N., Getoor, L., Koller, D., Pfeffer, A.: Learning probabilistic relational models. In: IJCAI (1999)Google Scholar
  21. 21.
    Fuhr N., Rolleke T.: A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Trans. Inf. Syst. 15(1), 32–66 (1997)CrossRefGoogle Scholar
  22. 22.
    Getoor, L., Taskar, B. (eds.): Introduction to Statistical Relational Learning. MIT Press, Cambridge (2007)MATHGoogle Scholar
  23. 23.
    Getoor L., Friedman N., Koller D., Taskar B.: Learning probabilistic models of link structure. J. Mach. Learn. Res. 3, 679–707 (2002)CrossRefMathSciNetGoogle Scholar
  24. 24.
    Gupta, R., Sarawagi, S.: Creating probabilistic databases from information extraction models. In: VLDB (2006)Google Scholar
  25. 25.
    Halpern J.: An analysis of first-order logics for reasoning about probability. Artif. Intell. 44(1–2), 167–207 (1990)Google Scholar
  26. 26.
    Huang C., Darwiche A.: Inference in belief networks: A procedural guide. Int. J. Approx. Reason. 15(3), 225–263 (1996)MATHCrossRefMathSciNetGoogle Scholar
  27. 27.
    Imielinski T., Lipski W. Jr: Incomplete information in relational databases. J. ACM 31(4), 761–797 (1984)MATHCrossRefMathSciNetGoogle Scholar
  28. 28.
    Jampani, R., Xu, F., Wu, M., Perez, L., Jermaine, C., Haas, P.: MCDB: a monte carlo approach to managing uncertain data. In: SIGMOD (2008)Google Scholar
  29. 29.
    Kanellakis, P., Smolka, S.: CCS expressions, finite state processes, and three problems of equivalence. In: ACM Symposium on Principles of Distributed Computing, Montreal, Canada (1983)Google Scholar
  30. 30.
    Kjaerulff, U.: Triangulation of graphs—algorithms giving small total state space. Technical report, University of Aalborg, Denmark (1990)Google Scholar
  31. 31.
    Koch, C., Olteanu, D.: Conditioning probabilistic databases. In: VLDB (2008)Google Scholar
  32. 32.
    Milch, B., Zettlemoyer, L., Kersting, K., Haimes, M., Kaelbling, L.: Lifted probabilistic inference with counting formulas. In: AAAI (2008)Google Scholar
  33. 33.
    Pearl, J.: Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, Menlo Park (1988)Google Scholar
  34. 34.
    Poole, D.: First-order probabilistic inference. In: IJCAI (2003)Google Scholar
  35. 35.
    Re C., Dalvi N., Suciu D.: Query evaluation on probabilistic databases. IEEE Data Eng. Bull. Spec. Issue Probab. Data Manag. 29(1), 17–24 (2006)Google Scholar
  36. 36.
    Re, C., Dalvi, N., Suciu, D.: Efficient top-k query evaluation on probabilistic data. In: ICDE (2007)Google Scholar
  37. 37.
    Richardson M., Domingos P.: Markov logic networks. Mach. Learn. 62(1–2), 107–136 (2006)CrossRefGoogle Scholar
  38. 38.
    Richardson T.: A characterization of Markov equivalence for directed cyclic graphs. Int. J. Approx. Reason. 17(2–3), 107–162 (1997)MATHCrossRefGoogle Scholar
  39. 39.
    Rish, I.: Efficient Reasoning in Graphical Models. PhD thesis, University of California, Irvine (1999)Google Scholar
  40. 40.
    Sen P., Deshpande, A.: Representing and querying correlated tuples in probabilistic databases. In: ICDE (2007)Google Scholar
  41. 41.
    Sen, P., Deshpande, A., Getoor, L.: Representing tuple and attribute uncertainty in probabilistic databases. In: DUNE Workshop (ICDM) (2007)Google Scholar
  42. 42.
    Sen P., Deshpande A., Getoor L.: Exploiting shared correlations in probabilistic databases. PVLDB 1(1), 809–820 (2008)Google Scholar
  43. 43.
    Sen, P., Deshpande, A., Getoor, L.: Bisimulation-based approximate lifted inference. In: UAI (2009)Google Scholar
  44. 44.
    Singh, S., Mayfield, C., Prabhakar, S., Hambrusch, S., Shah, R.: Indexing uncertain categorical data. In: ICDE (2007)Google Scholar
  45. 45.
    Singla, P., Domingos, P.: Lifted first-order belief propagation. In: AAAI (2008)Google Scholar
  46. 46.
    Wang, D., Michelakis, E., Garofalakis, M., Hellerstein, J.: BayesStore: managing large, uncertain data repositories with probabilistic graphical models. In: VLDB (2008)Google Scholar
  47. 47.
    Zhang, N., Poole, D.: A simple approach to Bayesian network computations. In: Canadian Conference on Artificial Intelligence, Banff, Canada (1994)Google Scholar
  48. 48.
    Zhang N., Poole D.: Exploiting causal independence in Bayesian network inference. J. Artif. Intell. Res. 5, 301–328 (1996)MATHMathSciNetGoogle Scholar

Copyright information

© Springer-Verlag 2009

Authors and Affiliations

  1. 1.Computer Science DepartmentUniversity of MarylandCollege ParkUSA

Personalised recommendations