The VLDB Journal

, Volume 26, Issue 1, pp 5–30 | Cite as

Dissociation and propagation for approximate lifted inference with standard relational database management systems

Special Issue Paper
  • 285 Downloads

Abstract

Probabilistic inference over large data sets is a challenging data management problem since exact inference is generally #P-hard and is most often solved approximately with sampling-based methods today. This paper proposes an alternative approach for approximate evaluation of conjunctive queries with standard relational databases: In our approach, every query is evaluated entirely in the database engine by evaluating a fixed number of query plans, each providing an upper bound on the true probability, then taking their minimum. We provide an algorithm that takes into account important schema information to enumerate only the minimal necessary plans among all possible plans. Importantly, this algorithm is a strict generalization of all known PTIME self-join-free conjunctive queries: A query is in PTIME if and only if our algorithm returns one single plan. Furthermore, our approach is a generalization of a family of efficient ranking methods from graphs to hypergraphs. We also adapt three relational query optimization techniques to evaluate all necessary plans very fast. We give a detailed experimental evaluation of our approach and, in the process, provide a new way of thinking about the value of probabilistic methods over non-probabilistic methods for ranking query answers. We also note that the techniques developed in this paper apply immediately to lifted inference from statistical relational models since lifted inference corresponds to PTIME plans in probabilistic databases.

Keywords

Probabilistic inference Lifted inference Probabilistic databases Problem relaxation Ranking Query plans Query optimization 

Notes

Acknowledgments

This work was supported in part by NSF Grants IIS-0513877, IIS-0713576, IIS-0915054, IIS-1115188, IIS-1247469, and CAREER IIS-1553547. We like to thank Abhay Jha for help with the experiments in the workshop version of this paper, Alexandra Meliou for suggesting the name “dissociation”, and Vibhav Gogate for guidance in using his tool SampleSearch. WG would also like to thank Manfred Hauswirth for a small comment in 2007 that was crucial for the development of the ideas in this paper.

Supplementary material

778_2016_434_MOESM1_ESM.pdf (644 kb)
Supplementary material 1 (pdf 644 KB)

References

  1. 1.
    Amarilli, A., Amsterdamer, Y., Milo, T.: Uncertainty in crowd data sourcing under structural constraints. In: DASFAA Workshops, pp. 351–359 (2014)Google Scholar
  2. 2.
    Antova, L., Jansen, T., Koch, C., Olteanu, D.: Fast and simple relational processing of uncertain data. In: ICDE, pp. 983–992 (2008)Google Scholar
  3. 3.
    Antova, L., Koch, C., Olteanu, D.: MayBMS: managing incomplete information with probabilistic world-set decompositions. In: ICDE, pp. 1479–1480 (2007)Google Scholar
  4. 4.
    Beame, P., Li, J., Roy, S., Suciu, D.: Model counting of query expressions: limitations of propositional methods. In: ICDT, pp. 177–188 (2014)Google Scholar
  5. 5.
    Bhalotia, G., Hulgeri, A., Nakhe, C., Chakrabarti, S., Sudarshan, S.: Keyword searching and browsing in databases using BANKS. In: ICDE, pp. 431–440 (2002)Google Scholar
  6. 6.
    Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. 30(1–7), 107–117 (1998)Google Scholar
  7. 7.
    Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka Jr., E.R., Mitchell, T.M.: Toward an architecture for never-ending language learning. In: AAAI (2010)Google Scholar
  8. 8.
    Chen, Y., Wang, D.Z.: Knowledge expansion over probabilistic knowledge bases. In: SIGMOD, pp. 649–660 (2014)Google Scholar
  9. 9.
    Cohen, W.W.: Data integration using similarity joins and a word-based information representation language. ACM Trans. Inf. Syst. 18(3), 288–321 (2000)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Colbourn, C.J.: The Combinatorics of Network Reliability. Oxford University Press, New York (1987)Google Scholar
  11. 11.
    Crestani, F.: Application of spreading activation techniques in information retrieval. Artif. Intell. Rev. 11(6), 453–482 (1997)CrossRefGoogle Scholar
  12. 12.
    Dalvi, N.N., Suciu, D.: Efficient query evaluation on probabilistic databases. VLDB J. 16(4), 523–544 (2007)CrossRefGoogle Scholar
  13. 13.
    Dalvi, N.N., Suciu, D.: The dichotomy of probabilistic inference for unions of conjunctive queries. J. ACM 59(6), 30 (2012)MathSciNetCrossRefMATHGoogle Scholar
  14. 14.
    Davis, M., Putnam, H.: A computing procedure for quantification theory. J. ACM 7(3), 201–215 (1960)MathSciNetCrossRefMATHGoogle Scholar
  15. 15.
  16. 16.
    Detwiler, L., Gatterbauer, W., Louie, B., Suciu, D., Tarczy-Hornoch, P.: Integrating and ranking uncertain scientific data. In: ICDE, pp. 1235–1238 (2009)Google Scholar
  17. 17.
    Domingos, Pedro, Lowd, Daniel: Markov Logic: An Interface Layer for Artificial Intelligence. Morgan & Claypool Publishers, San Rafael (2009)MATHGoogle Scholar
  18. 18.
    Dong, X., Gabrilovich, E., Heitz, G., Horn, W., Lao, N., Murphy, K., Strohmann, T., Sun, S., Zhang, W.: Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In: KDD, pp. 601–610 (2014)Google Scholar
  19. 19.
    Dylla, M., Miliaraki, I., Theobald, M.: Top-k query processing in probabilistic databases with non-materialized views. In: ICDE, pp. 122–133 (2013)Google Scholar
  20. 20.
    Ermis, B., Bouchard, G.: Iterative splits of quadratic bounds for scalable binary tensor factorization. In: UAI, pp. 192–199 (2014)Google Scholar
  21. 21.
    Fink, R., Huang, J., Olteanu, D.: Anytime approximation in probabilistic databases. VLDB J. 22(6), 823–848 (2013)CrossRefGoogle Scholar
  22. 22.
    Fink, R., Olteanu, D.: On the optimal approximation of queries using tractable propositional languages. In: ICDT, pp. 174–185 (2011)Google Scholar
  23. 23.
    Fink, R., Olteanu, D.: A dichotomy for non-repeating queries with negation in probabilistic databases. In: PODS, pp. 144–155 (2014)Google Scholar
  24. 24.
    Freire, C., Gatterbauer, W., Immerman, N., Meliou, A.: The complexity of resilience and responsibility for self-join-free conjunctive queries. PVLDB 9(3), 180–191 (2015)Google Scholar
  25. 25.
    Fuhr, N., Rölleke, T.: A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Trans. Inf. Syst. 15(1), 32–66 (1997)CrossRefGoogle Scholar
  26. 26.
    Gatterbauer, W., Günnemann, S., Koutra, D., Faloutsos, C.: Linearized and single-pass belief propagation. PVLDB 8(5), 581–592 (2015)Google Scholar
  27. 27.
    Gatterbauer, W., Jha, A.K., Suciu, D.: Dissociation and propagation for efficient query evaluation over probabilistic databases. In: Proceedings of 4th International VLDB workshop on Management of Uncertain Data (MUD), pp. 83–97 (2010)Google Scholar
  28. 28.
    Gatterbauer, W., Suciu, V.: Dissociation and propagation for approximate lifted inference with standard relational database management systems (2013). arXiv:1310.6257 [cs.DB]
  29. 29.
    Gatterbauer, W., Suciu, D.: Oblivious bounds on the probability of Boolean functions. ACM Trans. Database Syst. (TODS) 39(1), 5 (2014)MathSciNetCrossRefMATHGoogle Scholar
  30. 30.
    Gatterbauer, W., Suciu, D.: Approximate lifted inference with probabilistic databases. PVLDB 8(5), 629–640 (2015)Google Scholar
  31. 31.
    Gogate, V., Dechter, R.: SampleSearch: importance sampling in presence of determinism. Artif. Intell. 175(2), 694–729 (2011)MathSciNetCrossRefMATHGoogle Scholar
  32. 32.
    Gogate, V., Domingos, P.: Formula-based probabilistic inference. In: UAI, pp. 210–219 (2010)Google Scholar
  33. 33.
    Gogate, V., Domingos, P.: Probabilistic theorem proving. In: UAI, pp. 256–265 (2011)Google Scholar
  34. 34.
    Gomes, C.P., Sabharwal, A., Selman, B.: Model counting. In: Handbook of Satisfiability, pp. 633–654 (2009)Google Scholar
  35. 35.
    Goyal, A., Bonchi, F., Lakshmanan, L.V.S.: Learning influence probabilities in social networks. In: WSDM, pp. 241–250 (2010)Google Scholar
  36. 36.
    Grädel, E., Gurevich, Y., Hirsch, C.: The complexity of query reliability. In: PODS, pp. 227–234 (1998)Google Scholar
  37. 37.
    Gribkoff, E., Suciu, D.: Slimshot: in-database probabilistic inference for knowledge bases. PVLDB 9(7), 552–563 (2016)Google Scholar
  38. 38.
    Guha, R.V., Kumar, R., Raghavan, P., Tomkins, A.: Propagation of trust and distrust. In: WWW, pp. 403–412 (2004)Google Scholar
  39. 39.
    Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G.: Yago2: a spatially and temporally enhanced knowledge base from wikipedia. Artif. Intell. 194, 28–61 (2013)MathSciNetCrossRefMATHGoogle Scholar
  40. 40.
    Jaeger, M., Van den Broeck, G.: Liftability of probabilistic inference: upper and lower bounds. In: StaRAI (2012)Google Scholar
  41. 41.
    Jampani, R., Xu, F., Wu, M., Perez, L.L., Jermaine, C.M., Haas, P.J.: MCDB: a Monte Carlo approach to managing uncertain data. In: SIGMOD, pp. 687–700 (2008)Google Scholar
  42. 42.
    Jha, A., Olteanu, D., Suciu, D.: Bridging the gap between intensional and extensional query evaluation in probabilistic databases. In: EDBT, pp. 323–334 (2010)Google Scholar
  43. 43.
    Jha, A., Suciu, D.: Probabilistic databases with MarkoViews. PVLDB 5(11), 1160–1171 (2012)Google Scholar
  44. 44.
    Joshi, S., Jermaine, C.M.: Sampling-based estimators for subset-based queries. VLDB J. 18(1), 181–202 (2009)CrossRefGoogle Scholar
  45. 45.
    Kennedy, O., Koch, C.: PIP: a database system for great and small expectations. In: ICDE, pp. 157–168 (2010)Google Scholar
  46. 46.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)CrossRefMATHGoogle Scholar
  47. 47.
    McSherry, F., Najork, M.: Computing information retrieval performance measures efficiently in the presence of tied scores. In: ECIR, pp. 414–421 (2008)Google Scholar
  48. 48.
    Microsoft SQL Server 2012. http://www.microsoft.com/sqlserver
  49. 49.
    Moerkotte, G.: Building query compilers. Draft version 03 Mar 2009Google Scholar
  50. 50.
    Niu, F., Ré, C., Doan, A., Shavlik, J.W.: Tuffy: scaling up statistical inference in markov logic networks using an RDBMS. PVLDB 4(6), 373–384 (2011)Google Scholar
  51. 51.
    OEIS: The on-line encyclopedia of integer sequences: http://oeis.org/
  52. 52.
    Olteanu, D., Huang, J.: Using OBDDs for efficient query evaluation on probabilistic databases. In: SUM, pp. 326–340 (2008)Google Scholar
  53. 53.
    Olteanu, D., Huang, J., Koch, C.: Sprout: lazy vs. eager query plans for tuple-independent probabilistic databases. In: ICDE, pp. 640–651 (2009)Google Scholar
  54. 54.
    Olteanu, D., Huang, J., Koch, C.: Approximate confidence computation in probabilistic databases. In: ICDE, pp. 145–156 (2010)Google Scholar
  55. 55.
    Pasternack, J., Roth, D.: Knowing what to believe (when you already know something). In: COLING, pp. 877–885 (2010)Google Scholar
  56. 56.
    Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers, San Mateo (1988)MATHGoogle Scholar
  57. 57.
    Poole, D.: First-order probabilistic inference. In: IJCAI, pp. 985–991 (2003)Google Scholar
  58. 58.
  59. 59.
    Quillian, M.R.: Semantic memory. In: Semantic Information Processing, pp. 227–270. MIT Press (1968)Google Scholar
  60. 60.
    Raghunathan, R., De, S., Kambhampati, S.: Bayesian networks for supporting query processing over incomplete autonomous databases. J. Intell. Inf. Syst. 42(3), 595–618 (2014)CrossRefGoogle Scholar
  61. 61.
    Ré, C., Dalvi, N.N., Suciu, D.: Query evaluation on probabilistic databases. IEEE Data Eng. Bull. 29(1), 25–31 (2006)Google Scholar
  62. 62.
    Ré, C., Dalvi, N.N., Suciu, D.: Efficient top-k query evaluation on probabilistic data. In: ICDE, pp. 886–895 (2007)Google Scholar
  63. 63.
    Ré, C., Suciu, D.: Approximate lineage for probabilistic databases. PVLDB 1(1), 797–808 (2008)Google Scholar
  64. 64.
    Roy, S., Perduca, V., Tannen, V.: Faster query answering in probabilistic databases using read-once functions. In: ICDT, pp. 232–243 (2011)Google Scholar
  65. 65.
    Rumelhart, D.E., Hinton, G.E.,Williams, R.J.: Learning internal representations by error propagation. In: Parallel distributed processing: explorations in the microstructure of cognition, vol. 1, pp 318–362. MIT Press (1986)Google Scholar
  66. 66.
    Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access path selection in a relational database management system. In: SIGMOD, pp. 23–34 (1979)Google Scholar
  67. 67.
    Sen, P., Deshpande, A.: Representing and querying correlated tuples in probabilistic databases. In: ICDE, pp. 596–605 (2007)Google Scholar
  68. 68.
    Sen, P., Deshpande, A., Getoor, L.: Read-once functions and query evaluation in probabilistic databases. PVLDB 3(1), 1068–1079 (2010)Google Scholar
  69. 69.
    Singh, A.P., Gordon, G.J.: Relational learning via collective matrix factorization. In: KDD, pp. 650–658 (2008)Google Scholar
  70. 70.
    Stoyanovich, J., Davidson, S.B., Milo, T., Tannen, V.: Deriving probabilistic databases with inference ensembles. In: ICDE, pp. 303–314 (2011)Google Scholar
  71. 71.
    TPC-H Benchmark. http://www.tpc.org/tpch/
  72. 72.
    Van den Broeck, G., Choi, A., Darwiche, A.: Lifted relax, compensate and then recover: from approximate to exact lifted probabilistic inference. In: UAI, pp. 131–141 (2012)Google Scholar
  73. 73.
    Van den Broeck, G., Meert, W., Darwiche, A.: Skolemization for weighted first-order model counting. In: KR (2014)Google Scholar
  74. 74.
    Van den Broeck, G., Suciu, D.: Lifted probabilistic inference in relational models. In: UAI tutorials (2014)Google Scholar
  75. 75.
    Van den Broeck, G., Taghipour, N., Meert, W., Davis, J., De Raedt, L.: Lifted probabilistic inference by first-order knowledge compilation. In: IJCAI, pp. 2178–2185 (2011)Google Scholar
  76. 76.
    Vardi, M.Y.: The complexity of relational query languages (extended abstract). In: STOC, pp. 137–146 (1982)Google Scholar
  77. 77.
    Weston, J., Elisseeff, A., Zhou, D., Leslie, C.S., Noble, W.S.: Protein ranking: from local to global structure in the protein similarity network. Proc Natl Acad Sci USA 101(17), 6559–6563 (2004)CrossRefGoogle Scholar
  78. 78.
    Yin, X., Han, J., Philip, S.Y.: Truth discovery with multiple conflicting information providers on the web. IEEE Trans. Knowl. Data Eng. 20(6), 796–808 (2008)CrossRefGoogle Scholar
  79. 79.
    Zeng, K., Gao, S., Mozafari, B., Zaniolo, C.: The analytical bootstrap: a new method for fast error estimation in approximate query processing. In: SIGMOD, pp. 277–288 (2014)Google Scholar
  80. 80.
    Zhang, C., Ré, C.: Towards high-throughput Gibbs sampling at scale: a study across storage managers. In: SIGMOD, pp. 397–408 (2013)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  1. 1.Tepper School of BusinessCarnegie Mellon UniversityPittsburghUSA
  2. 2.Department of Computer Science and EngineeringUniversity of WashingtonSeattleUSA

Personalised recommendations