Skip to main content
Log in

In-database batch and query-time inference over probabilistic graphical models using UDA–GIST

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

To meet customers’ pressing demands, enterprise database vendors have been pushing advanced analytical techniques into databases. Most major DBMSes use user-defined aggregates (UDAs), a data-driven operator, to implement analytical techniques in parallel. However, UDAs alone are not sufficient to implement statistical algorithms where most of the work is performed by iterative transitions over a large state that cannot be naively partitioned due to data dependency. Typically, this type of statistical algorithm requires pre-processing to set up the large state in the first place and demands post-processing after the statistical inference. This paper presents general iterative state transition (GIST), a new database operator for parallel iterative state transitions over large states. GIST receives a state constructed by a UDA and then performs rounds of transitions on the state until it converges. A final UDA performs post-processing and result extraction. We argue that the combination of UDA and GIST (UDA–GIST) unifies data-parallel and state-parallel processing in a single system, thus significantly extending the analytical capabilities of DBMSes. We exemplify the framework through two high-profile batch applications: cross-document coreference, image denoising and one query-time inference application: marginal inference queries over probabilistic knowledge graphs. The 3 applications use probabilistic graphical models, which encode complex relationships of different variables and are powerful for a wide range of problems. We show that the in-database framework allows us to tackle a 27 times larger problem than a scalable distributed solution for the first application and achieves 43 times speedup over the state-of-the-art for the second application. For the third application, we implement query-time inference using the UDA–GIST framework and apply over a probabilistic knowledge graph, achieving 10 times speedup over sequential inference. To the best of our knowledge, this is the first in-database query-time inference engine over large probabilistic knowledge base. We show that the UDA–GIST framework for data- and graph-parallel computations can support both batch and query-time inference efficiently in databases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

References

  1. Arumugam, S., Dobra, A., Jermaine, C.M., Pansare, N., Perez, L.L.: The datapath system: a data-centric analytic processing engine for large data warehouses. In: Elmagarmid, A.K., Agrawal, D. (eds.) SIGMOD Conference, pp. 519–530. ACM (2010)

  2. Bagga, A., Baldwin, B.: Entity-based cross-document coreferencing using the vector space model. In: Boitet, C., Whitelock, P. (eds.) COLING-ACL, pp. 79–85. Morgan Kaufmann Publishers/ACL (1998)

  3. Bain, T., Davidson, L., Dewson, R., Hawkins, C.: User defined functions. In: SQL Server 2000 Stored Procedures Handbook, pp. 178–195. Springer, New York (2003)

  4. Beedkar, K., Del Corro, L., Gemulla, R.: Fully parallel inference in Markov Logic networks. In: BTW, pp. 205–224. Citeseer (2013)

  5. Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka, E.R. Jr, Mitchell, T.M.: Toward an architecture for never-ending language learning. In: AAAI, vol. 5, p. 3 (2010)

  6. Casella, G., George, E.I.: Explaining the Gibbs sampler. Am. Stat. 46(3), 167–174 (1992)

    MathSciNet  Google Scholar 

  7. Chafi, H., Sujeeth, A.K., Brown, K.J., Lee, H., Atreya, A.R., Olukotun, K.: A domain-specific approach to heterogeneous parallelism. SIGPLAN Not. 46(8), 35–46 (2011)

    Article  Google Scholar 

  8. Chechetka, A., Guestrin, C.: Focused belief propagation for query-specific inference. In: International Conference on Artificial Intelligence and Statistics, pp. 89–96 (2010)

  9. Chen, Y., Wang, D.Z.: Knowledge expansion over probabilistic knowledge bases. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. SIGMOD ’14, pp. 649–660. ACM, New York, NY, USA (2014)

  10. Chib, S., Greenberg, E.: Understanding the Metropolis–Hastings algorithm. Am. Stat. 49(4), 327–335 (1995)

    Google Scholar 

  11. Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J.M., Welton, C.: Mad skills: new analysis practices for big data. PVLDB 2(2), 1481–1492 (2009)

    Google Scholar 

  12. Cohen, S.: User-defined aggregate functions: bridging theory and practice. In: Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pp. 49–60. ACM (2006)

  13. Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: OSDI, pp. 137–150. USENIX Association (2004)

  14. Dobra, A.: Datapath: high-performance database engine, June (2011)

  15. Fader, A., Soderland, S., Etzioni, O.: Identifying relations for open information extraction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1535–1545. Association for Computational Linguistics (2011)

  16. Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient belief propagation for early vision. CVPR 1, 261–268 (2004)

    Google Scholar 

  17. Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Mach. Learn. 29(2–3), 131–163 (1997)

    Article  MATH  Google Scholar 

  18. Hellerstein, J.M., Ré, C., Schoppmann, F., Wang, D.Z., Fratkin, E., Gorajek, A., Ng, K.S., Welton, C., Feng, X., Li, K., Kumar, A.: The madlib analytics library or mad skills, the sql. CoRR, arXiv:1208.4165 (2012)

  19. Hellerstein, J.M., Ré, C., Schoppmann, F., Wang, D.Z., Fratkin, E., Gorajek, A., Ng, K.S., Welton, C., Feng, X., Li, K., Kumar, A.: The madlib analytics library: or MAD skills, the SQL. Proc. VLDB Endow. 5(12), 1700–1711 (2012)

    Article  Google Scholar 

  20. Ihler, A.T., Iii, J., Willsky, A.S.: Loopy belief propagation: convergence and effects of message errors. J. Mach. Learn. Res. 905–936 (2005)

  21. Jiang, S., Lowd, D., Dou, D.: Learning to refine an automatically extracted knowledge base using Markov Logic. In: ICDM, pp. 912–917 (2012)

  22. Kok, S., Singla, P., Richardson, M., Domingos, P., Sumner, M., Poon, H., Lowd, D.: The Alchemy System for Statistical Relational AI. University of Washington, Seattle (2005)

    Google Scholar 

  23. Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT Press, Cambridge (2009)

    MATH  Google Scholar 

  24. Lao, N., Mitchell, T., Cohen, W.W.: Random walk inference and learning in a large scale knowledge base. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 529–539, Edinburgh, Scotland, UK., July. Association for Computational Linguistics (2011)

  25. Li, K., Grant, C., Wang, D.Z., Khatri, S., Chitouras, G.: Gptext: Greenplum parallel statistical text analysis framework. In: Proceedings of the Second Workshop on Data Analytics in the Cloud, pp. 31–35. ACM (2013)

  26. Li, K., Wang, D.Z., Dobra, A., Dudley, C.: UDA–GIST: An in-database framework to unify data-parallel and state-parallel analytics. In: Proceedings of the VLDB Endowment, vol. 8, no. 5 (2015)

  27. Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., J.M. Hellerstein. Graphlab: A new framework for parallel machine learning. CoRR, arXiv:1006.4990 (2010)

  28. Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Distributed GraphLab: a framework for machine learning in the cloud. PVLDB 5(8), 716–727 (2012)

    Google Scholar 

  29. Mahout, A.: Scalable machine-learning and data-mining library. Available at mahout. apache. org

  30. Meng, J., Chakradhar, S., A.R. Best-effort parallel execution framework for recognition and mining applications. In: IEEE International Symposium on Parallel Distributed Processing, 2009. IPDPS 2009, pp. 1–12, May (2009)

  31. Mitchell, T., Cohen, W.: Data sets and supplementary files (2010). Online; accessed 5 Mar 2015

  32. Mitzenmacher, M.: The power of two choices in randomized load balancing. IEEE Trans. Parallel Distrib. Syst. 12(10), 1094–1104 (2001)

    Article  Google Scholar 

  33. Murphy, K.P., Weiss, Y., Jordan, M.I.: Loopy belief propagation for approximate inference: an empirical study. In: Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, pp. 467–475. Morgan Kaufmann Publishers Inc. (1999)

  34. Niepert, M., Domingos, P.M.: Tractable probabilistic knowledge bases: Wikipedia and beyond. In: AAAI Workshop: Statistical Relational Artificial Intelligence (2014)

  35. Niu, F., Ré, C., Doan, A., Shavlik, J.: Tuffy: scaling up statistical inference in markov logic networks using an RDBMS. Proc. VLDB Endow. 4(6), 373–384 (2011)

    Article  Google Scholar 

  36. Poon, H., Domingos, P.: Sound and efficient inference with probabilistic and deterministic dependencies. AAAI 6, 458–463 (2006)

    Google Scholar 

  37. Richardson, M., Domingos, P.: Markov logic networks. Mach. Learn. 62(1–2), 107–136 (2006)

    Article  Google Scholar 

  38. Rozanov, Y.A.: Markov Random Fields. Springer, New York (1982)

    Book  MATH  Google Scholar 

  39. Rusu, F., Dobra, A.: Glade: a scalable framework for efficient analytics. Oper. Syst. Rev. 46(1), 12–18 (2012)

    Article  Google Scholar 

  40. Schoenmackers, S., Etzioni, O., Weld, D.S., Davis, J.: Learning first-order horn clauses from web text. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 1088–1098. Association for Computational Linguistics (2010)

  41. Sen, P., Deshpande, A., Getoor, L.: Prdb: managing and exploiting rich correlations in probabilistic databases. VLDB J. Int. J. Very Large Data Bases 18(5), 1065–1090 (2009)

    Article  Google Scholar 

  42. Shin, J., Wu, S., Wang, F., De Sa, C., Zhang, C., Ré, C.: Incremental knowledge base construction using deepdive. Proc. VLDB Endow. 8(11), 1310–1321 (2015)

    Article  Google Scholar 

  43. Singh, S., Subramanya, A., Pereira, F., McCallum, A.: Wikilinks: A large-scale cross-document coreference corpus labeled via links to Wikipedia. Technical Report UM-CS-2012-015 (2012)

  44. Singh, S., Subramanya, A., Pereira, F.C.N., McCallum, A.: Large-scale cross-document coreference using distributed inference and hierarchical models. In: Lin, D., Matsumoto, Y., Mihalcea, R. (eds.) ACL, pp. 793–803. The Association for Computer Linguistics (2011)

  45. Smullyan, R.M.: First-Order Logic, vol. 21968. Springer, Berlin (1968)

    Book  MATH  Google Scholar 

  46. Sümer, Ö., Acar, U.A., Ihler, A.T., Mettu, R.R.: Adaptive exact inference in graphical models. J. Mach. Learn. Res. 12, 3147–3186 (2011)

    MathSciNet  MATH  Google Scholar 

  47. Wang, D.Z., Chen, Y., Grant, C., Li, K.: Efficient in-database analytics with graphical models. IEEE Data Eng. Bull. 37, 41–51 (2014)

  48. Wang, H., Zaniolo, C.: User defined aggregates in object-relational systems. In: Proceedings of 16th International Conference on Data Engineering, 2000, pp. 135–144 (2000)

  49. Wei, W., Erenrich, J., Selman, B.: Towards efficient sampling: exploiting random walk strategies. AAAI 4, 670–676 (2004)

    Google Scholar 

  50. Wick, M., McCallum, A., Miklau, G.: Scalable probabilistic databases with factor graphs and mcmc. Proc. VLDB Endow. 3(1–2), 794–804 (2010)

    Article  Google Scholar 

  51. Wick, M.L., McCallum, A.: Query-aware MCMC. In: Advances in Neural Information Processing Systems, pp. 2564–2572 (2011)

  52. Wikipedia. Hierarchical and recursive queries in SQL (2014). Online; accessed 25 Jan 2015

  53. Wikipedia. Barack obama citizenship conspiracy theories (2015). Online; Accessed 25 Jan 2015

  54. Xin, R.S., Gonzalez, J.E., Franklin, M.J., Stoica, I.: Graphx: A resilient distributed graph system on spark. In: First International Workshop on Graph Data Management Experiences and Systems, p. 2. ACM (2013)

  55. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, pp. 10 (2010)

Download references

Acknowledgments

This work was partially supported by NSF IIS Award No. 1526753, DARPA under FA8750-12-2-0348-2 (DEFT/CUBISM), and a generous gift from Google.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaofeng Zhou.

Additional information

Kun Li and Xiaofeng Zhou both authors contribute equally to this paper.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, K., Zhou, X., Wang, D.Z. et al. In-database batch and query-time inference over probabilistic graphical models using UDA–GIST. The VLDB Journal 26, 177–201 (2017). https://doi.org/10.1007/s00778-016-0446-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-016-0446-1

Keywords

Navigation