The VLDB Journal

, Volume 26, Issue 2, pp 177–201 | Cite as

In-database batch and query-time inference over probabilistic graphical models using UDA–GIST

  • Kun Li
  • Xiaofeng Zhou
  • Daisy Zhe Wang
  • Christan Grant
  • Alin Dobra
  • Christopher Dudley
Regular Paper
  • 316 Downloads

Abstract

To meet customers’ pressing demands, enterprise database vendors have been pushing advanced analytical techniques into databases. Most major DBMSes use user-defined aggregates (UDAs), a data-driven operator, to implement analytical techniques in parallel. However, UDAs alone are not sufficient to implement statistical algorithms where most of the work is performed by iterative transitions over a large state that cannot be naively partitioned due to data dependency. Typically, this type of statistical algorithm requires pre-processing to set up the large state in the first place and demands post-processing after the statistical inference. This paper presents general iterative state transition (GIST), a new database operator for parallel iterative state transitions over large states. GIST receives a state constructed by a UDA and then performs rounds of transitions on the state until it converges. A final UDA performs post-processing and result extraction. We argue that the combination of UDA and GIST (UDA–GIST) unifies data-parallel and state-parallel processing in a single system, thus significantly extending the analytical capabilities of DBMSes. We exemplify the framework through two high-profile batch applications: cross-document coreference, image denoising and one query-time inference application: marginal inference queries over probabilistic knowledge graphs. The 3 applications use probabilistic graphical models, which encode complex relationships of different variables and are powerful for a wide range of problems. We show that the in-database framework allows us to tackle a 27 times larger problem than a scalable distributed solution for the first application and achieves 43 times speedup over the state-of-the-art for the second application. For the third application, we implement query-time inference using the UDA–GIST framework and apply over a probabilistic knowledge graph, achieving 10 times speedup over sequential inference. To the best of our knowledge, this is the first in-database query-time inference engine over large probabilistic knowledge base. We show that the UDA–GIST framework for data- and graph-parallel computations can support both batch and query-time inference efficiently in databases.

Keywords

In-database analytics Query-time inference Batch inference Data-parallel analytics Graph-parallel analytics 

References

  1. 1.
    Arumugam, S., Dobra, A., Jermaine, C.M., Pansare, N., Perez, L.L.: The datapath system: a data-centric analytic processing engine for large data warehouses. In: Elmagarmid, A.K., Agrawal, D. (eds.) SIGMOD Conference, pp. 519–530. ACM (2010)Google Scholar
  2. 2.
    Bagga, A., Baldwin, B.: Entity-based cross-document coreferencing using the vector space model. In: Boitet, C., Whitelock, P. (eds.) COLING-ACL, pp. 79–85. Morgan Kaufmann Publishers/ACL (1998)Google Scholar
  3. 3.
    Bain, T., Davidson, L., Dewson, R., Hawkins, C.: User defined functions. In: SQL Server 2000 Stored Procedures Handbook, pp. 178–195. Springer, New York (2003)Google Scholar
  4. 4.
    Beedkar, K., Del Corro, L., Gemulla, R.: Fully parallel inference in Markov Logic networks. In: BTW, pp. 205–224. Citeseer (2013)Google Scholar
  5. 5.
    Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka, E.R. Jr, Mitchell, T.M.: Toward an architecture for never-ending language learning. In: AAAI, vol. 5, p. 3 (2010)Google Scholar
  6. 6.
    Casella, G., George, E.I.: Explaining the Gibbs sampler. Am. Stat. 46(3), 167–174 (1992)MathSciNetGoogle Scholar
  7. 7.
    Chafi, H., Sujeeth, A.K., Brown, K.J., Lee, H., Atreya, A.R., Olukotun, K.: A domain-specific approach to heterogeneous parallelism. SIGPLAN Not. 46(8), 35–46 (2011)CrossRefGoogle Scholar
  8. 8.
    Chechetka, A., Guestrin, C.: Focused belief propagation for query-specific inference. In: International Conference on Artificial Intelligence and Statistics, pp. 89–96 (2010)Google Scholar
  9. 9.
    Chen, Y., Wang, D.Z.: Knowledge expansion over probabilistic knowledge bases. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. SIGMOD ’14, pp. 649–660. ACM, New York, NY, USA (2014)Google Scholar
  10. 10.
    Chib, S., Greenberg, E.: Understanding the Metropolis–Hastings algorithm. Am. Stat. 49(4), 327–335 (1995)Google Scholar
  11. 11.
    Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J.M., Welton, C.: Mad skills: new analysis practices for big data. PVLDB 2(2), 1481–1492 (2009)Google Scholar
  12. 12.
    Cohen, S.: User-defined aggregate functions: bridging theory and practice. In: Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pp. 49–60. ACM (2006)Google Scholar
  13. 13.
    Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: OSDI, pp. 137–150. USENIX Association (2004)Google Scholar
  14. 14.
    Dobra, A.: Datapath: high-performance database engine, June (2011)Google Scholar
  15. 15.
    Fader, A., Soderland, S., Etzioni, O.: Identifying relations for open information extraction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1535–1545. Association for Computational Linguistics (2011)Google Scholar
  16. 16.
    Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient belief propagation for early vision. CVPR 1, 261–268 (2004)Google Scholar
  17. 17.
    Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Mach. Learn. 29(2–3), 131–163 (1997)CrossRefMATHGoogle Scholar
  18. 18.
    Hellerstein, J.M., Ré, C., Schoppmann, F., Wang, D.Z., Fratkin, E., Gorajek, A., Ng, K.S., Welton, C., Feng, X., Li, K., Kumar, A.: The madlib analytics library or mad skills, the sql. CoRR, arXiv:1208.4165 (2012)
  19. 19.
    Hellerstein, J.M., Ré, C., Schoppmann, F., Wang, D.Z., Fratkin, E., Gorajek, A., Ng, K.S., Welton, C., Feng, X., Li, K., Kumar, A.: The madlib analytics library: or MAD skills, the SQL. Proc. VLDB Endow. 5(12), 1700–1711 (2012)CrossRefGoogle Scholar
  20. 20.
    Ihler, A.T., Iii, J., Willsky, A.S.: Loopy belief propagation: convergence and effects of message errors. J. Mach. Learn. Res. 905–936 (2005)Google Scholar
  21. 21.
    Jiang, S., Lowd, D., Dou, D.: Learning to refine an automatically extracted knowledge base using Markov Logic. In: ICDM, pp. 912–917 (2012)Google Scholar
  22. 22.
    Kok, S., Singla, P., Richardson, M., Domingos, P., Sumner, M., Poon, H., Lowd, D.: The Alchemy System for Statistical Relational AI. University of Washington, Seattle (2005)Google Scholar
  23. 23.
    Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT Press, Cambridge (2009)MATHGoogle Scholar
  24. 24.
    Lao, N., Mitchell, T., Cohen, W.W.: Random walk inference and learning in a large scale knowledge base. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 529–539, Edinburgh, Scotland, UK., July. Association for Computational Linguistics (2011)Google Scholar
  25. 25.
    Li, K., Grant, C., Wang, D.Z., Khatri, S., Chitouras, G.: Gptext: Greenplum parallel statistical text analysis framework. In: Proceedings of the Second Workshop on Data Analytics in the Cloud, pp. 31–35. ACM (2013)Google Scholar
  26. 26.
    Li, K., Wang, D.Z., Dobra, A., Dudley, C.: UDA–GIST: An in-database framework to unify data-parallel and state-parallel analytics. In: Proceedings of the VLDB Endowment, vol. 8, no. 5 (2015)Google Scholar
  27. 27.
    Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., J.M. Hellerstein. Graphlab: A new framework for parallel machine learning. CoRR, arXiv:1006.4990 (2010)
  28. 28.
    Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Distributed GraphLab: a framework for machine learning in the cloud. PVLDB 5(8), 716–727 (2012)Google Scholar
  29. 29.
    Mahout, A.: Scalable machine-learning and data-mining library. Available at mahout. apache. orgGoogle Scholar
  30. 30.
    Meng, J., Chakradhar, S., A.R. Best-effort parallel execution framework for recognition and mining applications. In: IEEE International Symposium on Parallel Distributed Processing, 2009. IPDPS 2009, pp. 1–12, May (2009)Google Scholar
  31. 31.
    Mitchell, T., Cohen, W.: Data sets and supplementary files (2010). Online; accessed 5 Mar 2015Google Scholar
  32. 32.
    Mitzenmacher, M.: The power of two choices in randomized load balancing. IEEE Trans. Parallel Distrib. Syst. 12(10), 1094–1104 (2001)CrossRefGoogle Scholar
  33. 33.
    Murphy, K.P., Weiss, Y., Jordan, M.I.: Loopy belief propagation for approximate inference: an empirical study. In: Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, pp. 467–475. Morgan Kaufmann Publishers Inc. (1999)Google Scholar
  34. 34.
    Niepert, M., Domingos, P.M.: Tractable probabilistic knowledge bases: Wikipedia and beyond. In: AAAI Workshop: Statistical Relational Artificial Intelligence (2014)Google Scholar
  35. 35.
    Niu, F., Ré, C., Doan, A., Shavlik, J.: Tuffy: scaling up statistical inference in markov logic networks using an RDBMS. Proc. VLDB Endow. 4(6), 373–384 (2011)CrossRefGoogle Scholar
  36. 36.
    Poon, H., Domingos, P.: Sound and efficient inference with probabilistic and deterministic dependencies. AAAI 6, 458–463 (2006)Google Scholar
  37. 37.
    Richardson, M., Domingos, P.: Markov logic networks. Mach. Learn. 62(1–2), 107–136 (2006)CrossRefGoogle Scholar
  38. 38.
    Rozanov, Y.A.: Markov Random Fields. Springer, New York (1982)CrossRefMATHGoogle Scholar
  39. 39.
    Rusu, F., Dobra, A.: Glade: a scalable framework for efficient analytics. Oper. Syst. Rev. 46(1), 12–18 (2012)CrossRefGoogle Scholar
  40. 40.
    Schoenmackers, S., Etzioni, O., Weld, D.S., Davis, J.: Learning first-order horn clauses from web text. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 1088–1098. Association for Computational Linguistics (2010)Google Scholar
  41. 41.
    Sen, P., Deshpande, A., Getoor, L.: Prdb: managing and exploiting rich correlations in probabilistic databases. VLDB J. Int. J. Very Large Data Bases 18(5), 1065–1090 (2009)CrossRefGoogle Scholar
  42. 42.
    Shin, J., Wu, S., Wang, F., De Sa, C., Zhang, C., Ré, C.: Incremental knowledge base construction using deepdive. Proc. VLDB Endow. 8(11), 1310–1321 (2015)CrossRefGoogle Scholar
  43. 43.
    Singh, S., Subramanya, A., Pereira, F., McCallum, A.: Wikilinks: A large-scale cross-document coreference corpus labeled via links to Wikipedia. Technical Report UM-CS-2012-015 (2012)Google Scholar
  44. 44.
    Singh, S., Subramanya, A., Pereira, F.C.N., McCallum, A.: Large-scale cross-document coreference using distributed inference and hierarchical models. In: Lin, D., Matsumoto, Y., Mihalcea, R. (eds.) ACL, pp. 793–803. The Association for Computer Linguistics (2011)Google Scholar
  45. 45.
    Smullyan, R.M.: First-Order Logic, vol. 21968. Springer, Berlin (1968)CrossRefMATHGoogle Scholar
  46. 46.
    Sümer, Ö., Acar, U.A., Ihler, A.T., Mettu, R.R.: Adaptive exact inference in graphical models. J. Mach. Learn. Res. 12, 3147–3186 (2011)MathSciNetMATHGoogle Scholar
  47. 47.
    Wang, D.Z., Chen, Y., Grant, C., Li, K.: Efficient in-database analytics with graphical models. IEEE Data Eng. Bull. 37, 41–51 (2014)Google Scholar
  48. 48.
    Wang, H., Zaniolo, C.: User defined aggregates in object-relational systems. In: Proceedings of 16th International Conference on Data Engineering, 2000, pp. 135–144 (2000)Google Scholar
  49. 49.
    Wei, W., Erenrich, J., Selman, B.: Towards efficient sampling: exploiting random walk strategies. AAAI 4, 670–676 (2004)Google Scholar
  50. 50.
    Wick, M., McCallum, A., Miklau, G.: Scalable probabilistic databases with factor graphs and mcmc. Proc. VLDB Endow. 3(1–2), 794–804 (2010)CrossRefGoogle Scholar
  51. 51.
    Wick, M.L., McCallum, A.: Query-aware MCMC. In: Advances in Neural Information Processing Systems, pp. 2564–2572 (2011)Google Scholar
  52. 52.
    Wikipedia. Hierarchical and recursive queries in SQL (2014). Online; accessed 25 Jan 2015Google Scholar
  53. 53.
    Wikipedia. Barack obama citizenship conspiracy theories (2015). Online; Accessed 25 Jan 2015Google Scholar
  54. 54.
    Xin, R.S., Gonzalez, J.E., Franklin, M.J., Stoica, I.: Graphx: A resilient distributed graph system on spark. In: First International Workshop on Graph Data Management Experiences and Systems, p. 2. ACM (2013)Google Scholar
  55. 55.
    Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, pp. 10 (2010)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  1. 1.Department of Computer and Information Science and EngineeringUniversity of FloridaGainesvilleUSA
  2. 2.School of Computer ScienceUniversity of OklahomaNormanUSA

Personalised recommendations