The VLDB Journal

, Volume 26, Issue 2, pp 177–201

In-database batch and query-time inference over probabilistic graphical models using UDA–GIST

  • Kun Li
  • Xiaofeng Zhou
  • Daisy Zhe Wang
  • Christan Grant
  • Alin Dobra
  • Christopher Dudley
Regular Paper

DOI: 10.1007/s00778-016-0446-1

Cite this article as:
Li, K., Zhou, X., Wang, D.Z. et al. The VLDB Journal (2017) 26: 177. doi:10.1007/s00778-016-0446-1
  • 128 Downloads

Abstract

To meet customers’ pressing demands, enterprise database vendors have been pushing advanced analytical techniques into databases. Most major DBMSes use user-defined aggregates (UDAs), a data-driven operator, to implement analytical techniques in parallel. However, UDAs alone are not sufficient to implement statistical algorithms where most of the work is performed by iterative transitions over a large state that cannot be naively partitioned due to data dependency. Typically, this type of statistical algorithm requires pre-processing to set up the large state in the first place and demands post-processing after the statistical inference. This paper presents general iterative state transition (GIST), a new database operator for parallel iterative state transitions over large states. GIST receives a state constructed by a UDA and then performs rounds of transitions on the state until it converges. A final UDA performs post-processing and result extraction. We argue that the combination of UDA and GIST (UDA–GIST) unifies data-parallel and state-parallel processing in a single system, thus significantly extending the analytical capabilities of DBMSes. We exemplify the framework through two high-profile batch applications: cross-document coreference, image denoising and one query-time inference application: marginal inference queries over probabilistic knowledge graphs. The 3 applications use probabilistic graphical models, which encode complex relationships of different variables and are powerful for a wide range of problems. We show that the in-database framework allows us to tackle a 27 times larger problem than a scalable distributed solution for the first application and achieves 43 times speedup over the state-of-the-art for the second application. For the third application, we implement query-time inference using the UDA–GIST framework and apply over a probabilistic knowledge graph, achieving 10 times speedup over sequential inference. To the best of our knowledge, this is the first in-database query-time inference engine over large probabilistic knowledge base. We show that the UDA–GIST framework for data- and graph-parallel computations can support both batch and query-time inference efficiently in databases.

Keywords

In-database analytics Query-time inference Batch inference Data-parallel analytics Graph-parallel analytics 

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  1. 1.Department of Computer and Information Science and EngineeringUniversity of FloridaGainesvilleUSA
  2. 2.School of Computer ScienceUniversity of OklahomaNormanUSA