Skip to main content
Log in

FACTORBASE: multi-relational structure learning with SQL all the way

  • Regular Paper
  • Published:
International Journal of Data Science and Analytics Aims and scope Submit manuscript

Abstract

FactorBase is a new SQL-based framework that leverages a relational database management system to support multi-relational model discovery. A multi-relational statistical model provides an integrated analysis of the heterogeneous and interdependent data resources in the database. We adopt the BayesStore design philosophy: Statistical models are stored and managed as first-class citizens inside a database (Wang et al., in: PVLDB, pp 340–351, 2008). Whereas previous systems like BayesStore support multi-relational inference, FactorBase supports multi-relational learning. A case study on six benchmark databases evaluates how our system supports a challenging machine learning application, namely learning a first-order Bayesian network model for an entire database. Model learning in this setting has to examine a large number of potential statistical associations across data tables. Our implementation shows how the SQL constructs in FactorBase facilitate the fast, modular, and reliable development of scalable model learning systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. A par-factor can also include constraints on possible groundings.

  2. The schema assumes that all relationships are binary.

  3. Essentially, the same concept is called a slot chain in PRM modeling [9].

  4. www.grouplens.org, 1M version.

  5. www.imdb.com, July 2013.

References

  1. Chickering, D.: Optimal structure identification with greedy search. J. Mach. Learn. Res. 3, 507–554 (2003)

    MathSciNet  MATH  Google Scholar 

  2. Contributors, A.S.P.: Apache Spark. http://spark.apache.org/. Accessed 9 Mar 2016

  3. Deshpande, A., Madden, S.: MauveDB: supporting model-based user views in database systems. In: SIGMOD, pp. 73–84. ACM (2006)

  4. Domingos, P., Lowd, D.: Markov Logic: An Interface Layer for Artificial Intelligence. Morgan and Claypool Publishers, San Rafael (2009)

    MATH  Google Scholar 

  5. Dzeroski, S., Lavrac, N. (eds.): Relational Data Mining. Springer, Berlin (2001)

    MATH  Google Scholar 

  6. Feng, X., Kumar, A., Recht, B., Ré, C.: Towards a unified architecture for in-RDBMS analytics. In: SIGMOD Conference, pp. 325–336 (2012)

  7. Friedman, N., Getoor, L., Koller, D., Pfeffer, A.: Learning probabilistic relational models. In: IJCAI, pp. 1300–1309. Springer (1999)

  8. Geiger, D., Heckerman, D.: Knowledge representation and inference in similarity networks and Bayesian multinets. Artif. Intell. 82(1–2), 45–74 (1996)

    Article  MathSciNet  Google Scholar 

  9. Getoor, L., Friedman, N., Koller, D., Pfeffer, A., Taskar, B.: Probabilistic relational models. In: Introduction to Statistical Relational Learning [10], chap. 5, pp. 129–173

  10. Getoor, L., Taskar, B.: Introduction to Statistical Relational Learning. MIT Press, Cambridge (2007)

    Book  MATH  Google Scholar 

  11. Getoor, L., Taskar, B., Koller, D.: Selectivity estimation using probabilistic models. ACM SIGMOD Rec. 30(2), 461–472 (2001)

    Article  Google Scholar 

  12. Graefe, G., Fayyad, U.M., Chaudhuri, S.: On the efficient gathering of sufficient statistics for classification from large SQL databases. In: KDD, pp. 204–208 (1998)

  13. Heckerman, D., Meek, C., Koller, D.: Probabilistic entity-relationship models, PRMs, and plate models. In: Getoor and Taskar [10]

  14. Hellerstein, J.M., Ré, C., Schoppmann, F., Wang, D.Z., Fratkin, E., Gorajek, A., Ng, K.S., Welton, C., Feng, X., Li, K., Kumar, A.: The MADlib analytics library: Or MAD skills, the SQL. PVLDB 5(12), 1700–1711 (2012)

    Google Scholar 

  15. Jampani, R., Xu, F., Wu, M., Perez, L.L., Jermaine, C.M., Haas, P.J.: MCDB: a Monte Carlo approach to managing uncertain data. In: SIGMOD Conference, pp. 687–700 (2008)

  16. Khosravi, H., Schulte, O., Man, T., Xu, X., Bina, B.: Structure learning for Markov logic networks with many descriptive attributes. In: AAAI, pp. 487–493 (2010)

  17. Khot, T., Shavlik, J., Natarajan, S.: Boostr. http://pages.cs.wisc.edu/~tushar/Boostr/. Accessed 21 Nov 2012

  18. Kimmig, A., Mihalkova, L., Getoor, L.: Lifted graphical models: a survey. Mach. Learn. 99(1), 1–45 (2015). https://doi.org/10.1007/s10994-014-5443-2

    Article  MathSciNet  MATH  Google Scholar 

  19. Kraska, T., Talwalkar, A., Duchi, J.C., Griffith, R., Franklin, M.J., Jordan, M.I.: MLbase: a distributed machine-learning system. In: CIDR (2013)

  20. Lavrač, N., Perovšek, M., Vavpetič, A.: Propositionalization online. In: ECML, pp. 456–459. Springer (2014)

  21. Lv, Q., Xia, X., Qian, P.: A fast calculation of metric scores for learning Bayesian network. Int. J. Autom. Comput. 9, 37–44 (2012)

    Article  Google Scholar 

  22. Milch, B., Marthi, B., Russell, S.J., Sontag, D., Ong, D.L., Kolobov, A.: BLOG: probabilistic models with unknown objects. In: IJCAI, pp. 1352–1359 (2005)

  23. Moore, A.W., Lee, M.S.: Cached sufficient statistics for efficient machine learning with large datasets. JAIR 8, 67–91 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  24. Natarajan, S., Khot, T., Kersting, K., Gutmann, B., Shavlik, J.W.: Gradient-based boosting for statistical relational learning: the relational dependency network case. Mach. Learn. 86(1), 25–56 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  25. Niu, F., Ré, C., Doan, A., Shavlik, J.W.: Tuffy: scaling up statistical inference in Markov logic networks using an RDBMS. PVLDB 4(6), 373–384 (2011)

    Google Scholar 

  26. Niu, F., Zhang, C., Ré, C., Shavlik, J.: Felix: Scaling Inference for Markov Logic with an Operator-Based Approach. ArXiv e-prints (2011)

  27. Peralta, V.: Extraction and integration of MovieLens and IMDb data. Technical report, Laboratoire PRiSM (2007)

  28. Poole, D.: First-order probabilistic inference. In: Gottlob, G., Walsh, T. (eds.) IJCAI-03, Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, Acapulco, Mexico, August 9–15, 2003, pp. 985–991. Morgan Kaufmann (2003)

  29. Popescul, A., Ungar, L.: Feature generation and selection in multi-relational learning. In: Introduction to Statistical Relational Learning [10], chap. 16, pp. 453–476

  30. Qian, Z., Schulte, O.: The Bayes base system (2015). http://www.cs.sfu.ca/~oschulte/BayesBase/BayesBase.html. Accessed 6 May 2016

  31. Qian, Z., Schulte, O., Sun, Y.: Computing multi-relational sufficient statistics for large databases. In: CIKM, pp. 1249–1258. ACM (2014)

  32. Quakkelaar, R.: Exploiting relational database technology for statistical machine learning in factor base. Master thesis, Open Universiteit Nederland (2017)

  33. Ramakrishnan, R., Gehrke, J.: Database Management Systems, 3rd edn. McGraw-Hill, New York (2003)

    MATH  Google Scholar 

  34. Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach. Prentice Hall, Upper Saddle River (2010)

    MATH  Google Scholar 

  35. Schulte, O., Khosravi, H.: Learning graphical models for relational data via lattice search. Mach. Learn. 88(3), 331–368 (2012)

    Article  MathSciNet  Google Scholar 

  36. Schulte, O., Luo, W., Greiner, R.: Mind-change optimal learning of Bayes net structure from dependency and independency data. Inf. Comput. 208, 63–82 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  37. Schulte, O., Qian, Z.: Factorbase: SQL for learning a multi-relational graphical model. arXiv preprint (2015). arXiv:1508.02428

  38. Singh, A.P., Gordon, G.J.: Relational learning via collective matrix factorization. In: SIGKDD, pp. 650–658. ACM (2008)

  39. Singh, S., Graepel, T.: Automated probabilistic modeling for relational data. In: CIKM, pp. 1497–1500. ACM (2013)

  40. Sun, Y., Han, J.: Mining Heterogeneous Information Networks: Principles and Methodologies, vol. 3. Morgan & Claypool Publishers, San Rafael (2012)

    Google Scholar 

  41. Walker, T., O’Reilly, C., Kunapuli, G., Natarajan, S., Maclin, R., Page, D., Shavlik, J.W.: Automating the ILP setup task: converting user advice about specific examples into general background knowledge. In: ILP, pp. 253–268 (2010)

  42. Wang, D.Z., Michelakis, E., Garofalakis, M., Hellerstein, J.M.: BayesStore: managing large, uncertain data repositories with probabilistic graphical models. In: PVLDB, pp. 340–351 (2008)

  43. Wick, M.L., McCallum, A., Miklau, G.: Scalable probabilistic databases with factor graphs and MCMC. In: PVLDB, pp. 794–804 (2010)

  44. Wong, S.M., Butz, C.J., Xiang, Y.: A method for implementing a probabilistic model as a relational database. In: UAI, pp. 556–564 (1995)

Download references

Acknowledgements

This research was supported by a Discovery Grant to Oliver Schulte by the Natural Sciences and Engineering Research Council of Canada. Zhensong Qian was supported by a grant from the China Scholarship Council. We are indebted to anonymous reviewers for the Journal of Data Science and Analytics for helpful comments that improved the paper presentation substantially.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhensong Qian.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Appendix: The random variable database layout

Appendix: The random variable database layout

We provide details about the Schema Analyzer. A complete SQL script that implements the Schema Analyzer is available [37]. Table 21 shows the relational schema of the Random Variable Database. Figure 9 shows dependencies between the tables of this schema.

Table 21 Schema for random variable database
Fig. 9
figure 9

Table dependencies in the random variable database \({ VDB }\)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Schulte, O., Qian, Z. FACTORBASE: multi-relational structure learning with SQL all the way. Int J Data Sci Anal 7, 289–309 (2019). https://doi.org/10.1007/s41060-018-0130-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s41060-018-0130-1

Keywords

Navigation