FACTORBASE: multi-relational structure learning with SQL all the way

Schulte, Oliver; Qian, Zhensong

doi:10.1007/s41060-018-0130-1

FACTORBASE: multi-relational structure learning with SQL all the way

Regular Paper
Published: 02 June 2018

Volume 7, pages 289–309, (2019)
Cite this article

International Journal of Data Science and Analytics Aims and scope Submit manuscript

214 Accesses
Explore all metrics

Abstract

FactorBase is a new SQL-based framework that leverages a relational database management system to support multi-relational model discovery. A multi-relational statistical model provides an integrated analysis of the heterogeneous and interdependent data resources in the database. We adopt the BayesStore design philosophy: Statistical models are stored and managed as first-class citizens inside a database (Wang et al., in: PVLDB, pp 340–351, 2008). Whereas previous systems like BayesStore support multi-relational inference, FactorBase supports multi-relational learning. A case study on six benchmark databases evaluates how our system supports a challenging machine learning application, namely learning a first-order Bayesian network model for an entire database. Model learning in this setting has to examine a large number of potential statistical associations across data tables. Our implementation shows how the SQL constructs in FactorBase facilitate the fast, modular, and reliable development of scalable model learning systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Inductive Logic Programming Meets Relational Databases: Efficient Learning of Markov Logic Networks

Toward New Evaluation Metrics for Relational Learning

What Kinds of Relational Features Are Useful for Statistical Learning?

Notes

A par-factor can also include constraints on possible groundings.
The schema assumes that all relationships are binary.
Essentially, the same concept is called a slot chain in PRM modeling [9].
www.grouplens.org, 1M version.
www.imdb.com, July 2013.

References

Chickering, D.: Optimal structure identification with greedy search. J. Mach. Learn. Res. 3, 507–554 (2003)
MathSciNet MATH Google Scholar
Contributors, A.S.P.: Apache Spark. http://spark.apache.org/. Accessed 9 Mar 2016
Deshpande, A., Madden, S.: MauveDB: supporting model-based user views in database systems. In: SIGMOD, pp. 73–84. ACM (2006)
Domingos, P., Lowd, D.: Markov Logic: An Interface Layer for Artificial Intelligence. Morgan and Claypool Publishers, San Rafael (2009)
MATH Google Scholar
Dzeroski, S., Lavrac, N. (eds.): Relational Data Mining. Springer, Berlin (2001)
MATH Google Scholar
Feng, X., Kumar, A., Recht, B., Ré, C.: Towards a unified architecture for in-RDBMS analytics. In: SIGMOD Conference, pp. 325–336 (2012)
Friedman, N., Getoor, L., Koller, D., Pfeffer, A.: Learning probabilistic relational models. In: IJCAI, pp. 1300–1309. Springer (1999)
Geiger, D., Heckerman, D.: Knowledge representation and inference in similarity networks and Bayesian multinets. Artif. Intell. 82(1–2), 45–74 (1996)
Article MathSciNet Google Scholar
Getoor, L., Friedman, N., Koller, D., Pfeffer, A., Taskar, B.: Probabilistic relational models. In: Introduction to Statistical Relational Learning [10], chap. 5, pp. 129–173
Getoor, L., Taskar, B.: Introduction to Statistical Relational Learning. MIT Press, Cambridge (2007)
Book MATH Google Scholar
Getoor, L., Taskar, B., Koller, D.: Selectivity estimation using probabilistic models. ACM SIGMOD Rec. 30(2), 461–472 (2001)
Article Google Scholar
Graefe, G., Fayyad, U.M., Chaudhuri, S.: On the efficient gathering of sufficient statistics for classification from large SQL databases. In: KDD, pp. 204–208 (1998)
Heckerman, D., Meek, C., Koller, D.: Probabilistic entity-relationship models, PRMs, and plate models. In: Getoor and Taskar [10]
Hellerstein, J.M., Ré, C., Schoppmann, F., Wang, D.Z., Fratkin, E., Gorajek, A., Ng, K.S., Welton, C., Feng, X., Li, K., Kumar, A.: The MADlib analytics library: Or MAD skills, the SQL. PVLDB 5(12), 1700–1711 (2012)
Google Scholar
Jampani, R., Xu, F., Wu, M., Perez, L.L., Jermaine, C.M., Haas, P.J.: MCDB: a Monte Carlo approach to managing uncertain data. In: SIGMOD Conference, pp. 687–700 (2008)
Khosravi, H., Schulte, O., Man, T., Xu, X., Bina, B.: Structure learning for Markov logic networks with many descriptive attributes. In: AAAI, pp. 487–493 (2010)
Khot, T., Shavlik, J., Natarajan, S.: Boostr. http://pages.cs.wisc.edu/~tushar/Boostr/. Accessed 21 Nov 2012
Kimmig, A., Mihalkova, L., Getoor, L.: Lifted graphical models: a survey. Mach. Learn. 99(1), 1–45 (2015). https://doi.org/10.1007/s10994-014-5443-2
Article MathSciNet MATH Google Scholar
Kraska, T., Talwalkar, A., Duchi, J.C., Griffith, R., Franklin, M.J., Jordan, M.I.: MLbase: a distributed machine-learning system. In: CIDR (2013)
Lavrač, N., Perovšek, M., Vavpetič, A.: Propositionalization online. In: ECML, pp. 456–459. Springer (2014)
Lv, Q., Xia, X., Qian, P.: A fast calculation of metric scores for learning Bayesian network. Int. J. Autom. Comput. 9, 37–44 (2012)
Article Google Scholar
Milch, B., Marthi, B., Russell, S.J., Sontag, D., Ong, D.L., Kolobov, A.: BLOG: probabilistic models with unknown objects. In: IJCAI, pp. 1352–1359 (2005)
Moore, A.W., Lee, M.S.: Cached sufficient statistics for efficient machine learning with large datasets. JAIR 8, 67–91 (1998)
Article MathSciNet MATH Google Scholar
Natarajan, S., Khot, T., Kersting, K., Gutmann, B., Shavlik, J.W.: Gradient-based boosting for statistical relational learning: the relational dependency network case. Mach. Learn. 86(1), 25–56 (2012)
Article MathSciNet MATH Google Scholar
Niu, F., Ré, C., Doan, A., Shavlik, J.W.: Tuffy: scaling up statistical inference in Markov logic networks using an RDBMS. PVLDB 4(6), 373–384 (2011)
Google Scholar
Niu, F., Zhang, C., Ré, C., Shavlik, J.: Felix: Scaling Inference for Markov Logic with an Operator-Based Approach. ArXiv e-prints (2011)
Peralta, V.: Extraction and integration of MovieLens and IMDb data. Technical report, Laboratoire PRiSM (2007)
Poole, D.: First-order probabilistic inference. In: Gottlob, G., Walsh, T. (eds.) IJCAI-03, Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, Acapulco, Mexico, August 9–15, 2003, pp. 985–991. Morgan Kaufmann (2003)
Popescul, A., Ungar, L.: Feature generation and selection in multi-relational learning. In: Introduction to Statistical Relational Learning [10], chap. 16, pp. 453–476
Qian, Z., Schulte, O.: The Bayes base system (2015). http://www.cs.sfu.ca/~oschulte/BayesBase/BayesBase.html. Accessed 6 May 2016
Qian, Z., Schulte, O., Sun, Y.: Computing multi-relational sufficient statistics for large databases. In: CIKM, pp. 1249–1258. ACM (2014)
Quakkelaar, R.: Exploiting relational database technology for statistical machine learning in factor base. Master thesis, Open Universiteit Nederland (2017)
Ramakrishnan, R., Gehrke, J.: Database Management Systems, 3rd edn. McGraw-Hill, New York (2003)
MATH Google Scholar
Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach. Prentice Hall, Upper Saddle River (2010)
MATH Google Scholar
Schulte, O., Khosravi, H.: Learning graphical models for relational data via lattice search. Mach. Learn. 88(3), 331–368 (2012)
Article MathSciNet Google Scholar
Schulte, O., Luo, W., Greiner, R.: Mind-change optimal learning of Bayes net structure from dependency and independency data. Inf. Comput. 208, 63–82 (2010)
Article MathSciNet MATH Google Scholar
Schulte, O., Qian, Z.: Factorbase: SQL for learning a multi-relational graphical model. arXiv preprint (2015). arXiv:1508.02428
Singh, A.P., Gordon, G.J.: Relational learning via collective matrix factorization. In: SIGKDD, pp. 650–658. ACM (2008)
Singh, S., Graepel, T.: Automated probabilistic modeling for relational data. In: CIKM, pp. 1497–1500. ACM (2013)
Sun, Y., Han, J.: Mining Heterogeneous Information Networks: Principles and Methodologies, vol. 3. Morgan & Claypool Publishers, San Rafael (2012)
Google Scholar
Walker, T., O’Reilly, C., Kunapuli, G., Natarajan, S., Maclin, R., Page, D., Shavlik, J.W.: Automating the ILP setup task: converting user advice about specific examples into general background knowledge. In: ILP, pp. 253–268 (2010)
Wang, D.Z., Michelakis, E., Garofalakis, M., Hellerstein, J.M.: BayesStore: managing large, uncertain data repositories with probabilistic graphical models. In: PVLDB, pp. 340–351 (2008)
Wick, M.L., McCallum, A., Miklau, G.: Scalable probabilistic databases with factor graphs and MCMC. In: PVLDB, pp. 794–804 (2010)
Wong, S.M., Butz, C.J., Xiang, Y.: A method for implementing a probabilistic model as a relational database. In: UAI, pp. 556–564 (1995)

Download references

Acknowledgements

This research was supported by a Discovery Grant to Oliver Schulte by the Natural Sciences and Engineering Research Council of Canada. Zhensong Qian was supported by a grant from the China Scholarship Council. We are indebted to anonymous reviewers for the Journal of Data Science and Analytics for helpful comments that improved the paper presentation substantially.

Author information

Authors and Affiliations

Simon Fraser University, Burnaby, Canada
Oliver Schulte & Zhensong Qian

Authors

Oliver Schulte
View author publications
You can also search for this author in PubMed Google Scholar
Zhensong Qian
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhensong Qian.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Appendix: The random variable database layout

We provide details about the Schema Analyzer. A complete SQL script that implements the Schema Analyzer is available [37]. Table 21 shows the relational schema of the Random Variable Database. Figure 9 shows dependencies between the tables of this schema.

Table 21 Schema for random variable database

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Schulte, O., Qian, Z. FACTORBASE: multi-relational structure learning with SQL all the way. Int J Data Sci Anal 7, 289–309 (2019). https://doi.org/10.1007/s41060-018-0130-1

Download citation

Received: 30 April 2017
Accepted: 19 May 2018
Published: 02 June 2018
Issue Date: 01 June 2019
DOI: https://doi.org/10.1007/s41060-018-0130-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FACTORBASE: multi-relational structure learning with SQL all the way

Abstract

Access this article

Similar content being viewed by others

Inductive Logic Programming Meets Relational Databases: Efficient Learning of Markov Logic Networks

Toward New Evaluation Metrics for Relational Learning

What Kinds of Relational Features Are Useful for Statistical Learning?

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Appendix: The random variable database layout

Rights and permissions

About this article

Cite this article

Keywords

Navigation

FACTORBASE: multi-relational structure learning with SQL all the way

Abstract

Access this article

Similar content being viewed by others

Inductive Logic Programming Meets Relational Databases: Efficient Learning of Markov Logic Networks

Toward New Evaluation Metrics for Relational Learning

What Kinds of Relational Features Are Useful for Statistical Learning?

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Appendix: The random variable database layout

Appendix: The random variable database layout

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation