The VLDB Journal

, Volume 25, Issue 6, pp 893–918 | Cite as

ScaLeKB: scalable learning and inference over large knowledge bases

Regular Paper


Recent years have seen a drastic rise in the construction of web knowledge bases (e.g., Freebase, YAGO, DBPedia). These knowledge bases store structured information about real-world people, places, organizations, etc. However, due to the limitations of human knowledge, web corpora, and information extraction algorithms, the knowledge bases are still far from complete. To infer the missing knowledge, we propose the Ontological Pathfinding (OP) algorithm to mine first-order inference rules from these web knowledge bases. The OP algorithm scales up via a series of optimization techniques, including a new parallel-rule-mining algorithm, a pruning strategy to eliminate unsound and inefficient rules before applying them, and a novel partitioning algorithm to break the learning task into smaller independent sub-tasks. Combining these techniques, we develop a first rule mining system that scales to Freebase, the largest public knowledge base with 112 million entities and 388 million facts. We mine 36,625 inference rules in 34 h; no existing system achieves this scale.

Based on the mining algorithm and the optimizations, we develop an efficient inference engine. As a result, we infer 0.9 billion new facts from Freebase in 17.19 h. We use cross validation to evaluate the inferred facts and estimate a degree of expansion by 0.6 over Freebase, with a precision approaching 1.0. Our approach outperforms state-of-the-art mining algorithms and inference engines in terms of both performance and quality.


Knowledge bases Databases Rule mining Probabilistic reasoning 



This work was partially supported by NSF IIS Award # 1526753, DARPA under FA8750-12-2-0348-2 (DEFT/CUBISM), and a generous gift from Google. We also thank Dr. Milenko Petrovic and Dr. Alin Dobra for the helpful discussions on query optimization.


  1. 1.
    Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: Proceedings of the 13th International Conference on Extending Database Technology. ACM (2010)Google Scholar
  2. 2.
    Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in large databases. In: ACM SIGMOD Record (1993)Google Scholar
  3. 3.
    Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: VLDB (1994)Google Scholar
  4. 4.
    Arumugam, S., Dobra, A., Jermaine, C.M., Pansare, N., Perez, L.: The datapath system: a data-centric analytic processing engine for large data warehouses. In: SIGMOD. ACM (2010)Google Scholar
  5. 5.
    Atserias, A., Grohe, M., Marx, D.: Size bounds and query plans for relational joins. In: Foundations of Computer Science, 2008. FOCS’08. IEEE 49th Annual IEEE Symposium on, pages 739–748. IEEE (2008)Google Scholar
  6. 6.
    Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: Dbpedia: A nucleus for a web of open data. Springer (2007)Google Scholar
  7. 7.
    Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction for the web. In: IJCAI (2007)Google Scholar
  8. 8.
    Beame, P., Koutris, P., Suciu, D.: Communication steps for parallel query processing. In: Proceedings of the 32nd Symposium on Principles of Database Systems. ACM (2013)Google Scholar
  9. 9.
    Beame, P., Koutris, P., Suciu, D.: Skew in parallel query processing. In: Proceedings of the 33rd Symposium on Principles of Database Systems. ACM (2014)Google Scholar
  10. 10.
    Biega, J., Kuzey, E., Suchanek, F.M.: Inside yago2s: a transparent information extraction architecture. In: WWW. International World Wide Web Conferences Steering Committee (2013)Google Scholar
  11. 11.
    Blog, G.O.: Introducing the knowledge graph: thing, not strings.
  12. 12.
    Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a collaboratively created graph database for structuring human knowledge. In: SIGMOD. ACM (2008)Google Scholar
  13. 13.
    Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka Jr, E.R., Mitchell, T.M.: Toward an architecture for never-ending language learning. In: AAAI, volume 5, page 3 (2010)Google Scholar
  14. 14.
    Carlson, A., Betteridge, J., Wang, R.C., Hruschka Jr, E.R., Mitchell, T.M.: Coupled semi-supervised learning for information extraction. In: Proceedings of WSCM (2010)Google Scholar
  15. 15.
    Chambers, C., Raniwala, A., Perry, F., Adams, S., Henry, R.R., Bradshaw, R., Weizenbaum, N.: Flumejava: easy, efficient data-parallel pipelines. In: ACM Sigplan Notices, volume 45, pages 363–375. ACM (2010)Google Scholar
  16. 16.
    Chen, Y., Goldberg, S., Wang, D.Z., Johri, S.S.: Ontological pathfinding: Mining first-order knowledge from large knowledge bases. In: SIGMOD. ACM (2016)Google Scholar
  17. 17.
    Chen, Y., Petrovic, M., Clark, M.: Semmemdb: In-database knowledge activation. In: FLAIRS Conference (2014)Google Scholar
  18. 18.
    Chen, Y., Wang, D.Z.: Knowledge expansion over probabilistic knowledge bases. In: SIGMOD Conference, pages 649–660 (2014)Google Scholar
  19. 19.
    Cheng, Y., Qin, C., Rusu, F.: Glade: big data analytics made easy. In: SIGMOD (2012)Google Scholar
  20. 20.
    Chu, S., Balazinska, M., Suciu, D.: From theory to practice: Efficient join query evaluation in a parallel database system. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM (2015)Google Scholar
  21. 21.
    Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Communications of the ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  22. 22.
    Dong, X., Gabrilovich, E., Heitz, G., Horn, W., Lao, N., Murphy, K., Strohmann, T., Sun, S., Zhang, W.: Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In: SIGKDD (2014)Google Scholar
  23. 23.
    Dong, X.L., Gabrilovich, E., Heitz, G., Horn, W., Murphy, K., Sun, S. Zhang, W.: From data fusion to knowledge fusion. Proceedings of the VLDB Endowment (2014)Google Scholar
  24. 24.
    Elseidy, M., Abdelhamid, E., Skiadopoulos, S., Kalnis, P.: Grami: Frequent subgraph and pattern mining in a single large graph. Proceedings of the VLDB Endowment (2014)Google Scholar
  25. 25.
    Etzioni, O., Fader, A., Christensen, J., Soderland, S., Mausam, M.: Open information extraction: The second generation. In: IJCAI (2011)Google Scholar
  26. 26.
    Fader, A., Soderland, S., Etzioni, O.: Identifying relations for open information extraction. In: EMNLP (2011)Google Scholar
  27. 27.
    Galárraga, L., Teflioudi, C., Hose, K., Suchanek, F.M.: Fast rule mining in ontological knowledge bases with amie+. The VLDB Journal (2015)Google Scholar
  28. 28.
    Galárraga, L.A., Teflioudi, C., Hose, K., Suchanek, F.: Amie: association rule mining under incomplete evidence in ontological knowledge bases. In: WWW (2013)Google Scholar
  29. 29.
    Gonzalez, J.E., Low, Y., Gu, H., Bickson, D., Guestrin, C.: Powergraph: Distributed graph-parallel computation on natural graphs. In: OSDI (2012)Google Scholar
  30. 30.
    Gottlob, G., Lee, S.T., Valiant, G., Valiant, P.: Size and treewidth bounds for conjunctive queries. Journal of the ACM (JACM) (2012)Google Scholar
  31. 31.
    Han, J., Pei, J.: Mining frequent patterns by pattern-growth: methodology and implications. ACM SIGKDD explorations newsletter (2000)Google Scholar
  32. 32.
    Hellerstein, J.M., Ré, C., Schoppmann, F., Wang, D.Z., Fratkin, E., Gorajek, A., Ng, K.S., Welton, C., Feng, X., Li, K., et al.: The madlib analytics library: or mad skills, the sql. VLDB (2012)Google Scholar
  33. 33.
    Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G.: Yago2: a spatially and temporally enhanced knowledge base from wikipedia. Artificial Intelligence 194, 28–61 (2013)MathSciNetCrossRefMATHGoogle Scholar
  34. 34.
    Horn, A.: On sentences which are true of direct unions of algebras. The Journal of Symbolic Logic (1951)Google Scholar
  35. 35.
    Huynh, T.N.: Discriminative learning with markov logic networks. Technical report, DTIC Document (2009)Google Scholar
  36. 36.
    Joglekar, M., Re, C.: It’s all a matter of degree: Using degree information to optimize multiway joins. Proceedings of the International Conference on Database Theory (ICDT) (2016)Google Scholar
  37. 37.
    Kersting, K., De Raedt, L.: 1 bayesian logic programming: Theory and tool. Statistical Relational Learning, page 291, (2007)Google Scholar
  38. 38.
    Khamis, M.A., Ngo, H.Q., Suciu, D.: Computing join queries with functional dependencies. Proceedings of the 32nd Symposium on Principles of Database Systems (2016)Google Scholar
  39. 39.
    Kok, S.: Structure Learning in Markov Logic Networks. PhD thesis, University of Washington (2010)Google Scholar
  40. 40.
    Kuramochi, M., Karypis, G.: Frequent subgraph discovery. In: ICDM (2001)Google Scholar
  41. 41.
    Kuramochi, M., Karypis, G.: Finding frequent patterns in a large sparse graph*. Data mining and knowledge discovery (2005)Google Scholar
  42. 42.
    Lao, N., Mitchell, T., Cohen, W.W.: Random walk inference and learning in a large scale knowledge base. In: Proceedings of EMNLP (2011)Google Scholar
  43. 43.
    Lao, N., Subramanya, A., Pereira, F., Cohen, W.W.: Reading the web with learned syntactic-semantic inference rules. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics (2012)Google Scholar
  44. 44.
    Li, K., Wang, D.Z., Dobra, A., Dudley, C.: Uda-gist: An in-database framework to unify data-parallel and state-parallel analytics. Proceedings of the VLDB Endowment (2015)Google Scholar
  45. 45.
    Lin, T., Etzioni, O., et al.: Identifying functional relations in web text. In: EMNLP (2010)Google Scholar
  46. 46.
    Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., Hellerstein, J.M.: Distributed graphlab: a framework for machine learning and data mining in the cloud. VLDB (2012)Google Scholar
  47. 47.
    Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Graphlab: A new parallel framework for machine learning. In: UAI (July 2010)Google Scholar
  48. 48.
    Mahdisoltani, F., Biega, J., Suchanek, F.: Yago3: A knowledge base from multilingual wikipedias. In: CIDR (2015)Google Scholar
  49. 49.
    Mitchell, T., Cohen, W., Hruschka, E., Talukdar, P., Betteridge, J., Carlson, A., Mishra, B.D., Gardner, M., Kisiel, B., Krishnamurthy, J., Lao, N., Mazaitis, K., Mohamed, T., Nakashole, N., Platanios, E., Ritter, A., Samadi, M., Settles, B., Wang, R., Wijaya, D., Gupta, A., Chen, X., Saparov, A., Greaves, M., Welling, J.: Never-ending learning (2015)Google Scholar
  50. 50.
    Muggleton, S.: Inductive logic programming: derivations, successes and shortcomings. ACM SIGART Bulletin (1994)Google Scholar
  51. 51.
    Muggleton, S.: Inverse entailment and progol. New generation computing (1995)Google Scholar
  52. 52.
    Ngo, H.Q., Porat, E., Ré, C., Rudra, A.: Worst-case optimal join algorithms:[extended abstract]. In: Proceedings of the 31st symposium on Principles of Database Systems. ACM (2012)Google Scholar
  53. 53.
    Niu, F., Ré, C., Doan, A., Shavlik, J.: Tuffy: Scaling up statistical inference in markov logic networks using an rdbms. VLDB (2011)Google Scholar
  54. 54.
    Niu, F., Zhang, C., Ré, C., Shavlik, J.: Scaling inference for markov logic with a task-decomposition approach. arXiv preprint arXiv:1108.0294 (2011)
  55. 55.
    Niu, F., Zhang, C., Ré, C., Shavlik, J.W.: Deepdive: Web-scale knowledge-base construction using statistical learning and inference. In: VLDS, pages 25–28 (2012)Google Scholar
  56. 56.
    Park, J.S., Chen, M.-S., Yu, P.S.: An effective hash-based algorithm for mining association rules. SIGMOD Record (1995)Google Scholar
  57. 57.
    Quinlan, J.R.: Learning logical definitions from relations. Machine learning 5(3), 239–266 (1990)Google Scholar
  58. 58.
    Raghavan, S., Mooney, R.J.: Online inference-rule learning from natural-language extractions. In: AAAI Workshop: Statistical Relational Artificial Intelligence (2013)Google Scholar
  59. 59.
    Richards, B.L.: Learning relations by bathfinding (1992)Google Scholar
  60. 60.
    Richardson, M., Domingos, P.: Markov logic networks. Machine learning 62(1–2), 107–136 (2006)CrossRefGoogle Scholar
  61. 61.
    Ritter, A., Downey, D., Soderland, S., Etzioni, O.: It’s a contradiction—no, it’s not: a case study using functional relations. In: EMNLP (2008)Google Scholar
  62. 62.
    Savasere, A., Omiecinski, E., Navathe, S.B.: An efficient algorithm for mining association rules in large databases. In: VLDB (1995)Google Scholar
  63. 63.
    Schoenmackers, S., Etzioni, O., Weld, D.S.: Scaling textual inference to the web. In: EMNLP (2008)Google Scholar
  64. 64.
    Schoenmackers, S., Etzioni, O., Weld, D.S., Davis, J.: Learning first-order horn clauses from web text. In: EMNLP (2010)Google Scholar
  65. 65.
    Shin, J., Wu, S., Wang, F., De Sa, C., Zhang, C., Ré, C.: Incremental knowledge base construction using deepdive. Proceedings of the VLDB Endowment (2015)Google Scholar
  66. 66.
    Suchanek, F.M., Abiteboul, S., Senellart, P.: Paris: Probabilistic alignment of relations, instances, and schema. Proceedings of the VLDB Endowment (2011)Google Scholar
  67. 67.
    Tausend, B.: Representing biases for inductive logic programming. In: Machine Learning: ECML-94. Springer (1994)Google Scholar
  68. 68.
    Veldhuizen, T.L.: Leapfrog triejoin: A simple, worst-case optimal join algorithm. Proceedings of the International Conference on Database Theory (ICDT) (2014)Google Scholar
  69. 69.
    Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Communications of the ACM (2014)Google Scholar
  70. 70.
    Wang, D.Z., Chen, Y., Grant, C., Li, K.: Efficient in-database analytics with graphical models. IEEE Data Engineering Bulletin (2014)Google Scholar
  71. 71.
    Wang, D.Z., Franklin, M.J., Garofalakis, M., Hellerstein, J.M., Wick, M.L.: Hybrid in-database inference for declarative information extraction. In: SIGMOD (2011)Google Scholar
  72. 72.
    West, R., Gabrilovich, E., Murphy, K., Sun, S., Gupta, R., Lin, D.: Knowledge base completion via search-based question answering. In: Proceedings of the 23rd international conference on World wide web. ACM (2014)Google Scholar
  73. 73.
    Wijaya, D., Talukdar, P.P., Mitchell, T.: Pidgin: ontology alignment using web text as interlingua. In: CIKM (2013)Google Scholar
  74. 74.
    Wu, W., Li, H., Wang, H., Zhu, K.Q.: Probase: A probabilistic taxonomy for text understanding. In: SIGMOD. ACM (2012)Google Scholar
  75. 75.
    Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: NSDI. USENIX Association (2012)Google Scholar
  76. 76.
    Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, pages 10–10 (2010)Google Scholar
  77. 77.
    Zeng, Q., Patel, J.M., Page, D.: Quickfoil: scalable inductive logic programming. Proceedings of the VLDB Endowment (2014)Google Scholar
  78. 78.
    Zhang, C.: DeepDive: A Data Management System for Automatic Knowledge Base Construction. PhD thesis, UW-Madison (2015)Google Scholar
  79. 79.
    Zou, L., Chen, L., Özsu, M.T.: Distance-join: Pattern match query in a large graph database. Proceedings of VLDB (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  1. 1.Department of Computer and Information Science and EngineeringUniversity of FloridaGainesvilleUSA

Personalised recommendations