The VLDB Journal

, Volume 26, Issue 1, pp 81–105 | Cite as

Incremental knowledge base construction using DeepDive

  • Christopher De Sa
  • Alex Ratner
  • Christopher Ré
  • Jaeho Shin
  • Feiran Wang
  • Sen Wu
  • Ce Zhang
Special Issue Paper

Abstract

Populating a database with information from unstructured sources—also known as knowledge base construction (KBC)—is a long-standing problem in industry and research that encompasses problems of extraction, cleaning, and integration. In this work, we describe DeepDive, a system that combines database and machine learning ideas to help develop KBC systems, and we present techniques to make the KBC process more efficient. We observe that the KBC process is iterative, and we develop techniques to incrementally produce inference results for KBC systems. We propose two methods for incremental inference, based, respectively, on sampling and variational techniques. We also study the trade-off space of these methods and develop a simple rule-based optimizer. DeepDive includes all of these contributions, and we evaluate DeepDive on five KBC systems, showing that it can speed up KBC inference tasks by up to two orders of magnitude with negligible impact on quality.

Keywords

Knowledge base construction Incremental Performance 

References

  1. 1.
    Acar, U.A., Ihler, A.T., Mettu, R.R., Sümer, Ö.: Adaptive inference on general graphical models. In: UAI 2008, Proceedings of the 24th Conference in Uncertainty in Artificial Intelligence, Helsinki, Finland, July 9–12, 2008, pp. 1–8 (2008)Google Scholar
  2. 2.
    Andrieu, C., de Freitas, N., Doucet, A., Jordan, M.I.: An introduction to MCMC for machine learning. Mach. Learn. 50(1–2), 5–43 (2003)CrossRefMATHGoogle Scholar
  3. 3.
    Angeli, G., Gupta, S., Jose, M., Manning, C.D., Ré, C., Tibshirani, J., Wu, J.Y., Wu, S., Zhang, C.: Stanford’s 2014 slot filling systems. TAC KBP (2014)Google Scholar
  4. 4.
    Banerjee, O., Ghaoui, L.E., d’Aspremont, A.: Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. J. Mach. Learn. Res. 9, 485–516 (2008)MathSciNetMATHGoogle Scholar
  5. 5.
    Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction from the web. In: IJCAI 2007, Proceedings of the 20th International Joint Conference on Artificial Intelligence, Hyderabad, India, January 6–12, 2007, pp. 2670–2676 (2007)Google Scholar
  6. 6.
    Barbosa, D., Wang, H., Yu, C.: Shallow information extraction for the knowledge web. In: 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8–12, 2013, pp. 1264–1267 (2013)Google Scholar
  7. 7.
    Betteridge, J., Carlson, A., Hong, S.A., Jr., E.R.H., Law, E.L.M., Mitchell, T.M., Wang, S.H.: Toward never ending language learning. In: Learning by Reading and Learning to Read, Papers from the 2009 AAAI Spring Symposium, Technical Report SS-09-07, Stanford, California, USA, March 23–25, 2009, pp. 1–2 (2009)Google Scholar
  8. 8.
    Brin, S.: Extracting patterns and relations from the world wide web. In: The World Wide Web and Databases, International Workshop WebDB’98, Valencia, Spain, March 27–28, 1998, Selected Papers, pp. 172–183 (1998)Google Scholar
  9. 9.
    Brown, E., Epstein, E., Murdock, J.W., Fin, T.H.: Tools and methods for building watson. IBM Research Report (2013)Google Scholar
  10. 10.
    Bunescu, R.C., Mooney, R.J.: Learning to extract relations from the web using minimal supervision. In: ACL 2007, Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, June 23–30, 2007, Prague, Czech Republic (2007)Google Scholar
  11. 11.
    Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Jr., E.R.H., Mitchell, T.M.: Toward an architecture for never-ending language learning. In: Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2010, Atlanta, Georgia, USA, July 11–15, 2010 (2010)Google Scholar
  12. 12.
    Chen, F., Doan, A., Yang, J., Ramakrishnan, R.: Efficient information extraction over evolving text data. In: Proceedings of the 24th International Conference on Data Engineering, ICDE 2008, April 7–12, 2008, Cancún, México, pp. 943–952 (2008)Google Scholar
  13. 13.
    Chen, F., Feng, X., Re, C., Wang, M.: Optimizing statistical information extraction programs over evolving text. In: IEEE 28th International Conference on Data Engineering (ICDE 2012), Washington, DC, USA (Arlington, Virginia), 1–5 April, 2012, pp. 870–881 (2012)Google Scholar
  14. 14.
    Chen, Y., Wang, D.Z.: Knowledge expansion over probabilistic knowledge bases. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22–27, 2014, pp. 649–660 (2014)Google Scholar
  15. 15.
    Chirkova, R., Yang, J.: Materialized views. Found. Trends Databases 4(4), 295–405 (2012)CrossRefGoogle Scholar
  16. 16.
    Craven, M., Kumlien, J.: Constructing biological knowledge bases by extracting information from text sources. In: Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, August 6–10, 1999, Heidelberg, Germany, pp. 77–86 (1999)Google Scholar
  17. 17.
    Dalvi, N.N., Suciu, D.: The dichotomy of probabilistic inference for unions of conjunctive queries. J. ACM 59(6), 30 (2012)MathSciNetCrossRefMATHGoogle Scholar
  18. 18.
    Delcher, A.L., Grove, A.J., Kasif, S., Pearl, J.: Logarithmic-time updates and queries in probabilistic networks. J. Artif. Intell. Res. JAIR 4, 37–59 (1996)MathSciNetMATHGoogle Scholar
  19. 19.
    den Hollander, F.: Probability Theory: The Coupling Method. http://websites.math.leidenuniv.nl/probability/lecturenotes/CouplingLectures.pdf (2012)
  20. 20.
    Domingos, P.M., Lowd, D.: Markov logic: an interface layer for artificial intelligence. Synth. Lect. Artif. Intell. Mach. Learn. (2009)Google Scholar
  21. 21.
    Dong, X.L., Gabrilovich, E., Heitz, G., Horn, W., Murphy, K., Sun, S., Zhang, W.: From data fusion to knowledge fusion. PVLDB 7(10), 881–892 (2014)Google Scholar
  22. 22.
    Etzioni, O., Cafarella, M.J., Downey, D., Kok, S., Popescu, A., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Web-scale information extraction in knowitall: (preliminary results). In: Proceedings of the 13th international conference on World Wide Web, WWW 2004, New York, NY, USA, May 17–20, 2004, pp. 100–110 (2004)Google Scholar
  23. 23.
    Fan, M., Zhao, D., Zhou, Q., Liu, Z., Zheng, T.F., Chang, E.Y.: Distant supervision for relation extraction with matrix completion. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22–27, 2014, Baltimore, MD, USA, Volume 1: Long Papers, pp. 839–849 (2014)Google Scholar
  24. 24.
    Ferrucci, D.A., Brown, E.W., Chu-Carroll, J., Fan, J., Gondek, D., Kalyanpur, A., Lally, A., Murdock, J.W., Nyberg, E., Prager, J.M., Schlaefer, N., Welty, C.A.: Building watson: an overview of the deepqa project. AI Mag 31(3), 59–79 (2010)Google Scholar
  25. 25.
    Gama, J., Zliobaite, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation. ACM Comput. Surv. 46(4), 44:1–44:37 (2014)CrossRefMATHGoogle Scholar
  26. 26.
    Gottlob, G., Koch, C., Baumgartner, R., Herzog, M., Flesca, S.: The lixto data extraction project—back and forth between theory and practice. In: Proceedings of the Twenty-Third ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 14–16, 2004, Paris, France, pp. 1–12 (2004)Google Scholar
  27. 27.
    Govindaraju, V., et al.: Understanding tables in context using standard NLP toolkits. In: ACL (2013)Google Scholar
  28. 28.
    Gupta, A., Mumick, I.S. (eds.): Materialized Views: Techniques, Implementations, and Applications. MIT Press, Cambridge (1999)Google Scholar
  29. 29.
    Gupta, A., Mumick, I.S., Subrahmanian, V.S.: Maintaining views incrementally. In: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, DC, May 26–28, 1993, pp. 157–166 (1993)Google Scholar
  30. 30.
    Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: 14th International Conference on Computational Linguistics, COLING 1992, Nantes, France, August 23–28, 1992, pp. 539–545 (1992)Google Scholar
  31. 31.
    Hoffmann, R., Zhang, C., Ling, X., Zettlemoyer, L.S., Weld, D.S.: Knowledge-based weak supervision for information extraction of overlapping relations. In: The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 19–24 June, 2011, Portland, Oregon, USA, pp. 541–550 (2011)Google Scholar
  32. 32.
    Hoffmann, R., et al.: Learning 5000 relational extractors. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 286–295. Association for Computational Linguistics (2010)Google Scholar
  33. 33.
    Jampani, R., Xu, F., Wu, M., Perez, L.L., Jermaine, C.M., Haas, P.J.: MCDB: a monte carlo approach to managing uncertain data. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10–12, 2008, pp. 687–700 (2008)Google Scholar
  34. 34.
    Jaynes, E.T.: Probability Theory: The Logic of Science. Cambridge University Press, Cambridge (2003)CrossRefMATHGoogle Scholar
  35. 35.
    Jerrum, M., Sinclair, A.: Polynomial-time approximation algorithms for the ising model. SIAM J. Comput. 22(5), 1087–1116 (1993)MathSciNetCrossRefMATHGoogle Scholar
  36. 36.
    Jiang, S., Lowd, D., Dou, D.: Learning to refine an automatically extracted knowledge base using markov logic. In: 12th IEEE International Conference on Data Mining, ICDM 2012, Brussels, Belgium, December 10–13, 2012, pp. 912–917 (2012)Google Scholar
  37. 37.
    Kalchbrenner, N., Grefenstette, E., Blunsom, P.: A convolutional neural network for modelling sentences. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22–27, 2014, Baltimore, MD, USA, Volume 1: Long Papers, pp. 655–665 (2014)Google Scholar
  38. 38.
    Kasneci, G., Ramanath, M., Suchanek, F.M., Weikum, G.: The YAGO-NAGA approach to knowledge discovery. SIGMOD Rec. 37(4), 41–47 (2008)CrossRefGoogle Scholar
  39. 39.
    Katakis, I., Tsoumakas, G., Banos, E., Bassiliades, N., Vlahavas, I.P.: An adaptive personalized news dissemination system. J. Intell. Inf. Syst. 32(2), 191–212 (2009)CrossRefGoogle Scholar
  40. 40.
    Koc, M.L., Ré, C.: Incrementally maintaining classification using an RDBMS. PVLDB 4(5), 302–313 (2011)Google Scholar
  41. 41.
    Krause, S., Li, H., Uszkoreit, H., Xu, F.: Large-scale learning of relation-extraction rules with distant supervision from the web. In: The Semantic Web—ISWC 2012—11th International Semantic Web Conference, Boston, MA, USA, November 11–15, 2012, Proceedings, Part I, pp. 263–278 (2012)Google Scholar
  42. 42.
    Levin, D.A., Peres, Y., Wilmer, E.L.: Markov Chains and Mixing Times. American Mathematical Society, Providence (2006)Google Scholar
  43. 43.
    Li, J., Ritter, A., Hovy, E.H.: Weakly supervised user profile extraction from twitter. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22–27, 2014, Baltimore, MD, USA, Volume 1: Long Papers, pp. 165–174 (2014)Google Scholar
  44. 44.
    Li, Y., Reiss, F., Chiticariu, L.: Systemt: a declarative information extraction system. In: The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 19–24 June, 2011, Portland, Oregon, USA—System Demonstrations, pp. 109–114 (2011)Google Scholar
  45. 45.
    Madhavan, J., Cohen, S., Dong, X.L., Halevy, A.Y., Jeffery, S.R., Ko, D., Yu, C.: Web-scale data integration: You can afford to pay as you go. In: CIDR 2007, Third Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 7–10, 2007, Online Proceedings, pp. 342–350 (2007)Google Scholar
  46. 46.
    Marchetti-Bowick, M., Chambers, N.: Learning for microblogs with distant supervision: political forecasting with twitter. In: EACL 2012, 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, April 23–27, 2012, pp. 603–612 (2012)Google Scholar
  47. 47.
    Min, B., Grishman, R., Wan, L., Wang, C., Gondek, D.: Distant supervision for relation extraction with an incomplete knowledge base. In: Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 9–14, 2013, Westin Peachtree Plaza Hotel, Atlanta, Georgia, USA, pp. 777–782 (2013)Google Scholar
  48. 48.
    Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. In: ACL 2009, Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 2–7 August 2009, Singapore, pp. 1003–1011 (2009)Google Scholar
  49. 49.
    Nakashole, N., Theobald, M., Weikum, G.: Scalable knowledge harvesting with high precision and high recall. In: Proceedings of the Forth International Conference on Web Search and Web Data Mining, WSDM 2011, Hong Kong, China, February 9–12, 2011, pp. 227–236 (2011)Google Scholar
  50. 50.
    Nath, A., Domingos, P.M.: Efficient belief propagation for utility maximization and repeated inference. In: Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2010, Atlanta, Georgia, USA, July 11–15, 2010 (2010)Google Scholar
  51. 51.
    Nguyen, T.T., Moschitti, A.: End-to-end relation extraction using distant supervision from external semantic repositories. In: The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 19–24 June, 2011, Portland, Oregon, USA—Short Papers, pp. 277–282 (2011)Google Scholar
  52. 52.
    Nikolic, M., Elseidy, M., Koch, C.: LINVIEW: incremental view maintenance for complex analytical queries. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, UT, USA, June 22–27, 2014, pp. 253–264 (2014)Google Scholar
  53. 53.
    Niu, F., Ré, C., Doan, A., Shavlik, J.W.: Tuffy: scaling up statistical inference in markov logic networks using an RDBMS. PVLDB 4(6), 373–384 (2011)Google Scholar
  54. 54.
    Niu, F., Zhang, C., Ré, C., Shavlik, J.W.: Elementary: large-scale knowledge-base construction via machine learning and statistical inference. Int. J. Semant. Web Inf. Syst. 8(3), 42–73 (2012)CrossRefGoogle Scholar
  55. 55.
    Peters, S.E., Zhang, C., Livny, M., Ré, C.: A machine reading system for assembling synthetic paleontological databases. PloS One (2014)Google Scholar
  56. 56.
    Poon, H., Domingos, P.M.: Joint inference in information extraction. In: Proceedings of the Twenty-Second AAAI Conference on Artificial Intelligence, July 22–26, 2007, Vancouver, British Columbia, Canada, pp. 913–918 (2007)Google Scholar
  57. 57.
    Purver, M., Battersby, S.: Experimenting with distant supervision for emotion classification. In: EACL 2012, 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, April 23–27, 2012, pp. 482–491 (2012)Google Scholar
  58. 58.
    Ravikumar, P.D., Raskutti, G., Wainwright, M.J., Yu, B.: Model selection in gaussian graphical models: High-dimensional consistency of l\({}_{{1}}\)-regularized MLE. In: Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8–11, 2008, pp. 1329–1336 (2008)Google Scholar
  59. 59.
    Ré, C., Sadeghian, A.A., Shan, Z., Shin, J., Wang, F., Wu, S., Zhang, C.: Feature engineering for knowledge base construction. IEEE Data Eng. Bull. 37(3), 26–40 (2014)Google Scholar
  60. 60.
    Riedel, S., Yao, L., McCallum, A.: Modeling relations and their mentions without labeled text. In: Machine Learning and Knowledge Discovery in Databases, European Conference, ECML PKDD 2010, Barcelona, Spain, September 20–24, 2010, Proceedings, Part III, pp. 148–163 (2010)Google Scholar
  61. 61.
    Robert, C.P., Casella, G.: Monte Carlo Statistical Methods. Springer, Secaucus (2005)MATHGoogle Scholar
  62. 62.
    Sa, C.D., Ratner, A., Ré, C., Shin, J., Wang, F., Wu, S., Zhang, C.: Deepdive: declarative knowledge base construction. SIGMOD Rec. (2015)Google Scholar
  63. 63.
    Sen, P., Deshpande, A., Getoor, L.: PrDB: managing and exploiting rich correlations in probabilistic databases. VLDB J. 18(5), 1065–1090 (2009)CrossRefGoogle Scholar
  64. 64.
    Shen, W., Doan, A., Naughton, J.F., Ramakrishnan, R.: Declarative information extraction using datalog with embedded extraction predicates. In: Proceedings of the 33rd International Conference on Very Large Data Bases, University of Vienna, Austria, September 23–27, 2007, pp. 1033–1044 (2007)Google Scholar
  65. 65.
    Suchanek, F.M., Sozio, M., Weikum, G.: SOFIE: a self-organizing framework for information extraction. In: Proceedings of the 18th International Conference on World Wide Web, WWW 2009, Madrid, Spain, April 20–24, 2009, pp. 631–640 (2009)Google Scholar
  66. 66.
    Suciu, D., Olteanu, D., Ré, C., Koch, C.: Probabilistic Databases. Synth. Lect. Data Manag. (2011)Google Scholar
  67. 67.
    Surdeanu, M., Gupta, S., Bauer, J., McClosky, D., Chang, A.X., Spitkovsky, V.I., Manning, C.D.: Stanford’s distantly-supervised slot-filling system. In: Proceedings of the Fourth Text Analysis Conference, TAC 2011, Gaithersburg, Maryland, USA, November 14–15, 2011 (2011)Google Scholar
  68. 68.
    Surdeanu, M., McClosky, D., Tibshirani, J., Bauer, J., Chang, A.X., Spitkovsky, V.I., Manning, C.D.: A simple distant supervision approach for the TAC-KBP slot filling task. In: Proceedings of the Third Text Analysis Conference, TAC 2010, Gaithersburg, Maryland, USA, November 15–16, 2010 (2010)Google Scholar
  69. 69.
    Ullman, J.D.: Principles of Database and Knowledge-Base Systems, vol. II. Computer Science Press, New York (1989)Google Scholar
  70. 70.
    Wainwright, M.J., Jordan, M.I.: Log-determinant relaxation for approximate inference in discrete markov random fields. IEEE Trans Signal Process 54(6–1), 2099–2109 (2006)CrossRefGoogle Scholar
  71. 71.
    Wainwright, M.J., Jordan, M.I.: Graphical models, exponential families, and variational inference. Found Trends Mach Learn 1(1–2), 1–305 (2008)MATHGoogle Scholar
  72. 72.
    Weikum, G., Theobald, M.: From information to knowledge: harvesting entities and relationships from web sources. In: Proceedings of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2010, June 6–11, 2010, Indianapolis, Indiana, USA, pp. 65–76 (2010)Google Scholar
  73. 73.
    Wick, M.L., McCallum, A.: Query-aware MCMC. In: Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a meeting held 12–14 December 2011, Granada, Spain, pp. 2564–2572 (2011)Google Scholar
  74. 74.
    Wick, M.L., McCallum, A., Miklau, G.: Scalable probabilistic databases with factor graphs and MCMC. PVLDB 3(1), 794–804 (2010)Google Scholar
  75. 75.
    Yao, L., Riedel, S., McCallum, A.: Collective cross-document relation extraction without labelled data. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP 2010, 9–11 October 2010, MIT Stata Center, Massachusetts, USA, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1013–1023 (2010)Google Scholar
  76. 76.
    Yates, A., Banko, M., Broadhead, M., Cafarella, M.J., Etzioni, O., Soderland, S.: Textrunner: Open information extraction on the web. In: Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, April 22–27, 2007, Rochester, New York, USA, pp. 25–26 (2007)Google Scholar
  77. 77.
    Zhang, C., Niu, F., Ré, C., Shavlik, J.W.: Big data versus the crowd: looking for relationships in all the right places. In: The 50th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, July 8–14, 2012, Jeju Island, Korea—Volume 1: Long Papers, pp. 825–834 (2012)Google Scholar
  78. 78.
    Zhang, C., Ré, C.: Towards high-throughput gibbs sampling at scale: a study across storage managers. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, June 22–27, 2013, pp. 397–408 (2013)Google Scholar
  79. 79.
    Zhang, C., Re, C.: Dimmwitted: a study of main-memory statistical analytics. PVLDB 7(12), 1283–1294 (2014)Google Scholar
  80. 80.
    Zhang, X., Zhang, J., Zeng, J., Yan, J., Chen, Z., Sui, Z.: Towards accurate distant supervision for relational facts extraction. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, 4–9 August 2013, Sofia, Bulgaria, Volume 2: Short Papers, pp. 810–815 (2013)Google Scholar
  81. 81.
    Zhu, J., Nie, Z., Liu, X., Zhang, B., Wen, J.: Statsnowball: a statistical approach to extracting entity relationships. In: Proceedings of the 18th International Conference on World Wide Web, WWW 2009, Madrid, Spain, April 20–24, 2009, pp. 101–110 (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  • Christopher De Sa
    • 1
  • Alex Ratner
    • 1
  • Christopher Ré
    • 1
  • Jaeho Shin
    • 1
  • Feiran Wang
    • 1
  • Sen Wu
    • 1
  • Ce Zhang
    • 1
  1. 1.Stanford UniversityStanfordUSA

Personalised recommendations