DAVE: Extracting Domain Attributes and Values from Text Corpus

  • Yongxin Shen
  • Zhixu Li
  • Wenling Zhang
  • An Liu
  • Xiaofang Zhou
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10987)


Open Information Extraction (OpenIE) has been studied extensively, targeting at extracting structured information from free text. In this paper, we work on a novel OpenIE problem defined as Domain-specified Attribute-Value Extraction: Given a text corpus and a domain Knowledge Base (KB) with a number of domain attributes and corresponding attribute values, the task is to extend the KB by identifying more domain attributes and attribute values from the text corpus. Existing solutions adopted from the other OpenIE problems rely heavily on either using deep linguistic parsing or identifying effective lexical patterns. However, linguistic parsing does not always work well especially on short texts, while learning lexical patterns is too strict to reach a high extraction recall. In this paper, we propose an effective graph-based iterative extraction approach based on the cooccurrence between attribute terms and attribute value terms in the same sentences. Our experiments performed on two large real world data collections demonstrate that our method outperforms state-of-the-art approaches in reaching 10% higher extraction precision and recall.



This research is partially supported by National Natural Science Foundation of China (Grant No. 61632016, 61572336, 61472263, 61232006), the Postdoctoral scientific research funding of Jiangsu Province (No. 1501090B), the National Postdoctoral Funding (No. 2015M581859, 2016T90493) and the Natural Science Research Project of Jiangsu Higher Education Institution (No. 17KJA520003).


  1. 1.
    Agichtein, E., Gravano, L., Pavel, J., Sokolova, V., Voskoboynik, A.: Snowball: a prototype system for extracting relations from large text collections. In: ACM SIGMOD International Conference on Management of Data, p. 612 (2001)Google Scholar
  2. 2.
    Akbik, A.: KRAKEN: N-ary facts in open information extraction. In: Joint Workshop on Automatic Knowledge Base Construction and Web-Scale Knowledge Extraction, pp. 52–56 (2012)Google Scholar
  3. 3.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
  4. 4.
    Del Corro, L., Gemulla, R.: ClausIE: clause-based open information extraction. In: International Conference on World Wide Web, pp. 355–366 (2013)Google Scholar
  5. 5.
    Etzioni, O., Cafarella, M., Downey, D., Popescu, A.M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Unsupervised named-entity extraction from the web: an experimental study. Artif. Intell. 165(1), 91–134 (2005)CrossRefGoogle Scholar
  6. 6.
    Fader, A., Soderland, S., Etzioni, O.: Identifying relations for open information extraction. In: Conference on Empirical Methods in Natural Language Processing, pp. 1535–1545 (2011)Google Scholar
  7. 7.
    Fortunato, S.: Community detection in graphs. Phys. Rep. 486(3), 75–174 (2010)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Fortunato, S., Hric, D.: Community detection in networks: a user guide. Phys. Rep. 659, 1–44 (2016)MathSciNetCrossRefGoogle Scholar
  9. 9.
    Min, B., Shi, S., Grishman, R., Lin, C.Y.: Ensemble semantics for large-scale unsupervised relation extraction. In: Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 1027–1037 (2012)Google Scholar
  10. 10.
    Newman, M.E.J., Girvan, M.: Finding and evaluating community structure in networks. Phys. Rev. E Stat. Nonlinear Soft. Matter Phys. 69(2 Pt 2), 026113 (2004)CrossRefGoogle Scholar
  11. 11.
    Shen, H., Cheng, X., Cai, K., Hu, M.B.: Detect overlapping and hierarchical community structure in networks. Phys. A Stat. Mech. Appl. 388(8), 1706–1712 (2008)CrossRefGoogle Scholar
  12. 12.
    Wu, F., Weld, D.S.: Open information extraction using Wikipedia. In: ACL 2010, Proceedings of the Meeting of the Association for Computational Linguistics, 11–16 July 2010, Uppsala, Sweden, pp. 118–127 (2010)Google Scholar
  13. 13.
    Yates, A., Cafarella, M., Banko, M., Etzioni, O., Broadhead, M., Soderland, S.: TextRunner: open information extraction on the web. In: Human Language Technologies: the Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pp. 25–26 (2007)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Yongxin Shen
    • 1
  • Zhixu Li
    • 1
  • Wenling Zhang
    • 1
  • An Liu
    • 1
  • Xiaofang Zhou
    • 1
    • 2
  1. 1.School of Computer Science and TechnologySoochow UniversitySuzhouChina
  2. 2.The University of QueenslandBrisbaneAustralia

Personalised recommendations