To Extend or Not to Extend? Context-Specific Corpus Enrichment

  • Felix KuhrEmail author
  • Tanya Braun
  • Magnus Bender
  • Ralf Möller
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11919)


An agent in pursuit of a task may work with a corpus of documents with linked subjective content descriptions. Faced with a new document, an agent has to decide whether to include that document in its corpus or not. Basing the decision on only words, topics, or entities, has shown to not lead to a balanced performance for varying documents. Therefore, this paper presents an approach for an agent to decide if a new document adds value to its existing corpus by combining texts and content descriptions. Furthermore, an agent can use the approach as a starting point for high quality content descriptions for new documents. A case study shows the effectiveness of our approach given varying types of new documents.


Subjective content description Text mining 


  1. 1.
    Angeli, G., Premkumar, M.J.J., Manning, C.D.: Leveraging linguistic structure for open domain information extraction. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, Beijing, China, Volume 1: Long Papers, 26–31 July 2015, pp. 344–354 (2015)Google Scholar
  2. 2.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
  3. 3.
    Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka, Jr., E.R., Mitchell, T.M.: Toward an architecture for never-ending language learning. In: Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2010, Atlanta, Georgia, USA, 11–15 July 2010 (2010)Google Scholar
  4. 4.
    Collarana, D., Galkin, M., Ribón, I.T., Vidal, M., Lange, C., Auer, S.: MINTE: semantically integrating RDF graphs. In: Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics, WIMS 2017, Amantea, Italy, 19–22 June 2017, pp. 22:1–22:11 (2017)Google Scholar
  5. 5.
    Cunningham, H., Tablan, V., Roberts, A., Bontcheva, K.: Getting more out of biomedical documents with gate’s full lifecycle open source text analytics. PLoS Comput. Biol. 9(2), e1002854 (2013)CrossRefGoogle Scholar
  6. 6.
    Dong, X.L., et al.: From data fusion to knowledge fusion. PVLDB 7(10), 881–892 (2014)Google Scholar
  7. 7.
    Getoor, L., Diehl, C.P.: Link mining: a survey. In: SIGKDD Explorations, vol. 7, no. 2, pp. 3–12 (2005)CrossRefGoogle Scholar
  8. 8.
    Kuhr, F., Witten, B., Möller, R.: Corpus-driven annotation enrichment. In: 13th IEEE International Conference on Semantic Computing, ICSC 2019, Newport Beach, CA, USA, 30 January 30 – 1 February 2019, pp. 138–141 (2019)Google Scholar
  9. 9.
    Lehmann, J., et al.: DBpedia - a large-scale, multilingual knowledge base extracted from Wikipedia. Semant. Web 6(2), 167–195 (2015)Google Scholar
  10. 10.
    Newcombe, H.B., Kennedy, J.M., Axford, S., James, A.P.: Automatic linkage of vital records. Science 130(3381), 954–959 (1959)CrossRefGoogle Scholar
  11. 11.
    Papantoniou, K., Tsatsaronis, G., Paliouras, G.: KDTA: automated knowledge-driven text annotation. In: Proceedings of Machine Learning and Knowledge Discovery in Databases, European Conference, Part III, ECML PKDD 2010, Barcelona, Spain, 20–24 September 2010, pp. 611–614 (2010)CrossRefGoogle Scholar
  12. 12.
    Reuters, T.: Opencalais. Accessed 16 June 2008Google Scholar
  13. 13.
    Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Documentation 28(1), 11–21 (1972)CrossRefGoogle Scholar
  14. 14.
    Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, Banff, Alberta, Canada, 8–12 May 2007, pp. 697–706 (2007)Google Scholar
  15. 15.
    Braun, T., Kuhr, F., Möller, R.: Unsupervised text annotations. In: Formal and Cognitive Reasoning - Workshop at the 40th Annual German Conference on AI (KI-2017) (2017)Google Scholar
  16. 16.
    Yang, J., Zhang, Y., Li, L., Li, X.: YEDDA: a lightweight collaborative text span annotation tool. In: Proceedings of ACL 2018, System Demonstrations Melbourne, Australia, 15–20 July 2018, pp. 31–36 (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Felix Kuhr
    • 1
    Email author
  • Tanya Braun
    • 1
  • Magnus Bender
    • 1
  • Ralf Möller
    • 1
  1. 1.Institute of Information SystemsUniversity of LübeckLübeckGermany

Personalised recommendations