Explore and Exploit. Dictionary Expansion with Human-in-the-Loop

  • Anna Lisa Gentile
  • Daniel Gruhl
  • Petar RistoskiEmail author
  • Steve Welch
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11503)


Many Knowledge Extraction systems rely on semantic resources - dictionaries, ontologies, lexical resources - to extract information from unstructured text. A key for successful information extraction is to consider such resources as evolving artifacts and keep them up-to-date. In this paper, we tackle the problem of dictionary expansion and we propose a human-in-the-loop approach: we couple neural language models with tight human supervision to assist the user in building and maintaining domain-specific dictionaries. The approach works on any given input text corpus and is based on the explore and exploit paradigm: starting from a few seeds (or an existing dictionary) it effectively discovers new instances (explore) from the text corpus as well as predicts new potential instances which are not in the corpus, i.e. “unseen”, using the current dictionary entries (exploit). We evaluate our approach on five real-world dictionaries, achieving high accuracy with a rapid expansion rate.


  1. 1.
    Alba, A., Coden, A., Gentile, A.L., Gruhl, D., Ristoski, P., Welch, S.: Multi-lingual concept extraction with linked data and human-in-the-loop. In: Proceedings of the Knowledge Capture Conference, p. 24. ACM (2017)Google Scholar
  2. 2.
    Ando, R.K.: Semantic lexicon construction: learning from unlabeled data via spectral analysis. Technical report, IBM Thomas J Watson Research Center, Yorktown Heights, NY (2004)Google Scholar
  3. 3.
    Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)CrossRefGoogle Scholar
  4. 4.
    Blohm, S., Cimiano, P.: Using the web to reduce data sparseness in pattern-based information extraction. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) PKDD 2007. LNCS (LNAI), vol. 4702, pp. 18–29. Springer, Heidelberg (2007). Scholar
  5. 5.
    Bourigault, D.: Surface grammatical analysis for the extraction of terminological noun phrases. In: Proceedings of the 14th Conference on Computational Linguistics - Volume 3, Stroudsburg, PA, USA, pp. 977–981 (1992)Google Scholar
  6. 6.
    Church, K., Gale, W.: Inverse document frequency (IDF): a measure of deviations from poisson. In: Armstrong, S., Church, K., Isabelle, P., Manzi, S., Tzoukermann, E., Yarowsky, D. (eds.) Natural Language Processing Using Very Large Corpora. Text, Speech and Language Technology, vol. 11, pp. 283–295. Springer, Dordrecht (1999). Scholar
  7. 7.
    Coden, A., Danilevsky, M., Gruhl, D., Kato, L., Nagarajan, M.: A method to accelerate human in the loop clustering. In: Proceedings of the 2017 SIAM International Conference on Data Mining, pp. 237–245. SIAM (2017)CrossRefGoogle Scholar
  8. 8.
    Coden, A., Gruhl, D., Lewis, N., Tanenblatt, M., Terdiman, J.: SPOT the drug! An unsupervised pattern matching method to extract drug names from very large clinical corpora. In: Proceedings of the 2012 IEEE 2nd Conference on Healthcare Informatics, Imaging and Systems Biology, HISB 2012, pp. 33–39 (2012)Google Scholar
  9. 9.
    Hamilton, W.L., Clark, K., Leskovec, J., Jurafsky, D.: Inducing domain-specific sentiment lexicons from unlabeled corpora. In: Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 595–605 (2016)Google Scholar
  10. 10.
    Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J., et al.: Gradient flow in recurrent nets: the difficulty of learning long-term dependencies (2001)Google Scholar
  11. 11.
    Igo, S.P., Riloff, E.: Corpus-based semantic lexicon induction with web-based corroboration. In: Proceedings of the Workshop on Unsupervised and Minimally Supervised Learning of Lexical Semantics, pp. 18–26. ACL (2009)Google Scholar
  12. 12.
    Kuriki, I., et al.: The modern Japanese color lexicon. J. Vis. 17, 1 (2017)CrossRefGoogle Scholar
  13. 13.
    Lee, K., et al.: Adverse drug event detection in tweets with semi-supervised convolutional neural networks (2017)Google Scholar
  14. 14.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)Google Scholar
  15. 15.
    Pazienza, M.T., Pennacchiotti, M., Zanzotto, F.M.: Terminology extraction: an analysis of linguistic and statistical approaches. In: Sirmakessis, S. (ed.) Knowledge Mining. STUDFUZZ, vol. 185, pp. 255–279. Springer, Heidelberg (2005). Scholar
  16. 16.
    Pröllochs, N., Feuerriegel, S., Neumann, D.: Generating domain-specific dictionaries using Bayesian learning. In: ECIS 2015, pp. 0–14 (2015)Google Scholar
  17. 17.
    Riloff, E., Jones, R., et al.: Learning dictionaries for information extraction by multi-level bootstrapping. In: AAAI/IAAI, pp. 474–479 (1999)Google Scholar
  18. 18.
    Riloff, E., Wiebe, J., Wilson, T.: Learning subjective nouns using extraction pattern bootstrapping. In: HLT-NAACL 2003, pp. 25–32 (2003)Google Scholar
  19. 19.
    Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)CrossRefGoogle Scholar
  20. 20.
    Zhang, Z., Gao, J., Ciravegna, F.: JATE 2.0: Java automatic term extraction with apache solr. In: LREC 2016, Portorož, Slovenia (2016)Google Scholar
  21. 21.
    Zhang, Z., Gao, J., Ciravegna, F.: SemRe-Rank: Improving automatic term extraction by incorporating semantic relatedness with personalised pagerank. TKDD 12(5), 57:1–57:41 (2018)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Authors and Affiliations

  • Anna Lisa Gentile
    • 1
  • Daniel Gruhl
    • 1
  • Petar Ristoski
    • 1
    Email author
  • Steve Welch
    • 1
  1. 1.IBM Research AlmadenSan JoseUSA

Personalised recommendations