Explore and Exploit. Dictionary Expansion with Human-in-the-Loop

Gentile, Anna Lisa; Gruhl, Daniel; Ristoski, Petar; Welch, Steve

doi:10.1007/978-3-030-21348-0_9

Anna Lisa Gentile¹⁶,
Daniel Gruhl¹⁶,
Petar Ristoski¹⁶ &
…
Steve Welch¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11503))

Included in the following conference series:

European Semantic Web Conference

2855 Accesses
9 Citations

Abstract

Many Knowledge Extraction systems rely on semantic resources - dictionaries, ontologies, lexical resources - to extract information from unstructured text. A key for successful information extraction is to consider such resources as evolving artifacts and keep them up-to-date. In this paper, we tackle the problem of dictionary expansion and we propose a human-in-the-loop approach: we couple neural language models with tight human supervision to assist the user in building and maintaining domain-specific dictionaries. The approach works on any given input text corpus and is based on the explore and exploit paradigm: starting from a few seeds (or an existing dictionary) it effectively discovers new instances (explore) from the text corpus as well as predicts new potential instances which are not in the corpus, i.e. “unseen”, using the current dictionary entries (exploit). We evaluate our approach on five real-world dictionaries, achieving high accuracy with a rapid expansion rate.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
A forum where patients report their experience with medication drugs.
2.
https://twitter.com/.
3.
https://www.tensorflow.org/.

References

Alba, A., Coden, A., Gentile, A.L., Gruhl, D., Ristoski, P., Welch, S.: Multi-lingual concept extraction with linked data and human-in-the-loop. In: Proceedings of the Knowledge Capture Conference, p. 24. ACM (2017)
Google Scholar
Ando, R.K.: Semantic lexicon construction: learning from unlabeled data via spectral analysis. Technical report, IBM Thomas J Watson Research Center, Yorktown Heights, NY (2004)
Google Scholar
Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)
Article Google Scholar
Blohm, S., Cimiano, P.: Using the web to reduce data sparseness in pattern-based information extraction. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) PKDD 2007. LNCS (LNAI), vol. 4702, pp. 18–29. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74976-9_6
Chapter Google Scholar
Bourigault, D.: Surface grammatical analysis for the extraction of terminological noun phrases. In: Proceedings of the 14th Conference on Computational Linguistics - Volume 3, Stroudsburg, PA, USA, pp. 977–981 (1992)
Google Scholar
Church, K., Gale, W.: Inverse document frequency (IDF): a measure of deviations from poisson. In: Armstrong, S., Church, K., Isabelle, P., Manzi, S., Tzoukermann, E., Yarowsky, D. (eds.) Natural Language Processing Using Very Large Corpora. Text, Speech and Language Technology, vol. 11, pp. 283–295. Springer, Dordrecht (1999). https://doi.org/10.1007/978-94-017-2390-9_18
Chapter Google Scholar
Coden, A., Danilevsky, M., Gruhl, D., Kato, L., Nagarajan, M.: A method to accelerate human in the loop clustering. In: Proceedings of the 2017 SIAM International Conference on Data Mining, pp. 237–245. SIAM (2017)
Google Scholar
Coden, A., Gruhl, D., Lewis, N., Tanenblatt, M., Terdiman, J.: SPOT the drug! An unsupervised pattern matching method to extract drug names from very large clinical corpora. In: Proceedings of the 2012 IEEE 2nd Conference on Healthcare Informatics, Imaging and Systems Biology, HISB 2012, pp. 33–39 (2012)
Google Scholar
Hamilton, W.L., Clark, K., Leskovec, J., Jurafsky, D.: Inducing domain-specific sentiment lexicons from unlabeled corpora. In: Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 595–605 (2016)
Google Scholar
Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J., et al.: Gradient flow in recurrent nets: the difficulty of learning long-term dependencies (2001)
Google Scholar
Igo, S.P., Riloff, E.: Corpus-based semantic lexicon induction with web-based corroboration. In: Proceedings of the Workshop on Unsupervised and Minimally Supervised Learning of Lexical Semantics, pp. 18–26. ACL (2009)
Google Scholar
Kuriki, I., et al.: The modern Japanese color lexicon. J. Vis. 17, 1 (2017)
Article Google Scholar
Lee, K., et al.: Adverse drug event detection in tweets with semi-supervised convolutional neural networks (2017)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Pazienza, M.T., Pennacchiotti, M., Zanzotto, F.M.: Terminology extraction: an analysis of linguistic and statistical approaches. In: Sirmakessis, S. (ed.) Knowledge Mining. STUDFUZZ, vol. 185, pp. 255–279. Springer, Heidelberg (2005). https://doi.org/10.1007/3-540-32394-5_20
Chapter Google Scholar
Pröllochs, N., Feuerriegel, S., Neumann, D.: Generating domain-specific dictionaries using Bayesian learning. In: ECIS 2015, pp. 0–14 (2015)
Google Scholar
Riloff, E., Jones, R., et al.: Learning dictionaries for information extraction by multi-level bootstrapping. In: AAAI/IAAI, pp. 474–479 (1999)
Google Scholar
Riloff, E., Wiebe, J., Wilson, T.: Learning subjective nouns using extraction pattern bootstrapping. In: HLT-NAACL 2003, pp. 25–32 (2003)
Google Scholar
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)
Article Google Scholar
Zhang, Z., Gao, J., Ciravegna, F.: JATE 2.0: Java automatic term extraction with apache solr. In: LREC 2016, Portorož, Slovenia (2016)
Google Scholar
Zhang, Z., Gao, J., Ciravegna, F.: SemRe-Rank: Improving automatic term extraction by incorporating semantic relatedness with personalised pagerank. TKDD 12(5), 57:1–57:41 (2018)
Article Google Scholar

Download references

Author information

Authors and Affiliations

IBM Research Almaden, San Jose, CA, USA
Anna Lisa Gentile, Daniel Gruhl, Petar Ristoski & Steve Welch

Authors

Anna Lisa Gentile
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Gruhl
View author publications
You can also search for this author in PubMed Google Scholar
Petar Ristoski
View author publications
You can also search for this author in PubMed Google Scholar
Steve Welch
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Petar Ristoski .

Editor information

Editors and Affiliations

Wright State University, Dayton, OH, USA
Pascal Hitzler
KMi, The Open University, Milton Keynes, UK
Miriam Fernández
University of California, Santa Barbara, CA, USA
Krzysztof Janowicz
Maastricht University, Maastricht, The Netherlands
Amrapali Zaveri
Heriot-Watt University, Edinburgh, UK
Alasdair J.G. Gray
IBM Research, Dublin, Ireland
Vanessa Lopez
The Australian National University, Canberra, ACT, Australia
Armin Haller
Jönköping University, Jönköping, Sweden
Karl Hammar

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gentile, A.L., Gruhl, D., Ristoski, P., Welch, S. (2019). Explore and Exploit. Dictionary Expansion with Human-in-the-Loop. In: Hitzler, P., et al. The Semantic Web. ESWC 2019. Lecture Notes in Computer Science(), vol 11503. Springer, Cham. https://doi.org/10.1007/978-3-030-21348-0_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-21348-0_9
Published: 25 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-21347-3
Online ISBN: 978-3-030-21348-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics