A Distributional Semantics Approach to Simultaneous Recognition of Multiple Classes of Named Entities

Jonnalagadda, Siddhartha; Leaman, Robert; Cohen, Trevor; Gonzalez, Graciela

doi:10.1007/978-3-642-12116-6_19

Siddhartha Jonnalagadda¹⁷,
Robert Leaman¹⁷,
Trevor Cohen¹⁸ &
…
Graciela Gonzalez¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6008))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1888 Accesses
4 Citations
1 Altmetric

Abstract

Named Entity Recognition and Classification is being studied for last two decades. Since semantic features take huge amount of training time and are slow in inference, the existing tools apply features and rules mainly at the word level or use lexicons. Recent advances in distributional semantics allow us to efficiently create paradigmatic models that encode word order. We used Sahlgren et al’s permutation-based variant of the Random Indexing model to create a scalable and efficient system to simultaneously recognize multiple entity classes mentioned in natural language, which is validated on the GENIA corpus which has annotations for 46 biomedical entity classes and supports nested entities. Using distributional semantics features only, it achieves an overall micro-averaged F-measure of 67.3% based on fragment matching with performance ranging from 7.4% for “DNA substructure” to 80.7% for “Bioentity”.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Byrne, K.: Nested Named Entity Recognition in Historical Archive Text. In: Proceedings of International Conference on Semantic Computing (2007)
Google Scholar
Cohen, T., Widdows, D.: Empirical distributional semantics: methods and biomedical applications. Journal of Biomedical Informatics 42 (2009)
Google Scholar
Clark, A.: Inducing Syntactic Categories by Context Distribution Clustering. In: Proceedings of the 2nd Workshop on Learning Language in Logic and the 4th Conference on Computational Natural Language Learning (2000)
Google Scholar
David, B., Lloyd, T.: Numerical linear algebra. Society for Industrial and Applied Mathematics, Philadelphia (1997)
MATH Google Scholar
Dietterich, T.G.: Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Comput. 10 (1998)
Google Scholar
Eddy, S.R.: Hidden Markov Models. Curr. Opin. Struct. Biol. 6 (1996)
Google Scholar
Finkel, J.R., Manning, C.D.: Joint Parsing and Named Entity Recognition. In: Proceedings of of NAACL HLT (2009)
Google Scholar
Finkel, J.R., Manning, C.D.: Nested Named Entity Recognition. In: EMNLP (2009)
Google Scholar
Fox, C.: A Stop List for General Text. ACM SIGIR Forum 24 (199)
Google Scholar
Gu, B.: Recognizing Nested Named Entities in GENIA Corpus. In: Proceedings of the Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis (2006)
Google Scholar
Harris, Z.S.: The structure of science information. Journal of Biomedical Informatics 35 (2002)
Google Scholar
Johnson, W.B., Lindenstrauss, J.: Extensions of Lipschitz Mappings into a Hilbert Space. Contemporary Mathematics 26 (1984)
Google Scholar
Jones, M.N., Mewhort, D.J.K.: Representing Word Meaning and Order Information in a Composite Holographic Lexicon. Psychol. Rev. 114 (2007)
Google Scholar
Kanerva, P., Kristofersson, J., Holst, A.: Random Indexing of Text Samples for Latent Semantic Analysis. In: Proceedings of the 22nd Annual Conference of the Cognitive Science Society (2000)
Google Scholar
Kim, J.D., Ohta, T., Tateisi, Y., et al.: GENIA Corpus-a Semantically Annotated Corpus for Bio-Textmining. Bioinformatics-Oxford 19 (2003)
Google Scholar
Kim, J.D., Ohta, T., Tsujii, J.: Corpus Annotation for Mining Biomedical Events from Literature. BMC Bioinformatics 9 (2008)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In: Proceedings of ICML (2001)
Google Scholar
Landauer, T.K., Dumais, S.T.: A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. Psychol. Rev. 104, 211–240 (1997)
Article Google Scholar
Leaman, R., Gonzalez, G.: BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition. In: Proceedings of PSB (2008)
Google Scholar
Lund, K., Burgess, C.: Hyperspace Analog to Language (HAL): A General Model of Semantic Representation. Language and Cognitive Processes (1996)
Google Scholar
Màrquez, L., Villarejo, L., Martí, M.A., et al.: Semeval-2007 Task 09: Multilevel Semantic Annotation of Catalan and Spanish. In: Proceedings of the 4th International Workshop on Semantic Evaluations (2007)
Google Scholar
McCallum, A., Li, W.: Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons. In: Proceedings of CoNLL (2003)
Google Scholar
McDonald, R., Fernando, P.: Identifying gene and protein mentions in text using conditional random fields. BMC Bioinformatics (2005)
Google Scholar
Rau, L., Res, G., Center, D., et al.: Extracting Company Names from Text. In: Proceedings of IEEE Conference on Artificial Intelligence Applications (1991)
Google Scholar
Sahlgren, M., Holst, A., Kanerva, P.: Permutations as a Means to Encode Order in Word Space. In: Proceedings of CogSci. (2008)
Google Scholar
Sahlgren, M.: The Word-Space Model. Doctoral Dissertation in Computational Linguistics. Stockholm University (2006)
Google Scholar
Saussure, F., Bally, C., Séchehaye, A., et al.: Cours de linguistique générale. Payot, Paris (1922)
Google Scholar
Schütze, H.: Automatic Word Sense Discrimination. Computational Linguistics 24, 97–123 (1998)
Google Scholar
Settles, B.: ABNER: An Open Source Tool for Automatically Tagging Genes, Proteins and Other Entity Names in Text. Bioinformatics 21 (2005)
Google Scholar
Shen, D., Zhang, J., Zhou, G., et al.: Effective Adaptation of a Hidden Markov Model-Based Named Entity Recognizer for Biomedical Domain. In: Proceedings of ACL (2003)
Google Scholar
Song, Y., Kim, E., Lee, G.G., et al.: POSBIOTM-NER in the Shared Task of BioNLP/NLPBA 2004. In: Proceedings of IJNLPBA (2004)
Google Scholar
Tsai, R.T., Wu, S.H., Chou, W.C., et al.: Various Criteria in the Evaluation of Biomedical Named Entity Recognition. BMC Bioinformatics 7 (2006)
Google Scholar
Widdows, D., Ferraro, K.: Semantic Vectors: A Scalable Open Source Package and Online Technology Management Application. In: Proceedings of LREC (2008)
Google Scholar
Widdows, D., Cohen, T.: Semantic Vector Combinations and the Synoptic Gospels. In: Proceedings of the Third Quantum Interaction Symposium (2009)
Google Scholar
Zhou, G., Zhang, J., Su, J., et al.: Recognizing Names in Biomedical Texts: A Machine Learning Approach. Bioinformatics 20 (2004)
Google Scholar
Zhou, G.D.: Recognizing Names in Biomedical Texts using Mutual Information Independence Model and SVM Plus Sigmoid. Int. J. Med. Inf. 75 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Arizona State University, USA
Siddhartha Jonnalagadda, Robert Leaman & Graciela Gonzalez
The University of Texas Health Science Center at Houston, USA
Trevor Cohen

Authors

Siddhartha Jonnalagadda
View author publications
You can also search for this author in PubMed Google Scholar
Robert Leaman
View author publications
You can also search for this author in PubMed Google Scholar
Trevor Cohen
View author publications
You can also search for this author in PubMed Google Scholar
Graciela Gonzalez
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Computing Research, National Polytechnic Institute, 07738, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jonnalagadda, S., Leaman, R., Cohen, T., Gonzalez, G. (2010). A Distributional Semantics Approach to Simultaneous Recognition of Multiple Classes of Named Entities. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2010. Lecture Notes in Computer Science, vol 6008. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12116-6_19

Download citation

DOI: https://doi.org/10.1007/978-3-642-12116-6_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12115-9
Online ISBN: 978-3-642-12116-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics