Abstract
Most technical and scientific terms are comprised of complex, multi-word noun phrases but certainly not all noun phrases are technical or scientific terms. The distinction of specific terminology from common non-specific noun phrases can be based on the observation that terms reveal a much lesser degree of distributional variation than non-specific noun phrases. We formalize the limited paradigmatic modifiability of terms and, subsequently, test the corresponding algorithm on bigram, trigram and quadgram noun phrases extracted from a 104-million-word biomedical text corpus. Using an already existing and community-wide curated biomedical terminology as an evaluation gold standard, we show that our algorithm significantly outperforms standard term identification measures and, therefore, qualifies as a high-performant building block for any terminology identification system.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Nakagawa, H., Mori, T.: Nested collocation and compound noun for term recognition. In: COMPUTERM 1998 – Proceedings of the First Workshop on Comutational Terminology, pp. 64–70 (1998)
Hersh, W.R., Campbell, E., Evans, D., Brownlow, N.: Empirical, automated vocabulary discovery using large text corpora and advanced natural language processing tools. In: Cimino, J.J. (ed.) AMIA 1996 – Proceedings of the 1996 AMIA Annual Fall Symposium (formerly SCAMC). Beyond the Superhighway: Exploiting the Internet with Medical Informatics, Washington, D.C., October 26-30, pp. 159–163. Hanley & Belfus, Philadelphia (1996)
Rindflesch, T.C., Hunter, L., Aronson, A.R.: Mining molecular binding terminology from biomedical text. In: Lorenzi, N.M. (ed.) AMIA 1999 – Proceedings of the 1999 Annual Symposium of the American Medical Informatics Association. Transforming Health Care through Informatics: Cornerstones for a New Information Management Paradigm, Washington, D.C., November 6-10, pp. 127–131. Hanley & Belfus, Philadelphia (1999)
Collier, N., Nobata, C., Tsujii, J.: Automatic acquisition and classification of terminology using a tagged corpus in the molecular biology domain. Terminology 7, 239–257 (2002)
Bodenreider, O., Rindflesch, T.C., Burgun, A.: Unsupervised, corpus-based method for extending a biomedical terminology. In: Proceedings of the ACL Workshop on Natural Language Processing in the Biomedical Domain, pp. 53–60, Pittsburgh, Association for Computational Linguistics (2002)
Nenadić, G., Spasic, I., Ananiadou, S.: Terminology-driven mining of biomedical literature. Journal of Biomedical Informatics 33, 1–6 (2003)
Krauthammer, M., Nenadić, G.: Term identification in the biomedical literature. Journal of Biomedical Informatics 37, 512–526 (2004)
Frantzi, K., Ananiadou, S., Mima, H.: Automatic recognition of multi-word-terms: the C/NC value method. International Journal of Digital Libraries 3, 115–130 (2000)
Nenadić, G., Ananiadou, S., McNaught, J.: Enhancing automatic term recognition through recognition of variation. In: COLING 2004 – Proceedings of the 20th International Conference on Computational Linguistics, pp. 604–610. Association for Computational Linguistics (2004)
Damerau, F.J.: Generating and evaluating domain-oriented multi-word terms from text. Information Processing & Management 29, 433–447 (1993)
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. Bradford Book & MIT Press, Cambridge (1999)
Evert, S., Krenn, B.: Methods for the qualitative evaluation of lexical association measures. In: ACL 2001 – Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, Toulouse, France, pp. 188–195 (2001)
Wermter, J., Hahn, U.: Collocation extraction based on modifiability statistics. In: COLING Geneva 2004 – Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland, August 23-27, 2004, vol. 2, pp. 980–986. Association for Computational Linguistics (2004)
Kudo, T., Matsumoto, Y.: Chunking with support vector machines. In: NAACL 2001, Language Technologies 2001 – Proceedings of the 2nd Meeting of the North American Chapter of the Association for Computational Linguistics, Pittsburgh, PA, USA, June 2-7, pp. 192–199 (2001)
Justeson, J.S., Katz, S.M.: Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering 1, 9–27 (1995)
Browne, A.C., Divita, G., Nguyen, V., Cheng, V.C.: Modular text processing system based on the Specialist lexicon and lexical tools. In: Chute, C.G. (ed.) AMIA 1998 – Proceedings of the 1998 AMIA Annual Fall Symposium. A Paradigm Shift in Health Care Information Systems: Clinical Infrastructures for the 21st Century, Orlando, FL, November 7-11, p. 982. Hanley & Belfus, Philadelphia (1998)
UMLS: Unified Medical Language System. National Library of Medicine, Bethesda (2004)
MESH: Medical Subject Headings. National Library of Medicine, Bethesda (2004)
Sachs, L.: Applied Statistics: A Handbook of Techniques, 2nd edn. Springer, New York (1984)
Mima, H., Ananiadou, S., Nenadić, G.: The ATRACT workbench: Automatic term recognition and clustering for terms. In: Matusek, V. (ed.) TSD 2001. LNCS (LNAI), vol. 2166, pp. 126–133. Springer, Heidelberg (2001)
Benson, D.A., Boguski, M.S., Lipman, D.J., Ostell, J., Ouellette, F.B., Rapp, B.A., Wheeler, D.L.: GENBANK. Nucleic Acids Research 27, 12–17 (1999)
Gene Ontology Consortium: Creating the Gene Ontology resource: Design and implementation. Genome Research 11, 1425–1433 (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wermter, J., Hahn, U. (2005). Massive Biomedical Term Discovery. In: Hoffmann, A., Motoda, H., Scheffer, T. (eds) Discovery Science. DS 2005. Lecture Notes in Computer Science(), vol 3735. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11563983_24
Download citation
DOI: https://doi.org/10.1007/11563983_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29230-2
Online ISBN: 978-3-540-31698-5
eBook Packages: Computer ScienceComputer Science (R0)