Abstract
Subject headings systems are tools for organization of knowledge that have been developed over the years by libraries. The SKOS Simple Knowledge Organization System has provided a practical way to represent subject headings systems using the Resource Description Framework, and several libraries have taken the initiative to make subject headings systems widely available as open linked data. Each individual subject heading describes a concept, however, in the majority of cases, one subject heading is actually a combination of several concepts, such as a topic bounded in geographical and temporal scopes. In these cases, the label of the concept actually carries several concepts which are not represented in structured form. Our work explores machine learning techniques to recognize the sub concepts represented in the labels of SKOS subject headings. This paper describes a language independent named entity recognition technique based on conditional random fields, a machine learning algorithm for sequence labelling. This technique was evaluated on a subset of the Library of Congress Subject Headings, where we measured the recognition of geographic concepts, topics, time periods and historical periods. Our technique achieved an overall F1 score of 0.98.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Hoerman, H.L., Furniss, K.A.: Turning Practice into Principles: A Comparison of the IFLA Principles Underlying Subject Heading Languages (SHLs) and the Principles Underlying the Library of Congress Subject Headings System. Cataloging & Classification Quarterly 29(1/2), 31–52 (2000)
Miles, A.J., Matthews, B.M., Wilson, M.J.: Core RDF Vocabularies for Thesauri. SWAD-Europe Deliverable 8.1 (2001)
Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge (1998)
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: International Conference on Machine Learning (2001)
McCallum, A., Freitag, D., Pereira, F.: Maximum entropy Markov models for information extraction and segmentation. In: International Conference on Machine Learning (2000)
Rabiner, L., Juang, B.H.: Fundamentals of Speech Recognition. Prentice Hall Signal Processing Series. Prentice-Hall, Inc., Englewood Cliffs (1993)
Wellner, B., McCallum, A., Peng, F., Hay, M.: An Integrated, Conditional Model of Information Extraction and Coreference with Application to Citation Matching. In: UAI 2004 Proceedings of The 20th Conference On Uncertainty In Artificial Intelligence (2004)
Rijsbergen, C.J.: Information Retrieval. Butterworth, London (1979)
Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Linguisticae Investigationes 30 (2007)
Ravin, Y., Wacholder, N.: Extracting Names from Natural-Language Text (1997)
Mikheev, A.: A Knowledge-free Method for Capitalized Word Disambiguation. In: The 37th Annual Meeting of The Association for Computational Linguistics, pp. 159–166 (1999)
Silva, J., Kozareva, Z., Gabriel, J., Lopes, P.: Cluster Analysis and Classification of Named Entities. In: Proceedings Conference on Language Resources and Evaluation (2004)
Bikel, D., Daniel, M., Miller, S., Schwartz, R., Weischedel, R.: Nymble: a High-Performance Learning Name-finder. In: Proceedings of the Conference on Applied Natural Language Processing (1997)
Settles, B.: Biomedical Named Entity Recognition Using Conditional Random Fields and Rich Feature Sets. In: Proc. Conference on Computational Linguistics, Joint Workshop on Natural Language Processing in Biomedicine and its Applications (2004)
Yamashita, T., Matsumoto, Y.: Language independent morphological analysis. In: Proceedings of the Sixth Conference On Applied Natural Language Processing, pp. 232–238. Association for Computational Linguistics, Seattle (2000)
The Unicode Consortium: Unicode Text Segmentation (2010), http://www.unicode.org/reports/tr29/
McCallum, A.: MALLET: A Machine Learning for Language Toolkit (2002), http://mallet.cs.umass.edu
Lopes, M.I., Beall, J. (eds.): Working Group on Principles Underlying Subject Heading Languages, IFLA Section on Classification and Indexing: Principles Underlying Subject Heading Languages (SHLs). International Federation of Library Associations and Institutions (1999)
Sekine, S., Isahara, H.: IREX: IR and IE Evaluation project in Japanese. In: Proc. Conference on Language Resources and Evaluation (2000)
Sang, T.K., Erik, F.: Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition. In: Proceedings Conference on Natural Language Learning (2002)
Sang, T.K., Erik, F., De Meulder, F.: Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In: Proceedings Conference on Natural Language Learning (2003)
Isaac, A., Matthezing, H., Schlobach, S., Zinn, C.: Integrated access to cultural heritage resources through representation and alignment of controlled vocabularies. Library Review 57 (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Freire, N., Borbinha, J., Calado, P. (2011). A Language Independent Approach for Named Entity Recognition in Subject Headings. In: Gradmann, S., Borri, F., Meghini, C., Schuldt, H. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2011. Lecture Notes in Computer Science, vol 6966. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24469-8_7
Download citation
DOI: https://doi.org/10.1007/978-3-642-24469-8_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24468-1
Online ISBN: 978-3-642-24469-8
eBook Packages: Computer ScienceComputer Science (R0)