Automatic Annotation of Bibliographical References for Descriptive Language Materials

Hammarström, Harald

doi:10.1007/978-3-642-23708-9_8

Harald Hammarström²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6941))

Included in the following conference series:

International Conference of the Cross-Language Evaluation Forum for European Languages

600 Accesses
1 Citations

Abstract

The present paper considers the problem of annotating bibliographical references with labels/classes, given training data of references already annotated with labels. The problem is an instance of document categorization where the documents are short and written in a wide variety of languages. The skewed distributions of title words and labels calls for special carefulness when choosing a Machine Learning approach. The present paper describes how to induce Disjunctive Normal Form formulae (DNFs), which have several advantages over Decision Trees. The approach is evaluated on a large real-world collection of bibliographical references.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Hammarström, H., Nordhoff, S.: Langdoc: Bibliographic infrastructure for linguistic typology. Oslo Studies in Language, 14 (in press, 2011)
Google Scholar
Hammarström, H.: Automatic annotation of bibliographical references with target language. In: Proceedings of MMIES-2: Workshop on Multi-source, Multilingual Information Extraction and Summarization, ACL, pp. 57–64 (2008)
Google Scholar
Sahlgren, M.: The Word-Space Model: Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces. PhD thesis, Stockholm University, Stockholm (2006)
Google Scholar
Huang, X., Croft, W.B.: A unified relevance model for opinion retrieval. In: CIKM 2009: Proceeding of the 18th ACM Conference on Information and Knowledge Management, pp. 947–956. ACM, New York (2009)
Google Scholar
Lavrenko, V., Choquette, M., Croft, W.B.: Cross-lingual relevance models. In: SIGIR 2002: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 175–182. ACM, New York (2002)
Chapter Google Scholar
Zhang, D., Mei, Q., Zhai, C.: Cross-lingual latent topic extraction. In: ACL 2010: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 1128–1137. Association for Computational Linguistics, Morristown (2010)
Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34, 1–47 (2002)
Article Google Scholar
Al Zamil, M.G.H., Can, A.B.: Rolex-sp: Rules of lexical syntactic patterns for free text categorization. Knowledge-Based Systems 24(1), 58–65 (2011)
Article Google Scholar
Quinlan, J.R.: Induction of decision trees. Machine Learning 1(1), 81–106 (1986)
Google Scholar
Quinlan, J.R.: C4.5: programs for machine learning. Morgan Kaufmann, San Francisco (1993)
Google Scholar
Cohen, W.W.: Fast effective rule induction. In: Proceedings of the Twelfth International Conference on Machine Learning, pp. 115–123. Morgan Kaufmann, San Francisco (1995)
Google Scholar
Clark, P., Niblett, T.: The cn2 induction algorithm. Machine Learning 3, 261–283 (1989)
Google Scholar
Sever, H., Gorur, A., Tolun, M.R.: Text Categorization with ILA. In: Yazıcı, A., Şener, C. (eds.) ISCIS 2003. LNCS, vol. 2869, pp. 300–307. Springer, Heidelberg (2003)
Chapter Google Scholar
van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworths, London (1979)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Linguistics, Max Planck Institute for Evolutionary Anthropology, Deutscher Platz 6, D-04 150, Leipzig, Germany
Harald Hammarström

Authors

Harald Hammarström
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for the Evaluation of Language and Communication Technologies (CELCT), Via alla Casata 56/c, 38123, Povo, Italy
Pamela Forner
National University of Distance Education, E.T.S.I. Informática de la UNED, c/Juan del Rosal 16, 28040, Madrid, Spain
Julio Gonzalo
School of Information Sciences, University of Tampere, Kanslerinrinne 1, 33014, Tampere, Finland
Jaana Kekäläinen
Yahoo! Research, Avinguda Diagonal 177, 8th Floor, 08018, Barcelona, Spain
Mounia Lalmas
Intelligent Systems Laboratory, University of Amsterdam, Science Park 107, 1098 XG, Amsterdam, The Netherlands
Marteen de Rijke

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hammarström, H. (2011). Automatic Annotation of Bibliographical References for Descriptive Language Materials. In: Forner, P., Gonzalo, J., Kekäläinen, J., Lalmas, M., de Rijke, M. (eds) Multilingual and Multimodal Information Access Evaluation. CLEF 2011. Lecture Notes in Computer Science, vol 6941. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23708-9_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-23708-9_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23707-2
Online ISBN: 978-3-642-23708-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics