Abstract
The present paper considers the problem of annotating bibliographical references with labels/classes, given training data of references already annotated with labels. The problem is an instance of document categorization where the documents are short and written in a wide variety of languages. The skewed distributions of title words and labels calls for special carefulness when choosing a Machine Learning approach. The present paper describes how to induce Disjunctive Normal Form formulae (DNFs), which have several advantages over Decision Trees. The approach is evaluated on a large real-world collection of bibliographical references.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Hammarström, H., Nordhoff, S.: Langdoc: Bibliographic infrastructure for linguistic typology. Oslo Studies in Language, 14 (in press, 2011)
Hammarström, H.: Automatic annotation of bibliographical references with target language. In: Proceedings of MMIES-2: Workshop on Multi-source, Multilingual Information Extraction and Summarization, ACL, pp. 57–64 (2008)
Sahlgren, M.: The Word-Space Model: Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces. PhD thesis, Stockholm University, Stockholm (2006)
Huang, X., Croft, W.B.: A unified relevance model for opinion retrieval. In: CIKM 2009: Proceeding of the 18th ACM Conference on Information and Knowledge Management, pp. 947–956. ACM, New York (2009)
Lavrenko, V., Choquette, M., Croft, W.B.: Cross-lingual relevance models. In: SIGIR 2002: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 175–182. ACM, New York (2002)
Zhang, D., Mei, Q., Zhai, C.: Cross-lingual latent topic extraction. In: ACL 2010: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 1128–1137. Association for Computational Linguistics, Morristown (2010)
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34, 1–47 (2002)
Al Zamil, M.G.H., Can, A.B.: Rolex-sp: Rules of lexical syntactic patterns for free text categorization. Knowledge-Based Systems 24(1), 58–65 (2011)
Quinlan, J.R.: Induction of decision trees. Machine Learning 1(1), 81–106 (1986)
Quinlan, J.R.: C4.5: programs for machine learning. Morgan Kaufmann, San Francisco (1993)
Cohen, W.W.: Fast effective rule induction. In: Proceedings of the Twelfth International Conference on Machine Learning, pp. 115–123. Morgan Kaufmann, San Francisco (1995)
Clark, P., Niblett, T.: The cn2 induction algorithm. Machine Learning 3, 261–283 (1989)
Sever, H., Gorur, A., Tolun, M.R.: Text Categorization with ILA. In: Yazıcı, A., Şener, C. (eds.) ISCIS 2003. LNCS, vol. 2869, pp. 300–307. Springer, Heidelberg (2003)
van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Butterworths, London (1979)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hammarström, H. (2011). Automatic Annotation of Bibliographical References for Descriptive Language Materials. In: Forner, P., Gonzalo, J., Kekäläinen, J., Lalmas, M., de Rijke, M. (eds) Multilingual and Multimodal Information Access Evaluation. CLEF 2011. Lecture Notes in Computer Science, vol 6941. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23708-9_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-23708-9_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23707-2
Online ISBN: 978-3-642-23708-9
eBook Packages: Computer ScienceComputer Science (R0)