Abstract
We are presenting an approach to calculating the semantic similarity of documents written in the same or in different languages. The similarity calculation is achieved by representing the document contents in a language-independent way, using the descriptor terms of the multilingual thesaurus EUROVOC, and by then calculating the distance between these representations. While EUROVOC is a carefully handcrafted knowledge structure, our procedure uses statistical techniques. The method was applied to a collection of 5990 English and Spanish parallel texts and evaluated by measuring the number of times the translation of a given document was identified as the most similar document. The good results showed the feasibility and usefulness of the approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Eurovoc (1995). Thesaurus Eurovoc-Volume 2: Subject-Oriented Version. Ed. 3/English Language. Annex to the index of the Official Journal of the EC. Luxembourg, Office for Official Publications of the European Communities. http://europa.eu.int/celex/eurovoc
Hagman Johan, Domenico Perrotta, Ralf Steinberger & Aristide Varfis (2000). Document Classification and Visualisation to Support the Investigation of Suspected Fraud. Workshop on Machine Learning and Textual Information Access (MLTIA). Fourth European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’2000), 12 pages. Lyon, September 2000.
Kilgariff, Adam (1996). Which words are particularly characteristic of a text? A survey of statistical approaches. Proceedings of the AISB Workshop on Language Engineering for Document Analysis and Recognition, Sussex, April 1996, pp. 33–40.
Landauer Thomas & Michael Littman (1991). A statistical method for language-independent representation of the topical content of text segments. In Proceedings of the Eleventh International Conference: Expert Systems and Their Applications, volume 8, pp. 77–85, Avignon, France, May 1991.
Resnik Philip (1999). Mining the Web for Bilingual Text. 37th Annual Meeting of the Association for Computational Linguistics (ACL’99), Maryland, June 1999.
Robertson, S. E., S. Walker, M. Hancock-Beaulieu & M. Gatford (1994). Okapi in TREC-3, Text Retrieval Conference TREC-3, U.S. National Institute of Standards and Technology, Gaithersburg, USA. NIST Special Publication 500-225, pp. 109–126.
Salton G. (1989). Automatic Text Processing: the Transformation, Analysis and Retrieval of Information by Computer. Reading, Mass., Addison-Wesley
Scheer Stefan, Ralf Steinberger, Giovanni Valerio & Paul Henshaw (2000). A Methodology to Retrieve, to Manage, to Classify and to Query Open Source Information-Results of the OSILIA Project. JRC Technical Note No. I.01.016, 35 pages.
Scott, Michael (1999). WordSmith Tools v.3.0. Oxford University Press, Oxford, UK. http://www.liv.ac.uk/~ms2928/wordsmith
Smith Noah (2001). Detection of Translational Equivalence. Unpublished Undergraduate Honours Thesis. University of Maryland, College Park, Maryland, USA.
Steinberger Ralf (2001). Cross-lingual Keyword Assignment. Proceedings of the XVII Conference of the Spanish Society for Natural Language Processing (SEPLN’2001), Procesamiento del Lenguaje Natural, Revista No. 27, pp. 273–280. Jaén, Spain.
Steinberger Ralf, Johan Hagman & Stefan Scheer (2000). Using Thesauri for Information Extraction and for the Visualisation of Multilingual Document Collections. Proceedings of the Workshop on Ontologies and Lexical Knowledge Bases (OntoLex’2000), 12 pages. Sozopol, Bulgaria, September 2000.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Steinberger, R., Pouliquen, B., Hagman, J. (2002). Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2002. Lecture Notes in Computer Science, vol 2276. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45715-1_44
Download citation
DOI: https://doi.org/10.1007/3-540-45715-1_44
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43219-7
Online ISBN: 978-3-540-45715-2
eBook Packages: Springer Book Archive