Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC

Steinberger, Ralf; Pouliquen, Bruno; Hagman, Johan

doi:10.1007/3-540-45715-1_44

Ralf Steinberger⁵,
Bruno Pouliquen⁵ &
Johan Hagman⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2276))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1611 Accesses
37 Citations

Abstract

We are presenting an approach to calculating the semantic similarity of documents written in the same or in different languages. The similarity calculation is achieved by representing the document contents in a language-independent way, using the descriptor terms of the multilingual thesaurus EUROVOC, and by then calculating the distance between these representations. While EUROVOC is a carefully handcrafted knowledge structure, our procedure uses statistical techniques. The method was applied to a collection of 5990 English and Spanish parallel texts and evaluated by measuring the number of times the translation of a given document was identified as the most similar document. The good results showed the feasibility and usefulness of the approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Eurovoc (1995). Thesaurus Eurovoc-Volume 2: Subject-Oriented Version. Ed. 3/English Language. Annex to the index of the Official Journal of the EC. Luxembourg, Office for Official Publications of the European Communities. http://europa.eu.int/celex/eurovoc
Hagman Johan, Domenico Perrotta, Ralf Steinberger & Aristide Varfis (2000). Document Classification and Visualisation to Support the Investigation of Suspected Fraud. Workshop on Machine Learning and Textual Information Access (MLTIA). Fourth European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’2000), 12 pages. Lyon, September 2000.
Google Scholar
Kilgariff, Adam (1996). Which words are particularly characteristic of a text? A survey of statistical approaches. Proceedings of the AISB Workshop on Language Engineering for Document Analysis and Recognition, Sussex, April 1996, pp. 33–40.
Google Scholar
Landauer Thomas & Michael Littman (1991). A statistical method for language-independent representation of the topical content of text segments. In Proceedings of the Eleventh International Conference: Expert Systems and Their Applications, volume 8, pp. 77–85, Avignon, France, May 1991.
Google Scholar
Resnik Philip (1999). Mining the Web for Bilingual Text. 37th Annual Meeting of the Association for Computational Linguistics (ACL’99), Maryland, June 1999.
Google Scholar
Robertson, S. E., S. Walker, M. Hancock-Beaulieu & M. Gatford (1994). Okapi in TREC-3, Text Retrieval Conference TREC-3, U.S. National Institute of Standards and Technology, Gaithersburg, USA. NIST Special Publication 500-225, pp. 109–126.
Google Scholar
Salton G. (1989). Automatic Text Processing: the Transformation, Analysis and Retrieval of Information by Computer. Reading, Mass., Addison-Wesley
Google Scholar
Scheer Stefan, Ralf Steinberger, Giovanni Valerio & Paul Henshaw (2000). A Methodology to Retrieve, to Manage, to Classify and to Query Open Source Information-Results of the OSILIA Project. JRC Technical Note No. I.01.016, 35 pages.
Google Scholar
Scott, Michael (1999). WordSmith Tools v.3.0. Oxford University Press, Oxford, UK. http://www.liv.ac.uk/~ms2928/wordsmith
Google Scholar
Smith Noah (2001). Detection of Translational Equivalence. Unpublished Undergraduate Honours Thesis. University of Maryland, College Park, Maryland, USA.
Google Scholar
Steinberger Ralf (2001). Cross-lingual Keyword Assignment. Proceedings of the XVII Conference of the Spanish Society for Natural Language Processing (SEPLN’2001), Procesamiento del Lenguaje Natural, Revista No. 27, pp. 273–280. Jaén, Spain.
Google Scholar
Steinberger Ralf, Johan Hagman & Stefan Scheer (2000). Using Thesauri for Information Extraction and for the Visualisation of Multilingual Document Collections. Proceedings of the Workshop on Ontologies and Lexical Knowledge Bases (OntoLex’2000), 12 pages. Sozopol, Bulgaria, September 2000.
Google Scholar

Download references

Author information

Authors and Affiliations

European Commission, Joint Research Centre Institute for the Protection and Security of the Citizen (IPSC), Cybersecurity and New Technologies for Combating Fraud Unit (CSCF), 21020, Ispra, VA, Italy
Ralf Steinberger, Bruno Pouliquen & Johan Hagman

Authors

Ralf Steinberger
View author publications
You can also search for this author in PubMed Google Scholar
Bruno Pouliquen
View author publications
You can also search for this author in PubMed Google Scholar
Johan Hagman
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

CIC Centro de Investigacion en Computacion, IPN Instituto Politecnico Nacional, Col Zacateno, CP 07738, Mexico DF, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Steinberger, R., Pouliquen, B., Hagman, J. (2002). Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2002. Lecture Notes in Computer Science, vol 2276. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45715-1_44

Download citation

DOI: https://doi.org/10.1007/3-540-45715-1_44
Published: 05 February 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43219-7
Online ISBN: 978-3-540-45715-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics