Skip to main content

Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC

  • Conference paper
  • First Online:
Computational Linguistics and Intelligent Text Processing (CICLing 2002)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2276))

Abstract

We are presenting an approach to calculating the semantic similarity of documents written in the same or in different languages. The similarity calculation is achieved by representing the document contents in a language-independent way, using the descriptor terms of the multilingual thesaurus EUROVOC, and by then calculating the distance between these representations. While EUROVOC is a carefully handcrafted knowledge structure, our procedure uses statistical techniques. The method was applied to a collection of 5990 English and Spanish parallel texts and evaluated by measuring the number of times the translation of a given document was identified as the most similar document. The good results showed the feasibility and usefulness of the approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Eurovoc (1995). Thesaurus Eurovoc-Volume 2: Subject-Oriented Version. Ed. 3/English Language. Annex to the index of the Official Journal of the EC. Luxembourg, Office for Official Publications of the European Communities. http://europa.eu.int/celex/eurovoc

  2. Hagman Johan, Domenico Perrotta, Ralf Steinberger & Aristide Varfis (2000). Document Classification and Visualisation to Support the Investigation of Suspected Fraud. Workshop on Machine Learning and Textual Information Access (MLTIA). Fourth European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’2000), 12 pages. Lyon, September 2000.

    Google Scholar 

  3. Kilgariff, Adam (1996). Which words are particularly characteristic of a text? A survey of statistical approaches. Proceedings of the AISB Workshop on Language Engineering for Document Analysis and Recognition, Sussex, April 1996, pp. 33–40.

    Google Scholar 

  4. Landauer Thomas & Michael Littman (1991). A statistical method for language-independent representation of the topical content of text segments. In Proceedings of the Eleventh International Conference: Expert Systems and Their Applications, volume 8, pp. 77–85, Avignon, France, May 1991.

    Google Scholar 

  5. Resnik Philip (1999). Mining the Web for Bilingual Text. 37th Annual Meeting of the Association for Computational Linguistics (ACL’99), Maryland, June 1999.

    Google Scholar 

  6. Robertson, S. E., S. Walker, M. Hancock-Beaulieu & M. Gatford (1994). Okapi in TREC-3, Text Retrieval Conference TREC-3, U.S. National Institute of Standards and Technology, Gaithersburg, USA. NIST Special Publication 500-225, pp. 109–126.

    Google Scholar 

  7. Salton G. (1989). Automatic Text Processing: the Transformation, Analysis and Retrieval of Information by Computer. Reading, Mass., Addison-Wesley

    Google Scholar 

  8. Scheer Stefan, Ralf Steinberger, Giovanni Valerio & Paul Henshaw (2000). A Methodology to Retrieve, to Manage, to Classify and to Query Open Source Information-Results of the OSILIA Project. JRC Technical Note No. I.01.016, 35 pages.

    Google Scholar 

  9. Scott, Michael (1999). WordSmith Tools v.3.0. Oxford University Press, Oxford, UK. http://www.liv.ac.uk/~ms2928/wordsmith

    Google Scholar 

  10. Smith Noah (2001). Detection of Translational Equivalence. Unpublished Undergraduate Honours Thesis. University of Maryland, College Park, Maryland, USA.

    Google Scholar 

  11. Steinberger Ralf (2001). Cross-lingual Keyword Assignment. Proceedings of the XVII Conference of the Spanish Society for Natural Language Processing (SEPLN’2001), Procesamiento del Lenguaje Natural, Revista No. 27, pp. 273–280. Jaén, Spain.

    Google Scholar 

  12. Steinberger Ralf, Johan Hagman & Stefan Scheer (2000). Using Thesauri for Information Extraction and for the Visualisation of Multilingual Document Collections. Proceedings of the Workshop on Ontologies and Lexical Knowledge Bases (OntoLex’2000), 12 pages. Sozopol, Bulgaria, September 2000.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Steinberger, R., Pouliquen, B., Hagman, J. (2002). Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2002. Lecture Notes in Computer Science, vol 2276. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45715-1_44

Download citation

  • DOI: https://doi.org/10.1007/3-540-45715-1_44

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-43219-7

  • Online ISBN: 978-3-540-45715-2

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics