Cross-Lingual Semantic Similarity Measure for Comparable Articles

  • Motaz Saad
  • David Langlois
  • Kamel Smaïli
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8686)


A measure of similarity is required to find and compare cross-lingual articles concerning a specific topic. This measure can be based on bilingual dictionaries or based on numerical methods such as Latent Semantic Indexing (LSI). In this paper, we use LSI in two ways to retrieve Arabic-English comparable articles. The first way is monolingual: the English article is translated into Arabic and then mapped into the Arabic LSI space; the second way is cross-lingual: Arabic and English documents are mapped into Arabic-English LSI space. Then we compare LSI approaches to the dictionary-based approach on several English-Arabic parallel and comparable corpora. Results indicate that the performance of our cross-lingual LSI approach is competitive to the monolingual approach and even better for some corpora. Moreover, both LSI approaches outperform the dictionary approach.


Cross-lingual latent semantic indexing corpus comparability cross-lingual information retrieval 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Aljlayl, M., Frieder, O., Grossman, D.: On Arabic-English Cross-Language Information Retrieval: Machine Translation Approach. In: Machine Readable Dictionaries and Machine Translation, ACM Tenth Conference on Information and Knowledge Managemen (CIKM), pp. 295–302. ACM Press (2002)Google Scholar
  2. 2.
    Berry, M.W., Young, P.G.: Using latent semantic indexing for multilanguage information retrieval. Computers and the Humanities 29(6), 413–429 (1995)CrossRefGoogle Scholar
  3. 3.
    Bond, F., Paik, K.: A survey of wordnets and their licenses. In: 6th Global WordNet Conference (GWC 2012), pp. 64–71 (2012)Google Scholar
  4. 4.
    Cettolo, M., Girardi, C., Federico, M.: Wit3: Web inventory of transcribed and translated talks. In: Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT), Trento, Italy, pp. 261–268 (May 2012)Google Scholar
  5. 5.
    Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6), 391–407 (1990)CrossRefGoogle Scholar
  6. 6.
    Dumais, S.: Lsa and information retrieval: Getting back to basics. In: Handbook of Latent Semantic Analysis, pp. 293–321 (2007)Google Scholar
  7. 7.
    Fujii, A., Ishikawa, T.: Applying machine translation to two-stage cross-language information retrieval. In: White, J.S. (ed.) AMTA 2000. LNCS (LNAI), vol. 1934, pp. 13–24. Springer, Heidelberg (2000), CrossRefGoogle Scholar
  8. 8.
    Habash, N.: Introduction to Arabic natural language processing. Synthesis Lectures on Human Language Technologies 3(1), 1–187 (2010)CrossRefGoogle Scholar
  9. 9.
    Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Processes 25(2-3), 259–284 (1998)CrossRefGoogle Scholar
  10. 10.
    Li, B., Gaussier, E.: Improving corpus comparability for bilingual lexicon extraction from comparable corpora. In: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 644–652. Association for Computational Linguistics (2010)Google Scholar
  11. 11.
    Littman, M.L., Dumais, S.T., Landauer, T.K.: Automatic cross-language information retrieval using latent semantic indexing. In: Grefenstette, G. (ed.) Cross-Language Information Retrieval. The Springer International Series on Information Retrieval, pp. 51–62. Springer, US (1998)CrossRefGoogle Scholar
  12. 12.
    Ma, X., Zakhary, D.: Arabic newswire english translation collection. Linguistic Data Consortium, Philadelphia (2009)Google Scholar
  13. 13.
    Meftouh, K., Laskri, M.T., Smaïli, K.: Modeling Arabic Language using statistical methods. Arabian Journal for Science and Engineering 35(2C), 69–82 (2010)Google Scholar
  14. 14.
    Muhic, A., Rupnik, J., Skraba, P.: Cross-lingual document similarity. In: Proceedings of the ITI 2012 34th International Conference on Information Technology Interfaces (ITI), pp. 387–392 (June 2012)Google Scholar
  15. 15.
    NIST, M.I.G.: NIST 2008/2009 open machine translation (OpenMT) evaluation. Linguistic Data Consortium, Philadelphia (2010)Google Scholar
  16. 16.
    Otero, P., López, I., Cilenis, S., de Compostela, S.: Measuring comparability of multilingual corpora extracted from wikipedia. In: Iberian Cross-Language Natural Language Processings Tasks (ICL), p. 8 (2011)Google Scholar
  17. 17.
    Rafalovitch, A., Dale, R.: United nations general assembly resolutions: A six-language parallel corpus. In: Proceedings of the MT Summit XII, vol. 13, pp. 292–299 (2009)Google Scholar
  18. 18.
    Saad, M.: The Impact of Text Preprocessing and Term Weighting on Arabic Text Classification. Master’s thesis, Computer Engineering Dept., Islamic University of Gaza, Palestine (2010)Google Scholar
  19. 19.
    Saad, M., Langlois, D., Smaïli, K.: Extracting comparable articles from wikipedia and measuring their comparabilities. Procedia - Social and Behavioral Sciences 95, 40–47 (2013),, corpus Resources for Descriptive and Applied Studies. Current Challenges and Future Directions: Selected Papers from the 5th International Conference on Corpus Linguistics (CILC 2013)
  20. 20.
    Tiedemann, J.: Parallel data, tools and interfaces in opus. In: Chair), N.C.C., Choukri, K., Declerck, T., Dogan, M.U., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S. (eds.) Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012). European Language Resources Association (ELRA), Istanbul (2012)Google Scholar
  21. 21.
    Ture, F.: Searching to Translate and Translating to Search: When Information Retrieval Meets Machine Translation. Ph.D. thesis, Graduate School of the University of Maryland, College Park (2013),

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Motaz Saad
    • 1
    • 2
    • 3
  • David Langlois
    • 1
    • 2
    • 3
  • Kamel Smaïli
    • 1
    • 2
    • 3
  1. 1.SMarT Group, LORIA INRIAVillers-lès-NancyFrance
  2. 2.Université de Lorraine, LORIA, UMR 7503Villers-lès-NancyFrance
  3. 3.CNRS, LORIA, UMR 7503Villers-lès-NancyFrance

Personalised recommendations