Advertisement

Multi-Lingual LSA with Serbian and Croatian: An Investigative Case Study

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10546)

Abstract

One of the challenges in information retrieval is attempting to search a corpus of documents that may contain multiple languages. This exploratory study expands upon earlier research employing Latent Semantic Analysis (so called Multi-Lingual Latent Semantic Indexing, or ML-LSI/LSA). We experiment using this approach, and a new one, in a multi-lingual context utilising two similar languages, namely Serbian and Croatian. Traditionally, with an LSA approach, a parallel corpus would be needed in order to train the system by combining identical documents in two languages into one document. We repeat that approach and also experiment with creating a semantic space using the parallel corpus on its own without merging the documents together to test the hypothesis that, with very similar languages, the merging of documents may not be required for good results.

Keywords

Latent Semantic Analysis (LSA) Parallel Corpus Semantic Space Multilingual Information Retrieval Multi-lingual IR 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgements

This article is based upon work from COST Action KEYSTONE IC1302, supported by COST (European Cooperation in Science and Technology).

References

  1. 1.
    Berry, M.W., Browne, M.: Understanding Search Engines: Mathematical Modeling and Text Retrieval, 2nd edn. SIAM, Philadelphia (2005)CrossRefzbMATHGoogle Scholar
  2. 2.
    Chew, P., Abdelali, A.: The effects of language relatedness on multilingual information retrieval: a case study with Indo-European and semitic languages. In: Proceedings of the 2nd International Workshop on “Cross Lingual Information Access” Addressing the Information Need of Multilingual Societies, pp. 1–9, January 2008. http://anthology.aclweb.org/I/I08/I08-6.pdf#page=10
  3. 3.
    Corbett, G.G., Browne, W.: Serbo-croat: Bosnian, Croatian, Montenegrin, Serbian. In: The World’s Major Languages, pp. 330–346. Routledge, London (2009)Google Scholar
  4. 4.
    Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)CrossRefGoogle Scholar
  5. 5.
    Dhavachelvan, P., Pothula, S.: A review on the cross and multilingual information retrieval. Int. J. Web Semantic Technol. 2(4), 115–124 (2011)CrossRefGoogle Scholar
  6. 6.
    Dumais, S.T., Letsche, T.A., Littman, M.L., Landauer, T.K.: Automatic cross-language retrieval using latent semantic indexing. AAAI Technical Report SS-97-05, pp. 18–24 (1997)Google Scholar
  7. 7.
    Dwivedi, S., Chandra, G.: A survey on cross language information retrieval. Int. J. Cybern. Inform. 5(1), 127–142 (2016)Google Scholar
  8. 8.
    Greenberg, R.D.: Language politics in the federal republic of Yugoslavia: the crisis over the future of serbian. Slavic Rev. 59(3), 625–640 (2008)CrossRefGoogle Scholar
  9. 9.
    Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Process. 25, 259–284 (1998)CrossRefGoogle Scholar
  10. 10.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)CrossRefzbMATHGoogle Scholar
  11. 11.
    Sharma, M., Morwal, S.: A survey on cross language information retrieval. Int. J. Adv. Res. Comput. Commun. Eng. 4(2), 384–387 (2015)CrossRefGoogle Scholar
  12. 12.
    Tyers, F.M., Alperen, M.S.: South-East European Times: a parallel corpus of Balkan languages. In: Proceedings of the Workshop on Exploitation of Multilingual Resources and Tools for Central and (South) Eastern European Languages, LREC 2010, pp. 49–53 (2010). http://xixona.dlsi.ua.es/~fran/publications/lrec2010.pdf
  13. 13.
    Young, P.G.: Cross-language information retrieval using latent semantic indexing. Master’s thesis. University of Knoxville, Tennessee (1994)Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  1. 1.University of MaltaMsidaMalta
  2. 2.University of Novi SadNovi SadSerbia

Personalised recommendations