Advertisement

Combining Wikipedia-Based Concept Models for Cross-Language Retrieval

  • Benjamin Roth
  • Dietrich Klakow
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6107)

Abstract

As a low-cost ressource that is up-to-date, Wikipedia recently gains attention as a means to provide cross-language brigding for information retrieval. Contradictory to a previous study, we show that standard Latent Dirichlet Allocation (LDA) can extract cross-language information that is valuable for IR by simply normalizing the training data. Furthermore, we show that LDA and Explicit Semantic Analysis (ESA) complement each other, yielding significant improvements when combined. Such a combination can significantly contribute to retrieval based on machine translation, especially when query translations contain errors. The experiments were perfomed on the Multext JOC corpus und a CLEF dataset.

Keywords

Latent dirichlet allocation explicit semantic analysis cross-language information retrieval machine translation 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Anderka, M., Stein, B.: The ESA retrieval model revisited. In: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pp. 670–671. ACM, New York (2009)CrossRefGoogle Scholar
  2. 2.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. The Journal of Machine Learning Research 3, 993–1022 (2003)zbMATHCrossRefGoogle Scholar
  3. 3.
    Braschler, M.: CLEF 2000-Overview of results. In: Peters, C. (ed.) CLEF 2000. LNCS, vol. 2069, p. 89. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  4. 4.
    Carbonell, J.G., Yang, Y., Frederking, R.E., Brown, R.D., Geng, Y., Lee, D.: Translingual information retrieval: A comparative evaluation. In: International Joint Conference on Artificial Intelligence, Citeseer, vol. 15, pp. 708–715 (1997)Google Scholar
  5. 5.
    Cimiano, P., Schultz, A., Sizov, S., Sorg, P., Staab, S.: Explicit vs. Latent Concept Models for Cross-Language Information Retrieval. In: Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI 2009 (2009)Google Scholar
  6. 6.
    De Smet, W., Moens, M.F.: Cross-language linking of news stories on the web using interlingual topic modelling. In: The 2nd Workshop on Social Web Search and Mining, SWSM 2009 (2009)Google Scholar
  7. 7.
    Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. Journal of the American society for information science 41(6), 391–407 (1990)CrossRefGoogle Scholar
  8. 8.
    Dumais, S.T., Letsche, T.A., Littman, M.L., Landauer, T.K.: Automatic cross-language retrieval using latent semantic indexing. In: AAAI Spring Symposuim on Cross-Language Text and Speech Retrieval, pp. 115–132 (1997)Google Scholar
  9. 9.
    Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence, pp. 6–12 (2007)Google Scholar
  10. 10.
    Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proceedings of the National Academy of Sciences 101(90001), 5228–5235 (2004)Google Scholar
  11. 11.
    Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 50–57. ACM, New York (1999)CrossRefGoogle Scholar
  12. 12.
    Hofmann, T.: Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning 42(1), 177–196 (2001)zbMATHCrossRefGoogle Scholar
  13. 13.
    Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: MT summit, vol. 5 (2005)Google Scholar
  14. 14.
    Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., et al.: Moses: Open source toolkit for statistical machine translation. In: Annual meeting-association for computational linguistics, vol. 45 (2007)Google Scholar
  15. 15.
    Mimno, D., Wallach, H.M., Naradowsky, J., Smith, D.A., McCallum, A.: Polylingual Topic Models. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 880–889 (2009)Google Scholar
  16. 16.
    Ni, X., Sun, J.T., Hu, J., Chen, Z.: Mining multilingual topics from wikipedia. In: Proceedings of the 18th international conference on World wide web, pp. 1155–1156. ACM, New York (2009)CrossRefGoogle Scholar
  17. 17.
    Potthast, M., Stein, B., Anderka, M.: A wikipedia-based multilingual retrieval model. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, p. 522. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  18. 18.
    Sorg, P., Cimiano, P.: Cross-lingual information retrieval with explicit semantic analysis. In: Working Notes of the Annual CLEF Meeting (2008)Google Scholar
  19. 19.
    Sorg, P., Cimiano, P.: An Experimental Comparison of Explicit Semantic Analysis Implementations for Cross-Language Retrieval. In: Proceedings of the 14th International Conference on Applications of Natural Language to Information Systems, NLDB 2009 (2009)Google Scholar
  20. 20.
    Steyvers, M., Griffiths, T.: Probabilistic topic models. Handbook of Latent Semantic Analysis, 427 (2007)Google Scholar
  21. 21.
    Wang, Y., Bai, H., Stanton, M., Chen, W.-Y., Chang, E.Y.: Plda: Parallel latent dirichlet allocation for large-scale applications. In: Proc. of 5th International Conference on Algorithmic Aspects in Information and Management (2009), Software available at, http://code.google.com/p/plda
  22. 22.
    Wei, X., Croft, W.B.: LDA-based document models for ad-hoc retrieval. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 178–185. ACM, New York (2006)CrossRefGoogle Scholar
  23. 23.
    Wong, S.K.M., Ziarko, W., Wong, P.C.N.: Generalized vector spaces model in information retrieval. In: Proceedings of the 8th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 18–25. ACM, New York (1985)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Benjamin Roth
    • 1
  • Dietrich Klakow
    • 1
  1. 1.Spoken Language SystemsSaarland UniversityGermany

Personalised recommendations