Extracting Multilingual Topics from Unaligned Comparable Corpora

  • Jagadeesh Jagarlamudi
  • Hal DauméIII
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5993)


Topic models have been studied extensively in the context of monolingual corpora. Though there are some attempts to mine topical structure from cross-lingual corpora, they require clues about document alignments. In this paper we present a generative model called JointLDA which uses a bilingual dictionary to mine multilingual topics from an unaligned corpus. Experiments conducted on different data sets confirm our conjecture that jointly modeling the cross-lingual corpora offers several advantages compared to individual monolingual models. Since the JointLDA model merges related topics in different languages into a single multilingual topic: a) it can fit the data with relatively fewer topics. b) it has the ability to predict related words from a language different than that of the given document. In fact it has better predictive power compared to the bag-of-word based translation model leaving the possibility for JointLDA to be preferred over bag-of-word model for Cross-Lingual IR applications. We also found that the monolingual models learnt while optimizing the cross-lingual copora are more effective than the corresponding LDA models.


Topic Model Dictionary Entry Topic Distribution Parallel Corpus Bilingual Dictionary 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Xu, J., Weischedel, R., Nguyen, C.: Evaluating a probabilistic model for cross-lingual information retrieval. In: SIGIR 2001, pp. 105–110. ACM, New York (2001)CrossRefGoogle Scholar
  2. 2.
    Bel, N., Koster, C.H.A., Villegas, M.: Cross-lingual text categorization. In: Koch, T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769, pp. 126–139. Springer, Heidelberg (2003)Google Scholar
  3. 3.
    Blei, D.M., Lafferty, J.D.: A correlated topic model of science. Annals of Applied Statistics, 17–35 (August 2007)Google Scholar
  4. 4.
    Blei, D.M., Lafferty, J.: Topic models. Text Mining: Theory and Applications. Taylor and Francis, Abington (2009)Google Scholar
  5. 5.
    Steyvers, M., Griffiths, T.: Probabilistic topic models. Latent Semantic Analysis: A Road to Meaning (2005)Google Scholar
  6. 6.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Maching Learning Research 3, 993–1022 (2003)zbMATHCrossRefGoogle Scholar
  7. 7.
    Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: MT Summit (2005)Google Scholar
  8. 8.
    Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proceedings of National Academy of Sciences USA 101(suppl. 1), 5228–5235 (2004)CrossRefGoogle Scholar
  9. 9.
    Wang, X., Zhai, C., Hu, X., Sproat, R.: Mining correlated bursty topic patterns from coordinated text streams. In: KDD 2007: Proceedings of the 13th ACM SIGKDD, pp. 784–793. ACM, New York (2007)Google Scholar
  10. 10.
    Dumais, S.T., Landauer, T.K., Littman, M.L.: Automatic cross-linguistic information retrieval using latent semantic indexing. In: Working Notes of the Workshop on Cross-Linguistic Information Retrieval, SIGIR, Zurich, Switzerland, pp. 16–23. ACM, New York (1996)Google Scholar
  11. 11.
    Blei, D.M., Jordan, M.I.: Modeling annotated data. In: SIGIR 2003, pp. 127–134. ACM, New York (2003)CrossRefGoogle Scholar
  12. 12.
    Ni, X., Sun, J.T., Hu, J., Chen, Z.: Mining multilingual topics from wikipedia. In: 18th International World Wide Web Conference, April 2009, pp. 1155–1155 (2009)Google Scholar
  13. 13.
    Koehn, P., Knight, K.: Learning a translation lexicon from monolingual corpora. In: Proceedings of the ACL 2002 workshop on Unsupervised lexical acquisition, Morristown, NJ, USA, pp. 9–16. Association for Computational Linguistics (2002)Google Scholar
  14. 14.
    Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Computational Linguistics 29(1), 19–51 (2003)CrossRefGoogle Scholar
  15. 15.
    Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to ad hoc information retrieval. In: SIGIR 2001, pp. 334–342. ACM Press, New York (2001)CrossRefGoogle Scholar
  16. 16.
    Boyd-Graber, J., Blei, D.M.: Multilingual topic models for unaligned text. In: Uncertainty in Artificial Intelligence (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Jagadeesh Jagarlamudi
    • 1
  • Hal DauméIII
    • 1
  1. 1.School of ComputingUniversity of Utah 

Personalised recommendations