Skip to main content

Monolingual and Cross-Lingual Probabilistic Topic Models and Their Applications in Information Retrieval

  • Conference paper
Advances in Information Retrieval (ECIR 2013)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7814))

Included in the following conference series:

Abstract

Probabilistic topic models are a group of unsupervised generative machine learning models that can be effectively trained on large text collections. They model document content as a two-step generation process, i.e., documents are observed as mixtures of latent topics, while topics are probability distributions over vocabulary words. Recently, a significant research effort has been invested into transferring the probabilistic topic modeling concept from monolingual to multilingual settings. Novel topic models have been designed to work with parallel and comparable multilingual data (e.g., Wikipedia or news data discussing the same events). Probabilistic topics models offer an elegant way to represent content across different languages. Their probabilistic framework allows for their easy integration into a language modeling framework for monolingual and cross-lingual information retrieval. Moreover, we present how to use the knowledge from the topic models in the tasks of cross-lingual event clustering, cross-lingual document classification and the detection of cross-lingual semantic similarity of words. The tutorial also demonstrates how semantically similar words across languages are integrated as useful additional evidences in cross-lingual information retrieval models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)

    MATH  Google Scholar 

  2. Boyd-Graber, J.L., Blei, D.M.: Multilingual topic models for unaligned text. In: Proceedings of UAI, pp. 75–82 (2009)

    Google Scholar 

  3. Cimiano, P., Schultz, A., Sizov, S., Sorg, P., Staab, S.: Explicit versus latent concept models for cross-language information retrieval. In: Proceedings of IJCAI, pp. 1513–1518 (2009)

    Google Scholar 

  4. De Smet, W., Moens, M.-F.: Cross-language linking of news stories on the Web using interlingual topic models. In: Proceedings of SWSM 2009, pp. 57–64 (2009)

    Google Scholar 

  5. De Smet, W., Tang, J., Moens, M.-F.: Knowledge Transfer across Multilingual Corpora via Latent Topics. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011, Part I. LNCS, vol. 6634, pp. 549–560. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  6. Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence 6(6), 721–741 (1984)

    Article  MATH  Google Scholar 

  7. Griffiths, T.L., Steyvers, M., Tenenbaum, J.B.: Topics in semantic representation. Psychological Review 114(2), 211–244 (2007)

    Article  Google Scholar 

  8. Hofmann, T.: Probabilistic Latent Semantic Indexing. In: Proceedings of SIGIR, pp. 50–57 (1999)

    Google Scholar 

  9. Jagarlamudi, J., Daumé III, H.: Extracting Multilingual Topics from Unaligned Comparable Corpora. In: Gurrin, C., He, Y., Kazai, G., Kruschwitz, U., Little, S., Roelleke, T., Rüger, S., van Rijsbergen, K. (eds.) ECIR 2010. LNCS, vol. 5993, pp. 444–456. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  10. Lavrenko, V., Choquette, M., Croft, W.B.: Cross-lingual relevance models. In: Proceedings of SIGIR, pp. 175–182 (2002)

    Google Scholar 

  11. Lu, Y., Mei, Q., Zhai, C.: Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA. Information Retrieval 14(2), 178–203 (2011)

    Article  Google Scholar 

  12. Mimno, D., Wallach, H.M., Naradowsky, J., Smith, D.A., McCallum, A.: Polylingual topic models. In: Proceedings of EMNLP, pp. 880–889 (2009)

    Google Scholar 

  13. Ni, X., Sun, J.-T., Hu, J., Chen, Z.: Cross-lingual text classification by mining multilingual topics from Wikipedia. In: Proceedings of WSDM, pp. 375–384 (2011)

    Google Scholar 

  14. Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: Proceedings of SIGIR, pp. 275–281 (2001)

    Google Scholar 

  15. Roth, B., Klakow, D.: Combining Wikipedia-Based Concept Models for Cross-Language Retrieval. In: Cunningham, H., Hanbury, A., Rüger, S. (eds.) IRFC 2010. LNCS, vol. 6107, pp. 47–59. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  16. Vulić, I., De Smet, W., Moens, M.-F.: Identifying word translations from comparable corpora using latent topic models. In: Proceedings of ACL, pp. 479–484 (2011)

    Google Scholar 

  17. Vulić, I., De Smet, W., Moens, M.-F.: Cross-Language Information Retrieval with Latent Topic Models Trained on a Comparable Corpus. In: Salem, M.V.M., Shaalan, K., Oroumchian, F., Shakery, A., Khelalfa, H. (eds.) AIRS 2011. LNCS, vol. 7097, pp. 37–48. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  18. Vulić, I., De Smet, W., Moens, M.-F.: Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora. In: Information Retrieval (2012)

    Google Scholar 

  19. Vulić, I., Moens, M.-F.: Detecting highly confident word translations from comparable corpora without any prior knowledge. In: Proceedings of EACL, pp. 449–459 (2012)

    Google Scholar 

  20. Wallach, H.M., Murray, I., Salakhutdinov, R., Mimno, D.: Evaluation methods for topic models. In: Proceedings of ICML, pp. 1105–1112 (2009)

    Google Scholar 

  21. Wei, X., Croft, W.B.: LDA-based document models for ad-hoc retrieval. In: Proceedings of SIGIR, pp. 178–185 (2006)

    Google Scholar 

  22. Yi, X., Allan, J.: A Comparative Study of Utilizing Topic Models for Information Retrieval. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 29–41. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  23. Zhang, D., Mei, Q., Zhai, C.: Cross-lingual latent topic extraction. In: Proceedings of ACL, pp. 1128–1137 (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Moens, MF., Vulić, I. (2013). Monolingual and Cross-Lingual Probabilistic Topic Models and Their Applications in Information Retrieval. In: Serdyukov, P., et al. Advances in Information Retrieval. ECIR 2013. Lecture Notes in Computer Science, vol 7814. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36973-5_106

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-36973-5_106

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-36972-8

  • Online ISBN: 978-3-642-36973-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics