Monolingual and Cross-Lingual Probabilistic Topic Models and Their Applications in Information Retrieval

Moens, Marie-Francine; Vulić, Ivan

doi:10.1007/978-3-642-36973-5_106

Marie-Francine Moens²³ &
Ivan Vulić²³

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7814))

Included in the following conference series:

European Conference on Information Retrieval

2953 Accesses
3 Citations

Abstract

Probabilistic topic models are a group of unsupervised generative machine learning models that can be effectively trained on large text collections. They model document content as a two-step generation process, i.e., documents are observed as mixtures of latent topics, while topics are probability distributions over vocabulary words. Recently, a significant research effort has been invested into transferring the probabilistic topic modeling concept from monolingual to multilingual settings. Novel topic models have been designed to work with parallel and comparable multilingual data (e.g., Wikipedia or news data discussing the same events). Probabilistic topics models offer an elegant way to represent content across different languages. Their probabilistic framework allows for their easy integration into a language modeling framework for monolingual and cross-lingual information retrieval. Moreover, we present how to use the knowledge from the topic models in the tasks of cross-lingual event clustering, cross-lingual document classification and the detection of cross-lingual semantic similarity of words. The tutorial also demonstrates how semantically similar words across languages are integrated as useful additional evidences in cross-lingual information retrieval models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
MATH Google Scholar
Boyd-Graber, J.L., Blei, D.M.: Multilingual topic models for unaligned text. In: Proceedings of UAI, pp. 75–82 (2009)
Google Scholar
Cimiano, P., Schultz, A., Sizov, S., Sorg, P., Staab, S.: Explicit versus latent concept models for cross-language information retrieval. In: Proceedings of IJCAI, pp. 1513–1518 (2009)
Google Scholar
De Smet, W., Moens, M.-F.: Cross-language linking of news stories on the Web using interlingual topic models. In: Proceedings of SWSM 2009, pp. 57–64 (2009)
Google Scholar
De Smet, W., Tang, J., Moens, M.-F.: Knowledge Transfer across Multilingual Corpora via Latent Topics. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011, Part I. LNCS, vol. 6634, pp. 549–560. Springer, Heidelberg (2011)
Chapter Google Scholar
Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence 6(6), 721–741 (1984)
Article MATH Google Scholar
Griffiths, T.L., Steyvers, M., Tenenbaum, J.B.: Topics in semantic representation. Psychological Review 114(2), 211–244 (2007)
Article Google Scholar
Hofmann, T.: Probabilistic Latent Semantic Indexing. In: Proceedings of SIGIR, pp. 50–57 (1999)
Google Scholar
Jagarlamudi, J., Daumé III, H.: Extracting Multilingual Topics from Unaligned Comparable Corpora. In: Gurrin, C., He, Y., Kazai, G., Kruschwitz, U., Little, S., Roelleke, T., Rüger, S., van Rijsbergen, K. (eds.) ECIR 2010. LNCS, vol. 5993, pp. 444–456. Springer, Heidelberg (2010)
Chapter Google Scholar
Lavrenko, V., Choquette, M., Croft, W.B.: Cross-lingual relevance models. In: Proceedings of SIGIR, pp. 175–182 (2002)
Google Scholar
Lu, Y., Mei, Q., Zhai, C.: Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA. Information Retrieval 14(2), 178–203 (2011)
Article Google Scholar
Mimno, D., Wallach, H.M., Naradowsky, J., Smith, D.A., McCallum, A.: Polylingual topic models. In: Proceedings of EMNLP, pp. 880–889 (2009)
Google Scholar
Ni, X., Sun, J.-T., Hu, J., Chen, Z.: Cross-lingual text classification by mining multilingual topics from Wikipedia. In: Proceedings of WSDM, pp. 375–384 (2011)
Google Scholar
Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: Proceedings of SIGIR, pp. 275–281 (2001)
Google Scholar
Roth, B., Klakow, D.: Combining Wikipedia-Based Concept Models for Cross-Language Retrieval. In: Cunningham, H., Hanbury, A., Rüger, S. (eds.) IRFC 2010. LNCS, vol. 6107, pp. 47–59. Springer, Heidelberg (2010)
Chapter Google Scholar
Vulić, I., De Smet, W., Moens, M.-F.: Identifying word translations from comparable corpora using latent topic models. In: Proceedings of ACL, pp. 479–484 (2011)
Google Scholar
Vulić, I., De Smet, W., Moens, M.-F.: Cross-Language Information Retrieval with Latent Topic Models Trained on a Comparable Corpus. In: Salem, M.V.M., Shaalan, K., Oroumchian, F., Shakery, A., Khelalfa, H. (eds.) AIRS 2011. LNCS, vol. 7097, pp. 37–48. Springer, Heidelberg (2011)
Chapter Google Scholar
Vulić, I., De Smet, W., Moens, M.-F.: Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora. In: Information Retrieval (2012)
Google Scholar
Vulić, I., Moens, M.-F.: Detecting highly confident word translations from comparable corpora without any prior knowledge. In: Proceedings of EACL, pp. 449–459 (2012)
Google Scholar
Wallach, H.M., Murray, I., Salakhutdinov, R., Mimno, D.: Evaluation methods for topic models. In: Proceedings of ICML, pp. 1105–1112 (2009)
Google Scholar
Wei, X., Croft, W.B.: LDA-based document models for ad-hoc retrieval. In: Proceedings of SIGIR, pp. 178–185 (2006)
Google Scholar
Yi, X., Allan, J.: A Comparative Study of Utilizing Topic Models for Information Retrieval. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 29–41. Springer, Heidelberg (2009)
Chapter Google Scholar
Zhang, D., Mei, Q., Zhai, C.: Cross-lingual latent topic extraction. In: Proceedings of ACL, pp. 1128–1137 (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, KU Leuven, Celestijnenlaan 200A, B-3001, Heverlee, Belgium
Marie-Francine Moens & Ivan Vulić

Authors

Marie-Francine Moens
View author publications
You can also search for this author in PubMed Google Scholar
Ivan Vulić
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Yandex, Leo Tolstoy, 16, 119021, Moscow, Russia
Pavel Serdyukov & Ilya Segalovich &
Kontur Labs and Ural Federal University, Fonvizina 3-27, 620078, Yekaterinburg, Russia
Pavel Braslavski
National Research University Higher School of Economics (HSE), Pokrovskii bd 11, 109028, Moscow, Russia
Sergei O. Kuznetsov
University of Amsterdam, Turfdraagsterpad 9, 1012 XT, Amsterdam, The Netherlands
Jaap Kamps
Knowledge Media Institute, The Open University, Walton Hall, MK7 6AA, Milton Keynes, UK
Stefan Rüger
Mathematics & Computer Science Department, Emory University, 400 dowman Drive, 30329, Atlanta, GA, USA
Eugene Agichtein
Department of Computer Science, University College London, Gower Street, WC1E 6BT, London, UK
Emine Yilmaz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Moens, MF., Vulić, I. (2013). Monolingual and Cross-Lingual Probabilistic Topic Models and Their Applications in Information Retrieval. In: Serdyukov, P., et al. Advances in Information Retrieval. ECIR 2013. Lecture Notes in Computer Science, vol 7814. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36973-5_106

Download citation

DOI: https://doi.org/10.1007/978-3-642-36973-5_106
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-36972-8
Online ISBN: 978-3-642-36973-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics