ECIR 2013: Advances in Information Retrieval pp 874-877 | Cite as
Monolingual and Cross-Lingual Probabilistic Topic Models and Their Applications in Information Retrieval
Abstract
Probabilistic topic models are a group of unsupervised generative machine learning models that can be effectively trained on large text collections. They model document content as a two-step generation process, i.e., documents are observed as mixtures of latent topics, while topics are probability distributions over vocabulary words. Recently, a significant research effort has been invested into transferring the probabilistic topic modeling concept from monolingual to multilingual settings. Novel topic models have been designed to work with parallel and comparable multilingual data (e.g., Wikipedia or news data discussing the same events). Probabilistic topics models offer an elegant way to represent content across different languages. Their probabilistic framework allows for their easy integration into a language modeling framework for monolingual and cross-lingual information retrieval. Moreover, we present how to use the knowledge from the topic models in the tasks of cross-lingual event clustering, cross-lingual document classification and the detection of cross-lingual semantic similarity of words. The tutorial also demonstrates how semantically similar words across languages are integrated as useful additional evidences in cross-lingual information retrieval models.
Keywords
Probabilistic topic models Cross-lingual retrieval Ranking models Cross-lingual text miningPreview
Unable to display preview. Download preview PDF.
References
- 1.Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)MATHGoogle Scholar
- 2.Boyd-Graber, J.L., Blei, D.M.: Multilingual topic models for unaligned text. In: Proceedings of UAI, pp. 75–82 (2009)Google Scholar
- 3.Cimiano, P., Schultz, A., Sizov, S., Sorg, P., Staab, S.: Explicit versus latent concept models for cross-language information retrieval. In: Proceedings of IJCAI, pp. 1513–1518 (2009)Google Scholar
- 4.De Smet, W., Moens, M.-F.: Cross-language linking of news stories on the Web using interlingual topic models. In: Proceedings of SWSM 2009, pp. 57–64 (2009)Google Scholar
- 5.De Smet, W., Tang, J., Moens, M.-F.: Knowledge Transfer across Multilingual Corpora via Latent Topics. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011, Part I. LNCS, vol. 6634, pp. 549–560. Springer, Heidelberg (2011)CrossRefGoogle Scholar
- 6.Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence 6(6), 721–741 (1984)MATHCrossRefGoogle Scholar
- 7.Griffiths, T.L., Steyvers, M., Tenenbaum, J.B.: Topics in semantic representation. Psychological Review 114(2), 211–244 (2007)CrossRefGoogle Scholar
- 8.Hofmann, T.: Probabilistic Latent Semantic Indexing. In: Proceedings of SIGIR, pp. 50–57 (1999)Google Scholar
- 9.Jagarlamudi, J., Daumé III, H.: Extracting Multilingual Topics from Unaligned Comparable Corpora. In: Gurrin, C., He, Y., Kazai, G., Kruschwitz, U., Little, S., Roelleke, T., Rüger, S., van Rijsbergen, K. (eds.) ECIR 2010. LNCS, vol. 5993, pp. 444–456. Springer, Heidelberg (2010)CrossRefGoogle Scholar
- 10.Lavrenko, V., Choquette, M., Croft, W.B.: Cross-lingual relevance models. In: Proceedings of SIGIR, pp. 175–182 (2002)Google Scholar
- 11.Lu, Y., Mei, Q., Zhai, C.: Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA. Information Retrieval 14(2), 178–203 (2011)CrossRefGoogle Scholar
- 12.Mimno, D., Wallach, H.M., Naradowsky, J., Smith, D.A., McCallum, A.: Polylingual topic models. In: Proceedings of EMNLP, pp. 880–889 (2009)Google Scholar
- 13.Ni, X., Sun, J.-T., Hu, J., Chen, Z.: Cross-lingual text classification by mining multilingual topics from Wikipedia. In: Proceedings of WSDM, pp. 375–384 (2011)Google Scholar
- 14.Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: Proceedings of SIGIR, pp. 275–281 (2001)Google Scholar
- 15.Roth, B., Klakow, D.: Combining Wikipedia-Based Concept Models for Cross-Language Retrieval. In: Cunningham, H., Hanbury, A., Rüger, S. (eds.) IRFC 2010. LNCS, vol. 6107, pp. 47–59. Springer, Heidelberg (2010)CrossRefGoogle Scholar
- 16.Vulić, I., De Smet, W., Moens, M.-F.: Identifying word translations from comparable corpora using latent topic models. In: Proceedings of ACL, pp. 479–484 (2011)Google Scholar
- 17.Vulić, I., De Smet, W., Moens, M.-F.: Cross-Language Information Retrieval with Latent Topic Models Trained on a Comparable Corpus. In: Salem, M.V.M., Shaalan, K., Oroumchian, F., Shakery, A., Khelalfa, H. (eds.) AIRS 2011. LNCS, vol. 7097, pp. 37–48. Springer, Heidelberg (2011)CrossRefGoogle Scholar
- 18.Vulić, I., De Smet, W., Moens, M.-F.: Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora. In: Information Retrieval (2012)Google Scholar
- 19.Vulić, I., Moens, M.-F.: Detecting highly confident word translations from comparable corpora without any prior knowledge. In: Proceedings of EACL, pp. 449–459 (2012)Google Scholar
- 20.Wallach, H.M., Murray, I., Salakhutdinov, R., Mimno, D.: Evaluation methods for topic models. In: Proceedings of ICML, pp. 1105–1112 (2009)Google Scholar
- 21.Wei, X., Croft, W.B.: LDA-based document models for ad-hoc retrieval. In: Proceedings of SIGIR, pp. 178–185 (2006)Google Scholar
- 22.Yi, X., Allan, J.: A Comparative Study of Utilizing Topic Models for Information Retrieval. In: Boughanem, M., Berrut, C., Mothe, J., Soule-Dupuy, C. (eds.) ECIR 2009. LNCS, vol. 5478, pp. 29–41. Springer, Heidelberg (2009)CrossRefGoogle Scholar
- 23.Zhang, D., Mei, Q., Zhai, C.: Cross-lingual latent topic extraction. In: Proceedings of ACL, pp. 1128–1137 (2010)Google Scholar