Data Mining and Knowledge Discovery

, Volume 30, Issue 5, pp 1299–1323 | Cite as

C-BiLDA extracting cross-lingual topics from non-parallel texts by distinguishing shared from unshared content

Article

Abstract

We study the problem of extracting cross-lingual topics from non-parallel multilingual text datasets with partially overlapping thematic content (e.g., aligned Wikipedia articles in two different languages). To this end, we develop a new bilingual probabilistic topic model called comparable bilingual latent Dirichlet allocation (C-BiLDA), which is able to deal with such comparable data, and, unlike the standard bilingual LDA model (BiLDA), does not assume the availability of document pairs with identical topic distributions. We present a full overview of C-BiLDA, and show its utility in the task of cross-lingual knowledge transfer for multi-class document classification on two benchmarking datasets for three language pairs. The proposed model outperforms the baseline LDA model, as well as the standard BiLDA model and two standard low-rank approximation methods (CL-LSI and CL-KCCA) used in previous work on this task.

Keywords

Cross-lingual text mining Multilingual topic modeling  Multilinguality Comparable data Cross-lingual knowledge transfer Unsupervised modeling of text data Representation learning 

References

  1. Ahmed A, Xing EP (2010) Staying informed: supervised and semi-supervised multi-view topical analysis of ideological perspective. In: Proceedings of the 2010 conference on empirical methods in natural language processing (EMNLP), pp 1140–1150Google Scholar
  2. Amini MR, Goutte C (2010) A co-classification approach to learning from multilingual corpora. Mach Learn 79(1–2):105–121MathSciNetCrossRefGoogle Scholar
  3. Amini MR, Usunier N, Goutte C (2009) Learning from multiple partially observed views—an application to multilingual text categorization. In: Proceedings of the 23rd annual conference on advances in neural information processing systems (NIPS), pp 28–36Google Scholar
  4. Bel N, Koster CHA, Villegas M (2003) Cross-lingual text categorization. In: Proceedings of the 7th European conference on research and advanced technology for digital libraries (ECDL), pp 126–139Google Scholar
  5. Bishop CM (2006) Pattern Recognition and machine learning (Information science and statistics). Springer, Inc, New YorkMATHGoogle Scholar
  6. Blei DM, McAuliffe JD (2007) Supervised topic models. In: Proceedings of the 21st Annual conference on advances in neural information processing systems (NIPS), pp 121–128Google Scholar
  7. Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022MATHGoogle Scholar
  8. Boyd-Graber J, Blei DM (2009) Multilingual topic models for unaligned text. In: Proceedings of the 25th conference on uncertainty in artificial intelligence (UAI), pp 75–82Google Scholar
  9. Boyd-Graber J, Resnik P (2010) Holistic sentiment analysis across languages: multilingual supervised latent Dirichlet allocation. In: Proceedings of the 2010 conference on empirical methods in natural language processing (EMNLP), pp 45–55Google Scholar
  10. Cavallanti G, Cesa-Bianchi N, Gentile C (2010) Linear algorithms for online multitask classification. J Mach Learn Res 11:2901–2934MathSciNetMATHGoogle Scholar
  11. Chandar S, Lauly S, Larochelle H, Khapra MM, Ravindran B, Raykar VC, Saha A (2014) An autoencoder approach to learning bilingual word representations. In: Proceedings of the 27th annual conference on advances in neural information processing systems (NIPS)Google Scholar
  12. Das D, Petrov S (2011) Unsupervised part-of-speech tagging with bilingual graph-based projections. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies (ACL-HLT), pp 600–609Google Scholar
  13. De Smet W, Moens MF (2009) Cross-language linking of news stories on the Web using interlingual topic modeling. In: Proceedings of the CIKM 2009 workshop on social web search and mining (SWSM@CIKM), pp 57–64Google Scholar
  14. De Smet W, Tang J, Moens MF (2011) Knowledge transfer across multilingual corpora via latent topics. In: Proceedings of the 15th Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 549–560Google Scholar
  15. Duh K, Fujino A, Nagata M (2011) Is machine translation ripe for cross-lingual sentiment classification? In: Proceedings of the 49th Annual meeting of the association for computational linguistics: human language technologies (ACL-HLT), pp 429–433Google Scholar
  16. Fortuna B, Shawe-Taylor J (2005) The use of machine translation tools for cross-lingual text mining. In: Proceedings of the ICML 2005 KCCA workshop (KCCA)Google Scholar
  17. Ganchev K, Das D (2013) Cross-lingual discriminative learning of sequence models with posterior regularization. In: Proceedings of the 2013 conference on empirical methods in natural language processing (EMNLP), pp 1996–2006Google Scholar
  18. Ganguly D, Leveling J, Jones G (2012) Cross-lingual topical relevance models. In: Proceedings of the 24th international conference on computational linguistics (COLING), pp 927–942Google Scholar
  19. Geman S, Geman D (1984) Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell 6(6):721–741CrossRefMATHGoogle Scholar
  20. Gliozzo AM, Strapparava C (2006) Exploiting comparable corpora and bilingual dictionaries for cross-language text categorization. In: Proceedings of the 44th annual meeting of the association for computational linguistics and the 21st international conference on computational linguistics (ACL-COLING)Google Scholar
  21. Gouws S, Bengio Y, Corrado G (2014) Bilbowa: fast bilingual distributed representations without word alignments. In: Deep learning workshop, conference on neural information processing systems (NIPS)Google Scholar
  22. Guo Y, Xiao M (2012a) Cross language text classification via subspace co-regularized multi-view learning. In: Proceedings of the 29th international conference on machine learning (ICML)Google Scholar
  23. Guo Y, Xiao M (2012b) Transductive representation learning for cross-lingual text classification. In: Proceedings of the 12th IEEE international conference on data mining (ICDM), pp 888–893Google Scholar
  24. Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664CrossRefMATHGoogle Scholar
  25. Hermann KM, Blunsom P (2014a) Multilingual distributed representations without word alignment. In: Proceedings of the international conference on learning representations (ICLR)Google Scholar
  26. Hermann KM, Blunsom P (2014b) Multilingual models for compositional distributed semantics. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (ACL), pp 58–68Google Scholar
  27. Hofmann T (1999) Probabilistic latent semantic analysis. In: Proceedings of the 15th conference on uncertainty in artificial intelligence (UAI), pp 289–296Google Scholar
  28. Hu Y, Zhai K, Eidelman V, Boyd-Graber JL (2014) Polylingual tree-based topic models for translation domain adaptation. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (ACL), pp 1166–1176Google Scholar
  29. Jagarlamudi J, Daumé III H (2010) Extracting multilingual topics from unaligned comparable corpora. In: Proceedings of the 32nd annual european conference on advances in information retrieval (ECIR), pp 444–456Google Scholar
  30. Jiang Y, Liu J, Li Z, Lu H (2012) Collaborative PLSA for multi-view clustering. In: 2012 21st International conference on pattern recognition (ICPR), IEEE, pp 2997–3000Google Scholar
  31. Joachims T (1999) Making large-scale SVM learning practical. In: Schölkopf B, Burges C, Smola A (eds) Advances in kernel methods—support vector learning, vol 11. MIT Press, Cambridge, pp 169–184Google Scholar
  32. Kim S, Toutanova K, Yu H (2012) Multilingual named entity recognition using parallel data and metadata from Wikipedia. In: Proceedings of the 50th annual meeting of the association for computational linguistics (ACL), pp 694–702Google Scholar
  33. Klementiev A, Titov I, Bhattarai B (2012) Inducing crosslingual distributed representations of words. In: Proceedings of the 24th international conference on computational linguistics (COLING), pp 1459–1474Google Scholar
  34. Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Proceedings of the 10th machine translation summit (MT SUMMIT), pp 79–86Google Scholar
  35. Kočiský T, Hermann KM, Blunsom P (2014) Learning bilingual word representations by marginalizing alignments. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (ACL), pp 224–229Google Scholar
  36. Krstovski K, Smith DA (2013) Online polylingual topic models for fast document translation detection. In: Proceedings of the workshop on statistical MTGoogle Scholar
  37. Levow GA, Oard DW, Resnik P (2005) Dictionary-based techniques for cross-language information retrieval. Inf Process Manag 41(3):523–547CrossRefGoogle Scholar
  38. Lewis DD, Yang Y, Rose TG, Li F (2004) RCV1: a new benchmark collection for text categorization research. J Mach Learn Res 5:361–397Google Scholar
  39. Ling X, Xue GR, Dai W, Jiang Y, Yang Q, Yu Y (2008) Can Chinese web pages be classified with English data source? In: Proceedings of the 17th international conference on World Wide Web (WWW), pp 969–978Google Scholar
  40. Littman M, Dumais ST, Landauer TK (1998) Automatic cross-language information retrieval using latent semantic indexing. Cross-language information retrieval. Kluwer Academic Publishers, Boston, pp 51–62Google Scholar
  41. Lu B, Tan C, Cardie C, K Tsou B (2011) Joint bilingual sentiment classification with unlabeled parallel corpora. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies (ACL-HLT), pp 320–330Google Scholar
  42. McCallum A, Mimno DM, Wallach HM (2009) Rethinking lda: why priors matter. In: Proceedings of Neural Information Processing Systems (NIPS), pp 1973–1981Google Scholar
  43. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. In: Proceedings of the workshop of the international conference on learning representations (ICLR)Google Scholar
  44. Mimno D, Wallach H, Naradowsky J, Smith DA, McCallum A (2009) Polylingual topic models. In: Proceedings of the 2009 conference on empirical methods in natural language processing (EMNLP), pp 880–889Google Scholar
  45. Ni X, Sun JT, Hu J, Chen Z (2009) Mining multilingual topics from Wikipedia. In: Proceedings of the 18th international World Wide Web conference (WWW), pp 1155–1156Google Scholar
  46. Ni X, Sun JT, Hu J, Chen Z (2011) Cross lingual text classification by mining multilingual topics from Wikipedia. In: Proceedings of the 4th international conference on web search and web data mining (WSDM), pp 375–384Google Scholar
  47. Olsson JS, Oard DW, Hajič J (2005) Cross-language text classification. In: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR), pp 645–646Google Scholar
  48. Pan J, Xue GR, Yu Y, Wang Y (2011) Cross-lingual sentiment classification via bi-view non-negative matrix tri-factorization. In: Proceedings of the 15th Pacific-Asia conference on advances in knowledge discovery and data mining (PAKDD), pp 289–300Google Scholar
  49. Paul MJ, Girju R (2009) Cross-cultural analysis of blogs and forums with mixed-collection topic models. In: Proceedings of the 2009 conference on empirical methods in natural language processing (EMNLP), pp 1408–1417Google Scholar
  50. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830MathSciNetMATHGoogle Scholar
  51. Platt JC, Toutanova K, Yih WT (2010) Translingual document representations from discriminative projections. In: Proceedings of the 2010 conference on empirical methods in natural language processing (EMNLP), pp 251–261Google Scholar
  52. Prettenhofer P, Stein B (2010) Cross-language text classification using structural correspondence learning. In: Proceedings of the 48th annual meeting of the association for computational linguistics (ACL), pp 1118–1127Google Scholar
  53. Rigutini L, Maggini M, Liu B (2005) An EM based training algorithm for cross-language text categorization. In: Proceedings of the 2005 ACM international conference on web intelligence (WIC), pp 529–535Google Scholar
  54. Soyer H, Stenetorp P, Aizawa A (2015) Leveraging monolingual data for crosslingual compositional word representations. In: Proceedings of the international conference on learning representations (ICLR)Google Scholar
  55. Steyvers M, Griffiths T (2007) Probabilistic topic models. Handb Latent Semant Anal 427(7):424–440Google Scholar
  56. Täckström O, McDonald R, Nivre J (2013) Target language adaptation of discriminative transfer parsers. In: Proceedings of the 14th meeting of the North American chapter of the association for computational linguistics: human language technologies (NAACL-HLT), pp 1061–1071Google Scholar
  57. Talvensaari T, Pirkola A, Järvelin K, Juhola M, Laurikkala J (2008) Focused web crawling in the acquisition of comparable corpora. Inf Retr 11(5):427–445CrossRefGoogle Scholar
  58. Tao T, Zhai C (2005) Mining comparable bilingual text corpora for cross-language information integration. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 691–696Google Scholar
  59. Utiyama M, Isahara H (2003) Reliable measures for aligning Japanese-English news articles and sentences. In: Proceedings of the 41st annual meeting of the association for computational linguistics (ACL), pp 72–79Google Scholar
  60. Utsuro T, Horiuchi T, Chiba Y, Hamamoto T (2002) Semi-automatic compilation of bilingual lexicon entries from cross-lingually relevant news articles on WWW news sites. Springer, BerlinCrossRefMATHGoogle Scholar
  61. van der Plas L, Merlo P, Henderson J (2011) Scaling up automatic cross-lingual semantic role annotation. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies (ACL-HLT), pp 299–304Google Scholar
  62. Vinokourov A, Cristianini N, Shawe-Taylor JS (2002) Inferring a semantic representation of text via cross-language correlation analysis. In: Advances in neural information processing systems, pp 1473–1480Google Scholar
  63. Vu T, Aw AT, Zhang M (2009) Feature-based method for document alignment in comparable news corpora. In: Proceedings of the 12th conference of the European chapter of the association for computational linguistics (EACL), pp 843–851Google Scholar
  64. Vulić I, De Smet W, Moens MF (2011) Identifying word translations from comparable corpora using latent topic models. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies (ACL-HLT), pp 479–484Google Scholar
  65. Vulić I, De Smet W, Moens MF (2013) Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora. Inf Retr 16(3):331–368CrossRefGoogle Scholar
  66. Vulić I, De Smet W, Tang J, Moens M (2015) Probabilistic topic modeling in multilingual settings: an overview of its methodology and applications. Inf Process Manag 51(1):111–147Google Scholar
  67. Wan X (2009) Co-training for cross-lingual sentiment classification. In: Proceedings of the 47th annual meeting of the association for computational linguistics (ACL), pp 235–243Google Scholar
  68. Wan C, Pan R, Li J (2011) Bi-weighting domain adaptation for cross-language text classification. In: Proceedings of the 22nd international joint conference on artificial intelligence (IJCAI), pp 1535–1540Google Scholar
  69. Wang H, Huang H, Nie F, Ding C (2011) Cross-language web page classification via dual knowledge transfer using nonnegative matrix tri-factorization. In: Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval (SIGIR), pp 933–942Google Scholar
  70. Wei B, Pal CJ (2010) Cross lingual adaptation: an experiment on sentiment classifications. In: Proceedings of the 48th annual meeting of the association for computational linguistics (ACL), pp 258–262Google Scholar
  71. Xiao M, Guo Y (2013a) A novel two-step method for cross language representation learning. In: Proceedings of the 27th annual conference on advances in neural information processing systems (NIPS), pp 1259–1267Google Scholar
  72. Xiao M, Guo Y (2013b) Semi-supervised representation learning for cross-lingual text classification. In: Proceedings of the 2013 conference on empirical methods in natural language processing (EMNLP), pp 1465–1475Google Scholar
  73. Xu Y, Chen L, Wei J, Ananiadou S, Fan Y, Qian Y, Chang EIC, Tsujii J (2015) Bilingual term alignment from comparable corpora in English discharge summary and Chinese discharge summary. BMC Bioinf 16(1):149CrossRefGoogle Scholar
  74. Zhao WX, Jiang J, Weng J, He J, Lim EP, Yan H, Li X (2011) Comparing twitter and traditional media using topic models. In: Proceedings of the 33rd European conference on advances in information retrieval (ECIR), pp 338–349Google Scholar
  75. Zhang T, Liu K, Zhao J (2013) Cross lingual entity linking with bilingual topic model. In: Proceedings of the 23rd international joint conference on artificial intelligence (IJCAI), pp 2218–2224Google Scholar
  76. Zhang D, Mei Q, Zhai C (2010) Cross-lingual latent topic extraction. In: Proceedings of the 48th annual meeting of the association for computational linguistics (ACL), pp 1128–1137Google Scholar
  77. Zhao H, Song Y, Kit C, Zhou G (2009) Cross language dependency parsing using a bilingual lexicon. In: Proceedings of the 47th annual meeting of the association for computational linguistics (ACL), pp 55–63Google Scholar

Copyright information

© The Author(s) 2015

Authors and Affiliations

  • Geert Heyman
    • 1
  • Ivan Vulić
    • 1
  • Marie-Francine Moens
    • 1
  1. 1.Department of Computer ScienceKU LeuvenHeverleeBelgium

Personalised recommendations