Applying Word Embeddings to Leverage Knowledge Available in One Language in Order to Solve a Practical Text Classification Problem in Another Language

Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 661)


A text classification problem in Kazakh language is examined. The amount of training data for the task in Kazakh is very limited, but plenty of labeled data in Russian are available. Language vector space transform is built and used to transfer knowledge from Russian into Kazakh language. The obtained classification quality is comparable to that of an approach that employed sophisticated automatic translation system.


Language vector space Word embedding Text classification Low resource 



This work was financially supported by the Ministry of Education and Science of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.


  1. 1.
    Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)zbMATHGoogle Scholar
  2. 2.
    Coulmance, J., Marty, J.-M., Wenzek, G., Benhalloum, A.: Trans-gram, fast cross-lingual word-embeddings. In: Proceedings of the Empirical Methods in Natural Language Processing (2015)Google Scholar
  3. 3.
    Erk, K., Pad, S.: A structured vector space model for word meaning in context. In: Proceedings of EMNLP (2008)Google Scholar
  4. 4.
    Gouws, S., Bengio, Y., Corrado, G.: Bilbowa: fast bilingual distributed representations without word alignments. In: Proceedings of the 25th International Conference on Machine Learning, vol. 15, pp. 748–756 (2015)Google Scholar
  5. 5.
    Huang, E.H., Socher, R., Manning, C.D., Ng, A.Y.: Improving word representations via global context and multiple word prototypes. In: Proceedings of ACL, pp. 873–882. ACL (2012)Google Scholar
  6. 6.
    Klementiev, A., Titov, A., Bhattarai, B.: Inducing crosslingual distributed representations of words. In: International Conference on Computational Linguistics (COLING), Bombay, India (2012)Google Scholar
  7. 7.
    Lewis, D.D., Yang, Y., Rose, T., Li, F.: Rcv1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)Google Scholar
  8. 8.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of Workshop at ICLR (2013)Google Scholar
  9. 9.
    Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among languages for machine translation (2013).
  10. 10.
    Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the Empirical Methods in Natural Language Processing (2014)Google Scholar
  11. 11.
    Socher, R., Pennington, J., Huang, E.H., Ng, A.Y., Manning, C.D.: Semi-supervised recursive autoencoders for predicting sentiment distributions. In: Proceedings of EMNLP, pp. 151–161. ACL (2011)Google Scholar
  12. 12.
    Turney, P.D., Pantel, P., et al.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37(1), 141–188 (2010)MathSciNetzbMATHGoogle Scholar
  13. 13.

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.STC-InnovationsSaint PetersburgRussia
  2. 2.Speech Technology CenterSaint PetersburgRussia
  3. 3.ITMO-UniversitySaint PetersburgRussia

Personalised recommendations