Noun Compositionality Detection Using Distributional Semantics for the Russian Language

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11832)


In this paper, we present the first gold-standard corpus of Russian noun compounds annotated with compositionality information. We used Universal Dependency treebanks to collect noun compounds according to part of speech patterns, such as ADJ-NOUN or NOUN-NOUN and annotated them according to the following schema: a phrase can be either compositional, non-compositional, or ambiguous (i.e., depending on the context it can be interpreted both as compositional or non-compositional). Next, we conduct a series of experiments to evaluate both unsupervised and supervised methods for predicting compositionality. To expand this manually annotated dataset with more non-compositional compounds and streamline the annotation process we use active learning. We show that not only the methods, previously proposed for English, are easily adapted for Russian, but also can be exploited in active learning paradigm, that increases the efficiency of the annotation process.



Dmitry Puzyrev and Ekaterina Artemova were supported by the framework of the HSE University Basic Research Program and Russian Academic Excellence Project “5–100”.


  1. 1.
    Aharodnik, K., Feldman, A., Peng, J.: Designing a Russian idiom-annotated corpus. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)Google Scholar
  2. 2.
    Anke, L.E., Schockaert, S.: Seven: augmenting word embeddings with unsupervised relation vectors. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 2653–2665 (2018)Google Scholar
  3. 3.
    Baldwin, T., Bannard, C., Tanaka, T., Widdows, D.: An empirical model of multiword expression decomposability. In: Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, vol. 18, pp. 89–96 (2003)Google Scholar
  4. 4.
    Baldwin, T., Villavicencio, A.: Extracting the unextractable: a case study on verb-particles. In: Proceedings of CoNLL, pp. 1–7 (2002)Google Scholar
  5. 5.
    Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information (2016)Google Scholar
  6. 6.
    Breiman, L.: Classification and regression trees (2017)CrossRefGoogle Scholar
  7. 7.
    Cordeiro, S., Ramisch, C., Idiart, M., Villavicencio, A.: Predicting the compositionality of nominal compounds: giving word embeddings a hard time. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 1986–1997 (2016)Google Scholar
  8. 8.
    Farahmand, M., Smith, A., Nivre, J.: A multiword expression data set: annotating non-compositionality and conventionalization for English noun compounds. In: Proceedings of the 11th Workshop on Multiword Expressions, pp. 29–33 (2015)Google Scholar
  9. 9.
    Gurrutxaga, A., Alegria, I.: Combining different features of idiomaticity for the automatic classification of noun+verb expressions in Basque. In: Proceedings of the 9th Workshop on Multiword Expressions, pp. 116–125 (2013)Google Scholar
  10. 10.
    Hinton, G.E.: Connectionist learning procedures. Artif. Intell. 40(1–3), 185–234 (1989)CrossRefGoogle Scholar
  11. 11.
    Jana, A., Puzyrev, D., Panchenko, A., Goyal, P., Biemann, C., Mukherjee, A.: On the compositionality prediction of noun phrases using poincaré embeddings. In: The 57th Annual Meeting of the Association for Computational Linguistics (ACL) (2019)Google Scholar
  12. 12.
    Lewis, D.D., Gale, W.A.: A sequential algorithm for training text classifiers. In: SIGIR 1994, pp. 3–12 (1994)CrossRefGoogle Scholar
  13. 13.
    McCarthy, D., Keller, B., Carroll, J.: Detecting a continuum of compositionality in phrasal verbs. In: Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, MWE 2003, vol. 18, pp. 73–80 (2003)Google Scholar
  14. 14.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)Google Scholar
  15. 15.
    Mitchell, J., Lapata, M.: Vector-based models of semantic composition. In: Proceedings of ACL 2008: HLT, pp. 236–244 (2008)Google Scholar
  16. 16.
    Nickel, M., Kiela, D.: Poincaré embeddings for learning hierarchical representations. In: Advances in Neural Information Processing Systems, pp. 6338–6347 (2017)Google Scholar
  17. 17.
    Peng, J., Feldman, A.: Automatic idiom recognition with word embeddings. In: Lossio-Ventura, J.A., Alatrista-Salas, H. (eds.) SIMBig 2015-2016. CCIS, vol. 656, pp. 17–29. Springer, Cham (2017). Scholar
  18. 18.
    Platt, J.C.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Advances in Large Margin Classifiers, pp. 61–74 (1999)Google Scholar
  19. 19.
    Ramisch, C., Cordeiro, S., Zilio, L., Idiart, M., Villavicencio, A.: How naked is the naked truth? A multilingual lexicon of nominal compound compositionality. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (2016)Google Scholar
  20. 20.
    Reddy, S., McCarthy, D., Manandhar, S.: An empirical study on compositionality in compound nouns. In: Proceedings of the 5th International Joint Conference on Natural Language Processing, pp. 210–218 (2011)Google Scholar
  21. 21.
    Rehurek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP frameworks, pp. 45–50 (2010)Google Scholar
  22. 22.
    Roller, S., Schulte im Walde, S., Scheible, S.: The (un)expected effects of applying standard cleansing models to human ratings on compositionality. In: Proceedings of the 9th Workshop on Multiword Expressions, pp. 32–41 (2013)Google Scholar
  23. 23.
    Savary, A., et al.: PARSEME - PARSing and Multiword Expressions within a European multilingual network. In: 7th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (LTC 2015) (2015)Google Scholar
  24. 24.
    Segalovich, I.: A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine. In: MLMTA (2003)Google Scholar
  25. 25.
    Settles, B.: Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences (2009)Google Scholar
  26. 26.
    Suvorov, R., Shelmanov, A., Smirnov, I.: Active learning with adaptive density weighted sampling for information extraction from scientific papers. In: Filchenkov, A., Pivovarova, L., Žižka, J. (eds.) AINL 2017. CCIS, vol. 789, pp. 77–90. Springer, Cham (2018). Scholar
  27. 27.
    Tong, S., Koller, D.: Support vector machine active learning with applications to text classification. J. Mach. Learn. Res. 2, 45–66 (2001)zbMATHGoogle Scholar
  28. 28.
    Venkatapathy, S., Joshi, A.K.: Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 899–906 (2005)Google Scholar
  29. 29.
    Weston, J., Bordes, A., Yakhnenko, O., Usunier, N.: Connecting language and knowledge bases with embedding models for relation extraction. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1366–1371 (2013)Google Scholar
  30. 30.
    Zhang, H.: The optimality of naive bayes. In: Proceedings of the Seventeenth International Florida Artificial Intelligence Research Society Conference, FLAIRS 2004, vol. 2 (2004)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.National Research University Higher School of EconomicsMoscowRussia
  2. 2.Skolkovo Institute of Science and TechnologyMoscowRussia

Personalised recommendations