Abstract
In this paper, we present the first gold-standard corpus of Russian noun compounds annotated with compositionality information. We used Universal Dependency treebanks to collect noun compounds according to part of speech patterns, such as ADJ-NOUN or NOUN-NOUN and annotated them according to the following schema: a phrase can be either compositional, non-compositional, or ambiguous (i.e., depending on the context it can be interpreted both as compositional or non-compositional). Next, we conduct a series of experiments to evaluate both unsupervised and supervised methods for predicting compositionality. To expand this manually annotated dataset with more non-compositional compounds and streamline the annotation process we use active learning. We show that not only the methods, previously proposed for English, are easily adapted for Russian, but also can be exploited in active learning paradigm, that increases the efficiency of the annotation process.
This is a preview of subscription content, access via your institution.
Buying options


Notes
- 1.
The dataset and the code: https://github.com/slangtech/ru-comps.
- 2.
- 3.
- 4.
- 5.
We used a Wikipedia dump from 02.05.2019, which consists of 1,542,621 articles.
- 6.
References
Aharodnik, K., Feldman, A., Peng, J.: Designing a Russian idiom-annotated corpus. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)
Anke, L.E., Schockaert, S.: Seven: augmenting word embeddings with unsupervised relation vectors. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 2653–2665 (2018)
Baldwin, T., Bannard, C., Tanaka, T., Widdows, D.: An empirical model of multiword expression decomposability. In: Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, vol. 18, pp. 89–96 (2003)
Baldwin, T., Villavicencio, A.: Extracting the unextractable: a case study on verb-particles. In: Proceedings of CoNLL, pp. 1–7 (2002)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information (2016)
Breiman, L.: Classification and regression trees (2017)
Cordeiro, S., Ramisch, C., Idiart, M., Villavicencio, A.: Predicting the compositionality of nominal compounds: giving word embeddings a hard time. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 1986–1997 (2016)
Farahmand, M., Smith, A., Nivre, J.: A multiword expression data set: annotating non-compositionality and conventionalization for English noun compounds. In: Proceedings of the 11th Workshop on Multiword Expressions, pp. 29–33 (2015)
Gurrutxaga, A., Alegria, I.: Combining different features of idiomaticity for the automatic classification of noun+verb expressions in Basque. In: Proceedings of the 9th Workshop on Multiword Expressions, pp. 116–125 (2013)
Hinton, G.E.: Connectionist learning procedures. Artif. Intell. 40(1–3), 185–234 (1989)
Jana, A., Puzyrev, D., Panchenko, A., Goyal, P., Biemann, C., Mukherjee, A.: On the compositionality prediction of noun phrases using poincaré embeddings. In: The 57th Annual Meeting of the Association for Computational Linguistics (ACL) (2019)
Lewis, D.D., Gale, W.A.: A sequential algorithm for training text classifiers. In: SIGIR 1994, pp. 3–12 (1994)
McCarthy, D., Keller, B., Carroll, J.: Detecting a continuum of compositionality in phrasal verbs. In: Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, MWE 2003, vol. 18, pp. 73–80 (2003)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Mitchell, J., Lapata, M.: Vector-based models of semantic composition. In: Proceedings of ACL 2008: HLT, pp. 236–244 (2008)
Nickel, M., Kiela, D.: Poincaré embeddings for learning hierarchical representations. In: Advances in Neural Information Processing Systems, pp. 6338–6347 (2017)
Peng, J., Feldman, A.: Automatic idiom recognition with word embeddings. In: Lossio-Ventura, J.A., Alatrista-Salas, H. (eds.) SIMBig 2015-2016. CCIS, vol. 656, pp. 17–29. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-55209-5_2
Platt, J.C.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Advances in Large Margin Classifiers, pp. 61–74 (1999)
Ramisch, C., Cordeiro, S., Zilio, L., Idiart, M., Villavicencio, A.: How naked is the naked truth? A multilingual lexicon of nominal compound compositionality. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (2016)
Reddy, S., McCarthy, D., Manandhar, S.: An empirical study on compositionality in compound nouns. In: Proceedings of the 5th International Joint Conference on Natural Language Processing, pp. 210–218 (2011)
Rehurek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP frameworks, pp. 45–50 (2010)
Roller, S., Schulte im Walde, S., Scheible, S.: The (un)expected effects of applying standard cleansing models to human ratings on compositionality. In: Proceedings of the 9th Workshop on Multiword Expressions, pp. 32–41 (2013)
Savary, A., et al.: PARSEME - PARSing and Multiword Expressions within a European multilingual network. In: 7th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (LTC 2015) (2015)
Segalovich, I.: A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine. In: MLMTA (2003)
Settles, B.: Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences (2009)
Suvorov, R., Shelmanov, A., Smirnov, I.: Active learning with adaptive density weighted sampling for information extraction from scientific papers. In: Filchenkov, A., Pivovarova, L., Žižka, J. (eds.) AINL 2017. CCIS, vol. 789, pp. 77–90. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-71746-3_7
Tong, S., Koller, D.: Support vector machine active learning with applications to text classification. J. Mach. Learn. Res. 2, 45–66 (2001)
Venkatapathy, S., Joshi, A.K.: Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 899–906 (2005)
Weston, J., Bordes, A., Yakhnenko, O., Usunier, N.: Connecting language and knowledge bases with embedding models for relation extraction. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1366–1371 (2013)
Zhang, H.: The optimality of naive bayes. In: Proceedings of the Seventeenth International Florida Artificial Intelligence Research Society Conference, FLAIRS 2004, vol. 2 (2004)
Acknowledgements
Dmitry Puzyrev and Ekaterina Artemova were supported by the framework of the HSE University Basic Research Program and Russian Academic Excellence Project “5–100”.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Puzyrev, D., Shelmanov, A., Panchenko, A., Artemova, E. (2019). Noun Compositionality Detection Using Distributional Semantics for the Russian Language. In: , et al. Analysis of Images, Social Networks and Texts. AIST 2019. Lecture Notes in Computer Science(), vol 11832. Springer, Cham. https://doi.org/10.1007/978-3-030-37334-4_20
Download citation
DOI: https://doi.org/10.1007/978-3-030-37334-4_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-37333-7
Online ISBN: 978-3-030-37334-4
eBook Packages: Computer ScienceComputer Science (R0)