Abstract
Word sense induction (WSI) is the problem of grouping occurrences of an ambiguous word according to the expressed sense of this word. Recently a new approach to this task was proposed, which generates possible substitutes for the ambiguous word in a particular context using neural language models, and then clusters sparse bag-of-words vectors built from these substitutes. In this work, we apply this approach to the Russian language and improve it in two ways. First, we propose methods of combining left and right contexts, resulting in better substitutes generated. Second, instead of fixed number of clusters for all ambiguous words we propose a technique for selecting individual number of clusters for each word. Our approach established new state-of-the-art level, improving current best results of WSI for the Russian language on two RUSSE 2018 datasets by a large margin.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
References
Alagić, D., Šnajder, J., Padó, S.: Leveraging lexical substitutes for unsupervised word sense induction. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Amplayo, R.K., won Hwang, S., Song, M.: AutoSense model for word sense induction. In: AAAI (2019)
Amrami, A., Goldberg, Y.: Word sense induction with neural biLM and symmetric patterns. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 4860–4867. Association for Computational Linguistics (2018). https://www.aclweb.org/anthology/D18-1523
Apresjan, V.: Active dictionary of the Russian language: theory and practice. In: Meaning-Text Theory 2011, pp. 13–24 (2011)
Arefyev, N., Ermolaev, P., Panchenko, A.: How much does a word weigh? Weighting word embeddings for word sense induction. In: Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference “Dialogue”, Moscow, Russia, pp. 68–84. RSUH (2018)
Bartunov, S., Kondrashkin, D., Osokin, A., Vetrov, D.: Breaking sticks and ambiguities with adaptive skip-gram. In: Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS) (2016)
Baskaya, O., Sert, E., Cirik, V., Yuret, D.: AI-KU: using substitute vectors and co-occurrence modeling for word sense induction and disambiguation. In: Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), vol. 2, pp. 300–306 (2013)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Hope, D., Keller, B.: UoS: a graph-based system for graded word sense induction. In: Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), no. 1, Atlanta, Georgia, USA, pp. 689–694 (2013). http://www.aclweb.org/anthology/S13-2113
Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 328–339. Association for Computational Linguistics (2018). https://www.aclweb.org/anthology/P18-1031
Jurgens, D., Klapaftis, I.: Semeval-2013 task 13: word sense induction for graded and non-graded senses. In: Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), vol. 2, pp. 290–299 (2013)
Kutuzov, A.: Russian word sense induction by clustering averaged word embeddings. CoRR abs/1805.02258 (2018). http://arxiv.org/abs/1805.02258
Lau, J.H., Cook, P., Baldwin, T.: unimelb: topic modelling-based word sense induction. In: Second Joint Conference on Lexical and Computational Semantics (*SEM): SemEval 2013), vol. 2, Atlanta, Georgia, USA, pp. 307–311 (2013). http://www.aclweb.org/anthology/S13-2051
Manandhar, S., Klapaftis, I.P., Dligach, D., Pradhan, S.S.: SemEval-2010 task 14: word sense induction & disambiguation. In: Proceedings of the 5th International Workshop on Semantic Evaluation, pp. 63–68. Association for Computational Linguistics (2010)
Melamud, O., Goldberger, J., Dagan, I.: context2vec: learning generic context embedding with bidirectional LSTM. In: Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pp. 51–61 (2016)
Panchenko, A., et al.: RUSSE’2018: a shared task on word sense induction for the Russian language. In: Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference “Dialogue”, Moscow, Russia, pp. 547–564. RSUH (2018). http://www.dialog-21.ru/media/4324/panchenkoa.pdf
Peters, M.E., et al.: Deep contextualized word representations. In: Proceedings of the NAACL (2018)
Struyanskiy, O., Arefyev, N.: Neural networks with attention for word sense induction. In: Supplementary Proceedings of the Seventh International Conference on Analysis of Images, Social Networks and Texts (AIST 2018), Moscow, Russia, 5–7 July 2018, pp. 208–213 (2018). http://ceur-ws.org/Vol-2268/paper23.pdf
Tang, G., Müller, M., Rios, A., Sennrich, R.: Why self-attention? A targeted evaluation of neural machine translation architectures. arXiv preprint arXiv:1808.08946 (2018)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Véronis, J.: HyperLex: lexical cartography for information retrieval. Comput. Speech Lang. 18(3), 223–252 (2004)
Wang, A., Cho, K.: BERT has a mouth, and it must speak: BERT as a Markov random field language model. CoRR abs/1902.04094 (2019). http://arxiv.org/abs/1902.04094
Wang, J., Bansal, M., Gimpel, K., Ziebart, B.D., Yu, C.T.: A sense-topic model for word sense induction with unsupervised data enrichment. TACL 3, 59–71 (2015)
Acknowledgements
We are grateful to Dima Lipin, Artem Grachev and Alex Nevidomsky for their valuable help.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
A Examples of Substitutes Generated
Table 5 provides examples of discriminative substitutes with their relative frequencies for each of two most frequent senses of several words. A substitute is called discriminative if it is frequently generated for one sense of an ambigusous word, but rarely for another. Formally, we take substitutes with the largest \(\frac{P(w|sense_1)}{P(w|sense_2)}\), where \(P(w|sense_i)\) is estimated using add-one smoothing:
Additionally, we leave only substitutes which were generated at lest 10 times for one of the senses.
Table 6 lists ten most probable substitutes according to the combined distribution and according to the forward and the backward LM distributions separately for several examples. Substitutes from unidirectional distributions are very sensitive to the position of the target word. When either left or right context doesn’t contain enough information at least halve of the substitutes will be not related to the target word. Combined distribution provides more relevant substitutes.
B The Number of Clusters Selected
Figure 4 plots distributions of the differences between the true number of senses, the number of clusters in submissions and the optimal number of clusters. Silhouette score gives the number of clusters, which is usually larger than the number of senses, but is near the optimum with respect to ARI and given our vectors. The previous best submissions better estimate the true number of senses.
C Hyperparameters
Table 7 shows the selected hyperparameters for the methods described in Sect. 3. For bts-rnc and active-dict datasets hyperparameters were selected using grid search on corresponding train sets. For wiki-wiki we used the hyperparameters from bts-rnc due to very small size of wiki-wiki train set. We selected the following hyperparameters.
-
1.
Add bias (True/False). Ignoring bias in the softmax layer of the LM was proposed by [3] to improve substitutes, because adding bias results in prediction of frequency words instead of rare but relevant substitutes.
-
2.
Normalize output embeddings (True/False). Similarly to ignoring bias, this may result in prediction of more relevant substitutes.
-
3.
K (10–400). The number of substitutes from each distribution.
-
4.
Exclude Target (True/False). We want the substitutes for different senses of the target word to be non-overlapping. Thus, it may be beneficial to exclude the target word from the substitutes.
-
5.
TFIDF (True/False). Applying TFIDF transformation to bag-of-words vectors of substitutes sometimes improve performance.
-
6.
S (=20). The number of representatives for each example. It didn’t affect the performance so we use the value from [3].
-
7.
L (4–30). The number of substitutes to sample from top K predictions.
-
8.
z (1.0–3.0). The parameter of Zipf distribution.
-
9.
\(\varvec{\beta }\) (0.1–0.5). Relative length of the left or the right context after which the discounting of the corresponding LM begins.
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Arefyev, N., Sheludko, B., Aleksashina, T. (2019). Combining Neural Language Models for Word Sense Induction. In: van der Aalst, W., et al. Analysis of Images, Social Networks and Texts. AIST 2019. Lecture Notes in Computer Science(), vol 11832. Springer, Cham. https://doi.org/10.1007/978-3-030-37334-4_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-37334-4_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-37333-7
Online ISBN: 978-3-030-37334-4
eBook Packages: Computer ScienceComputer Science (R0)