Improving Morpheme Segmentation Using BERT Embeddings

Sorokin, Alexey

doi:10.1007/978-3-031-16500-9_13

Alexey Sorokin ORCID: orcid.org/0000-0003-3877-4223^22,23

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13217))

Included in the following conference series:

International Conference on Analysis of Images, Social Networks and Texts

363 Accesses

Abstract

We offer a method of incorporating BERT embeddings into neural morpheme segmentation. We show that our method significantly improves over baseline on 6 typologically diverse languages (English, Finnish, Turkish, Estonian, Georgian and Zulu). Moreover, it establishes a new SOTA on 4 languages where language-specific models are available. We demonstrate that the key component of the performance is not only the BPE vocabulary of BERT, but also the embeddings themselves. Additionally, we show that a simpler pretraining task optimizing subword word2vec-like objective also reaches state-of-the-art performance on 4 of 6 languages considered.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
See http://morpho.aalto.fi/events/morphochallenge2010/datasets.shtml as example.
2.
The hyperparameters are in the Appendix.
3.
It is freely available https://github.com/AlexeySorokin/MorphemeBert.
4.
https://huggingface.co/models.
5.
https://github.com/AlexeySorokin/NeuralMorphemeSegmentation The models are retrained for 50 epochs with the parameters provided in the repository.
6.
https://morfessor.readthedocs.io/en/latest/.
7.
Training parameters are in the Appendix.
8.
https://github.com/VProv/BPE-Dropout.
9.
Obviously, when the BERT weights are available, training is rather cheap. However, the pretraining cost of BERT is several orders of magnitude higher than the one of word2vec-like embeddings.
10.
https://radimrehurek.com/gensim/models/word2vec.html.
11.
https://github.com/google/sentencepiece.

References

Andreev, N.D. (ed.): Statictical and combinatorial language modelling (Statistiko-kombinatornoe modelirovanie iazykov, in Russian). Nauka (1965)
Google Scholar
Creutz, M., Lagus, K.: Unsupervised models for morpheme segmentation and morphology learning. ACM Trans. Speech Lang. Process. (TSLP) 4(1), 3 (2007)
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019)
Google Scholar
Eskander, R., Callejas, F., Nichols, E., Klavans, J.L., Muresan, S.: Morphagram, evaluation and framework for unsupervised morphological segmentation. In: Proceedings of The 12th Language Resources and Evaluation Conference, pp. 7112–7122 (2020)
Google Scholar
Grönroos, S.A., Virpioja, S., Kurimo, M.: North sámi morphological segmentation with low-resource semi-supervised sequence labeling. In: Proceedings of the Fifth International Workshop on Computational Linguistics for Uralic Languages, pp. 15–26 (2019)
Google Scholar
Grönroos, S.A., Virpioja, S., Kurimo, M.: Morfessor em+ prune: improved subword segmentation with expectation maximization and pruning. arXiv preprint arXiv:2003.03131 (2020)
Grönroos, S.A., Virpioja, S., Kurimo, M.: Transfer learning and subword sampling for asymmetric-resource one-to-many neural translation. arXiv preprint arXiv:2004.04002 (2020)
Harris, Z.S.: Morpheme boundaries within words: report on a computer test. In: Papers in Structural and Transformational Linguistics. Formal Linguistics Series, pp. 68–77. Springer, Dordrecht (1970). https://doi.org/10.1007/978-94-017-6059-1_3
Chapter MATH Google Scholar
Heinzerling, B., Strube, M.: BPEmb: Tokenization-free pre-trained subword embeddings in 275 languages. In: chair, N.C.C., et al. (eds.) Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 7–12 May 2018. European Language Resources Association (ELRA), Miyazaki, Japan (2018)
Google Scholar
Hofmann, V., Pierrehumbert, J.B., Schütze, H.: Superbizarre is not superb: Improving Bert’s interpretations of complex words with derivational morphology. arXiv preprint arXiv:2101.00403 (2021)
Huang, K., Huang, D., Liu, Z., Mo, F.: A joint multiple criteria model in transfer learning for cross-domain chinese word segmentation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3873–3882 (2020)
Google Scholar
Johnson, M., Griffiths, T.L., Goldwater, S.: Adaptor grammars: a framework for specifying compositional nonparametric Bayesian models. In: Advances in Neural Information Processing Systems, pp. 641–648 (2007)
Google Scholar
Kann, K., Mager, M., Meza-Ruiz, I., Schütze, H.: Fortification of neural morphological segmentation models for polysynthetic minimal-resource languages. arXiv preprint arXiv:1804.06024 (2018)
Ke, Z., Shi, L., Meng, E., Wang, B., Qiu, X., Huang, X.: Unified multi-criteria Chinese word segmentation with Bert. arXiv preprint arXiv:2004.05808 (2020)
Kudo, T.: Subword regularization: improving neural network translation models with multiple subword candidates. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 66–75 (2018)
Google Scholar
Kudo, T., Richardson, J.: Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 66–71 (2018)
Google Scholar
Kurimo, M., Virpioja, S., Turunen, V., Lagus, K.: Morpho challenge competition 2005–2010: evaluations and results. In: Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology, pp. 87–95. Association for Computational Linguistics (2010)
Google Scholar
Mager, M., Maier, E., Medina-Urrea, A., Meza-Ruiz, I., Kann, K.: Lost in translation: analysis of information loss during machine translation between polysynthetic and fusional languages. In: Proceedings of the Workshop on Computational Modeling of Polysynthetic Languages, pp. 73–83 (2018)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Provilkov, I., Emelianenko, D., Voita, E.: BPE-dropout: simple and effective subword regularization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1882–1892 (2020)
Google Scholar
Ruokolainen, T., Kohonen, O., Virpioja, S., Kurimo, M.: Supervised morphological segmentation in a low-resource learning setting using conditional random fields. In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp. 29–37 (2013)
Google Scholar
Ruokolainen, T., Kohonen, O., Virpioja, S., et al.: Painless semi-supervised morphological segmentation using conditional random fields. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, Volume 2: Short Papers, pp. 84–89 (2014)
Google Scholar
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725 (2016)
Google Scholar
Shafey, L.E., Soltau, H., Shafran, I.: Joint speech recognition and speaker diarization via sequence transduction. arXiv preprint arXiv:1907.05337 (2019)
Sorokin, A.: Convolutional neural networks for low-resource morpheme segmentation: baseline or state-of-the-art? In: Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology, pp. 154–159 (2019)
Google Scholar
Sorokin, A., Kravtsova, A.: Deep convolutional networks for supervised morpheme segmentation of Russian language. In: Ustalov, D., Filchenkov, A., Pivovarova, L., Žižka, J. (eds.) AINL 2018. CCIS, vol. 930, pp. 3–10. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01204-5_1
Chapter Google Scholar
Tian, Y., Song, Y., Xia, F., Zhang, T., Wang, Y.: Improving Chinese word segmentation with wordhood memory networks. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8274–8285 (2020)
Google Scholar
Ulčar, M., Robnik-Šikonja, M.: Finest Bert and crosloengual Bert: less is more in multilingual models. arXiv preprint arXiv:2006.07890 (2020)
Virpioja, S., Smit, P., Grönroos, S.A., Kurimo, M., et al.: Morfessor 2.0: Python implementation and extensions for Morfessor Baseline (2013)
Google Scholar
Yao, Y., Huang, Z.: Bi-directional LSTM recurrent neural network for Chinese word segmentation. In: Hirose, A., Ozawa, S., Doya, K., Ikeda, K., Lee, M., Liu, D. (eds.) ICONIP 2016. LNCS, vol. 9950, pp. 345–353. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46681-1_42
Chapter Google Scholar

Download references

Acknowledgements

The author thanks Alexander Panin for helpful discussions and ideas. He is also grateful to Natalia Loukachevitch for communication during preparing this paper.

Author information

Authors and Affiliations

Yandex Research, Moscow, Russia
Alexey Sorokin
Moscow State University, Moscow, Russia
Alexey Sorokin

Authors

Alexey Sorokin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexey Sorokin .

Editor information

Editors and Affiliations

Skolkovo Institute of Science and Technology, Moscow, Russia
Evgeny Burnaev
National Research University Higher School of Economics, Moscow, Russia
Dmitry I. Ignatov
Skolkovo Institute of Science and Technology, Moscow, Russia
Sergei Ivanov
Krasovskii Institute of Mathematics and Mechanics of Russian Academy of Sciences, Yekaterinburg, Russia
Michael Khachay
National Research University Higher School of Economics, St. Petersburg, Russia
Olessia Koltsova
University of Oslo, Oslo, Norway
Andrei Kutuzov
National Research University Higher School of Economics, Moscow, Russia
Sergei O. Kuznetsov
Lomonosov Moscow State University, Moscow, Russia
Natalia Loukachevitch
LORIA, Campus Scientifique, Vandœuvre lès Nancy, France
Amedeo Napoli
Skolkovo Institute of Science and Technology, Moscow, Russia
Alexander Panchenko
University of Florida, Gainesville, USA
Panos M. Pardalos
Aalto University, Espoo, Finland
Jari Saramäki
National Research University Higher School of Economics, Nizhny Novgorod, Russia
Andrey V. Savchenko
Yandex LLC, Moscow, Russia
Evgenii Tsymbalov
Kazan Federal University, Kazan, Russia
Elena Tutubalina

Appendix

1.1 8.1 Preprocessing Details

We remove accents in Finnish training data as it was done during training of Finnish BERT. In all the wordlists we keep only the words containing alphabet characters and hyphens with at least 5 letters.

1.2 8.2 BERT Models

Table 7. BERT models used in the paper.

Full size table

All the BERT models used in the paper are either language-specific or mulitilingual BERT-base models. In particular, the embeddings dimension is 768 (Fig. 8).

1.3 8.3 Convolutional Network Parameters

Table 8. Model parameters

Full size table

Network parameters are obtained by manual search using word accuracy on Finnish data and are transferred to other languages without change.

1.4 8.4 Ablation Studies Details

We train the word2vec model with default parameters from Gensim^{Footnote 10} for 10 epochs on small datasets (50000 words) and for 5 epochs on larger ones. In particular, the algorithm is CBOW, embeddings dimension is 300 and window size 2. We learn subword vocabularies by BPE [23] algorithm with SentencePiece^{Footnote 11} library. We use the command below to train the model:

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sorokin, A. (2022). Improving Morpheme Segmentation Using BERT Embeddings. In: Burnaev, E., et al. Analysis of Images, Social Networks and Texts. AIST 2021. Lecture Notes in Computer Science, vol 13217. Springer, Cham. https://doi.org/10.1007/978-3-031-16500-9_13

Download citation

DOI: https://doi.org/10.1007/978-3-031-16500-9_13
Published: 02 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16499-6
Online ISBN: 978-3-031-16500-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Improving Morpheme Segmentation Using BERT Embeddings

Abstract

Access this chapter

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

1.1 8.1 Preprocessing Details

1.2 8.2 BERT Models

1.3 8.3 Convolutional Network Parameters

1.4 8.4 Ablation Studies Details

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation