Skip to main content

Improving Morpheme Segmentation Using BERT Embeddings

  • Conference paper
  • First Online:
Analysis of Images, Social Networks and Texts (AIST 2021)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13217))

  • 363 Accesses

Abstract

We offer a method of incorporating BERT embeddings into neural morpheme segmentation. We show that our method significantly improves over baseline on 6 typologically diverse languages (English, Finnish, Turkish, Estonian, Georgian and Zulu). Moreover, it establishes a new SOTA on 4 languages where language-specific models are available. We demonstrate that the key component of the performance is not only the BPE vocabulary of BERT, but also the embeddings themselves. Additionally, we show that a simpler pretraining task optimizing subword word2vec-like objective also reaches state-of-the-art performance on 4 of 6 languages considered.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    See http://morpho.aalto.fi/events/morphochallenge2010/datasets.shtml as example.

  2. 2.

    The hyperparameters are in the Appendix.

  3. 3.

    It is freely available https://github.com/AlexeySorokin/MorphemeBert.

  4. 4.

    https://huggingface.co/models.

  5. 5.

    https://github.com/AlexeySorokin/NeuralMorphemeSegmentation The models are retrained for 50 epochs with the parameters provided in the repository.

  6. 6.

    https://morfessor.readthedocs.io/en/latest/.

  7. 7.

    Training parameters are in the Appendix.

  8. 8.

    https://github.com/VProv/BPE-Dropout.

  9. 9.

    Obviously, when the BERT weights are available, training is rather cheap. However, the pretraining cost of BERT is several orders of magnitude higher than the one of word2vec-like embeddings.

  10. 10.

    https://radimrehurek.com/gensim/models/word2vec.html.

  11. 11.

    https://github.com/google/sentencepiece.

References

  1. Andreev, N.D. (ed.): Statictical and combinatorial language modelling (Statistiko-kombinatornoe modelirovanie iazykov, in Russian). Nauka (1965)

    Google Scholar 

  2. Creutz, M., Lagus, K.: Unsupervised models for morpheme segmentation and morphology learning. ACM Trans. Speech Lang. Process. (TSLP) 4(1), 3 (2007)

    Article  Google Scholar 

  3. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019)

    Google Scholar 

  4. Eskander, R., Callejas, F., Nichols, E., Klavans, J.L., Muresan, S.: Morphagram, evaluation and framework for unsupervised morphological segmentation. In: Proceedings of The 12th Language Resources and Evaluation Conference, pp. 7112–7122 (2020)

    Google Scholar 

  5. Grönroos, S.A., Virpioja, S., Kurimo, M.: North sámi morphological segmentation with low-resource semi-supervised sequence labeling. In: Proceedings of the Fifth International Workshop on Computational Linguistics for Uralic Languages, pp. 15–26 (2019)

    Google Scholar 

  6. Grönroos, S.A., Virpioja, S., Kurimo, M.: Morfessor em+ prune: improved subword segmentation with expectation maximization and pruning. arXiv preprint arXiv:2003.03131 (2020)

  7. Grönroos, S.A., Virpioja, S., Kurimo, M.: Transfer learning and subword sampling for asymmetric-resource one-to-many neural translation. arXiv preprint arXiv:2004.04002 (2020)

  8. Harris, Z.S.: Morpheme boundaries within words: report on a computer test. In: Papers in Structural and Transformational Linguistics. Formal Linguistics Series, pp. 68–77. Springer, Dordrecht (1970). https://doi.org/10.1007/978-94-017-6059-1_3

    Chapter  MATH  Google Scholar 

  9. Heinzerling, B., Strube, M.: BPEmb: Tokenization-free pre-trained subword embeddings in 275 languages. In: chair, N.C.C., et al. (eds.) Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 7–12 May 2018. European Language Resources Association (ELRA), Miyazaki, Japan (2018)

    Google Scholar 

  10. Hofmann, V., Pierrehumbert, J.B., Schütze, H.: Superbizarre is not superb: Improving Bert’s interpretations of complex words with derivational morphology. arXiv preprint arXiv:2101.00403 (2021)

  11. Huang, K., Huang, D., Liu, Z., Mo, F.: A joint multiple criteria model in transfer learning for cross-domain chinese word segmentation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3873–3882 (2020)

    Google Scholar 

  12. Johnson, M., Griffiths, T.L., Goldwater, S.: Adaptor grammars: a framework for specifying compositional nonparametric Bayesian models. In: Advances in Neural Information Processing Systems, pp. 641–648 (2007)

    Google Scholar 

  13. Kann, K., Mager, M., Meza-Ruiz, I., Schütze, H.: Fortification of neural morphological segmentation models for polysynthetic minimal-resource languages. arXiv preprint arXiv:1804.06024 (2018)

  14. Ke, Z., Shi, L., Meng, E., Wang, B., Qiu, X., Huang, X.: Unified multi-criteria Chinese word segmentation with Bert. arXiv preprint arXiv:2004.05808 (2020)

  15. Kudo, T.: Subword regularization: improving neural network translation models with multiple subword candidates. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 66–75 (2018)

    Google Scholar 

  16. Kudo, T., Richardson, J.: Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 66–71 (2018)

    Google Scholar 

  17. Kurimo, M., Virpioja, S., Turunen, V., Lagus, K.: Morpho challenge competition 2005–2010: evaluations and results. In: Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology, pp. 87–95. Association for Computational Linguistics (2010)

    Google Scholar 

  18. Mager, M., Maier, E., Medina-Urrea, A., Meza-Ruiz, I., Kann, K.: Lost in translation: analysis of information loss during machine translation between polysynthetic and fusional languages. In: Proceedings of the Workshop on Computational Modeling of Polysynthetic Languages, pp. 73–83 (2018)

    Google Scholar 

  19. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  20. Provilkov, I., Emelianenko, D., Voita, E.: BPE-dropout: simple and effective subword regularization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1882–1892 (2020)

    Google Scholar 

  21. Ruokolainen, T., Kohonen, O., Virpioja, S., Kurimo, M.: Supervised morphological segmentation in a low-resource learning setting using conditional random fields. In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp. 29–37 (2013)

    Google Scholar 

  22. Ruokolainen, T., Kohonen, O., Virpioja, S., et al.: Painless semi-supervised morphological segmentation using conditional random fields. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, Volume 2: Short Papers, pp. 84–89 (2014)

    Google Scholar 

  23. Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725 (2016)

    Google Scholar 

  24. Shafey, L.E., Soltau, H., Shafran, I.: Joint speech recognition and speaker diarization via sequence transduction. arXiv preprint arXiv:1907.05337 (2019)

  25. Sorokin, A.: Convolutional neural networks for low-resource morpheme segmentation: baseline or state-of-the-art? In: Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology, pp. 154–159 (2019)

    Google Scholar 

  26. Sorokin, A., Kravtsova, A.: Deep convolutional networks for supervised morpheme segmentation of Russian language. In: Ustalov, D., Filchenkov, A., Pivovarova, L., Žižka, J. (eds.) AINL 2018. CCIS, vol. 930, pp. 3–10. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01204-5_1

    Chapter  Google Scholar 

  27. Tian, Y., Song, Y., Xia, F., Zhang, T., Wang, Y.: Improving Chinese word segmentation with wordhood memory networks. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8274–8285 (2020)

    Google Scholar 

  28. Ulčar, M., Robnik-Šikonja, M.: Finest Bert and crosloengual Bert: less is more in multilingual models. arXiv preprint arXiv:2006.07890 (2020)

  29. Virpioja, S., Smit, P., Grönroos, S.A., Kurimo, M., et al.: Morfessor 2.0: Python implementation and extensions for Morfessor Baseline (2013)

    Google Scholar 

  30. Yao, Y., Huang, Z.: Bi-directional LSTM recurrent neural network for Chinese word segmentation. In: Hirose, A., Ozawa, S., Doya, K., Ikeda, K., Lee, M., Liu, D. (eds.) ICONIP 2016. LNCS, vol. 9950, pp. 345–353. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46681-1_42

    Chapter  Google Scholar 

Download references

Acknowledgements

The author thanks Alexander Panin for helpful discussions and ideas. He is also grateful to Natalia Loukachevitch for communication during preparing this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexey Sorokin .

Editor information

Editors and Affiliations

Appendix

Appendix

1.1 8.1 Preprocessing Details

We remove accents in Finnish training data as it was done during training of Finnish BERT. In all the wordlists we keep only the words containing alphabet characters and hyphens with at least 5 letters.

1.2 8.2 BERT Models

Table 7. BERT models used in the paper.

All the BERT models used in the paper are either language-specific or mulitilingual BERT-base models. In particular, the embeddings dimension is 768 (Fig. 8).

1.3 8.3 Convolutional Network Parameters

Table 8. Model parameters

Network parameters are obtained by manual search using word accuracy on Finnish data and are transferred to other languages without change.

1.4 8.4 Ablation Studies Details

We train the word2vec model with default parameters from GensimFootnote 10 for 10 epochs on small datasets (50000 words) and for 5 epochs on larger ones. In particular, the algorithm is CBOW, embeddings dimension is 300 and window size 2. We learn subword vocabularies by BPE [23] algorithm with SentencePieceFootnote 11 library. We use the command below to train the model:

figure a

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sorokin, A. (2022). Improving Morpheme Segmentation Using BERT Embeddings. In: Burnaev, E., et al. Analysis of Images, Social Networks and Texts. AIST 2021. Lecture Notes in Computer Science, vol 13217. Springer, Cham. https://doi.org/10.1007/978-3-031-16500-9_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-16500-9_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-16499-6

  • Online ISBN: 978-3-031-16500-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics