Skip to main content

Canonical and Surface Morphological Segmentation for Nguni Languages

Part of the Communications in Computer and Information Science book series (CCIS,volume 1551)

Abstract

Morphological Segmentation involves decomposing words into morphemes, the smallest meaning-bearing units of language. This is an important NLP task for morphologically-rich agglutinative languages such as the Southern African Nguni language group. In this paper, we investigate supervised and unsupervised models for two variants of morphological segmentation: canonical and surface segmentation. We train sequence-to-sequence models for canonical segmentation, where the underlying morphemes may not be equal to the surface form of the word, and Conditional Random Fields (CRF) for surface segmentation. Transformers outperform LSTMs with attention on canonical segmentation, obtaining an average F1 score of 72.5% across 4 languages. Feature-based CRFs outperform bidirectional LSTM-CRFs to obtain an average of 97.1% F1 on surface segmentation. In the unsupervised setting, an entropy-based approach using a character-level LSTM language model fails to outperform a Morfessor baseline, while on some of the languages neither approach performs much better than a random baseline. We hope that the high accuracy of the supervised segmentation models will help to facilitate the development of better NLP tools for Nguni languages.

Keywords

  • Natural language processing
  • Morphology
  • Nguni languages
  • Conditional random fields
  • Sequence to sequence models
  • Unsupervised learning

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    The Sotho-Tswana languages, the other major South African language group, are written disjunctively: morphemes are generally written as separate words, despite the languages being agglutinative.

  2. 2.

    Datasets are available at https://repo.sadilar.org/handle/20.500.12185/7.

  3. 3.

    https://github.com/bentrevett/pytorch-seq2seq.

  4. 4.

    Available at https://repo.sadilar.org/handle/20.500.12185/7/discover?filtertype=type&filter_relational_operator=equals&filter=Modules.

  5. 5.

    https://github.com/TeamHG-Memex/sklearn-crfsuite.

  6. 6.

    https://github.com/jidasheng/bi-lstm-crf.

References

  1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: International Conference on Learning Representations (2015)

    Google Scholar 

  2. Bosch, S.E., Pretorius, L.: A computational approach to Zulu verb morphology within the context of lexical semantics. Lexikos 27, 152–182 (2017). http://www.scielo.org.za/scielo.php?script=sci_arttext&pid=S2224-00392017000100007&nrm=iso

  3. Cotterell, R., Müller, T., Fraser, A., Schütze, H.: Labeled morphological segmentation with semi-Markov models. In: Proceedings of the Nineteenth Conference on Computational Natural Language Learning, pp. 164–174. Association for Computational Linguistics, Beijing (2015). https://doi.org/10.18653/v1/K15-1017. https://www.aclweb.org/anthology/K15-1017

  4. Cotterell, R., Vieira, T., Schütze, H.: A joint model of orthography and morphological segmentation. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 664–669. Association for Computational Linguistics, San Diego (2016). https://doi.org/10.18653/v1/N16-1080. https://www.aclweb.org/anthology/N16-1080

  5. Creutz, M., et al.: Morph-based speech recognition and modeling of out-of-vocabulary words across languages. ACM Trans. Speech Lang. Process. (TSLP) 5(1), 1–29 (2007)

    CrossRef  Google Scholar 

  6. Creutz, M., Lagus, K.: Unsupervised models for morpheme segmentation and morphology learning. ACM Trans. Speech Lang. Process. 4(1), 1–34 (2007). https://doi.org/10.1145/1187415.1187418

    CrossRef  Google Scholar 

  7. Eiselen, R., Puttkammer, M.: Developing text resources for ten South African languages. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014), pp. 3698–3703. European Languages Resources Association (ELRA), Reykjavik (2014). http://www.lrec-conf.org/proceedings/lrec2014/pdf/1151_Paper.pdf

  8. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    CrossRef  Google Scholar 

  9. Kann, K., Cotterell, R., Schütze, H.: Neural morphological analysis: encoding-decoding canonical segments. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 961–967. Association for Computational Linguistics, Austin (2016). https://doi.org/10.18653/v1/D16-1097. https://www.aclweb.org/anthology/D16-1097

  10. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2017)

  11. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML 2001, pp. 282–289. Morgan Kaufmann Publishers Inc., San Francisco (2001)

    Google Scholar 

  12. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 260–270. Association for Computational Linguistics, San Diego (2016). https://doi.org/10.18653/v1/N16-1030. https://www.aclweb.org/anthology/N16-1030

  13. Ma, X., Hovy, E.: End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1064–1074. Association for Computational Linguistics, Berlin (2016). https://doi.org/10.18653/v1/P16-1101. https://www.aclweb.org/anthology/P16-1101

  14. Mzamo, L., Helberg, A., Bosch, S.: Evaluation of combined bi-directional branching entropy language models for morphological segmentation of isiXhosa. In: South African Forum of Artificial Intelligence Research, pp. 77–89 (2019)

    Google Scholar 

  15. Mzamo, L., Helberg, A., Bosch, S.: Towards an unsupervised morphological segmenter for isiXhosa. In: SAUPEC/RobMech/PRASA, pp. 166–170 (2019)

    Google Scholar 

  16. Ruokolainen, T., Kohonen, O., Virpioja, S., Kurimo, M.: Supervised morphological segmentation in a low-resource learning setting using conditional random fields. In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp. 29–37. Association for Computational Linguistics, Sofia (2013). https://www.aclweb.org/anthology/W13-3504

  17. Ruzsics, T., Samardžić, T.: Neural sequence-to-sequence learning of internal word structure. In: Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pp. 184–194. Association for Computational Linguistics, Vancouver (2017). https://doi.org/10.18653/v1/K17-1020. https://www.aclweb.org/anthology/K17-1020

  18. Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Sig. Process. 45(11), 2673–2681 (1997)

    CrossRef  Google Scholar 

  19. Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948). https://doi.org/10.1002/j.1538-7305.1948.tb01338.x

    CrossRef  MathSciNet  MATH  Google Scholar 

  20. Spiegler, S., van der Spuy, A., Flach, P.A.: Ukwabelana - an open-source morphological Zulu corpus. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pp. 1020–1028. Coling 2010 Organizing Committee, Beijing (2010). https://www.aclweb.org/anthology/C10-1115

  21. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)

    Google Scholar 

  22. Taljard, E., Bosch, S.E.: A comparison of approaches to word class tagging: disjunctively vs. conjunctively written Bantu languages. Nord. J. Afr. Stud. 15(4), 428–442 (2006)

    Google Scholar 

  23. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

Download references

Acknowledgements

This work is based on research supported in part by the National Research Foundation of South Africa (Grant Number: 129850) and the South African Centre for High Performance Computing. We thank Zola Mahlaza for valuable feedback, and Francois Meyer for running an additional baseline.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jan Buys .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Moeng, T., Reay, S., Daniels, A., Buys, J. (2022). Canonical and Surface Morphological Segmentation for Nguni Languages. In: Jembere, E., Gerber, A.J., Viriri, S., Pillay, A. (eds) Artificial Intelligence Research. SACAIR 2021. Communications in Computer and Information Science, vol 1551. Springer, Cham. https://doi.org/10.1007/978-3-030-95070-5_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-95070-5_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-95069-9

  • Online ISBN: 978-3-030-95070-5

  • eBook Packages: Computer ScienceComputer Science (R0)