Abstract
Morphological Segmentation involves decomposing words into morphemes, the smallest meaning-bearing units of language. This is an important NLP task for morphologically-rich agglutinative languages such as the Southern African Nguni language group. In this paper, we investigate supervised and unsupervised models for two variants of morphological segmentation: canonical and surface segmentation. We train sequence-to-sequence models for canonical segmentation, where the underlying morphemes may not be equal to the surface form of the word, and Conditional Random Fields (CRF) for surface segmentation. Transformers outperform LSTMs with attention on canonical segmentation, obtaining an average F1 score of 72.5% across 4 languages. Feature-based CRFs outperform bidirectional LSTM-CRFs to obtain an average of 97.1% F1 on surface segmentation. In the unsupervised setting, an entropy-based approach using a character-level LSTM language model fails to outperform a Morfessor baseline, while on some of the languages neither approach performs much better than a random baseline. We hope that the high accuracy of the supervised segmentation models will help to facilitate the development of better NLP tools for Nguni languages.
Keywords
- Natural language processing
- Morphology
- Nguni languages
- Conditional random fields
- Sequence to sequence models
- Unsupervised learning
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
The Sotho-Tswana languages, the other major South African language group, are written disjunctively: morphemes are generally written as separate words, despite the languages being agglutinative.
- 2.
Datasets are available at https://repo.sadilar.org/handle/20.500.12185/7.
- 3.
- 4.
- 5.
- 6.
References
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: International Conference on Learning Representations (2015)
Bosch, S.E., Pretorius, L.: A computational approach to Zulu verb morphology within the context of lexical semantics. Lexikos 27, 152–182 (2017). http://www.scielo.org.za/scielo.php?script=sci_arttext&pid=S2224-00392017000100007&nrm=iso
Cotterell, R., Müller, T., Fraser, A., Schütze, H.: Labeled morphological segmentation with semi-Markov models. In: Proceedings of the Nineteenth Conference on Computational Natural Language Learning, pp. 164–174. Association for Computational Linguistics, Beijing (2015). https://doi.org/10.18653/v1/K15-1017. https://www.aclweb.org/anthology/K15-1017
Cotterell, R., Vieira, T., Schütze, H.: A joint model of orthography and morphological segmentation. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 664–669. Association for Computational Linguistics, San Diego (2016). https://doi.org/10.18653/v1/N16-1080. https://www.aclweb.org/anthology/N16-1080
Creutz, M., et al.: Morph-based speech recognition and modeling of out-of-vocabulary words across languages. ACM Trans. Speech Lang. Process. (TSLP) 5(1), 1–29 (2007)
Creutz, M., Lagus, K.: Unsupervised models for morpheme segmentation and morphology learning. ACM Trans. Speech Lang. Process. 4(1), 1–34 (2007). https://doi.org/10.1145/1187415.1187418
Eiselen, R., Puttkammer, M.: Developing text resources for ten South African languages. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014), pp. 3698–3703. European Languages Resources Association (ELRA), Reykjavik (2014). http://www.lrec-conf.org/proceedings/lrec2014/pdf/1151_Paper.pdf
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Kann, K., Cotterell, R., Schütze, H.: Neural morphological analysis: encoding-decoding canonical segments. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 961–967. Association for Computational Linguistics, Austin (2016). https://doi.org/10.18653/v1/D16-1097. https://www.aclweb.org/anthology/D16-1097
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2017)
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML 2001, pp. 282–289. Morgan Kaufmann Publishers Inc., San Francisco (2001)
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 260–270. Association for Computational Linguistics, San Diego (2016). https://doi.org/10.18653/v1/N16-1030. https://www.aclweb.org/anthology/N16-1030
Ma, X., Hovy, E.: End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1064–1074. Association for Computational Linguistics, Berlin (2016). https://doi.org/10.18653/v1/P16-1101. https://www.aclweb.org/anthology/P16-1101
Mzamo, L., Helberg, A., Bosch, S.: Evaluation of combined bi-directional branching entropy language models for morphological segmentation of isiXhosa. In: South African Forum of Artificial Intelligence Research, pp. 77–89 (2019)
Mzamo, L., Helberg, A., Bosch, S.: Towards an unsupervised morphological segmenter for isiXhosa. In: SAUPEC/RobMech/PRASA, pp. 166–170 (2019)
Ruokolainen, T., Kohonen, O., Virpioja, S., Kurimo, M.: Supervised morphological segmentation in a low-resource learning setting using conditional random fields. In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp. 29–37. Association for Computational Linguistics, Sofia (2013). https://www.aclweb.org/anthology/W13-3504
Ruzsics, T., Samardžić, T.: Neural sequence-to-sequence learning of internal word structure. In: Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pp. 184–194. Association for Computational Linguistics, Vancouver (2017). https://doi.org/10.18653/v1/K17-1020. https://www.aclweb.org/anthology/K17-1020
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Sig. Process. 45(11), 2673–2681 (1997)
Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948). https://doi.org/10.1002/j.1538-7305.1948.tb01338.x
Spiegler, S., van der Spuy, A., Flach, P.A.: Ukwabelana - an open-source morphological Zulu corpus. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pp. 1020–1028. Coling 2010 Organizing Committee, Beijing (2010). https://www.aclweb.org/anthology/C10-1115
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)
Taljard, E., Bosch, S.E.: A comparison of approaches to word class tagging: disjunctively vs. conjunctively written Bantu languages. Nord. J. Afr. Stud. 15(4), 428–442 (2006)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Acknowledgements
This work is based on research supported in part by the National Research Foundation of South Africa (Grant Number: 129850) and the South African Centre for High Performance Computing. We thank Zola Mahlaza for valuable feedback, and Francois Meyer for running an additional baseline.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Moeng, T., Reay, S., Daniels, A., Buys, J. (2022). Canonical and Surface Morphological Segmentation for Nguni Languages. In: Jembere, E., Gerber, A.J., Viriri, S., Pillay, A. (eds) Artificial Intelligence Research. SACAIR 2021. Communications in Computer and Information Science, vol 1551. Springer, Cham. https://doi.org/10.1007/978-3-030-95070-5_9
Download citation
DOI: https://doi.org/10.1007/978-3-030-95070-5_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-95069-9
Online ISBN: 978-3-030-95070-5
eBook Packages: Computer ScienceComputer Science (R0)