Skip to main content
Log in

Tswana finite state tokenisation

  • Original Research
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Tswana, a Bantu language in the Sotho group, is characterised by an agglutinative morphology and a disjunctive orthography, which mainly affects the verb category. In particular, verbal prefixes are usually written disjunctively, while suffixes follow a conjunctive writing style. Therefore, Tswana tokenisation cannot be based solely on whitespace, as is the case in many alphabetic, segmented languages, including the conjunctively written Nguni group of South African Bantu languages. This paper shows how a combination of two finite state tokeniser transducers and a finite state morphological analyser are combined to solve the Tswana (verb) tokenisation problem. The approach has the important advantage of bringing the processing of Tswana, beyond the morphological analysis level, in line with what is appropriate for the Nguni languages. This means that the challenge of the disjunctive orthography is met at the tokenisation/morphological analysis level and does not in principle propagate to subsequent levels of analysis such as POS tagging and shallow parsing, etc. The tokenisation approach is novel and, when implemented and evaluated, yields an F1-score of 95 % with respect to a hand tokenised gold standard.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. Private communication from Anderson (2014).

  2. This is an example of an auxiliary verb (kake) used in a negative verb construction in the future tense. “Negative forms in which auxiliaries are employed are emphatic in significance.” (Krüger 2006:261).

References

  • Anderson, W.N. (2014). Private communication.

  • Anderson, W. N. & Kotzé, P. M. (2006). Finite state tokenisation of an orthographical disjunctive agglutinative language: The verbal segment of Northern Sotho. In Proceedings of the 5th international conference on language resources and evaluation, Genoa, Italy, May 22–28, 2006.

  • Beesley, K. R. & Karttunen, L. (2003). Finite state morphology. Cambridge: Cambridge University Press.

    Google Scholar 

  • Cole, D. T. & Moncho-Warren, L. (2012). Setswana and English illustrated dictionary. Northlands, Gauteng, SA: MacMillan South Africa.

    Google Scholar 

  • Dixon, R. M. W. & Aikhenvald, A. Y. (2002). Word: A cross-linguistic typology. Cambridge: Cambridge University Press.

    Google Scholar 

  • Farghaly, A. (2003). Handbook for language engineers. Stanford University: CSLI Publications.

    Google Scholar 

  • Forst, M. & Kaplan, R. M. (2006). The importance of precise tokenization for deep grammars. In Proceedings of the 5th international conference on language resources and evalution, Genoa, Italy, May 22–28, 2006.

  • Hurskainen, A., Louwrens, L. & Poulos, G. (2005). Computational description of verbs in disjoining writing systems. Nordic Journal of African Studies, 14(4), 438–451.

    Google Scholar 

  • Jurafsky, D. & Martin, J. H. (2009). Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition (2nd ed.). New Jersey: Pearson Education.

    Google Scholar 

  • Kosch, I. M. (2006). Topics in morphology in the African language context. Pretoria: Unisa Press.

    Google Scholar 

  • Kotzé, P. M. (2011). Tokenization rules for the disjunctively written verbal segment of Northern Sotho. South African Journal of African Languages, 31(1), 121–137.

  • Krüger, C. J. H. (2006). Introduction to the morphology of Tswana. München: Lincom Europe.

    Google Scholar 

  • Mikheev, A. (2003). Text segmentation. In R. Mitkov (Ed.), The Oxford handbook of computational linguistics (pp. 201–218). Oxford: Oxford University Press.

    Google Scholar 

  • Otlogetswe, T. J. (2007). Corpus design for Tswana lexicography. Ph.D. thesis. University of Pretoria, Pretoria, South Africa.

  • Palmer, D. D. (2000). Tokenisation and sentence segmentation. In R. Dale, H. Moisl & H. Somers (Eds.), Handbook of natural language processing (pp. 11–35). New York: Marcel Dekker Inc.

    Google Scholar 

  • Poulos, G. & Louwrens, L. J. (1994). A linguistics analysis of Northern Sotho. Pretoria, South Africa: Via Africa.

  • Poulos, G. & Msimang, C. T. (1998). A linguistics analysis of Zulu. Pretoria, South Africa: Via Africa.

  • Pretorius, R. S. (1997). Auxiliary verbs as a sub-category of the verb in Tswana. Ph.D. thesis. Potchefstroom University for CHE, Potchefstroom, South Africa.

  • Pretorius, R., Berg, A. & Pretorius, L. (2012). Multiple object agreement morphemes in Tswana: A computational approach. Southern African Linguistics and Applied Language Studies, Special issue: Language technology in Southern Africa: Subject and object marking in Bantu, 30(2), 203–218.

    Article  Google Scholar 

  • Pretorius, R., Berg, A., Pretorius, L. & Viljoen, B. (2009). Setswana tokenisation and computational verb morphology: Facing the challenge of a disjunctive orthography. In G. De Pauw, G. M. de Schryver & L. Levin (Eds.), Proceedings of the first workshop on language technologies for African Languages (AfLaT ‘09) (pp. 66–73). Stroudsburg, PA: Association for Computational Linguistics.

    Chapter  Google Scholar 

  • Pretorius, R., Viljoen, B. & Pretorius, L. (2005). A finite-state morphological analysis of Tswana nouns. South African Journal of African Languages, 25(1), 48–58.

    Google Scholar 

  • Pretorius, L., Viljoen, B., Pretorius, R. & Berg, A. (2008). Towards a computational morphological analysis of Tswana compounds. Literator, 29(1), 1–20.

    Article  Google Scholar 

  • Taljard, E. & Bosch, S. E. (2006). A comparison of approaches towards word class tagging: Disjunctively versus conjunctively written Bantu languages. Nordic Journal of African Studies, 15(4), 428–442.

    Google Scholar 

  • Van Wyk, E. B. (1958). Woordverdeling in Noord-Sotho en Zoeloe: ‘n Bydrae tot die vraagstuk van woordidentifikasie in die Bantoetale. Pretoria: University of Pretoria.

    Google Scholar 

  • Van Wyk, E. B. (1967). The word classes of Northern Sotho. Lingua, 17(2), 230–261.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Laurette Pretorius.

Appendix

Appendix

See Appendix Table 13.

Table 13 Morphological tags and their descriptions

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pretorius, L., Viljoen, B., Berg, A. et al. Tswana finite state tokenisation. Lang Resources & Evaluation 49, 831–856 (2015). https://doi.org/10.1007/s10579-014-9292-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-014-9292-1

Keywords

Navigation