Context-Driven Corpus-Based Model for Automatic Text Segmentation and Part of Speech Tagging in Setswana Using OpenNLP Tool

  • Mary Ambrossine Dibitso
  • Pius Adewale Owolawi
  • Sunday Olusegun OjoEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11939)


Setswana is an under-resourced Bantu African language that is morphologically rich with the disjunctive writing system. Developing NLP pipeline tools for such a language could be challenging, due to the need to balance the linguistics semantics robustness of the tool with computational parsimony. A Part-of-Speech (POS) tagger is one such NLP tool for assigning lexical categories like noun, verb, pronoun, and so on, to each word in a text corpus. POS tagging is an important task in Natural Language Processing (NLP) applications such as information extraction, Machine Translation, Word prediction, etc. Developing a POS tagger for a morphologically rich language such as Setswana has computational linguistics challenges that could affect the effectiveness of the entire NLP system. This is due to some contextual semantics features of the language, that demand a fine-grained granularity level for the required POS tagset, with the need to balance tool semantic robustness with computational parsimony. In this paper, a context-driven corpus-based model for text segmentation and POS tagging for the language is presented. The tagger is developed using the Apache OpenNLP tool and returns the accuracy of 96.73%.


NLP Context Text segmentation POS tagging Setswana OpenNLP 


  1. 1.
    Otlogetswe, T.J.: Corpus design for Setswana lexicography. Doctoral dissertation, University of Pretoria (2007)Google Scholar
  2. 2. (2019). Accessed 01 Apr 2019
  3. 3.
    Pretorius, R., Berg, A., Pretorius, L., Viljoen, B.: Setswana tokenisation and computational verb morphology: facing the challenge of a disjunctive orthography. In: Proceedings of the First Workshop on Language Technologies for African Languages, pp. 66–73. Association for Computational Linguistics (2009)Google Scholar
  4. 4.
    Palmer, D.D.: Tokenisation and sentence segmentation. In: Dale, R., Somers, H.L., Moisl, H. (eds.) Handbook of Natural Language Processing, pp. 11–35. Marcel Dekker, New York (2000)Google Scholar
  5. 5.
    Kanakaraddi, S.G., Nandyal, S.S.: Survey on parts of speech tagger techniques. In: International Conference on Current Trends towards Converging Technologies (ICCTCT), pp. 1–6. IEEE (2018)Google Scholar
  6. 6.
    Neves, M.: Introduction to Natural Language Processing (2016)Google Scholar
  7. 7.
    Morwal, S., Jahan, N., Chopra, D.: Named entity recognition using hidden Markov model (HMM). Int. J. Nat. Lang. Comput. (IJNLC) 1(4), 15–23 (2012)CrossRefGoogle Scholar
  8. 8.
    Paul, A., Purkayastha, B.S., Sarkar, S.: Hidden Markov model based part of speech tagging for Nepali language. In: 2015 International Symposium on Advanced Computing and Communication (ISACC), pp. 149–156. IEEE (2015)Google Scholar
  9. 9.
    Amri, S., Zenkouar, L., Outahajala, M.: Combination POS taggers on Amazigh texts. In: 3rd International Conference of Cloud Computing Technologies and Applications (CloudTech), pp. 1–6. IEEE (2017)Google Scholar
  10. 10.
    Sinha, P., Veyie, N.M., Purkayastha, B.S.: Enhancing the performance of part of speech tagging of Nepali language through hybrid approach. Int. J. Emerg. Techn. Adv. Eng. 5(5), 354–359 (2015)Google Scholar
  11. 11.
    Freihat, A.A., Bella, G., Mubarak, H., Giunchiglia, F.: A single-model approach for Arabic segmentation, POS tagging, and named entity recognition. In: 2nd International Conference on Natural Language and Speech Processing (ICNLSP), pp. 1–8. IEEE (2018)Google Scholar
  12. 12.
    Ncube, D.N.: The morpheme in Setswana. Doctoral dissertation, University of Johannesburg (1994)Google Scholar
  13. 13.
    Le Roux, J.C.: A grammatical analysis of the Tswana adverbial. Doctoral dissertation, University of South Africa (2007)Google Scholar
  14. 14.
    Taljard, E., Faaß, G., Heid, U., Prinsloo, D.J.: On the development of a tagset for Northern Sotho with special reference to the issue of standardisation. Lit. J. Lit. Crit. Comp. Linguist. Literary Stud. 29(1), 111–137 (2008)Google Scholar
  15. 15., Apache OpenNLP Developer Documentation. Accessed 23 Apr 2019
  16. 16.
    Manning, C.: Maxent models and discriminative estimation. CS 224N lecture notes, Spring (2005)Google Scholar
  17. 17.
  18. 18.
    Dalal, A., Nagaraj, K., Sawant, U., Shelke, S.: Hindi part-of-speech tagging and chunking: a maximum entropy approach. In: Proceeding of the NLPAI Machine Learning Competition (2006)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Mary Ambrossine Dibitso
    • 1
  • Pius Adewale Owolawi
    • 1
  • Sunday Olusegun Ojo
    • 1
    Email author
  1. 1.Faculty of ICTTshwane University of TechnologyPretoriaSouth Africa

Personalised recommendations