Advertisement

Data-Driven Part-of-Speech Tagging of Kiswahili

  • Guy De Pauw
  • Gilles-Maurice de Schryver
  • Peter W. Wagacha
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4188)

Abstract

In this paper we present experiments with data-driven part-of-speech taggers trained and evaluated on the annotated Helsinki Corpus of Swahili. Using four of the current state-of-the-art data-driven taggers, TnT, MBT, SVMTool and MXPOST, we observe the latter as being the most accurate tagger for the Kiswahili dataset.We further improve on the performance of the individual taggers by combining them into a committee of taggers. We observe that the more naive combination methods, like the novel plural voting approach, outperform more elaborate schemes like cascaded classifiers and weighted voting. This paper is the first publication to present experiments on data-driven part-of-speech tagging for Kiswahili and Bantu languages in general.

Keywords

Combination Method Weighted Vote Word Class System Combination Unknown Word 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    van Rooy, B., Pretorius, R.: A word-class tagset for Setswana. Southern African Linguistics and Applied Language Studies 21(4), 203–222 (2003)CrossRefGoogle Scholar
  2. 2.
    Allwood, J., Grönqvist, L., Hendrikse, A.P.: Developing a tagset and tagger for the African languages of South Africa with special reference to Xhosa. Southern African Linguistics and Applied Language Studies 21(4), 223–237 (2003)CrossRefGoogle Scholar
  3. 3.
    Prinsloo, D.J., Heid, U.: Creating word class tagged corpora for Northern Sotho by linguistically informed bootstrapping. In: Proceedings of the Conference on Lesser Used Languages & Computer Linguistics (LULCL 2005), Bozen/Bolzano, Italy (to be published, 2005)Google Scholar
  4. 4.
    Taljard, E., Bosch, S.E.: A comparison of approaches towards word class tagging: disjunctively vs conjunctively written Bantu languages. In: Proceedings of the Conference on Lesser Used Languages & Computer Linguistics (LULCL 2005), Bozen/Bolzano, Italy (to be published, 2005)Google Scholar
  5. 5.
    Pretorius, L., Bosch, S.E.: Computational aids for Zulu natural language processing. Southern African Linguistics and Applied Language Studies 21(4), 267–282 (2003)CrossRefGoogle Scholar
  6. 6.
    Hurskainen, A.: HCS 2004 – Helsinki Corpus of Swahili. Compilers: Institute for Asian and African Studies (University of Helsinki) and CSC (2004)Google Scholar
  7. 7.
    Hurskainen, A.: Disambiguation of morphological analysis in Bantu languages. In: Proceedings of the Sixteenth International Conference on Computational Linguistics (COLING 1996), Copenhagen, Denmark, pp. 568–573 (1996)Google Scholar
  8. 8.
    Brants, T.: TnT – a statistical part-of-speech tagger. In: Proceedings of the Sixth Conference on Applied Natural Language Processing (ANLP 2000), Seattle, WA, USA, pp. 224–231 (2000)Google Scholar
  9. 9.
    Ratnaparkhi, A.: A maximum entropy model for part-of-speech tagging. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Somerset, NJ, USA, pp. 133–142 (1996)Google Scholar
  10. 10.
    van Halteren, H., Zavrel, J., Daelemans, W.: Improving accuracy in word class tagging through combination of machine learning systems. Computational Linguistics 27(2), 199–230 (2001)CrossRefGoogle Scholar
  11. 11.
    Daelemans, W., Zavrel, J., van den Bosch, A., van der Sloot, K.: MBT: Memory Based Tagger, version 2.0, Reference Guide. ILK Research Group Technical Report Series 03-13, Tilburg (2003)Google Scholar
  12. 12.
    Wagacha, P., Manderick, B., Getao, K.: Benchmarking Support Vector Machines using StatLog Methodology. In: Proceedings of Benelearn 2004, Machine Learning Conference of Belgium and the Netherlands, Brussels, Belgium, pp. 185–190 (2004)Google Scholar
  13. 13.
    Giménez, J., Màrquez, L.: SVMTool: A general POS tagger generator based on Support Vector Machines. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal, pp. 43–46 (2004)Google Scholar
  14. 14.
    Joachims, T.: Making Large-scale SVM Learning Practical. In: Schölkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods – Support Vector Learning, pp. 41–56. MIT Press, Boston (1999)Google Scholar
  15. 15.
    De Pauw, G., Daelemans, W.: The role of algorithm bias vs information source in learning algorithms for morphosyntactic disambiguation. In: Proceedings of the Fourth Conference on Computational Natural Language Learning (CoNLL 2000), Lisbon, Portugal, pp. 19–24 (2000)Google Scholar
  16. 16.
    Brill, E.: A simple rule-based part-of-speech tagger. In: Proceedings of the Third Conference on Applied Natural Language Processing (ANLP 1992), Trento, Italy, pp. 152–155 (1992)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Guy De Pauw
    • 1
  • Gilles-Maurice de Schryver
    • 2
    • 3
  • Peter W. Wagacha
    • 4
  1. 1.CNTS – Language Technology GroupUniversity of AntwerpBelgium
  2. 2.African Languages and CulturesGhent UniversityBelgium
  3. 3.Xhosa DepartmentUniversity of the Western CapeSouth Africa
  4. 4.School of Computing and InformaticsUniversity of NairobiKenya

Personalised recommendations