Educating Lia: The Development of a Linguistically Accurate Memory-Based Lemmatiser for Afrikaans

  • Hendrik J. Groenewald
Part of the IFIP International Federation for Information Processing book series (IFIPAICT, volume 228)


This paper describes the development of a memory-based lemmatiser for Afrikaans called Lia. The paper commences with a brief overview of Afrikaans lemmatisation and it is indicated that lemmatisation is seen as a simplified process of morphological analysis within the context of this paper. This overview is followed by an introduction to memory-based learning — the machine learning technique that is used in the development of the Afrikaans lemmatiser. The deployment of Lia is then discussed with specific emphasis on the format of the training and testing data that is used. The Afrikaans lemmatiser is then evaluated and it is indicated that Lia achieves a linguistic accuracy figure of over 90%. The paper concludes with some ideas on future work that can be done to improve the linguistic accuracy of the Afrikaans lemmatiser.


Natural Language Processing Machine Learning Lemmatisation Afrikaans Memory-Based Learning 


  1. Afrikaanse Speltoetser 3.0, Thesaurus 1.0 and Hyphenator, Potchefstroom: CTexT, North-West University, 2005.Google Scholar
  2. E. Aloaydin. Introduction to Machine Learning, Cambridge: MIT Press, 2004.Google Scholar
  3. T. Baldwin and F. Bond. A Plethora of Methods for Learning English Countability. Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, 2003.Google Scholar
  4. H. Bussman. Routledge Dictionary of Language and Linguistics. London: Routledge, 1996.Google Scholar
  5. A. Carstens. Basiskursus: Aspekte van die Afrikaanse Taalkunde ‘Aspects of Afrikaans Linguistics’. Bloemfontein: Patmos, 1992.Google Scholar
  6. J. G. H. Combrinck. Soek: Afrikaans sefleksie ‘Wanted: The inflectional morphemes of Afrikaans’. Taalkunde —’ n Lewe ‘Linguistics — a life’. Cape Town: Tafelberg, 1974.Google Scholar
  7. W. Daelemans and H. Strik. Het Nederlands in de taal-en spraaktechnologie: prioriteiten voor basisvoorzieningen. ‘Dutch in language and speech technology: priorities for basic provisions’. Dutch Language Union, 2002Google Scholar
  8. W. Daelemans, A. van den Bosch and J. Zavrel. Forgetting Exceptions is Harmful in Language Learning. Machine Learning, 34(1): 11–43, 1999.zbMATHCrossRefGoogle Scholar
  9. W. Daelemans, A. Van den Bosch, J. Zavrel and K. Van der Sloot. TiMBL: Tilburg Memory Based Learner, version 5.1, Reference Guide. ILK Technical Report 04-02, 2004.Google Scholar
  10. P. J. du Toit. Taalleer vir Onderwyser en Student ‘Language learning for Teacher and Student’. Pretoria: Academica, 1982.Google Scholar
  11. T. Erjavec and S. Dzeroski. Machine Learning of Morphosyntactic Structure: Lemmatising Unknown Slovene Words. Applied Artificial Intelligence 18(1): 17–40, 2004.CrossRefGoogle Scholar
  12. D. N. Gearailt. Dictionary characteristics in cross-language information retrieval. Technical report UCAM-CL-TR-616. Cambridge: University of Cambridge Computer Laboratory, 2005.Google Scholar
  13. J. Gustafson, N. Lindberg and M. Lundeberg. The August Spoken Dialogue System. Proceedings of Eurospeech, 1999.Google Scholar
  14. R. Hausser. Foundation of Computational Linguistics: man-machine communication in natural language. Berlin: Springer, page 516, 1999.Google Scholar
  15. W. Kraaij and R. Pohlmann. Porter’s Stemming Algorithm for Dutch. in Informatiewetenschap 1994: Wetenschaplike bijdraen aan de derde STINFON Conferentie, pages 167–180, 1994.Google Scholar
  16. C. D. Manning and H. Schutze. Foundations of Statistical Natural Language Processing. Cambridge: The MIT Press, 1999.zbMATHGoogle Scholar
  17. T. M. Mitchell. Machine Learning. Boston; McGraw-Hill, 1997.zbMATHGoogle Scholar
  18. J. Plisson, N. Lavrac and D. Mladenic. A rule based approach to word lemmatisation. Proceedings of the 7th International Multi-conference Information Society. Ljubljana: Institut Jozef Stefan, pages 83–86, 2004.Google Scholar
  19. M. Porter. An Algorithm for Suffix Stripping. Program 14(3): 1300–137, 1980.Google Scholar
  20. J. R. Quinlan. C4.5: Programs for Machine Learning San Mateo: Morgan Kaufmann Publishers, 1993.Google Scholar
  21. J. L. van Schoor. Die Grammatika van Standaard-Afrikaans ‘The Grammar of Standard Afrikaans’. Cape Town: Lex Patria Publishers, 1983.Google Scholar
  22. O. Streiter and E. W. de Luca. Example-based NLP for Minority Languages: Tasks, Resources and Tools. Proceedings of TALN 2003. Batz-sur-Mer, 11–14 June 2003.Google Scholar
  23. A. van den Bosch. Paramsearch 1.0 beta patch 24. (2005).Google Scholar

Copyright information

© International Federation for Information Processing 2006

Authors and Affiliations

  • Hendrik J. Groenewald
    • 1
  1. 1.Centre for Text TechnologyNorth-West UniversityPotchefstroomSouth Africa

Personalised recommendations