Statistical unicodification of African languages

  • Kevin P. ScannellEmail author
Original Paper


Many languages in Africa are written using Latin-based scripts, but often with extra diacritics (e.g. dots below in Igbo: \({\d i}, {\d o}, {\d u}\)) or modifications to the letters themselves (e.g. open vowels “e” and “o” in Lingala: ɛ, ɔ). While it is possible to render these characters accurately in Unicode, oftentimes keyboard input methods are not easily accessible or are cumbersome to use, and so the vast majority of electronic texts in many African languages are written in plain ASCII. We call the process of converting an ASCII text to its proper Unicode form unicodification. This paper describes an open-source package which performs automatic unicodification, implementing a variant of an algorithm described in previous work of De Pauw, Wagacha, and de Schryver. We have trained models for more than 100 languages using web data, and have evaluated each language using a range of feature sets.


Diacritic restoration Unicodification Under-resourced languages African languages Machine learning 



We are grateful to Nuance Communications, and especially Ann Aoki Becker, for their support and for their ongoing commitment to developing input technology for under-resourced languages around the world. Thanks also to my student Michael Schade for making this work much more accessible to language communities through his Firefox add-on, and to my many collaborators on the Crúbadán project for their help preparing the web corpora which were used to train the language models, especially Tunde Adegbola (Yoruba), Denis Jacquerye (Lingala), Chinedu Uchechukwa (Igbo), Thapelo Otlogetswe (Setswana), Abdoul Cisse and Mohomodou Houssouba (Songhay), and Outi Sané (Diola). Alexandru Szasz gave helpful feedback on Romanian, as did Jean Came Poulard on Haitian Creole. Finally, thanks to Guy De Pauw, Peter Wagacha, and Gilles-Maurice de Schryver for their encouragement of this work. This paper is dedicated to the memory of my friend and collaborator on Frisian, Eeltje de Vries (1938–2008).


  1. Caldwell, M. E. (2009). Development of psychometrically equivalent speech audiometry materials for testing children in Mongolian, M.S. Thesis, Brigham Young University, December.Google Scholar
  2. De Pauw, G., Wagacha, P. W., & de Schryver, G.-M. (2007). Automatic diacritic restoration for resource-scarce languages. In V. Matousek, & P. Mautner, (Eds.), Proceedings of text, speech and dialogue conference 2007, pp. 170–179.Google Scholar
  3. De Pauw, G., Wagacha, P. W., & de Schryver, G.-M. (2011). Collection and deployment of a parallel corpus English-Swahili, Language resources and evaluation, this volume.Google Scholar
  4. Fairon, C., et al. (Eds.) (2007). Building and Exploring Web Corpora, Proceedings of the 3rd web as corpus Workshop, Louvain-la-Neuve, Belgium.Google Scholar
  5. Haslam V. N. (2009). Psychometrically equivalent monosyllabic words for word recognition testing in Mongolian, M.S. Thesis, Brigham Young University, August.Google Scholar
  6. Iftene, A., & Trandabăţ D. (2009). Recovering diacritics using Wikipedia and Google. In: Knowledge engineering: Principles and techniques, Proceedings of the international conference on knowledge engineering KEPT2009, pp. 37–40.Google Scholar
  7. Mihalcea, R. (2002). Diacritics restoration: Learning from letters versus learning from words. In Proceedings of the third international conference on intelligent text processing and computational linguistics.Google Scholar
  8. Mihalcea, R., & Nastase, V. (2002). Letter level learning for language independent diacritics restoration. In Proceedings of CoNLL-2002, pp. 105–111.Google Scholar
  9. Moran, S. (2011). An ontology for accessing transcription systems, Language resources and evaluation, this volume.Google Scholar
  10. Scannell, K. P. (2007). The Crúbadán project: Corpus building for under-resourced languages. In Building and Exploring Web Corpora. Proceedings of the 3rd web as corpus workshop, pp. 5–15.Google Scholar
  11. Simard, M. (1998). Automatic insertion of accents in French text. In Ide & Vuotilainen (Eds.), Proceedings of the third conference on empirical methods in natural language processing, pp. 27–35.Google Scholar
  12. Simard, M., & Deslauriers, A. (2001). Real-time automatic insertion of accents in French text. Natural Language Engineering, 7(2), 143–165.CrossRefGoogle Scholar
  13. Spriet, T., & El-Bèze, M. (1997). Réaccentuation Automatique de Textes. In FRACTAL 97, Besançon.Google Scholar
  14. Streiter, O., & Stuflesser, M. (2006). Design features for the collection and distribution of basic NLP-resources for the world’s writing systems. In Proceedings of LREC 2006, Genova, Italy.Google Scholar
  15. Tufiş, D., & Chiţu, A. (1999). Automatic diacritics insertion in romanian texts. In Proceedings of the 5th international workshop on computational lexicography COMPLEX ’99, pp. 185–194.Google Scholar
  16. Tufiş, D., & Ceauşu, A. (2008). DIAC+: A professional diacritics recovering system. In Proceedings of the sixth international language resources and evaluation (LREC’08).Google Scholar
  17. Wagacha, P. W., De Pauw, G., & Githinji, P. W. (2006). A grapheme-based approach for accent restoration in Gĩkũyũ. In Proceedings of LREC’06, pp. 1937–1940.Google Scholar
  18. Yarowsky, D. (1994). A comparison of corpus-based techniques for restoring accents in Spanish and French text. In Proceedings of the 2nd annual workshop on very large text corpora, pp. 99–120.Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2011

Authors and Affiliations

  1. 1.Department of Mathematics and Computer ScienceSaint Louis UniversitySt. LouisUSA

Personalised recommendations