Skip to main content

Automatic Restoration of Diacritics for Igbo Language

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9924))

Included in the following conference series:

Abstract

Igbo is a low-resource African language with orthographic and tonal diacritics, which capture distinctions between words that are important for both meaning and pronunciation, and hence of potential value for a range of language processing tasks. Such diacritics, however, are often largely absent from the electronic texts we might want to process, or assemble into corpora, and so the need arises for effective methods for automatic diacritic restoration for Igbo. In this paper, we experiment using an Igbo bible corpus, which is extensively marked for vowel distinctions, and partially for tonal distinctions, and attempt the task of reinstating these diacritics when they have been deleted. We investigate a number of word-level diacritic restoration methods, based on n-grams, under a closed-world assumption, achieving an accuracy of 98.83 % with our most effective method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    http://www.merriam-webster.com/dictionary/diacritic.

  2. 2.

    http://www.columbia.edu/itc/mealac/pritchett/00fwp/igbo/txt_onwu_1961.pdf.

  3. 3.

    Observe that m and n, nasal consonants, are sometimes treated as tone marked vowels.

  4. 4.

    LexDif is the average number of candidates per wordkey, calculated by dividing the total word types with the unique wordkeys. A wordkey is gotten by stripping the diacritics off a word.

  5. 5.

    For example: strings like “na”, (mostly conjunction), “na-” (auxiliary) or “n’ ” (preposition) are treated as valid tokens due to the special roles the symbols play in distinguishing the word classes.

  6. 6.

    This corpus was originally processed by Onyenwe et al. [6].

  7. 7.

    Since we did not deal with unknown words, we simplified our models by assuming that words not found in our dictionary do not exist.

  8. 8.

    We recognise that this might be counter productive as correctly restored words in the previous step may be wrongly replaced again in the next.

References

  1. Achebe, I., Ikekeonwu, C., Eme, C., Emenanjo, N., Wanjiku, N.: A Composite Synchronic Alphabet of Igbo Dialects (CSAID). IADP, New York (2011)

    Google Scholar 

  2. Cocks, J., Keegan, T.: A word-based approach for diacritic restoration in Māori. In: 2011 Proceedings of the Australasian Language Technology Association Workshop, Canberra, Australia, pp. 126–130, December 2011

    Google Scholar 

  3. Crandall, D.: Automatic Accent Restoration in Spanish Text (2016). http://www.cs.indiana.edu/~djcran/projects/674_final.pdf. Accessed 7 Jan 2016

  4. De Pauw, G., De Schryver, G.M., Pretorius, L., Levin, L.: Introduction to the special issue on African language technology. Lang. Resour. Eval. 45, 263–269 (2011). Springer Online

    Article  Google Scholar 

  5. Mihalcea, R.F.: Diacritics restoration: learning from letters versus learning from words. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 339–348. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  6. Onyenwe, I.E., Uchechukwu, C., Hepple, M.R.: Part-of-speech tagset and corpus development for Igbo, an African language. In: LAW VIII - The 8th Linguistic Annotation Workshop, pp. 93–98. ACL, Dublin (2014)

    Google Scholar 

  7. Šantić, N., Šnajder, J., Dalbelo Bašić, B.: Automatic diacritics restoration in Croatian texts. In: Stančić, H., Seljan, S., Bawden, D., Lasić-Lazić, J., Slavić, A. (eds.) The Future of Information Sciences, Digital Resources and Knowledge Sharing, pp. 126–130 (2009). ISBN 978-953-175-355-5

    Google Scholar 

  8. Scannell, K.P.: Statistical unicodification of African languages. Lang. Resour. Eval. 45(3), 375–386 (2011). Springer New York Inc., Secaucus, NJ, USA

    Article  Google Scholar 

  9. Simard, M.: Automatic insertion of accents in French texts. In: Proceedings of the Third Conference on Empirical Methods in Natural, Language Processing, pp. 27–35 (1998)

    Google Scholar 

  10. Tufiş, D., Chiţu, A.: Automatic diacritics insertion in Romanian texts. In: Proceedings of International Conference on Computational Lexicography, Pecs, Hungary, pp. 185–194 (1999)

    Google Scholar 

  11. Wagacha, P.W., De Pauw, G., Githinji, P.W.: A grapheme-based approach to accent restoration in Gĩkũyũ. In: Proceedings of 5th International Conference on Language Resources and Evaluation (2006)

    Google Scholar 

  12. Yarowsky, D.: Corpus-based techniques for restoring accents in Spanish and French text. In: Armstrong, S., Church, K., Isabelle, P., Manzi, S., Tzoukermann, E., Yarowsky, D. (eds.) National Language Processing Using Very Large Corpora. Text, Speech and Language Technology, vol. 11, pp. 99–120. Springer, Netherlands (1999). Kluwer Academic Publishers

    Chapter  Google Scholar 

Download references

Acknowledgments

Many thanks to Nnamdi Azikiwe University & TETFund Nigeria for the funding, my colleagues at the IgboNLP Project, University of Sheffield, UK and Prof. Kelvin P. Scannell, St Louis University, USA.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ignatius Ezeani .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Ezeani, I., Hepple, M., Onyenwe, I. (2016). Automatic Restoration of Diacritics for Igbo Language. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2016. Lecture Notes in Computer Science(), vol 9924. Springer, Cham. https://doi.org/10.1007/978-3-319-45510-5_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-45510-5_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-45509-9

  • Online ISBN: 978-3-319-45510-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics