Automatic Restoration of Diacritics for Igbo Language

Ezeani, Ignatius; Hepple, Mark; Onyenwe, Ikechukwu

doi:10.1007/978-3-319-45510-5_23

Ignatius Ezeani¹⁷,
Mark Hepple¹⁷ &
Ikechukwu Onyenwe¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9924))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

1741 Accesses
6 Citations

Abstract

Igbo is a low-resource African language with orthographic and tonal diacritics, which capture distinctions between words that are important for both meaning and pronunciation, and hence of potential value for a range of language processing tasks. Such diacritics, however, are often largely absent from the electronic texts we might want to process, or assemble into corpora, and so the need arises for effective methods for automatic diacritic restoration for Igbo. In this paper, we experiment using an Igbo bible corpus, which is extensively marked for vowel distinctions, and partially for tonal distinctions, and attempt the task of reinstating these diacritics when they have been deleted. We investigate a number of word-level diacritic restoration methods, based on n-grams, under a closed-world assumption, achieving an accuracy of 98.83 % with our most effective method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Spanish Diacritic Error Detection and Restoration—A Survey

Challenges for and Perspectives on the Malagasy Language in the Digital Age

A Comparison of Neural Networks Architectures for Diacritics Restoration

Notes

1.
http://www.merriam-webster.com/dictionary/diacritic.
2.
http://www.columbia.edu/itc/mealac/pritchett/00fwp/igbo/txt_onwu_1961.pdf.
3.
Observe that m and n, nasal consonants, are sometimes treated as tone marked vowels.
4.
LexDif is the average number of candidates per wordkey, calculated by dividing the total word types with the unique wordkeys. A wordkey is gotten by stripping the diacritics off a word.
5.
For example: strings like “na”, (mostly conjunction), “na-” (auxiliary) or “n’ ” (preposition) are treated as valid tokens due to the special roles the symbols play in distinguishing the word classes.
6.
This corpus was originally processed by Onyenwe et al. [6].
7.
Since we did not deal with unknown words, we simplified our models by assuming that words not found in our dictionary do not exist.
8.
We recognise that this might be counter productive as correctly restored words in the previous step may be wrongly replaced again in the next.

References

Achebe, I., Ikekeonwu, C., Eme, C., Emenanjo, N., Wanjiku, N.: A Composite Synchronic Alphabet of Igbo Dialects (CSAID). IADP, New York (2011)
Google Scholar
Cocks, J., Keegan, T.: A word-based approach for diacritic restoration in Māori. In: 2011 Proceedings of the Australasian Language Technology Association Workshop, Canberra, Australia, pp. 126–130, December 2011
Google Scholar
Crandall, D.: Automatic Accent Restoration in Spanish Text (2016). http://www.cs.indiana.edu/~djcran/projects/674_final.pdf. Accessed 7 Jan 2016
De Pauw, G., De Schryver, G.M., Pretorius, L., Levin, L.: Introduction to the special issue on African language technology. Lang. Resour. Eval. 45, 263–269 (2011). Springer Online
Article Google Scholar
Mihalcea, R.F.: Diacritics restoration: learning from letters versus learning from words. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 339–348. Springer, Heidelberg (2002)
Chapter Google Scholar
Onyenwe, I.E., Uchechukwu, C., Hepple, M.R.: Part-of-speech tagset and corpus development for Igbo, an African language. In: LAW VIII - The 8th Linguistic Annotation Workshop, pp. 93–98. ACL, Dublin (2014)
Google Scholar
Šantić, N., Šnajder, J., Dalbelo Bašić, B.: Automatic diacritics restoration in Croatian texts. In: Stančić, H., Seljan, S., Bawden, D., Lasić-Lazić, J., Slavić, A. (eds.) The Future of Information Sciences, Digital Resources and Knowledge Sharing, pp. 126–130 (2009). ISBN 978-953-175-355-5
Google Scholar
Scannell, K.P.: Statistical unicodification of African languages. Lang. Resour. Eval. 45(3), 375–386 (2011). Springer New York Inc., Secaucus, NJ, USA
Article Google Scholar
Simard, M.: Automatic insertion of accents in French texts. In: Proceedings of the Third Conference on Empirical Methods in Natural, Language Processing, pp. 27–35 (1998)
Google Scholar
Tufiş, D., Chiţu, A.: Automatic diacritics insertion in Romanian texts. In: Proceedings of International Conference on Computational Lexicography, Pecs, Hungary, pp. 185–194 (1999)
Google Scholar
Wagacha, P.W., De Pauw, G., Githinji, P.W.: A grapheme-based approach to accent restoration in Gĩkũyũ. In: Proceedings of 5th International Conference on Language Resources and Evaluation (2006)
Google Scholar
Yarowsky, D.: Corpus-based techniques for restoring accents in Spanish and French text. In: Armstrong, S., Church, K., Isabelle, P., Manzi, S., Tzoukermann, E., Yarowsky, D. (eds.) National Language Processing Using Very Large Corpora. Text, Speech and Language Technology, vol. 11, pp. 99–120. Springer, Netherlands (1999). Kluwer Academic Publishers
Chapter Google Scholar

Download references

Acknowledgments

Many thanks to Nnamdi Azikiwe University & TETFund Nigeria for the funding, my colleagues at the IgboNLP Project, University of Sheffield, UK and Prof. Kelvin P. Scannell, St Louis University, USA.

Author information

Authors and Affiliations

NLP Group, Department of Computer Science, The University of Sheffield, Sheffield, UK
Ignatius Ezeani, Mark Hepple & Ikechukwu Onyenwe

Authors

Ignatius Ezeani
View author publications
You can also search for this author in PubMed Google Scholar
Mark Hepple
View author publications
You can also search for this author in PubMed Google Scholar
Ikechukwu Onyenwe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ignatius Ezeani .

Editor information

Editors and Affiliations

Masaryk University , Brno, Czech Republic
Petr Sojka
Masaryk University , Brno, Czech Republic
Aleš Horák
Masaryk University , Brno, Czech Republic
Ivan Kopeček
Masaryk University , Brno, Czech Republic
Karel Pala

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ezeani, I., Hepple, M., Onyenwe, I. (2016). Automatic Restoration of Diacritics for Igbo Language. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2016. Lecture Notes in Computer Science(), vol 9924. Springer, Cham. https://doi.org/10.1007/978-3-319-45510-5_23

Download citation

DOI: https://doi.org/10.1007/978-3-319-45510-5_23
Published: 03 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-45509-9
Online ISBN: 978-3-319-45510-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Automatic Restoration of Diacritics for Igbo Language

Abstract

Access this chapter

Similar content being viewed by others

Spanish Diacritic Error Detection and Restoration—A Survey

Challenges for and Perspectives on the Malagasy Language in the Digital Age

A Comparison of Neural Networks Architectures for Diacritics Restoration

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Automatic Restoration of Diacritics for Igbo Language

Abstract

Access this chapter

Similar content being viewed by others

Spanish Diacritic Error Detection and Restoration—A Survey

Challenges for and Perspectives on the Malagasy Language in the Digital Age

A Comparison of Neural Networks Architectures for Diacritics Restoration

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation