Advertisement

Multi-task Projected Embedding for Igbo

  • Ignatius Ezeani
  • Mark Hepple
  • Ikechukwu Onyenwe
  • Chioma Enemuo
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11107)

Abstract

NLP research on low resource African languages is often impeded by the unavailability of basic resources: tools, techniques, annotated corpora, and datasets. Besides the lack of funding for the manual development of these resources, building from scratch will amount to the reinvention of the wheel. Therefore, adapting existing techniques and models from well-resourced languages is often an attractive option. One of the most generally applied NLP models is word embeddings. Embedding models often require large amounts of data to train which are not available for most African languages. In this work, we adopt an alignment based projection method to transfer trained English embeddings to the Igbo language. Various English embedding models were projected and evaluated on the odd-word, analogy and word-similarity tasks intrinsically, and also on the diacritic restoration task. Our results show that the projected embeddings performed very well across these tasks.

Keywords

Low-resource Igbo Diacritics Embedding models Transfer learning 

References

  1. 1.
    Crandall, D.: Automatic Accent Restoration in Spanish text (2005). http://www.cs.indiana.edu/~djcran/projects/674_final.pdf. Accessed 7 Jan 2016
  2. 2.
    De Pauw, G., De Schryver, G.M., Pretorius, L., Levin, L.: Introduction to the special issue on African language technology. Lang. Resour. Eval. 45, 263–269 (2011)CrossRefGoogle Scholar
  3. 3.
    Ezeani, I., Hepple, M., Onyenwe, I.: Automatic restoration of diacritics for Igbo language. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2016. LNCS (LNAI), vol. 9924, pp. 198–205. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-45510-5_23CrossRefGoogle Scholar
  4. 4.
    Ezeani, I., Hepple, M., Onyenwe, I.: Lexical disambiguation of Igbo using diacritic restoration. In: Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and Their Applications, pp. 53–60 (2017)Google Scholar
  5. 5.
    Finkelstein, L., et al.: Placing search in context: the concept revisited. In: Proceedings of the 10th International Conference on World Wide Web, pp. 406–414. ACM (2001)Google Scholar
  6. 6.
    Francom, J., Hulden, M.: Diacritic error detection and restoration via POS tags. In: Proceedings of the 6th Language and Technology Conference (2013)Google Scholar
  7. 7.
    Guo, J., Che, W., Yarowsky, D., Wang, H., Liu, T.: Cross-lingual dependency parsing based on distributed representations. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Vol1: Long Papers), pp. 1234–1244 (2015)Google Scholar
  8. 8.
    Mihalcea, R.F.: Diacritics restoration: learning from letters versus learning from words. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 339–348. Springer, Heidelberg (2002).  https://doi.org/10.1007/3-540-45715-1_35CrossRefGoogle Scholar
  9. 9.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word Representations in Vector Space, arXiv preprint arXiv:1301.3781 (2013)
  10. 10.
    Onyenwe, I.E., Hepple, M., Chinedu, U., Ezeani, I.: A basic language resource kit implementation for the IgboNLP project. ACM Trans. Asian Low-Resource Lang. Inf. Process. 17(2), 101–1023 (2018)Google Scholar
  11. 11.
    Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 22 May, pp. 45–50. ELRA, Valletta (2010). http://is.muni.cz/publication/884893/en
  12. 12.
    Cocks, J., Keegan, T.-T.: A word-based approach for diacritic restoration in Māori. In: Proceedings of the Australasian Language Technology Association Workshop 2011, Canberra, Australia, pp. 126–130, December 2011. http://www.aclweb.org/anthology/U/U11/U11-2016
  13. 13.
    Tufiş, D., Chiţu, A.: Automatic diacritics insertion in Romanian texts. In: Proceedings of the International Conference on Computational Lexicography, Pecs, Hungary, pp. 185–194 (1999)Google Scholar
  14. 14.
    Scannell, K.P.: Statistical unicodification of African languages. Lang. Resour. Eval. 45(3), 375–386 (2011)CrossRefGoogle Scholar
  15. 15.
    Simard, M.: Automatic insertion of accents in French text. In: Proceedings of the Third Conference on Empirical Methods for Natural Language Processing, pp. 27–35 (1998)Google Scholar
  16. 16.
    Wagacha, P.W., De Pauw, G., Githinji P.W.: A grapheme-based approach to accent restoration in Gĩkũyũ. In: Fifth International Conference on Language Resources and Evaluation (2006)Google Scholar
  17. 17.
    Yarowsky, D.: A comparison of corpus-based techniques for restoring accents in Spanish and French text. In: Proceedings of 2nd Annual Workshop on Very Large Corpora, Kyoto, pp. 19–32 (1994)Google Scholar
  18. 18.
    Yarowsky, D.: Corpus-Based Techniques for Restoring Accents in Spanish and French Text, Natural Language Processing Using Very Large Corpora. Kluwer Academic Publishers, pp. 99–120 (1999)Google Scholar
  19. 19.
    Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching Word Vectors with Subword Information, arXiv preprint arXiv:1607.04606 (2016)

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Ignatius Ezeani
    • 1
  • Mark Hepple
    • 1
  • Ikechukwu Onyenwe
    • 1
  • Chioma Enemuo
    • 1
  1. 1.Department of Computer ScienceThe University of SheffieldSheffieldUK

Personalised recommendations