Advertisement

Automatic diacritization of Arabic text using recurrent neural networks

  • Gheith A. Abandah
  • Alex Graves
  • Balkees Al-Shagoor
  • Alaa Arabiyat
  • Fuad Jamour
  • Majid Al-Taee
Original Paper

Abstract

This paper presents a sequence transcription approach for the automatic diacritization of Arabic text. A recurrent neural network is trained to transcribe undiacritized Arabic text with fully diacritized sentences. We use a deep bidirectional long short-term memory network that builds high-level linguistic abstractions of text and exploits long-range context in both input directions. This approach differs from previous approaches in that no lexical, morphological, or syntactical analysis is performed on the data before being processed by the net. Nonetheless, when the network is post-processed with our error correction techniques, it achieves state-of-the-art performance, yielding an average diacritic and word error rates of 2.09 and 5.82 %, respectively, on samples from 11 books. For the LDC ATB3 benchmark, this approach reduces the diacritic error rate by 25 %, the word error rate by 20 %, and the last-letter diacritization error rate by 33 % over the best published results.

Keywords

Automatic diacritization Arabic text Machine learning Sequence transcription Recurrent neural networks  Deep neural networks  Long short-term memory 

References

  1. 1.
    Abandah, G., Khundakjie, F.: Issues concerning code system for Arabic letters. Dirasat Eng. Sci. J. 31(1), 165–177 (2004)Google Scholar
  2. 2.
    Abandah, G.A., Jamour, F.T., Qaralleh, E.A.: Recognizing handwritten Arabic words using grapheme segmentation and recurrent neural networks. Int. J. Doc. Anal. Recognit. 17(3), 275–291 (2014)CrossRefGoogle Scholar
  3. 3.
    Al-Sughaiyer, I.A., Al-Kharashi, I.A.: Arabic morphological analysis techniques: a comprehensive survey. J. Am. Soc. Inf. Sci. Technol. 55(3), 189–213 (2004)CrossRefGoogle Scholar
  4. 4.
    Azim, A.S., Wang, X., Sim, K.C.: A weighted combination of speech with text-based models for Arabic diacritization. In: 13th Annual Conference of International Speech Communication Association, pp. 2334–2337 (2012)Google Scholar
  5. 5.
    Azmi, A.M., Almajed, R.S.: A survey of automatic Arabic diacritization techniques. Nat. Lang. Eng. 1–19 (2013). doi: 10.1017/S1351324913000284
  6. 6.
    Bahanshal, A., Al-Khalifa, H.S.: A first approach to the evaluation of Arabic diacritization systems. In: International Conference on Digital Information Management, pp. 155–158 (2012)Google Scholar
  7. 7.
    Beesley, K.R.: Arabic finite-state morphological analysis and generation. In: 16th Conference on Computational Linguistics, vol. 1, pp. 89–94 (1996)Google Scholar
  8. 8.
    Buckwalter, T.: Buckwalter Arabic Morphological Analyzer, v2.0 edn. Linguistic Data Consortium, Philadelphia (2004)Google Scholar
  9. 9.
    Dahl, G., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(1), 30–42 (2012)CrossRefGoogle Scholar
  10. 10.
    Damerau, F.J.: A technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171–176 (1964)CrossRefGoogle Scholar
  11. 11.
    El-Sadany, T., Hashish, M.: Semi-automatic vowelization of Arabic verbs. In: 10th National Computer Conference, pp. 725–732 (1988)Google Scholar
  12. 12.
    Gal, Y.: An HMM approach to vowel restoration in Arabic and Hebrew. In: ACL-02 Workshop on Computational Approaches to Semitic Languages, pp. 1–7 (2002)Google Scholar
  13. 13.
    Gers, F., Schraudolph, N., Schmidhuber, J.: Learning precise timing with LSTM recurrent networks. J. Mach. Learn. Res. 3(1), 115–143 (2002)MathSciNetGoogle Scholar
  14. 14.
    Graves, A.: Practical variational inference for neural networks. In: Advances in Neural Information Processing Systems, pp. 2348–2356. Curran Associates, Inc. (2011)Google Scholar
  15. 15.
    Graves, A.: Offline Arabic handwriting recognition with multidimensional recurrent neural networks. In: Märgner, V., El Abed, H. (eds.) Guide to OCR for Arabic Scripts, pp. 297–313. Springer, London (2012)CrossRefGoogle Scholar
  16. 16.
    Graves, A.: Sequence transduction with recurrent neural networks. In: ICML Representation Learning Worksop (2012)Google Scholar
  17. 17.
    Graves, A.: Supervised sequence labelling with recurrent neural networks. Springer, Berlin (2012)CrossRefzbMATHGoogle Scholar
  18. 18.
    Graves, A.: Generating sequences with recurrent neural networks. arXiv:1308.0850 (2013)
  19. 19.
    Graves, A.: RNNLIB: a recurrent neural network library for sequence learning problems. http://sourceforge.net/projects/rnnl/ (2013)
  20. 20.
    Graves, A., Mohamed, A.R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649 (2013)Google Scholar
  21. 21.
    Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18(5–6), 602–610 (2005)CrossRefGoogle Scholar
  22. 22.
    Habash, N., Rambow, O.: Arabic diacritization through full morphological tagging. In: Conference on North American Chapter of the Association for Computational Linguistics, pp. 53–56 (2007)Google Scholar
  23. 23.
    Hifny, Y.: Smoothing techniques for Arabic diacritics restoration. In: 12th Conference on Language Engineering, pp. 6–12 (2012)Google Scholar
  24. 24.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  25. 25.
    Kirchhoff, K., Bilmes, J., Das, S., Duta, N., Egan, M., Ji, G., He, F., Henderson, J., Liu, D., Noamany, M., et al.: Novel approaches to Arabic speech recognition: report from the 2002 Johns-Hopkins summer workshop. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 344–347 (2003)Google Scholar
  26. 26.
    Lewis, M.P. (ed.): Ethnologue: Languages of the World, 16th edn. SIL International, Dallas (2009)Google Scholar
  27. 27.
    Maamouri, M., Bies, A., Buckwalter, T., Mekki, W.: The Penn Arabic treebank: building a large-scale annotated Arabic corpus. In: NEMLAR Conference on Arabic Language Resources and Tools, pp. 102–109 (2004)Google Scholar
  28. 28.
    Märgner, V., El Abed, H.: ICDAR 2009: Arabic handwriting recognition competition. In: International Conference on Document Analysis and Recognition, pp. 1383–1387 (2009)Google Scholar
  29. 29.
    Murray, A.F., Edwards, P.J.: Enhanced MLP performance and fault tolerance resulting from synaptic weight noise during training. IEEE Trans. Neural Netw. 5(5), 792–802 (1994)CrossRefGoogle Scholar
  30. 30.
    Nelken, R., Shieber, S.M.: Arabic diacritization using weighted finite-state transducers. In: ACL Workshop on Computational Approaches to Semitic Languages, pp. 79–86 (2005)Google Scholar
  31. 31.
    Rashwan, M., Al-Badrashiny, M., Attia, M., Abdou, S., Rafea, A.: A stochastic Arabic diacritizer based on a hybrid of factorized and unfactorized textual features. IEEE Trans. Audio Speech Lang. Process. 19(1), 166–175 (2011)CrossRefGoogle Scholar
  32. 32.
    Ryding, K.C.: A Reference Grammar of Modern Standard Arabic. Cambridge University Press, Cambridge (2005)CrossRefGoogle Scholar
  33. 33.
    Said, A., El-Sharqwi, M., Chalabi, A., Kamal, E.: A hybrid approach for Arabic diacritization. In: Mtais, E., Meziane, F., Saraee, M., Sugumaran, V., Vadera, S. (eds.) Natural Language Processing and Information Systems. Lecture Notes in Computer Science, vol. 7934, pp. 53–64. Springer, Berlin (2013)CrossRefGoogle Scholar
  34. 34.
    Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)CrossRefGoogle Scholar
  35. 35.
    Vergyri, D., Kirchhoff, K.: Automatic diacritization of Arabic for acoustic modeling in speech recognition. In: Workshop on Computational Approaches to Arabic Script-based Languages, pp. 66–73 (2004)Google Scholar
  36. 36.
    Zarrabi-Zadeh, H.: Tanzil: Quran Navigator. http://tanzil.net/download. Accessed 27 Nov 2014
  37. 37.
    Zerrouki, T.: Arabic corpora resources, Tashkila collection from the Arabic Al-Shamela library. http://aracorpus.e3rab.com. Accessed 27 Nov 2014
  38. 38.
    Zitouni, I., Sorensen, J.S., Sarikaya, R.: Maximum entropy based restoration of Arabic diacritics. In: 21st International Conference on Computational Linguistics, pp. 577–584 (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  • Gheith A. Abandah
    • 1
  • Alex Graves
    • 2
  • Balkees Al-Shagoor
    • 1
  • Alaa Arabiyat
    • 1
  • Fuad Jamour
    • 3
  • Majid Al-Taee
    • 1
  1. 1.Computer Engineering DepartmentUniversity of JordanAmmanJordan
  2. 2.Google DeepMindLondonUK
  3. 3.King Abdullah University of Science and TechnologyThuwalSaudi Arabia

Personalised recommendations