International Journal of Speech Technology

, Volume 21, Issue 3, pp 589–600 | Cite as

Text normalization with convolutional neural networks

  • Sevinj Yolchuyeva
  • Géza Németh
  • Bálint Gyires-Tóth


Text normalization is a critical step in the variety of tasks involving speech and language technologies. It is one of the vital components of natural language processing, text-to-speech synthesis and automatic speech recognition. Convolutional neural networks (CNNs) have proven their superior performance to recurrent architectures in various application scenarios, like neural machine translation, however their ability in text normalization was not exploited yet. In this paper we investigate and propose a novel CNNs based text normalization method. Training, inference times, accuracy, precision, recall, and F1-score were evaluated on an open-source dataset. The performance of CNNs is evaluated and compared with a variety of different long short-term memory (LSTM) and Bi-LSTM architectures with the same dataset.


Text-normalization CONVOLUTIONAL neural networks (CNN) Long short-term memory LSTM Residual architecture 



The research presented in this paper has been supported by the European Union, co-financed by the European Social Fund (EFOP-3.6.2-16-2017-00013), by the VUK project (AAL 2014-183), the DANSPLAT project (Eureka 9944) and by the BME-Artificial Intelligence FIKP grant of EMMI (BME FIKP-MI/SC). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research.


  1. Allauzen, C., Riley, M., & Roark, B. (2016). Distributed representation and estimation of WFST-based N-Gram models. In Proceeding of ACL workshop on statistical NLP and weighted automata (pp 32–41).Google Scholar
  2. Allen, J., Hunnicutt, M. S., & Klatt, D. (1987). From text to speech—The MITalk system. Cambridge: MIT press.Google Scholar
  3. Arik, S. O., Chrzanowski, M., Coates, A., Diamos, G., Gibiansky, A., Kang, Y., et al. (2017). Deep voice: real-time neural text-to-speech. In Proceedings of the 34th international conference on machine learning (pp 195–204).Google Scholar
  4. Astudillo, R. F., Amir, S., Lin, W., Silva, M., & Trancoso, I. (2015). Learning word representations from scarce and noisy data with embedding subspaces. In Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (pp 1074–1084).Google Scholar
  5. Aw, A, Zhang, M., Xiao, J., & Su, J. (2006). A phrase-based statistical model for SMS text normalization. In Proceedings of the COLING/ACL on Main Conference Poster Sessions (pp 33–40).Google Scholar
  6. Baldwin, T., Road, H., & Jose, S. (2015). An in-depth analysis of the effect of text normalization in social media. In Proceedings of the main conference, HLT-NAACL 2015—Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (pp 420–429)Google Scholar
  7. Bigi, B. (2014). A multilingual text normalization approach. Lecture notes in computer science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) pp. 515–526.Google Scholar
  8. Chollet, F. (2016). Keras: Theano-based deep learning library. Code: Documentation:
  9. Conneau, A., Schwenk, H., Barrault, L., & Lecun, Y. (2016). Very deep convolutional networks for text classification. KI - Künstliche Intelligenz, 26(4), 357–363.Google Scholar
  10. Cook, P., & Stevenson, S. (2009). An Unsupervised Model for Text Message Normalization. In Proceedings of the workshop on computational approaches to linguistic creativity (pp 71–78).Google Scholar
  11. Daiber, J., & Van Der Goot, R. (2016). The denoised web treebank: Evaluating dependency parsing under noisy input conditions. In Proceedings of the tenth international conference on language resources and evaluation (LREC 2016) (pp 649–653).Google Scholar
  12. El-Desoky, M., & Schuller, B. (2016). Deep bidirectional long short-term memory recurrent neural networks for grapheme-to-phoneme conversion utilizing complex many-to-many alignments. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (pp 2836–2840).Google Scholar
  13. Elman, J. L. (1991). Distributed representations, simple recurrent networks, and grammatical structure. Machine Learning, 7(2), 195–225.Google Scholar
  14. Gehring, J., Auli, M., Grangier, D. & Dauphin, Y. N. (2016). A convolutional encoder model for neural machine translation. In Proceedings of the 55th annual meeting of the association for computational linguistics (pp 123–135).Google Scholar
  15. Gehring, J., Auli, M., Grangier, D. & Dauphin, Y. N. (2017). Convolutional sequence to sequence learning. arXiv preprint arXiv: 1705.03122.Google Scholar
  16. Goldberg, Y. (2016). A primer on neural network models for natural language processing. Journal of Artificial Intelligence Research, 57, 345–420.MathSciNetCrossRefzbMATHGoogle Scholar
  17. Graves, A., Fernandez, S., Gomez, F. & Schmidhuber, J. (2006). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on machine learning (pp 369–376).Google Scholar
  18. Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., Liu, T., Wang, X., Wang, L., Wang, G., Cai, J., & Chen, T. (2015). Recent advances in convolutional neural networks. arXiv preprint arXiv:1512.07108.Google Scholar
  19. Greff, K., Rupesh, K., & Srivastava, J. S. (2017). Highway and residual networks learn unrolled iterative estimation. In 5th international conference on learning representations.Google Scholar
  20. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd international conference on machine learning.Google Scholar
  21. Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR).Google Scholar
  22. Kobus, C., Yvon, F., & Damnati, G. (2008). Normalizing SMS: Are two metaphors better than one? In Proceedings of the 22nd international conference on computational linguistics association for computational linguistics (pp 441–448).Google Scholar
  23. Kumar, V., & Sridhar, R. (2015). Unsupervised text normalization using distributed representations of words and phrases. In Proceedings of NAACL-HLT 2015 (pp 8–16).Google Scholar
  24. Lu, L., Zhang, X., & Renais, S. (2016). On training the recurrent neural network encoder-decoder for large vocabulary end-to-end speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP (pp 5060–5064).Google Scholar
  25. Mikolov, T., Corrado, G., Chen, K., & Dean, J. (2013). Efficient estimation of word representations in vector space. In Proceedings of the international conference on learning representations (ICLR 2013) (pp 1–12).Google Scholar
  26. Nair, V. & Hinton., G. E. (2010). Rectified linear units improve restricted Boltzmann machines. In Proceedings 27th international conference on machine learning (pp. 807–814).Google Scholar
  27. Pundak, G., & Sainath, T. N. (2017). Highway-LSTM and recurrent highway networks for speech recognition. Interspeech 2017, 1303–1307.CrossRefGoogle Scholar
  28. Roark, B., et al. (2012). The OpenGrm open-source finite-state grammar software libraries. In Proceedings of the ACL 2012 System Demonstrations (pp 61–66).Google Scholar
  29. Shang, W., & Chiu, J. (2017). Exploring normalization in deep residual networks with concatenated rectified linear units batch normalization in ResNets. Proceedings of the 31th Conference on Artificial Intelligence (AAAI 2017) 1, 1509–16.Google Scholar
  30. Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing and Management 45, 427–437.CrossRefGoogle Scholar
  31. Sonmez, C., Ozgur, A., & Ozg, A. (2014). A graph-based approach for contextual text normalization. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (Iv) (pp 313–324).Google Scholar
  32. Sproat, R., et al. (2001). Normalization of non-standard words. Computer Speech & Language, 15(3), 287–333.CrossRefGoogle Scholar
  33. Sproat, R., & Jaitly, N. (2016). RNN Approaches to text normalization: a challenge. arXiv: preprint arXiv: 1611.00068.Google Scholar
  34. Sproat, R., & Hall, K. (2014). Applications of maximum entropy rankers to problems in spoken language processing. In Proceedings of the annual conference of the international speech communication association, INTERSPEECH (pp 761–64).Google Scholar
  35. Sridhar, R., & Kumar, V. (2015). Unsupervised topic modeling for short texts using distributed representations of words. In Proceedings of the 1st workshop on vector space modeling for natural language processing (pp 192–200).Google Scholar
  36. Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Training very deep networks training very deep networks. In NIPS’15 Proceedings of the 28th international conference on neural information processing systems (pp 2377–2385).Google Scholar
  37. Sun, R., & Lee Giles, C. (2001) Sequence learning: from recognition and prediction to sequential decision making. IEEE Intelligent Systems, 16(4), 67–70.CrossRefGoogle Scholar
  38. Sundermeyer, M., Alkhouli, T., Wuebker, J., & Ney, H. (2014). Translation modeling with bidirectional recurrent neural networks. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp 14–25).Google Scholar
  39. Sutskever, I., Vinyals, O., & Quoc, V. L. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems (NIPS) (pp 3104–3112).Google Scholar
  40. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp 1–9).Google Scholar
  41. The Theano Development. (2016). A Python framework for fast computation of mathematical expressions, arXiv preprint arXiv:1605.02688.Google Scholar
  42. Weber, D., & Zhekova, D. (2016). TweetNorm: text normalization on Italian Twitter Data. In Proceedings of the 13th Conference on Natural Language Processing (pp 306–312).Google Scholar
  43. Young, T., Hazarika, D., Poria, S., & Cambria, E. (2017). Recent trends in deep learning based natural language processing. arXiv: preprint arXiv:1708.02709v4.Google Scholar
  44. Zhang, C., Baldwin, T., Kimelfeld, B., & Li, Y. (2013). Adaptive parser-centric text normalization. In Proceedings of the 34th international conference on machine learning, Sydney.Google Scholar
  45. Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level convolutional networks for text classification. In Advances in neural information processing systems (pp. 1–9).Google Scholar
  46. Zhou, P., Qi, Z., Zheng, S., Xu, J., Bao, H., & Xu, B. (2016). Text classification improved by integrating bidirectional LSTM with two-dimensional max pooling. In Proceedings of COLING 2016, the 26th international conference on computational linguistics: technical papers (pp 3485–3495).Google Scholar
  47. Zilly, J. G., Srivastava, R. K., Jan, K., & Schmidhuber, J. (2017). Recurrent highway networks. In Proceedings of the 34th international conference on machine learning, Sydney.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Sevinj Yolchuyeva
    • 1
  • Géza Németh
    • 1
  • Bálint Gyires-Tóth
    • 1
  1. 1.Department of Telecommunications and Media InformaticsUniversity of Budapest Technology and EconomicsBudapestHungary

Personalised recommendations