A Comparison of Modeling Units in Sequence-to-Sequence Speech Recognition with the Transformer on Mandarin Chinese

  • Shiyu ZhouEmail author
  • Linhao Dong
  • Shuang Xu
  • Bo Xu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11305)


The choice of modeling units is critical to automatic speech recognition (ASR) tasks. Conventional ASR systems typically choose context-dependent states (CD-states) or context-dependent phonemes (CD-phonemes) as their modeling units. However, it has been challenged by sequence-to-sequence attention-based models. On English ASR tasks, previous attempts have already shown that the modeling unit of graphemes can outperform that of phonemes by sequence-to-sequence attention-based model. In this paper, we are concerned with modeling units on Mandarin Chinese ASR tasks using sequence-to-sequence attention-based models with the Transformer. Five modeling units are explored including context-independent phonemes (CI-phonemes), syllables, words, sub-words and characters. Experiments on HKUST datasets demonstrate that the lexicon free modeling units can outperform lexicon related modeling units in terms of character error rate (CER). Among five modeling units, character based model performs best and establishes a new state-of-the-art CER of \(26.64\%\) on HKUST datasets.


ASR Multi-head attention Modeling units Sequence-to-sequence Transformer 



The research work is supported by the National Key Research and Development Program of China under Grant No. 2016YFB1001404.


  1. 1.
    Chan, W., Lane, I.: On online attention-based speech recognition and joint Mandarin character-pinyin training. In: Interspeech, pp. 3404–3408 (2016)Google Scholar
  2. 2.
    Chiu, C.C., et al.: State-of-the-art speech recognition with sequence-to-sequence models. arXiv preprint arXiv:1712.01769 (2017)
  3. 3.
    Dahl, G.E., Yu, D., Deng, L., Acero, A.: Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process. 20(1), 30–42 (2012)CrossRefGoogle Scholar
  4. 4.
    Hori, T., Watanabe, S., Zhang, Y., Chan, W.: Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM. arXiv preprint arXiv:1706.02737 (2017)
  5. 5.
    Kannan, A., Wu, Y., Nguyen, P., Sainath, T.N., Chen, Z., Prabhavalkar, R.: An analysis of incorporating an external language model into a sequence-to-sequence model. arXiv preprint arXiv:1712.01996 (2017)
  6. 6.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  7. 7.
    Dong, L., Xu, S., Xu, B.: Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 5884–5888. IEEE (2018)Google Scholar
  8. 8.
    Liu, Y., Fung, P., Yang, Y., Cieri, C., Huang, S., Graff, D.: HKUST/MTS: a very large scale Mandarin telephone speech corpus. In: Huo, Q., Ma, B., Chng, E.-S., Li, H. (eds.) ISCSLP 2006. LNCS, vol. 4274, pp. 724–735. Springer, Heidelberg (2006). Scholar
  9. 9.
    Prabhavalkar, R., Sainath, T.N., Li, B., Rao, K., Jaitly, N.: An analysis of attention in sequence-to-sequence models. In: Proceedings of Interspeech (2017)Google Scholar
  10. 10.
    Sainath, T.N., et al.: No need for a lexicon? Evaluating the value of the pronunciation lexica in end-to-end models. arXiv preprint arXiv:1712.01864 (2017)
  11. 11.
    Sak, H., Senior, A., Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: Fifteenth Annual Conference of the International Speech Communication Association (2014)Google Scholar
  12. 12.
    Sak, H., Senior, A., Rao, K., Beaufays, F.: Fast and accurate recurrent neural network acoustic models for speech recognition. arXiv preprint arXiv:1507.06947 (2015)
  13. 13.
    Senior, A., Sak, H., Shafran, I.: Context dependent phone models for LSTM RNN acoustic modelling. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 4585–4589. IEEE (2015)Google Scholar
  14. 14.
    Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 (2015)
  15. 15.
    Shan, C., Zhang, J., Wang, Y., Xie, L.: Attention-based end-to-end speech recognition on voice search. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4764–4768 (2018)Google Scholar
  16. 16.
    Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)Google Scholar
  17. 17.
    Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 6000–6010 (2017)Google Scholar
  18. 18.
    Zhao, Y., Xu, S., Xu, B.: Multidimensional residual learning based on recurrent neural networks for acoustic modeling. In: Interspeech, pp. 3419–3423 (2016)Google Scholar
  19. 19.
    Zhou, S., Dong, L., Xu, S., Xu, B.: Syllable-based sequence-to-sequence speech recognition with the transformer in Mandarin Chinese. ArXiv e-prints, April 2018Google Scholar
  20. 20.
    Zou, W., Jiang, D., Zhao, S., Li, X.: A comparable study of modeling units for end-to-end Mandarin speech recognition. arXiv preprint arXiv:1805.03832 (2018)

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Institute of AutomationChinese Academy of SciencesBeijingChina
  2. 2.University of Chinese Academy of SciencesBeijingChina

Personalised recommendations