Joint CTC-Attention End-to-End Speech Recognition with a Triangle Recurrent Neural Network Encoder

Zhu, Tao; Cheng, Chunling

doi:10.1007/s12204-019-2147-6

Joint CTC-Attention End-to-End Speech Recognition with a Triangle Recurrent Neural Network Encoder

Published: 03 December 2019

Volume 25, pages 70–75, (2020)
Cite this article

Journal of Shanghai Jiaotong University (Science) Aims and scope Submit manuscript

Tao Zhu (朱涛)¹ &
Chunling Cheng (程春玲)¹

109 Accesses
3 Citations
Explore all metrics

Abstract

Traditional speech recognition model based on deep neural network (DNN) and hidden Markov model (HMM) is a complex and multi-module system. In other words, optimization goals may differ between modules in traditional model. Besides, additional language resources are required, such as pronunciation dictionary and language model. To eliminate the drawbacks of traditional model, we hereby propose an end-to-end speech recognition method, where connectionist temporal classification (CTC) and attention are integrated for decoding. In our model, the complex modules are replaced by a single deep network. Our model mainly consists of encoder and decoder. The encoder is constructed by bidirectional long short-term memory (BLSTM) with a triangular structure for feature extraction. The decoder based on CTC-attention decoding utilizes advanced features extracted by shared encoder for training and decoding. The experimental results on the VoxForge dataset indicate that end-to-end method is superior to basic CTC and attention-based encoder-decoder decoding, and the character error rate (CER) is reduced to 12.9% without using any language model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

End-to-End Speech Recognition in Russian

Towards end-to-end speech recognition with transfer learning

Article Open access 21 November 2018

RAttSR: A Novel Low-Cost Reconstructed Attention-Based End-to-End Speech Recognizer

Article 24 December 2023

References

ANUSUYA M A, KATTI S K. Speech recognition by machine: A review [J]. International Journal of Computer Science and Information Security, 2009, 6(3): 181–205.
Google Scholar
RABINER L R. A tutorial on hidden Markov models and selected applications in speech recognition [J]. Proceedings of the IEEE, 1989, 77(2): 257–286.
Article Google Scholar
HINTON G, DENG L, YU D, et al. Deep neural net-works for acoustic modeling in speech recognition: The shared views of four research groups [J]. IEEE Signal Processing Magazine, 2012, 29(6): 82–97.
Article Google Scholar
GRAVES A, FERNÁNDEZ S, GOMEZ F, et al. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks [C]// 23rd International Conference on Machine Learning. Pittsburgh, Pennsylvania, USA: ACM, 2006: 369–376.
Google Scholar
GRAVES A, JAITLY N. Towards end-to-end speech recognition with recurrent neural networks [C]// 31st International Conference on Machine Learning. Beijing, China: W&CP, 2014: 1764–1772.
Google Scholar
BAHDANAU D, CHO K H, BENGIO Y. Neural machine translation by jointly learning to align and translate [C]// International Conference on Learning Representations. San Diego, CA, USA: Computational and Biological Learning Society, 2015: 0473.
Google Scholar
VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [C]// 31st Conference on Neural Information Processing Systems. Long Beach, CA, USA: NIPS, 2017: 5998–6008.
Google Scholar
MARKOVNIKOV N, KIPYATKOVA I, LYAKSO E. End-to-end speech recognition in Russian [C]// International Conference on Speech and Computer. Leizig, Germany: Springer, 2018: 377–386.
Chapter Google Scholar
SAK H, SENIOR A, BEAUFAYS F. Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition [C]// 15th Annual Conference of the International Speech Communication Association. Singapore: ISCA, 2014: 1128.
Google Scholar
HANNUN A Y, MAAS A L, JURAFSKY D, et al. First-pass large vocabulary continuous speech recognition using bi-directional recurrent DNNs [EB/OL]. (2014-08-12) [2018-11-08]. https://arxiv.org/pdf/1408.2873.pdf.
MIAO Y, GOWAYYED M, METZE F. EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding [C]// IEEE Workshop on Automatic Speech Recognition and Understanding. Scottsdale, AZ, USA: IEEE, 2015: 167–174.
Google Scholar
MOHRI M, PEREIRA F, RILEY M. Weighted finite-state transducers in speech recognition [J]. Computer Speech & Language, 2002, 16(1): 69–88.
Article Google Scholar
CHOROWSKI J K, BAHDANAU D, SERDYUK D, et al. Attention-based models for speech recognition [C]// 29th Conference on Advances in Neural Information Processing Systems. Montreal, Canada: NIPS, 2015: 577–585.
Google Scholar
BAHDANAU D, CHOROWSKI J, SERDYUK D, et al. End-to-end attention-based large vocabulary speech recognition [C]// 41st IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai, China: IEEE, 2016: 4945–4949.
Google Scholar
LU L, ZHANG X, CHO K, et al. A study of the recurrent nerual network encoder-decoder for large vocabulary speech recognition [C]// Proceedings of the Interspeech. Dresden, Germany: ISCA, 2015: 3249–3253.
Google Scholar
ZEILER M D. Adadelta: An adaptive learning rate method [EB/OL]. (2012-12-22) [2018-11-08]. https://arxiv.org/pdf/1212.5701.pdf.
WATANABE S, HORI T, KARITA S, et al. ESPnet: End-to-end speech processing toolkit [C]// Proceedings of the Interspeech. Hyderabad, India: ISCA, 2018: 2207–2211.
Chapter Google Scholar
POVEY D, GHOSHAL A, BOULIANNE G, et al. The Kaldi speech recognition toolkit [C]// IEEE Workshop on Automatic Speech Recognition and Understanding. Hawaii, USA: IEEE, 2011: 1–4.
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China
Tao Zhu (朱涛) & Chunling Cheng (程春玲)

Authors

Tao Zhu (朱涛)
View author publications
You can also search for this author in PubMed Google Scholar
Chunling Cheng (程春玲)
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chunling Cheng (程春玲).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhu, T., Cheng, C. Joint CTC-Attention End-to-End Speech Recognition with a Triangle Recurrent Neural Network Encoder. J. Shanghai Jiaotong Univ. (Sci.) 25, 70–75 (2020). https://doi.org/10.1007/s12204-019-2147-6

Download citation

Received: 08 November 2018
Published: 03 December 2019
Issue Date: February 2020
DOI: https://doi.org/10.1007/s12204-019-2147-6

Key words

CLC number

TP 183

Document code

A

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Joint CTC-Attention End-to-End Speech Recognition with a Triangle Recurrent Neural Network Encoder

Abstract

Access this article

Similar content being viewed by others

End-to-End Speech Recognition in Russian

Towards end-to-end speech recognition with transfer learning

RAttSR: A Novel Low-Cost Reconstructed Attention-Based End-to-End Speech Recognizer

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Key words

CLC number

Document code

Navigation

Joint CTC-Attention End-to-End Speech Recognition with a Triangle Recurrent Neural Network Encoder

Abstract

Access this article

Similar content being viewed by others

End-to-End Speech Recognition in Russian

Towards end-to-end speech recognition with transfer learning

RAttSR: A Novel Low-Cost Reconstructed Attention-Based End-to-End Speech Recognizer

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Key words

CLC number

Document code

Search

Navigation