Skip to main content
Log in

A transformer-based network for speech recognition

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

In the field of automatic speech recognition (ASR), the noisy audio data and the ambiguity in recognizing homophone lead to the degradation of model performance. In order to address the mentioned problems, a network called DMRS-transformer, a Transformer-based network, is proposed in this study. The proposed DMRS-Transformer mainly consists of two components except for the traditional Transformer network, which are denoising module and Mandarin recognition supplementary module respectively. The denoising module is used for pruning the trivial features caused by the noisy input audio data. The Mandarin recognition supplementary module, short for MRS module, tends to tackle the problem of recognizing Mandarin speech signals which have several homophones. Empirical evaluations have been conducted on two widely used datasets, which are Aishell-1 and HKUST respectively. The experimental results can validate the effectiveness of the proposed DMRS-Transformer network. Compared with the Transformer baseline, the proposed DMRS-Transformer has 0.8% CER improvement and 1.5% CER improvement in these two datasets respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Data availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

References

  • Al-Taai, R. Y. L., Xiaojun, W., & Zhu, Y. (2020). Targeted voice enhancement by bandpass filter and composite deep denoising autoencoder. In: 14th international conference on signal processing and communication systems (ICSPCS) (pp. 1–6). SA, Australia: Adelaide.

  • Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv:1607.06450

  • Bai, Y., Yi, J., Tao, J., et al. (2020). Listen attentively, and spell once: Whole sentence generation via a non-autoregressive architecture for low-latency speech recognition. arXiv:abs/2005.04862v4

  • Bao, F., Gao, G., Yan, X., et al. (2013). Segmentation-based mongolian LVCSR approach. In: 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 8136–8139), Vancouver, BC, Canada.

  • Bu, H., Du, J., Na, X., et al. (2017). Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In: 20th Conference of the oriental chapter of international committee for coordination and standardization of speech databases and assessment techniques (pp. 1–5), Seoul, South Korea.

  • Bustamin, A., Indrabayu, Areni I. S., et al. (2016). Speech to text for Indonesian homophone phrase with MEL frequency cepstral coefficient. In: 2016 International conference on computational intelligence and cybernetics (pp 29–31), Makassar, Indonesia.

  • Cao, H., Ching, P. C., Lee, T., et al. (2010). Semantics-based language modeling for Cantonese–English code-mixing speech recognition. In: 2010 7th international symposium on Chinese spoken language processing (pp. 246–250), Tainan, Taiwan.

  • Cengiz, Y., & Arıöz, Y. (2016). An application for speech denoising using discrete wavelet transform. In: 20th national biomedical engineering meeting (BIYOMUT) (pp. 1–4), Izmir, Turkey.

  • Chen, N., Watanabe, S., Villalba, J., et al. (2019). Non-autoregressive transformer automatic speech recognition. arXiv:abs/1911.04908v1.

  • Dong, L., Xu, S., & Xu, B. (2018). Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5884–5888). Calgary, AB, Canada.

  • Dong, L., Wang, F., & Xu, B. (2019). Self-attention aligner: A latency-control end-to-end model for asr using self-attention network and chunk-hopping. In: 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5656–5660). Brighton, UK.

  • Fan, C., Yi, J., Tao, J., et al. (2021). Gated recurrent fusion with joint training framework for robust end-to-end speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 198–209.

    Article  Google Scholar 

  • Fan, R., Chu, W., Chang, P., et al. (2021). Cass-nat: Ctc alignment-based single step non-autoregressive transformer for speech recognition. In: 2021 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5889–5893), Toronto, ON, Canada.

  • Feng, X., Zhang, Y., & Glass, J. (2014). Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 1759–1763). Florence, Italy.

  • Fujita, Y., O, M., C, X., & Watanabe, S. (2020). Insertion-based modeling for end-to-end automatic speech recognition. In: Interspeech, (pp. 3660–3664).

  • Ghosh, P., Chingtham, T. S., & Ghose, M. K. (2016). SLHAR: A supervised learning approach for homophone ambiguity reduction from speech recognition system. In: 2016 Second international conference on research in computational intelligence and communication networks (ICRCICN) (pp. 12–16). Kolkata, India.

  • Ghosh, P., Chinghtham, T. S., & Ghose, M. K. (2019). Homophone ambiguity reduction from word level speech recognition using artificial immune system. In: 4th international conference on recent trends on electronics (pp. 161–166). Communication Technology (RTEICT), Bangalore, India.

  • Graves, A., & Jaitly, N. (2014). Towards end-to-end speech recognition with recurrent neural networks. In: International conference on machine learning (pp. 1764–1772), Beijing, China.

  • Graves, A., Fernández, S., Gomez, F. J., et al. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd international conference on machine learning (pp 369–376), Pittsburgh, PA, USA.

  • Graves, A., rahman Mohamed, A., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In: 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6645–6649), Vancouver BC, Canada.

  • Greff, K., Srivastava, R. K., Koutník, J., et al. (2017). LSTM: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems, 28(10), 2222–2232.

    Article  MathSciNet  Google Scholar 

  • Grozdić, ĐT., & Jovičić, S. T. (2017). Whispered speech recognition using deep denoising autoencoder and inverse filtering. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(12), 2313–2322.

    Article  Google Scholar 

  • Gulati, A., Qin, J., Chiu, C. C., et al. (2020). Conformer: Convolution augmented transformer for speech recognition. In: Proceedings of the annual conference of the international speech communication association (pp 5036–5040), Shanghai, China.

  • He, K., Zhang, X., Ren, S., et al. (2016). Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (pp. 770–778), Las Vegas, NV, USA.

  • He, R., Ravula, A., Kanagal, B., et al. (2020) Real-former: Transformer likes residual attention. arXiv:abs/2012.11747v2

  • Hinton, G., Deng, L., Yu, D., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.

    Article  Google Scholar 

  • Hu, Y., Hou, N., Chen, C., et al. (2022). Interactive feature fusion for end-to-end noise-robust speech recognition. In: 2022 IEEE international conference on acoustics speech and signal processing (ICASSP) (pp. 6292–6296). Singapore.

  • Isogawa, K., Ida, T., Shiodera, T., et al. (2018). Deep shrinkage convolutional neural network for adaptive noise reduction. IEEE Signal Processing Letters, 25(2), 224–228.

    Article  Google Scholar 

  • Karita, S., Chen, N., & Hayashi, T. (2019). A comparative study on transformer vs RNN in speech applications. In: 2019 IEEE automatic speech recognition and understanding workshop (ASRU) (pp 449–456), Singapore.

  • Kitaev, N., Kaiser, L., & Levskaya, A. (2020). Reformer: The efficient transformer. In: International conference on learning representations (ICLR). Virtual Conference, Addis Ababa, Ethiopia.

  • Kim, S., Hori, T., & Watanabe, S. (2017). Joint CTC-attention based end-to-end speech recognition using multi-task learning. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 3839–4835). New Orleans, LA, USA.

  • Liu, M. (2022). English speech emotion recognition method based on speech recognition. International Journal of Speech Technology, 25, 391–398.

    Article  Google Scholar 

  • Liu, Y., Fung, P., Yang, Y., et al. (2006). HKUST/MTS: A very large scale mandarin telephone speech corpus. In: International symposium on Chinese spoken language processing (ISCSLP 2006) (pp 724–735), Singapore.

  • Lv, X., Chen, S. B., & Wang, X. (2021). Adversarial training with gated convolutional neural networks for robust speech recognition. In: 2021 17th international conference on computational intelligence and security (CIS) (pp. 113–117). Chengdu, China.

  • Miao, Y., Gowayyed, M., Na, X., et al. (2016). An empirical exploration of CTC acoustic models. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2623–2627), Shanghai, China.

  • Özkan, K., Seke, E., & Işık, Ş. (2016). A new approach for speech denoising. In: 24th signal processing and communication application conference (pp. 2109–2112), Zonguldak, Turkey.

  • Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.

    Article  Google Scholar 

  • Ramadan, R. A. (2021). RETRACTED ARTICLE: Detecting adversarial attacks on audio-visual speech recognition using deep learning method. International Journal of Speech Technology.

  • Sak, H., Senior, A., & Beaufays, F. (2014). Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: Fifteenth annual conference of the international speech communication association (pp. 338–342), Singapore.

  • Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is all you need. In M. I. Jordan, Y. LeCun, & S. A. Solla (Eds.), Advances in neural information processing systems (pp. 5999–6009). MIT Press.

  • Wang, J., Wang, D., Chen, Y., et al. (2019). Noise robustness automatic speech recognition with convolutional neural network and time delay neural network. Journal of the Audio Engineering Society.

  • Wilson, K. W., Raj, B., Smaragdis, P., et al. (2008). Speech denoising using nonnegative matrix factorization with priors. In: 2008 IEEE international conference on acoustics speech and signal processing (pp. 4029–4032). Las Vegas, NV, USA.

  • Winata, G. I., Cahyawijaya, S., Lin, Z., et al. (2020). Lightweight and efficient end-to-end speech recognition using low-rank transformer. In: 2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp 6144–6148), Barcelona, Spain.

  • Yadava, G. T., & Jayanna, H. S. (2020). Enhancements in automatic Kannada speech recognition system by background noise elimination and alternate acoustic modelling. International Journal of Speech Technology, 23, 149–167.

    Article  Google Scholar 

  • Zhang, H., Bao, F., Gao, G., et al. (2016). Comparison on neural network based acoustic model in Mongolian speech recognition. In: 2016 international conference on Asian language processing (IALP) (pp. 1–5), Tainan, Taiwan.

  • Zhao, M., Zhong, S., Fu, X., et al. (2020). Deep residual shrinkage networks for fault diagnosis. IEEE Transactions on Industrial Informatics, 16(7), 4681–4690.

    Article  Google Scholar 

  • Zhikui, D., Guozhi, G., & Jiawei, C. (2022). Dual-residual transformer network for speech recognition. Journal of the Audio Engineering Society 70(10), 871–881.

  • Zhong, X., Dai, Y., Dai, Y., et al. (2018). Study on processing of wavelet speech denoising in speech recognition system. International Journal of Speech Technology, 21, 563–569.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lina Tang.

Ethics declarations

Conflict of interest

All the authors do not have any possible conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tang, L. A transformer-based network for speech recognition. Int J Speech Technol 26, 531–539 (2023). https://doi.org/10.1007/s10772-023-10034-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-023-10034-z

Keywords

Navigation