Skip to main content

Improving Automatic Speech Recognition for Non-native English with Transfer Learning and Language Model Decoding

  • Chapter
  • First Online:
Analysis and Application of Natural Language and Speech Processing

Abstract

ASR systems designed for native English (L1) usually underperform on non-native English (L2). To address this performance gap, (1) we extend our previous work to investigate fine-tuning of a pre-trained wav2vec 2.0 model (Baevski et al. (wav2vec 2.0: A framework for self-supervised learning of speech representations (2020). Preprint arXiv:200611477), Xu et al. (Self-training and pre-training are complementary for speech recognition. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 3030–3034 (2021)) ) under a rich set of L1 and L2 training conditions. We further (2) incorporate language model decoding in the ASR system, along with the fine-tuning method. Quantifying gains acquired from each of these two approaches separately and an error analysis allows us to identify different sources of improvement within our models. We find that while the large self-trained wav2vec 2.0 may be internalizing sufficient decoding knowledge for clean L1 speech (Xu et al. (Self-training and pre-training are complementary for speech recognition. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 3030–3034 (2021))), this does not hold for L2 speech and accounts for the utility of employing language model decoding on L2 data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Although sometimes referred to as “unsupervised,” these models employ a self-supervised objective.

  2. 2.

    https://github.com/pytorch/fairseq/tree/master/examples/wav2vec.

  3. 3.

    https://github.com/kensho-technologies/pyctcdecode.

  4. 4.

    http://www.openslr.org/12/.

  5. 5.

    https://github.com/facebookresearch/libri-light.

  6. 6.

    https://librivox.org.

  7. 7.

    http://www.gutenberg.org.

  8. 8.

    We use the term “accent” here to loosely refer to variation in speakers with L1 other than English.

  9. 9.

    https://github.com/pytorch/fairseq.

  10. 10.

    https://github.com/UBC-NLP/L2ASR.

  11. 11.

    https://huggingface.co/facebook/wav2vec2-large-960h-lv60-self.

  12. 12.

    We use 10% of the utterances from these 18 speakers for development (Dev).

  13. 13.

    For those speakers whose TOEFL scores are known [59], a strong negative correlation was observed between speaker-specific WERs of Baseline-I and speaker’s TOEFL scores, r(8) ≈−0.77, p <0.01.

References

  1. Baevski, A., Schneider, S., Auli, M.: vq-wav2vec: Self-supervised learning of discrete speech representations (2019). Preprint arXiv:191005453

    Google Scholar 

  2. Baevski, A., Zhou, H., Mohamed, A., Auli, M.: wav2vec 2.0: A framework for self-supervised learning of speech representations (2020). Preprint arXiv:200611477

    Google Scholar 

  3. Bearman, A., Josund, K., Fiore, G.: Accent conversion using artificial neural networks. Technical Report, Stanford University, Technical Report, Tech. Rep (2017)

    Google Scholar 

  4. Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 4960–4964 (2016)

    Google Scholar 

  5. Chorowski, J.K., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. In: Advances in Neural Information Processing Systems, pp 577–585 (2015)

    Google Scholar 

  6. Chung, Y.A., Hsu, W.N., Tang, H., Glass, J.: An unsupervised autoregressive model for speech representation learning (2019). Preprint arXiv:190403240

    Google Scholar 

  7. Crystal, D.: English as a global language. Ernst Klett Sprachen, Stuttgart (2003)

    Book  Google Scholar 

  8. Das, N., Bodapati, S., Sunkara, M., Srinivasan, S., Chau, D.H.: Best of both worlds: Robust accented speech recognition with adversarial transfer learning (2021). Preprint arXiv:210305834

    Google Scholar 

  9. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding (2018). Preprint arXiv:181004805

    Google Scholar 

  10. Futami, H., Inaguma, H., Ueno, S., Mimura, M., Sakai, S., Kawahara, T.: Distilling the knowledge of bert for sequence-to-sequence ASR (2020). Preprint arXiv:200803822

    Google Scholar 

  11. Graves, A.: Connectionist temporal classification. In: Supervised Sequence Labelling with Recurrent Neural Networks, pp. 61–93. Springer, Berlin (2012)

    Google Scholar 

  12. Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: International Conference on Machine Learning, pp. 1764–1772 (2014)

    Google Scholar 

  13. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006)

    Google Scholar 

  14. Gulati, A., Qin, J., Chiu, C.C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu Y., et al.: Conformer: Convolution-augmented transformer for speech recognition (2020). Preprint arXiv:200508100

    Google Scholar 

  15. Guliani, D., Beaufays, F., Motta, G.: Training speech recognition models with federated learning: A quality/cost framework. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 3080–3084 (2021)

    Google Scholar 

  16. Hannun, A.Y., Maas, A.L., Jurafsky, D., Ng, A.Y.: First-pass large vocabulary continuous speech recognition using bi-directional recurrent DNNs (2014). Preprint arXiv:14082873

    Google Scholar 

  17. Heafield, K.: KenLM: Faster and smaller language model queries. In: Proceedings of the sixth Workshop on Statistical Machine Translation, pp. 187–197 (2011)

    Google Scholar 

  18. Hori, T., Cho, J., Watanabe, S.: End-to-end speech recognition with word-based rnn language models. In: 2018 IEEE Spoken Language Technology Workshop (SLT), IEEE, pp. 389–396 (2018)

    Google Scholar 

  19. Hou, J., Guo, P., Sun, S., Soong, F.K., Hu, W., Xie, L.: Domain adversarial training for improving keyword spotting performance of esl speech. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 8122–8126 (2019)

    Google Scholar 

  20. Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for NLP. In: International Conference on Machine Learning, PMLR, pp. 2790–2799 (2019)

    Google Scholar 

  21. Hu, H., Yang, X., Raeesy, Z., Guo, J., Keskin, G., Arsikere, H., Rastrow, A., Stolcke, A., Maas, R.: Redat: Accent-invariant representation for end-to-end asr by domain adversarial training with relabeling. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 6408–6412 (2021)

    Google Scholar 

  22. Hwang, K., Sung, W.: Character-level incremental speech recognition with recurrent neural networks. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 5335–5339 (2016)

    Google Scholar 

  23. Jain, A., Upreti, M., Jyothi, P.: Improved accented speech recognition using accent embeddings and multi-task learning. In: Interspeech, pp. 2454–2458 (2018)

    Google Scholar 

  24. Kahn, J., Rivière, M., Zheng, W., Kharitonov, E., Xu, Q., Mazaré, P.E., Karadayi, J., Liptchinsky, V., Collobert, R., Fuegen, C., et al.: Libri-light: A benchmark for asr with limited or no supervision. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 7669–7673 (2020)

    Google Scholar 

  25. Kominek, J., Black, A.W., Ver, V.: CMU ARCTIC databases for speech synthesis (2003)

    Google Scholar 

  26. Kunze, J., Kirsch, L., Kurenkov, I., Krug, A., Johannsmeier, J., Stober, S.: Transfer learning for speech recognition on a budget (2017). Preprint arXiv:170600290

    Google Scholar 

  27. Li, X., Wang, C., Tang, Y., Tran, C., Tang, Y., Pino, J., Baevski, A., Conneau, A., Auli, M.: Multilingual speech translation with efficient finetuning of pretrained models (2020). Preprint arXiv:201012829

    Google Scholar 

  28. Ling, S., Liu, Y., Salazar, J., Kirchhoff, K.: Deep contextualized acoustic representations for semi-supervised speech recognition. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 6429–6433 (2020)

    Google Scholar 

  29. Liu, A.T., Yang, S.W., Chi, P.H., Hsu, P.C., Lee, H.Y.: Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 6419–6423 (2020)

    Google Scholar 

  30. Liu, S., Wang, D., Cao, Y., Sun, L., Wu, X., Kang, S., Wu, Z., Liu, X., Su, D., Yu, D., et al.: End-to-end accent conversion without using native utterances. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp 6289–6293 (2020)

    Google Scholar 

  31. Livescu, K., Glass, J.: Lexical modeling of non-native speech for automatic speech recognition. In: 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100), IEEE, vol 3, pp. 1683–1686 (2000)

    Google Scholar 

  32. Lowerre, B.T.: The Harpy Speech Recognition System. Carnegie Mellon University (1976)

    Google Scholar 

  33. Maas, A., Xie, Z., Jurafsky, D., Ng, A.Y.: Lexicon-free conversational speech recognition with neural networks. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 345–354 (2015)

    Google Scholar 

  34. Matassoni, M., Gretter, R., Falavigna, D., Giuliani, D.: Non-native children speech recognition through transfer learning. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 6229–6233 (2018)

    Google Scholar 

  35. Meister, C., Vieira, T., Cotterell, R.: If beam search is the answer, what was the question? (2020) Preprint arXiv:201002650

    Google Scholar 

  36. Miao, Y., Gowayyed, M., Metze, F.: Eesen: End-to-end speech recognition using deep RNN models and WFST-based decoding. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), IEEE, pp. 167–174 (2015)

    Google Scholar 

  37. Oord, A.V.D., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding (2018). Preprint arXiv:180703748

    Google Scholar 

  38. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2009)

    Article  Google Scholar 

  39. Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: An ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 5206–5210 (2015)

    Google Scholar 

  40. Park, D.S., Chan, W., Zhang, Y., Chiu, C.C., Zoph, B., Cubuk, E.D., Le, Q.V.: Specaugment: A simple data augmentation method for automatic speech recognition (2019). Preprint arXiv:190408779

    Google Scholar 

  41. Ping, T.T.: Automatic speech recognition for non-native speakers. PhD Thesis, Université Joseph-Fourier-Grenoble I (2008)

    Google Scholar 

  42. Radzikowski, K., Wang, L., Yoshie, O., Nowak, R.: Accent modification for speech recognition of non-native speakers using neural style transfer. EURASIP J. Audio Speech Music Process. 2021(1), 1–10 (2021)

    Article  Google Scholar 

  43. Sak, H., Senior, A., Rao, K., Irsoy, O., Graves, A., Beaufays, F., Schalkwyk, J.: Learning acoustic frame labeling for speech recognition with recurrent neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 4280–4284 (2015)

    Google Scholar 

  44. Schneider, S., Baevski, A., Collobert, R., Auli, M.: wav2vec: Unsupervised pre-training for speech recognition (2019). Preprint arXiv:190405862

    Google Scholar 

  45. Shi, X., Yu, F., Lu, Y., Liang, Y., Feng, Q., Wang, D., Qian, Y., Xie, L.: The accented english speech recognition challenge 2020: Open datasets, tracks, baselines, results and methods. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 6918–6922 (2021)

    Google Scholar 

  46. Shibano, T., Zhang, X., Li, M.T., Cho, H., Sullivan, P., Abdul-Mageed, M.: Speech technology for everyone: Automatic speech recognition for non-native english with transfer learning (2021). Preprint arXiv:211000678

    Google Scholar 

  47. Sun, S., Yeh, C.F., Hwang, M.Y., Ostendorf, M., Xie, L.: Domain adversarial training for accented speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 4854–4858 (2018)

    Google Scholar 

  48. Synnaeve, G., Xu, Q., Kahn, J., Likhomanenko, T., Grave, E., Pratap, V., Sriram, A., Liptchinsky, V., Collobert, R.: End-to-end ASR: from supervised to semi-supervised learning with modern architectures (2019). Preprint arXiv:191108460

    Google Scholar 

  49. Turan, M.A.T., Vincent, E., Jouvet, D.: Achieving multi-accent asr via unsupervised acoustic model adaptation. In: INTERSPEECH 2020 (2020)

    Google Scholar 

  50. Viglino, T., Motlicek, P., Cernak, M.: End-to-end accented speech recognition. In: Interspeech, pp. 2140–2144 (2019)

    Google Scholar 

  51. Wang, D., Zheng, T.F.: Transfer learning for speech and language processing. In: 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), IEEE, pp. 1225–1237 (2015)

    Google Scholar 

  52. Wang, Z., Schultz, T., Waibel, A.: Comparison of acoustic model adaptation techniques on non-native speech. In: 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings (ICASSP’03), IEEE, vol. 1, pp. I–I (2003)

    Google Scholar 

  53. Wang, Y., Luan, H., Yuan, J., Wang, B., Lin, H.: Laix corpus of chinese learner english: Towards a benchmark for l2 english asr. In: INTERSPEECH, pp. 414–418 (2020)

    Google Scholar 

  54. Wang, Y., Mohamed, A., Le, D., Liu, C., Xiao, A., Mahadeokar, J., Huang, H., Tjandra, A., Zhang. X., Zhang, F., et al.: Transformer-based acoustic modeling for hybrid speech recognition. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal, Processing. (ICASSP), IEEE, pp. 6874–6878 (2020)

    Google Scholar 

  55. Watanabe, S., Hori, T., Kim, S., Hershey, J.R., Hayashi, T.: Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE J. Selec. Top. Signal Process. 11(8), 1240–1253 (2017)

    Article  Google Scholar 

  56. Xu, Q., Baevski, A., Likhomanenko, T., Tomasello, P., Conneau, A., Collobert, R., Synnaeve, G., Auli, M.: Self-training and pre-training are complementary for speech recognition. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 3030–3034 (2021)

    Google Scholar 

  57. Yu, W., Freiwald, J., Tewes, S., Huennemeyer, F., Kolossa, D.: Federated learning in ASR: Not as easy as you think. In: Speech Communication; 14th ITG Conference, VDE, pp 1–5 (2021)

    Google Scholar 

  58. Zenkel, T., Sanabria, R., Metze, F., Niehues, J., Sperber, M., Stüker, S., Waibel, A.: Comparison of decoding strategies for ctc acoustic models (2017). Preprint arXiv:170804469

    Google Scholar 

  59. Zhao, G., Sonsaat, S., Silpachai, A.O., Lucic, I., Chukharev-Hudilainen, E., Levis, J., Gutierrez-Osuna, R.: L2-ARCTIC: A non-native english speech corpus. In: INTERSPEECH Perception Sensing Instrumentation Lab (2018)

    Google Scholar 

Download references

Acknowledgements

We would like to thank Mia Li, Jeremy Zhang, and Haejin Cho for having contributed to an initial phase of this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peter Sullivan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Sullivan, P., Shibano, T., Abdul-Mageed, M. (2023). Improving Automatic Speech Recognition for Non-native English with Transfer Learning and Language Model Decoding. In: Abbas, M. (eds) Analysis and Application of Natural Language and Speech Processing. Signals and Communication Technology. Springer, Cham. https://doi.org/10.1007/978-3-031-11035-1_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-11035-1_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-11034-4

  • Online ISBN: 978-3-031-11035-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics