Skip to main content

ConWST: Non-native Multi-source Knowledge Distillation for Low Resource Speech Translation

  • Conference paper
  • First Online:
Cognitive Systems and Information Processing (ICCSIP 2021)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1515))

Included in the following conference series:

  • 1205 Accesses

Abstract

In the absence of source speech information, most end-to-end speech translation (ST) models showed unsatisfactory results. So, for low-resource non-native speech translation, we propose a self-supervised bidirectional distillation processing system. It improves speech ST performance by using a large amount of untagged speech and text in a complementary way without adding source information. The framework is based on an attentional Sq2sq model, which uses wav2vec2.0 pre-training to guide the Conformer encoder for reconstructing the acoustic representation. The decoder generates a target token by fusing the out-of-domain embeddings. We investigate the use of Byte pair encoding (BPE) and compare it with several fusion techniques. Under the framework of ConWST, we conducted experiments on language transcription from Swahili to English. The experimental results show that the transcription under the framework has a better performance than the baseline model, which seems to be one of the best transcriptional methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Weiss, R.J., Chorowski, J., Jaitly, N., Wu, Y., Chen, Z.: Sequence-to-sequence models can directly translate foreign speech. arXiv preprint arXiv:1703.08581 (2017)

  2. B’erard, A., Besacier, L., Kocabiyikoglu, A.C., Pietquin, O.: End-to-end automatic speech translation of audiobooks. In: Proceedings of ICASSP (2018)

    Google Scholar 

  3. Tang, Y., Pino, J., Wang, C., Ma, X., Genzel, D.: A general multitask learning framework to leverage text data for speech to text tasks. In: Proceedings of ICASSP (2021)

    Google Scholar 

  4. Bansal, S., Kamper, H., Livescu, K., Lopez, A., Goldwater, S.: Pre-training on high-resource speech recognition improves low-resource speech-to-text translation. In: Proceedings of NAACL (2019)

    Google Scholar 

  5. Stoian, M.C., Bansal, S., Goldwater, S.: Analyzing asr pre-training for low-resource speech-to-text translation. In: Proceedings of ICASSP (2020)

    Google Scholar 

  6. Wang, C., Wu, Y., Liu, S., Yang, Z., Zhou, M.: Bridging the gap between pre-training and finetuning for end-to-end speech translation. In: Proceedings of AAAI (2020)

    Google Scholar 

  7. Jia, Y., et al.: Leveraging weakly supervised data to improve the end-to-end speech-to-text translation. In: Proceedings of ICASSP (2019)

    Google Scholar 

  8. Pino, J., Puzon, L., Gu, J., Ma, X., McCarthy, A.D., Gopinath, D.: Harnessing indirect training data for end-to-end automatic speech translation: tricks of the trade. In: Proceedings of IWSLT (2019)

    Google Scholar 

  9. Salesky, E., Sperber, M., Black, A.W.: Exploring phoneme-level speech representations for end-to-end speech translation. In: Proceedings of ACL (2019)

    Google Scholar 

  10. McCarthy, A.D., Puzon, L., Pino, J.: Skinaugment: auto-encoding speaker conversions for automatic speech translation. In: Proceedings of ICASSP (2020)

    Google Scholar 

  11. Wu, A., Wang, C., Pino, J., Gu, J.: Self-supervised representations improve end-to-end speech translation. In: Proceedings of Interspeech (2020)

    Google Scholar 

  12. Nguyen, H., Bougares, F., Tomashenko, N., Estève, Y., Besaucier, L.: Investigating self-supervised pretraining for end-to-end speech translation. In: Proceedings of Interspeech (2020)

    Google Scholar 

  13. Pino, J., Xu, Q., Ma, X., Dousti, M.J., Tang, Y.: Self-training for end-to-end speech translation. In: Proceedings of Interspeech (2020)

    Google Scholar 

  14. Di Gangi, M.A., Negri, M., Turchi, M.: One-to-many multilingual end-to-end speech translation. In: Proceedings of ASRU (2019)

    Google Scholar 

  15. Inaguma, H., Duh, K., Kawahara, T., Watanabe, S.: Multilingual end-to-end speech translation. In: Proceedings of ASRU (2019)

    Google Scholar 

  16. Wang, C., Pino, J., Wu, A., Gu, J.: Covost: a diverse multilingual speech-to-text translation corpus. In: Proceedings of LREC (2020)

    Google Scholar 

  17. Wang, C., Wu, A., Pino, J.: Covost 2 and massively multilingual speech-to-text translation arXiv (2020)

    Google Scholar 

  18. Li, X., et al.: Multilingual speech translation with efficient finetuning of pretrained models. arXiv, abs/2010.12829 (2021)

    Google Scholar 

  19. Anastasopoulos, A., Chiang, D.: Tied multitask learning for neural speech translation. In: Proceedings of NAACL (2018)

    Google Scholar 

  20. Liu, Y., et al.: End-to-end speech translation with knowledge distillation. In: Proceedings of Interspeech (2019)

    Google Scholar 

  21. Chuang, S.-P., Sung, T.-W., Liu, A.H., Lee, H.-Y.: Worse WER, but better BLEU? Leveraging word embedding as intermediate in multitask end-to-end speech translation. In: Proceedings of ACL (2020)

    Google Scholar 

  22. Wang, C., Wu, Y., Liu, S., Zhou, M., Yang, Z.: Curriculum pre-training for end-to-end speech translation. In: Proceedings of ACL (2020)

    Google Scholar 

  23. Salesky, E., Black, A.W.: Phone features improve speech translation. In: Proceedings of ACL (2020)

    Google Scholar 

  24. Bansal, S., Kamper, H., Livescu, K., Lopez, A., Goldwater, S.: Pretraining on high-resource speech recognition improve the low-resource speech-to-text translation. In: Proceedings of NAACL (2019)

    Google Scholar 

  25. Kim, Y., Rush, A.M.: Sequence-level knowledge distillation (2016)

    Google Scholar 

  26. Zhou, C., Gu, J., Neubig, G.: Understanding knowledge distillation in non-autoregressive machine translation. In: Proceedings of ICLR (2019a)

    Google Scholar 

  27. Ren, Y., Liu, J., Tan, X., Zhao, Z., Zhao, S., Liu, T.Y.: A study of non-autoregressive model for sequence generation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020)

    Google Scholar 

  28. Schneider, S., Baevski, A., Collobert, R., Auli, M.: wav2vec: unsupervised pre-training for speech recognition. In: Proceedings of Interspeech, pp. 3465–3469 (2019)

    Google Scholar 

  29. Inaguma, H., Kawahara, T., Watanabe, S.: .Source and target bidirectional knowledge distillation for end-to-end speech translation. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2021)

    Google Scholar 

  30. Chen, T., Kornblith, S., Swersky, K., Norouzi, M., Hinton, G.E.: Big self-supervised models are strong semi-supervised learners. In: Advances in Neural Information Processing Systems, vol. 33 (2020)

    Google Scholar 

  31. Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)

  32. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. In: International Conference on Learning Representations (2019)

    Google Scholar 

  33. Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Proceedings of NeurIPS (2020)

    Google Scholar 

  34. Tran, C., Tang, Y., Li, X., Gu, J.: Cross-lingual retrieval for iterative self-supervised training. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Proceedings of NeurIPS (2020)

    Google Scholar 

  35. Baevski, A., Zhou, Y., Mohamed, A.-R., Auli, M.: wav2vec2.0: a framework for self-supervised learning of speech representations. In: Advances in Neural Information Processing Systems, vol. 33 (2020)

    Google Scholar 

  36. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  37. Baevski, A., Zhou, H., Mohamed, A., Auli, M.: Wav2vec 2.0: a framework for self-supervised learning of speech representations (2020)

    Google Scholar 

  38. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., et al.: Attention is all you need. arXiv (2017)

    Google Scholar 

  39. Koehn, P., et al.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 177–180. Association for Computational Linguistics (2007)

    Google Scholar 

  40. Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society (2011)

    Google Scholar 

  41. Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)

    Google Scholar 

  42. Park, D.S.: Specaugment: a simple data augmentation method for automatic speech recognition. In: Kubin, G., Kacic, Z. (eds.) Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15–19 September 2019, pp. 2613–2617. ISCA (2019)

    Google Scholar 

  43. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015, Conference Track Proceedings (2015)

    Google Scholar 

  44. Aswani, A.V., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  45. Rico Sennrich, A.B., Haddow, B.: Neural machine translation of rare words with subword units. In: Proceedings of ACL (2016)

    Google Scholar 

  46. Provilkov, I., Emelianenko, D., Voita, E.: BPE-dropout: simple and effective subword regularization. In: Proceedings of ACL (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wenbo Zhu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhu, W. et al. (2022). ConWST: Non-native Multi-source Knowledge Distillation for Low Resource Speech Translation. In: Sun, F., Hu, D., Wermter, S., Yang, L., Liu, H., Fang, B. (eds) Cognitive Systems and Information Processing. ICCSIP 2021. Communications in Computer and Information Science, vol 1515. Springer, Singapore. https://doi.org/10.1007/978-981-16-9247-5_10

Download citation

  • DOI: https://doi.org/10.1007/978-981-16-9247-5_10

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-16-9246-8

  • Online ISBN: 978-981-16-9247-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics