ConWST: Non-native Multi-source Knowledge Distillation for Low Resource Speech Translation

Zhu, Wenbo; Jin, Hao; Chen, JianWen; Luo, Lufeng; Wang, Jinhai; Lu, Qinghua; Li, Aiyuan

doi:10.1007/978-981-16-9247-5_10

Wenbo Zhu¹¹,
Hao Jin¹¹,
JianWen Chen¹¹,
Lufeng Luo¹¹,
Jinhai Wang¹¹,
Qinghua Lu¹¹ &
…
Aiyuan Li¹¹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1515))

Included in the following conference series:

International Conference on Cognitive Systems and Signal Processing

1205 Accesses

Abstract

In the absence of source speech information, most end-to-end speech translation (ST) models showed unsatisfactory results. So, for low-resource non-native speech translation, we propose a self-supervised bidirectional distillation processing system. It improves speech ST performance by using a large amount of untagged speech and text in a complementary way without adding source information. The framework is based on an attentional Sq2sq model, which uses wav2vec2.0 pre-training to guide the Conformer encoder for reconstructing the acoustic representation. The decoder generates a target token by fusing the out-of-domain embeddings. We investigate the use of Byte pair encoding (BPE) and compare it with several fusion techniques. Under the framework of ConWST, we conducted experiments on language transcription from Swahili to English. The experimental results show that the transcription under the framework has a better performance than the baseline model, which seems to be one of the best transcriptional methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Weiss, R.J., Chorowski, J., Jaitly, N., Wu, Y., Chen, Z.: Sequence-to-sequence models can directly translate foreign speech. arXiv preprint arXiv:1703.08581 (2017)
B’erard, A., Besacier, L., Kocabiyikoglu, A.C., Pietquin, O.: End-to-end automatic speech translation of audiobooks. In: Proceedings of ICASSP (2018)
Google Scholar
Tang, Y., Pino, J., Wang, C., Ma, X., Genzel, D.: A general multitask learning framework to leverage text data for speech to text tasks. In: Proceedings of ICASSP (2021)
Google Scholar
Bansal, S., Kamper, H., Livescu, K., Lopez, A., Goldwater, S.: Pre-training on high-resource speech recognition improves low-resource speech-to-text translation. In: Proceedings of NAACL (2019)
Google Scholar
Stoian, M.C., Bansal, S., Goldwater, S.: Analyzing asr pre-training for low-resource speech-to-text translation. In: Proceedings of ICASSP (2020)
Google Scholar
Wang, C., Wu, Y., Liu, S., Yang, Z., Zhou, M.: Bridging the gap between pre-training and finetuning for end-to-end speech translation. In: Proceedings of AAAI (2020)
Google Scholar
Jia, Y., et al.: Leveraging weakly supervised data to improve the end-to-end speech-to-text translation. In: Proceedings of ICASSP (2019)
Google Scholar
Pino, J., Puzon, L., Gu, J., Ma, X., McCarthy, A.D., Gopinath, D.: Harnessing indirect training data for end-to-end automatic speech translation: tricks of the trade. In: Proceedings of IWSLT (2019)
Google Scholar
Salesky, E., Sperber, M., Black, A.W.: Exploring phoneme-level speech representations for end-to-end speech translation. In: Proceedings of ACL (2019)
Google Scholar
McCarthy, A.D., Puzon, L., Pino, J.: Skinaugment: auto-encoding speaker conversions for automatic speech translation. In: Proceedings of ICASSP (2020)
Google Scholar
Wu, A., Wang, C., Pino, J., Gu, J.: Self-supervised representations improve end-to-end speech translation. In: Proceedings of Interspeech (2020)
Google Scholar
Nguyen, H., Bougares, F., Tomashenko, N., Estève, Y., Besaucier, L.: Investigating self-supervised pretraining for end-to-end speech translation. In: Proceedings of Interspeech (2020)
Google Scholar
Pino, J., Xu, Q., Ma, X., Dousti, M.J., Tang, Y.: Self-training for end-to-end speech translation. In: Proceedings of Interspeech (2020)
Google Scholar
Di Gangi, M.A., Negri, M., Turchi, M.: One-to-many multilingual end-to-end speech translation. In: Proceedings of ASRU (2019)
Google Scholar
Inaguma, H., Duh, K., Kawahara, T., Watanabe, S.: Multilingual end-to-end speech translation. In: Proceedings of ASRU (2019)
Google Scholar
Wang, C., Pino, J., Wu, A., Gu, J.: Covost: a diverse multilingual speech-to-text translation corpus. In: Proceedings of LREC (2020)
Google Scholar
Wang, C., Wu, A., Pino, J.: Covost 2 and massively multilingual speech-to-text translation arXiv (2020)
Google Scholar
Li, X., et al.: Multilingual speech translation with efficient finetuning of pretrained models. arXiv, abs/2010.12829 (2021)
Google Scholar
Anastasopoulos, A., Chiang, D.: Tied multitask learning for neural speech translation. In: Proceedings of NAACL (2018)
Google Scholar
Liu, Y., et al.: End-to-end speech translation with knowledge distillation. In: Proceedings of Interspeech (2019)
Google Scholar
Chuang, S.-P., Sung, T.-W., Liu, A.H., Lee, H.-Y.: Worse WER, but better BLEU? Leveraging word embedding as intermediate in multitask end-to-end speech translation. In: Proceedings of ACL (2020)
Google Scholar
Wang, C., Wu, Y., Liu, S., Zhou, M., Yang, Z.: Curriculum pre-training for end-to-end speech translation. In: Proceedings of ACL (2020)
Google Scholar
Salesky, E., Black, A.W.: Phone features improve speech translation. In: Proceedings of ACL (2020)
Google Scholar
Bansal, S., Kamper, H., Livescu, K., Lopez, A., Goldwater, S.: Pretraining on high-resource speech recognition improve the low-resource speech-to-text translation. In: Proceedings of NAACL (2019)
Google Scholar
Kim, Y., Rush, A.M.: Sequence-level knowledge distillation (2016)
Google Scholar
Zhou, C., Gu, J., Neubig, G.: Understanding knowledge distillation in non-autoregressive machine translation. In: Proceedings of ICLR (2019a)
Google Scholar
Ren, Y., Liu, J., Tan, X., Zhao, Z., Zhao, S., Liu, T.Y.: A study of non-autoregressive model for sequence generation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020)
Google Scholar
Schneider, S., Baevski, A., Collobert, R., Auli, M.: wav2vec: unsupervised pre-training for speech recognition. In: Proceedings of Interspeech, pp. 3465–3469 (2019)
Google Scholar
Inaguma, H., Kawahara, T., Watanabe, S.: .Source and target bidirectional knowledge distillation for end-to-end speech translation. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2021)
Google Scholar
Chen, T., Kornblith, S., Swersky, K., Norouzi, M., Hinton, G.E.: Big self-supervised models are strong semi-supervised learners. In: Advances in Neural Information Processing Systems, vol. 33 (2020)
Google Scholar
Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. In: International Conference on Learning Representations (2019)
Google Scholar
Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Proceedings of NeurIPS (2020)
Google Scholar
Tran, C., Tang, Y., Li, X., Gu, J.: Cross-lingual retrieval for iterative self-supervised training. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Proceedings of NeurIPS (2020)
Google Scholar
Baevski, A., Zhou, Y., Mohamed, A.-R., Auli, M.: wav2vec2.0: a framework for self-supervised learning of speech representations. In: Advances in Neural Information Processing Systems, vol. 33 (2020)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Baevski, A., Zhou, H., Mohamed, A., Auli, M.: Wav2vec 2.0: a framework for self-supervised learning of speech representations (2020)
Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., et al.: Attention is all you need. arXiv (2017)
Google Scholar
Koehn, P., et al.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 177–180. Association for Computational Linguistics (2007)
Google Scholar
Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society (2011)
Google Scholar
Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
Google Scholar
Park, D.S.: Specaugment: a simple data augmentation method for automatic speech recognition. In: Kubin, G., Kacic, Z. (eds.) Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15–19 September 2019, pp. 2613–2617. ISCA (2019)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015, Conference Track Proceedings (2015)
Google Scholar
Aswani, A.V., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Rico Sennrich, A.B., Haddow, B.: Neural machine translation of rare words with subword units. In: Proceedings of ACL (2016)
Google Scholar
Provilkov, I., Emelianenko, D., Voita, E.: BPE-dropout: simple and effective subword regularization. In: Proceedings of ACL (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

Foshan University, Foshan, China
Wenbo Zhu, Hao Jin, JianWen Chen, Lufeng Luo, Jinhai Wang, Qinghua Lu & Aiyuan Li

Authors

Wenbo Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Hao Jin
View author publications
You can also search for this author in PubMed Google Scholar
JianWen Chen
View author publications
You can also search for this author in PubMed Google Scholar
Lufeng Luo
View author publications
You can also search for this author in PubMed Google Scholar
Jinhai Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qinghua Lu
View author publications
You can also search for this author in PubMed Google Scholar
Aiyuan Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wenbo Zhu .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Fuchun Sun
National University of Defense Technology, Changsha, China
Dewen Hu
Universität Hamburg, Hamburg, Germany
Stefan Wermter
Tsingzhan Artificial Intelligence Research Institute, Nanjing, China
Lei Yang
Tsinghua University, Beijing, China
Huaping Liu
Tsinghua University, Beijing, China
Bin Fang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, W. et al. (2022). ConWST: Non-native Multi-source Knowledge Distillation for Low Resource Speech Translation. In: Sun, F., Hu, D., Wermter, S., Yang, L., Liu, H., Fang, B. (eds) Cognitive Systems and Information Processing. ICCSIP 2021. Communications in Computer and Information Science, vol 1515. Springer, Singapore. https://doi.org/10.1007/978-981-16-9247-5_10

Download citation

DOI: https://doi.org/10.1007/978-981-16-9247-5_10
Published: 11 January 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-9246-8
Online ISBN: 978-981-16-9247-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics