Abstract
In the absence of source speech information, most end-to-end speech translation (ST) models showed unsatisfactory results. So, for low-resource non-native speech translation, we propose a self-supervised bidirectional distillation processing system. It improves speech ST performance by using a large amount of untagged speech and text in a complementary way without adding source information. The framework is based on an attentional Sq2sq model, which uses wav2vec2.0 pre-training to guide the Conformer encoder for reconstructing the acoustic representation. The decoder generates a target token by fusing the out-of-domain embeddings. We investigate the use of Byte pair encoding (BPE) and compare it with several fusion techniques. Under the framework of ConWST, we conducted experiments on language transcription from Swahili to English. The experimental results show that the transcription under the framework has a better performance than the baseline model, which seems to be one of the best transcriptional methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Weiss, R.J., Chorowski, J., Jaitly, N., Wu, Y., Chen, Z.: Sequence-to-sequence models can directly translate foreign speech. arXiv preprint arXiv:1703.08581 (2017)
B’erard, A., Besacier, L., Kocabiyikoglu, A.C., Pietquin, O.: End-to-end automatic speech translation of audiobooks. In: Proceedings of ICASSP (2018)
Tang, Y., Pino, J., Wang, C., Ma, X., Genzel, D.: A general multitask learning framework to leverage text data for speech to text tasks. In: Proceedings of ICASSP (2021)
Bansal, S., Kamper, H., Livescu, K., Lopez, A., Goldwater, S.: Pre-training on high-resource speech recognition improves low-resource speech-to-text translation. In: Proceedings of NAACL (2019)
Stoian, M.C., Bansal, S., Goldwater, S.: Analyzing asr pre-training for low-resource speech-to-text translation. In: Proceedings of ICASSP (2020)
Wang, C., Wu, Y., Liu, S., Yang, Z., Zhou, M.: Bridging the gap between pre-training and finetuning for end-to-end speech translation. In: Proceedings of AAAI (2020)
Jia, Y., et al.: Leveraging weakly supervised data to improve the end-to-end speech-to-text translation. In: Proceedings of ICASSP (2019)
Pino, J., Puzon, L., Gu, J., Ma, X., McCarthy, A.D., Gopinath, D.: Harnessing indirect training data for end-to-end automatic speech translation: tricks of the trade. In: Proceedings of IWSLT (2019)
Salesky, E., Sperber, M., Black, A.W.: Exploring phoneme-level speech representations for end-to-end speech translation. In: Proceedings of ACL (2019)
McCarthy, A.D., Puzon, L., Pino, J.: Skinaugment: auto-encoding speaker conversions for automatic speech translation. In: Proceedings of ICASSP (2020)
Wu, A., Wang, C., Pino, J., Gu, J.: Self-supervised representations improve end-to-end speech translation. In: Proceedings of Interspeech (2020)
Nguyen, H., Bougares, F., Tomashenko, N., Estève, Y., Besaucier, L.: Investigating self-supervised pretraining for end-to-end speech translation. In: Proceedings of Interspeech (2020)
Pino, J., Xu, Q., Ma, X., Dousti, M.J., Tang, Y.: Self-training for end-to-end speech translation. In: Proceedings of Interspeech (2020)
Di Gangi, M.A., Negri, M., Turchi, M.: One-to-many multilingual end-to-end speech translation. In: Proceedings of ASRU (2019)
Inaguma, H., Duh, K., Kawahara, T., Watanabe, S.: Multilingual end-to-end speech translation. In: Proceedings of ASRU (2019)
Wang, C., Pino, J., Wu, A., Gu, J.: Covost: a diverse multilingual speech-to-text translation corpus. In: Proceedings of LREC (2020)
Wang, C., Wu, A., Pino, J.: Covost 2 and massively multilingual speech-to-text translation arXiv (2020)
Li, X., et al.: Multilingual speech translation with efficient finetuning of pretrained models. arXiv, abs/2010.12829 (2021)
Anastasopoulos, A., Chiang, D.: Tied multitask learning for neural speech translation. In: Proceedings of NAACL (2018)
Liu, Y., et al.: End-to-end speech translation with knowledge distillation. In: Proceedings of Interspeech (2019)
Chuang, S.-P., Sung, T.-W., Liu, A.H., Lee, H.-Y.: Worse WER, but better BLEU? Leveraging word embedding as intermediate in multitask end-to-end speech translation. In: Proceedings of ACL (2020)
Wang, C., Wu, Y., Liu, S., Zhou, M., Yang, Z.: Curriculum pre-training for end-to-end speech translation. In: Proceedings of ACL (2020)
Salesky, E., Black, A.W.: Phone features improve speech translation. In: Proceedings of ACL (2020)
Bansal, S., Kamper, H., Livescu, K., Lopez, A., Goldwater, S.: Pretraining on high-resource speech recognition improve the low-resource speech-to-text translation. In: Proceedings of NAACL (2019)
Kim, Y., Rush, A.M.: Sequence-level knowledge distillation (2016)
Zhou, C., Gu, J., Neubig, G.: Understanding knowledge distillation in non-autoregressive machine translation. In: Proceedings of ICLR (2019a)
Ren, Y., Liu, J., Tan, X., Zhao, Z., Zhao, S., Liu, T.Y.: A study of non-autoregressive model for sequence generation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020)
Schneider, S., Baevski, A., Collobert, R., Auli, M.: wav2vec: unsupervised pre-training for speech recognition. In: Proceedings of Interspeech, pp. 3465–3469 (2019)
Inaguma, H., Kawahara, T., Watanabe, S.: .Source and target bidirectional knowledge distillation for end-to-end speech translation. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2021)
Chen, T., Kornblith, S., Swersky, K., Norouzi, M., Hinton, G.E.: Big self-supervised models are strong semi-supervised learners. In: Advances in Neural Information Processing Systems, vol. 33 (2020)
Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. In: International Conference on Learning Representations (2019)
Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Proceedings of NeurIPS (2020)
Tran, C., Tang, Y., Li, X., Gu, J.: Cross-lingual retrieval for iterative self-supervised training. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Proceedings of NeurIPS (2020)
Baevski, A., Zhou, Y., Mohamed, A.-R., Auli, M.: wav2vec2.0: a framework for self-supervised learning of speech representations. In: Advances in Neural Information Processing Systems, vol. 33 (2020)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Baevski, A., Zhou, H., Mohamed, A., Auli, M.: Wav2vec 2.0: a framework for self-supervised learning of speech representations (2020)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., et al.: Attention is all you need. arXiv (2017)
Koehn, P., et al.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 177–180. Association for Computational Linguistics (2007)
Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society (2011)
Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
Park, D.S.: Specaugment: a simple data augmentation method for automatic speech recognition. In: Kubin, G., Kacic, Z. (eds.) Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15–19 September 2019, pp. 2613–2617. ISCA (2019)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015, Conference Track Proceedings (2015)
Aswani, A.V., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Rico Sennrich, A.B., Haddow, B.: Neural machine translation of rare words with subword units. In: Proceedings of ACL (2016)
Provilkov, I., Emelianenko, D., Voita, E.: BPE-dropout: simple and effective subword regularization. In: Proceedings of ACL (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Zhu, W. et al. (2022). ConWST: Non-native Multi-source Knowledge Distillation for Low Resource Speech Translation. In: Sun, F., Hu, D., Wermter, S., Yang, L., Liu, H., Fang, B. (eds) Cognitive Systems and Information Processing. ICCSIP 2021. Communications in Computer and Information Science, vol 1515. Springer, Singapore. https://doi.org/10.1007/978-981-16-9247-5_10
Download citation
DOI: https://doi.org/10.1007/978-981-16-9247-5_10
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-9246-8
Online ISBN: 978-981-16-9247-5
eBook Packages: Computer ScienceComputer Science (R0)