Abstract
Far-field, noise, reverberation, and overlapping speech make the cocktail party problem one of the greatest challenges in speech recognition. In this paper, we focus on solving the problem of overlapping speech and present a pipelined architecture with serialized output training(SOT). The baseline and the proposed methods are evaluated on the artificially mixed speech datasets generated from the AliMeeting corpus. Experimental results demonstrate that our proposed model outperforms the baseline even with high overlap ratio, which leads to 10.8% and 4.9% relative performance gains in terms of CER for 0.5 overlap ratio and average case, respectively.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
McDermott, J.H.: The cocktail party problem. Curr. Biol. 19(22), R1024–R1027 (2009)
Xue, B., et al.: Bayesian transformer language models for speech recognition. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 7378–7382 (2021)
Zeineldeen, M., et al.: Conformer-based hybrid asr system for switchboard dataset. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 7437–7441 (2022)
Luo, Y., Mesgarani, N.: Conv-tasnet: Surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio, Speech, Lang. Process. 27(8), 1256–1266 (2019)
Luo, Y., Han, C., Mesgarani, N., Ceolini, E., Liu, S.-C.: Fasnet: Low-latency adaptive beamforming for multi-microphone audio processing. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
Luo, Y., Chen, Z., Yoshioka, T.: Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 46–50 (2020)
Zeghidour, N., Grangier, D.: Wavesplit: end-to-end speech separation by speaker clustering. IEEE/ACM Trans. Audio, Speech, Lang. Process. 29, 2840–2849 (2021)
Heymann, J., Drude, L., Haeb-Umbach, R.: Neural network based spectral mask estimation for acoustic beamforming. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, vol. 03, pp. 196–200 (2016)
Wang, Z.-Q., Wang, D.: A joint training framework for robust automatic speech recognition. IEEE/ACM Trans. Audio, Speech, Lang. Process. 24(4), 796–806 (2016)
von Neumann, T., et al: Multi-talker asr for an unknown number of sources: Joint training of source counting, separation and asr, arXiv preprint arXiv:2006.02786 (2020)
Fan, C., Yi, J., Tao, J., Tian, Z., Liu, B., Wen, Z.: Gated recurrent fusion with joint training framework for robust end-to-end speech recognition. IEEE/ACM Trans. on Audio Speech Lang. Process. 29, 198–209 (2020)
Liu, B., et al.: Jointly adversarial enhancement training for robust end-to-end speech recognition, vol. 09, pp. 491–495 (2019)
Zhuang, X., Zhang, L., Zhang, Z., Qian, Y., Wang, M.: Coarse-grained attention fusion with joint training framework for complex speech enhancement and end-to-end speech recognition. In: Interspeech (2022)
Yu, D., Kolbæk, M., Tan, Z.-H., Jensen, J.: Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 241–245 (2017)
Kolbæk, M., Yu, D., Tan, Z.-H., Jensen, J.: Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Trans. Audio, Speech Lang. Process. 25(10), 1901–1913 (2017)
Seki,H., Hori, T., Watanabe, S., Le Roux, J., Hershey, J.: A purely end-to-end system for multi-speaker speech recognition, vol. 01 2018, pp. 2620–2630 (2018)
Zhang, W., Chang, X., Qian, Y.: Knowledge distillation for end-to-end monaural multi-talker asr system, vol. 09 2019, pp. 2633–2637 (2019)
Zhang, W., Chang, X., Qian, Y., Watanabe, S.: Improving end-to-end single-channel multi-talker speech recognition. IEEE/ACM Trans. Audio, Speech, Lang. Process. 28, 1385–1394 (2020)
Chang, X., Zhang, W., Qian, Y., Le Roux, J., Watanabe, S.: End-to-end multi-speaker speech recognition with transformer, vol. 02 (2020)
Chang, X., Qian, Y., Yu, K., Watanabe, S.: End-to-end monaural multi-speaker asr system without pretraining. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6256–6260 (2019)
Kanda, N., Gaur, Y., Wang, X., Meng, Z., Yoshioka, T.: Serialized output training for end-to-end overlapped speech recognition, vol. 10, pp. 2797–2801 (2020)
N. Kanda, et al.: Streaming multi-talker asr with token-level serialized output training, vol. 02 (2022)
Yu, F.: M2met: The icassp 2022 multi-channel multi-party meeting transcription challenge, vol. 10 (2021)
Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 (2018)
Hopfield, J.J.: Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. 79(8), 2554–2558 (1982)
Chollet, F.: Xception: Deep learning with depthwise separable convolutions," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1251–1258 (2017)
Gulati, A.: Conformer: Convolution-augmented transformer for speech recognition, vol. 10 2020, pp. 5036–5040 (2020)
Park, D.S.: SpecAugment: A simple data augmentation method for automatic speech recognition. In: Interspeech 2019. ISCA (2019)
Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning, vol. 60, 01 2009, p. 6 (2009)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition, vol. 06 2016, pp. 770–778 (2016)
Acknowledgements
This research was funded by the National Natural Science Foundation of China (Grant No. 61876160 and No. 62001405) and in part by the Science and Technology Key Project of Fujian Province, China (Grant No. 2020HZ020005).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Li, T., Huang, L., Wang, F., Li, S., Hong, Q., Li, L. (2023). A Pipelined Framework with Serialized Output Training for Overlapping Speech Recognition. In: Zhenhua, L., Jianqing, G., Kai, Y., Jia, J. (eds) Man-Machine Speech Communication. NCMMSC 2022. Communications in Computer and Information Science, vol 1765. Springer, Singapore. https://doi.org/10.1007/978-981-99-2401-1_10
Download citation
DOI: https://doi.org/10.1007/978-981-99-2401-1_10
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-2400-4
Online ISBN: 978-981-99-2401-1
eBook Packages: Computer ScienceComputer Science (R0)