Skip to main content

A Pipelined Framework with Serialized Output Training for Overlapping Speech Recognition

  • Conference paper
  • First Online:
Man-Machine Speech Communication (NCMMSC 2022)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1765))

Included in the following conference series:

  • 392 Accesses

Abstract

Far-field, noise, reverberation, and overlapping speech make the cocktail party problem one of the greatest challenges in speech recognition. In this paper, we focus on solving the problem of overlapping speech and present a pipelined architecture with serialized output training(SOT). The baseline and the proposed methods are evaluated on the artificially mixed speech datasets generated from the AliMeeting corpus. Experimental results demonstrate that our proposed model outperforms the baseline even with high overlap ratio, which leads to 10.8% and 4.9% relative performance gains in terms of CER for 0.5 overlap ratio and average case, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. McDermott, J.H.: The cocktail party problem. Curr. Biol. 19(22), R1024–R1027 (2009)

    Article  Google Scholar 

  2. Xue, B., et al.: Bayesian transformer language models for speech recognition. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 7378–7382 (2021)

    Google Scholar 

  3. Zeineldeen, M., et al.: Conformer-based hybrid asr system for switchboard dataset. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp. 7437–7441 (2022)

    Google Scholar 

  4. Luo, Y., Mesgarani, N.: Conv-tasnet: Surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio, Speech, Lang. Process. 27(8), 1256–1266 (2019)

    Article  Google Scholar 

  5. Luo, Y., Han, C., Mesgarani, N., Ceolini, E., Liu, S.-C.: Fasnet: Low-latency adaptive beamforming for multi-microphone audio processing. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

    Google Scholar 

  6. Luo, Y., Chen, Z., Yoshioka, T.: Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 46–50 (2020)

    Google Scholar 

  7. Zeghidour, N., Grangier, D.: Wavesplit: end-to-end speech separation by speaker clustering. IEEE/ACM Trans. Audio, Speech, Lang. Process. 29, 2840–2849 (2021)

    Article  Google Scholar 

  8. Heymann, J., Drude, L., Haeb-Umbach, R.: Neural network based spectral mask estimation for acoustic beamforming. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, vol. 03, pp. 196–200 (2016)

    Google Scholar 

  9. Wang, Z.-Q., Wang, D.: A joint training framework for robust automatic speech recognition. IEEE/ACM Trans. Audio, Speech, Lang. Process. 24(4), 796–806 (2016)

    Article  Google Scholar 

  10. von Neumann, T., et al: Multi-talker asr for an unknown number of sources: Joint training of source counting, separation and asr, arXiv preprint arXiv:2006.02786 (2020)

  11. Fan, C., Yi, J., Tao, J., Tian, Z., Liu, B., Wen, Z.: Gated recurrent fusion with joint training framework for robust end-to-end speech recognition. IEEE/ACM Trans. on Audio Speech Lang. Process. 29, 198–209 (2020)

    Article  Google Scholar 

  12. Liu, B., et al.: Jointly adversarial enhancement training for robust end-to-end speech recognition, vol. 09, pp. 491–495 (2019)

    Google Scholar 

  13. Zhuang, X., Zhang, L., Zhang, Z., Qian, Y., Wang, M.: Coarse-grained attention fusion with joint training framework for complex speech enhancement and end-to-end speech recognition. In: Interspeech (2022)

    Google Scholar 

  14. Yu, D., Kolbæk, M., Tan, Z.-H., Jensen, J.: Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 241–245 (2017)

    Google Scholar 

  15. Kolbæk, M., Yu, D., Tan, Z.-H., Jensen, J.: Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Trans. Audio, Speech Lang. Process. 25(10), 1901–1913 (2017)

    Article  Google Scholar 

  16. Seki,H., Hori, T., Watanabe, S., Le Roux, J., Hershey, J.: A purely end-to-end system for multi-speaker speech recognition, vol. 01 2018, pp. 2620–2630 (2018)

    Google Scholar 

  17. Zhang, W., Chang, X., Qian, Y.: Knowledge distillation for end-to-end monaural multi-talker asr system, vol. 09 2019, pp. 2633–2637 (2019)

    Google Scholar 

  18. Zhang, W., Chang, X., Qian, Y., Watanabe, S.: Improving end-to-end single-channel multi-talker speech recognition. IEEE/ACM Trans. Audio, Speech, Lang. Process. 28, 1385–1394 (2020)

    Article  Google Scholar 

  19. Chang, X., Zhang, W., Qian, Y., Le Roux, J., Watanabe, S.: End-to-end multi-speaker speech recognition with transformer, vol. 02 (2020)

    Google Scholar 

  20. Chang, X., Qian, Y., Yu, K., Watanabe, S.: End-to-end monaural multi-speaker asr system without pretraining. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6256–6260 (2019)

    Google Scholar 

  21. Kanda, N., Gaur, Y., Wang, X., Meng, Z., Yoshioka, T.: Serialized output training for end-to-end overlapped speech recognition, vol. 10, pp. 2797–2801 (2020)

    Google Scholar 

  22. N. Kanda, et al.: Streaming multi-talker asr with token-level serialized output training, vol. 02 (2022)

    Google Scholar 

  23. Yu, F.: M2met: The icassp 2022 multi-channel multi-party meeting transcription challenge, vol. 10 (2021)

    Google Scholar 

  24. Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 (2018)

  25. Hopfield, J.J.: Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. 79(8), 2554–2558 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  26. Chollet, F.: Xception: Deep learning with depthwise separable convolutions," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1251–1258 (2017)

    Google Scholar 

  27. Gulati, A.: Conformer: Convolution-augmented transformer for speech recognition, vol. 10 2020, pp. 5036–5040 (2020)

    Google Scholar 

  28. Park, D.S.: SpecAugment: A simple data augmentation method for automatic speech recognition. In: Interspeech 2019. ISCA (2019)

    Google Scholar 

  29. Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning, vol. 60, 01 2009, p. 6 (2009)

    Google Scholar 

  30. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition, vol. 06 2016, pp. 770–778 (2016)

    Google Scholar 

Download references

Acknowledgements

This research was funded by the National Natural Science Foundation of China (Grant No. 61876160 and No. 62001405) and in part by the Science and Technology Key Project of Fujian Province, China (Grant No. 2020HZ020005).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Qingyang Hong or Lin Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, T., Huang, L., Wang, F., Li, S., Hong, Q., Li, L. (2023). A Pipelined Framework with Serialized Output Training for Overlapping Speech Recognition. In: Zhenhua, L., Jianqing, G., Kai, Y., Jia, J. (eds) Man-Machine Speech Communication. NCMMSC 2022. Communications in Computer and Information Science, vol 1765. Springer, Singapore. https://doi.org/10.1007/978-981-99-2401-1_10

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-2401-1_10

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-2400-4

  • Online ISBN: 978-981-99-2401-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics