Skip to main content
Log in

Meta-reinforcement learning based few-shot speech reconstruction for non-intrusive speech quality assessment

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Speech quality assessment (SQA) is meaningful for modern communication systems and Quality of Service (QoS). At present, the non-intrusive SQA becomes the research direction due to it not needing the original speech. However, the intrusive algorithms outperformed the non-intrusive methods since the prior information of the original signal are available in the test. The objected of this paper is to execute the non-intrusive evaluation of the noisy speech quality in “an intrusive way”. To reconstruct the original speech, a meta-reinforcement learning method MetaRL-SR is proposed in this paper, focusing on the quasi-clean speech reconstruction from the noisy speech with few training samples. First, a reinforcement learning based meta-learner is proposed which initializes the actions by a finite number of T-F masks, and the related action-value function is developed. Second, to optimize the model, this paper develops the reward calculation for reinforcement learning by using the user perception. Third, the model-agnostic Meta learning (MAML) algorithm is applied to fully utilize the limited data to improve the generalization of the meta-learner and towards better generalization of learning new tasks. Finally, the quasi-clean speech is applied as the reference in the International Telecommunication Union (ITU) standard PESQ intrusive model, and the distortion error between noisy speech and quasi clean speech is calculated to estimate the Mean Opinion Score (MOS) of noisy speech. The experiment results show that in terms of person correlation and standard deviation of error measurements, this work achieves improvement of at least 5.8% ~ 7.3% for 1-shot cases and 5.4% ~6.8% for 5-shot cases in contrast to the state-of-the-art DNN based SQA methods in challenging conditions, where the environment noises are diverse, and the signals are non-stationary.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Algorithm 1:
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Zhou WL, Zhu Z (2021) A novel BNMF-DNN based speech reconstruction method for speech quality evaluation under complex environments. Int J Mach Learn Cybern 12(4):959–972

    Article  Google Scholar 

  2. Zhou WL, Zhu Z (2019) A new online Bayesian NMF based quasi-clean speech reconstruction for non-intrusive voice quality evaluation. Neurocomputing 349:261–270

    Article  Google Scholar 

  3. Zhou WL, He QH (2015) Non-intrusive speech quality objective evaluation in high-noise environments, Proc. IEEE China Sum and Int. Conf. on Signal and Information Processing, Chengdu, 50–54

  4. Wang J, XIE X, Li JX et al (2014) Research on audio quality evaluation standards. Inf Technol Stand 3:39–46

    Google Scholar 

  5. ITU-T Rec (2001) P.862, Perceptual Evaluation of Speech Quality (PESQ):An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs

  6. Ludovic M, Jens B, Martin K (2016) P.563-the ITU-T standard for single-ended speech quality assessment. IEEE Trans Audio Speech Lang Process 14:1924–1934

    Google Scholar 

  7. Narwaria M, Lin W, McLoughlin IV et al (2012) Nonintrusive quality assessment of noise suppressed speech with mel-fi ltered energies and support vector regression[J]. IEEE Trans Audio Speech Lang Process 20(4):1217–1232

    Article  Google Scholar 

  8. Rajesh KD, Arun K (2015) Non-intrusive speech quality assessment using multi-resolution auditory model features for degraded narrowband speech. IET Signal Processing 9:638–646

    Article  Google Scholar 

  9. Li Q, Fang Y, Lin W et a1 (2014) Non-intrusive quality assessment for enhanced speech signals based on spectro-temporal features[C]. IEEE International Conference on Multimedia and Expo Workshops(ICMEW), Chengdu, l-6

  10. Soni MH, Patil HA (2016) Novel subband autoencoder features for non-intrusive quality assessment of noise suppressed speech. Proc. Interspeech, pp. 3708–3712

  11. Fu SW , Tsao Y , Hwang HT et al. (2018) Quality-Net: An End-to-End Non-intrusive Speech Quality Assessment Model based on BLSTM, arXiv preprint arXiv: 1808.05344

  12. Chu W-H, Frank Wang Y-C (2018) Learning Semantics-Guided Visual Attention for Few-Shot Image Classification. IEEE International Conference on Image Processing (ICIP)

  13. Das D, George Lee CS (2020) A Two-Stage Approach to Few-Shot Learning for Image Recognition. IEEE/ACM Trans Image Process 29:3336–3350

    Article  MATH  Google Scholar 

  14. Kang B , Liu Z, Wang X, Yu F, Feng J (2019) Few-shot object detection via feature reweighting. IEEE/CVF International Conference on Computer Vision (ICCV), pp. 69–74

  15. Pan C, Huang J, Gong J, Yuan X (2019) Few-shot transfer learning for text classification with lightweight word embedding based models. IEEE Access 7:53296–53304

    Article  Google Scholar 

  16. Winata GI, Cahyawijaya S, Liu Z, Lin Z, Madotto A, Xu P, Fung P (2020) Learning Fast Adaptation on Cross-Accented Speech Recognition. arXiv preprint arXiv: 2003.01901

  17. Erdogan H, Hershey JR, Watanabe S, Le Roux J (2019) Learning to Match Transient Sound Events Using Attentional Similarity for Few-shot Sound Recognition. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 125–128

  18. Moss HB, Aggarwal V, Prateek N, González J, Barra-Chicote R (2020) BOFFIN TTS: Few-Shot Speaker Adaptation by Bayesian Optimization. arXiv preprint arXiv:2002.01953

  19. Anand P, Singh AK, Srivastava S, Lall B (2019) Few Shot Speaker Recognition using Deep Neural Networks. arXiv preprint arXiv: arXiv:1904.08775

  20. Zhou W, Zhu Z, Liang P et al (2019) Multimed Tools Appl 78:15647–15664

    Article  Google Scholar 

  21. Afouras T, Chung JS, Zisserman A (2018) The conversation: Deep audiovisual speech reconsturction. Proc. Interspeech 2018, pp. 3244–3248

  22. Fu SW, Liao CF, Tsao Y, Lin SD (2019) MetricGAN: Generative adversarial networks based black-box metric scores optimization for speech reconsturction. arXiv preprint arXiv:1905.04874

  23. Rethage D, Pons J, Serra X (2019) A wavenet for speech denoising. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 423–426

  24. Masuyama Y, Togami M, Komatsu T (2018) Consistency-aware multi-channel speech reconsturction using deep neural networks. arXiv preprint: arXiv:2002.05831

  25. Erdogan H, Hershey JR, Watanabe S, Le Roux J (2019) Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In: Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pp. 708–712

  26. Williamson DS, Wang Y, Wang DL (2019) Complex ratio masking for monaural speech separation. IEEE/ACM Trans Audio Speech Lang Process (TASLP) 24(3):483–492

    Article  Google Scholar 

  27. Zhou W, Zhu Z (2020) A novel BNMF-DNN based speech reconstruction method for speech quality evaluation under complex environments. Int J Mach Learn Cybern. https://doi.org/10.1007/s13042-020-01214-3

  28. Deng F, Jiang T, Wang XR, Zhang C, Li Y (2020) NAAGN: noise-aware attention-gated network for speech Reconsturction. Proc Interspeech, 2457-2461

  29. Li A, Zheng C, Peng R, Fan C, Li X (2020) Dynamic Attention Based Generative Adversarial Network with Phase Post-Processing for Speech Reconsturction. arXiv preprint arXiv:2006.07530

  30. Pascual S, Bonafonte A, Serra J (2018) SEGAN: Speech reconsturction generative adversarial network. In: Proc. Interspeech, pp. 77–82

  31. Signal Processing Information Base (2020) NOISEX-92 database. http://www.auditory.org/mhonarc/2006/msg00609.html

  32. Thiemann J, Ito N, Vincent E (2013) The diverse environments multi-channel acoustic noise database: a database of multichannel environmental noise recordings. J Acoust Soc Am 133(5):3591–3591

    Article  Google Scholar 

  33. Tadas B, Chaitanya A, Morency L-P (2019) Multimodal machine learning: A survey and taxonomy. arXiv preprint arXiv:1705.094062

  34. Santoro A, Bartunov S, Botvinick M, Wierstra D, Lillicrap T (2016) Meta learning with memory-augmented neural networks. In: International conference on machine learning, pp. 767–771

  35. Snell J, Swersky K, Zemel R (2017) Prototypical networks for few-shot learning. In: Advances in Neural Information Processing Systems, pp. 512–515

  36. Lin S-C, Chen C-J, Lee T-J (2018) A Multi-Label Classification With Hybrid Label-Based Meta-Learning Method in Internet of Things. IEEE Access 8:2169–3536

    Google Scholar 

  37. Finn C, Abbeel P, Levine S (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. JMLR. org

  38. Zhou W, Lu M, Ji R, Liang P (2021) Learning to enhance: A meta-learning framework for few-shot speech reconsturction, IEEE/ACM Transactions on Audio, Speech, and Language Processing

  39. Silver D et al (2017) Mastering the game of go without human knowledge. Nature 550(7676):354

    Article  Google Scholar 

  40. Mnih V et al (2015) Human-level control through deep reinforcement learning. Nature 518(7540):529

    Article  Google Scholar 

  41. Zoph B, Le QV (2016) Neural architecture search with reinforcement learning, arXiv preprint arXiv:1611.01578

  42. Baker B, Gupta O, Naik N, Raskar R (2017) Designing neural network architectures using reinforcement learning, arXiv preprint arXiv:1611.02167

  43. Jane X, Zeb K, Dhruva T, et al (2018) Learning to reinforcement learn, arXiv preprint arXiv:1611.05763

  44. Fakoor R, Chaudhari P, Soatto S, Smola AJ (2019) Meta-Q-learning. In In Proceedings of International Conference on Learning Representations pp 332–338

  45. ITU-T P-series Recommendations (2021) ITU-T Supplement-23 coded-speech database. http://www.itu.int/net/itu-t/sigdb/genaudio/Pseries.htm#Psupp23

  46. Ma J, Hu Y, Loizou P (2009) Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions, J Acoust Soc Am 125(5):3387–3405

  47. Fisher W, Doddington G, Goudie K (1986) The DARPA speech recognition research database: specifications and status, pp 93–99

  48. Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer assisted intervention, pp. 234–241

Download references

Acknowledgments

This work is supported by the Foshan University Research Foundation for Advanced Talents (GG07005), the Natural Science Foundation of Guangdong Province (2018A0303130082, 2019A1515111148), Guangdong Province Colleges and Universities Young Innovative Talent Project (2019KQNCX168).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weili Zhou.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, W., Lai, J., Liao, Y. et al. Meta-reinforcement learning based few-shot speech reconstruction for non-intrusive speech quality assessment. Appl Intell 53, 14146–14161 (2023). https://doi.org/10.1007/s10489-022-04165-0

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-04165-0

Keywords

Navigation