Non-intrusive speech quality prediction based on the blind estimation of clean speech and the i-vector framework

Avila, Anderson R.; O’Shaughnessy, Douglas; Falk, Tiago H.

doi:10.1007/s41233-020-00040-3

Non-intrusive speech quality prediction based on the blind estimation of clean speech and the i-vector framework

Research Article
Published: 06 October 2020

Volume 5, article number 11, (2020)
Cite this article

Quality and User Experience Aims and scope Submit manuscript

Anderson R. Avila ORCID: orcid.org/0000-0002-3088-5116¹,
Douglas O’Shaughnessy¹ &
Tiago H. Falk¹

159 Accesses
Explore all metrics

Abstract

Output-based instrumental speech quality assessment relies only on the received (processed) signal to predict quality. Such methods are called non-intrusive and are crucial in speech applications where reference clean signals are not accessible. In this paper, we propose a new non-intrusive instrumental quality measure based on the similarity between two i-vectors. As the reference clean signal is not available, the reference i-vector representation cannot be extracted directly from it. Therefore, we propose the use of a clean speech Gaussian mixture model to estimate the clean speech spectra from its degraded speech spectrum counterpart. Next, the two respective i-vector representations are extracted and either the cosine or Eucledian similarity metrics are computed as a correlate of speech quality. Here, the clean speech model is trained using RASTA-filtered mel-frequency cepstral coefficients extracted from a pool of clean speech files, thus allowing us to attain a model of clean spectrum characteristics. The proposed method is evaluated on noisy, reverberant, and enhanced speech conditions. Experimental results show the proposed system providing higher correlations with perceptual speech quality than several benchmark non-intrusive measures, especially for noisy and enhanced speech.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the use of the i-vector speech representation for instrumental quality measurement

Article 20 June 2020

Anderson R. Avila, Jahangir Alam, … Tiago H. Falk

Modulation Spectral Features for Intrusive Measurement of Reverberant Speech Quality

Robust Speech Analysis Based on Source-Filter Model Using Multivariate Empirical Mode Decomposition in Noisy Environments

Notes

In the case of P.563, speech samples were resampled to 8 kHz.
Out-of-scope usage of POLQA as input and reference signals are wideband whereas reference signals are expected to be superwideband [36].

References

Liotou E et al (2015) Quality of experience management in mobile cellular networks: key issues and design challenges. IEEE Commun Mag 53(7):145–153
Article Google Scholar
Cauchi B et al (2016) Perceptual and instrumental evaluation of the perceived level of reverberation. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 629–633. IEE
Gastaldo P, Zunino R, Redi J (2013) Supporting visual quality assessment with machine learning. EURASIP J Image Video Process 2013(1):54
Article Google Scholar
Issa O et al (2012) Quality-of-experience perception for video streaming services: Preliminary subjective and objective results. In Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, pp 1–9. IEEE
Jin C, Kubichek R (1996) Vector quantization techniques for output-based objective speech quality. In 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, vol. 1, pp 491–494. IEEE
Möller S et al (2011) Speech quality estimation: models and trends. IEEE Signal Process Mag 28(6):18–28
Article Google Scholar
ITU-T Recommendation P.563. (2004) Single-ended method for objective speech quality assessment in narrow-band telphony applications
Falk TH et al (2015) Objective quality and intelligibility prediction for users of assistive listening devices: advantages and limitations of existing tools. IEEE Signal Process Mag 32(2):114–124
Article Google Scholar
Avila AR et al (2019) Non-intrusive speech quality assessment using neural networks. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 631–635. IEEE
Gamper H et al (2019) Intrusive and non-intrusive perceptual speech quality assessment using a convolutional neural network. In 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp 85–89. IEEE
Soni MH, Patil HA (2016) Novel subband autoencoder features for non-intrusive quality assessment of noise suppressed speech. In INTERSPEECH, pp 3708–3712
Cauchi B et al (2019) Non-intrusive speech quality prediction using modulation energies and lstm-network. IEEE/ACM Trans Audio Speech Lang Process 27(7):1151–1163
Article Google Scholar
Avila AR et al (2019) Blind channel response estimation for replay attack detection. Proc. Interspeech 2893–2897
Avila AR et al (2019) Intrusive quality measurement of noisy and enhanced speech based on i-vector similarity. In Eleventh International Conference on Quality of Multimedia Experience (QoMEX), pp 1–5. IEEE
Hermansky H, Morgan N (1994) Rasta processing of speech. IEEE Trans Speech Audio Process 2(4):578–589
Article Google Scholar
Falk TH, Chan W-Y (2006) Single-ended speech quality measurement using machine learning methods. IEEE Trans Audio Speech Lang Process 14(6):1935–1947
Article Google Scholar
Gaubitch ND, Brookes M, Naylor AA (2013) Blind channel magnitude response estimation in speech using spectrum classification. IEEE Trans Audio Speech Lang Process 21(10):2162–2171
Article Google Scholar
Kenny P et al (2007) Joint factor analysis versus eigenchannels in speaker recognition. IEEE Trans Audio Speech Lang Process 15(4):1435–1447
Article Google Scholar
Dehak N et al (2011) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798
Article Google Scholar
Hansen JHL, Hasan T (2015) Speaker recognition by machines and humans: a tutorial review. IEEE Signal Process Mag 32(6):74–99
Article Google Scholar
Kenny P, Boulianne G, Dumouchel P (2005) Eigenvoice modeling with sparse training data. IEEE Trans Speech Audio Process 13(3):345–354
Article Google Scholar
Garcia-Romero D, Espy-Wilson CY (2011) Analysis of i-vector length normalization in speaker recognition systems. In Twelfth Annual Conference of the International Speech Communication Association
Van der Maaten L, Hinton G (2008) Visualizing data using t-sne. Journal of Machine Learning Research 9(Nov)
Series B (2014) Recommendation itu-r bs. 1534-3 method for the subjective assessment of intermediate quality level of audio systems. International Telecommunication Union Radio Communication Assembly
Santos JF, Falk TH (2019) Towards the development of a non-intrusive objective quality measure for dnn-enhanced speech. In 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX), pp 1–6. IEEE
Schoeffler M et al (2018) webmushra-a comprehensive framework for web-based listening tests. J Open Res Softw 6(1)
Valentini-Botinhao C et al (2017) Noisy speech database for training speech enhancement algorithms and tts models. University of Edinburgh. School of Informatics, Centre for Speech Technology Research (CSTR)
Pascual S, Bonafonte A, Serrà J (2017) Segan: Speech enhancement generative adversarial network. arXiv preprint arXiv:1703.09452
Veaux C, Yamagishi J, King S (2013) The voice bank corpus: Design, collection and data analysis of a large regional accent speech database. In 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), pp 1–4. IEEE
Varga A, Steeneken HJM (1993) Assessment for automatic speech recognition: Ii. noisex-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun 12(3):247–251
Article Google Scholar
Santos JF, Falk TH (2018) Speech dereverberation with context-aware recurrent neural networks. IEEE/ACM Trans Audio Speech Lang Process 26(7):1236–1246
Article Google Scholar
Williamson DS, Wang D (2017) Time-frequency masking in the complex domain for speech dereverberation and denoising. IEEE/ACM Trans Audio Speech Lang Processing 25(7):1492–1501
Article Google Scholar
Wu B et al (2016) A reverberation-time-aware approach to speech dereverberation based on deep neural networks. IEEE/ACM Trans Audio Speech Lang Process 25(1):102–111
Article Google Scholar
ITU-T Recommendation P.862. Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs, February 2001
Rix A, Beerends J, Hollier M, Hekstra A (2001) Perceptual evaluation of speech quality (PESQ) - a new method for speech quality assessment of telephone networks and codecs. pp 749–752
ITU-T Recommendation P.863. (2008) Perceptual objective listening quality assessment: An advanced objective perceptual method for end-to-end listening speech quality evaluation of fixed, mobile, and IP-based networks and speech codecs covering narrowband, wideband, and super-wideband signals. Technical report
Beerends J et al (2013) Perceptual objective listening quality assessment (polqa), the third generation itu-t standard for end-to-end speech quality measurement part ii: Perceptual model. Audio Eng Soc 61(6)
Ma J, Hu Y, Loizou PC (2009) Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions. J Acoust Soc Am 125(5):3387–3405
Article Google Scholar
Janssen JH (1957) A method for the calculation of the speech intelligibility under conditions of reverberation and noise. Acta Acustica united with Acustica 7(5):305–310
Google Scholar
Taal CH et al (2010) A short-time objective intelligibility measure for time-frequency weighted noisy speech. pp 4214–4217
Malfait L, Berger J, Kastner M (2006) P.563 - the ITU-T standard for single-ended speech quality assessment. IEEE Trans Audio Speech Lang Process 14(6):1924–1934
Article Google Scholar
ITU-T Recommendation P.830. (1996) Subjective performance assessment of telephone-band and wideband digital codecs
Falk TH, Zheng C, Chan W-Y (2010) A non-intrusive quality and intelligibility measure of reverberant and dereverberated speech. IEEE Trans Audio Speech Lang Process 18(7):1766–1774
Article Google Scholar
Santos JF, Senoussaoui M, Falk TH (2014) An improved non-intrusive intelligibility metric for noisy and reverberant speech. pp 55–59
Fu SW et al (2018) Quality-net: An end-to-end non-intrusive speech quality assessment model based on blstm. arXiv preprint arXiv:1808.05344
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436
Article Google Scholar
ITU-T Recommendation P.800. (1998) Recommendation P.800: Methods for subjectiuve determination of transmission quality
Rix AW (2003) Comparison between subjective listening quality and p. 862 pesq score. Proc. Measurement of Speech and Audio Quality in Networks (MESAQIN’03), Prague, Czech Republic
Shcherbakov MV et al (2013) A survey of forecast error measures. World Appl Sci J 24(24):171–176
Google Scholar
ITU-T Recommendation P.862.1. (2003) Mapping function for transforming p.862 raw result scores to mos-lq
Falk TH, Chan W-Y (2010) Temporal dynamics for blind measurement of room acoustical parameters. IEEE Trans Instrum Meas 59(4):978–989
Article Google Scholar
Kenny P et al (2008) A study of interspeaker variability in speaker verification. IEEE Trans Audio Speech Lang Process 16(5):980–988
Article Google Scholar
ITU-T Recommendation P.835. (2003) Subjective test methodology for evaluating speech communication systems that include noise suppression algorithm. International Telecommunication Union, Geneva
Cauchi B et al (2015) Combination of MVDR beamforming and single-channel spectral processing for enhancing noisy and reverberant speech. EURASIP J Adv Signal Process 2015:1–12
Article Google Scholar
Thiemann J et al (2016) Speech enhancement for multimicrophone binaural hearing aids aiming to preserve the spatial auditory scene. EURASIP J Adv Signal Process 2016(1):12
Article Google Scholar

Download references

Acknowledgements

The authors would like to thank the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), the Fonds de recherche du Québec - Nature et Technologies (FRQNT), and the Natural Sciences and Engineering Research Council of Canada (NSERC) for their financial support.

Author information

Authors and Affiliations

Institut national de la recherche scientifique, 800, rue de la Gauchetière Ouest, Montréal (Quebec), H5A 1K6, Canada
Anderson R. Avila, Douglas O’Shaughnessy & Tiago H. Falk

Authors

Anderson R. Avila
View author publications
You can also search for this author in PubMed Google Scholar
Douglas O’Shaughnessy
View author publications
You can also search for this author in PubMed Google Scholar
Tiago H. Falk
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anderson R. Avila.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Avila, A.R., O’Shaughnessy, D. & Falk, T.H. Non-intrusive speech quality prediction based on the blind estimation of clean speech and the i-vector framework. Qual User Exp 5, 11 (2020). https://doi.org/10.1007/s41233-020-00040-3

Download citation

Received: 19 December 2019
Published: 06 October 2020
DOI: https://doi.org/10.1007/s41233-020-00040-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Non-intrusive speech quality prediction based on the blind estimation of clean speech and the i-vector framework

Abstract

Access this article

Similar content being viewed by others

On the use of the i-vector speech representation for instrumental quality measurement

Modulation Spectral Features for Intrusive Measurement of Reverberant Speech Quality

Robust Speech Analysis Based on Source-Filter Model Using Multivariate Empirical Mode Decomposition in Noisy Environments

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Non-intrusive speech quality prediction based on the blind estimation of clean speech and the i-vector framework

Abstract

Access this article

Similar content being viewed by others

On the use of the i-vector speech representation for instrumental quality measurement

Modulation Spectral Features for Intrusive Measurement of Reverberant Speech Quality

Robust Speech Analysis Based on Source-Filter Model Using Multivariate Empirical Mode Decomposition in Noisy Environments

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation