On the Ideal Ratio Mask as the Goal of Computational Auditory Scene Analysis

Hummersone, Christopher; Stokes, Toby; Brookes, Tim

doi:10.1007/978-3-642-55016-4_12

Christopher Hummersone³,
Toby Stokes³ &
Tim Brookes³

Part of the book series: Signals and Communication Technology ((SCT))

3199 Accesses
35 Citations
2 Altmetric

Abstract

The ideal binary mask (IBM) is widely considered to be the benchmark for time–frequency-based sound source separation techniques such as computational auditory scene analysis (CASA). However, it is well known that binary masking introduces objectionable distortion, especially musical noise. This can make binary masking unsuitable for sound source separation applications where the output is auditioned. It has been suggested that soft masking reduces musical noise and leads to a higher quality output. A previously defined soft mask, the ideal ratio mask (IRM), is found to have similar properties to the IBM, may correspond more closely to auditory processes, and offers additional computational advantages. Consequently, the IRM is proposed as the goal of CASA. To further support this position, a number of studies are reviewed that show soft masks to provide superior performance to the IBM in applications such as automatic speech recognition and speech intelligibility. A brief empirical study provides additional evidence demonstrating the objective and perceptual superiority of the IRM over the IBM.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://sisec.wiki.irisa.fr/tiki-index.php

References

Anzalone, M., Calandruccio, L., Doherty, K., Carney, L.: Determination of the potential benefit of time-frequency gain manipulation. Ear and hearing 27(5), 480 (2006)
Article Google Scholar
Araki, S., Makino, S., Sawada, H., Mukai, R.: Underdetermined blind separation of convolutive mixtures of speech with directivity pattern based mask and ICA. In: Puntonet, C., Prieto, A. (eds.) Independent Component Analysis and Blind Signal Separation. Lecture Notes in Computer Science, vol. 3195, pp. 898–905. Springer, Berlin (2004)
Google Scholar
Araki, S., Makino, S., Sawada, H., Mukai, R.: Reducing musical noise by a fine-shift overlap-add method applied to source separation using a time-frequency mask. IEEE Int. Conf. Acoust. Speech Signal Proc. (ICASSP) III, 81–84 (2005)
Google Scholar
Araki, S., Nesta, F., Vincent, E., Koldovsk, Z., Nolte, G., Ziehe, A., Benichoux, A.: The 2011 signal separation evaluation campaign (SiSEC2011): audio source separation. In: Theis, F., Cichocki, A., Yeredor, A., Zibulevsky, M. (eds.) Latent Variable Analysis and Signal Separation. Lecture Notes in Computer Science, vol. 7191, pp. 414–422. Springer, Berlin, Heidelberg (2012)
Google Scholar
Araki, S., Sawada, H., Mukai, R., Makino, S.: Blind sparse source separation with spatially smoothed time-frequency masking. In: International Workshop on Acoustic, Echo and Noise Control. Paris (2006)
Google Scholar
Barker, J., Josifovski, L., Cooke, M.P., Green, P.D.: Soft decisions in missing data techniques for robust automatic speech recognition. In: Proceedings of International Conference on Spoken Language Processing, pp. 373–376 (2000)
Google Scholar
Bell, A.J., Sejnowski, T.J.: An information-maximization approach to blind separation and blind deconvolution. Neural computation 7(6), 1129–1159 (1995)
Article Google Scholar
Bregman, A.: The meaning of duplex perception: sounds as transparent objects. In: Schouten, M.E.H. (ed.) The Psychophysics of Speech Perception, pp. 95–111. Martinus Nijhoff, Dordrecht (1987)
Google Scholar
Bregman, A.S.: Auditory Scene Analysis. MIT Press, Cambridge (1990)
Google Scholar
Brons, I., Houben, R., Dreschler, W.A.: Perceptual effects of noise reduction by time-frequency masking of noisy speech. J. Acoust. Soc. Am. 132(4), 2690–2699 (2012)
Article Google Scholar
Brungart, D.S., Chang, P.S., Simpson, B.D., Wang, D.: Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation. J. Acoust. Soc. Am. 120(6), 4007–4018 (2006)
Article Google Scholar
Christensen, H., Barker, J., Ma, N., Green, P.: The chime corpus: a resource and a challenge for computational hearing in multisource environments. In: Proceedings of Interspeech (2010)
Google Scholar
Coy, A., Barker, J.: An automatic speech recognition system based on the scene analysis account of auditory perception. Speech Commun. 49(5), 384–401 (2007)
Article Google Scholar
Emiya, V., Vincent, E., Harlander, N., Hohmann, V.: Subjective and objective quality assessment of audio source separation. IEEE Trans. Audio Speech Lang. Proc. 19(7), 2046–2057 (2011)
Google Scholar
Ephraim, Y., Malah, D.: Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Proc. 32(6), 1109–1121 (1984)
Google Scholar
Ephraim, Y., Malah, D.: Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Proc. 33(2), 443–445 (1985)
Google Scholar
Erkelens, J., Hendriks, R., Heusdens, R., Jensen, J.: Minimum mean-square error estimation of discrete fourier coefficients with generalized gamma priors. IEEE Trans. Audio Speech Lang. Proc. 15(6), 1741–1752 (2007)
Google Scholar
Grais, E., Erdogan, H.: Single channel speech music separation using nonnegative matrix factorization and spectral masks. In: The 17th International Conference on Digital Signal Processing, pp. 1–6 (2011)
Google Scholar
Hartmann, W., Fosler-Lussier, E.: Investigations into the incorporation of the ideal binary mask in ASR. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 4804–4807 (2011)
Google Scholar
Hendriks, R., Heusdens, R., Jensen, J.: MMSE based noise PSD tracking with low complexity. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 4266–4269 (2010)
Google Scholar
Hu, Y., Loizou, P.C.: Techniques for estimating the ideal binary mask. In: Proceedings 11th International Workshop on Acoustic Echo and Noise Control (2008)
Google Scholar
Jensen, J., Hendriks, R.: Spectral magnitude minimum mean-square error estimation using binary and continuous gain functions. IEEE Trans. Audio Speech Lang. Proc. 20(1), 92–102 (2012)
Google Scholar
Jutten, C., Hérault, J.: Independent component analysis (inca) versus principal component analysis. In: Signal Processing IV: Theories and applications—Proceedings of EUSIPCO, pp. 643–646. North-Holland, Grenoble (1988)
Google Scholar
Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)
Article Google Scholar
Li, M., McAllister, H., Black, N., De Perez, T.: Perceptual time-frequency subtraction algorithm for noise reduction in hearing aids. IEEE Trans. Biomed. Eng. 48(9), 979–988 (2001)
Article Google Scholar
Li, N., Loizou, P.C.: Effect of spectral resolution on the intelligibility of ideal binary masked speech. J. Acoust. Soc. Am. 123(4), 59–64 (2008)
Article Google Scholar
Li, N., Loizou, P.C.: Factors influencing intelligibility of ideal binary-masked speech: implications for noise reduction. J. Acoust. Soc. Am. 123(3), 1673–1682 (2008)
Article Google Scholar
Li, Y., Wang, D.: On the optimality of ideal binary time-frequency masks. Speech Commun. 51(3), 230–239 (2009)
Article Google Scholar
Madhu, N., Breithaupt, C., Martin, R.: Temporal smoothing of spectral masks in the cepstral domain for speech separation. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 45–48 (2008)
Google Scholar
Madhu, N., Spriet, A., Jansen, S., Koning, R., Wouters, J.: The potential for speech intelligibility improvement using the ideal binary mask and the ideal Wiener filter in single channel noise reduction systems: Application to auditory prostheses. IEEE Trans. Audio Speech Lang. Proc. 21(1), 63–72 (2013)
Google Scholar
Makkiabadi, B., Sanei, S., Marshall, D.: A k-subspace based tensor factorization approach for under-determined blind identification. In: Forty Fourth Asilomar Conference on Signals, Systems and Computers, pp. 18–22 (2010)
Google Scholar
Moore, B.C.J.: An Introduction to the Psychology of Hearing, 5th edn. Academic Press, London (2004)
Google Scholar
Mowlaee, P., Saeidi, R., Martin, R.: Model-driven speech enhancement for multisource reverberant environment (signal separation evaluation campaign (SiSEC) 2011). In: Theis, F., Cichocki, A., Yeredor, A., Zibulevsky, M. (eds.) Latent Variable Analysis and Signal Separation. Lecture Notes in Computer Science, vol. 7191, pp. 454–461. Springer, Berlin, Heidelberg (2012)
Google Scholar
Naik, G.R., Kumar, D.K.: An overview of independent component analysis and its applications. Informatica 35, 63–81 (2011)
MATH Google Scholar
Ozerov, A., Vincent, E., Bimbot, F.: A general flexible framework for the handling of prior information in audio source separation. IEEE Trans. Audio Speech Lang. Proc. 20(4), 1118–1133 (2012)
Google Scholar
Patterson, R., Nimmo-Smith, I., Holdsworth, J., Rice, P.: An efficient auditory filterbank based on the gammatone function. Technical report, MRC Applied Psychology Unit, Cambridge (1987)
Google Scholar
Pedersen, M., Wang, D., Larsen, J., Kjems, U.: Overcomplete blind source separation by combining ICA and binary time-frequency masking. In: IEEE Workshop Machine Learning Signal Processing, pp. 15–20 (2005)
Google Scholar
Peterson, W., Birdsall, T.G., Fox, W.C.: The theory of signal detectability. In: Proceedings of the IRE Professional Group on Information Theory 4, pp. 171–212 (1954)
Google Scholar
Rangachari, S., Loizou, P.C.: A noise-estimation algorithm for highly non-stationary environments. Speech Commun. 48(2), 220–231 (2006)
Article Google Scholar
Roman, N., Wang, D.: Pitch-based monaural segregation of reverberant speech. J. Acoust. Soc. Am. 120(1), 458–469 (2006)
Article Google Scholar
Shannon, R., Zeng, F., Kamath, V., Wygonski, J., Ekelid, M.: Speech recognition with primarily temporal cues. Science 270, 303–303 (1995)
Article Google Scholar
Srinivasan, S., Roman, N., Wang, D.: Binary and ratio time-frequency masks for robust speech recognition. Speech Commun. 48(11), 1486–1501 (2006)
Article Google Scholar
Stokes, T., Hummersone, C., Brookes, T.: Reducing binary masking artefacts in blind audio source separation. In: Proceedings of 134th Engineering Society Convention Rome (2013)
Google Scholar
Swets, J.A.: Is there a sensory threshold? Science 134(3473), 168–177 (1961)
Article Google Scholar
Swets, J.A.: Signal Detection and Recognition by Human Observers. Wiley, New York (1964)
Google Scholar
Tanner Jr, W.P., Swets, J.A.: A decision-making theory of visual detection. Psychol. Rev. 61(6), 401–409 (1954)
Article Google Scholar
Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Proc. 14(4), 1462–1469 (2006)
Google Scholar
Wang, D.: On ideal binary mask as the computational goal of auditory scene analysis. In: Divenyi, P. (ed.) Speech Separation by Humans and Machines, pp. 181–197. Kluwer Academic, Norwell (2005)
Google Scholar
Wang, D.: Time-frequency masking for speech separation and its potential for hearing aid design. Trends Amplif. 12(4), 332–353 (2008)
Article Google Scholar
Wang, D., Brown, G.J.: Fundamentals of computational auditory scene analysis. In: Wang, D., Brown, G.J. (eds.) Computational Auditory Scene Analysis: Principles, Algorithms and Applications, pp. 1–44. Wiley, Hoboken (2006)
Google Scholar
Wiener, N.: Extrapolation, Interpolation, and Smoothing of Stationary Time Series: with Engineering Applications. MIT Press, Cambridge (1950)
Google Scholar

Download references

Acknowledgments

The authors would like to thank Nicoleta Roman and colleagues for providing the data for Table 12.4, Nilesh Madhu for providing the data for Table 12.6, and Jesper Jensen for providing the data for Table 12.7.

Author information

Authors and Affiliations

University of Surrey, Guildford, UK
Christopher Hummersone, Toby Stokes & Tim Brookes

Authors

Christopher Hummersone
View author publications
You can also search for this author in PubMed Google Scholar
Toby Stokes
View author publications
You can also search for this author in PubMed Google Scholar
Tim Brookes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christopher Hummersone .

Editor information

Editors and Affiliations

University of Technology, Sydney, Sydney, Australia
Ganesh R. Naik
University of Surrey, Guildford, United Kingdom
Wenwu Wang

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Hummersone, C., Stokes, T., Brookes, T. (2014). On the Ideal Ratio Mask as the Goal of Computational Auditory Scene Analysis. In: Naik, G., Wang, W. (eds) Blind Source Separation. Signals and Communication Technology. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-55016-4_12

Download citation

DOI: https://doi.org/10.1007/978-3-642-55016-4_12
Published: 22 May 2014
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-55015-7
Online ISBN: 978-3-642-55016-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics