Abstract
The ideal binary mask (IBM) is widely considered to be the benchmark for time–frequency-based sound source separation techniques such as computational auditory scene analysis (CASA). However, it is well known that binary masking introduces objectionable distortion, especially musical noise. This can make binary masking unsuitable for sound source separation applications where the output is auditioned. It has been suggested that soft masking reduces musical noise and leads to a higher quality output. A previously defined soft mask, the ideal ratio mask (IRM), is found to have similar properties to the IBM, may correspond more closely to auditory processes, and offers additional computational advantages. Consequently, the IRM is proposed as the goal of CASA. To further support this position, a number of studies are reviewed that show soft masks to provide superior performance to the IBM in applications such as automatic speech recognition and speech intelligibility. A brief empirical study provides additional evidence demonstrating the objective and perceptual superiority of the IRM over the IBM.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Anzalone, M., Calandruccio, L., Doherty, K., Carney, L.: Determination of the potential benefit of time-frequency gain manipulation. Ear and hearing 27(5), 480 (2006)
Araki, S., Makino, S., Sawada, H., Mukai, R.: Underdetermined blind separation of convolutive mixtures of speech with directivity pattern based mask and ICA. In: Puntonet, C., Prieto, A. (eds.) Independent Component Analysis and Blind Signal Separation. Lecture Notes in Computer Science, vol. 3195, pp. 898–905. Springer, Berlin (2004)
Araki, S., Makino, S., Sawada, H., Mukai, R.: Reducing musical noise by a fine-shift overlap-add method applied to source separation using a time-frequency mask. IEEE Int. Conf. Acoust. Speech Signal Proc. (ICASSP) III, 81–84 (2005)
Araki, S., Nesta, F., Vincent, E., Koldovsk, Z., Nolte, G., Ziehe, A., Benichoux, A.: The 2011 signal separation evaluation campaign (SiSEC2011): audio source separation. In: Theis, F., Cichocki, A., Yeredor, A., Zibulevsky, M. (eds.) Latent Variable Analysis and Signal Separation. Lecture Notes in Computer Science, vol. 7191, pp. 414–422. Springer, Berlin, Heidelberg (2012)
Araki, S., Sawada, H., Mukai, R., Makino, S.: Blind sparse source separation with spatially smoothed time-frequency masking. In: International Workshop on Acoustic, Echo and Noise Control. Paris (2006)
Barker, J., Josifovski, L., Cooke, M.P., Green, P.D.: Soft decisions in missing data techniques for robust automatic speech recognition. In: Proceedings of International Conference on Spoken Language Processing, pp. 373–376 (2000)
Bell, A.J., Sejnowski, T.J.: An information-maximization approach to blind separation and blind deconvolution. Neural computation 7(6), 1129–1159 (1995)
Bregman, A.: The meaning of duplex perception: sounds as transparent objects. In: Schouten, M.E.H. (ed.) The Psychophysics of Speech Perception, pp. 95–111. Martinus Nijhoff, Dordrecht (1987)
Bregman, A.S.: Auditory Scene Analysis. MIT Press, Cambridge (1990)
Brons, I., Houben, R., Dreschler, W.A.: Perceptual effects of noise reduction by time-frequency masking of noisy speech. J. Acoust. Soc. Am. 132(4), 2690–2699 (2012)
Brungart, D.S., Chang, P.S., Simpson, B.D., Wang, D.: Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation. J. Acoust. Soc. Am. 120(6), 4007–4018 (2006)
Christensen, H., Barker, J., Ma, N., Green, P.: The chime corpus: a resource and a challenge for computational hearing in multisource environments. In: Proceedings of Interspeech (2010)
Coy, A., Barker, J.: An automatic speech recognition system based on the scene analysis account of auditory perception. Speech Commun. 49(5), 384–401 (2007)
Emiya, V., Vincent, E., Harlander, N., Hohmann, V.: Subjective and objective quality assessment of audio source separation. IEEE Trans. Audio Speech Lang. Proc. 19(7), 2046–2057 (2011)
Ephraim, Y., Malah, D.: Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Proc. 32(6), 1109–1121 (1984)
Ephraim, Y., Malah, D.: Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Proc. 33(2), 443–445 (1985)
Erkelens, J., Hendriks, R., Heusdens, R., Jensen, J.: Minimum mean-square error estimation of discrete fourier coefficients with generalized gamma priors. IEEE Trans. Audio Speech Lang. Proc. 15(6), 1741–1752 (2007)
Grais, E., Erdogan, H.: Single channel speech music separation using nonnegative matrix factorization and spectral masks. In: The 17th International Conference on Digital Signal Processing, pp. 1–6 (2011)
Hartmann, W., Fosler-Lussier, E.: Investigations into the incorporation of the ideal binary mask in ASR. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 4804–4807 (2011)
Hendriks, R., Heusdens, R., Jensen, J.: MMSE based noise PSD tracking with low complexity. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 4266–4269 (2010)
Hu, Y., Loizou, P.C.: Techniques for estimating the ideal binary mask. In: Proceedings 11th International Workshop on Acoustic Echo and Noise Control (2008)
Jensen, J., Hendriks, R.: Spectral magnitude minimum mean-square error estimation using binary and continuous gain functions. IEEE Trans. Audio Speech Lang. Proc. 20(1), 92–102 (2012)
Jutten, C., Hérault, J.: Independent component analysis (inca) versus principal component analysis. In: Signal Processing IV: Theories and applications—Proceedings of EUSIPCO, pp. 643–646. North-Holland, Grenoble (1988)
Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)
Li, M., McAllister, H., Black, N., De Perez, T.: Perceptual time-frequency subtraction algorithm for noise reduction in hearing aids. IEEE Trans. Biomed. Eng. 48(9), 979–988 (2001)
Li, N., Loizou, P.C.: Effect of spectral resolution on the intelligibility of ideal binary masked speech. J. Acoust. Soc. Am. 123(4), 59–64 (2008)
Li, N., Loizou, P.C.: Factors influencing intelligibility of ideal binary-masked speech: implications for noise reduction. J. Acoust. Soc. Am. 123(3), 1673–1682 (2008)
Li, Y., Wang, D.: On the optimality of ideal binary time-frequency masks. Speech Commun. 51(3), 230–239 (2009)
Madhu, N., Breithaupt, C., Martin, R.: Temporal smoothing of spectral masks in the cepstral domain for speech separation. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 45–48 (2008)
Madhu, N., Spriet, A., Jansen, S., Koning, R., Wouters, J.: The potential for speech intelligibility improvement using the ideal binary mask and the ideal Wiener filter in single channel noise reduction systems: Application to auditory prostheses. IEEE Trans. Audio Speech Lang. Proc. 21(1), 63–72 (2013)
Makkiabadi, B., Sanei, S., Marshall, D.: A k-subspace based tensor factorization approach for under-determined blind identification. In: Forty Fourth Asilomar Conference on Signals, Systems and Computers, pp. 18–22 (2010)
Moore, B.C.J.: An Introduction to the Psychology of Hearing, 5th edn. Academic Press, London (2004)
Mowlaee, P., Saeidi, R., Martin, R.: Model-driven speech enhancement for multisource reverberant environment (signal separation evaluation campaign (SiSEC) 2011). In: Theis, F., Cichocki, A., Yeredor, A., Zibulevsky, M. (eds.) Latent Variable Analysis and Signal Separation. Lecture Notes in Computer Science, vol. 7191, pp. 454–461. Springer, Berlin, Heidelberg (2012)
Naik, G.R., Kumar, D.K.: An overview of independent component analysis and its applications. Informatica 35, 63–81 (2011)
Ozerov, A., Vincent, E., Bimbot, F.: A general flexible framework for the handling of prior information in audio source separation. IEEE Trans. Audio Speech Lang. Proc. 20(4), 1118–1133 (2012)
Patterson, R., Nimmo-Smith, I., Holdsworth, J., Rice, P.: An efficient auditory filterbank based on the gammatone function. Technical report, MRC Applied Psychology Unit, Cambridge (1987)
Pedersen, M., Wang, D., Larsen, J., Kjems, U.: Overcomplete blind source separation by combining ICA and binary time-frequency masking. In: IEEE Workshop Machine Learning Signal Processing, pp. 15–20 (2005)
Peterson, W., Birdsall, T.G., Fox, W.C.: The theory of signal detectability. In: Proceedings of the IRE Professional Group on Information Theory 4, pp. 171–212 (1954)
Rangachari, S., Loizou, P.C.: A noise-estimation algorithm for highly non-stationary environments. Speech Commun. 48(2), 220–231 (2006)
Roman, N., Wang, D.: Pitch-based monaural segregation of reverberant speech. J. Acoust. Soc. Am. 120(1), 458–469 (2006)
Shannon, R., Zeng, F., Kamath, V., Wygonski, J., Ekelid, M.: Speech recognition with primarily temporal cues. Science 270, 303–303 (1995)
Srinivasan, S., Roman, N., Wang, D.: Binary and ratio time-frequency masks for robust speech recognition. Speech Commun. 48(11), 1486–1501 (2006)
Stokes, T., Hummersone, C., Brookes, T.: Reducing binary masking artefacts in blind audio source separation. In: Proceedings of 134th Engineering Society Convention Rome (2013)
Swets, J.A.: Is there a sensory threshold? Science 134(3473), 168–177 (1961)
Swets, J.A.: Signal Detection and Recognition by Human Observers. Wiley, New York (1964)
Tanner Jr, W.P., Swets, J.A.: A decision-making theory of visual detection. Psychol. Rev. 61(6), 401–409 (1954)
Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Proc. 14(4), 1462–1469 (2006)
Wang, D.: On ideal binary mask as the computational goal of auditory scene analysis. In: Divenyi, P. (ed.) Speech Separation by Humans and Machines, pp. 181–197. Kluwer Academic, Norwell (2005)
Wang, D.: Time-frequency masking for speech separation and its potential for hearing aid design. Trends Amplif. 12(4), 332–353 (2008)
Wang, D., Brown, G.J.: Fundamentals of computational auditory scene analysis. In: Wang, D., Brown, G.J. (eds.) Computational Auditory Scene Analysis: Principles, Algorithms and Applications, pp. 1–44. Wiley, Hoboken (2006)
Wiener, N.: Extrapolation, Interpolation, and Smoothing of Stationary Time Series: with Engineering Applications. MIT Press, Cambridge (1950)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Hummersone, C., Stokes, T., Brookes, T. (2014). On the Ideal Ratio Mask as the Goal of Computational Auditory Scene Analysis. In: Naik, G., Wang, W. (eds) Blind Source Separation. Signals and Communication Technology. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-55016-4_12
Download citation
DOI: https://doi.org/10.1007/978-3-642-55016-4_12
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-55015-7
Online ISBN: 978-3-642-55016-4
eBook Packages: EngineeringEngineering (R0)