Skip to main content

On the Ideal Ratio Mask as the Goal of Computational Auditory Scene Analysis

  • Chapter
  • First Online:
Blind Source Separation

Abstract

The ideal binary mask (IBM) is widely considered to be the benchmark for time–frequency-based sound source separation techniques such as computational auditory scene analysis (CASA). However, it is well known that binary masking introduces objectionable distortion, especially musical noise. This can make binary masking unsuitable for sound source separation applications where the output is auditioned. It has been suggested that soft masking reduces musical noise and leads to a higher quality output. A previously defined soft mask, the ideal ratio mask (IRM), is found to have similar properties to the IBM, may correspond more closely to auditory processes, and offers additional computational advantages. Consequently, the IRM is proposed as the goal of CASA. To further support this position, a number of studies are reviewed that show soft masks to provide superior performance to the IBM in applications such as automatic speech recognition and speech intelligibility. A brief empirical study provides additional evidence demonstrating the objective and perceptual superiority of the IRM over the IBM.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://sisec.wiki.irisa.fr/tiki-index.php

References

  1. Anzalone, M., Calandruccio, L., Doherty, K., Carney, L.: Determination of the potential benefit of time-frequency gain manipulation. Ear and hearing 27(5), 480 (2006)

    Article  Google Scholar 

  2. Araki, S., Makino, S., Sawada, H., Mukai, R.: Underdetermined blind separation of convolutive mixtures of speech with directivity pattern based mask and ICA. In: Puntonet, C., Prieto, A. (eds.) Independent Component Analysis and Blind Signal Separation. Lecture Notes in Computer Science, vol. 3195, pp. 898–905. Springer, Berlin (2004)

    Google Scholar 

  3. Araki, S., Makino, S., Sawada, H., Mukai, R.: Reducing musical noise by a fine-shift overlap-add method applied to source separation using a time-frequency mask. IEEE Int. Conf. Acoust. Speech Signal Proc. (ICASSP) III, 81–84 (2005)

    Google Scholar 

  4. Araki, S., Nesta, F., Vincent, E., Koldovsk, Z., Nolte, G., Ziehe, A., Benichoux, A.: The 2011 signal separation evaluation campaign (SiSEC2011): audio source separation. In: Theis, F., Cichocki, A., Yeredor, A., Zibulevsky, M. (eds.) Latent Variable Analysis and Signal Separation. Lecture Notes in Computer Science, vol. 7191, pp. 414–422. Springer, Berlin, Heidelberg (2012)

    Google Scholar 

  5. Araki, S., Sawada, H., Mukai, R., Makino, S.: Blind sparse source separation with spatially smoothed time-frequency masking. In: International Workshop on Acoustic, Echo and Noise Control. Paris (2006)

    Google Scholar 

  6. Barker, J., Josifovski, L., Cooke, M.P., Green, P.D.: Soft decisions in missing data techniques for robust automatic speech recognition. In: Proceedings of International Conference on Spoken Language Processing, pp. 373–376 (2000)

    Google Scholar 

  7. Bell, A.J., Sejnowski, T.J.: An information-maximization approach to blind separation and blind deconvolution. Neural computation 7(6), 1129–1159 (1995)

    Article  Google Scholar 

  8. Bregman, A.: The meaning of duplex perception: sounds as transparent objects. In: Schouten, M.E.H. (ed.) The Psychophysics of Speech Perception, pp. 95–111. Martinus Nijhoff, Dordrecht (1987)

    Google Scholar 

  9. Bregman, A.S.: Auditory Scene Analysis. MIT Press, Cambridge (1990)

    Google Scholar 

  10. Brons, I., Houben, R., Dreschler, W.A.: Perceptual effects of noise reduction by time-frequency masking of noisy speech. J. Acoust. Soc. Am. 132(4), 2690–2699 (2012)

    Article  Google Scholar 

  11. Brungart, D.S., Chang, P.S., Simpson, B.D., Wang, D.: Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation. J. Acoust. Soc. Am. 120(6), 4007–4018 (2006)

    Article  Google Scholar 

  12. Christensen, H., Barker, J., Ma, N., Green, P.: The chime corpus: a resource and a challenge for computational hearing in multisource environments. In: Proceedings of Interspeech (2010)

    Google Scholar 

  13. Coy, A., Barker, J.: An automatic speech recognition system based on the scene analysis account of auditory perception. Speech Commun. 49(5), 384–401 (2007)

    Article  Google Scholar 

  14. Emiya, V., Vincent, E., Harlander, N., Hohmann, V.: Subjective and objective quality assessment of audio source separation. IEEE Trans. Audio Speech Lang. Proc. 19(7), 2046–2057 (2011)

    Google Scholar 

  15. Ephraim, Y., Malah, D.: Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Proc. 32(6), 1109–1121 (1984)

    Google Scholar 

  16. Ephraim, Y., Malah, D.: Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Proc. 33(2), 443–445 (1985)

    Google Scholar 

  17. Erkelens, J., Hendriks, R., Heusdens, R., Jensen, J.: Minimum mean-square error estimation of discrete fourier coefficients with generalized gamma priors. IEEE Trans. Audio Speech Lang. Proc. 15(6), 1741–1752 (2007)

    Google Scholar 

  18. Grais, E., Erdogan, H.: Single channel speech music separation using nonnegative matrix factorization and spectral masks. In: The 17th International Conference on Digital Signal Processing, pp. 1–6 (2011)

    Google Scholar 

  19. Hartmann, W., Fosler-Lussier, E.: Investigations into the incorporation of the ideal binary mask in ASR. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 4804–4807 (2011)

    Google Scholar 

  20. Hendriks, R., Heusdens, R., Jensen, J.: MMSE based noise PSD tracking with low complexity. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 4266–4269 (2010)

    Google Scholar 

  21. Hu, Y., Loizou, P.C.: Techniques for estimating the ideal binary mask. In: Proceedings 11th International Workshop on Acoustic Echo and Noise Control (2008)

    Google Scholar 

  22. Jensen, J., Hendriks, R.: Spectral magnitude minimum mean-square error estimation using binary and continuous gain functions. IEEE Trans. Audio Speech Lang. Proc. 20(1), 92–102 (2012)

    Google Scholar 

  23. Jutten, C., Hérault, J.: Independent component analysis (inca) versus principal component analysis. In: Signal Processing IV: Theories and applications—Proceedings of EUSIPCO, pp. 643–646. North-Holland, Grenoble (1988)

    Google Scholar 

  24. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)

    Article  Google Scholar 

  25. Li, M., McAllister, H., Black, N., De Perez, T.: Perceptual time-frequency subtraction algorithm for noise reduction in hearing aids. IEEE Trans. Biomed. Eng. 48(9), 979–988 (2001)

    Article  Google Scholar 

  26. Li, N., Loizou, P.C.: Effect of spectral resolution on the intelligibility of ideal binary masked speech. J. Acoust. Soc. Am. 123(4), 59–64 (2008)

    Article  Google Scholar 

  27. Li, N., Loizou, P.C.: Factors influencing intelligibility of ideal binary-masked speech: implications for noise reduction. J. Acoust. Soc. Am. 123(3), 1673–1682 (2008)

    Article  Google Scholar 

  28. Li, Y., Wang, D.: On the optimality of ideal binary time-frequency masks. Speech Commun. 51(3), 230–239 (2009)

    Article  Google Scholar 

  29. Madhu, N., Breithaupt, C., Martin, R.: Temporal smoothing of spectral masks in the cepstral domain for speech separation. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 45–48 (2008)

    Google Scholar 

  30. Madhu, N., Spriet, A., Jansen, S., Koning, R., Wouters, J.: The potential for speech intelligibility improvement using the ideal binary mask and the ideal Wiener filter in single channel noise reduction systems: Application to auditory prostheses. IEEE Trans. Audio Speech Lang. Proc. 21(1), 63–72 (2013)

    Google Scholar 

  31. Makkiabadi, B., Sanei, S., Marshall, D.: A k-subspace based tensor factorization approach for under-determined blind identification. In: Forty Fourth Asilomar Conference on Signals, Systems and Computers, pp. 18–22 (2010)

    Google Scholar 

  32. Moore, B.C.J.: An Introduction to the Psychology of Hearing, 5th edn. Academic Press, London (2004)

    Google Scholar 

  33. Mowlaee, P., Saeidi, R., Martin, R.: Model-driven speech enhancement for multisource reverberant environment (signal separation evaluation campaign (SiSEC) 2011). In: Theis, F., Cichocki, A., Yeredor, A., Zibulevsky, M. (eds.) Latent Variable Analysis and Signal Separation. Lecture Notes in Computer Science, vol. 7191, pp. 454–461. Springer, Berlin, Heidelberg (2012)

    Google Scholar 

  34. Naik, G.R., Kumar, D.K.: An overview of independent component analysis and its applications. Informatica 35, 63–81 (2011)

    MATH  Google Scholar 

  35. Ozerov, A., Vincent, E., Bimbot, F.: A general flexible framework for the handling of prior information in audio source separation. IEEE Trans. Audio Speech Lang. Proc. 20(4), 1118–1133 (2012)

    Google Scholar 

  36. Patterson, R., Nimmo-Smith, I., Holdsworth, J., Rice, P.: An efficient auditory filterbank based on the gammatone function. Technical report, MRC Applied Psychology Unit, Cambridge (1987)

    Google Scholar 

  37. Pedersen, M., Wang, D., Larsen, J., Kjems, U.: Overcomplete blind source separation by combining ICA and binary time-frequency masking. In: IEEE Workshop Machine Learning Signal Processing, pp. 15–20 (2005)

    Google Scholar 

  38. Peterson, W., Birdsall, T.G., Fox, W.C.: The theory of signal detectability. In: Proceedings of the IRE Professional Group on Information Theory 4, pp. 171–212 (1954)

    Google Scholar 

  39. Rangachari, S., Loizou, P.C.: A noise-estimation algorithm for highly non-stationary environments. Speech Commun. 48(2), 220–231 (2006)

    Article  Google Scholar 

  40. Roman, N., Wang, D.: Pitch-based monaural segregation of reverberant speech. J. Acoust. Soc. Am. 120(1), 458–469 (2006)

    Article  Google Scholar 

  41. Shannon, R., Zeng, F., Kamath, V., Wygonski, J., Ekelid, M.: Speech recognition with primarily temporal cues. Science 270, 303–303 (1995)

    Article  Google Scholar 

  42. Srinivasan, S., Roman, N., Wang, D.: Binary and ratio time-frequency masks for robust speech recognition. Speech Commun. 48(11), 1486–1501 (2006)

    Article  Google Scholar 

  43. Stokes, T., Hummersone, C., Brookes, T.: Reducing binary masking artefacts in blind audio source separation. In: Proceedings of 134th Engineering Society Convention Rome (2013)

    Google Scholar 

  44. Swets, J.A.: Is there a sensory threshold? Science 134(3473), 168–177 (1961)

    Article  Google Scholar 

  45. Swets, J.A.: Signal Detection and Recognition by Human Observers. Wiley, New York (1964)

    Google Scholar 

  46. Tanner Jr, W.P., Swets, J.A.: A decision-making theory of visual detection. Psychol. Rev. 61(6), 401–409 (1954)

    Article  Google Scholar 

  47. Vincent, E., Gribonval, R., Févotte, C.: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Proc. 14(4), 1462–1469 (2006)

    Google Scholar 

  48. Wang, D.: On ideal binary mask as the computational goal of auditory scene analysis. In: Divenyi, P. (ed.) Speech Separation by Humans and Machines, pp. 181–197. Kluwer Academic, Norwell (2005)

    Google Scholar 

  49. Wang, D.: Time-frequency masking for speech separation and its potential for hearing aid design. Trends Amplif. 12(4), 332–353 (2008)

    Article  Google Scholar 

  50. Wang, D., Brown, G.J.: Fundamentals of computational auditory scene analysis. In: Wang, D., Brown, G.J. (eds.) Computational Auditory Scene Analysis: Principles, Algorithms and Applications, pp. 1–44. Wiley, Hoboken (2006)

    Google Scholar 

  51. Wiener, N.: Extrapolation, Interpolation, and Smoothing of Stationary Time Series: with Engineering Applications. MIT Press, Cambridge (1950)

    Google Scholar 

Download references

Acknowledgments

The authors would like to thank Nicoleta Roman and colleagues for providing the data for Table 12.4, Nilesh Madhu for providing the data for Table 12.6, and Jesper Jensen for providing the data for Table 12.7.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christopher Hummersone .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Hummersone, C., Stokes, T., Brookes, T. (2014). On the Ideal Ratio Mask as the Goal of Computational Auditory Scene Analysis. In: Naik, G., Wang, W. (eds) Blind Source Separation. Signals and Communication Technology. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-55016-4_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-55016-4_12

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-55015-7

  • Online ISBN: 978-3-642-55016-4

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics