On Ideal Binary Mask As the Computational Goal of Auditory Scene Analysis

  • DeLiang Wang


In his famous treatise of computational vision, Marr (1982) makes a compelling argument for separating different levels of analysis in order to understand complex information processing. In particular, the computational theory level, concerned with the goal of computation and general processing strategy, must be separated from the algorithm level, or the separation of what from how. This chapter is an attempt at a computational-theory analysis of auditory scene analysis, where the main task is to understand the character of the CASA problem.

My analysis results in the proposal of the ideal binary mask as a main goal of CASA. This goal is consistent with characteristics of human auditory scene analysis. The goal is also consistent with more specific objectives such as enhancing ASR and speech intelligibility. The resulting evaluation metric has the properties of simplicity and generality, and is easy to apply when the premixing target is available. The goal of the ideal binary mask has led to effective for speech separation algorithms that attempt to explicitly estimate such masks.


Automatic Speech Recognition Speech Enhancement Binary Mask Speech Intelligibility Stream Segregation 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Bird, J. and Darwin, C.J., 1997, Effects of a difference in fundamental frequency in separating two sentences, in: Psychophysical and Physiological Advances in Hearing, A.R. Palmer, et al., ed., Whurr, London.Google Scholar
  2. Bodden, M., 1993, Modeling human sound-source localization and the cocktail-party-effect, Acta Acust. 1: 43–55.Google Scholar
  3. Bregman, A.S., 1990, Auditory Scene Analysis, MIT Press, Cambridge MA.Google Scholar
  4. Brown, G.J. and Cooke, M., 1994, Computational auditory scene analysis, Computer Speech and Language 8: 297–336.Google Scholar
  5. Brungart, D., Chang, P., Simpson, B., and Wang, D. L., in preparation.Google Scholar
  6. Carlyon, R.P. and Shackleton, T.M., 1994, Comparing the fundamental frequencies of resolved and unresolved harmonics: Evidence for two pitch mechanisms?, J. Acoust. Soc. Am. 95: 3541–3554.Google Scholar
  7. Cherry, E.C., 1953, Some experiments on the recognition of speech, with one and with two ears, J. Acoust. Soc. Am. 25: 975–979.CrossRefGoogle Scholar
  8. Cooke, M., 1993, Modelling Auditory Processing and Organization, Cambridge University Press, Cambridge U.K.Google Scholar
  9. Cooke, M., Green, P., Josifovski, L., and Vizinho, A., 2001, Robust automatic speech recognition with missing and unreliable acoustic data, Speech Comm. 34: 267–285.zbMATHGoogle Scholar
  10. Cowan, N., 2001, The magic number 4 in short-term memory: a reconsideration of mental storage capacity, Behav. Brain Sci. 24: 87–185.Google Scholar
  11. Ellis, D.P.W., 1996, Prediction-driven computational auditory scene analysis, Ph.D. Dissertation, MIT Department of Electrical Engineersing and Computer Science.Google Scholar
  12. Gibson, J.J., 1966, The Senses Considered as Perceptual Systems, Greenwood Press, Westport CT.Google Scholar
  13. Glotin, H., 2001, Elaboration et étude comparative de systèmes adaptatifs multi-flux de reconnaissance robuste de la parole: incorporation d’indices de voisement et de localisation, Ph.D. Dissertation, Institut National Polytechnique de Grenoble.Google Scholar
  14. Helmholtz, H., 1863, On the Sensation of Tone (A.J. Ellis, Trans.), Dover Publishers, Second English ed., New York.Google Scholar
  15. Hu, G. and Wang, D.L., 2001, Speech segregation based on pitch tracking and amplitude modulation, in Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 79–82.Google Scholar
  16. Hu, G. and Wang, D.L., 2003, Monaural speech separation, in: Advances in Neural Information Processing Systems (NIPS’02), MIT Press, Cambridge MA, pp. 1221–1228.Google Scholar
  17. Hu, G. and Wang, D.L., 2004, Monaural speech segregation based on pitch tracking and amplitude modulation, IEEE Trans. Neural Net., in press.Google Scholar
  18. Hyvärinen, A., Karhunen, J., and Oja, E., 2001, Independent Component Analysis, Wiley, New York.Google Scholar
  19. Jourjine, A., Rickard, S., and Yilmaz, O., 2000, Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures, in Proceedings of IEEE ICASSP, pp. 2985–2988.Google Scholar
  20. Krim, H. and Viberg, M., 1996, Two decades of array signal processing research: The parametric approach, IEEE Sig. Proc. Mag. 13: 67–94.CrossRefGoogle Scholar
  21. Lee, T.-W., 1998, Independent Component Analysis: Theory and Applications, Kluwer Academic, Boston.zbMATHGoogle Scholar
  22. Lim, J., ed., 1983, Speech Enhancement, Prentice Hall, Englewood Cliffs NJ.Google Scholar
  23. Marr, D., 1982, Vision, Freeman, New York.Google Scholar
  24. McCabe, S.L. and Denham, M.J., 1997, A model of auditory streaming, J. Acoust. Soc. Am. 101: 1611–1621.CrossRefGoogle Scholar
  25. Moore, B.C.J., 1998, Cochlear Hearing Loss, Whurr Publishers, London.Google Scholar
  26. Moore, B.C.J., 2003, An Introduction to the Psychology of Hearing, Academic Press, 5th ed., San Diego, CA.Google Scholar
  27. Nakatani, T. and Okuno, H.G., 1999, Harmonic sound stream segregation using localization and its application to speech stream segregation, Speech Comm. 27: 209–222.Google Scholar
  28. Norris, M., 2003, Assessment and extension of Wang’s oscillatory model of auditory stream segregation, Ph.D. Dissertation, University of Queensland School of Information Technology and Electrical Engineering.Google Scholar
  29. O’Shaughnessy, D., 2000, Speech Communications: Human and Machine, IEEE Press, 2nd ed., Piscataway NJ.Google Scholar
  30. Pashler, H.E., 1998, The Psychology of Attention, MIT Press, Cambridge MA.Google Scholar
  31. Roman, N., Wang, D.L., and Brown, G.J., 2001, Speech segregation based on sound localization, in Proceedings of IJCNN, pp. 2861–2866.Google Scholar
  32. Roman, N., Wang, D.L., and Brown, G.J., 2003, Speech segregation based on sound localization, J. Acoust. Soc. Am. 114: 2236–2252.CrossRefGoogle Scholar
  33. Rosenthal, D.F. and Okuno, H.G., ed., 1998, Computational Auditory Scene Analysis, Lawrence Erlbaum, Mahwah NJ.Google Scholar
  34. Roweis, S.T., 2001, One microphone source separation, in: Advances in Neural Information Processing Systems (NIPS’00), MIT Press.Google Scholar
  35. Stubbs, R.J. and Summerfield, Q., 1988, Evaluation of two voice-separation algorithms using normal-hearing and hearing-impaired listeners, J. Acoust. Soc. Am. 84: 1236–1249.CrossRefGoogle Scholar
  36. Stubbs, R.J. and Summerfield, Q., 1990, Algorithms for separating the speech of interfering talkers: Evaluations with voiced sentences, and normal-hearing and hearing-impaired listeners, J. Acoust. Soc. Am. 87: 359–372.CrossRefGoogle Scholar
  37. Treisman, A., 1999, Solutions to the binding problem: progress through controversy and convergence, Neuron 24: 105–110.CrossRefGoogle Scholar
  38. van der Kouwe, A.J.W., Wang, D.L., and Brown, G.J., 2001, A comparison of auditory and blind separation techniques for speech segregation, IEEE Trans. Speech Audio Process. 9: 189–195.Google Scholar
  39. van Veen, B.D. and Buckley, K.M., April 1988, Beamforming: A versatile approach to spatial filtering, IEEE ASSP Magazine, pp. 4–24.Google Scholar
  40. Wang, D.L., 1996, Primitive auditory segregation based on oscillatory correlation, Cognit. Sci. 20: 409–456.Google Scholar
  41. Wang, D.L. and Brown, G.J., 1999, Separation of speech from interfering sounds based on oscillatory correlation, IEEE Trans. Neural Net. 10: 684–697.Google Scholar
  42. Weintraub, M., 1985, A theory and computational model of auditory monaural sound separation, Ph.D. Dissertation, Stanford University Department of Electrical Engineering.Google Scholar
  43. Wrigley, S.N. and Brown, G.J., 2004, A computational model of auditory selective attention, IEEE Trans. Neural Net., in press.Google Scholar

Copyright information

© Springer Science + Business Media, Inc. 2005

Authors and Affiliations

  • DeLiang Wang
    • 1
  1. 1.Department of Computer Science and Engineering and Center of Cognitive ScienceThe Ohio State UniversityColumbus

Personalised recommendations