Abstract
The term auditory scene analysis (ASA) refers to the ability of human listeners to form perceptual representations of the constituent sources in an acoustic mixture, as in the well-known ‘cocktail party’ effect. Accordingly, computational auditory scene analysis (CASA) is the field of study which attempts to replicate ASA in machines. Some CASA systems are closely modelled on the known stages of auditory processing, whereas others adopt a more functional approach. However, all are broadly based on the principles underlying the perception and organization of sound by human listeners, and in this respect they differ from ICA and other approaches to sound separation. In this chapter, we review the principles underlying ASA and show how they can be implemented in CASA systems. We also consider the link between CASA and automatic speech recognition, and draw distinctions between the CASA and ICA approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
J. P. Barker, M. P. Cooke, and D. P. W. Ellis, “Decoding speech in the presence of other sources,” Speech Communication, 2004, in press.
F. Berthommier and G. F. Meyer, “Source separation by a functional model of amplitude demodulation,” in Proc. EUROSPEECH, 1995, vol. 4, pp. 135–138.
F. Berthommier and G. F. Meyer, “Improving amplitude modulation maps for F0-dependent segregation of harmonic sounds,” in Proc. EUROSPEECH, 1997, vol. 5, pp. 2483–2486.
M. Bodden, “Modelling human sound-source localization and the cocktail party effect,” Acta Acustica, vol. 1, pp. 43–55, 1993.
A. S. Bregman, Auditory Scene Analysis. MIT Press, Cambridge MA, 1990.
G. J. Brown, Computational Auditory Scene Analysis: A Representational Approach. Ph.D. Thesis, University of Sheffield, 1992.
G. J. Brown and M. P. Cooke, “Computational auditory scene analysis,” Computer Speech and Language, vol. 8, pp. 297–336, 1994.
G. J. Brown, D. L. Wang, and J. Barker, “A neural oscillator sound separator for missing data speech recognition,” in Proc. IJCNN, 2001, vol. 4, pp. 2907–2912.
J. F. Cardoso, “High-order contrasts for independent component analysis,” Neural Computation, vol. 11, pp. 157–192, 1999.
M. P. Cooke, “Making sense of everyday speech: a glimpsing account,” in Speech Separation by Humans and Machines, edited by P. Divenyi, Springer, New York, 2004.
M. P. Cooke, Modelling Auditory Processing and Organization. Cambridge University Press, Cambridge, UK, 1993.
M. P. Cooke, G. J. Brown, M. D. Crawford, and P. Green, “Computational auditory scene analysis: listening to several things at once,” Endeavour, vol. 17, no. 4, pp. 186–190, 1993.
M. P. Cooke, P. Green, L. Josifovski, and A. Vizinho, “Robust automatic speech recognition with missing and unreliable acoustic data,” Speech Communication, vol. 34, pp. 267–285, 2001.
A. de Cheveigné, “The cancellation principle in acoustic scene analysis,” in Speech Separation by Humans and Machines, edited by P. Divenyi, Springer, New York, 2004.
A. de Cheveigné, “Cancellation model of pitch perception,” J. Acoust. Soc. Am., vol. 103, pp. 1261–1271, 1998.
A. de Cheveigné and H. Kawahara, “Multiple period estimation and pitch perception model,” Speech Communication, vol. 27, pp. 175–185, 1999.
C. J. Darwin and R. P. Carlyon, “Auditory grouping,” in Hearing, edited by B. C. J. Moore, Academic Press, 1995.
P. N. Denbigh and J. Zhao, “Pitch extraction and separation of overlapping speech,” Speech Communication, vol. 11, pp. 119–125, 1992.
L. A. Drake, A. Katsaggelos, J. C. Rutledge, and J. Zhang, “Sound source separation via computational auditory scene analysis-enhanced beamforming,” in Proc. of the IEEE Sensor Array and Multichannel Signal Processing Workshop, 2002.
N. I. Durlach, “Note on the equalization and cancellation theory of binaural masking level differences,” J. Acoust. Soc. Am., vol. 32, no. 8, pp. 1075–1076, 1960.
D. P. W. Ellis, Prediction-Driven Computational Auditory Scene Analysis. Ph.D. Thesis, Department of Electrical Engineering and Computer Science, M.I.T, 1996.
D. P. W. Ellis, “Evaluating speech separation systems”, in Speech Separation by Humans and Machines, edited by P. Divenyi, Springer, New York, 2004.
D. P. W. Ellis, “Using knowledge to organize sound: the prediction-driven approach to computational auditory scene analysis, and its application to speech/nonspeech mixtures,” Speech Communication, vol. 27, pp. 281–298, 1998.
D. J. Godsmark and G. J. Brown, “A blackboard architecture for computational auditory scene analysis,” Speech Communication, vol. 27, pp. 351–366, 1999.
J. G. Harris, C. J. Pu, and J. C. Principe, “A monaural cue sound localizer,” Analog Integrated Circuits and Signal Processing, vol. 23, pp. 163–172, 2000.
M. J. Hewitt and R. Meddis, “An evaluation of eight computer models of mammalian inner hair-cell function,” J. Acoust. Soc. Am., vol. 90, no. 2, pp. 904–917, 1991.
G. Hu and D. L. Wang, “Speech segregation based on pitch tracking and amplitude modulation,” in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 1991, pp. 79–82.
G. Hu and D. L. Wang, “Separation of stop consonants,” in Proc. IEEE ICASSP, 2003, vol. 2, pp. 749–752.
G. Hu and D. L. Wang, “Monaural speech segregation based on pitch tracking and amplitude modulation,” IEEE Trans. Neural Networks, vol. 15, no. 5, pp. 1135–1150, 2004.
L. A. Jeffress, “A place theory of sound localization,” Journal of Comparative and Physiological Psychology, vol. 41, pp. 35–39, 1948.
A. Jourjine, S. Rickard, and O. Yilmaz, “Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures,” in Proc. IEEE ICASSP, 2000, pp. 2985–2988.
M. Karjalainen and T. Tolonen, “Multi-pitch and periodicity analysis model for sound separation and auditory scene analysis,” in Proc. IEEE ICASSP, 1999, vol. 2, pp. 929–932.
A. P. Klapuri, “Multiple fundamental frequency estimation based on harmonicity and spectral smoothness,” IEEE Trans. Speech Audio Processing, vol. 11, no. 6, pp. 804–816, 2003.
B. Kollmeier and R. Koch, “Speech enhancement based on physiological and psychoacoustical models of modulation perception and binaural interaction,” J. Acoust. Soc. Am., vol. 95, pp. 1593–1602, 1994.
A. Khurshid and S. L. Denham, “A temporal analysis based pitch estimation system for noisy speech with a comparative study of performance of recent systems,” IEEE Trans. Neural Networks, vol. 15, no. 5, pp. 1112–1124, 2004.
J. C. R. Licklider, “A duplex theory of pitch perception,” Experimentia, vol. 7, pp. 128–133, 1951.
C. Liu, B. C. Wheeler, W. D. O’Brien, R. C. Bilger, C. R. Lansing, and A. S. Feng, “Localization of multiple sound sources with two microphones,” J. Acoust. Soc. Am., vol. 108(4), pp. 1888–1905, 2000.
C. Liu, B. C. Wheeler, W. D. O’Brien, C. R. Lansing, R. C. Bilger, D. L. Jones, and A. S. Feng, “A two-microphone dual delay-line approach for extraction of a speech sound in the presence of multiple interferers,” J. Acoust. Soc. Am., vol. 110, no. 6, pp. 3218–3231, 2001.
R. F. Lyon, “A computational model of binaural localization and separation,” in Proc. IEEE ICASSP, 1983, pp. 1148–1151.
D. Mellinger, Event Formation and Separation in Musical Sound. Ph.D. Thesis, Stanford University, 1991.
B. C. J. Moore, An Introduction to the Psychology of Hearing (5th edition). Academic Press, 2003.
T. Nakatani and H. G. Okuno, “Harmonic sound stream segregation using localisation and its application to speech stream segregation,” Speech Communication, vol. 27, pp. 209–222, 1999.
J. Nix, M. Kleinschmidt, and V. Hohmann, “Computational auditory scene analysis by using statistics of high-dimensional speech dynamics and sound source direction,” in Proc. EUROSPEECH, 2003, pp. 1441–1444.
H. G. Okuno, T. Nakatani, and T. Kawabata, “A new speech enhancement: speech stream segregation,” in International Conference on Spoken Language Processing, 1996, vol. 4, pp. 2356–2359.
H. G. Okuno, T. Nakatani, and T. Kawabata, “Listening to two simultaneous speeches,” Speech Communication, vol. 27, pp. 299–310, 1999.
L. Ottaviani and D. Rocchesso, “Separation of speech signal from complex auditory scenes,” in Proc. of the Conference on Digital Audio Effects, 2001.
K. J. Palomäki, G. J. Brown, and D. L. Wang, “A binaural processor for missing data speech recognition in the presence of noise and small-room reverberation,” Speech Communication, vol. 43, no. 4, pp. 273–398, 2004.
K. J. Palomäki, G. J. Brown, and J. P. Barker, “Techniques for handling convolutional distortion with ‘missing data’ automatic speech recognition,” Speech Communication, vol. 43, pp. 123–142, 2004.
T. W. Parsons, “Separation of speech from interfering speech by means of harmonic selection,” J. Acoust. Soc. Am., vol. 60, no. 4, pp. 911–918, 1976.
R. D. Patterson, K. Robinson, J. Holdsworth, D. McKeown, C. Zhang, and M. Allerhand, “Complex sounds and auditory images,” in Auditory Physiology and Perception, edited by Y. Cazals, L. Demany, and K. Horner, Pergamon, Oxford, 1992.
N. Roman and D. L. Wang, “Binaural tracking of multiple moving sources,” in Proc. IEEE ICASSP, 2003, vol. 5, pp. 149–152.
N. Roman, D. L. Wang, and G. J. Brown, “Speech segregation based on sound localization,” J. Acoust. Soc. Am., vol. 114, no. 4, pp. 2236–2252, 2003.
N. Roman and D. L. Wang, “Binaural sound segregation for multisource reverberant environments,” in Proc. IEEE ICASSP, 2004, vol. 2, pp. 373–376.
S. T. Roweis, “One microphone source separation,” Neural Information Processing Systems, vol. 13, pp. 793–799, 2000.
T. Rutkowski, A. Cichocki, and A. K. Barros, “Speech enhancement from interfering sounds using CASA techniques and blind source separation,” in 3rd International Conference on Independent Component Analysis and Blind Signal Separation, 2001, San Diego, California, pp. 728–733.
A. Shamsoddini and P. N. Denbigh, “A sound segregation algorithm for reverberant conditions,” Speech Communication, vol. 33, pp. 179–196, 2001.
M. Slaney and R. F. Lyon, “A perceptual pitch detector,” in Proc. IEEE ICASSP, 1990, vol. 1, pp. 357–360.
M. Slaney, D. Naar, and R. F. Lyon, “Auditory model inversion for sound separation,” in Proc. IEEE ICASSP, 1994, pp. 77–80.
P. Smaragdis, Redundancy Reduction for Computational Audition: A Unifying Approach. Ph.D. Thesis, Program in Media Arts and Sciences, M.I.T., 1994.
S. Srinivasan and D. L. Wang, “A schema-based model for phonemic restoration,” Speech Communication, 2004, in press.
H. W. Strube, “Separation of several speakers recorded by two microphones (cocktail-party processing),” Signal Processing, vol. 3, no. 4, pp. 355–364, 1981.
T. Tolonen and M. Karjalainen, “A computationally efficient multi-pitch analysis model,” IEEE Trans. Speech Audio Processing, vol. 8, no. 6, pp. 708–716, 2000.
A. J. W. van der Kouwe, D. L. Wang, and G. J. Brown, “A comparison of auditory and blind separation techniques for speech segregation,” IEEE Trans. Speech Audio Proc., vol. 9, no. 3, pp. 189–195, 2001.
D. L. Wang, “Primitive auditory segregation based on oscillatory correlation,” Cognitive Science, vol. 20, pp. 409–456, 1996.
D. L. Wang, “On ideal binary mask as the computational goal of auditory scene analysis,” in Speech Separation by Humans and Machines, edited by P. Divenyi, Springer, New York, 2004.
D. L. Wang and G. J. Brown, “Separation of speech from interfering sounds based on oscillatory correlation,” IEEE Trans. Neural Networks, vol. 10, no. 3, pp. 684–697, 1999.
M. Weintraub, A Theory and Computational Model of Monaural Auditory Sound Separation. Ph.D. Thesis, Standford University, 1985.
M. Wu and D. L. Wang, “A one-microphone algorithm for reverberant speech enhancement,” in Proc. IEEE ICASSP, 2003, vol. 1, pp. 844–847.
M. Wu, D. L. Wang, and G. J. Brown, “A multipitch tracking algorithm for noisy speech,” IEEE Trans. Speech Audio Proc., vol. 11, no. 3, pp. 229–241, 2003.
S. N. Wrigley and G. J. Brown, “A computational model of auditory selective attention,” IEEE Trans. Neural Networks, vol. 15, no. 5, pp. 1151–1163, 2004.
O. Yilmaz and S. Rickard, “Blind separation of speech mixtures via time-frequency masking,” IEEE Trans. Signal Processing, vol. 52, no. 7, pp. 1830–1847, 2004.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Brown, G.J., Wang, D. (2005). Separation of Speech by Computational Auditory Scene Analysis. In: Speech Enhancement. Signals and Communication Technology. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-27489-8_16
Download citation
DOI: https://doi.org/10.1007/3-540-27489-8_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24039-6
Online ISBN: 978-3-540-27489-6
eBook Packages: EngineeringEngineering (R0)