Separation of Speech by Computational Auditory Scene Analysis

Brown, Guy J.; Wang, DeLiang

doi:10.1007/3-540-27489-8_16

Guy J. Brown⁴ &
DeLiang Wang^5,6

Part of the book series: Signals and Communication Technology ((SCT))

2512 Accesses
41 Citations

Abstract

The term auditory scene analysis (ASA) refers to the ability of human listeners to form perceptual representations of the constituent sources in an acoustic mixture, as in the well-known ‘cocktail party’ effect. Accordingly, computational auditory scene analysis (CASA) is the field of study which attempts to replicate ASA in machines. Some CASA systems are closely modelled on the known stages of auditory processing, whereas others adopt a more functional approach. However, all are broadly based on the principles underlying the perception and organization of sound by human listeners, and in this respect they differ from ICA and other approaches to sound separation. In this chapter, we review the principles underlying ASA and show how they can be implemented in CASA systems. We also consider the link between CASA and automatic speech recognition, and draw distinctions between the CASA and ICA approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

J. P. Barker, M. P. Cooke, and D. P. W. Ellis, “Decoding speech in the presence of other sources,” Speech Communication, 2004, in press.
Google Scholar
F. Berthommier and G. F. Meyer, “Source separation by a functional model of amplitude demodulation,” in Proc. EUROSPEECH, 1995, vol. 4, pp. 135–138.
Google Scholar
F. Berthommier and G. F. Meyer, “Improving amplitude modulation maps for F0-dependent segregation of harmonic sounds,” in Proc. EUROSPEECH, 1997, vol. 5, pp. 2483–2486.
Google Scholar
M. Bodden, “Modelling human sound-source localization and the cocktail party effect,” Acta Acustica, vol. 1, pp. 43–55, 1993.
Google Scholar
A. S. Bregman, Auditory Scene Analysis. MIT Press, Cambridge MA, 1990.
Google Scholar
G. J. Brown, Computational Auditory Scene Analysis: A Representational Approach. Ph.D. Thesis, University of Sheffield, 1992.
Google Scholar
G. J. Brown and M. P. Cooke, “Computational auditory scene analysis,” Computer Speech and Language, vol. 8, pp. 297–336, 1994.
Article Google Scholar
G. J. Brown, D. L. Wang, and J. Barker, “A neural oscillator sound separator for missing data speech recognition,” in Proc. IJCNN, 2001, vol. 4, pp. 2907–2912.
Google Scholar
J. F. Cardoso, “High-order contrasts for independent component analysis,” Neural Computation, vol. 11, pp. 157–192, 1999.
Article Google Scholar
M. P. Cooke, “Making sense of everyday speech: a glimpsing account,” in Speech Separation by Humans and Machines, edited by P. Divenyi, Springer, New York, 2004.
Google Scholar
M. P. Cooke, Modelling Auditory Processing and Organization. Cambridge University Press, Cambridge, UK, 1993.
Google Scholar
M. P. Cooke, G. J. Brown, M. D. Crawford, and P. Green, “Computational auditory scene analysis: listening to several things at once,” Endeavour, vol. 17, no. 4, pp. 186–190, 1993.
Article Google Scholar
M. P. Cooke, P. Green, L. Josifovski, and A. Vizinho, “Robust automatic speech recognition with missing and unreliable acoustic data,” Speech Communication, vol. 34, pp. 267–285, 2001.
Article Google Scholar
A. de Cheveigné, “The cancellation principle in acoustic scene analysis,” in Speech Separation by Humans and Machines, edited by P. Divenyi, Springer, New York, 2004.
Google Scholar
A. de Cheveigné, “Cancellation model of pitch perception,” J. Acoust. Soc. Am., vol. 103, pp. 1261–1271, 1998.
Article Google Scholar
A. de Cheveigné and H. Kawahara, “Multiple period estimation and pitch perception model,” Speech Communication, vol. 27, pp. 175–185, 1999.
Article Google Scholar
C. J. Darwin and R. P. Carlyon, “Auditory grouping,” in Hearing, edited by B. C. J. Moore, Academic Press, 1995.
Google Scholar
P. N. Denbigh and J. Zhao, “Pitch extraction and separation of overlapping speech,” Speech Communication, vol. 11, pp. 119–125, 1992.
Article Google Scholar
L. A. Drake, A. Katsaggelos, J. C. Rutledge, and J. Zhang, “Sound source separation via computational auditory scene analysis-enhanced beamforming,” in Proc. of the IEEE Sensor Array and Multichannel Signal Processing Workshop, 2002.
Google Scholar
N. I. Durlach, “Note on the equalization and cancellation theory of binaural masking level differences,” J. Acoust. Soc. Am., vol. 32, no. 8, pp. 1075–1076, 1960.
Article Google Scholar
D. P. W. Ellis, Prediction-Driven Computational Auditory Scene Analysis. Ph.D. Thesis, Department of Electrical Engineering and Computer Science, M.I.T, 1996.
Google Scholar
D. P. W. Ellis, “Evaluating speech separation systems”, in Speech Separation by Humans and Machines, edited by P. Divenyi, Springer, New York, 2004.
Google Scholar
D. P. W. Ellis, “Using knowledge to organize sound: the prediction-driven approach to computational auditory scene analysis, and its application to speech/nonspeech mixtures,” Speech Communication, vol. 27, pp. 281–298, 1998.
Article Google Scholar
D. J. Godsmark and G. J. Brown, “A blackboard architecture for computational auditory scene analysis,” Speech Communication, vol. 27, pp. 351–366, 1999.
Article Google Scholar
J. G. Harris, C. J. Pu, and J. C. Principe, “A monaural cue sound localizer,” Analog Integrated Circuits and Signal Processing, vol. 23, pp. 163–172, 2000.
Article Google Scholar
M. J. Hewitt and R. Meddis, “An evaluation of eight computer models of mammalian inner hair-cell function,” J. Acoust. Soc. Am., vol. 90, no. 2, pp. 904–917, 1991.
Article Google Scholar
G. Hu and D. L. Wang, “Speech segregation based on pitch tracking and amplitude modulation,” in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 1991, pp. 79–82.
Google Scholar
G. Hu and D. L. Wang, “Separation of stop consonants,” in Proc. IEEE ICASSP, 2003, vol. 2, pp. 749–752.
Google Scholar
G. Hu and D. L. Wang, “Monaural speech segregation based on pitch tracking and amplitude modulation,” IEEE Trans. Neural Networks, vol. 15, no. 5, pp. 1135–1150, 2004.
Article Google Scholar
L. A. Jeffress, “A place theory of sound localization,” Journal of Comparative and Physiological Psychology, vol. 41, pp. 35–39, 1948.
Google Scholar
A. Jourjine, S. Rickard, and O. Yilmaz, “Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures,” in Proc. IEEE ICASSP, 2000, pp. 2985–2988.
Google Scholar
M. Karjalainen and T. Tolonen, “Multi-pitch and periodicity analysis model for sound separation and auditory scene analysis,” in Proc. IEEE ICASSP, 1999, vol. 2, pp. 929–932.
Google Scholar
A. P. Klapuri, “Multiple fundamental frequency estimation based on harmonicity and spectral smoothness,” IEEE Trans. Speech Audio Processing, vol. 11, no. 6, pp. 804–816, 2003.
Article Google Scholar
B. Kollmeier and R. Koch, “Speech enhancement based on physiological and psychoacoustical models of modulation perception and binaural interaction,” J. Acoust. Soc. Am., vol. 95, pp. 1593–1602, 1994.
Article Google Scholar
A. Khurshid and S. L. Denham, “A temporal analysis based pitch estimation system for noisy speech with a comparative study of performance of recent systems,” IEEE Trans. Neural Networks, vol. 15, no. 5, pp. 1112–1124, 2004.
Article Google Scholar
J. C. R. Licklider, “A duplex theory of pitch perception,” Experimentia, vol. 7, pp. 128–133, 1951.
Article Google Scholar
C. Liu, B. C. Wheeler, W. D. O’Brien, R. C. Bilger, C. R. Lansing, and A. S. Feng, “Localization of multiple sound sources with two microphones,” J. Acoust. Soc. Am., vol. 108(4), pp. 1888–1905, 2000.
Article Google Scholar
C. Liu, B. C. Wheeler, W. D. O’Brien, C. R. Lansing, R. C. Bilger, D. L. Jones, and A. S. Feng, “A two-microphone dual delay-line approach for extraction of a speech sound in the presence of multiple interferers,” J. Acoust. Soc. Am., vol. 110, no. 6, pp. 3218–3231, 2001.
Article Google Scholar
R. F. Lyon, “A computational model of binaural localization and separation,” in Proc. IEEE ICASSP, 1983, pp. 1148–1151.
Google Scholar
D. Mellinger, Event Formation and Separation in Musical Sound. Ph.D. Thesis, Stanford University, 1991.
Google Scholar
B. C. J. Moore, An Introduction to the Psychology of Hearing (5th edition). Academic Press, 2003.
Google Scholar
T. Nakatani and H. G. Okuno, “Harmonic sound stream segregation using localisation and its application to speech stream segregation,” Speech Communication, vol. 27, pp. 209–222, 1999.
Article Google Scholar
J. Nix, M. Kleinschmidt, and V. Hohmann, “Computational auditory scene analysis by using statistics of high-dimensional speech dynamics and sound source direction,” in Proc. EUROSPEECH, 2003, pp. 1441–1444.
Google Scholar
H. G. Okuno, T. Nakatani, and T. Kawabata, “A new speech enhancement: speech stream segregation,” in International Conference on Spoken Language Processing, 1996, vol. 4, pp. 2356–2359.
Article Google Scholar
H. G. Okuno, T. Nakatani, and T. Kawabata, “Listening to two simultaneous speeches,” Speech Communication, vol. 27, pp. 299–310, 1999.
Article Google Scholar
L. Ottaviani and D. Rocchesso, “Separation of speech signal from complex auditory scenes,” in Proc. of the Conference on Digital Audio Effects, 2001.
Google Scholar
K. J. Palomäki, G. J. Brown, and D. L. Wang, “A binaural processor for missing data speech recognition in the presence of noise and small-room reverberation,” Speech Communication, vol. 43, no. 4, pp. 273–398, 2004.
Article Google Scholar
K. J. Palomäki, G. J. Brown, and J. P. Barker, “Techniques for handling convolutional distortion with ‘missing data’ automatic speech recognition,” Speech Communication, vol. 43, pp. 123–142, 2004.
Article Google Scholar
T. W. Parsons, “Separation of speech from interfering speech by means of harmonic selection,” J. Acoust. Soc. Am., vol. 60, no. 4, pp. 911–918, 1976.
Article Google Scholar
R. D. Patterson, K. Robinson, J. Holdsworth, D. McKeown, C. Zhang, and M. Allerhand, “Complex sounds and auditory images,” in Auditory Physiology and Perception, edited by Y. Cazals, L. Demany, and K. Horner, Pergamon, Oxford, 1992.
Google Scholar
N. Roman and D. L. Wang, “Binaural tracking of multiple moving sources,” in Proc. IEEE ICASSP, 2003, vol. 5, pp. 149–152.
Google Scholar
N. Roman, D. L. Wang, and G. J. Brown, “Speech segregation based on sound localization,” J. Acoust. Soc. Am., vol. 114, no. 4, pp. 2236–2252, 2003.
Article Google Scholar
N. Roman and D. L. Wang, “Binaural sound segregation for multisource reverberant environments,” in Proc. IEEE ICASSP, 2004, vol. 2, pp. 373–376.
Google Scholar
S. T. Roweis, “One microphone source separation,” Neural Information Processing Systems, vol. 13, pp. 793–799, 2000.
Google Scholar
T. Rutkowski, A. Cichocki, and A. K. Barros, “Speech enhancement from interfering sounds using CASA techniques and blind source separation,” in 3rd International Conference on Independent Component Analysis and Blind Signal Separation, 2001, San Diego, California, pp. 728–733.
Google Scholar
A. Shamsoddini and P. N. Denbigh, “A sound segregation algorithm for reverberant conditions,” Speech Communication, vol. 33, pp. 179–196, 2001.
Article Google Scholar
M. Slaney and R. F. Lyon, “A perceptual pitch detector,” in Proc. IEEE ICASSP, 1990, vol. 1, pp. 357–360.
Google Scholar
M. Slaney, D. Naar, and R. F. Lyon, “Auditory model inversion for sound separation,” in Proc. IEEE ICASSP, 1994, pp. 77–80.
Google Scholar
P. Smaragdis, Redundancy Reduction for Computational Audition: A Unifying Approach. Ph.D. Thesis, Program in Media Arts and Sciences, M.I.T., 1994.
Google Scholar
S. Srinivasan and D. L. Wang, “A schema-based model for phonemic restoration,” Speech Communication, 2004, in press.
Google Scholar
H. W. Strube, “Separation of several speakers recorded by two microphones (cocktail-party processing),” Signal Processing, vol. 3, no. 4, pp. 355–364, 1981.
Article Google Scholar
T. Tolonen and M. Karjalainen, “A computationally efficient multi-pitch analysis model,” IEEE Trans. Speech Audio Processing, vol. 8, no. 6, pp. 708–716, 2000.
Article Google Scholar
A. J. W. van der Kouwe, D. L. Wang, and G. J. Brown, “A comparison of auditory and blind separation techniques for speech segregation,” IEEE Trans. Speech Audio Proc., vol. 9, no. 3, pp. 189–195, 2001.
Article Google Scholar
D. L. Wang, “Primitive auditory segregation based on oscillatory correlation,” Cognitive Science, vol. 20, pp. 409–456, 1996.
Article Google Scholar
D. L. Wang, “On ideal binary mask as the computational goal of auditory scene analysis,” in Speech Separation by Humans and Machines, edited by P. Divenyi, Springer, New York, 2004.
Google Scholar
D. L. Wang and G. J. Brown, “Separation of speech from interfering sounds based on oscillatory correlation,” IEEE Trans. Neural Networks, vol. 10, no. 3, pp. 684–697, 1999.
Article MathSciNet Google Scholar
M. Weintraub, A Theory and Computational Model of Monaural Auditory Sound Separation. Ph.D. Thesis, Standford University, 1985.
Google Scholar
M. Wu and D. L. Wang, “A one-microphone algorithm for reverberant speech enhancement,” in Proc. IEEE ICASSP, 2003, vol. 1, pp. 844–847.
Google Scholar
M. Wu, D. L. Wang, and G. J. Brown, “A multipitch tracking algorithm for noisy speech,” IEEE Trans. Speech Audio Proc., vol. 11, no. 3, pp. 229–241, 2003.
Article Google Scholar
S. N. Wrigley and G. J. Brown, “A computational model of auditory selective attention,” IEEE Trans. Neural Networks, vol. 15, no. 5, pp. 1151–1163, 2004.
Article Google Scholar
O. Yilmaz and S. Rickard, “Blind separation of speech mixtures via time-frequency masking,” IEEE Trans. Signal Processing, vol. 52, no. 7, pp. 1830–1847, 2004.
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Sheffield, Sheffield, S1 4DP, UK
Guy J. Brown
Department of Computer Science & Engineering, The Ohio State University, Columbus, OH, 43210, USA
DeLiang Wang
Center for Cognitive Science, Columbus, OH, 43210, USA
DeLiang Wang

Authors

Guy J. Brown
View author publications
You can also search for this author in PubMed Google Scholar
DeLiang Wang
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Brown, G.J., Wang, D. (2005). Separation of Speech by Computational Auditory Scene Analysis. In: Speech Enhancement. Signals and Communication Technology. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-27489-8_16

Download citation

DOI: https://doi.org/10.1007/3-540-27489-8_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-24039-6
Online ISBN: 978-3-540-27489-6
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics