Skip to main content

Separation of Speech by Computational Auditory Scene Analysis

  • Chapter
Speech Enhancement

Part of the book series: Signals and Communication Technology ((SCT))

Abstract

The term auditory scene analysis (ASA) refers to the ability of human listeners to form perceptual representations of the constituent sources in an acoustic mixture, as in the well-known ‘cocktail party’ effect. Accordingly, computational auditory scene analysis (CASA) is the field of study which attempts to replicate ASA in machines. Some CASA systems are closely modelled on the known stages of auditory processing, whereas others adopt a more functional approach. However, all are broadly based on the principles underlying the perception and organization of sound by human listeners, and in this respect they differ from ICA and other approaches to sound separation. In this chapter, we review the principles underlying ASA and show how they can be implemented in CASA systems. We also consider the link between CASA and automatic speech recognition, and draw distinctions between the CASA and ICA approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. J. P. Barker, M. P. Cooke, and D. P. W. Ellis, “Decoding speech in the presence of other sources,” Speech Communication, 2004, in press.

    Google Scholar 

  2. F. Berthommier and G. F. Meyer, “Source separation by a functional model of amplitude demodulation,” in Proc. EUROSPEECH, 1995, vol. 4, pp. 135–138.

    Google Scholar 

  3. F. Berthommier and G. F. Meyer, “Improving amplitude modulation maps for F0-dependent segregation of harmonic sounds,” in Proc. EUROSPEECH, 1997, vol. 5, pp. 2483–2486.

    Google Scholar 

  4. M. Bodden, “Modelling human sound-source localization and the cocktail party effect,” Acta Acustica, vol. 1, pp. 43–55, 1993.

    Google Scholar 

  5. A. S. Bregman, Auditory Scene Analysis. MIT Press, Cambridge MA, 1990.

    Google Scholar 

  6. G. J. Brown, Computational Auditory Scene Analysis: A Representational Approach. Ph.D. Thesis, University of Sheffield, 1992.

    Google Scholar 

  7. G. J. Brown and M. P. Cooke, “Computational auditory scene analysis,” Computer Speech and Language, vol. 8, pp. 297–336, 1994.

    Article  Google Scholar 

  8. G. J. Brown, D. L. Wang, and J. Barker, “A neural oscillator sound separator for missing data speech recognition,” in Proc. IJCNN, 2001, vol. 4, pp. 2907–2912.

    Google Scholar 

  9. J. F. Cardoso, “High-order contrasts for independent component analysis,” Neural Computation, vol. 11, pp. 157–192, 1999.

    Article  Google Scholar 

  10. M. P. Cooke, “Making sense of everyday speech: a glimpsing account,” in Speech Separation by Humans and Machines, edited by P. Divenyi, Springer, New York, 2004.

    Google Scholar 

  11. M. P. Cooke, Modelling Auditory Processing and Organization. Cambridge University Press, Cambridge, UK, 1993.

    Google Scholar 

  12. M. P. Cooke, G. J. Brown, M. D. Crawford, and P. Green, “Computational auditory scene analysis: listening to several things at once,” Endeavour, vol. 17, no. 4, pp. 186–190, 1993.

    Article  Google Scholar 

  13. M. P. Cooke, P. Green, L. Josifovski, and A. Vizinho, “Robust automatic speech recognition with missing and unreliable acoustic data,” Speech Communication, vol. 34, pp. 267–285, 2001.

    Article  Google Scholar 

  14. A. de Cheveigné, “The cancellation principle in acoustic scene analysis,” in Speech Separation by Humans and Machines, edited by P. Divenyi, Springer, New York, 2004.

    Google Scholar 

  15. A. de Cheveigné, “Cancellation model of pitch perception,” J. Acoust. Soc. Am., vol. 103, pp. 1261–1271, 1998.

    Article  Google Scholar 

  16. A. de Cheveigné and H. Kawahara, “Multiple period estimation and pitch perception model,” Speech Communication, vol. 27, pp. 175–185, 1999.

    Article  Google Scholar 

  17. C. J. Darwin and R. P. Carlyon, “Auditory grouping,” in Hearing, edited by B. C. J. Moore, Academic Press, 1995.

    Google Scholar 

  18. P. N. Denbigh and J. Zhao, “Pitch extraction and separation of overlapping speech,” Speech Communication, vol. 11, pp. 119–125, 1992.

    Article  Google Scholar 

  19. L. A. Drake, A. Katsaggelos, J. C. Rutledge, and J. Zhang, “Sound source separation via computational auditory scene analysis-enhanced beamforming,” in Proc. of the IEEE Sensor Array and Multichannel Signal Processing Workshop, 2002.

    Google Scholar 

  20. N. I. Durlach, “Note on the equalization and cancellation theory of binaural masking level differences,” J. Acoust. Soc. Am., vol. 32, no. 8, pp. 1075–1076, 1960.

    Article  Google Scholar 

  21. D. P. W. Ellis, Prediction-Driven Computational Auditory Scene Analysis. Ph.D. Thesis, Department of Electrical Engineering and Computer Science, M.I.T, 1996.

    Google Scholar 

  22. D. P. W. Ellis, “Evaluating speech separation systems”, in Speech Separation by Humans and Machines, edited by P. Divenyi, Springer, New York, 2004.

    Google Scholar 

  23. D. P. W. Ellis, “Using knowledge to organize sound: the prediction-driven approach to computational auditory scene analysis, and its application to speech/nonspeech mixtures,” Speech Communication, vol. 27, pp. 281–298, 1998.

    Article  Google Scholar 

  24. D. J. Godsmark and G. J. Brown, “A blackboard architecture for computational auditory scene analysis,” Speech Communication, vol. 27, pp. 351–366, 1999.

    Article  Google Scholar 

  25. J. G. Harris, C. J. Pu, and J. C. Principe, “A monaural cue sound localizer,” Analog Integrated Circuits and Signal Processing, vol. 23, pp. 163–172, 2000.

    Article  Google Scholar 

  26. M. J. Hewitt and R. Meddis, “An evaluation of eight computer models of mammalian inner hair-cell function,” J. Acoust. Soc. Am., vol. 90, no. 2, pp. 904–917, 1991.

    Article  Google Scholar 

  27. G. Hu and D. L. Wang, “Speech segregation based on pitch tracking and amplitude modulation,” in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 1991, pp. 79–82.

    Google Scholar 

  28. G. Hu and D. L. Wang, “Separation of stop consonants,” in Proc. IEEE ICASSP, 2003, vol. 2, pp. 749–752.

    Google Scholar 

  29. G. Hu and D. L. Wang, “Monaural speech segregation based on pitch tracking and amplitude modulation,” IEEE Trans. Neural Networks, vol. 15, no. 5, pp. 1135–1150, 2004.

    Article  Google Scholar 

  30. L. A. Jeffress, “A place theory of sound localization,” Journal of Comparative and Physiological Psychology, vol. 41, pp. 35–39, 1948.

    Google Scholar 

  31. A. Jourjine, S. Rickard, and O. Yilmaz, “Blind separation of disjoint orthogonal signals: demixing N sources from 2 mixtures,” in Proc. IEEE ICASSP, 2000, pp. 2985–2988.

    Google Scholar 

  32. M. Karjalainen and T. Tolonen, “Multi-pitch and periodicity analysis model for sound separation and auditory scene analysis,” in Proc. IEEE ICASSP, 1999, vol. 2, pp. 929–932.

    Google Scholar 

  33. A. P. Klapuri, “Multiple fundamental frequency estimation based on harmonicity and spectral smoothness,” IEEE Trans. Speech Audio Processing, vol. 11, no. 6, pp. 804–816, 2003.

    Article  Google Scholar 

  34. B. Kollmeier and R. Koch, “Speech enhancement based on physiological and psychoacoustical models of modulation perception and binaural interaction,” J. Acoust. Soc. Am., vol. 95, pp. 1593–1602, 1994.

    Article  Google Scholar 

  35. A. Khurshid and S. L. Denham, “A temporal analysis based pitch estimation system for noisy speech with a comparative study of performance of recent systems,” IEEE Trans. Neural Networks, vol. 15, no. 5, pp. 1112–1124, 2004.

    Article  Google Scholar 

  36. J. C. R. Licklider, “A duplex theory of pitch perception,” Experimentia, vol. 7, pp. 128–133, 1951.

    Article  Google Scholar 

  37. C. Liu, B. C. Wheeler, W. D. O’Brien, R. C. Bilger, C. R. Lansing, and A. S. Feng, “Localization of multiple sound sources with two microphones,” J. Acoust. Soc. Am., vol. 108(4), pp. 1888–1905, 2000.

    Article  Google Scholar 

  38. C. Liu, B. C. Wheeler, W. D. O’Brien, C. R. Lansing, R. C. Bilger, D. L. Jones, and A. S. Feng, “A two-microphone dual delay-line approach for extraction of a speech sound in the presence of multiple interferers,” J. Acoust. Soc. Am., vol. 110, no. 6, pp. 3218–3231, 2001.

    Article  Google Scholar 

  39. R. F. Lyon, “A computational model of binaural localization and separation,” in Proc. IEEE ICASSP, 1983, pp. 1148–1151.

    Google Scholar 

  40. D. Mellinger, Event Formation and Separation in Musical Sound. Ph.D. Thesis, Stanford University, 1991.

    Google Scholar 

  41. B. C. J. Moore, An Introduction to the Psychology of Hearing (5th edition). Academic Press, 2003.

    Google Scholar 

  42. T. Nakatani and H. G. Okuno, “Harmonic sound stream segregation using localisation and its application to speech stream segregation,” Speech Communication, vol. 27, pp. 209–222, 1999.

    Article  Google Scholar 

  43. J. Nix, M. Kleinschmidt, and V. Hohmann, “Computational auditory scene analysis by using statistics of high-dimensional speech dynamics and sound source direction,” in Proc. EUROSPEECH, 2003, pp. 1441–1444.

    Google Scholar 

  44. H. G. Okuno, T. Nakatani, and T. Kawabata, “A new speech enhancement: speech stream segregation,” in International Conference on Spoken Language Processing, 1996, vol. 4, pp. 2356–2359.

    Article  Google Scholar 

  45. H. G. Okuno, T. Nakatani, and T. Kawabata, “Listening to two simultaneous speeches,” Speech Communication, vol. 27, pp. 299–310, 1999.

    Article  Google Scholar 

  46. L. Ottaviani and D. Rocchesso, “Separation of speech signal from complex auditory scenes,” in Proc. of the Conference on Digital Audio Effects, 2001.

    Google Scholar 

  47. K. J. Palomäki, G. J. Brown, and D. L. Wang, “A binaural processor for missing data speech recognition in the presence of noise and small-room reverberation,” Speech Communication, vol. 43, no. 4, pp. 273–398, 2004.

    Article  Google Scholar 

  48. K. J. Palomäki, G. J. Brown, and J. P. Barker, “Techniques for handling convolutional distortion with ‘missing data’ automatic speech recognition,” Speech Communication, vol. 43, pp. 123–142, 2004.

    Article  Google Scholar 

  49. T. W. Parsons, “Separation of speech from interfering speech by means of harmonic selection,” J. Acoust. Soc. Am., vol. 60, no. 4, pp. 911–918, 1976.

    Article  Google Scholar 

  50. R. D. Patterson, K. Robinson, J. Holdsworth, D. McKeown, C. Zhang, and M. Allerhand, “Complex sounds and auditory images,” in Auditory Physiology and Perception, edited by Y. Cazals, L. Demany, and K. Horner, Pergamon, Oxford, 1992.

    Google Scholar 

  51. N. Roman and D. L. Wang, “Binaural tracking of multiple moving sources,” in Proc. IEEE ICASSP, 2003, vol. 5, pp. 149–152.

    Google Scholar 

  52. N. Roman, D. L. Wang, and G. J. Brown, “Speech segregation based on sound localization,” J. Acoust. Soc. Am., vol. 114, no. 4, pp. 2236–2252, 2003.

    Article  Google Scholar 

  53. N. Roman and D. L. Wang, “Binaural sound segregation for multisource reverberant environments,” in Proc. IEEE ICASSP, 2004, vol. 2, pp. 373–376.

    Google Scholar 

  54. S. T. Roweis, “One microphone source separation,” Neural Information Processing Systems, vol. 13, pp. 793–799, 2000.

    Google Scholar 

  55. T. Rutkowski, A. Cichocki, and A. K. Barros, “Speech enhancement from interfering sounds using CASA techniques and blind source separation,” in 3rd International Conference on Independent Component Analysis and Blind Signal Separation, 2001, San Diego, California, pp. 728–733.

    Google Scholar 

  56. A. Shamsoddini and P. N. Denbigh, “A sound segregation algorithm for reverberant conditions,” Speech Communication, vol. 33, pp. 179–196, 2001.

    Article  Google Scholar 

  57. M. Slaney and R. F. Lyon, “A perceptual pitch detector,” in Proc. IEEE ICASSP, 1990, vol. 1, pp. 357–360.

    Google Scholar 

  58. M. Slaney, D. Naar, and R. F. Lyon, “Auditory model inversion for sound separation,” in Proc. IEEE ICASSP, 1994, pp. 77–80.

    Google Scholar 

  59. P. Smaragdis, Redundancy Reduction for Computational Audition: A Unifying Approach. Ph.D. Thesis, Program in Media Arts and Sciences, M.I.T., 1994.

    Google Scholar 

  60. S. Srinivasan and D. L. Wang, “A schema-based model for phonemic restoration,” Speech Communication, 2004, in press.

    Google Scholar 

  61. H. W. Strube, “Separation of several speakers recorded by two microphones (cocktail-party processing),” Signal Processing, vol. 3, no. 4, pp. 355–364, 1981.

    Article  Google Scholar 

  62. T. Tolonen and M. Karjalainen, “A computationally efficient multi-pitch analysis model,” IEEE Trans. Speech Audio Processing, vol. 8, no. 6, pp. 708–716, 2000.

    Article  Google Scholar 

  63. A. J. W. van der Kouwe, D. L. Wang, and G. J. Brown, “A comparison of auditory and blind separation techniques for speech segregation,” IEEE Trans. Speech Audio Proc., vol. 9, no. 3, pp. 189–195, 2001.

    Article  Google Scholar 

  64. D. L. Wang, “Primitive auditory segregation based on oscillatory correlation,” Cognitive Science, vol. 20, pp. 409–456, 1996.

    Article  Google Scholar 

  65. D. L. Wang, “On ideal binary mask as the computational goal of auditory scene analysis,” in Speech Separation by Humans and Machines, edited by P. Divenyi, Springer, New York, 2004.

    Google Scholar 

  66. D. L. Wang and G. J. Brown, “Separation of speech from interfering sounds based on oscillatory correlation,” IEEE Trans. Neural Networks, vol. 10, no. 3, pp. 684–697, 1999.

    Article  MathSciNet  Google Scholar 

  67. M. Weintraub, A Theory and Computational Model of Monaural Auditory Sound Separation. Ph.D. Thesis, Standford University, 1985.

    Google Scholar 

  68. M. Wu and D. L. Wang, “A one-microphone algorithm for reverberant speech enhancement,” in Proc. IEEE ICASSP, 2003, vol. 1, pp. 844–847.

    Google Scholar 

  69. M. Wu, D. L. Wang, and G. J. Brown, “A multipitch tracking algorithm for noisy speech,” IEEE Trans. Speech Audio Proc., vol. 11, no. 3, pp. 229–241, 2003.

    Article  Google Scholar 

  70. S. N. Wrigley and G. J. Brown, “A computational model of auditory selective attention,” IEEE Trans. Neural Networks, vol. 15, no. 5, pp. 1151–1163, 2004.

    Article  Google Scholar 

  71. O. Yilmaz and S. Rickard, “Blind separation of speech mixtures via time-frequency masking,” IEEE Trans. Signal Processing, vol. 52, no. 7, pp. 1830–1847, 2004.

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Brown, G.J., Wang, D. (2005). Separation of Speech by Computational Auditory Scene Analysis. In: Speech Enhancement. Signals and Communication Technology. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-27489-8_16

Download citation

  • DOI: https://doi.org/10.1007/3-540-27489-8_16

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-24039-6

  • Online ISBN: 978-3-540-27489-6

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics