Skip to main content

Binaural Technology for Machine Speech Recognition and Understanding

  • Chapter
  • First Online:
The Technology of Binaural Understanding

Part of the book series: Modern Acoustics and Signal Processing ((MASP))

  • 1116 Accesses

Abstract

It is well known that binaural processing is very useful for separating incoming sound sources as well as for improving speech intelligibility in reverberant environments. This chapter describes and compares a number of ways in which automatic-speech-recognition accuracy in difficult acoustical environments can be improved through the use of signal processing techniques that are motivated by our understanding of binaural perception and binaural technology. These approaches are all based on the exploitation of interaural differences in arrival time and intensity of the signals arriving at the two ears to separate signals according to direction of arrival and to enhance the desired target signal. Their structure is motivated by classic models of binaural hearing as well as the precedence effect. We describe the structure and operation of a number of methods that use two or more microphones to improve the accuracy of automatic-speech-recognition systems operating in cluttered, noisy, and reverberant environments. The individual implementations differ in the methods by which binaural principles are imposed on speech processing, and in the precise mechanism used to extract interaural time and intensity differences. Algorithms that exploit binaural information can provide substantially improved speech-recognition accuracy in noisy, cluttered, and reverberant environments compared to baseline delay-and-sum beamforming. The type of signal manipulation that is most effective for improving performance in reverberation is different from what is most effective for ameliorating the effects of degradation caused by spatially-separated interfering sound sources.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  • Aarabi, P., and G. Shi. 2004. Phase-based dual-microphone robust speech enhancment. IEEE Transactions on Systems, Man, and Cybernetics, Part B 34: 1763–1773.

    Google Scholar 

  • Allen, J.B., D.A. Berkley, and J. Blauert. 1977. Multimicrophone signal-processing technique to remove room reverberation from speech signals. Journal of the Acoustical Society of America 62 (4): 912–915.

    ADS  Google Scholar 

  • Allen, J.B., and L.R. Rabiner. 1977. A unified approach to short-time Fourier analysis and synthesis. Proceedings of the IEEE 65 (11): 1558–1564.

    Google Scholar 

  • Araki, S., T. Hayashi, M. Delcroix, M. Fujimoto, K. Takeda, and T. Nakatani. 2015. Exploring multi-channel features for denoissing-autoencoder-based speech enhancement. In Proceedings on IEEE International Conference on Acoustics, Speech and Signal Processing, 116–120

    Google Scholar 

  • Beutelmann, R., and T. Brand. 2006. Prediction of speech intelligibility in spatial noise and reverberation for normal-hearing and hearing-impaired listeners. Journal of Acoustical Society of America 120: 331–342.

    ADS  Google Scholar 

  • Beutelmann, R., T. Brand, and B. Kollmeier. 2010. Revision, extension, and evaluation of a binaural speech intelligibility model. Journal of Acoustical Society of America 127: 2479–2497.

    ADS  Google Scholar 

  • Blauert, J. 1980. Modeling of interaural time and intensity difference discrimination. In Psychophysical, Physiological, and Behavioural Studies in Hearing, eds. G. van den Brink, and F. Bilsen, 412–424. Delft: Delft University Press.

    Google Scholar 

  • Blauert, J. 1983. Review paper: Psychoacoustic binaural phenomena. In Hearing–Physiologica Bases and Psychophysics, eds. R. Klinke, and R. Hartmann, 182–189. Heidelberg: Springer-Verlag.

    Google Scholar 

  • Blauert, J. 1997. Spatial Hearing: The Psychophysics of Human Sound Localization, 2nd ed. Cambridge, MA: MIT Press.

    Google Scholar 

  • Blauert, J., and W. Cobben. 1978. Some considerations of binaural cross-correlation analysis. Acustica 39: 96–103.

    Google Scholar 

  • Bodden, M. 1993. Modelling human sound-source localization and the cocktail party effect. Acta Acustica 1: 43–55.

    Google Scholar 

  • Bodden, M., and Anderson, T.R. 1995. A binaural selectivity model for speech recognition. In Proceedings of Eurospeech 1995 (European Speech Communication Association).

    Google Scholar 

  • Boll, S.F. 1979. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing 27(2), 113–120.

    Google Scholar 

  • Bourlard, H., and Morgan, N. 1994. Connectionist Speech Recognition: A hybrid approach. Kluwer Academic Publishers.

    Google Scholar 

  • Braasch, J. 2005. Modelling of binaural hearing. In Communication Acoustics, ed. J. Blauert, Chap. 4, 75–108. Berlin: Springer-Verlag

    Google Scholar 

  • Breebaart, J., S. van de Par, and A. Kohlrausch. 2001a. Binaural processing model based on contralateral inhibition. I. Model structure. Journal of the Acoustical Society of America 110: 1074–1088.

    ADS  Google Scholar 

  • Breebaart, J., S. van de Par, and A. Kohlrausch. 2001b. Binaural processing model based on contralateral inhibition. II. Dependence on spectral parameters. Journal of the Acoustical Society of America 110: 1089–1103.

    ADS  Google Scholar 

  • Breebaart, J., S. van de Par, and A. Kohlrausch. 2001c. Binaural processing model based on contralateral inhibition. III. Dependence on temporal parameters. Journal of the Acoustical Society of America 110: 1117–1125.

    Google Scholar 

  • Bregman, A.S. 1990. Auditory Scene Analysis. Cambridge, MA: MIT Press.

    Google Scholar 

  • Brown, G.J., and M.P. Cooke. 1994. Computational auditory scene analysis. Computer Speech and Language 8: 297–336.

    Google Scholar 

  • Brown, G.J., S. Harding, and J.P. Barker, 2006. Speech separation based on the statistics of binaural auditory features. In Proceedings of IEEE International Conference Acoustical, Speech, and Signal Processing, vol. V, 949 – 952.

    Google Scholar 

  • Brown, G.J., and K.J. Palomäki. 2011. A computational model of binaural speech recognition: Role of across-frequency vs. within-frequency processing and internal noise. Speech Communication 53: 924–940.

    Google Scholar 

  • Burkhard, M.D., and R.M. Sachs. 1975. Anthroponetric manikin for acoustic research. Journal of the Acoustical Society of America 58: 214–222.

    ADS  Google Scholar 

  • Cantu, M. 2018. Sound source segregation of multiple concurrent talkers via short-time target cancellation. Ph.D. thesis, Boston University.

    Google Scholar 

  • Cho, B.J., H. Kwon, J.-W. Cho, C. Kim, R.M. Stern, and H.-M. Park. 2016. A subband-based stationary-component suppression method using harmonics and power ratio for reverberant speech recognition. IEEE Signal Processing Letters 23 (6): 780–784.

    ADS  Google Scholar 

  • Colburn, H.S. 1969. Some physiological limitations on binaural performance. Ph.D. thesis, Massachusetts Institute of Technology.

    Google Scholar 

  • Colburn, H.S. 1973. Theory of binaural interaction based on auditory-nerve data. I. general strategy and preliminary results on interaural discrimination. Journal of the Acoustical Society of America 54: 1458–1470.

    ADS  Google Scholar 

  • Colburn, H.S., and N.I. Durlach. 1978. Models of binaural interaction. In Hearing, ed. E.C. Carterette, and M. P. Friedmann, Vol. IV of Handbook of Perception, Chap. 11, 467–518. New York: Academic Press

    Google Scholar 

  • Colburn, H.S., and A. Kulkarni. 2005. Models of sound localization. In Sound Source Localization, eds. R. Fay, and T. Popper, Springer Handbook of Auditory Research, Chap. 8, 272–316. Springer-Verlag

    Google Scholar 

  • Cooke, M., P. Green, L. Josifovski, and A. Vizinho. 2001. Robust automatic speech recognition with missing and unreliable acoustic data. Speech Communication 34: 267–285.

    MATH  Google Scholar 

  • Cooke, M.P., and D. P.W. Ellis. 2001. The auditory organization of speech and other sources in listeners and computational models. Speech Communication 35, 141–177.

    Google Scholar 

  • Davis, S.B., and P. Mermelstein. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing 28: 357–366.

    Google Scholar 

  • Dempster, A.P., N.M. Laird, and D.B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B 39: 1–38.

    MathSciNet  MATH  Google Scholar 

  • DeSimio, M.P., T.R. Anderson, and J.J. Westerkamp. 1996. Phoneme recognition with a model of binaural hearing. IEEE Transactions on Speech and Audio Processing 4: 157–166.

    Google Scholar 

  • Dietz, M., J.H. Lestang, P. Majdak, R.M. Stern, T. Marquardt, S.D. Ewert, W.M. Hartmann, and D.F.M. Goodman. 2017. A framework for testing and comparing binaural models. Hearing Research 360: 92–106.

    Google Scholar 

  • Dietz, M., T. Marquardt, N.H. Salminen, and D. McAlpine. 2013. Emphasis of spatial cues in the temporal fine structure during the rising segments of amplitude-modulated sounds. Proceedings of the National Academy of Sciences of the United States of America 110: 15151–15156.

    ADS  Google Scholar 

  • Domnitz, R.H., and H.S. Colburn. 1976. Analysis of binaural detection models for dependence on interaural target parameters. Journal of the Acoustical Society of America 59: 599–601.

    ADS  Google Scholar 

  • Domnitz, R.H., and H.S. Colburn. 1977. Lateral position and interaural discrimination. Journal of the Acoustical Society of America 61: 1586–1598.

    ADS  Google Scholar 

  • Droppo, J. 2013. Feature compensation. In Techniques for Noise Robustness in Automatic Speech Recognition, ed. T. Virtanen, B. Raj, and R. Singh, Chap. 9. Wiley

    Google Scholar 

  • Durlach, N.I. 1963. Equalization and cancellation theory of binaural masking level differences. Journal of the Acoustical Society of America 35 (8): 1206–1218.

    ADS  Google Scholar 

  • Durlach, N.I. 1972. Binaural signal detection: Equalization and cancellation theory. In Foundations of Modern Auditory Theory, vol. 2, ed. J.V. Tobias, 369–462. New York: Academic Press.

    Google Scholar 

  • Durlach, N.I., and H.S. Colburn. 1978. Binaural phenomena. In Hearing, ed. E.C. Carterette, and M.P. Friedman, 365–466., Vol. IV of Handbook of Perception New York: Academic Press.

    Google Scholar 

  • Faller, C., and J. Merimaa. 2004. Sound localization in complex listening situations: Selection of binaural cues based on interaural coherence. Journal of the Acoustical Society of America 116 (5): 3075–3089.

    ADS  Google Scholar 

  • Fan, N., J. Du, and L.-R. Dai. 2016. A regression approach to binaural speech segregation via deep neural networks. In Proceedings of IEEE International Symposium on Chinese Spoken Language Processing, 116–120.

    Google Scholar 

  • Flanagan, J.L., J.D. Johnston, R. Zahn, and G.W. Elko. 1985. Computer-steered microphone arrays for sound transduction in large rooms. Journal of the Acoustical Society of America 78: 1508–1518.

    ADS  Google Scholar 

  • Gaik, W. 1993. Combined evaluation of interaural time and intensity differences: Psychoacoustic results and computer modeling. Journal of the Acoustical Society of America 94: 98–110.

    ADS  Google Scholar 

  • Gardner, B., and K. Martin. 1994. HRTF measurements of a KEMAR dummy-head microphone. Technical Report 280. Available online at http://sound.media.mit.edu/KEMAR.html.

  • Gilkey, R.H., and Anderson, T.A. (eds.). 1997. Binaural and Spatial Hearing in Real and Virtual Environments. Psychology Press.

    Google Scholar 

  • Gold, B., N. Morgan, and D. Ellis. 2011. Speech and Audio Signal Processing, 2nd ed. Wiley Interscience.

    Google Scholar 

  • Goodfellow, I., Y. Bengio, and A. Courville. 2016. Deep Learning. MIT Press.

    Google Scholar 

  • Harding, S., J. Barker, and G.J. Brown. 2006. Mask estimation for missing data speech recognition based on statistics of binaural interaction. IEEE Transactions on Speech and Audio Processing 14: 58–67.

    Google Scholar 

  • Hartung, K., and C. Trahiotis. 2001. Peripheral auditory processing and investigations of the “precedence effect” which utilize successive transient stimuli. Journal of the Acoustical Society of America 110 (3): 1505–1513.

    Google Scholar 

  • Hawley, M.L., R.Y. Litovsky, and H.S. Colburn. 1999. Speech intelligibility and localization in a multi-source environment. Journal of the Acoustical Society of America 105: 3436–3448.

    ADS  Google Scholar 

  • Haykin, S. 2018. Neural Networks And Learning Machines, 3rd ed. Springer.

    Google Scholar 

  • Hermansky, H. 1990. Perceptual linear predictive (PLP) analysis of speech. Journal of the Acoustical Society of America 87 (4): 1738–1752.

    Google Scholar 

  • Hermansky, H., D.P.W. Ellis, and S. Sharma. 2000. Tandem connectionist feature extraction for conventional hmm systems. In Proceedings of the IEEE ICASSP, 1635–1638.

    Google Scholar 

  • Hermansky, H., and N. Morgan. 1994. RASTA processing of speech. IEEE Transactions on Speech and Audio Processing 2: 578–589.

    Google Scholar 

  • Hinton, G., L. Deng, D. Yu, G.E. Dahl, and Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., and Kingsbury, B. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine 29, 82–97.

    Google Scholar 

  • Jeffress, L.A. 1948. A place theory of sound localization. Journal of Comparative Physiology, Psychology 41: 35–39.

    Google Scholar 

  • Jeub, M., M. Dorbecker, and P. Vary. 2011a. Semi-analytical model for the binaural coherence of noise fields. IEEE Signal Processing Letters 18 (3): 197–200.

    ADS  Google Scholar 

  • Jeub, M., C. Nelke, C. Beaugeant, and P. Vary. 2011b. Blind estimation of the coherent-to-diffuse energy ratio from noisy speech signals. In Proceedings of the\(19^{th}\)European Signal Processing Conference.

    Google Scholar 

  • Jeub, M., M. Schafer, T. Esch, and P. Vary. 2010. Model-based dereverberation preserving binaural cues. IEEE Transactions on Audio, Speech, and Language Processing 18 (7): 1732–1745.

    Google Scholar 

  • Jeub, M., M. Schafer, and P. Vary. 2009. A binaural room impulse response database for the evaluation of dereverberation algorithms. In Proceedings on\(16^{th}\)International Conference on Digital Signal Processing, 1–5.

    Google Scholar 

  • Jiang, Y., D. Wang, R. Liu, and Z. Feng. 2014. Binaural classification for reverberant speech segregation using deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22 (12): 2112–2121.

    Google Scholar 

  • Johnson, D.H., and D.E. Dudgeon. 1993. Array Signal Processing: Concepts and Techniques. Englewood Cliffs NJ: Prentice-Hall.

    MATH  Google Scholar 

  • Kates, J.M. 1991. A time-domain digital cochlear model. IEEE Transaction on Signal Processing 39: 2573–2592.

    ADS  Google Scholar 

  • Kim, C., C. Khawand, and R.M. Stern. 2012. Two-microphone source separation algorithm based on statistical modeling of angle distributions. In Proceedings of the IEEE International Conference Acoustical, Speech and Signal Processing.

    Google Scholar 

  • Kim, C., K. Kumar, B. Raj, and R.M. Stern. 2009. Signal separation for robust speech recognition based on phase difference information obtained in the frequency domain. In Proceedings of the Interspeech Conference.

    Google Scholar 

  • Kim, C., K. Kumar, and R.M. Stern. 2011. Binaural sound source separation motivated by auditory processing. In Proceedings of the Interspeech Conference, Prague, Czech Republic, vol. 23, 780–784.

    Google Scholar 

  • Kim, C., and R.M. Stern. 2010. Nonlinear enhancement of onset for robust speech recognition. In Proceedings of the Interspeech Conference. Makuhari, Japan

    Google Scholar 

  • Kim, C., and R.M. Stern. 2016. Power-normalized cepstral coefficients (PNCC) for robust speech recognition. IEEE Transactions on Audio, Speech, and Language Processing 24(7), 1315–1329.

    Google Scholar 

  • Kim, C., R.M. Stern, K. Eom, and J. Kee. 2010. Automatic selection of thresholds for signal separation algorithms based on interaural delay. In Proceedings of the Interspeech Conference. Makuhari, Japan.

    Google Scholar 

  • Kohonen, T. 1989. The neural phonetic typewriter. IEEE Computer Magazine, 11–22.

    Google Scholar 

  • Kohlrausch, A., J. Braasch, D. Kolossa, and J. Blauert. 2013. An introduction to binaural processing. In The Technology of Binarual Listening, ed. J. Blauert., Springer and ASA Press.

    Google Scholar 

  • Kumatani, K., J. McDonough, and B. Raj. 2012. Microphone array processing for robust speech recognition. IEEE Signal Processing Magazine 29 (6): 127–140.

    ADS  Google Scholar 

  • Lindemann, W. 1986a. Extension of a binaural cross-correlation model by contralateral inhibition. I. simulation of lateralization for stationary signals. Journal of the Acoustical Society of America 80: 1608–1622.

    ADS  Google Scholar 

  • Lindemann, W. 1986b. Extension of a binaural cross-correlation model by contralateral inhibition. II. the law of the first wavefront. Journal of the Acoustical Society of America 80: 1623–1630.

    ADS  Google Scholar 

  • Lippmann, R.P. 1987. An introduction to computing with neural nets. IEEE ASSP Magazine 4 (2): 4–22.

    Google Scholar 

  • Lippmann, R.P. 1989. Review of neural networks for speech recognition. Neural Computation 1 (1): 1–38.

    Google Scholar 

  • Litovsky, R.Y., S.H. Colburn, W.A. Yost, and S.J. Guzman. 1999. The precedence effect. Journal of the Acoustical Society of America 106: 1633–1654.

    ADS  Google Scholar 

  • Lyon, R.F. 1984. Computational models of neural auditory processing. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing of the International Conference on Acoustics, Speech and Signal Processing, 36.1.1–36.1.4.

    Google Scholar 

  • Mandel, M.I., R.J. Weiss, and D.P.W. Ellis. 2010. Model-based expectation-maximization source separation and localization. IEEE Transactions on Audio, Speech, and Language Processing 18 (2): 382–394.

    Google Scholar 

  • Martin, K.D. 1997. Echo suppression in a computational model of the precedence effect. In Proceedings of the IEEE Mohonk Workshop on Applications of Signal Processing to Acoustics and Audio.

    Google Scholar 

  • May, T., S.V.D. Par, and A. Kohlrausch. 2012. A binaural scene analyzer for joint localization and recognition of speakers in the presence of interfering noise sources and reverberation. IEEE Transactions on Audio, Speech, and Language Processing 20: 108–121.

    Google Scholar 

  • May, T., S. van de Par, and A. Kohlrausch. 2011. A probabilistic model for robust localization based on a binaural auditory front-end. IEEE Transactions on Audio, Speech, and Language Processing 19 (1): 1–13.

    Google Scholar 

  • McGovern, S.G. 2004. Room impulse response generator (MATLAB code). http://www.mathworks.com/matlabcentral/fileexchange/5116-room-impulse-response-generator.

  • Mehrgardt, S., and V. Mellert. 1977. Transformation charactersitics of the external human ear. Journal of the Acoustical Society of America 61: 1567–1576.

    ADS  Google Scholar 

  • Menon, A. 2018. Robust recognition of binaural speech signals using techniques based on human auditory processing. Ph.D. thesis, Carnegie Mellon University.

    Google Scholar 

  • Mi, J., and H.S. Colburn. 2016. A binaural grouping model for predicting speech intelligibility in multitalker environments. Trends in Hearing 20: 1–12.

    Google Scholar 

  • Mi, J., M. Groll, and H.S. Colburn. 2017. Comparison of a target-equalization-cancellation approach and a localization approach to source separation. Journal of the Acoustical Society of America 142 (5): 2933–2941.

    ADS  Google Scholar 

  • Miao, Y., and F. Metze. 2017. End-to-end architectures for speech recognition. In New Era for Robust Speech Recognition: Exploiting Deep Learning, ed. Watanabe, S., M. Delcroix, F. Metze, and J.R. Hershey, 299–323. Springer International Publishing

    Google Scholar 

  • Mitra, V., H. Franco, R. Stern, J.V. Hout, L. Ferrer, M. Graciarena, W. Wang, D. Vergyri, A. Alwan, and J.H.L. Nansen. 2017. Robust features in deep learning-based speech recognition. In New Era for Robust Speech Recognition: Exploiting Deep Learning, ed. Watanabe, S., M. Delcroix, F. Metze, and J.R. Hershey, 183–212. Springer International Publishing

    Google Scholar 

  • Moore, B.C.J. 2012. An Introduction to the Psychology of Hearing, 6th ed. Bingley UK, London: Emerald Group Publishing Ltd.

    Google Scholar 

  • Moreno, P.J., B. Raj, and R.M. Stern. 1996. A vector Taylor series approach for environment-independent speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 733–736

    Google Scholar 

  • Nielsen, M. 2016. Neural Networks and Deep Learning. http://neuralnetworksanddeeplearning.com/.

  • Osman, E. 1971. A correlation model of binaural masking level differences. Journal of the Acoustical Society of America 50: 1494–1511.

    ADS  Google Scholar 

  • Palomäki, K.J., G.J. Brown, and D.L. Wang. 2004. A binaural processor for missing data speech recognition in the presence of noise and small-room reverberation. Speech Communication 43 (4): 361–378.

    Google Scholar 

  • Park, H.-M., and R.M. Stern. 2009. Spatial separation of speech signals using continuously-variable weighting factors estimated from comparisons of zero crossings. Speech Communication Journal 51 (1): 15–25.

    Google Scholar 

  • Patterson, R.D., I. Nimmo-Smith, J. Holdsworth, and P. Rice. 1988. An efficient auditory filterbank based on the gammatone function, Applied Psychology Unit (APU) Report 2341. Cambridge UK

    Google Scholar 

  • Rabiner, L.R. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77 (2): 257–286.

    Google Scholar 

  • Rabiner, L.R., and B.-H. Juang. 1993. Fundamentals of Speech Recognition. Prentice-Hall.

    Google Scholar 

  • Raj, B., M.L. Seltzer, and R.M. Stern. 2004. Reconstruction of missing features for robust speech recognition. Speech Communication 43 (4): 275–296.

    Google Scholar 

  • Raj, B., and R.M. Stern. 2005. Missing-feature approaches in speech recognition. IEEE Signal Processing Magazine 22 (5): 101–115.

    ADS  Google Scholar 

  • Rickard, S. 2007. The DUET blind source separation algorithm. In Blind Speech Separation, ed. Makino, S., T. Lee, and H.E. Sawada. New York: Springer-Verlag.

    Google Scholar 

  • Roman, N., S. Srinivasan, and D. Wang. 2006. Binaural segregation in multisource. Journal of the Acoustical Society of America 120: 4040–4051.

    Google Scholar 

  • Roman, N., D.L. Wang, and G.J. Brown. 2003. Speech segregation based on sound localization. Journal of the Acoustical Society of America 114 (4): 2236–2252.

    ADS  Google Scholar 

  • Rosenblatt, R. 1959. Principles of Neurodynamics. New York: Spartan Books.

    Google Scholar 

  • Schroeder, M.R. 1977. New viewpoints in binaural interactions. In Psychophysics and Physiology of Hearing, ed. Evans, E.F. and J.P. Wilson, 455–467. London: Academic Press

    Google Scholar 

  • Shamma, S.A., N. Shen, and P. Gopalaswamy. 1989. Binaural processing without neural delays. Journal of the Acoustical Society of America 86: 987–1006.

    ADS  Google Scholar 

  • Shao, Y., and D.L. Wang. 2008. Robust speaker identification using auditory features and computational auditory scene analysis. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 1589–1592

    Google Scholar 

  • Srinivasan, S., M. Roman, and D. Wang. 2006. Binary and ratio time-frequency masks for robust speech recognition. Speech Communication 48: 1486–1501.

    Google Scholar 

  • Stecker, G.C., J.D. Ostreicher, and A.D. Brown. 2013. Temporal weighting functions for interaural time and level differences. III. Temporal weighting for lateral position judgments. Journal of the Acoustical Society of America 134: 1242–1252.

    Google Scholar 

  • Stern, R.M., and H.S. Colburn. 1978. Theory of binaural interaction based on auditory-nerve data. IV. A model for subjective lateral position. Journal of the Acoustical Society of America 64: 127–140.

    Google Scholar 

  • Stern, R.M., and Trahiotis, C. 1995. Models of binaural interaction. In Hearing, ed. Moore, B.C.J., Handbook of Perception and Cognition, 2 ed, Chap. 10, 347–386. New York: Academic.

    Google Scholar 

  • Stern, R.M., and C. Trahiotis. 1996. Models of binaural perception. In Binaural and Spatial Hearing in Real and Virtual Environments, ed. Gilkey, R. and T.R. Anderson, Chap. 24, 499–531. Lawrence Erlbaum Associates

    Google Scholar 

  • Stern, R.M., D. Wang, and G.J. Brown. 2006. Binaural sound localization. In Computational Auditory Scene Analysis, ed. Wang, D., and G.J: Brown, Chap. 5. Wiley-IEEE Press

    Google Scholar 

  • Stern, R.M., A.S. Zeiberg, and C. Trahiotis. 1988. Lateralization of complex binaural stimuli: a weighted image model. Journal of the Acoustical Society of America 84: 156–165.

    ADS  Google Scholar 

  • Stevens, S.S., J. Volkman, and E. Newman. 1937. A scale for the measurement of the psychological magnitude pitch. Journal of the Acoustical Society of America 8 (3): 185–190.

    ADS  Google Scholar 

  • Stockham, T.G., T.M. Cannon, and R.B. Ingrebretsen. 1975. Blind deconvolution through digital signal processing. Proceedings of the IEEE 63 (4): 678–692.

    Google Scholar 

  • Thiergart, O., G. Del Galdo, and E.A. Habets. 2012. Signal-to-reverberant ratio estimation based on the complex spatial coherence between omnidirectional microphones. In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 309–312.

    Google Scholar 

  • Trahiotis, C., L.R. Bernstein, R.M. Stern, and T.N. Buell. 2005. Interaural correlation as the basis of a working model of binaural processing: An introduction. In Sound Source Localization, ed. R. Fay, and T. Popper, 238–271., Springer Handbook of Auditory Research. Heidelberg: Springer-Verlag.

    Google Scholar 

  • Van Trees, H.L. 2004. Detection, Estimation, and Modulation Theory: Optimum Array Processing. Wiley.

    Google Scholar 

  • Virtanen, T., B. Raj, and R. Singh, eds. 2012. Noise-Robust Techniques for Automatic Speech Recognition. Wiley.

    Google Scholar 

  • Wallach, H.W., E.B. Newman, and M.R. Rosenzweig. 1949. The precedence effect in sound localization. American Journal of Psychology 62: 315–337.

    Google Scholar 

  • Wan, R., N.I. Durlach, and H.S. Colburn. 2010. Application of an extended equalization-cancellation model to speech intelligibility with spatially distributed maskers. Journal of the Acoustical Society of America 128: 3678–3690.

    ADS  Google Scholar 

  • Wan, R., N.I. Durlach, and H.S. Colburn. 2014. Application of a short-time version of the equalization-cancellation model to speech intelligibility experiments with speech maskers. Journal of the Acoustical Society of America 136: 768–776.

    ADS  Google Scholar 

  • Wang, D., and G.J. Brown, eds. 2006. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. Wiley-IEEE Press.

    Google Scholar 

  • Wang, D.L., and J. Chen. 2018. Supervised speech separation based on deep learning: An overview. IEEE Transactions on Audio, Speech, and Language Processing 26: 1702–1726.

    ADS  Google Scholar 

  • Wang, Y., and D.L. Wang. 2013. Towards scaling up classification-based speech separation. IEEE Transactions on Audio, Speech, and Language Processing 21: 1381–1390.

    Google Scholar 

  • Watanabe, S., M. Delcroix, F. Metze, and J.R. Hershey, eds. 2017. New Era for Robust Speech Recognition: Exploiting Deep Learning. Springer International.

    Google Scholar 

  • Westermann, A., J.M. Buchholz, and T. Dau. 2013. Binaural dereverberation based on interaural coherence histograms. The Journal of the Acoustical Society of America 133 (5): 2767–2777.

    Google Scholar 

  • Wightman, F.L., and D.J. Kistler. 1989a. Headphone simulation of free-field listening. I: Stimulus synthesis. The Journal of the Acoustical Society of America 85: 858–867.

    ADS  Google Scholar 

  • Wightman, F.L., and D.J. Kistler. 1989b. Headphone simulation of free-field listening. II: Psychophysical validation. Journal of the Acoustical Society of America 87: 868–878.

    ADS  Google Scholar 

  • Wightman, F.L., and D.J. Kistler. 1999. Resolution of front-back ambiguity in spatial hearing by listener and source movement. The Journal of the Acoustical Society of America 105 (5): 2841–2853.

    ADS  Google Scholar 

  • Woodruff, J., and D.L. Wang. 2013. Binaural detection, localization, and segregation in reverberant environments based on joint pitch and azimuth cues. IEEE Transactions on Audio, Speech, and Language Processing 21: 806–815.

    Google Scholar 

  • Yost, W.A. 1981. Lateral position of sinusoids presented with intensitive and temporal differences. Journal of the Acoustical Society of America 70: 397–409.

    ADS  Google Scholar 

  • Yost, W.A. 2013. Fundamentals of Hearing: An Introduction, 5th ed. Burlington MA: Academic Press.

    Google Scholar 

  • Yu, Y., W. Wang, and P. Han. 2016. Localization based stereo speech source separation using probabilistic time-frequency masking and deep neural networks. EURASIP Journal on Audio, Speech, and Music Processing 2016: 1–18.

    Google Scholar 

  • Zhang, X., M.G. Heinz, I.C. Bruce, and L.H. Carney. 2001. A phenomenological model for the response of auditory-nerve fibers: I. nonlinear tuning with compression and suppression. Journal of the Acoustical Society of America 109: 648–670.

    Google Scholar 

  • Zhang, X., and D. Wang. 2017. Deep learning based binaural speech separation in reverberant environments. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25 (5): 1075–1084.

    Google Scholar 

  • Zheng, C., A. Schwarz, W. Kellermann, and X. Li. 2015. Binaural coherent-to-diffuse-ratio estimation for dereverberation using an ITD model. In Proceedings of the\(23^{rd}\)European Signal Processing Conference (EUSIPCO), 1048–1052.

    Google Scholar 

  • Zilany, M.S.A., I.C. Bruce, P.C. Nelson, and L.H. Carney. 2009. A phenomenological model of the synapse between the inner hair cell and auditory nerve: Long-term adaptation with power-law dynamics. Journal of the Acoustical Society of America 125: 2390–2412.

    ADS  Google Scholar 

  • Zurek, P.M. 1993. Binaural advantages and directional effects in speech intelligibility. In Acoustical Factors Affecting Hearing Aid Performance, ed. G.A. Studebaker, and I. Hochberg. Boston: Allyn and Bacon.

    Google Scholar 

  • Zurek, P.M., R.L. Freyman, and U. Balakrishnan. 2004. Auditory target detection in reverberation. Journal of the Acoustical Society of America 115 (4): 1609–1620.

    ADS  Google Scholar 

Download references

Acknowledgements

Preparation of this manuscript was partially supported by grants from Honeywell, Google, and Afeka University. A. Menon has been supported by the Prabhu and Poonam Goel Graduate Fellowship Fund and the Jack and Mildred Bowers Scholarship in Engineering. R. Stern is deeply grateful to the many mentors, colleagues, and friends in the binaural-hearing and speech-recognition communities that have informed this analysis, including especially H.  S. Colburn, C. Trahiotis, B. Raj, and R. Singh. The authors also thank E, Gouvêa, C. Kim, A. Moghimi, H.-M. Park, and T.  M.  Sullivan for many experimental contributions and general insight into these phenomena. Thanks are further due to two anonymous reviewers for valuable comments and suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Richard M. Stern .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Stern, R.M., Menon, A. (2020). Binaural Technology for Machine Speech Recognition and Understanding. In: Blauert, J., Braasch, J. (eds) The Technology of Binaural Understanding. Modern Acoustics and Signal Processing. Springer, Cham. https://doi.org/10.1007/978-3-030-00386-9_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-00386-9_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-00385-2

  • Online ISBN: 978-3-030-00386-9

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics