Binaural Speech Separation Using Recurrent Timing Neural Networks for Joint F0-Localisation Estimation
A speech separation system is described in which sources are represented in a joint interaural time difference-fundamental frequency (ITD-F0) cue space. Traditionally, recurrent timing neural networks (RTNNs) have been used only to extract periodicity information; in this study, this type of network is extended in two ways. Firstly, a coincidence detector layer is introduced, each node of which is tuned to a particular ITD; secondly, the RTNN is extended to become two-dimensional to allow periodicity analysis to be performed at each best-ITD. Thus, one axis of the RTNN represents F0 and the other ITD allowing sources to be segregated on the basis of their separation in ITD-F0 space. Source segregation is performed within individual frequency channels without recourse to across-channel estimates of F0 or ITD that are commonly used in auditory scene analysis approaches. The system is evaluated on spatialised speech signals using energy-based metrics and automatic speech recognition.
KeywordsAutomatic Speech Recognition Binary Mask Coincidence Detector Automatic Speech Recognition System Pitch Period
Unable to display preview. Download preview PDF.
- 1.Bregman, A.S.: Auditory Scene Analysis. The Perceptual Organization of Sound. MIT Press, Cambridge (1990)Google Scholar
- 2.Wang, D., Brown, G.J. (eds.): Computational Auditory Scene Analysis: Principles, Algorithms and Applications. IEEE Press / Wiley-Interscience (2006)Google Scholar
- 4.Brokx, J.P.L., Nooteboom, S.G.: Intonation and the perceptual separation of simultaneous voices. J. Phonetics 10, 23–36 (1982)Google Scholar
- 5.Scheffers, M.T.M.: Sifting Vowels: Auditory Pitch Analysis and Sound Segregation. PhD thesis, Groningen University, The Netherlands (1983)Google Scholar
- 6.Bird, J., Darwin, C.J.: Effects of a difference in fundamental frequency in separating two sentences. In: Palmer, A.R., Rees, A., Summerfield, A.Q., Meddis, R. (eds.) Psychophysical and physiological advances in hearing, Whurr, pp. 263–269 (1997)Google Scholar
- 7.Blauert, J.: Spatial Hearing — The Psychophysics of Human Sound Localization. MIT Press, Cambridge (1997)Google Scholar
- 8.Lyon, R.F.: A computational model of binaural localization and separation. In: Proc. Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp. 1148–1151 (1983)Google Scholar
- 12.Cariani, P.A.: Recurrent timing nets for auditory scene analysis. In: Proc. Intl. Conf. on Neural Networks (IJCNN) (2003)Google Scholar
- 14.Patterson, R.D., Nimmo-Smith, I., Holdsworth, J., Rice, P.: An efficient auditory filterbank based on the gammatone function. Technical Report 2341, Applied Psychology Unit, University of Cambridge, UK (1988)Google Scholar
- 16.Leonard, R.G.: A database for speaker-independent digit recognition. In: Proc. Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP). vol. 3 (1984)Google Scholar
- 19.Cooke, M.P.: Modelling auditory processing and organisation. Cambridge University Press, Cambridge (1991/1993)Google Scholar
- 20.Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P.: The HTK Book (for HTK Version 3.3). Cambridge University Engineering Department (2005)Google Scholar