Binaural Speech Separation Using Recurrent Timing Neural Networks for Joint F0-Localisation Estimation
Abstract
A speech separation system is described in which sources are represented in a joint interaural time difference-fundamental frequency (ITD-F0) cue space. Traditionally, recurrent timing neural networks (RTNNs) have been used only to extract periodicity information; in this study, this type of network is extended in two ways. Firstly, a coincidence detector layer is introduced, each node of which is tuned to a particular ITD; secondly, the RTNN is extended to become two-dimensional to allow periodicity analysis to be performed at each best-ITD. Thus, one axis of the RTNN represents F0 and the other ITD allowing sources to be segregated on the basis of their separation in ITD-F0 space. Source segregation is performed within individual frequency channels without recourse to across-channel estimates of F0 or ITD that are commonly used in auditory scene analysis approaches. The system is evaluated on spatialised speech signals using energy-based metrics and automatic speech recognition.
Keywords
Automatic Speech Recognition Binary Mask Coincidence Detector Automatic Speech Recognition System Pitch PeriodPreview
Unable to display preview. Download preview PDF.
References
- 1.Bregman, A.S.: Auditory Scene Analysis. The Perceptual Organization of Sound. MIT Press, Cambridge (1990)Google Scholar
- 2.Wang, D., Brown, G.J. (eds.): Computational Auditory Scene Analysis: Principles, Algorithms and Applications. IEEE Press / Wiley-Interscience (2006)Google Scholar
- 3.Cooke, M., Green, P., Josifovski, L., Vizinho, A.: Robust automatic speech recognition with missing and unreliable acoustic data. Speech Commun. 34(3), 267–285 (2001)MATHCrossRefGoogle Scholar
- 4.Brokx, J.P.L., Nooteboom, S.G.: Intonation and the perceptual separation of simultaneous voices. J. Phonetics 10, 23–36 (1982)Google Scholar
- 5.Scheffers, M.T.M.: Sifting Vowels: Auditory Pitch Analysis and Sound Segregation. PhD thesis, Groningen University, The Netherlands (1983)Google Scholar
- 6.Bird, J., Darwin, C.J.: Effects of a difference in fundamental frequency in separating two sentences. In: Palmer, A.R., Rees, A., Summerfield, A.Q., Meddis, R. (eds.) Psychophysical and physiological advances in hearing, Whurr, pp. 263–269 (1997)Google Scholar
- 7.Blauert, J.: Spatial Hearing — The Psychophysics of Human Sound Localization. MIT Press, Cambridge (1997)Google Scholar
- 8.Lyon, R.F.: A computational model of binaural localization and separation. In: Proc. Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp. 1148–1151 (1983)Google Scholar
- 9.Roman, N., Wang, D., Brown, G.J.: Speech segregation based on sound localization. J. Acoust. Soc. Am. 114, 2236–2252 (2003)CrossRefGoogle Scholar
- 10.Edmonds, B.A., Culling, J.F.: The spatial unmasking of speech: Evidence for within-channel processing of interaural time delay. J. Acoust. Soc. Am. 117, 3069–3078 (2005)CrossRefGoogle Scholar
- 11.Cariani, P.A.: Neural timing nets. Neural Networks 14, 737–753 (2001)CrossRefGoogle Scholar
- 12.Cariani, P.A.: Recurrent timing nets for auditory scene analysis. In: Proc. Intl. Conf. on Neural Networks (IJCNN) (2003)Google Scholar
- 13.Jeffress, L.A.: A place theory of sound localization. J. Comp. Physiol. Psychol. 41, 35–39 (1948)CrossRefGoogle Scholar
- 14.Patterson, R.D., Nimmo-Smith, I., Holdsworth, J., Rice, P.: An efficient auditory filterbank based on the gammatone function. Technical Report 2341, Applied Psychology Unit, University of Cambridge, UK (1988)Google Scholar
- 15.Glasberg, B.R., Moore, B.C.J.: Derivation of auditory filter shapes from notched-noise data. Hearing Res. 47, 103–138 (1990)CrossRefGoogle Scholar
- 16.Leonard, R.G.: A database for speaker-independent digit recognition. In: Proc. Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP). vol. 3 (1984)Google Scholar
- 17.Gardner, W.G., Martin, K.D.: HRTF measurements of a KEMAR. J. Acoust. Soc. Am. 97(6), 3907–3908 (1995)CrossRefGoogle Scholar
- 18.Hu, G., Wang, D.: Monaural speech segregation based on pitch tracking and amplitude modulation. Neural Networks 15(5), 1135–1150 (2004)CrossRefGoogle Scholar
- 19.Cooke, M.P.: Modelling auditory processing and organisation. Cambridge University Press, Cambridge (1991/1993)Google Scholar
- 20.Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P.: The HTK Book (for HTK Version 3.3). Cambridge University Engineering Department (2005)Google Scholar
- 21.Murata, N., Ikeda, S., Ziehe, A.: An approach to blind source separation based on temporal structure of speech signals. Neurocomputing 41, 1–24 (2001)MATHCrossRefGoogle Scholar
- 22.Wrigley, S.N., Brown, G.J.: A computational model of auditory selective attention. IEEE Trans. Neural Networks 15(5), 1151–1163 (2004)CrossRefGoogle Scholar