Binaural Speech Separation Using Recurrent Timing Neural Networks for Joint F0-Localisation Estimation

  • Stuart N. Wrigley
  • Guy J. Brown
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4892)


A speech separation system is described in which sources are represented in a joint interaural time difference-fundamental frequency (ITD-F0) cue space. Traditionally, recurrent timing neural networks (RTNNs) have been used only to extract periodicity information; in this study, this type of network is extended in two ways. Firstly, a coincidence detector layer is introduced, each node of which is tuned to a particular ITD; secondly, the RTNN is extended to become two-dimensional to allow periodicity analysis to be performed at each best-ITD. Thus, one axis of the RTNN represents F0 and the other ITD allowing sources to be segregated on the basis of their separation in ITD-F0 space. Source segregation is performed within individual frequency channels without recourse to across-channel estimates of F0 or ITD that are commonly used in auditory scene analysis approaches. The system is evaluated on spatialised speech signals using energy-based metrics and automatic speech recognition.


Automatic Speech Recognition Binary Mask Coincidence Detector Automatic Speech Recognition System Pitch Period 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bregman, A.S.: Auditory Scene Analysis. The Perceptual Organization of Sound. MIT Press, Cambridge (1990)Google Scholar
  2. 2.
    Wang, D., Brown, G.J. (eds.): Computational Auditory Scene Analysis: Principles, Algorithms and Applications. IEEE Press / Wiley-Interscience (2006)Google Scholar
  3. 3.
    Cooke, M., Green, P., Josifovski, L., Vizinho, A.: Robust automatic speech recognition with missing and unreliable acoustic data. Speech Commun. 34(3), 267–285 (2001)MATHCrossRefGoogle Scholar
  4. 4.
    Brokx, J.P.L., Nooteboom, S.G.: Intonation and the perceptual separation of simultaneous voices. J. Phonetics 10, 23–36 (1982)Google Scholar
  5. 5.
    Scheffers, M.T.M.: Sifting Vowels: Auditory Pitch Analysis and Sound Segregation. PhD thesis, Groningen University, The Netherlands (1983)Google Scholar
  6. 6.
    Bird, J., Darwin, C.J.: Effects of a difference in fundamental frequency in separating two sentences. In: Palmer, A.R., Rees, A., Summerfield, A.Q., Meddis, R. (eds.) Psychophysical and physiological advances in hearing, Whurr, pp. 263–269 (1997)Google Scholar
  7. 7.
    Blauert, J.: Spatial Hearing — The Psychophysics of Human Sound Localization. MIT Press, Cambridge (1997)Google Scholar
  8. 8.
    Lyon, R.F.: A computational model of binaural localization and separation. In: Proc. Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp. 1148–1151 (1983)Google Scholar
  9. 9.
    Roman, N., Wang, D., Brown, G.J.: Speech segregation based on sound localization. J. Acoust. Soc. Am. 114, 2236–2252 (2003)CrossRefGoogle Scholar
  10. 10.
    Edmonds, B.A., Culling, J.F.: The spatial unmasking of speech: Evidence for within-channel processing of interaural time delay. J. Acoust. Soc. Am. 117, 3069–3078 (2005)CrossRefGoogle Scholar
  11. 11.
    Cariani, P.A.: Neural timing nets. Neural Networks 14, 737–753 (2001)CrossRefGoogle Scholar
  12. 12.
    Cariani, P.A.: Recurrent timing nets for auditory scene analysis. In: Proc. Intl. Conf. on Neural Networks (IJCNN) (2003)Google Scholar
  13. 13.
    Jeffress, L.A.: A place theory of sound localization. J. Comp. Physiol. Psychol. 41, 35–39 (1948)CrossRefGoogle Scholar
  14. 14.
    Patterson, R.D., Nimmo-Smith, I., Holdsworth, J., Rice, P.: An efficient auditory filterbank based on the gammatone function. Technical Report 2341, Applied Psychology Unit, University of Cambridge, UK (1988)Google Scholar
  15. 15.
    Glasberg, B.R., Moore, B.C.J.: Derivation of auditory filter shapes from notched-noise data. Hearing Res. 47, 103–138 (1990)CrossRefGoogle Scholar
  16. 16.
    Leonard, R.G.: A database for speaker-independent digit recognition. In: Proc. Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP). vol. 3 (1984)Google Scholar
  17. 17.
    Gardner, W.G., Martin, K.D.: HRTF measurements of a KEMAR. J. Acoust. Soc. Am. 97(6), 3907–3908 (1995)CrossRefGoogle Scholar
  18. 18.
    Hu, G., Wang, D.: Monaural speech segregation based on pitch tracking and amplitude modulation. Neural Networks 15(5), 1135–1150 (2004)CrossRefGoogle Scholar
  19. 19.
    Cooke, M.P.: Modelling auditory processing and organisation. Cambridge University Press, Cambridge (1991/1993)Google Scholar
  20. 20.
    Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Moore, G., Odell, J., Ollason, D., Povey, D., Valtchev, V., Woodland, P.: The HTK Book (for HTK Version 3.3). Cambridge University Engineering Department (2005)Google Scholar
  21. 21.
    Murata, N., Ikeda, S., Ziehe, A.: An approach to blind source separation based on temporal structure of speech signals. Neurocomputing 41, 1–24 (2001)MATHCrossRefGoogle Scholar
  22. 22.
    Wrigley, S.N., Brown, G.J.: A computational model of auditory selective attention. IEEE Trans. Neural Networks 15(5), 1151–1163 (2004)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Stuart N. Wrigley
    • 1
  • Guy J. Brown
    • 1
  1. 1.Department of Computer ScienceUniversity of SheffieldSheffieldUnited Kingdom

Personalised recommendations