Unsupervised Auditory Saliency Enabled Binaural Scene Analyzer for Speaker Localization and Recognition

  • R. Venkatesan
  • A. Balaji GaneshEmail author
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 678)


The paper presents an unsupervised binaural scene analyzer that is capable of performing simultaneous operations, such as localization, detection and recognition of target speaker amidst in a reverberant and noise inferential environment. The proposed technique includes three main stages: sound source localization, speaker recognition and auditory saliency-based indexing system. The stage one involves the localization of target speaker by incorporating the binaural cues into Azimuth dependant GMM-EM classifier. During the second stage, the study proposes a Gabor-Hilbert Envelope Coefficient (GHEC) based spectro-temporal extractor as an efficient speaker recognition method which shows better robustness with minimum computational complexity. The Hilbert Envelope provides relevant acoustic information and also it improvises the performance of speaker identification process in different reverberant environments and SNR values. Later in the third stage, the auditory saliency based diarization is proposed as a process of indexing the speech contents based on the image identity of the target speaker. The proposed system may be used to catalogue the entire speech content with the corresponding image of the target speaker that finds wide range of applications, including teleconference, hands-free communication and meeting hall content localization and fast audio retrieval in a repository.


Binaural cues Computational auditory scene analysis Automatic speaker recognition GHEC Shortterm fourier transform Auditory saliency 


  1. 1.
    May, T., van de Par, S., Kohlrausch, A.: A binaural scene analyzer for joint localization and recognition of speakers in the presence of interfering noise sources and reverberation. IEEE Trans. Audio Speech Lang. Process. 20(7), 2016–2030 (2012)CrossRefGoogle Scholar
  2. 2.
    Kohlrausch, A., Braasch, J., Kolossa, D., Blauert, J.: The Technology of Binaural Listening. Springer, Berlin (2013)Google Scholar
  3. 3.
    Anguera, M.X., Bozonnet, S., Evans, N., Fredouille, C.: Speaker diarization: a review of recent research. IEEE Trans. Audio Speech Lang. Process. 10, 356–370 (2012)CrossRefGoogle Scholar
  4. 4.
    Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 99, 1 (2011)Google Scholar
  5. 5.
    Dehak, N., Dehak, R., Glass, J., Reynolds, D., Kenny, P.: Cosine similarity scoring without score normalization techniques. In: Odyssey Speaker and Language Recognition Workshop (2010)Google Scholar
  6. 6.
    Sadjadi, S.O., Hansen, J.H.L.: Mean hilbert envelope coefficients (MHEC) for robust speaker and language identification. Speech Communication 17, 138–148 (2015)CrossRefGoogle Scholar
  7. 7.
    Venkatesan, R., Reeni, J., Balaji Ganesh, A.: A saliency based effective browsing of visual and acoustics. Aust. J. Basic Appl. Sci. 9(16), 97–103 (2015)Google Scholar
  8. 8.
    Woodruff, J., Wang, D.: Binaural localization of multiple sources in reverberant and noisy environments. IEEE Trans. Audio Speech Lang. Process. 20(5), 1503–1512 (2012)CrossRefGoogle Scholar
  9. 9.
    Schadler, M.R., Meyer, B.T., Kollmeier, B.: Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition. J. Acoust. Soc. Amer. 131, 4134–4151 (2012)CrossRefGoogle Scholar
  10. 10.
    Kanagasundaram, A., Dean, D., Sridharan, S., Vogt, R.: i-vector based speaker recognition using advanced channel compensation techniques. Comput. Speech Lang. 28, 121–140 (2014)CrossRefGoogle Scholar
  11. 11.
    Hong, T., Kingsbury N., Furman, M.D.: Biologically-inspired object recognition system with features from complex wavelets. In: Proceedings of 18th IEEE International Conference on Image Processing (ICIP), pp. 261–264 (2011)Google Scholar
  12. 12.
    Haifeng, H.: Illumination invariant face recognition based on dual-tree complex wavelet transform. IET Comput. Vision 9(2), 163–173 (2015)CrossRefGoogle Scholar
  13. 13.
    Selesnick, I., Baraniuk, R., Kingsbury, N.: The dual tree complex wavelet transform. IEEE Signal Process. Mag. 22(6), 123–151 (2005)CrossRefGoogle Scholar
  14. 14.
    Zhao, X., Shao, Y., Wang, D.L.: CASA-based robust speaker identification. IEEE Trans. Audio Speech Lang. Process. 20(5), 1608–1616 (2012)CrossRefGoogle Scholar
  15. 15.
    Garofolo, J., et al.: TIMIT Acoustic-Phonetic Continuous Speech Corpus. Linguistic Data Consortium, Philadelphia (1993)CrossRefGoogle Scholar
  16. 16.
    Sanderson, C., Lovell, B.C.: Multi-region probabilistic histograms for robust and scalable identity inference. Lecture notes in computer Science (LNCS), Vol. 5558, pp. 199–208 (2009)Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  1. 1.Electronic System Design Laboratory, TIFAC-CORE,Velammal Engineering CollegeChennaiIndia

Personalised recommendations