Abstract
The paper presents an auditory scene analyser that comprises of two joint simultaneous modules, namely binaural speech segregation and speaker recognition. The binaural speech segregation is realized by incorporating interaural time and level differences, interaural phase difference and interaural coherence along with direct-to-reverberant ratio into deep recurrent neural network. The performance of deep recurrent network-based speech segregation is validated in terms of source to interference ratio, source to distortion ratio and source to artifacts ratio and compared with existing architectures including deep neural network. It is observed that performance of conventional deep recurrent neural network can be improved further by involving discriminative objectives along with soft time–frequency masking as a layer in the network structure. The system also proposes a spectro-temporal extractor which is referred as Gabor–Hilbert envelope coefficients (GHEC). The proposed monaural feature is responsible for extracting discriminative acoustic information from segregated speech sources. The performance of GHEC is validated under various noisy and reverberant environments and the results are compared with existing monaural features. The results of binaural speech segregation have shown better signal-to-noise ratio at an average of 0.7 dB even in the presence of higher reverberation time, 0.89 s over other baseline algorithms.
Similar content being viewed by others
References
R. Abdipour, A. Akbari, M. Rahmani, B. Nasersharif, Binaural source separation based on spatial cues and maximum likelihood model adaptation. Digit. Signal Proc. 36, 174–183 (2015)
A.K.H. Al-Ali, D. Dean, B. Senadji, V. Chandran, G.R. Naik, Enhanced forensic speaker verification using a combination of DWT and MFCC feature warping in the presence of noise and reverberation conditions. IEEE Access 99, 1–1 (2017). https://doi.org/10.1109/ACCESS.2017.2728801
A. Alinaghi, W. Wang, P. J. B. Jackson, Spatial and coherence cues based time-frequency masking for binaural reverberant speech separation, in Proceedings of IEEE International Conference on Acoustics, Speech, Signal Process. (ICASSP) (2013), pp. 684–688
A. Alinaghi, P.J. Jackson, Q. Liu, W. Wang, Joint mixing vector and binaural model based stereo source separation. IEEE ACM Trans. Audio Speech Lang. Process. (TASLP) 22(9), 1434–1448 (2014)
X. Anguera Miro, S. Bozonnet, N. Evans, C. Fredouille, Speaker diarization: a review of recent research. IEEE Trans. Audio Speech Lang. Process. 20(2), 356–370 (2012)
F. Asano, H. Asoh, K. Nakadai, Sound source localization using joint Bayesian estimation with a hierarchical noise model. IEEE Trans. Audio Speech Lang. Process. 21(9), 1953–1965 (2013)
A. Bednar, F.M. Boland, E.C. Lalor, Different spatio-temporal electroencephalography features drive the successful decoding of binaural and monaural cues for sound localization. Eur. J. Neurosci. 45(5), 679–689 (2017)
J. Chen, Y. Wang, D.L. Wang, A feature study for classification-based speech separation at low signal-to-noise ratios. IEEE ACM Trans. Audio Speech Lang. Process. 22(12), 1993–2002 (2014)
N. Dehak, R. Dehak, J. Glass, D. Reynolds, P. Kenny, Cosine similarity scoring without score normalization techniques. in Proceedings of Odyssey 2010 - The Speaker and Language Recognition Workshop (Odyssey, 2010), pp. 71–75
N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, P. Oucllet, Front end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)
J.S. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, N.L. Dahlgren, V. Zue, TIMIT Acoustic-Phonetic Continuous Speech Corpus (Linguistic Data Consortium, Philadelphia, 1993)
K. Han, D.L. Wang, A classification based approach to speech segregation. J. Acoust. Soc. Am. 132(5), 3475–3483 (2012)
Y. Hioka, K. Niwa, S. Sakauchi, K. Furuya, Y. Haneda, Estimating direct-to-reverberant energy ratio using D/R spatial correlation matrix model. IEEE Trans. Audio Speech Lang. Process. 19(8), 2374–2384 (2011)
Y. Hu, P. Loizou, Subjective evaluation and comparison of speech enhancement algorithms. Speech Commun. 49, 588–601 (2007)
P.S. Huang, M. Kim, M. Hasegawa-Johnson, P. Smaragdis, Deep learning for monaural speech separation, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2014), pp. 1562–1566
P.S. Huang, M. Kim, M. Hasegawa-johnson, P. Smaragdis, Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE ACM Trans. Audio Speech Lang. Process. 23(12), 1–12 (2015)
C. Hummersone, R. Mason, T. Brookes, Dynamic precedence effect modeling for source separation in reverberant environments. IEEE Trans. Audio Speech Lang. Process. 18(7), 1867–1871 (2010)
M. Jeub, M. Schäfer, P. Vary, A binaural room impulse response database for the evaluation of dereverberation algorithms, in Proceedings of International Conference on Digital Signal Processing (DSP) (2009), pp. 1–4
Y. Jiang, D.L. Wang, R.S. Liu, Z.M. Feng, Binaural classification for reverberant speech segregation using deep neural networks. IEEE Trans. Audio Speech Lang. Process. 22(12), 2112–2121 (2014)
Z. Jin, D.L. Wang, A supervised Learning Approach to monaural segregation of reverberant speech. IEEE Trans. Audio Speech Lang. Process. 17(4), 625–638 (2009)
A. Kanagasundaram, R. Vogt, D.B. Dean, S. Sridharan, M.W. Mason, I-vector based speaker recognition on short utterances, in Proceedings of the 12th Annual Conference of the International Speech Communication Association (ISCA) (2011), pp. 2341–2344
A. Kanagasundaram, D. Dean, S. Sridharan, R. Vogt, I-vector based speaker recognition using advanced channel compensation technique. Comput. Speech Lang. 28(1), 121–140 (2014)
T. Kinnunen, H. Li, An overview of text-independent speaker recognition: from features to supervectors. Speech Commun. 52(1), 12–40 (2010)
A. Kohlrausch, J. Braasch, D. Kolossa, J. Blauert, The Technology of Binaural Listening (Springer, Berlin, 2013)
G. Kovács, L. Tóth, Dirk Van Compernolle, selection and enhancement of Gabor filters for automatic speech recognition. Int. J. Speech Technol. 18(1), 1–16 (2014)
M. Kuster, Estimating the direct-to-reverberant energy ratio from the coherence between coincident pressure and particle velocity. J. Acoust. Soc. Am. 130(6), 3781–3787 (2011)
S.M. Lajevardi, Z.M. Hussain, Automatic facial expression recognition: feature extraction and selection. SIViP 6(1), 159–169 (2012)
H. Lei, B.T. Meyer, N. Mirghafori, Spectro-temporal Gabor features for speaker recognition, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (2012), pp. 4241–4244
J. Li, L. Deng, Y. Gong, R. Haeb-Umbach, An overview of Noise–Robust automatic speech recognition. IEEE ACM Trans. Audio Speech Lang. Process 22(4), 745–777 (2014)
Y.C. Lu, M. Cooke, Binaural distance perception based on direct-to-reverberant energy Ratio, in Proceedings of International Workshop on Acoust. Echo and Noise Control, 2008, pp. 1793–1805
Y.C. Lu, M. Cooke, Binaural estimation of sound source distance via the direct reverberant energy ratio for static and moving sources. IEEE Trans. Audio Speech Lang. Process. 18(7), 793–1805 (2010)
A. Maas, Q.V. Le, T.M. O’Neil, O. Vinyals, P. Nguyen, A.Y. Ng, Recurrent neural networks for noise reduction in robust ASR, in Proceedings of 13th Annual Conference of the International Speech Communication Association (INTERSPEECH), (2012), pp. 22–25
M.I. Mandel, S. Bressler, B. Shinn-Cunningham, D.P.W. Ellis, Evaluating source separation algorithms with reverberant speech. IEEE Trans. Audio Speech Lang. Process. 18(7), 1872–1883 (2010)
T. May, S. Van de Par, A. Kohlrausch, A binaural scene analyzer for joint localization and recognition of speakers in the presence of interfering noise sources and reverberation. IEEE Trans. Audio Speech Lang. Process. 20(7), 2016–2030 (2012)
P. Mowlaee, R. Saeidi, M.G. Christensen, Z.H. Tan, T. Kinnunen, P. Franti, S.H. Jensen, A joint approach for single-channel speaker identification and speech separation. IEEE Trans. Audio Speech Lang. Process. 20(9), 2586–2601 (2012)
G.R. Naik, Measure of quality of source separation for sub-and super-Gaussian audio mixtures. Informatica 23(4), 581–599 (2012)
G.R. Naik, W. Wang, Audio analysis of statistically instantaneous signals with mixed Gaussian probability distributions. Int. J. Electron. 99(10), 1333–1350 (2012)
G.R. Naik, W. Wang, Blind Source Separation: Advances in Theory, Algorithms and Applications (Springer, Heidelberg, 2014)
S. Nandini, Md Sahidullah, G. Saha, Lung sound classification using cepstral-based statistical features. Comput. Biol. Med. 75, 118–129 (2016)
M. Raspaud, H. Viste, G. Evangelista, Binaural source localization by joint estimation of ILD and ITD. IEEE Trans. Audio Speech Lang. Process. 18(1), 68–77 (2010)
D.A. Reynolds, T.F. Quatieri, R.B. Dunn, Speaker verification using adapted Gaussian mixture models. Digital Signal Process. 10, 19–41 (2000)
S.O. Sadjadi, J.H.L. Hansen, Blind spectral weighting for robust speaker identification under reverberation mismatch. IEEE Trans. Audio Speech Lang. Process. 22(5), 937–945 (2014)
S.O. Sadjadi, J.H.L. Hansen, Mean Hilbert envelope coefficients (MHEC) for robust speaker and language identification. Speech Commun. 17, 138–148 (2015)
S.O. Sadjadi, T. Hasan, J.H. Hansen, Mean Hilbert envelope coefficients (MHEC) for robust speaker recognition, in INTERSPEECH, (2012), pp. 1696–1699
M.R. Schädler, B. Kollmeier, Separable spectro-temporal Gabor filter bank features: reducing the complexity of robust features for automatic speech recognition. J. Acoust. Soc. Am. 134(4), 2047–2059 (2015)
M.R. Schädler, B.T. Meyer, B. Kollmeier, Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition. J. Acoust. Soc. Am. 131(5), 4134–4151 (2012)
Y. Shao, S. Srinivasan, Z. Jin, D. Wang, A computational auditory scene analysis system for speech segregation and robust speech recognition. Comput. Speech Lang. 24(1), 77–93 (2010)
C. Spille, M. Dietz, V. Hohmann, Using binaural processing for automatic speech recognition in multi-talker scenes, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013), pp. 7805–7809
I. Trowitzsch, J. Mohr, Y. Kashef, K. Obermayer, Robust detection of environmental sounds in binaural auditory scenes. IEEE ACM Trans. Audio Speech Lang. Process. 25(6), 1344–1356 (2017)
R. Venkatesan, A. Balaji Ganesh, Full sound source localization of binaural signals, in International conference on Wireless Communication, Signal Processing and Networking, 2017 (Accepted)
Y. Wang, K. Han, D.L. Wang, Exploring monaural features for classification-based speech segregation. IEEE Trans. Audio Speech Lang. Process. 21(2), 270–279 (2013)
R.J. Weiss, Michael I. Mandel, Daniel P.W. Ellis, Combining localization cues and source model constraints for binaural source separation. Speech Commun. 53(5), 606–621 (2011)
J. Woodruff, D.L. Wang, Binaural localization of multiple sources in reverberant and noisy environments. IEEE Trans. Audio Speech Lang. Process. 20(5), 1503–1512 (2012)
J. Woodruff, D.L. Wang, Binaural detection, localization, and segregation in reverberant environments based on joint pitch and azimuth cues. IEEE Trans. Audio Speech Lang. Process. 21(4), 806–815 (2013)
S.N. Wrigley, G.J. Brown, Binaural speech separation using recurrent timing neural networks for joint F0-localisation, in: Machine Learning for Multimodal Interaction, (2008), pp. 271–282
F. Xiong, B.T. Meyer, N. Moritz, R. Rehr, J. Anemüller, T. Gerkmann, S. Doclo, S. Goetze, Front-end technologies for robust ASR in reverberant environments-spectral enhancement-based dereverberation and auditory modulation filterbank features. EURASIP J. Adv. Signal Process. 70(1), 1–18 (2015)
Y. Yu, W. Wang, P. Han, Localization based stereo speech source separation using probabilistic time-frequency masking and deep neural network. J Audio Speech Music Proc. (2016). https://doi.org/10.1186/s13636-016-0085-x
X. Zhang, D. Wang, Deep learning based binaural speech separation in reverberant environments. IEEE ACM Trans. Audio Speech Lang. Process. 25(5), 1075–1084 (2017)
X. Zhao, Y. Shao, D.L. Wang, CASA based robust speaker identification. IEEE Trans. Audio Speech Lang. Process. 20(51), 608–1616 (2012)
X. Zhao, Y. Wang, D.L. Wang, Robust speaker identification in noisy and reverberant conditions. IEEE Trans. Audio Speech Lang. Process. 22(4), 836–845 (2014)
Acknowledgements
The authors wish to thank Department of Science and Technology for awarding a project under Cognitive Science Initiative Programme (DST File No. SR/CSI/09/2011) through which the work has been implemented. Also, authors are very much grateful to the anonymous reviewers for their valuable and constructive suggestions that improved the quality of the manuscript.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Venkatesan, R., Balaji Ganesh, A. Binaural Classification-Based Speech Segregation and Robust Speaker Recognition System. Circuits Syst Signal Process 37, 3383–3411 (2018). https://doi.org/10.1007/s00034-017-0712-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-017-0712-5