Advertisement

Circuits, Systems, and Signal Processing

, Volume 37, Issue 8, pp 3383–3411 | Cite as

Binaural Classification-Based Speech Segregation and Robust Speaker Recognition System

Article
  • 83 Downloads

Abstract

The paper presents an auditory scene analyser that comprises of two joint simultaneous modules, namely binaural speech segregation and speaker recognition. The binaural speech segregation is realized by incorporating interaural time and level differences, interaural phase difference and interaural coherence along with direct-to-reverberant ratio into deep recurrent neural network. The performance of deep recurrent network-based speech segregation is validated in terms of source to interference ratio, source to distortion ratio and source to artifacts ratio and compared with existing architectures including deep neural network. It is observed that performance of conventional deep recurrent neural network can be improved further by involving discriminative objectives along with soft time–frequency masking as a layer in the network structure. The system also proposes a spectro-temporal extractor which is referred as Gabor–Hilbert envelope coefficients (GHEC). The proposed monaural feature is responsible for extracting discriminative acoustic information from segregated speech sources. The performance of GHEC is validated under various noisy and reverberant environments and the results are compared with existing monaural features. The results of binaural speech segregation have shown better signal-to-noise ratio at an average of 0.7 dB even in the presence of higher reverberation time, 0.89 s over other baseline algorithms.

Keywords

Binaural cues Computational auditory scene analysis Automatic speaker recognition Gabor Hilbert envelope features Deep recurrent neural network Soft time–frequency masking 

Notes

Acknowledgements

The authors wish to thank Department of Science and Technology for awarding a project under Cognitive Science Initiative Programme (DST File No. SR/CSI/09/2011) through which the work has been implemented. Also, authors are very much grateful to the anonymous reviewers for their valuable and constructive suggestions that improved the quality of the manuscript.

References

  1. 1.
    R. Abdipour, A. Akbari, M. Rahmani, B. Nasersharif, Binaural source separation based on spatial cues and maximum likelihood model adaptation. Digit. Signal Proc. 36, 174–183 (2015)MathSciNetCrossRefGoogle Scholar
  2. 2.
    A.K.H. Al-Ali, D. Dean, B. Senadji, V. Chandran, G.R. Naik, Enhanced forensic speaker verification using a combination of DWT and MFCC feature warping in the presence of noise and reverberation conditions. IEEE Access 99, 1–1 (2017).  https://doi.org/10.1109/ACCESS.2017.2728801 Google Scholar
  3. 3.
    A. Alinaghi, W. Wang, P. J. B. Jackson, Spatial and coherence cues based time-frequency masking for binaural reverberant speech separation, in Proceedings of IEEE International Conference on Acoustics, Speech, Signal Process. (ICASSP) (2013), pp. 684–688Google Scholar
  4. 4.
    A. Alinaghi, P.J. Jackson, Q. Liu, W. Wang, Joint mixing vector and binaural model based stereo source separation. IEEE ACM Trans. Audio Speech Lang. Process. (TASLP) 22(9), 1434–1448 (2014)CrossRefGoogle Scholar
  5. 5.
    X. Anguera Miro, S. Bozonnet, N. Evans, C. Fredouille, Speaker diarization: a review of recent research. IEEE Trans. Audio Speech Lang. Process. 20(2), 356–370 (2012)CrossRefGoogle Scholar
  6. 6.
    F. Asano, H. Asoh, K. Nakadai, Sound source localization using joint Bayesian estimation with a hierarchical noise model. IEEE Trans. Audio Speech Lang. Process. 21(9), 1953–1965 (2013)CrossRefGoogle Scholar
  7. 7.
    A. Bednar, F.M. Boland, E.C. Lalor, Different spatio-temporal electroencephalography features drive the successful decoding of binaural and monaural cues for sound localization. Eur. J. Neurosci. 45(5), 679–689 (2017)CrossRefGoogle Scholar
  8. 8.
    J. Chen, Y. Wang, D.L. Wang, A feature study for classification-based speech separation at low signal-to-noise ratios. IEEE ACM Trans. Audio Speech Lang. Process. 22(12), 1993–2002 (2014)CrossRefGoogle Scholar
  9. 9.
    N. Dehak, R. Dehak, J. Glass, D. Reynolds, P. Kenny, Cosine similarity scoring without score normalization techniques. in Proceedings of Odyssey 2010 - The Speaker and Language Recognition Workshop (Odyssey, 2010), pp. 71–75Google Scholar
  10. 10.
    N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, P. Oucllet, Front end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)CrossRefGoogle Scholar
  11. 11.
    J.S. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, N.L. Dahlgren, V. Zue, TIMIT Acoustic-Phonetic Continuous Speech Corpus (Linguistic Data Consortium, Philadelphia, 1993)CrossRefGoogle Scholar
  12. 12.
    K. Han, D.L. Wang, A classification based approach to speech segregation. J. Acoust. Soc. Am. 132(5), 3475–3483 (2012)CrossRefGoogle Scholar
  13. 13.
    Y. Hioka, K. Niwa, S. Sakauchi, K. Furuya, Y. Haneda, Estimating direct-to-reverberant energy ratio using D/R spatial correlation matrix model. IEEE Trans. Audio Speech Lang. Process. 19(8), 2374–2384 (2011)CrossRefGoogle Scholar
  14. 14.
    Y. Hu, P. Loizou, Subjective evaluation and comparison of speech enhancement algorithms. Speech Commun. 49, 588–601 (2007)CrossRefGoogle Scholar
  15. 15.
    P.S. Huang, M. Kim, M. Hasegawa-Johnson, P. Smaragdis, Deep learning for monaural speech separation, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2014), pp. 1562–1566Google Scholar
  16. 16.
    P.S. Huang, M. Kim, M. Hasegawa-johnson, P. Smaragdis, Joint optimization of masks and deep recurrent neural networks for monaural source separation. IEEE ACM Trans. Audio Speech Lang. Process. 23(12), 1–12 (2015)CrossRefGoogle Scholar
  17. 17.
    C. Hummersone, R. Mason, T. Brookes, Dynamic precedence effect modeling for source separation in reverberant environments. IEEE Trans. Audio Speech Lang. Process. 18(7), 1867–1871 (2010)CrossRefGoogle Scholar
  18. 18.
    M. Jeub, M. Schäfer, P. Vary, A binaural room impulse response database for the evaluation of dereverberation algorithms, in Proceedings of International Conference on Digital Signal Processing (DSP) (2009), pp. 1–4Google Scholar
  19. 19.
    Y. Jiang, D.L. Wang, R.S. Liu, Z.M. Feng, Binaural classification for reverberant speech segregation using deep neural networks. IEEE Trans. Audio Speech Lang. Process. 22(12), 2112–2121 (2014)CrossRefGoogle Scholar
  20. 20.
    Z. Jin, D.L. Wang, A supervised Learning Approach to monaural segregation of reverberant speech. IEEE Trans. Audio Speech Lang. Process. 17(4), 625–638 (2009)CrossRefGoogle Scholar
  21. 21.
    A. Kanagasundaram, R. Vogt, D.B. Dean, S. Sridharan, M.W. Mason, I-vector based speaker recognition on short utterances, in Proceedings of the 12th Annual Conference of the International Speech Communication Association (ISCA) (2011), pp. 2341–2344Google Scholar
  22. 22.
    A. Kanagasundaram, D. Dean, S. Sridharan, R. Vogt, I-vector based speaker recognition using advanced channel compensation technique. Comput. Speech Lang. 28(1), 121–140 (2014)CrossRefGoogle Scholar
  23. 23.
    T. Kinnunen, H. Li, An overview of text-independent speaker recognition: from features to supervectors. Speech Commun. 52(1), 12–40 (2010)CrossRefGoogle Scholar
  24. 24.
    A. Kohlrausch, J. Braasch, D. Kolossa, J. Blauert, The Technology of Binaural Listening (Springer, Berlin, 2013)Google Scholar
  25. 25.
    G. Kovács, L. Tóth, Dirk Van Compernolle, selection and enhancement of Gabor filters for automatic speech recognition. Int. J. Speech Technol. 18(1), 1–16 (2014)CrossRefGoogle Scholar
  26. 26.
    M. Kuster, Estimating the direct-to-reverberant energy ratio from the coherence between coincident pressure and particle velocity. J. Acoust. Soc. Am. 130(6), 3781–3787 (2011)CrossRefGoogle Scholar
  27. 27.
    S.M. Lajevardi, Z.M. Hussain, Automatic facial expression recognition: feature extraction and selection. SIViP 6(1), 159–169 (2012)CrossRefGoogle Scholar
  28. 28.
    H. Lei, B.T. Meyer, N. Mirghafori, Spectro-temporal Gabor features for speaker recognition, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (2012), pp. 4241–4244Google Scholar
  29. 29.
    J. Li, L. Deng, Y. Gong, R. Haeb-Umbach, An overview of Noise–Robust automatic speech recognition. IEEE ACM Trans. Audio Speech Lang. Process 22(4), 745–777 (2014)CrossRefGoogle Scholar
  30. 30.
    Y.C. Lu, M. Cooke, Binaural distance perception based on direct-to-reverberant energy Ratio, in Proceedings of International Workshop on Acoust. Echo and Noise Control, 2008, pp. 1793–1805Google Scholar
  31. 31.
    Y.C. Lu, M. Cooke, Binaural estimation of sound source distance via the direct reverberant energy ratio for static and moving sources. IEEE Trans. Audio Speech Lang. Process. 18(7), 793–1805 (2010)Google Scholar
  32. 32.
    A. Maas, Q.V. Le, T.M. O’Neil, O. Vinyals, P. Nguyen, A.Y. Ng, Recurrent neural networks for noise reduction in robust ASR, in Proceedings of 13th Annual Conference of the International Speech Communication Association (INTERSPEECH), (2012), pp. 22–25Google Scholar
  33. 33.
    M.I. Mandel, S. Bressler, B. Shinn-Cunningham, D.P.W. Ellis, Evaluating source separation algorithms with reverberant speech. IEEE Trans. Audio Speech Lang. Process. 18(7), 1872–1883 (2010)CrossRefGoogle Scholar
  34. 34.
    T. May, S. Van de Par, A. Kohlrausch, A binaural scene analyzer for joint localization and recognition of speakers in the presence of interfering noise sources and reverberation. IEEE Trans. Audio Speech Lang. Process. 20(7), 2016–2030 (2012)CrossRefGoogle Scholar
  35. 35.
    P. Mowlaee, R. Saeidi, M.G. Christensen, Z.H. Tan, T. Kinnunen, P. Franti, S.H. Jensen, A joint approach for single-channel speaker identification and speech separation. IEEE Trans. Audio Speech Lang. Process. 20(9), 2586–2601 (2012)CrossRefGoogle Scholar
  36. 36.
    G.R. Naik, Measure of quality of source separation for sub-and super-Gaussian audio mixtures. Informatica 23(4), 581–599 (2012)MathSciNetMATHGoogle Scholar
  37. 37.
    G.R. Naik, W. Wang, Audio analysis of statistically instantaneous signals with mixed Gaussian probability distributions. Int. J. Electron. 99(10), 1333–1350 (2012)CrossRefGoogle Scholar
  38. 38.
    G.R. Naik, W. Wang, Blind Source Separation: Advances in Theory, Algorithms and Applications (Springer, Heidelberg, 2014)CrossRefMATHGoogle Scholar
  39. 39.
    S. Nandini, Md Sahidullah, G. Saha, Lung sound classification using cepstral-based statistical features. Comput. Biol. Med. 75, 118–129 (2016)CrossRefGoogle Scholar
  40. 40.
    M. Raspaud, H. Viste, G. Evangelista, Binaural source localization by joint estimation of ILD and ITD. IEEE Trans. Audio Speech Lang. Process. 18(1), 68–77 (2010)CrossRefGoogle Scholar
  41. 41.
    D.A. Reynolds, T.F. Quatieri, R.B. Dunn, Speaker verification using adapted Gaussian mixture models. Digital Signal Process. 10, 19–41 (2000)CrossRefGoogle Scholar
  42. 42.
    S.O. Sadjadi, J.H.L. Hansen, Blind spectral weighting for robust speaker identification under reverberation mismatch. IEEE Trans. Audio Speech Lang. Process. 22(5), 937–945 (2014)CrossRefGoogle Scholar
  43. 43.
    S.O. Sadjadi, J.H.L. Hansen, Mean Hilbert envelope coefficients (MHEC) for robust speaker and language identification. Speech Commun. 17, 138–148 (2015)CrossRefGoogle Scholar
  44. 44.
    S.O. Sadjadi, T. Hasan, J.H. Hansen, Mean Hilbert envelope coefficients (MHEC) for robust speaker recognition, in INTERSPEECH, (2012), pp. 1696–1699Google Scholar
  45. 45.
    M.R. Schädler, B. Kollmeier, Separable spectro-temporal Gabor filter bank features: reducing the complexity of robust features for automatic speech recognition. J. Acoust. Soc. Am. 134(4), 2047–2059 (2015)CrossRefGoogle Scholar
  46. 46.
    M.R. Schädler, B.T. Meyer, B. Kollmeier, Spectro-temporal modulation subspace-spanning filter bank features for robust automatic speech recognition. J. Acoust. Soc. Am. 131(5), 4134–4151 (2012)CrossRefGoogle Scholar
  47. 47.
    Y. Shao, S. Srinivasan, Z. Jin, D. Wang, A computational auditory scene analysis system for speech segregation and robust speech recognition. Comput. Speech Lang. 24(1), 77–93 (2010)CrossRefGoogle Scholar
  48. 48.
    C. Spille, M. Dietz, V. Hohmann, Using binaural processing for automatic speech recognition in multi-talker scenes, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013), pp. 7805–7809Google Scholar
  49. 49.
    I. Trowitzsch, J. Mohr, Y. Kashef, K. Obermayer, Robust detection of environmental sounds in binaural auditory scenes. IEEE ACM Trans. Audio Speech Lang. Process. 25(6), 1344–1356 (2017)CrossRefGoogle Scholar
  50. 50.
    R. Venkatesan, A. Balaji Ganesh, Full sound source localization of binaural signals, in International conference on Wireless Communication, Signal Processing and Networking, 2017 (Accepted)Google Scholar
  51. 51.
    Y. Wang, K. Han, D.L. Wang, Exploring monaural features for classification-based speech segregation. IEEE Trans. Audio Speech Lang. Process. 21(2), 270–279 (2013)CrossRefGoogle Scholar
  52. 52.
    R.J. Weiss, Michael I. Mandel, Daniel P.W. Ellis, Combining localization cues and source model constraints for binaural source separation. Speech Commun. 53(5), 606–621 (2011)CrossRefGoogle Scholar
  53. 53.
    J. Woodruff, D.L. Wang, Binaural localization of multiple sources in reverberant and noisy environments. IEEE Trans. Audio Speech Lang. Process. 20(5), 1503–1512 (2012)CrossRefGoogle Scholar
  54. 54.
    J. Woodruff, D.L. Wang, Binaural detection, localization, and segregation in reverberant environments based on joint pitch and azimuth cues. IEEE Trans. Audio Speech Lang. Process. 21(4), 806–815 (2013)CrossRefGoogle Scholar
  55. 55.
    S.N. Wrigley, G.J. Brown, Binaural speech separation using recurrent timing neural networks for joint F0-localisation, in: Machine Learning for Multimodal Interaction, (2008), pp. 271–282Google Scholar
  56. 56.
    F. Xiong, B.T. Meyer, N. Moritz, R. Rehr, J. Anemüller, T. Gerkmann, S. Doclo, S. Goetze, Front-end technologies for robust ASR in reverberant environments-spectral enhancement-based dereverberation and auditory modulation filterbank features. EURASIP J. Adv. Signal Process. 70(1), 1–18 (2015)Google Scholar
  57. 57.
    Y. Yu, W. Wang, P. Han, Localization based stereo speech source separation using probabilistic time-frequency masking and deep neural network. J Audio Speech Music Proc. (2016).  https://doi.org/10.1186/s13636-016-0085-x Google Scholar
  58. 58.
    X. Zhang, D. Wang, Deep learning based binaural speech separation in reverberant environments. IEEE ACM Trans. Audio Speech Lang. Process. 25(5), 1075–1084 (2017)CrossRefGoogle Scholar
  59. 59.
    X. Zhao, Y. Shao, D.L. Wang, CASA based robust speaker identification. IEEE Trans. Audio Speech Lang. Process. 20(51), 608–1616 (2012)Google Scholar
  60. 60.
    X. Zhao, Y. Wang, D.L. Wang, Robust speaker identification in noisy and reverberant conditions. IEEE Trans. Audio Speech Lang. Process. 22(4), 836–845 (2014)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2017

Authors and Affiliations

  1. 1.Electronic System Design Laboratory, Department of Electrical and Electronics EngineeringVelammal Engineering CollegeChennaiIndia

Personalised recommendations