Circuits, Systems, and Signal Processing

, Volume 38, Issue 11, pp 5253–5288 | Cite as

Robust Methods for Text-Dependent Speaker Verification

  • Ramesh K. BhukyaEmail author
  • S. R. Mahadeva Prasanna
  • Biswajit Dev Sarma


In this work, we explore various noise robust techniques at different stages of a Text-Dependent Speaker Verification (TDSV) system. A speech-specific knowledge-based robust end points detection technique is used for noise compensation at signal level. Feature-level compensation is done by using robust features extracted from Hilbert Spectrum (HS) of the Intrinsic Mode Functions obtained from Modified Empirical Mode Decomposition of speech. We also explored a combined temporal and spectral speech enhancement technique prior to the end points detection for enhancing speech regions embedded in noise. All experimental studies are conducted using two databases, namely the RSR2015 and the IITG database. It is found that the use of robust end points detection improves the performance of the TDSV system compared to the energy-based end points detection in both clean and degraded speech conditions. Use of noise robust HS features augmented with Mel-frequency cepstral coefficients further improves the performance of the system. It is also found that the use of speech enhancement prior to signal and feature-level compensation results in further improvement in performance for the low SNR cases. The final combined system obtained by using three robust methods provides a relative improvement from 6 to 25% in terms of the EER, on the RSR2015 database corrupted with Babble noise of varying strength and by around from 30 to 45% relative improvement on the IITG database.


End point detection VLRs Dominant resonant frequency Glottal activity detection Foreground speech segmentation MEMD IMFs Hilbert spectrum MFCCs TDSV DTW 



  1. 1.
    L.D. Alsteris, K.K. Paliwal, Further intelligibility results from human listening tests using the short-time phase spectrum. Speech Commun. 48(6), 727–736 (2006)CrossRefGoogle Scholar
  2. 2.
    Y. Bayya, D.N. Gowda, Spectro-temporal analysis of speech signals using zero-time windowing and group delay function. Speech Commun. 55(6), 782–795 (2013)CrossRefGoogle Scholar
  3. 3.
    H. Beigi, Speaker Recognition: Advancements and Challenges (INTECH Open Access Publisher, London, 2012)Google Scholar
  4. 4.
    R.K. Bhukya, B.D. Sarma, S.R.M. Prasanna, End point detection using speech-specific knowledge for text-dependent speaker verification. Circuits Syst. Signal Process. 37(12), 5507–5539 (2018)CrossRefGoogle Scholar
  5. 5.
    G. Biagetti, P. Crippa, L. Falaschetti, S. Orcioni, C. Turchetti, An investigation on the accuracy of truncated DKLT representation for speaker identification with short sequences of speech frames. IEEE Trans. Cybern. 47(12), 4235–4249 (2017)CrossRefGoogle Scholar
  6. 6.
    G. Biagetti, P. Crippa, L. Falaschetti, S. Orcioni, C. Turchetti, Speaker identification in noisy conditions using short sequences of speech frames. In: International Conference on Intelligent Decision Technologies (Springer, 2017), pp. 43–52Google Scholar
  7. 7.
    H. Boril, P. Fousek, P. Pollák, Data-driven design of front-end filter bank for Lombard speech recognition. In: Ninth International Conference on Spoken Language Processing (2006)Google Scholar
  8. 8.
    A. Bouchikhi, A.O. Boudraa, Multicomponent AM-FM signals analysis based on EMD-B-splines ESA. Signal Process. 92(9), 2214–2228 (2012)CrossRefGoogle Scholar
  9. 9.
    C. Charbuillet, B. Gas, M. Chetouani, J. Zarader, Optimizing feature complementarity by evolution strategy: application to automatic speaker verification. Speech Commun. 51(9), 724–731 (2009)CrossRefGoogle Scholar
  10. 10.
    K.T. Deepak, S.R.M. Prasanna, Foreground speech segmentation and enhancement using glottal closure instants and mel cepstral coefficients. IEEE/ACM Trans. Audio Speech Lang. Process. 24(7), 1205–1219 (2016)CrossRefGoogle Scholar
  11. 11.
    K.T. Deepak, B.D. Sarma, S.R.M. Prasanna, Foreground speech segmentation using zero frequency filtered signal. In: Thirteenth Annual Conference of the International Speech Communication Association (2012)Google Scholar
  12. 12.
    N. Dehak, P.J. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)CrossRefGoogle Scholar
  13. 13.
    S. Dey, S. Barman, R.K. Bhukya, R.K. Das, B.C. Haris, S.R.M. Prasanna, R. Sinha, Speech biometric based attendance system. In: National Conference on Communications (2014)Google Scholar
  14. 14.
    N. Dhananjaya, B. Yegnanarayana, Voiced/nonvoiced detection based on robustness of voiced epochs. Signal Process. Lett. IEEE 17(3), 273–276 (2010)CrossRefGoogle Scholar
  15. 15.
    G.R. Doddington, M.A. Przybocki, A.F. Martin, D.A. Reynolds, The NIST speaker recognition evaluation—overview, methodology, systems, results, perspective. Speech Commun. 31(2), 225–254 (2000)CrossRefGoogle Scholar
  16. 16.
    Y. Ephraim, D. Malah, Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans Acoust Speech Signal Process 33(2), 443–445 (1985)CrossRefGoogle Scholar
  17. 17.
    P. Flandrin, Some aspects of huangs empirical mode decomposition, from interpretation to applications. In: International Conference of Computational Harmonic Analysis CHA, vol. 4 (2004)Google Scholar
  18. 18.
    P. Flandrin, P. Gonçalves, G. Rilling, EMD equivalent filter banks, from interpretation to applications, in Hilbert-Huang Transform and Its Applications. Interdisciplinary Mathematical Sciences, ed. by N.E. Huang, S.S.P. Shen (World Scientific Publishing, Singapore, 2005), pp. 57–74CrossRefGoogle Scholar
  19. 19.
    S. Furui, Cepstral analysis technique for automatic speaker verification. IEEE Trans. Acoust. Speech Signal Process. 29(2), 254–272 (1981)CrossRefGoogle Scholar
  20. 20.
    T. Ganchev, N. Fakotakis, G. Kokkinakis, Comparative evaluation of various MFCC implementations on the speaker verification task. Proc. SPECOM 1, 191–194 (2005)Google Scholar
  21. 21.
    S. Gazor, W. Zhang, A soft voice activity detector based on a Laplacian–Gaussian model. IEEE Trans. Speech Audio Proces. 11(5), 498–505 (2003)CrossRefGoogle Scholar
  22. 22.
    F. Gianfelici, G. Biagetti, P. Crippa, C. Turchetti, Multicomponent AM-FM representations: an asymptotically exact approach. IEEE Trans. Audio Speech Lang. Process. 15(3), 823–837 (2007)CrossRefGoogle Scholar
  23. 23.
    M. Hébert, Text-dependent speaker recognition, in Springer Handbook of Speech Processing, ed. by J. Benesty, M.M. Sondhi, Y.A. Huang (Springer, 2008), pp. 743–762Google Scholar
  24. 24.
    R.S. Holambe, M.S. Deshpande, Advances in Non-linear Modeling for Speech Processing (Springer Science & Business Media, Berlin, 2012)CrossRefGoogle Scholar
  25. 25.
    N.E. Huang, Empirical mode decomposition and Hilbert spectral analysis (1998),
  26. 26.
    N.E. Huang, S.S. Shen, Hilbert–Huang transform and Its Applications, vol. 5 (World Scientific, Singapore, 2005)CrossRefGoogle Scholar
  27. 27.
    J.C. Junqua, B. Reaves, B. Mak, A study of endpoint detection algorithms in adverse conditions: incidence on a DTW and HMM recognizer. In: Second European Conference on Speech Communication and Technology (1991)Google Scholar
  28. 28.
    K. Khaldi, A.O. Boudraa, A. Komaty, Speech enhancement using empirical mode decomposition and the Teager–Kaiser energy operator. J. Acoust. Soc. Am. 135(1), 451–459 (2014)CrossRefGoogle Scholar
  29. 29.
    A.N. Khan, B. Yegnanarayana, Vowel onset point based variable frame rate analysis for speech recognition. In: Proceedings of 2005 International Conference on Intelligent Sensing and Information Processing, 2005 (IEEE, 2005), pp. 392–394Google Scholar
  30. 30.
    B.K. Khonglah, R.K. Bhukya, S.R.M. Prasanna, Processing degraded speech for text dependent speaker verification. Int. J. Speech Technol. 20(4), 839–850 (2017)CrossRefGoogle Scholar
  31. 31.
    T. Kinnunen, H. Li, An overview of text-independent speaker recognition: from features to supervectors. Speech Commun. 52(1), 12–40 (2010)CrossRefGoogle Scholar
  32. 32.
    H. Kremer, A. Cohen, T. Vaich, Voice activity detector (VAD) for hmm based speech recognition. In: Proceedings of ICSPAT (1999)Google Scholar
  33. 33.
    P. Krishnamoorthy, S.R.M. Prasanna, Enhancement of noisy speech by temporal and spectral processing. Speech Commun. 53(2), 154–174 (2011)CrossRefGoogle Scholar
  34. 34.
    A. Larcher, K.A. Lee, B. Ma, H. Li, Text-dependent speaker verification: classifiers, databases and RSR2015. Speech Commun. 60, 56–77 (2014)CrossRefGoogle Scholar
  35. 35.
    K.A. Lee, A. Larcher, H. Thai, B. Ma, H. Li, Joint application of speech and speaker recognition for automation and security in smart home. In: INTERSPEECH (2011), pp. 3317–3318Google Scholar
  36. 36.
    Q. Li, J. Zheng, A. Tsai, Q. Zhou, Robust endpoint detection and energy normalization for real-time speech and speaker recognition. IEEE Trans. Speech Audio Process. 10(3), 146–157 (2002)CrossRefGoogle Scholar
  37. 37.
    D. Mahanta, A. Paul, R.K. Bhukya, R.K. Das, R. Sinha, S.R.M. Prasanna, Warping path and gross spectrum information for speaker verification under degraded condition. In: 22nd National Conference on Communication (NCC) (IEEE, 2016), pp. 1–6Google Scholar
  38. 38.
    J. Makhoul, Linear prediction: a tutorial review. Proc. IEEE 63(4), 561–580 (1975)CrossRefGoogle Scholar
  39. 39.
    S. Marinov, H.I. Skövde, Text dependent and text independent speaker verification systems. Technology and applications. Overview article (2003)Google Scholar
  40. 40.
    A. Martin, G. Doddington, T. Kamm, M. Ordowski, M. Przybocki, The DET curve in assessment of detection task performance. Technical report, National Institute of Standards and Technology, Gaithersburg MD (1997)Google Scholar
  41. 41.
    N. McLaughlin, J. Ming, D. Crookes, Speaker recognition in noisy conditions with limited training data. In: 2011 19th European Signal Processing Conference (IEEE, 2011), pp. 1294–1298Google Scholar
  42. 42.
    J. Ming, T.J. Hazen, J.R. Glass, D.A. Reynolds, Robust speaker recognition in noisy conditions. IEEE Trans. Audio Speech Lang. Process. 15(5), 1711–1723 (2007)CrossRefGoogle Scholar
  43. 43.
    K.S.R. Murty, B. Yegnanarayana, M.A. Joseph, Characterization of glottal activity from speech signals. IEEE Signal Process. Lett. 16(6), 469–472 (2009)CrossRefGoogle Scholar
  44. 44.
    A. Paul, D. Mahanta, R.K. Das, R.K. Bhukya, S. Prasanna, Presence of speech region detection using vowel-like regions and spectral slope information. In: 2017 14th IEEE India Council International Conference (INDICON) (IEEE, 2017), p. 15Google Scholar
  45. 45.
    G. Pradhan, S.R.M. Prasanna, Speaker verification by vowel and nonvowel like segmentation. IEEE Trans. Audio Speech Lang. Process. 21(4), 854–867 (2013)CrossRefGoogle Scholar
  46. 46.
    S.R.M. Prasanna, G. Pradhan, Significance of vowel-like regions for speaker verification under degraded conditions. IEEE Trans. Audio Speech Lang. Process. 19(8), 2552–2565 (2011)CrossRefGoogle Scholar
  47. 47.
    S.R.M. Prasanna, B. Yegnanarayana, Detection of vowel onset point events using excitation information. In: Ninth European Conference on Speech Communication and Technology (2005)Google Scholar
  48. 48.
    S.R.M. Prasanna, J.M. Zachariah, B. Yegnanarayana, Begin-end detection using vowel onset points. In: Workshop on Spoken Language Processing (2003)Google Scholar
  49. 49.
    L.R. Rabiner, R.W. Schafer et al., Introduction to digital speech processing. Found. Trends® Signal Process. 1(1–2), 1–194 (2007)CrossRefGoogle Scholar
  50. 50.
    K. Ramesh, S.R.M. Prasanna, R.K. Das, Significance of glottal activity detection and glottal signature for text dependent speaker verification. In: 2014 IEEE International Conference on Signal Processing and Communications (SPCOM) (2014), pp. 1–5Google Scholar
  51. 51.
    B.D. Sarma, S.R.M. Prasanna, P. Sarmah, Consonant-vowel unit recognition using dominant aperiodic and transition region detection. Speech Commun. 92, 77–89 (2017)CrossRefGoogle Scholar
  52. 52.
    R. Sharma, R.K. Bhukya, S.R.M. Prasanna, Analysis of the Hilbert spectrum for text-dependent speaker verification. Speech Commun. 96, 207–224 (2018)CrossRefGoogle Scholar
  53. 53.
    R. Sharma, S.R.M. Prasanna, A better decomposition of speech obtained using modified empirical mode decomposition. Digit. Signal Process. 58, 26–39 (2016)CrossRefGoogle Scholar
  54. 54.
    R. Sharma, S.R.M. Prasanna, R.K. Bhukya, R.K. Das, Analysis of the intrinsic mode functions for speaker information. Speech Commun. 91, 1–16 (2017)CrossRefGoogle Scholar
  55. 55.
    R. Sharma, L. Vignolo, G. Schlotthauer, M.A. Colominas, H.L. Rufiner, S.R.M. Prasanna, Empirical mode decomposition for adaptive AM-FM analysis of speech: a review. Speech Commun. 88, 39–64 (2017)CrossRefGoogle Scholar
  56. 56.
    A. Varga, H.J. Steeneken, Assessment for automatic speech recognition: II. noisex-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)CrossRefGoogle Scholar
  57. 57.
    J.D. Wu, Y.J. Tsai, Speaker identification system using empirical mode decomposition and an artificial neural network. Expert Syst. Appl. 38(5), 6112–6117 (2011)CrossRefGoogle Scholar
  58. 58.
    B. Yegnanarayana, S.R.M. Prasanna, J.M. Zachariah, C.S. Gupta, Combining evidence from source, suprasegmental and spectral features for a fixed-text speaker verification system. IEEE Trans. Speech Audio Process. 13(4), 575–582 (2005)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Electro Medical and Speech Technology Laboratory, Department of Electronics and Electrical EngineeringIndian Institute of Technology GuwahatiGuwahatiIndia
  2. 2.Department of Electrical EngineeringIndian Institute of Technology DharwadDharwadIndia
  3. 3.Bay Area Advanced Analytics India (P) Ltd., a Kaliber.AI companyGuwahatiIndia

Personalised recommendations