End Point Detection Using Speech-Specific Knowledge for Text-Dependent Speaker Verification

Bhukya, Ramesh K.; Sarma, Biswajit Dev; Prasanna, S. R. Mahadeva

doi:10.1007/s00034-018-0827-3

End Point Detection Using Speech-Specific Knowledge for Text-Dependent Speaker Verification

Published: 04 May 2018

Volume 37, pages 5507–5539, (2018)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Ramesh K. Bhukya ORCID: orcid.org/0000-0002-6221-5627¹,
Biswajit Dev Sarma¹ &
S. R. Mahadeva Prasanna¹

466 Accesses
6 Citations
Explore all metrics

Abstract

This paper proposes a method using speech-specific knowledge to detect the begin and end points of speech under degraded condition. The method is based on vowel-like region (VLR) detection and uses both excitation source and vocal tract system information. Existing method for VLR detection uses excitation source information. Vocal tract system information from dominant resonant frequency is used to eliminate spurious VLRs in background noise. Foreground speech segmentation using excitation and vocal tract system information is carried out to remove spurious VLRs in the background speech region. Better localization of the end points is done using more detailed information about excitation source in terms of glottal activity to detect the sonorant consonants and missed VLRs. To include an unvoiced consonant, obstruent region detection is done at the beginning of the first VLR and at the end of last VLR. Detected begin and end points are evaluated by comparing with manually marked end points as well as by conducting the text-dependent speaker verification experiments. The proposed method performs better than some of the existing techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

A Deep Learning Framework for Audio Deepfake Detection

Article 08 November 2021

AlignTool: The automatic temporal alignment of spoken utterances in German, Dutch, and British English for psycholinguistic purposes

Article 29 January 2018

References

F. Bimbot, J.-F. Bonastre, C. Fredouille, G. Gravier, I. Magrin-Chagnolleau, S. Meignier, T. Merlin, J. Ortega-García, D. Petrovska-Delacrétaz, D.A. Reynolds, A tutorial on text-independent speaker verification. EURASIP J. Adv. Signal Process. 2004(4), 101962 (2004)
Article Google Scholar
S. E. Bou-Ghazale, K. Assaleh, A robust endpoint detection of speech for noisy environments with application to automatic speech recognition. in 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 4, pp. IV-3808 (2002)
J.P. Campbell, Speaker recognition: a tutorial. Proc. IEEE 85(9), 1437–1462 (1997)
Article Google Scholar
R.K. Das, S. Jelil, S.R.M. Prasanna, Development of multi-level speech based person authentication system. J. Signal Process. Syst. 88(3), 259–271 (2017)
Article Google Scholar
K.T. Deepak, B.D. Sarma, S.R.M. Prasanna, Foreground speech segmentation using zero frequency filtered signal. in Interspeech 2012, Sept (2012)
K .T. Deepak, S .R .M. Prasanna, Foreground speech segmentation and enhancement using glottal closure instants and mel cepstral coefficients. IEEE/ACM Trans. Acoust. Speech Lang. Process. 24, 1204–1218 (2016)
Google Scholar
S. Dey, S. Barman, R. K. Bhukya, R. K. Das, B. Haris, S. R. M. Prasanna, R. Sinha, Speech biometric based attendance system, in 2014 Twentieth National Conference on Communications (NCC), IEEE, pp. 1–6 (2014)
N. Dhananjaya, B. Yegnanarayana, Voiced/nonvoiced detection based on robustness of voiced epochs. IEEE Signal Process. Lett. 17, 273–277 (2010)
Article Google Scholar
T. Dutta, Dynamic time warping based approach to text-dependent speaker identification using spectrograms, in Congress on Image and Signal Processing, CISP’08, vol. 2. IEEE, pp. 354–360 (2008)
S. Furui, Cepstral analysis technique for automatic speaker verification. IEEE Trans. Acoust. Speech Signal Process. 29, 254–272 (1981)
Article Google Scholar
J. González-Rodríguez, J. Ortega-García, C. Martín, L. Hernández, Increasing robustness in GMM speaker recognition systems for noisy and reverberant speech with low complexity microphone arrays, in Proceedings of Fourth International Conference on Spoken Language, 1996, ICSLP 96, vol. 3, IEEE, pp. 1333–1336 (1996)
D. N. Gowda, in Signal Processing for Excitation-Based Analysis of Acoustic Events in Speech. Ph.D. Dissertation, Department of Computer Science and Engineering, IIT Madras (2011)
M. Hamada, Y. Takizawa, T. Norimatsu, A noise robust speech recognition system, in The International Conference on Spoken Language Processing (1990)
V. Hautamäki, M. Tuononen, T. Niemi-Laitinen, P. Fränti, Improving speaker verification by periodicity based voice activity detection, in Proceedings of 12th International Conference on Speech and Computer (SPECOM2007), vol. 2, pp. 645–650 (2007)
M. Hébert, Text-dependent speaker recognition, in Springer Handbook of Speech Processing, Springer, pp. 743–762 (2008)
B. K. Khonglah, R. K. Bhukya, S. R. M. Prasanna, Processing degraded speech for text dependent speaker verification, in International Journal of Speech Technology, pp. 1–12 (2017)
T. Kinnunen, H. Li, An overview of text-independent speaker recognition: from features to supervectors. Speech Commun. 52(1), 12–40 (2010)
Article Google Scholar
L. Lamel, L. Rabiner, A. Rosenberg, J. Wilpon, An improved endpoint detector for isolated word recognition. IEEE Trans. Acoust. Speech Signal Process. 29(4), 777–785 (1981)
Article Google Scholar
A. Larcher, K.A. Lee, B. Ma, H. Li, Text-dependent speaker verification: classifiers, databases and RSR2015. Speech Commun. 60, 56–77 (2014)
Article Google Scholar
Q. Li, J. Zheng, A. Tsai, Q. Zhou, Robust endpoint detection and energy normalization for real-time speech and speaker recognition. IEEE Trans. Speech Audio Process. 10(3), 146–157 (2002)
Article Google Scholar
D. Mahanta, A. Paul, R. K. Bhukya, R. K. Das, R. Sinha, S. R. M. Prasanna, Warping path and gross spectrum information for speaker verification under degraded condition, in 2016 Twenty Second National Conference on Communication (NCC), IEEE, pp. 1–6 (2016)
J. Makhoul, Linear prediction: a tutorial review. Proc. IEEE 63(4), 561–580 (1975)
Article Google Scholar
J. Ming, T.J. Hazen, J.R. Glass, D.A. Reynolds, Robust speaker recognition in noisy conditions. IEEE Trans. Audio, Speech, Lang. Process. 15(5), 1711–1723 (2007)
Article Google Scholar
K .S .R. Murthy, B. Yegnanarayana, Epoch extraction from speech signals. IEEE Trans. Audio, Speech, Lang. Process. 16(8), 1602–1613 (2008)
Article Google Scholar
K.S.R. Murthy, B. Yegnanarayana, M.A. Joseph, Characterization of glottal activity from speech signals. IEEE Signal Process. Lett. 16(6), 469–472 (2009)
Article Google Scholar
R. Piyare, M. Tazil, Bluetooth based home automation system using cell phone, in IEEE 15th International Symposium on Consumer Electronics (ISCE), IEEE , pp. 192–195 (2011)
G. Pradhan, S.R.M. Prasanna, Speaker verification by vowel and nonvowel like segmentation. IEEE Trans. Audio Speech Lang. Process. 21(4), 854–867 (2013)
Article Google Scholar
G. Pradhan, S.R.M. Prasanna, Speaker verification by vowel and nonvowel like segmentation. IEEE Trans. Audio Speech Lang. Process. 21(4), 854–867 (2013)
Article Google Scholar
G. Pradhan, Speaker verification under degraded conditions using vowel-like and nonvowel-like regions, Ph.D. Dissertation (2013)
G. Pradhan, S.R.M. Prasanna, Speaker verification under degraded condition: a perceptual study. Int. J. Speech Technol. (Springer) 14(4), 405–417 (2011)
Article Google Scholar
R. S. Prasad, B. Yegnanarayana, Acoustic segmentation of speech using zero time littering, in Proceedings of INTERSPEECH, pp. 2292–2296 Aug (2013)
S. R. M. Prasanna, B. Yegnanarayana, Detection of vowel onset point events using excitation source information, in Proceedings of INTERSPEECH, pp. 1133–1136, Sept (2005)
S.R.M. Prasanna, J.M. Zachariah, B. Yegnanarayana, Begin-end detection using vowel onset points, in Workshop on Spoken Language Processing, TIFR, Mumbai, India, Jan (2003)
S.R.M. Prasanna, B.V.S. Reddy, P. Krishnamoorthy, Vowel onset point detection using source, spectral peaks, and modulation spectrum energies. IEEE Trans. Audio Speech Lang. Process 17(4), 556–565 (2009)
Article Google Scholar
S.R.M. Prasanna, G. Pradhan, Significance of vowel-like regions for speaker verification under degraded conditions. IEEE Trans. Audio Speech Lang. Process 19(8), 2552–2565 (2011)
Article Google Scholar
S. R. M. Prasanna, J. M. Zachariah, B. Yegnanarayana, Begin-end detection using vowel onset points, in Workshop on Spoken Language Processing (2003)
S. R. M. Prasanna, Event-based analysis of speech, Ph.D. Dissertation, Department of Computer Science and Engineering, IIT Madras (2004)
L.R. Rabiner, B.H. Juang, Fundamentals of Speech Recognition (Prentice-Hall, Upper Saddle River, 1993)
Google Scholar
L.R. Rabiner, A.E. Rosenberg, S.E. Levinson, Considerations in dynamic time warping algorithms for discrete word recognition. J. Acoust. Soc. Am. 63(S1), S79–S79 (1978)
Article Google Scholar
K. Ramesh, S. R. M. Prasanna, R. K. Das, Significance of glottal activity detection and glottal signature for text dependent speaker verification, in International Conference on Signal Processing and Communications (SPCOM), 2014, IEEE, pp. 1–5 (2014)
G. Saha, S. Chakroborty, S. Senapati, A new silence removal and endpoint detection algorithm for speech and speaker recognition applications, in Proceedings of the 11th national conference on communications (NCC), pp. 291–295 (2005)
H. Sakoe, S. Chiba, Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoustics Speech Signal Process. 26(1), 43–49 (1978)
Article Google Scholar
B. D. Sarma, S. R. M. Prasanna, Analysis of spurious vowel-like regions detected by excitation source information, in Indicon (2013)
B.D. Sarma, S.R.M. Prasanna, P. Sarmah, Consonant-vowel unit recognition using dominant aperiodic and transition region detection. Speech Commun. 92, 77–89 (2017)
Article Google Scholar
B. D. Sarma, P. S. Supreeth, S. R. M. Prasanna, Improved vowel onset and offset points detection using Bessel features, in SPCOM (2014)
M.H. Savoji, A robust algorithm for accurate endpointing of speech. Speech Commun. 8, 45–60 (1989)
Article Google Scholar
C.S.P. Secries, in Time-Frequency Analysis: Theory and Applications, Series: Signal Processing Series (Englewood Cliffs: Prentice-Hall, 1995)
R. Sharma, S.R.M. Prasanna, A better decomposition of speech obtained using modified empirical mode decomposition. Digit. Signal Process. 58, 26–39 (2016)
Article Google Scholar
R. Sharma, R.K. Bhukya, S.R.M. Prasanna, Analysis of the Hilbert spectrum for text-dependent speaker verification. Speech Commun. 96, 207–224 (2018)
Article Google Scholar
J. Sohn, N.S. Kim, W. Sung, A statistical model-based voice activity detection. IEEE Signal Process. Lett. 6(1), 1–3 (1999)
Article Google Scholar
C. Tsao, R.M. Gray, An endpoint detection for LPC speech using residual look-ahead for vector quantization applications, in IEEE International Conference on Acoustics, Speech, and Signal Processing (Springer, Berlin, 1984), p. 1
A. Varga, H.J. Steeneken, Assessment for automatic speech recognition: Ii. noisex-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)
Article Google Scholar
L. P. Wong, M. Russell, Text-dependent speaker verification under noisy conditions using parallel model combination, in Proceedings of (ICASSP’01). 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2001, vol. 1, IEEE, pp. 457–460 (2001)
B. Yegnanarayana, S.R.M. Prasanna, J.M. Zachariah, C.S. Gupta, Combining evidence from source, suprasegmental and spectral features for a fixed-text speaker verification system. IEEE Trans. Acoust. Speech Signal Process. 13, 575–582 (2005)
Article Google Scholar
B. Yegnanarayana, S.R.M. Prasanna, J.M. Zachariah, C.S. Gupta, Combining evidence from source, suprasegmental and spectral features for a fixed-text speaker verification system. IEEE Trans. Speech Audio Process. 13(4), 575–582 (2005)
Article Google Scholar
B. Yegnanarayana, D.N. Gowda, Spectro-temporal analysis of speech signals using zero-time windowing and group delay function. Speech Commun. 55, 782–795 (2013)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Electro Medical and Speech Technology Laboratory, Department of Electronics and Electrical Engineering, Indian Institute of Technology Guwahati, Guwahati, 781039, India
Ramesh K. Bhukya, Biswajit Dev Sarma & S. R. Mahadeva Prasanna

Authors

Ramesh K. Bhukya
View author publications
You can also search for this author in PubMed Google Scholar
Biswajit Dev Sarma
View author publications
You can also search for this author in PubMed Google Scholar
S. R. Mahadeva Prasanna
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ramesh K. Bhukya.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bhukya, R.K., Sarma, B.D. & Prasanna, S.R.M. End Point Detection Using Speech-Specific Knowledge for Text-Dependent Speaker Verification. Circuits Syst Signal Process 37, 5507–5539 (2018). https://doi.org/10.1007/s00034-018-0827-3

Download citation

Received: 24 November 2017
Revised: 22 April 2018
Accepted: 23 April 2018
Published: 04 May 2018
Issue Date: December 2018
DOI: https://doi.org/10.1007/s00034-018-0827-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

End Point Detection Using Speech-Specific Knowledge for Text-Dependent Speaker Verification

Abstract

Access this article

Similar content being viewed by others

A comprehensive survey on automatic speech recognition using neural networks

A Deep Learning Framework for Audio Deepfake Detection

AlignTool: The automatic temporal alignment of spoken utterances in German, Dutch, and British English for psycholinguistic purposes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

End Point Detection Using Speech-Specific Knowledge for Text-Dependent Speaker Verification

Abstract

Access this article

Similar content being viewed by others

A comprehensive survey on automatic speech recognition using neural networks

A Deep Learning Framework for Audio Deepfake Detection

AlignTool: The automatic temporal alignment of spoken utterances in German, Dutch, and British English for psycholinguistic purposes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation