Abstract
This paper proposes a method using speech-specific knowledge to detect the begin and end points of speech under degraded condition. The method is based on vowel-like region (VLR) detection and uses both excitation source and vocal tract system information. Existing method for VLR detection uses excitation source information. Vocal tract system information from dominant resonant frequency is used to eliminate spurious VLRs in background noise. Foreground speech segmentation using excitation and vocal tract system information is carried out to remove spurious VLRs in the background speech region. Better localization of the end points is done using more detailed information about excitation source in terms of glottal activity to detect the sonorant consonants and missed VLRs. To include an unvoiced consonant, obstruent region detection is done at the beginning of the first VLR and at the end of last VLR. Detected begin and end points are evaluated by comparing with manually marked end points as well as by conducting the text-dependent speaker verification experiments. The proposed method performs better than some of the existing techniques.
Similar content being viewed by others
References
F. Bimbot, J.-F. Bonastre, C. Fredouille, G. Gravier, I. Magrin-Chagnolleau, S. Meignier, T. Merlin, J. Ortega-García, D. Petrovska-Delacrétaz, D.A. Reynolds, A tutorial on text-independent speaker verification. EURASIP J. Adv. Signal Process. 2004(4), 101962 (2004)
S. E. Bou-Ghazale, K. Assaleh, A robust endpoint detection of speech for noisy environments with application to automatic speech recognition. in 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 4, pp. IV-3808 (2002)
J.P. Campbell, Speaker recognition: a tutorial. Proc. IEEE 85(9), 1437–1462 (1997)
R.K. Das, S. Jelil, S.R.M. Prasanna, Development of multi-level speech based person authentication system. J. Signal Process. Syst. 88(3), 259–271 (2017)
K.T. Deepak, B.D. Sarma, S.R.M. Prasanna, Foreground speech segmentation using zero frequency filtered signal. in Interspeech 2012, Sept (2012)
K .T. Deepak, S .R .M. Prasanna, Foreground speech segmentation and enhancement using glottal closure instants and mel cepstral coefficients. IEEE/ACM Trans. Acoust. Speech Lang. Process. 24, 1204–1218 (2016)
S. Dey, S. Barman, R. K. Bhukya, R. K. Das, B. Haris, S. R. M. Prasanna, R. Sinha, Speech biometric based attendance system, in 2014 Twentieth National Conference on Communications (NCC), IEEE, pp. 1–6 (2014)
N. Dhananjaya, B. Yegnanarayana, Voiced/nonvoiced detection based on robustness of voiced epochs. IEEE Signal Process. Lett. 17, 273–277 (2010)
T. Dutta, Dynamic time warping based approach to text-dependent speaker identification using spectrograms, in Congress on Image and Signal Processing, CISP’08, vol. 2. IEEE, pp. 354–360 (2008)
S. Furui, Cepstral analysis technique for automatic speaker verification. IEEE Trans. Acoust. Speech Signal Process. 29, 254–272 (1981)
J. González-Rodríguez, J. Ortega-García, C. Martín, L. Hernández, Increasing robustness in GMM speaker recognition systems for noisy and reverberant speech with low complexity microphone arrays, in Proceedings of Fourth International Conference on Spoken Language, 1996, ICSLP 96, vol. 3, IEEE, pp. 1333–1336 (1996)
D. N. Gowda, in Signal Processing for Excitation-Based Analysis of Acoustic Events in Speech. Ph.D. Dissertation, Department of Computer Science and Engineering, IIT Madras (2011)
M. Hamada, Y. Takizawa, T. Norimatsu, A noise robust speech recognition system, in The International Conference on Spoken Language Processing (1990)
V. Hautamäki, M. Tuononen, T. Niemi-Laitinen, P. Fränti, Improving speaker verification by periodicity based voice activity detection, in Proceedings of 12th International Conference on Speech and Computer (SPECOM2007), vol. 2, pp. 645–650 (2007)
M. Hébert, Text-dependent speaker recognition, in Springer Handbook of Speech Processing, Springer, pp. 743–762 (2008)
B. K. Khonglah, R. K. Bhukya, S. R. M. Prasanna, Processing degraded speech for text dependent speaker verification, in International Journal of Speech Technology, pp. 1–12 (2017)
T. Kinnunen, H. Li, An overview of text-independent speaker recognition: from features to supervectors. Speech Commun. 52(1), 12–40 (2010)
L. Lamel, L. Rabiner, A. Rosenberg, J. Wilpon, An improved endpoint detector for isolated word recognition. IEEE Trans. Acoust. Speech Signal Process. 29(4), 777–785 (1981)
A. Larcher, K.A. Lee, B. Ma, H. Li, Text-dependent speaker verification: classifiers, databases and RSR2015. Speech Commun. 60, 56–77 (2014)
Q. Li, J. Zheng, A. Tsai, Q. Zhou, Robust endpoint detection and energy normalization for real-time speech and speaker recognition. IEEE Trans. Speech Audio Process. 10(3), 146–157 (2002)
D. Mahanta, A. Paul, R. K. Bhukya, R. K. Das, R. Sinha, S. R. M. Prasanna, Warping path and gross spectrum information for speaker verification under degraded condition, in 2016 Twenty Second National Conference on Communication (NCC), IEEE, pp. 1–6 (2016)
J. Makhoul, Linear prediction: a tutorial review. Proc. IEEE 63(4), 561–580 (1975)
J. Ming, T.J. Hazen, J.R. Glass, D.A. Reynolds, Robust speaker recognition in noisy conditions. IEEE Trans. Audio, Speech, Lang. Process. 15(5), 1711–1723 (2007)
K .S .R. Murthy, B. Yegnanarayana, Epoch extraction from speech signals. IEEE Trans. Audio, Speech, Lang. Process. 16(8), 1602–1613 (2008)
K.S.R. Murthy, B. Yegnanarayana, M.A. Joseph, Characterization of glottal activity from speech signals. IEEE Signal Process. Lett. 16(6), 469–472 (2009)
R. Piyare, M. Tazil, Bluetooth based home automation system using cell phone, in IEEE 15th International Symposium on Consumer Electronics (ISCE), IEEE , pp. 192–195 (2011)
G. Pradhan, S.R.M. Prasanna, Speaker verification by vowel and nonvowel like segmentation. IEEE Trans. Audio Speech Lang. Process. 21(4), 854–867 (2013)
G. Pradhan, S.R.M. Prasanna, Speaker verification by vowel and nonvowel like segmentation. IEEE Trans. Audio Speech Lang. Process. 21(4), 854–867 (2013)
G. Pradhan, Speaker verification under degraded conditions using vowel-like and nonvowel-like regions, Ph.D. Dissertation (2013)
G. Pradhan, S.R.M. Prasanna, Speaker verification under degraded condition: a perceptual study. Int. J. Speech Technol. (Springer) 14(4), 405–417 (2011)
R. S. Prasad, B. Yegnanarayana, Acoustic segmentation of speech using zero time littering, in Proceedings of INTERSPEECH, pp. 2292–2296 Aug (2013)
S. R. M. Prasanna, B. Yegnanarayana, Detection of vowel onset point events using excitation source information, in Proceedings of INTERSPEECH, pp. 1133–1136, Sept (2005)
S.R.M. Prasanna, J.M. Zachariah, B. Yegnanarayana, Begin-end detection using vowel onset points, in Workshop on Spoken Language Processing, TIFR, Mumbai, India, Jan (2003)
S.R.M. Prasanna, B.V.S. Reddy, P. Krishnamoorthy, Vowel onset point detection using source, spectral peaks, and modulation spectrum energies. IEEE Trans. Audio Speech Lang. Process 17(4), 556–565 (2009)
S.R.M. Prasanna, G. Pradhan, Significance of vowel-like regions for speaker verification under degraded conditions. IEEE Trans. Audio Speech Lang. Process 19(8), 2552–2565 (2011)
S. R. M. Prasanna, J. M. Zachariah, B. Yegnanarayana, Begin-end detection using vowel onset points, in Workshop on Spoken Language Processing (2003)
S. R. M. Prasanna, Event-based analysis of speech, Ph.D. Dissertation, Department of Computer Science and Engineering, IIT Madras (2004)
L.R. Rabiner, B.H. Juang, Fundamentals of Speech Recognition (Prentice-Hall, Upper Saddle River, 1993)
L.R. Rabiner, A.E. Rosenberg, S.E. Levinson, Considerations in dynamic time warping algorithms for discrete word recognition. J. Acoust. Soc. Am. 63(S1), S79–S79 (1978)
K. Ramesh, S. R. M. Prasanna, R. K. Das, Significance of glottal activity detection and glottal signature for text dependent speaker verification, in International Conference on Signal Processing and Communications (SPCOM), 2014, IEEE, pp. 1–5 (2014)
G. Saha, S. Chakroborty, S. Senapati, A new silence removal and endpoint detection algorithm for speech and speaker recognition applications, in Proceedings of the 11th national conference on communications (NCC), pp. 291–295 (2005)
H. Sakoe, S. Chiba, Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoustics Speech Signal Process. 26(1), 43–49 (1978)
B. D. Sarma, S. R. M. Prasanna, Analysis of spurious vowel-like regions detected by excitation source information, in Indicon (2013)
B.D. Sarma, S.R.M. Prasanna, P. Sarmah, Consonant-vowel unit recognition using dominant aperiodic and transition region detection. Speech Commun. 92, 77–89 (2017)
B. D. Sarma, P. S. Supreeth, S. R. M. Prasanna, Improved vowel onset and offset points detection using Bessel features, in SPCOM (2014)
M.H. Savoji, A robust algorithm for accurate endpointing of speech. Speech Commun. 8, 45–60 (1989)
C.S.P. Secries, in Time-Frequency Analysis: Theory and Applications, Series: Signal Processing Series (Englewood Cliffs: Prentice-Hall, 1995)
R. Sharma, S.R.M. Prasanna, A better decomposition of speech obtained using modified empirical mode decomposition. Digit. Signal Process. 58, 26–39 (2016)
R. Sharma, R.K. Bhukya, S.R.M. Prasanna, Analysis of the Hilbert spectrum for text-dependent speaker verification. Speech Commun. 96, 207–224 (2018)
J. Sohn, N.S. Kim, W. Sung, A statistical model-based voice activity detection. IEEE Signal Process. Lett. 6(1), 1–3 (1999)
C. Tsao, R.M. Gray, An endpoint detection for LPC speech using residual look-ahead for vector quantization applications, in IEEE International Conference on Acoustics, Speech, and Signal Processing (Springer, Berlin, 1984), p. 1
A. Varga, H.J. Steeneken, Assessment for automatic speech recognition: Ii. noisex-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun. 12(3), 247–251 (1993)
L. P. Wong, M. Russell, Text-dependent speaker verification under noisy conditions using parallel model combination, in Proceedings of (ICASSP’01). 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2001, vol. 1, IEEE, pp. 457–460 (2001)
B. Yegnanarayana, S.R.M. Prasanna, J.M. Zachariah, C.S. Gupta, Combining evidence from source, suprasegmental and spectral features for a fixed-text speaker verification system. IEEE Trans. Acoust. Speech Signal Process. 13, 575–582 (2005)
B. Yegnanarayana, S.R.M. Prasanna, J.M. Zachariah, C.S. Gupta, Combining evidence from source, suprasegmental and spectral features for a fixed-text speaker verification system. IEEE Trans. Speech Audio Process. 13(4), 575–582 (2005)
B. Yegnanarayana, D.N. Gowda, Spectro-temporal analysis of speech signals using zero-time windowing and group delay function. Speech Commun. 55, 782–795 (2013)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Bhukya, R.K., Sarma, B.D. & Prasanna, S.R.M. End Point Detection Using Speech-Specific Knowledge for Text-Dependent Speaker Verification. Circuits Syst Signal Process 37, 5507–5539 (2018). https://doi.org/10.1007/s00034-018-0827-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-018-0827-3