Abstract
Automatic speaker recognition (ASR) is a challenging task when the duration of the test speech is very short i.e., a few seconds. Source features extracted from short speech utterances are shown to be effective for such cases. This paper proposes a system based on LP residual for text independent speaker recognition. Discrete wavelet transform (DWT) and stationary wavelet transform (SWT) are experimented to parameterize the LP residual. DWT works well in case of denoising and compression. SWT works well in reconstructing the noised signal at higher levels of decomposition than DWT. SWT/DWT coefficients of LP residual are used for implementing an i-vector/P-LDA based speaker recognition system. Effectiveness of the system is evaluated by using 10 s–10 s task of NIST speaker recognition evaluation (SRE) 2010 database. To evaluate robustness in degraded environments, the speech files are mixed with white noise from NOISEX-92 database. Speaker recognition using SWT level-3 results in an equal error rate (EER) of 40 and decision cost function (DCF) of 0.3956 for voice part of the signal in 10 s training—10 s testing data set. It has been shown that the proposed method gives robust speaker recognition performance in terms of DCF.
Similar content being viewed by others
References
Ananthapadmanabha, T. V., & Fant, G. (1982). Calculation of true glottal flow and its components. Speech Communication, pp. 167–184.
Bonastre, J. -F., Wils, F., & Meignier, S. (2005). ALIZE, a free toolkit for speaker recognition. In: Proceedings of ICASSP, pp. 737–740.
Chan, W., Lee, T., Zheng, N., & Ouyang, H. (2006). Use of vocal source features in speaker segmentation. In: Proceedings on IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, May, pp. 14–19.
Chen, S. H., & Wang, H. C. (2004). Improvement of speaker recognition by combining residual and prosodic features with acoustic features. In: Proceedings of ICASSP, pp. 93–96.
Chetouani, M., Faundez-Zanuy, M., Gas, B., & Zarader, J. L. (2009). Investigation on LP-residual representations for speaker identification. Pattern Recognition, 42, 487–494.
Das, Rohan Kumar, & Mahadeva Prasanna, S. R. (2017). Speaker verification from short utterance perspective: A review. In: IETE Technical Review, 7 September, https://doi.org/10.1080/02564602.2017.1357507.
Daubechies, I. (1990). The wavelet transform time-frequency localization and signal analysis. IEEE Transactions on Information and Theory, 36, 961–1004, 1992: Ten Lectures on Wavelets. Society for Industrial and Applied Mathematics https://epubs.siam.org/doi/abs/10.1137/1.9781611970104.
Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, Language Processing, 19(4), 788–798.
Fatima, N., & Zheng, T.F. (2012). Short utterance speaker recognition a research agenda. In: International Conference on Systems and Informatics (ICSAI) 2012, 19th May, pp. 1746–1750.
Fauve, B. G. B., Matrouf, D., Scheffer, N., Bonastre, J., & Mason, J. S. D. (2007). State-of-the-art performance in text-independent speaker verification through open-source software. IEEE Transactions on Audio, Speech, and Language Process, 15(7), 221–243.
Ferrer, L., Nandwana, M. K., McLaren, M. L., Castan, D., & Lawson, A. (2019). Towards fail-safe speaker recognition? Trial-based calibration with a reject option. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(1), 140–153.
Furui, S. (1986). Speaker-independent isolated word recognition using dynamic features of speech spectrum. IEEE Transactions on Acoustics, Speech, Signal Processing, 34(1), 52–59.
Jahangir, R., et al. (2020). Text-independent speaker identification through feature fusion and deep neural network. IEEE Access, 8, 32187–32202. https://doi.org/10.1109/ACCESS.2020.2973541.
Jiang, Y., Lee, K., & Wang, L. (2014). PLDA in the I-supervector space for text-independent speaker verification. EURASIP Journal on Audio, Music and Speech Processing, 2014, 29.
Kanagasundaram, A., Vogt, R., Dean, D., Sridharan, S., & Mason, M. (2011). I-vector based speaker recognition on short utterances (pp. 2341–2344). Florence: InterSpeech.
Kawakami, Y., Wang, L., & Nakagawa, S. (2013). Speaker identification using pseudo pitch synchronized phase information in noisy environments. In: Proceedings on APSIPA.
Kawakami, Y., Wang, L., Kai, A., & Nakagawa, S. (2014). Speaker identification by combining various vocal tract and vocal source features. In: Proceedings of International Conference on Text, Speech and Dialogue 2014, pp. 382–389.
Kenny, P., Ouellet, P., Dehak, N., Gupta, V., & Dumouchel, P. (2008). A study of inter-speaker variability in speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 16(5), 980–988.
Kinnunen, T. T., & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 52(1), 12–40.
Li, L., Wang, D., Zhang, C., & Zheng, T. F. (2016). Improving short utterance speaker recognition by modeling speech unit classes. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(6), 1129–1139.
Mahadeva Prasanna, S. R., Gupta, C. S., & Yegnanarayana, B. (2006). Extraction of speaker specific excitation information from linear prediction residual of speech. Speech Communication, 48(10), 1234–1261.
Mary, L., & Yegnanarayana, B. (2008). Extraction and representation of prosodic features for language and speaker recognition. Speech Communication, 50(10), 782–796.
McClanahan, R. D., Stewart, B., & De Leon, P. L. (2014). Performance of I-vector speaker verification and the detection of synthetic speech. In: IEEE International Conference on Acoustistics and Speech Signal Processing (ICASSP), May, pp. 3779–3783.
Murty, K. S. R., & Yegnanarayana, B. (2006). Combining evidence from residual phase and MFCC features for speaker recognition. IEEE Signal Processing Letters, 13(1), 52–56.
Plumpe, M. D., Quatieri, T. F., & Reynolds, D. A. (1999). Modeling of the glottal flow derivative waveform with application to speaker identification. IEEE Transactions on Speech Audio Process, 17(5), 569–586.
Pradhan, G., & Prasanna, S. R. M. (2011). Speaker verification under degraded condition: a perceptual study. International Journal of Speech Technology, 14(4), 405–417.
Pradhan, G., & Prasanna, S. R. M. (2013). Speaker verification by vowel and nonvowel like segmentation. IEEE Transactions on Audio, Speech and Language Processing, 21(4), 854–867.
Prasanna, S. R. M., & Pradhan, G. (2011). Significance of vowel-like regions for speaker verification under degraded condition. IEEE Transactions on Audio, Speech, and Language Processing, 19(8), 2552–2565.
Qiu, L., & Er, M. H. (1995). Wavelet spectrogram of noisy signals. International Journal on Electronics, 79, 665–677.
Reynolds, D. A. (1995). Speaker identification and verification using gaussian mixture speaker models. Speech Communication, 17, 91–108.
Rouat, J., Liu, Yong Chun, & Morissette, D. (1997). A pitch determination and voiced/unvoiced decision algorithm for noisy speech. Speech Communication, 21, 191–207.
Sadjadi, S. O., Pelecanos, J., & Ganapathy, S. (2016). The ibm speaker recognition system: Recent advances and error analysis. Proceedings on Interspeech, 2016, 3633–3637.
Scheffer, N., Ferrer, L., Graciarena, M., Kajarekar, S., Shriberg, E., & Stolcke, A. (2010). The SRI NIST, speaker recognition evaluation system. IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), 2011, 5292–5295.
Selva Nidhyananthan, S., Shantha Selva Kumari, R., & Jaffino, G. (2012). Robust speaker identification using vocal source information. In: International Conference on Devices Circuits and Systems (ICDCS) 2012, pp. 182–186.
Solomonoff, A., Campbell, W., & Boardman, I. (2005). Advances in channel compensation for SVM speaker recognition. In: Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2005) (Philadelphia, USA, March 2005), pp. 629–632.
Sri Rama Murty , K., Boominathan, V., & Vijayan, K. (2012). Allpass modeling of LP residual for speaker recognition. In: International Conference on Signal Processing and Communications (SPCOM), July, pp. 1–5.
Stolcke, A., Kajarekar, S., & Ferrer, L. (2008). Nonparametric feature normalization for SVM-based speaker verification. In: Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2008) (Las Vegas, Nevada, April 2008), pp. 1577–1580.
Taherian, H., Wang, Z., Chang, J., & Wang, D. (2020). Robust speaker recognition based on single-channel and multi-channel speech enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 1293–1302. https://doi.org/10.1109/TASLP.2020.2986896.
Talkin, D. (1995). A robust algorithm for pitch tracking (RAPT). Speech Coding and Synthesis, pp. 497–518.
Tech, M., & Bansal, A. (2012). Speaker recognition using MFCC front end analysis and VQ modeling techniques for Hindi words using MATLAB. International Journal of Computer Applications, 45, 24.
Thevenaz, P., & Hugli, H. (1995). Usefullness of the LPC-residue in text-independent speaker verification. Speech Communication, 17(1–2), 145–157.
Varga, A., & Steeneken, H. (1993). Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 12, 247–251.
Wang, N., Ching, P. C., Zheng, N., & Lee, T. (2011). Robust speaker recognition using denoised vocal source and vocal tract features. IEEE Transactions on Audio Speech and Language Processing, 19(1), 196–205.
Wang, L., Nakagawa, S., Dang, J., Wei, J., Shen, T., Li, L., & Zheng, T. F. (2017). Pseudo-pitch-synchronized phase information extraction and its application for robust speaker recognition. In: IEEE 6th Global Conference on Consumer Electronics (GCCE) 2017, Oct, pp. 1–5.
Yao, Q., & Mak, M. (2018). SNR-invariant multitask deep neural networks for robust speaker verification. IEEE Signal Processing Letters, 25(11), 1670–1674.
Yegnanarayana, B., Avendano, C., Hermansky, H., & Murthy, P. S. (1999). Speech enhancement using linear prediction residual. Speech Communication, 28, 25–42.
Yu, J., Markov, K., & Matsui, T. (2019). Articulatory and spectrum information fusion based on deep recurrent neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(4), 742–752.
Zhang, X., Zou, X., Sun, M., Zheng, T. F., Jia, C., & Wang, Y. (2019). Noise robust speaker recognition based on adaptive frame weighting in GMM for i-vector extraction. IEEE Access, 7, 27874–27882. https://doi.org/10.1109/ACCESS.2019.2901812.
Zhao, X., Wang, Y., & Wang, D. (2014). Robust speaker identification in noisy and reverberant conditions. IEEE/ACM Transactions on Audio Speech and Language Processing, 22, 836–845.
Zheng, N., & Ching, P. (2004). Using Haar transformed vocal source information for automatic speaker recognition. IEEE International Conference on Acoustics and Speech Signal Processing (ICASSP), 1, 77–80.
Zheng, N., Ching, P. C., & Lee, T. (2004). Time frequency analysis of vocal source signal for speaker recognition. In: Proceedings of ICSLP, pp. 2333–2336.
Zilca, R. D., Kingsbury, B., Navratil, J., & Ramaswamy, G. N. (2006). Pseudo pitch synchronous analysis of speech with applications to speaker recognition. IEEE Transaction on Speech and Audio Processing, 14(2), 467–478.
Zilca, R. D., Navratil, J., & Ramaswamy, G. N. (2003). ”SynPitch”: A pseudo Pitch Synchrounous Algorithm For Speaker Recognition. In: Eurospeech ’03, Geneva, pp. 2649–2652.
Acknowledgements
Authors would like to thank Kerala State Council for Science Technology and Environment, India (KSCSTE) for their support. Authors would also like to thank Indian Institute of Technology, Hyderabad (IITH), for their technical support.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sreehari, V.R., Mary, L. Automatic short utterance speaker recognition using stationary wavelet coefficients of pitch synchronised LP residual. Int J Speech Technol 25, 147–161 (2022). https://doi.org/10.1007/s10772-021-09895-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-021-09895-z