Skip to main content
Log in

Automatic short utterance speaker recognition using stationary wavelet coefficients of pitch synchronised LP residual

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Automatic speaker recognition (ASR) is a challenging task when the duration of the test speech is very short i.e., a few seconds. Source features extracted from short speech utterances are shown to be effective for such cases. This paper proposes a system based on LP residual for text independent speaker recognition. Discrete wavelet transform (DWT) and stationary wavelet transform (SWT) are experimented to parameterize the LP residual. DWT works well in case of denoising and compression. SWT works well in reconstructing the noised signal at higher levels of decomposition than DWT. SWT/DWT coefficients of LP residual are used for implementing an i-vector/P-LDA based speaker recognition system. Effectiveness of the system is evaluated by using 10 s–10 s task of NIST speaker recognition evaluation (SRE) 2010 database. To evaluate robustness in degraded environments, the speech files are mixed with white noise from NOISEX-92 database. Speaker recognition using SWT level-3 results in an equal error rate (EER) of 40 and decision cost function (DCF) of 0.3956 for voice part of the signal in 10 s training—10 s testing data set. It has been shown that the proposed method gives robust speaker recognition performance in terms of DCF.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  • Ananthapadmanabha, T. V., & Fant, G. (1982). Calculation of true glottal flow and its components. Speech Communication, pp. 167–184.

  • Bonastre, J. -F., Wils, F., & Meignier, S. (2005). ALIZE, a free toolkit for speaker recognition. In: Proceedings of ICASSP, pp. 737–740.

  • Chan, W., Lee, T., Zheng, N., & Ouyang, H. (2006). Use of vocal source features in speaker segmentation. In: Proceedings on IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, May, pp. 14–19.

  • Chen, S. H., & Wang, H. C. (2004). Improvement of speaker recognition by combining residual and prosodic features with acoustic features. In: Proceedings of ICASSP, pp. 93–96.

  • Chetouani, M., Faundez-Zanuy, M., Gas, B., & Zarader, J. L. (2009). Investigation on LP-residual representations for speaker identification. Pattern Recognition, 42, 487–494.

    Article  MATH  Google Scholar 

  • Das, Rohan Kumar, & Mahadeva Prasanna, S. R. (2017). Speaker verification from short utterance perspective: A review. In: IETE Technical Review, 7 September, https://doi.org/10.1080/02564602.2017.1357507.

  • Daubechies, I. (1990). The wavelet transform time-frequency localization and signal analysis. IEEE Transactions on Information and Theory, 36, 961–1004, 1992: Ten Lectures on Wavelets. Society for Industrial and Applied Mathematics https://epubs.siam.org/doi/abs/10.1137/1.9781611970104.

  • Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, Language Processing, 19(4), 788–798.

    Article  Google Scholar 

  • Fatima, N., & Zheng, T.F. (2012). Short utterance speaker recognition a research agenda. In: International Conference on Systems and Informatics (ICSAI) 2012, 19th May, pp. 1746–1750.

  • Fauve, B. G. B., Matrouf, D., Scheffer, N., Bonastre, J., & Mason, J. S. D. (2007). State-of-the-art performance in text-independent speaker verification through open-source software. IEEE Transactions on Audio, Speech, and Language Process, 15(7), 221–243.

    Article  Google Scholar 

  • Ferrer, L., Nandwana, M. K., McLaren, M. L., Castan, D., & Lawson, A. (2019). Towards fail-safe speaker recognition? Trial-based calibration with a reject option. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(1), 140–153.

    Article  Google Scholar 

  • Furui, S. (1986). Speaker-independent isolated word recognition using dynamic features of speech spectrum. IEEE Transactions on Acoustics, Speech, Signal Processing, 34(1), 52–59.

    Article  Google Scholar 

  • Jahangir, R., et al. (2020). Text-independent speaker identification through feature fusion and deep neural network. IEEE Access, 8, 32187–32202. https://doi.org/10.1109/ACCESS.2020.2973541.

    Article  Google Scholar 

  • Jiang, Y., Lee, K., & Wang, L. (2014). PLDA in the I-supervector space for text-independent speaker verification. EURASIP Journal on Audio, Music and Speech Processing, 2014, 29.

    Article  Google Scholar 

  • Kanagasundaram, A., Vogt, R., Dean, D., Sridharan, S., & Mason, M. (2011). I-vector based speaker recognition on short utterances (pp. 2341–2344). Florence: InterSpeech.

    Google Scholar 

  • Kawakami, Y., Wang, L., & Nakagawa, S. (2013). Speaker identification using pseudo pitch synchronized phase information in noisy environments. In: Proceedings on APSIPA.

  • Kawakami, Y., Wang, L., Kai, A., & Nakagawa, S. (2014). Speaker identification by combining various vocal tract and vocal source features. In: Proceedings of International Conference on Text, Speech and Dialogue 2014, pp. 382–389.

  • Kenny, P., Ouellet, P., Dehak, N., Gupta, V., & Dumouchel, P. (2008). A study of inter-speaker variability in speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 16(5), 980–988.

    Article  Google Scholar 

  • Kinnunen, T. T., & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 52(1), 12–40.

    Article  Google Scholar 

  • Li, L., Wang, D., Zhang, C., & Zheng, T. F. (2016). Improving short utterance speaker recognition by modeling speech unit classes. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(6), 1129–1139.

    Article  Google Scholar 

  • Mahadeva Prasanna, S. R., Gupta, C. S., & Yegnanarayana, B. (2006). Extraction of speaker specific excitation information from linear prediction residual of speech. Speech Communication, 48(10), 1234–1261.

    Article  Google Scholar 

  • Mary, L., & Yegnanarayana, B. (2008). Extraction and representation of prosodic features for language and speaker recognition. Speech Communication, 50(10), 782–796.

    Article  Google Scholar 

  • McClanahan, R. D., Stewart, B., & De Leon, P. L. (2014). Performance of I-vector speaker verification and the detection of synthetic speech. In: IEEE International Conference on Acoustistics and Speech Signal Processing (ICASSP), May, pp. 3779–3783.

  • Murty, K. S. R., & Yegnanarayana, B. (2006). Combining evidence from residual phase and MFCC features for speaker recognition. IEEE Signal Processing Letters, 13(1), 52–56.

    Article  Google Scholar 

  • Plumpe, M. D., Quatieri, T. F., & Reynolds, D. A. (1999). Modeling of the glottal flow derivative waveform with application to speaker identification. IEEE Transactions on Speech Audio Process, 17(5), 569–586.

    Article  Google Scholar 

  • Pradhan, G., & Prasanna, S. R. M. (2011). Speaker verification under degraded condition: a perceptual study. International Journal of Speech Technology, 14(4), 405–417.

    Article  Google Scholar 

  • Pradhan, G., & Prasanna, S. R. M. (2013). Speaker verification by vowel and nonvowel like segmentation. IEEE Transactions on Audio, Speech and Language Processing, 21(4), 854–867.

    Article  Google Scholar 

  • Prasanna, S. R. M., & Pradhan, G. (2011). Significance of vowel-like regions for speaker verification under degraded condition. IEEE Transactions on Audio, Speech, and Language Processing, 19(8), 2552–2565.

    Article  Google Scholar 

  • Qiu, L., & Er, M. H. (1995). Wavelet spectrogram of noisy signals. International Journal on Electronics, 79, 665–677.

    Article  Google Scholar 

  • Reynolds, D. A. (1995). Speaker identification and verification using gaussian mixture speaker models. Speech Communication, 17, 91–108.

    Article  Google Scholar 

  • Rouat, J., Liu, Yong Chun, & Morissette, D. (1997). A pitch determination and voiced/unvoiced decision algorithm for noisy speech. Speech Communication, 21, 191–207.

    Article  Google Scholar 

  • Sadjadi, S. O., Pelecanos, J., & Ganapathy, S. (2016). The ibm speaker recognition system: Recent advances and error analysis. Proceedings on Interspeech, 2016, 3633–3637.

    Google Scholar 

  • Scheffer, N., Ferrer, L., Graciarena, M., Kajarekar, S., Shriberg, E., & Stolcke, A. (2010). The SRI NIST, speaker recognition evaluation system. IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), 2011, 5292–5295.

    Google Scholar 

  • Selva Nidhyananthan, S., Shantha Selva Kumari, R., & Jaffino, G. (2012). Robust speaker identification using vocal source information. In: International Conference on Devices Circuits and Systems (ICDCS) 2012, pp. 182–186.

  • Solomonoff, A., Campbell, W., & Boardman, I. (2005). Advances in channel compensation for SVM speaker recognition. In: Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2005) (Philadelphia, USA, March 2005), pp. 629–632.

  • Sri Rama Murty , K., Boominathan, V., & Vijayan, K. (2012). Allpass modeling of LP residual for speaker recognition. In: International Conference on Signal Processing and Communications (SPCOM), July, pp. 1–5.

  • Stolcke, A., Kajarekar, S., & Ferrer, L. (2008). Nonparametric feature normalization for SVM-based speaker verification. In: Proceedings of International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2008) (Las Vegas, Nevada, April 2008), pp. 1577–1580.

  • Taherian, H., Wang, Z., Chang, J., & Wang, D. (2020). Robust speaker recognition based on single-channel and multi-channel speech enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 1293–1302. https://doi.org/10.1109/TASLP.2020.2986896.

    Article  Google Scholar 

  • Talkin, D. (1995). A robust algorithm for pitch tracking (RAPT). Speech Coding and Synthesis, pp. 497–518.

  • Tech, M., & Bansal, A. (2012). Speaker recognition using MFCC front end analysis and VQ modeling techniques for Hindi words using MATLAB. International Journal of Computer Applications, 45, 24.

    Google Scholar 

  • Thevenaz, P., & Hugli, H. (1995). Usefullness of the LPC-residue in text-independent speaker verification. Speech Communication, 17(1–2), 145–157.

    Article  Google Scholar 

  • Varga, A., & Steeneken, H. (1993). Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 12, 247–251.

    Article  Google Scholar 

  • Wang, N., Ching, P. C., Zheng, N., & Lee, T. (2011). Robust speaker recognition using denoised vocal source and vocal tract features. IEEE Transactions on Audio Speech and Language Processing, 19(1), 196–205.

    Article  Google Scholar 

  • Wang, L., Nakagawa, S., Dang, J., Wei, J., Shen, T., Li, L., & Zheng, T. F. (2017). Pseudo-pitch-synchronized phase information extraction and its application for robust speaker recognition. In: IEEE 6th Global Conference on Consumer Electronics (GCCE) 2017, Oct, pp. 1–5.

  • Yao, Q., & Mak, M. (2018). SNR-invariant multitask deep neural networks for robust speaker verification. IEEE Signal Processing Letters, 25(11), 1670–1674.

    Article  Google Scholar 

  • Yegnanarayana, B., Avendano, C., Hermansky, H., & Murthy, P. S. (1999). Speech enhancement using linear prediction residual. Speech Communication, 28, 25–42.

    Article  Google Scholar 

  • Yu, J., Markov, K., & Matsui, T. (2019). Articulatory and spectrum information fusion based on deep recurrent neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(4), 742–752.

    Article  Google Scholar 

  • Zhang, X., Zou, X., Sun, M., Zheng, T. F., Jia, C., & Wang, Y. (2019). Noise robust speaker recognition based on adaptive frame weighting in GMM for i-vector extraction. IEEE Access, 7, 27874–27882. https://doi.org/10.1109/ACCESS.2019.2901812.

    Article  Google Scholar 

  • Zhao, X., Wang, Y., & Wang, D. (2014). Robust speaker identification in noisy and reverberant conditions. IEEE/ACM Transactions on Audio Speech and Language Processing, 22, 836–845.

    Article  Google Scholar 

  • Zheng, N., & Ching, P. (2004). Using Haar transformed vocal source information for automatic speaker recognition. IEEE International Conference on Acoustics and Speech Signal Processing (ICASSP), 1, 77–80.

    Google Scholar 

  • Zheng, N., Ching, P. C., & Lee, T. (2004). Time frequency analysis of vocal source signal for speaker recognition. In: Proceedings of ICSLP, pp. 2333–2336.

  • Zilca, R. D., Kingsbury, B., Navratil, J., & Ramaswamy, G. N. (2006). Pseudo pitch synchronous analysis of speech with applications to speaker recognition. IEEE Transaction on Speech and Audio Processing, 14(2), 467–478.

    Article  Google Scholar 

  • Zilca, R. D., Navratil, J., & Ramaswamy, G. N. (2003). ”SynPitch”: A pseudo Pitch Synchrounous Algorithm For Speaker Recognition. In: Eurospeech ’03, Geneva, pp. 2649–2652.

Download references

Acknowledgements

Authors would like to thank Kerala State Council for Science Technology and Environment, India (KSCSTE) for their support. Authors would also like to thank Indian Institute of Technology, Hyderabad (IITH), for their technical support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to V. R Sreehari.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sreehari, V.R., Mary, L. Automatic short utterance speaker recognition using stationary wavelet coefficients of pitch synchronised LP residual. Int J Speech Technol 25, 147–161 (2022). https://doi.org/10.1007/s10772-021-09895-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-021-09895-z

Keywords

Navigation