Skip to main content
Log in

Speaker recognition under stressed condition

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

This paper presents the feature analysis and design of compensators for speaker recognition under stressed speech conditions. Any condition that causes a speaker to vary his or her speech production from normal or neutral condition is called stressed speech condition. Stressed speech is induced by emotion, high workload, sleep deprivation, frustration and environmental noise. In stressed condition, the characteristics of speech signal are different from that of normal or neutral condition. Due to changes in speech signal characteristics, performance of the speaker recognition system may degrade under stressed speech conditions. Firstly, six speech features (mel-frequency cepstral coefficients (MFCC), linear prediction (LP) coefficients, linear prediction cepstral coefficients (LPCC), reflection coefficients (RC), arc-sin reflection coefficients (ARC) and log-area ratios (LAR)), which are widely used for speaker recognition, are analyzed for evaluation of their characteristics under stressed condition. Secondly, Vector Quantization (VQ) classifier and Gaussian Mixture Model (GMM) are used to evaluate speaker recognition results with different speech features. This analysis help select the best feature set for speaker recognition under stressed condition. Finally, four VQ based novel compensation techniques are proposed and evaluated for improvement of speaker recognition under stressed condition. The compensation techniques are speaker and stressed information based compensation (SSIC), compensation by removal of stressed vectors (CRSV), cepstral mean normalization (CMN) and combination of MFCC and sinusoidal amplitude (CMSA) features. Speech data from SUSAS database corresponding to four different stressed conditions, Angry, Lombard, Question and Neutral, are used for analysis of speaker recognition under stressed condition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Atal, B. S. (1976). Automatic recognition of speakers from their voices. Proceedings of the IEEE, 64(4), 460–476.

    Article  Google Scholar 

  • Behera, S., & Dandapat, S. (2002). Speech features for speaker identification. Journal of Institution of Engineers (Electronics and Telecommunications section) India, 83, 1–6.

    Google Scholar 

  • Bimbot, F., Bonastre, J., Fredouille, C., Gravier, G., Magrin-Chagnolleau, I., Meignier, S., Merlin, T., Ortega-Garcia, J., Petrovska-Delacretaz, D., & Reynolds, D. (2004). A tutorial on text-independent speaker verification. EURASIP Journal of Applied Signal Processing, 4, 430–451.

    Article  Google Scholar 

  • Bou-Ghazale, S. E., & Hansen, J. H. L. (1998). Stress perturbation of neutral speech for synthesis based on hidden Markov models. IEEE Transactions on Speech and Audio Processing, 6, 201–216.

    Article  Google Scholar 

  • Cairns, D. A., & Hansen, J. H. L. (1994). Nonlinear analysis and classification of speech under stressed conditions. The Journal of the Acoustical Society of America, 96(6), 3392–3400.

    Article  Google Scholar 

  • Campbell, J. (1997). Speaker recognition: A tutorial. Proceedings of the IEEE, 85(9), 1437–1462.

    Article  Google Scholar 

  • Chagnolleau, I. M., & Durou, G. (2002). Application of time-frequency principal component analysis to text-independent speaker identification. IEEE Transactions on Speech and Audio Processing, 10(6), 371–378.

    Article  Google Scholar 

  • Chen, Y. (1987). Cepstral domain stress compensation for robust speech recognition. In IEEE international conference on acoustics, speech, and signal processing, ICASSP’87 (Vol. 12, pp. 717–720).

  • Childers, D. G., & Wong, C. F. (1994). Measuring and modeling vocal source-tract interaction. IEEE Transactions on Biomedical Engineering, 41(7), 663–671.

    Article  Google Scholar 

  • Doddington, G. R., Martin, A., Przybockin, M. A., & Reynolds, D. (2000). The NIST speaker recognition evaluation—overview, methodology, systems, results, perspective. Speech Communication, 31(2–3), 225–254.

    Article  Google Scholar 

  • Etemoglu, C. O., Cuperman, V., & Gersho, A. (2000). Speech coding with an analysis-by-synthesis sinusoidal model. In ICASSP (Vol.3, pp. 1371–1374).

  • Friedlander, B., & Francos, J. (1995). Estimation of amplitude and phase parameters of multicomponent signals. IEEE Transactions on Signal Processing, 43(4), 917–926.

    Article  Google Scholar 

  • Furui, S. (1981). Comparison of speaker recognition methods using statistical features and dynamic features. IEEE Transactions on Acoustics, Speech, and Signal Processing, 29(3), 342–350.

    Article  Google Scholar 

  • George, E., & Smith, M. (1997). Speech analysis/synthesis and modification using an analysis-by-synthesis/overlap-add sinusoidal model. IEEE Transactions on Speech and Audio Processing, 5(5), 389–406.

    Article  Google Scholar 

  • Ghazale, S., & Hansen, J. H. L. (2000). A comparative study of traditional and newly proposed features for recognition of speech under stress. IEEE Transactions on Speech and Audio Processing, 8(4), 429–442.

    Article  Google Scholar 

  • Hansen, J. H. L. (1996). Analysis and compensation of speech under stress and noise for environmental robustness in speech recognition. Speech Communication, 20, 151–173.

    Article  Google Scholar 

  • Hansen, J. H. L., & Cairns, D. A. (1995). ICARUS: Source generator based real-time recognition of speech in noisy stressful and Lombard effect environments. Speech Communication, 16(4), 391–422.

    Article  Google Scholar 

  • Hansen, J. H. L., & Womack, B. D. (1996). Feature analysis and neural network-based classification of speech under stress. IEEE Transactions on Speech and Audio Processing, 4, 307–313.

    Article  Google Scholar 

  • Hansen, J. H. L., Womack, B., & Arsian, L. M. (1994). A source generator based production model for environmental robustness in speech recognition. In International conference on spoken language processing (ICSLP) (pp.  1003–1006).

  • Hayes, M. H. (1996). Statistical digital signal processing and modeling. New York: Wiley.

    Google Scholar 

  • Hunt, M. (1983). Further experiments in text-independent speaker recognition over communications channels. In Proc. IEEE intern. conf. ASSP (pp. 563–566).

  • Imperl, B., Kačič, Z., & Horvat, B. (1997). A study of harmonic features for the speaker recognition. Speech Communication, 22(4), 385–402.

    Article  Google Scholar 

  • Inal, M., Erkan, K., Yildirum, M., Butun, E., & Ceken, C. (2000). Comparison of linear predictive analysis methods for ANN-based speaker identification. In Proc. IEEE 5th seminar on neural network applications in electrical engineering (pp. 109–112).

  • Jensen, J., & Hansen, J. H. L. (2001). Speech enhancement using a constrained iterative sinusoidal model. IEEE Transactions on Speech and Audio Processing, 9(7), 731–740.

    Article  Google Scholar 

  • Jiang, H., & Deng, L. (2001). A Bayesian approach to the verification problem: Applications to speaker verification. IEEE Transactions on Speech Audio Process, 9(8), 883–884.

    Google Scholar 

  • Marques, L. S., & Almeida, L. (1989). Frequency-varying sinusoidal modeling of speech. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(5), 763–765.

    Article  Google Scholar 

  • Mathur, S., & Story, B. H. (2003). Vocal tract modeling: implementation of continuous length variations in a half-sample delay Kelly-Lochbaum model. In Proceedings of the 3rd IEEE international symposium on signal processing and information technology, Germany (pp. 753–756).

  • McAulay, R., & Quatieri, T. (1986). Speech analysis/synthesis based on a sinusoidal representation. IEEE Transactions on Acoustics, Speech, and Signal Processing, 34(4), 744–754.

    Article  Google Scholar 

  • Misra, H., Ikbal, S., & Yegnanarayana, B. (2003). Speaker-specific mapping for text-independent speaker recognition. Speech Communication, 24, 193–209.

    Google Scholar 

  • Murty, K. S. R., & Yegnanarayana, B. (2006). Combining evidence from residual phase and MFCC features for speaker recognition. IEEE Signal Processing Letters, 13(1), 52–55.

    Article  Google Scholar 

  • Pellom, B. L., & Hansen, J. (1998). An efficient scoring algorithm for gaussian mixture model based speaker identification. IEEE Signal Processing Letters, 5(11), 281–284.

    Article  Google Scholar 

  • Picone, J. (1993). Signal modeling techniques in speech recognition. Proceedings of the IEEE, 81(9), 1215–1247.

    Article  Google Scholar 

  • Quatieri, T. (2004). Discrete-time speech signal processing. Upper Saddle River: Pearson Education.

    Google Scholar 

  • Quatieri, T., & Danisewicz, R. (1990). An approach to co-channel talker interference suppression using a sinusoidal model for speech. IEEE Transaction on Acoustics, Speech, and Signal Processing, 38(1), 56–69.

    Article  Google Scholar 

  • Rabiner, L., & Juang, B. H. (1993). Fundamentals of speech recognition. Englewood Cliffs: Prentice Hall Inc., 07632.

    Google Scholar 

  • Raja, G. S., & Dandapat, S. (2004a). Sinusoidal model based speaker identification. In Proc. NCC-2004 conference, IIsc, Bangalore (Vol. 1(1), pp. 523–527).

  • Raja, G. S., & Dandapat, S. (2004b). Sinusoidal model based speaker identification using VQ and DHMM. In Proc. IEEE, INDICON-2004 conference, IIT, Kharagpur (pp. 338–343).

  • Raja, G. S., & Dandapat, S. (2008). Performance of selective speech features for speaker identification. IE(I) Journal-CP (pp. 38–46).

  • Ramamohan, S., & Dandapat, S. (2006). Sinusoidal model-based analysis and classification of stressed speech. IEEE Transactions on Audio Speech and Language Processing, 14(3), 737–746.

    Article  Google Scholar 

  • Rao, K., & Yegnanarayana, B. (2006). Prosody modification using instants of significant excitation. IEEE Transactions on Audio Speech and Language Processing, 14(3), 972–980.

    Article  Google Scholar 

  • Reynolds, D. (1995). Speaker identification and verification using gaussian mixture speaker models. Speech Communication, 17(1–2), 91–108.

    Article  Google Scholar 

  • Reynolds, D. (1997). Comparison of background normalization methods for text-independent speaker verification. In European conference on speech processing, Rhodes, Greece (pp. 963, 966).

  • Roe, D. (1987). Speech recognition with a noise-adapting codebook. In IEEE international conference on acoustics, speech, and signal processing, ICASSP’87 (Vol. 12, pp. 1139–1142).

  • Rosenberg, A. E. (1976). Automatic speaker verification: A review. IEEE Proceedings of the IEEE, 64(4), 475–487.

    Article  Google Scholar 

  • Rosenberg, A. E., & Soong, F. K. (1987). Evaluation of a vector quantization talker recognition system in text independent and text dependent modes. Computer Speech and Language, 2(3–4), 143–157.

    Article  Google Scholar 

  • Sarikaya, R., & Hansen, J. H. L. (2000). High resolution speech feature parametrization for monophone-based stressed speech recognition. IEEE Signal Processing Letters, 7(7), 182–185.

    Article  Google Scholar 

  • Shahin, I. (2006). Enhancing speaker identification performance under the shouted talking condition using second-order circular hidden Markov models. Speech Communication, 48(8), 1047–1055.

    Article  Google Scholar 

  • Shaughnessy, D. O. (1986). Speaker recognition. IEEE ASSP Magazine, 3(4), 4–17.

    Article  Google Scholar 

  • Shaughnessy, D. O. (1987). Speech communication: human and machine. Reading: Addison-Wesley.

    Google Scholar 

  • Stylianou, Y. (2000). A simple and fast way of generating a harmonic signal. IEEE Signal Processing Letters, 7(5), 111–113.

    Article  Google Scholar 

  • Tao, J., Kang, Y., & Li, A. (2006). Prosody conversion from neutral speech to emotional speech. IEEE Transactions on Audio, Speech and Language Processing, 14, 1145–1154.

    Article  Google Scholar 

  • Umesh, S., Cohen, L., & Nelson, D. (1999). Fitting the mel scale. In IEEE international conference on acoustics, speech, and signal processing (Vol. 1, pp. 217–220).

  • Yegnanarayana, B., Zachariah, J., Prasanna, S. R. M., & Gupta, C. (2005). Combining evidence from source, suprasegmental and spectral features for a fixed-text speaker verification system. IEEE Transactions on Speech and Audio Processing, 13(4), 575–582.

    Article  Google Scholar 

  • Zhang, Y., Loke, C., Togneri, R., & Alder, M. (1994). A comparison of PBDHMM and CHMM for isolated word recognition. In Proc. fifth Australian international conference on speech science and technology, Perth, Australia (pp. 564–569).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to G. Senthil Raja.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Senthil Raja, G., Dandapat, S. Speaker recognition under stressed condition. Int J Speech Technol 13, 141–161 (2010). https://doi.org/10.1007/s10772-010-9075-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-010-9075-z

Keywords

Navigation