Processing of linear prediction residual in spectral and cepstral domains for speaker information

Pati, Debadatta; Prasanna, S. R. Mahadeva

doi:10.1007/s10772-015-9273-9

Processing of linear prediction residual in spectral and cepstral domains for speaker information

Published: 24 February 2015

Volume 18, pages 333–350, (2015)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Debadatta Pati¹ &
S. R. Mahadeva Prasanna¹

247 Accesses
5 Citations
Explore all metrics

Abstract

In this work the linear prediction (LP) residual is processed in spectral and cepstral domains to model the speaker-specific excitation information. In the spectral domain, the excitation energy information is modeled from subband energies (SBE). The excitation periodicity information is modeled by power differences of spectrum in subband (PDSS) measure. This work carries some refinements in the existing methods of extracting SBE and PDSS by exploiting the nature of the excitation spectrum. The SBE and PDSS values are computed from mel warped residual subband spectrum and called as residual mel subband energies (R-MSE) and mel power differences of subband spectra (M-PDSS), respectively. The different speaker recognition studies performed using NIST-99 and NIST-03 databases demonstrate that R-MSE and M-PDSS features represent good speaker information. It is also demonstrated that the excitation energy information can be better modeled in the cepstral domain by residual mel frequency cepstral coefficients (R-MFCC). Furhter, the evidences provided by M-PDSS and R-MFCC features are different and combine well and provides improved recognition performance. The combined evidence from M-PDSS and R-MFCC together with the vocal tract information further improves the performance. Finally, a comparative study on processing the LP residual in temporal, spectral and cepstral domains demonstrates that with a small compromise with the recognition performance, processing LP residual in spectral and cepstral domains provide compact and effective way of representing the excitation information, as compared to temporal processing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Novel Linear Prediction Temporal Phase Based Features for Speaker Recognition

An empirical study on analysis window functions for text-independent speaker recognition

Article 19 February 2023

Speaker identification based on normalized pitch frequency and Mel Frequency Cepstral Coefficients

Article 17 September 2018

References

Atal, B. S. (1972). Automatic speaker recogntion based on pitch contours. The Journal of the Acoustical Society of America, 52(6), 1687–1697.
Article Google Scholar
Atal, B. S. (1974). Effetiveness of linear prediction characterstics of the speech wave for automatic speaker identification and verification. The Journal of the Acoustical Society of America, 55(6), 1304–1312.
Article Google Scholar
Atal, B. S. (1976). Automatic recognition of speakers from their voices. Proceedings of the IEEE, 64(4), 460–475.
Article Google Scholar
Campbell, J. P, Jr. (1997). Speaker recognition: A tutorial. Proceedings of IEEE, 85(9), 1437–1462.
Article Google Scholar
Chan, W. N., Zheng, N., & Lee, T. (2007). Discrimination power of vocal source and vocal tract related features for speaker segmentations. IEEE Transactions on Audio, Speech and Signal Processing, 15(6), 1884–1892.
Article MATH Google Scholar
Cohen, L. (1995). Time-Frequency Analysis: Theory and Application, Series Signal Processing Series. Englewood Cliffs: Prentice Hall.
Davis, S. B., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, 28(28), 357–366.
Article Google Scholar
Deller, J., Hansen, J., & Proakis, J. G. (2000). Discrete-time processing of speech signal (2nd ed.). New York: IEEE Press.
Google Scholar
Duda, R. O., & Hart, P. E. (2001). Pattern classification (2nd ed.). New York: Wiley.
MATH Google Scholar
Feustel, T. C., Velius, G. A., & Logan, R. J. (1989). Human and machine performance on speaker identity verification. Speech Technology, pp. 169–170
Furui, S. (1981). Cepstral analysis technique for automatic speaker verification. IEEE Transactions on Acoustics, Speech and Signal Processing, 29(2), 254–272.
Article Google Scholar
Gray, A. H, Jr, & Markel, J. D. (1974). A spectral-flatness measure for studying the autocorrelation method of linear prediction of speech analysis. IEEE Transactions on Acoustic Speech and Signal Processing, ASSP–22(3), 207–217.
Article Google Scholar
Gudnanson, J., & Brookes, M. (2008). Voice source cepstrum coefficients for speaker identification. Proceedings of the International Conference on Acoustic Speech and Signal Processing. (pp. 4821–4824). Nevada, USA:Las Vegas.
Haeb-Umbach, R. (1999). Investigation on inter-speaker variability in the feature space. Proceedings of the international conference on Acoustical Speech and Signal Processing (ICASSP), Phoenix, AZ, (pp. 397–400).
Hall, J. J., & Srihari, S. N. (1994). decision combination in multiple classifier systems. IEEE Transactions on Pattern Analysis and Machine intelligence, 16, 66–75.
Article Google Scholar
Hayakawa, S., Takeda, K., & Itakura, F. (1997). Speaker identification using harmonic structure of lp-residual spectrum. Biometric personal Aunthentification, Lecture notes, Springer, Berlin, 1206, 253–260.
Makhoul, J. (1975). Linear prediction: A tutorial review. Proceedings of the IEEE, 63(4), 561–580.
Article Google Scholar
Martin, A., Doddington, G., Kamm, T., Ordowski, M., & Przybocki, M. (1997). The DET curve in assessment of detection task performance. Proceedings of the Europe Conference on Speech Communication Technology, Rhodes, Greece, vol. 4, pp. 1895–1898.
Mary, L., & Yegnanarayana, B. (2008). Extraction and representation of prosodic features for language and speaker recognition. Speech Communication, 50, 782–796.
Article Google Scholar
Mashao, D. J., & Skosan, M. (2006). Combining classifier decisions for robust speaker identification. Pattern Recognition, 39, 147–155.
Article Google Scholar
Murthy, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signal. IEEE Transactions on Audio Speech and Language Processing, 16(8), 1602–1613.
Article Google Scholar
Murthy, K. S. R., & Yegnanarayana, B. (2009). Characterization of glottal activity from speech signal. IEEE Signal Processing Letter, 16(6), 469–472.
Article Google Scholar
Murty, K. S. R., Prasanna, S. R. M., & Yegnanarayana, B. (2004). Speaker specific information from residual phase. Proceedings of ther international of conference on signal processing and communication (SPCOM).
Murty, K. S. R., & Yegnanarayana, B. (2006). Combining evidence from residual phase and MFCC features for speaker recognition. IEEE Signal Processing Letters, 13(1), 52–55.
Article Google Scholar
Nist speaker recognition evaluation plan (2003). In: Proceedings of NIST Speaker Recognition Workshop, College Park, MD.
Oppenhiem, A. V., & Schafer, R. W. (1975). Digital signal processing. Englewood Cliffs, NJ: Prentice Hall.
Google Scholar
Pati, D., & Prasanna, S. R. M. (2008). Non-parametric vector quantization of excitation source information for speaker recognition. In Proceedings of the IEEE TENCON, pp. 1–4.
Pati, D., & Prasanna, S. R. M. (2010). Speaker information from subband energies of linear prediction residual. Proceedings of the NCC, 2010, 1–4.
Pati, D., & Prasanna, S. R. M. (2011). Subsegmental, segmental and suprasegmental processing of linear prediction residual for speaker information. International Journal of Speech Technology, 14(1), 49–63.
Article Google Scholar
Plumpe, M. D., Quatieri, T. F., & Reynolds, D. A. (1999). Modelling of glottal flow derivative waveform with application to speaker identification. IEEE Transactions on Speech and Audio Processing, 7(5), 569–586.
Article Google Scholar
Prasanna, S. R. M., Gupta, C. S., & Yegnanarayana, B. (2006). Extraction of speaker-specific excitation information from linear prediction residual of speech. Speech Communication, 48, 1243–1261.
Article Google Scholar
Pruzansky, S., & Mathews, M. V. (1964). Talker-Recognition procedure based on Analysis of Variance. The Journal of the Acoustical Society of America, 36(11), 2041–2047.
Article Google Scholar
Przybocky, M., & Martin, A. (2000). The NIST-1999 speaker recognition evaluation- An overview. Digital signal processing, 10, 1–18.
Article Google Scholar
Reynolds, D. A. (1994). Experimental evaluation of features for robust speaker identification. IEEE Transactions on Speech and Audio Processing, 2(4), 639–643.
Article Google Scholar
Reynolds, D. A. (1995). Speaker identification and verification using gaussian mixture speaker models. Speech Communication, 17, 91–108.
Article Google Scholar
Reynolds, D. A., & Rose, R. C. (1995). Robust text-independent speaker identification using gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing, 3(1), 4–17.
Article Google Scholar
Thevenaz, P., & Hugli, H. (1995). Usefulness of the LPC-residue in text-independent speaker verification. Speech Communication, 17, 145–157.
Article Google Scholar
Wang, N., Ching, P. C., & Lee, T. (2009). Exploration of vocal excitation modulation features for speaker recognition. Proceedings of INTERSPEECH-09, (pp. 892–895). Brighton UK.
Wolf, J. J. (1972). Efficient acoustic parameters for speaker recognition. Journal of the Acoustical Society of America, 51(2), 2044–2055.
Article Google Scholar
Yegnanarayana, B., Reddy, K. S., & Kishore, S. P. (May 2001). Source and systsem feature for speaker recognition using AANN Models. Proceedings of IEEE International Conference Acoustic Speech and Signal Processing. (pp. 409–412) Salt Lake City, UT, USA.
Yegnenarayana, B., & Murthy, K. S. R. (2009). Event based instanteneous fundamental frequency estimation from speech signals. IEEE Transaction Audio Speech and Language Processing, 17(4), 614–624.
Article Google Scholar
Zheng, N., Lee, T., & Ching, P. C. (2007). Integration of complimentary acoustic features for speaker recognition. IEEESignal Processing Letter, 14(3), 181–184.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electronics and Electrical Engineering, Indian Institute of Technology Guwahati, Guwahati, 781039, India
Debadatta Pati & S. R. Mahadeva Prasanna

Authors

Debadatta Pati
View author publications
You can also search for this author in PubMed Google Scholar
S. R. Mahadeva Prasanna
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Debadatta Pati.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pati, D., Prasanna, S.R.M. Processing of linear prediction residual in spectral and cepstral domains for speaker information. Int J Speech Technol 18, 333–350 (2015). https://doi.org/10.1007/s10772-015-9273-9

Download citation

Received: 01 December 2013
Accepted: 05 February 2015
Published: 24 February 2015
Issue Date: September 2015
DOI: https://doi.org/10.1007/s10772-015-9273-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Processing of linear prediction residual in spectral and cepstral domains for speaker information

Abstract

Access this article

Similar content being viewed by others

Novel Linear Prediction Temporal Phase Based Features for Speaker Recognition

An empirical study on analysis window functions for text-independent speaker recognition

Speaker identification based on normalized pitch frequency and Mel Frequency Cepstral Coefficients

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Processing of linear prediction residual in spectral and cepstral domains for speaker information

Abstract

Access this article

Similar content being viewed by others

Novel Linear Prediction Temporal Phase Based Features for Speaker Recognition

An empirical study on analysis window functions for text-independent speaker recognition

Speaker identification based on normalized pitch frequency and Mel Frequency Cepstral Coefficients

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation