Source and system features for phone recognition

Manjunath, K. E.; Sreenivasa Rao, K.

doi:10.1007/s10772-014-9266-0

Source and system features for phone recognition

Published: 09 December 2014

Volume 18, pages 257–270, (2015)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

K. E. Manjunath¹ &
K. Sreenivasa Rao¹

328 Accesses
11 Citations
Explore all metrics

Abstract

In this work, we have explored excitation source features in addition to vocal tract system features to improve the performance of phone recognition systems (PRSs). The excitation source information is derived by processing linear prediction residual of the speech signal. The vocal tract information is captured using Mel-frequency cepstral coefficient features. The PRSs are developed using hidden Markov models. The robustness of proposed excitation source features is demonstrated using white and babble noisy speech samples. In this work, TIMIT and Bengali speech databases are used for developing PRSs. The tandem PRSs are developed using the phone posteriors obtained from feedforward neural networks. From the results, it is observed that the tandem PRSs developed using the combination of excitation source and vocal tract system features, outperform the conventional tandem systems developed using system features alone. It is also observed that the PRSs developed using the combination of excitation source and vocal tract features, are more robust to noise than the PRSs developed using vocal tract features alone.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Automatic speech recognition: a survey

Article 10 November 2020

Databases, features and classifiers for speech emotion recognition: a review

Article 19 January 2018

References

Bourlard, H. A., & Morgan, N. (1994). Connnectionist speech recognition: A hybrid approach. Boston: Kluwer Academic Publishers.
Book Google Scholar
Chengalvarayan, R. (1998). On the use of normalized LPC Error towards better large vocabulary speech recognition systems. In IEEE international conference on acoustics, speech and signal processing.
Chetouani, M., Faundez-Zanuy, M., Gas, B., & Zarader, J. L. (2009). Investigation on LP-residual representations for speaker identification. Pattern Recognition, 42, 487–494.
Article MATH Google Scholar
Csapo, T. G. (2012). Increasing the naturalness of synthesizes speech. http://speechlab.tmit.bme.hu/csapo/downloads/Csapo-phonetician2012-paper.pdf.
Csapo, T. G., & Nemeth, G. (2012). A novel codebook-based excitation model for use in speech synthesis. In International conference on cognitive infocommunications.
Dhananjaya, N., Yegnanarayana, B., & Suryakanth, V. G. (2011). Acoustic-phonetic information from excitation source for refining manner hypotheses of a phone recognizer. In IEEE international conference on acoustics, speech and signal processing (ICASSP).
Fallside, F., Lucke, H., Marsland, T.P., O’Shea, P.J., Owen, M.S.J., Prager, R.W., Robinson, A.J., & Russell, N.H. (1990). Continuous speech recognition for the TIMIT database using neural networks. In ICASSP-90.
Fant, G. (1979). Glottal source and excitation analysis. STL-QPSR, 20, 085–107.
Google Scholar
Graves, Alex, Mohamed, Abdel-rahman, & Hinton, Geoffrey (2013). Speech recognition with deep recurrent neural networks. In IEEE international conference on acoustics, speech and signal processing (ICASSP).
Hayakawa, S., Takeda, K., & Itakura, F. (1997). Speaker identification using harmonic structure of LP-residual spectrum. Biometric personal Aunthentification, Lecture notes, 1206, 253–260.
Google Scholar
He, Jialong, Liu, Li, & Palm, G. (1996). On the use of residual cepstrum in speech recognition. In IEEE international conference on acoustics, speech, and signal processing (ICASSP).
Hermansky, H., Ellis, D. P. W., & Sharma, S. (2000). Tandem connectionist feature extraction for conventional HMM systems. In IEEE international conference on acoustics, speech and signal processing (ICASSP).
Hinton, G., Deng, Li, Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29, 82–97.
Article Google Scholar
Ketabdar, H., & Bourlard, H. (2008). Hierarchical integration of phonetic and lexical knowledge in phone posterior estimation. In IEEE international conference on acoustics, speech and signal processing (ICASSP).
Lee, K.-F., & Hon, H.-W. (1989). Speaker-independent phone recognition using hidden markov models. IEEE Transactions on Acoustics, Speech and Signal Processing, 37, 1641–1648.
Article Google Scholar
Linguistic Data Consortium (1993). TIMIT Acoustic-Phonetic Continuous Speech Corpus. (1993). Available: http://catalog.ldc.upenn.edu/LDC93S1.
Mahadeva Prasanna, S. R., Gupta, C. S., & Yegnanarayana, B. (2006). Extraction of speaker-specific excitation information from linear prediction residual of speech. Speech Communication, 48, 1243–1261.
Article Google Scholar
Makhoul, J. (1975). Linear prediction: A tutorial review. Proceedings of the IEEE, 63, 561–580.
Article Google Scholar
Manjunath, K.E., & Sreenivasa Rao, K. (2014). Automatic phonetic transcription for read, extempore and conversation speech for an Indian language: Bengali. In NCC-2014.
Manjunath, K. E., Sreenivasa Rao, K., & Pati, D. (2013). Development of Phonetic Engine for Indian languages: Bengali and Oriya. In 16th international oriental COCOSDA.
Manjunath, K.E., Sunil Kumar, S. B., Pati, D., Satapathy, B., & Sreenivasa Rao, K. (2013). Development of consonant-vowel recognition systems for Indian languages : Bengali and Odia. In INDICON-2013.
Mohamed, A., Dahl, G. E., & Hinton, G. (2012). Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, 20, 14–22.
Article Google Scholar
Pati, D., & Mahadeva Prasanna, S. R. (2008). Non-Parametric Vector Quantization of Excitation Source Information for Speaker Recognition. In IEEE region 10 conference TENCON.
Pati, D., & Mahadeva Prasanna, S. R. (2012). Speaker verification using excitation source information. The International Journal of Speech Technology (Springer), 15, 241–257.
Article Google Scholar
Pati, D., & Mahadeva Prasanna, S. R. (2013). A comparative study of explicit and implicit modelling of subsegmental speaker-specific excitation source information. Sadhana, 38, 591–620.
Article MathSciNet Google Scholar
Rabiner, L., Juang, B.-H., & Yagnanarayana, B. (2008). Fundamentals of speech recognition. Singapore: Pearson Education.
Google Scholar
Speech Group at the International Computer Science Ins. (2010) QuickNet Software and Documentation. [Online]. Available: http://www1.icsi.berkeley.edu/Speech.
Sri Rama Murty, K., & Yegnanarayana, B. (2006). Combining evidence from residual phase and MFCC features for speaker recognition. IEEE Signal Processing Letters, 13, 52–55.
Article Google Scholar
Stevens, K. N. (1998). Acoustic phonetics. Cambridge, MA: MIT Press.
Google Scholar
Sunil Kumar, S. B., Sreenivasa Rao, K., & Pati, D. (2013). Phonetic and prosodically rich transcribed speech corpus in indian languages : Bengali and Odia. In 16th international oriental COCOSDA.
The Hidden Markov Model Toolkit and HTK book. (2013). Available: http://htk.eng.cam.ac.uk.
The International Phonetic Association. (2005). International phonetic alphabet. Available: http://www.langsci.ucl.ac.uk/ipa/index.html.
Titze, I. R. (2008). Nonlinear sourcefilter coupling in phonation: Theory. Journal of the Acoustical Society of America, 123(5), 2733–2749.
Article Google Scholar
Varga, A., & Steeneken, H. J. (1993). Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 12, 247–251.
Vaseghi, S. (2008). Speech processing. Available: http://dea.brunel.ac.uk/cmsp/Home_Saeed_Vaseghi/Chapter13-SpeechProcessing.
Yegnanarayana, B., Mahadeva Prasanna, S. R., Duraiswami, R., & Zotkin, D. (2005). Processing of reverberant speech for time-delay estimation. IEEE Transactions on Audio, Speech, and Language Processing, 13, 1110–1118.
Article Google Scholar

Download references

Acknowledgments

The work presented in this paper was performed at IIT-Kharagpur as a part of the project “Prosodically guided phonetic engine for searching speech databases in Indian languages” supported by Department of Information Technology, Government of India.

Author information

Authors and Affiliations

School of Information Technology, Indian Institute of Technology Kharagpur, Kharagpur, 721302, India
K. E. Manjunath & K. Sreenivasa Rao

Authors

K. E. Manjunath
View author publications
You can also search for this author in PubMed Google Scholar
K. Sreenivasa Rao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to K. Sreenivasa Rao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Manjunath, K.E., Sreenivasa Rao, K. Source and system features for phone recognition. Int J Speech Technol 18, 257–270 (2015). https://doi.org/10.1007/s10772-014-9266-0

Download citation

Received: 30 July 2014
Accepted: 22 November 2014
Published: 09 December 2014
Issue Date: June 2015
DOI: https://doi.org/10.1007/s10772-014-9266-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Source and system features for phone recognition

Abstract

Access this article

Similar content being viewed by others

A comprehensive survey on automatic speech recognition using neural networks

Automatic speech recognition: a survey

Databases, features and classifiers for speech emotion recognition: a review

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Source and system features for phone recognition

Abstract

Access this article

Similar content being viewed by others

A comprehensive survey on automatic speech recognition using neural networks

Automatic speech recognition: a survey

Databases, features and classifiers for speech emotion recognition: a review

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation