Abstract
In this work, we have explored excitation source features in addition to vocal tract system features to improve the performance of phone recognition systems (PRSs). The excitation source information is derived by processing linear prediction residual of the speech signal. The vocal tract information is captured using Mel-frequency cepstral coefficient features. The PRSs are developed using hidden Markov models. The robustness of proposed excitation source features is demonstrated using white and babble noisy speech samples. In this work, TIMIT and Bengali speech databases are used for developing PRSs. The tandem PRSs are developed using the phone posteriors obtained from feedforward neural networks. From the results, it is observed that the tandem PRSs developed using the combination of excitation source and vocal tract system features, outperform the conventional tandem systems developed using system features alone. It is also observed that the PRSs developed using the combination of excitation source and vocal tract features, are more robust to noise than the PRSs developed using vocal tract features alone.
Similar content being viewed by others
References
Bourlard, H. A., & Morgan, N. (1994). Connnectionist speech recognition: A hybrid approach. Boston: Kluwer Academic Publishers.
Chengalvarayan, R. (1998). On the use of normalized LPC Error towards better large vocabulary speech recognition systems. In IEEE international conference on acoustics, speech and signal processing.
Chetouani, M., Faundez-Zanuy, M., Gas, B., & Zarader, J. L. (2009). Investigation on LP-residual representations for speaker identification. Pattern Recognition, 42, 487–494.
Csapo, T. G. (2012). Increasing the naturalness of synthesizes speech. http://speechlab.tmit.bme.hu/csapo/downloads/Csapo-phonetician2012-paper.pdf.
Csapo, T. G., & Nemeth, G. (2012). A novel codebook-based excitation model for use in speech synthesis. In International conference on cognitive infocommunications.
Dhananjaya, N., Yegnanarayana, B., & Suryakanth, V. G. (2011). Acoustic-phonetic information from excitation source for refining manner hypotheses of a phone recognizer. In IEEE international conference on acoustics, speech and signal processing (ICASSP).
Fallside, F., Lucke, H., Marsland, T.P., O’Shea, P.J., Owen, M.S.J., Prager, R.W., Robinson, A.J., & Russell, N.H. (1990). Continuous speech recognition for the TIMIT database using neural networks. In ICASSP-90.
Fant, G. (1979). Glottal source and excitation analysis. STL-QPSR, 20, 085–107.
Graves, Alex, Mohamed, Abdel-rahman, & Hinton, Geoffrey (2013). Speech recognition with deep recurrent neural networks. In IEEE international conference on acoustics, speech and signal processing (ICASSP).
Hayakawa, S., Takeda, K., & Itakura, F. (1997). Speaker identification using harmonic structure of LP-residual spectrum. Biometric personal Aunthentification, Lecture notes, 1206, 253–260.
He, Jialong, Liu, Li, & Palm, G. (1996). On the use of residual cepstrum in speech recognition. In IEEE international conference on acoustics, speech, and signal processing (ICASSP).
Hermansky, H., Ellis, D. P. W., & Sharma, S. (2000). Tandem connectionist feature extraction for conventional HMM systems. In IEEE international conference on acoustics, speech and signal processing (ICASSP).
Hinton, G., Deng, Li, Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29, 82–97.
Ketabdar, H., & Bourlard, H. (2008). Hierarchical integration of phonetic and lexical knowledge in phone posterior estimation. In IEEE international conference on acoustics, speech and signal processing (ICASSP).
Lee, K.-F., & Hon, H.-W. (1989). Speaker-independent phone recognition using hidden markov models. IEEE Transactions on Acoustics, Speech and Signal Processing, 37, 1641–1648.
Linguistic Data Consortium (1993). TIMIT Acoustic-Phonetic Continuous Speech Corpus. (1993). Available: http://catalog.ldc.upenn.edu/LDC93S1.
Mahadeva Prasanna, S. R., Gupta, C. S., & Yegnanarayana, B. (2006). Extraction of speaker-specific excitation information from linear prediction residual of speech. Speech Communication, 48, 1243–1261.
Makhoul, J. (1975). Linear prediction: A tutorial review. Proceedings of the IEEE, 63, 561–580.
Manjunath, K.E., & Sreenivasa Rao, K. (2014). Automatic phonetic transcription for read, extempore and conversation speech for an Indian language: Bengali. In NCC-2014.
Manjunath, K. E., Sreenivasa Rao, K., & Pati, D. (2013). Development of Phonetic Engine for Indian languages: Bengali and Oriya. In 16th international oriental COCOSDA.
Manjunath, K.E., Sunil Kumar, S. B., Pati, D., Satapathy, B., & Sreenivasa Rao, K. (2013). Development of consonant-vowel recognition systems for Indian languages : Bengali and Odia. In INDICON-2013.
Mohamed, A., Dahl, G. E., & Hinton, G. (2012). Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, 20, 14–22.
Pati, D., & Mahadeva Prasanna, S. R. (2008). Non-Parametric Vector Quantization of Excitation Source Information for Speaker Recognition. In IEEE region 10 conference TENCON.
Pati, D., & Mahadeva Prasanna, S. R. (2012). Speaker verification using excitation source information. The International Journal of Speech Technology (Springer), 15, 241–257.
Pati, D., & Mahadeva Prasanna, S. R. (2013). A comparative study of explicit and implicit modelling of subsegmental speaker-specific excitation source information. Sadhana, 38, 591–620.
Rabiner, L., Juang, B.-H., & Yagnanarayana, B. (2008). Fundamentals of speech recognition. Singapore: Pearson Education.
Speech Group at the International Computer Science Ins. (2010) QuickNet Software and Documentation. [Online]. Available: http://www1.icsi.berkeley.edu/Speech.
Sri Rama Murty, K., & Yegnanarayana, B. (2006). Combining evidence from residual phase and MFCC features for speaker recognition. IEEE Signal Processing Letters, 13, 52–55.
Stevens, K. N. (1998). Acoustic phonetics. Cambridge, MA: MIT Press.
Sunil Kumar, S. B., Sreenivasa Rao, K., & Pati, D. (2013). Phonetic and prosodically rich transcribed speech corpus in indian languages : Bengali and Odia. In 16th international oriental COCOSDA.
The Hidden Markov Model Toolkit and HTK book. (2013). Available: http://htk.eng.cam.ac.uk.
The International Phonetic Association. (2005). International phonetic alphabet. Available: http://www.langsci.ucl.ac.uk/ipa/index.html.
Titze, I. R. (2008). Nonlinear sourcefilter coupling in phonation: Theory. Journal of the Acoustical Society of America, 123(5), 2733–2749.
Varga, A., & Steeneken, H. J. (1993). Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication, 12, 247–251.
Vaseghi, S. (2008). Speech processing. Available: http://dea.brunel.ac.uk/cmsp/Home_Saeed_Vaseghi/Chapter13-SpeechProcessing.
Yegnanarayana, B., Mahadeva Prasanna, S. R., Duraiswami, R., & Zotkin, D. (2005). Processing of reverberant speech for time-delay estimation. IEEE Transactions on Audio, Speech, and Language Processing, 13, 1110–1118.
Acknowledgments
The work presented in this paper was performed at IIT-Kharagpur as a part of the project “Prosodically guided phonetic engine for searching speech databases in Indian languages” supported by Department of Information Technology, Government of India.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Manjunath, K.E., Sreenivasa Rao, K. Source and system features for phone recognition. Int J Speech Technol 18, 257–270 (2015). https://doi.org/10.1007/s10772-014-9266-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-014-9266-0