Articulatory and excitation source features for speech recognition in read, extempore and conversation modes

Manjunath, K. E.; Sreenivasa Rao, K.

doi:10.1007/s10772-015-9329-x

Articulatory and excitation source features for speech recognition in read, extempore and conversation modes

Published: 11 December 2015

Volume 19, pages 121–134, (2016)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

K. E. Manjunath¹ &
K. Sreenivasa Rao¹

384 Accesses
8 Citations
Explore all metrics

Abstract

In our previous works, we have explored articulatory and excitation source features to improve the performance of phone recognition systems (PRSs) using read speech corpora. In this work, we have extended the use of articulatory and excitation source features for developing PRSs of extempore and conversation modes of speech, in addition to the read speech. It is well known that the overall performance of speech recognition system heavily depends on accuracy of phone recognition. Therefore, the objective of this paper is to enhance the accuracy of phone recognition systems using articulatory and excitation source features in addition to conventional spectral features. The articulatory features (AFs) are derived from the spectral features using feedforward neural networks (FFNNs). We have considered five AF groups, namely: manner, place, roundness, frontness and height. Five different AF-based tandem PRSs are developed using the combination of Mel frequency cepstral coefficients (MFCCs) and AFs derived from FFNNs. Hybrid PRSs are developed by combining the evidences from AF-based tandem PRSs using weighted combination approach. The excitation source information is derived by processing the linear prediction residual of the speech signal. The vocal tract information is captured using MFCCs. The combination of vocal tract and excitation source features is used for developing PRSs. The PRSs are developed using hidden Markov models. Bengali speech database is used for developing PRSs of read, extempore and conversation modes of speech. The results are analyzed and the performance is compared across different modes of speech. From the results, it is observed that the use of either articulatory or excitation source features along-with to MFCCs will improve the performance of PRSs in all three modes of speech. The improvement in the performance using AFs is much higher compared to the improvement obtained using excitation source features.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic speech recognition: a survey

Article 10 November 2020

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

Article Open access 03 January 2024

References

Bourlard, H. A., & Morgan, N. (1994). Connnectionist speech recognition: A hybrid approach. Dordrecht: Kluwer.
Book Google Scholar
Chengalvarayan, R. (1998). On the use of normalized LPC error towards better large vocabulary speech recognition systems. In IEEE international conference on acoustics, speech and signal processing (pp. 17–20).
Dhananjaya, N., Yegnanarayana, B., & Suryakanth, V. G. (2011). Acoustic-phonetic information from excitation source for refining manner hypotheses of a phone recognizer. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5252–5255).
Fallside, F., Lucke, H., Marsland, T. P., O’Shea, P. J., Owen, M. S. J., Prager, R. W., et al. (1990). Continuous speech recognition for the TIMIT database using neural networks. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 445–448).
Gerfen. (2015). Phonetics theory (online). http://www.unc.edu/~gerfen/Ling 30Sp2002/phonetics.html.
Graves, A., Mohamed, A., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 1–5).
He, J., Liu, L., & Palm, G. (1996). On the use of residual cepstrum in speech recognition. In IEEE international conference on acoustics, speech, and signal processing (ICASSP) (pp. 5–8).
Hermansky, H., Ellis, D. P. W., & Sharma, S. (2000). Tandem connectionist feature extraction for conventional HMM systems. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 1635–1638).
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., et al. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine, 29, 82–97.
Article Google Scholar
Ketabdar, H., & Bourlard, H. (2008). Hierarchical integration of phonetic and lexical knowledge in phone posterior estimation. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4065–4068).
Kirchhoff, K., Fink, Gernot A., & Sagerer, Gerhard. (2002). Combining acoustic and articulatory feature information for robust speech recognition. Speech Communication, 37, 303–319.
Article MATH Google Scholar
Lee, K., & Hon, H. (1989). Speaker-independent phone recognition using hidden Markov models. IEEE Transactions on Acoustics, Speech and Signal Processing, 37, 1641–1648.
Article Google Scholar
Manjunath, K. E., & Sreenivasa Rao, K. (2014). Automatic phonetic transcription for read, extempore and conversation speech for an indian language: Bengali. In IEEE national conference on communications (NCC) (pp. 1–6).
Manjunath, K. E., & Sreenivasa Rao, K. (2015a). Source and system features for phone recognition. International Journal of Speech Technology, 18, 257–270.
Manjunath, K. E., & Sreenivasa Rao, K. (2015b). Improvement of phone recognition accuracy using articulatory features. Applied Soft Computing (revision submitted).
Manjunath, K. E., Sreenivasa Rao, K., & Gurunath Reddy, M. (2015a). Two-stage phone recognition system using articulatory and spectral features. In IEEE international conference on signal processing and communication engineering systems (SPACES) (pp. 107–111).
Manjunath, K. E., Sreenivasa Rao, K., & Gurunath Reddy, M. (2015b). Improvement of phone recognition accuracy using source and system features. In IEEE international conference on signal processing and communication engineering systems (SPACES) (pp. 501–505).
Manjunath, K. E., Sreenivasa Rao, K., & Pati, D. (2013). Development of phonetic engine for Indian languages: Bengali and Oriya. In 16th International oriental COCOSDA conference (IEEE explore) (pp. 1–6), Gurgoan, India.
Manjunath, K. E., Sunil Kumar, S. B., Pati, D., Satapathy, B., & Sreenivasa Rao, K. (2013). Development of consonant-vowel recognition systems for Indian languages: Bengali and Oriya. In IEEE INDICON (IEEE Explore) (pp. 1–6), IIT Bombay, Mumbai, India.
Metze, F. (2005). Articulatory features for conversational speech recognition. Ph.D. dissertation, Carnegie Mellon University.
Mitra, V., Wang, W., Stolcke, A., Nam, H., Richey, C., Yuan, J., et al. (2013). Articulatory trajectories for large-vocabulary speech recognition. In IEEE international conference on acoustics, speech, and signal processing (ICASSP) (pp. 7145–7149).
Mohamed, A., Dahl, G. E., & Hinton, G. (2012). Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, and Language Processing, 20, 14–22.
Article Google Scholar
Sainath, T. N., Mohamed, A., Kingsbury, B., & Ramabhadran, B. (2013). Deep convolutional neural networks for LVCSR. In IEEE international conference on acoustics, speech, and signal processing (ICASSP) (pp. 8614–8618).
Siniscalchi, S. M., & Lee, C. (2009). A study on integrating acoustic-phonetic information into lattice rescoring for automatic speech recognition. Speech Communication, 51, 1139–1153.
Article Google Scholar
Speech Group at the International Computer Science Ins. (2010). QuickNet software and documentation (online). http://www1.icsi.berkeley.edu/Speech.
Sreenivasa Rao, K., & Koolagudi, S. G. (2013). Recognition of emotions from video using acoustic and facial features. In Signal, image and video processing (SIViP) (pp. 1–17).
Sunil Kumar, S. B., Sreenivasa Rao, K., & Pati, D. (2013). Phonetic and prosodically rich transcribed speech corpus in indian languages: Bengali and Odia. In 16th International Oriental COCOSDA (pp. 1–5).
The Hidden Markov Model Toolkit and HTK book. (2015). (online). http://htk.eng.cam.ac.uk.
The International Phonetic Association. (2015). International Phonetic Alphabet (online). http://www.langsci.ucl.ac.uk/ipa/index.html.
Toth, L. (2014). Convolutional deep maxout networks for phone recognition. In International speech communication association (INTERSPEECH) (pp. 1078–1082).

Download references

Acknowledgments

The work presented in this paper was performed at IIT-Kharagpur as a part of the project 11(6)/2011-HCC(TDIL) , Dt. 23-12-2011, “Prosodically guided phonetic engine for searching speech databases in Indian languages” supported by Department of Information Technology, Government of India.

Author information

Authors and Affiliations

School of Information Technology, Indian Institute of Technology Kharagpur, Kharagpur, 721302, India
K. E. Manjunath & K. Sreenivasa Rao

Authors

K. E. Manjunath
View author publications
You can also search for this author in PubMed Google Scholar
K. Sreenivasa Rao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to K. Sreenivasa Rao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Manjunath, K.E., Sreenivasa Rao, K. Articulatory and excitation source features for speech recognition in read, extempore and conversation modes. Int J Speech Technol 19, 121–134 (2016). https://doi.org/10.1007/s10772-015-9329-x

Download citation

Received: 21 August 2015
Accepted: 28 November 2015
Published: 11 December 2015
Issue Date: March 2016
DOI: https://doi.org/10.1007/s10772-015-9329-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Articulatory and excitation source features for speech recognition in read, extempore and conversation modes

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Articulatory and excitation source features for speech recognition in read, extempore and conversation modes

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation