Application of prosody models for developing speech systems in Indian languages

Rao, K. Sreenivasa

doi:10.1007/s10772-010-9086-9

Application of prosody models for developing speech systems in Indian languages

Published: 11 December 2010

Volume 14, pages 19–33, (2011)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

K. Sreenivasa Rao¹

375 Accesses
36 Citations
Explore all metrics

Abstract

In this paper we demonstrate the use of prosody models for developing speech systems in Indian languages. Duration and intonation models developed using feedforward neural networks are considered as prosody models. Labelled broadcast news data in the languages Hindi, Telugu, Tamil and Kannada is used for developing the neural network models for predicting the duration and intonation. The features representing the positional, contextual and phonological constraints are used for developing the prosody models. In this paper, the use of prosody models is illustrated using speech recognition, speech synthesis, speaker recognition and language identification applications. Autoassociative neural networks and support vector machines are used as classification models for developing the speech systems. The performance of the speech systems has shown to be improved by combining the prosodic features along with one popular spectral feature set consisting of Weighted Linear Prediction Cepstral Coefficients (WLPCCs).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Batliner, A., Mobius, B., Mohler, G., Schweitzer, A., & Noth, E. (2001). Prosodic models, automatic speech understanding, and speech synthesis: towards the common ground. In Eurospeech, Scandinavia.
Google Scholar
Benesty, J., Sondhi, M. M., & Huang, Y. (Eds.) (2008). Springer handbook on speech processing. New York: Springer.
Google Scholar
Bozkurt, B., Bagein, M., & Dutoit, T. (2001). From MBROLA to NU-MBROLA. In Proc. 4th ISCA workshop on speech synthesis, Pitlochry, Scotland, UK (pp. 127–129).
Google Scholar
Bozkurt, B., Dutoit, T., Prudon, R., D’Alessandro, C., & Pagel, V. (2002). Improving quality of MBROLA synthesis for non-uniform units synthesis. In IEEE workshop on speech synthesis, Santa Monica, California, USA, Sept. 2002.
Google Scholar
Chopde, A. (2009). Itrans Indian language transliteration package version 5.2 source. http://www.aczone.com/itrans/.
Deller, J. R., Proakis, J. G., & Hansen, J. H. L. (1993). Discrete-time processing of speech signals. New York: Macmillan Co.
Google Scholar
Gangashetty, S. V., Sekhar, C. C., & Yegnanarayana, B. (2004). Acoustic model combination for recognition of speech in multiple languages using support vector machines. In Proc. IEEE int. conf. acoust., speech, signal processing.
Google Scholar
Gangashetty, S. V., Sekhar, C. C., & Yegnanarayana, B. (2004). Extraction of fixed dimension patterns from varying duration segments of consonant-vowel utterances. In Proc. IEEE int. conf. intelligent sensing and information processing, Chennai, India, Jan. 2004 (pp. 159–164).
Chapter Google Scholar
Haykin, S. (1999). Neural networks: a comprehensive foundation. Upper Saddle River: Pearson Education.
MATH Google Scholar
Hogg, R. V., & Ledolter, J. (1987). Engineering statistics. New York: Macmillan Co.
Google Scholar
Huang, X., Acero, A., & Hon, H. W. (2001). Spoken language processing. New York: Prentice-Hall.
Google Scholar
Hwang, S.-H., & Chen, S.-H. (1995). A prosodic model for mandarin speech and its application to pitch level generation for text-to-speech. In Proc. IEEE int. conf. acoust., speech, signal processing, May 1995 (pp. 616–619).
Google Scholar
Khan, A. N., Gangashetty, S. V., & Yegnanarayana, B. (2003). Syllabic properties of three Indian languages: Implications for speech recognition and language identification. In Int. conf. natural language processing, Mysore, India, Dec. 2003 (pp. 125–134).
Google Scholar
Kishore, S. P., Kumar, R., & Sangal, R. (2002). A data-driven synthesis approach for Indian languages using syllable as basic unit. In Int. conf. natural language processing.
Google Scholar
Krishna, N. S., Murthy, H. A., & Gonsalves, T. A. (2002). Text-to-speech (tts) in Indian languages. In Int. conf. natural language processing.
Google Scholar
Kumar, S. R. R. (1990). Significance of durational knowledge for a text-to-speech system in an Indian language. Master’s thesis, Dept. of computer science and engineering, indian institute of technology, Madras, March.
Kumar, A. S. M., Rajendran, S., & Yegnanarayana, B. (1993). Intonation component of text-to-speech system for Hindi. Computer Speech and Language, 7, 283–301.
Article Google Scholar
Kumar, K. K., Rao, K. S., & Yegnanarayana, B. (2002). Duration knowledge for text-to-speech system for Telugu. In Proc. int. conf. knowledge based computer systems, Mumbai, India, Dec. 2002 (pp. 563–571).
Google Scholar
Lee, S., Hirose, K., & Minematsu, N. (2001). Incorporation of prosodic modules for large vocabulary continuous speech recognition. In Proc. ISCA workshop on prosody in speech recognition and understanding (pp. 97–101).
Google Scholar
Leung, C.-C., Ferras, M., Barras, C., & Gauvain, J.-L. (2008). Comparing prosodic models for speaker recognition. In Interspeech, Brisbane, Australia, Sept 2008 (pp. 1945–1948).
Google Scholar
Mary, L. (2006). Multi level implicit features for language and speaker recognition. PhD thesis, Dept. of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai, India, June.
Mary, L., & Yegnanarayana, B. (2008). Extraction and representation of prosodic features for language and speaker recognition. Speech communication, 50, 782–796.
Article Google Scholar
Murthy, P. S., & Yegnanarayana, B. (1999). Robustness of group-delay-based method for extraction of significant excitation from speech signals. IEEE Transactions on Speech and Audio Processing, 7, 609–619.
Article Google Scholar
Ostendorfy, M., Shafranz, I., & Bates, R. (2003). Prosody models for conversational speech recognition. In Symposium on prosody and speech.
Google Scholar
Prasanna, S. R. M. (2004). Event-based analysis of speech. PhD thesis, Dept. of computer science and engineering, Indian institute of technology, Madras, Chennai, India, March.
Prasanna, S. R. M., & Yegnanarayana, B. (2004). Extraction of pitch in adverse conditions. In Proc. IEEE int. conf. acoust., speech, signal processing, Montreal, Canada, May 2004.
Google Scholar
Prasanna, S. R. M., & Zachariah, J. M. (2002). Detection of vowel onset point in speech. In Proc. IEEE int. conf. acoust., speech, signal processing, Orlando, Florida, USA, May 2002.
Google Scholar
Prasanna, S. R. M., Reddy, B. V. S., & Murthy, P. K. (2009). Vowel onset point detection using source, spectral peaks, and modulation spectrum energies. IEEE Transactions on Speech and Audio Processing, 17, 556–565.
Article Google Scholar
Rabiner, L. R., & Juang, B. H. (1993). Fundamentals of speech recognition. Prentice Hall: Englewood Cliffs.
Google Scholar
Rajendran, S., Rao, K. S., Yegnanarayana, B., & Reddy, K. N. (2003). Syllable duration in broadcast news in Telugu: A preliminary study. In National conf. language technology tools: implementation of telugu/urdu, Hyderabad, India, Oct. 2003.
Google Scholar
Rao, K. S. (2008). Modeling supra-segmental features of syllables using neural networks. In P. B. Prasad & S. R. M. Prasanna (Eds.), Speech, audio, image and biomedical signal processing using neural networks (pp. 71–95). New York: Springer.
Chapter Google Scholar
Rao, K. S., & Yegnanarayana, B. (2003). Prosodic manipulation using instants of significant excitation. In Proc. IEEE int. conf. multimedia and expo, Baltimore, Maryland, USA, July 2003 (pp. 389–392).
Google Scholar
Rao, K. S., & Yegnanarayana, B. (2006). Prosody modification using instants of significant excitation. IEEE Transactions on Speech and Audio Processing, 14, 972–980.
Article Google Scholar
Rao, K. S., & Yegnanarayana, B. (2007). Modeling durations of syllables using neural networks. Computer Speech and Language, 21, 282–295.
Article Google Scholar
Rao, K. S., & Yegnanarayana, B. (2009). Intonation modeling for Indian languages. Computer Speech and Language, 23, 240–256.
Article Google Scholar
Shriberg, E., & Stolcke, A. (2001). Prosody modeling for automatic speech understanding: An overview of recent research at SRI. In Prosody in speech recognition and understanding, ISCA tutorial and research workshop (ITRW), Molly Pitcher Inn, Red Bank, NJ, USA, Oct. 2001.
Google Scholar
Shriberg, E., & Stolcke, A. (2004). Mathematical foundations of speech and language processing. New York: Springer.
Google Scholar
Smits, R., & Yegnanarayana, B. (1995). Determination of instants of significant excitation in speech using group delay function. IEEE Transactions on Speech and Audio Processing, 3, 325–333.
Article Google Scholar
Srikanth, S., Kumar, S. R. R., Sundar, R., & Yegnanarayana, B. (1989). A text-to-speech conversion system for Indian languages based on waveform concatenation model. Technical report No. 11, project VOIS, Dept. of computer science and engineering, Indian institute of technology, Madras, March.
Stevens, K. N. (1999). Acoustic phonetics. Cambridge: MIT Press.
Google Scholar
Suryakanth, G. V. (2005). Neural network models for recognition of consonant-vowel units of speech in multiple languages. PhD thesis, Dept. of computer science and engineering, Indian institute of technology, Madras, Chennai, India, May.
Vainio, M. (2001). Artificial neural network based prosody models for Finnish text-to-speech synthesis. PhD thesis, Dept. of phonetics, University of Helsinki, Finland.
Vainio, M., & Altosaar, T. (1998). Modeling the microprosody of pitch and loudness for speech synthesis with neural networks. In Proc. int. conf. spoken language processing, Sidney, Australia, Dec. 1998.
Google Scholar
Vapnik, V. N. (2001). Statistical learning theory. New York: Wiley.
Google Scholar
Weber, F., Manganaro, L., Peskin, B., & Shriberg, E. (2002). Using prosodic and lexical information for speaker identification. In Proc. IEEE int. conf. acoust., speech, signal processing.
Google Scholar
Werner, S., & Keller, E. (1994). Prosodic aspects of speech. In E. Keller (Ed.), Fundamentals of speech synthesis and speech recognition: basic concepts, state of the art, the future challenges (pp. 23–40). Chichester: Wiley.
Google Scholar
Yegnanarayana, B. (1999). Artificial neural networks. New York: Prentice Hall.
Google Scholar
Yin, B., Ambikairajah, E., & Chen, F. (2006). Combining cepstral and prosodic features in language identification. In 18th international conference on pattern recognition (ICPR’06).
Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Technology, Indian Institute of Technology Kharagpur, Kharagpur, 721302, West Bengal, India
K. Sreenivasa Rao

Authors

K. Sreenivasa Rao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to K. Sreenivasa Rao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rao, K.S. Application of prosody models for developing speech systems in Indian languages. Int J Speech Technol 14, 19–33 (2011). https://doi.org/10.1007/s10772-010-9086-9

Download citation

Received: 30 September 2010
Accepted: 02 December 2010
Published: 11 December 2010
Issue Date: March 2011
DOI: https://doi.org/10.1007/s10772-010-9086-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Application of prosody models for developing speech systems in Indian languages

Abstract

Access this article

Similar content being viewed by others

Speech Processing and Prosody

A Comparison of Two Prosody Modelling Approaches for Sesotho and Serbian

Prosody Modeling: A Review Report on Indian Language

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Application of prosody models for developing speech systems in Indian languages

Abstract

Access this article

Similar content being viewed by others

Speech Processing and Prosody

A Comparison of Two Prosody Modelling Approaches for Sesotho and Serbian

Prosody Modeling: A Review Report on Indian Language

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation