Skip to main content

Advertisement

Log in

Application of prosody models for developing speech systems in Indian languages

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

In this paper we demonstrate the use of prosody models for developing speech systems in Indian languages. Duration and intonation models developed using feedforward neural networks are considered as prosody models. Labelled broadcast news data in the languages Hindi, Telugu, Tamil and Kannada is used for developing the neural network models for predicting the duration and intonation. The features representing the positional, contextual and phonological constraints are used for developing the prosody models. In this paper, the use of prosody models is illustrated using speech recognition, speech synthesis, speaker recognition and language identification applications. Autoassociative neural networks and support vector machines are used as classification models for developing the speech systems. The performance of the speech systems has shown to be improved by combining the prosodic features along with one popular spectral feature set consisting of Weighted Linear Prediction Cepstral Coefficients (WLPCCs).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Batliner, A., Mobius, B., Mohler, G., Schweitzer, A., & Noth, E. (2001). Prosodic models, automatic speech understanding, and speech synthesis: towards the common ground. In Eurospeech, Scandinavia.

    Google Scholar 

  • Benesty, J., Sondhi, M. M., & Huang, Y. (Eds.) (2008). Springer handbook on speech processing. New York: Springer.

    Google Scholar 

  • Bozkurt, B., Bagein, M., & Dutoit, T. (2001). From MBROLA to NU-MBROLA. In Proc. 4th ISCA workshop on speech synthesis, Pitlochry, Scotland, UK (pp. 127–129).

    Google Scholar 

  • Bozkurt, B., Dutoit, T., Prudon, R., D’Alessandro, C., & Pagel, V. (2002). Improving quality of MBROLA synthesis for non-uniform units synthesis. In IEEE workshop on speech synthesis, Santa Monica, California, USA, Sept. 2002.

    Google Scholar 

  • Chopde, A. (2009). Itrans Indian language transliteration package version 5.2 source. http://www.aczone.com/itrans/.

  • Deller, J. R., Proakis, J. G., & Hansen, J. H. L. (1993). Discrete-time processing of speech signals. New York: Macmillan Co.

    Google Scholar 

  • Gangashetty, S. V., Sekhar, C. C., & Yegnanarayana, B. (2004). Acoustic model combination for recognition of speech in multiple languages using support vector machines. In Proc. IEEE int. conf. acoust., speech, signal processing.

    Google Scholar 

  • Gangashetty, S. V., Sekhar, C. C., & Yegnanarayana, B. (2004). Extraction of fixed dimension patterns from varying duration segments of consonant-vowel utterances. In Proc. IEEE int. conf. intelligent sensing and information processing, Chennai, India, Jan. 2004 (pp. 159–164).

    Chapter  Google Scholar 

  • Haykin, S. (1999). Neural networks: a comprehensive foundation. Upper Saddle River: Pearson Education.

    MATH  Google Scholar 

  • Hogg, R. V., & Ledolter, J. (1987). Engineering statistics. New York: Macmillan Co.

    Google Scholar 

  • Huang, X., Acero, A., & Hon, H. W. (2001). Spoken language processing. New York: Prentice-Hall.

    Google Scholar 

  • Hwang, S.-H., & Chen, S.-H. (1995). A prosodic model for mandarin speech and its application to pitch level generation for text-to-speech. In Proc. IEEE int. conf. acoust., speech, signal processing, May 1995 (pp. 616–619).

    Google Scholar 

  • Khan, A. N., Gangashetty, S. V., & Yegnanarayana, B. (2003). Syllabic properties of three Indian languages: Implications for speech recognition and language identification. In Int. conf. natural language processing, Mysore, India, Dec. 2003 (pp. 125–134).

    Google Scholar 

  • Kishore, S. P., Kumar, R., & Sangal, R. (2002). A data-driven synthesis approach for Indian languages using syllable as basic unit. In Int. conf. natural language processing.

    Google Scholar 

  • Krishna, N. S., Murthy, H. A., & Gonsalves, T. A. (2002). Text-to-speech (tts) in Indian languages. In Int. conf. natural language processing.

    Google Scholar 

  • Kumar, S. R. R. (1990). Significance of durational knowledge for a text-to-speech system in an Indian language. Master’s thesis, Dept. of computer science and engineering, indian institute of technology, Madras, March.

  • Kumar, A. S. M., Rajendran, S., & Yegnanarayana, B. (1993). Intonation component of text-to-speech system for Hindi. Computer Speech and Language, 7, 283–301.

    Article  Google Scholar 

  • Kumar, K. K., Rao, K. S., & Yegnanarayana, B. (2002). Duration knowledge for text-to-speech system for Telugu. In Proc. int. conf. knowledge based computer systems, Mumbai, India, Dec. 2002 (pp. 563–571).

    Google Scholar 

  • Lee, S., Hirose, K., & Minematsu, N. (2001). Incorporation of prosodic modules for large vocabulary continuous speech recognition. In Proc. ISCA workshop on prosody in speech recognition and understanding (pp. 97–101).

    Google Scholar 

  • Leung, C.-C., Ferras, M., Barras, C., & Gauvain, J.-L. (2008). Comparing prosodic models for speaker recognition. In Interspeech, Brisbane, Australia, Sept 2008 (pp. 1945–1948).

    Google Scholar 

  • Mary, L. (2006). Multi level implicit features for language and speaker recognition. PhD thesis, Dept. of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai, India, June.

  • Mary, L., & Yegnanarayana, B. (2008). Extraction and representation of prosodic features for language and speaker recognition. Speech communication, 50, 782–796.

    Article  Google Scholar 

  • Murthy, P. S., & Yegnanarayana, B. (1999). Robustness of group-delay-based method for extraction of significant excitation from speech signals. IEEE Transactions on Speech and Audio Processing, 7, 609–619.

    Article  Google Scholar 

  • Ostendorfy, M., Shafranz, I., & Bates, R. (2003). Prosody models for conversational speech recognition. In Symposium on prosody and speech.

    Google Scholar 

  • Prasanna, S. R. M. (2004). Event-based analysis of speech. PhD thesis, Dept. of computer science and engineering, Indian institute of technology, Madras, Chennai, India, March.

  • Prasanna, S. R. M., & Yegnanarayana, B. (2004). Extraction of pitch in adverse conditions. In Proc. IEEE int. conf. acoust., speech, signal processing, Montreal, Canada, May 2004.

    Google Scholar 

  • Prasanna, S. R. M., & Zachariah, J. M. (2002). Detection of vowel onset point in speech. In Proc. IEEE int. conf. acoust., speech, signal processing, Orlando, Florida, USA, May 2002.

    Google Scholar 

  • Prasanna, S. R. M., Reddy, B. V. S., & Murthy, P. K. (2009). Vowel onset point detection using source, spectral peaks, and modulation spectrum energies. IEEE Transactions on Speech and Audio Processing, 17, 556–565.

    Article  Google Scholar 

  • Rabiner, L. R., & Juang, B. H. (1993). Fundamentals of speech recognition. Prentice Hall: Englewood Cliffs.

    Google Scholar 

  • Rajendran, S., Rao, K. S., Yegnanarayana, B., & Reddy, K. N. (2003). Syllable duration in broadcast news in Telugu: A preliminary study. In National conf. language technology tools: implementation of telugu/urdu, Hyderabad, India, Oct. 2003.

    Google Scholar 

  • Rao, K. S. (2008). Modeling supra-segmental features of syllables using neural networks. In P. B. Prasad & S. R. M. Prasanna (Eds.), Speech, audio, image and biomedical signal processing using neural networks (pp. 71–95). New York: Springer.

    Chapter  Google Scholar 

  • Rao, K. S., & Yegnanarayana, B. (2003). Prosodic manipulation using instants of significant excitation. In Proc. IEEE int. conf. multimedia and expo, Baltimore, Maryland, USA, July 2003 (pp. 389–392).

    Google Scholar 

  • Rao, K. S., & Yegnanarayana, B. (2006). Prosody modification using instants of significant excitation. IEEE Transactions on Speech and Audio Processing, 14, 972–980.

    Article  Google Scholar 

  • Rao, K. S., & Yegnanarayana, B. (2007). Modeling durations of syllables using neural networks. Computer Speech and Language, 21, 282–295.

    Article  Google Scholar 

  • Rao, K. S., & Yegnanarayana, B. (2009). Intonation modeling for Indian languages. Computer Speech and Language, 23, 240–256.

    Article  Google Scholar 

  • Shriberg, E., & Stolcke, A. (2001). Prosody modeling for automatic speech understanding: An overview of recent research at SRI. In Prosody in speech recognition and understanding, ISCA tutorial and research workshop (ITRW), Molly Pitcher Inn, Red Bank, NJ, USA, Oct. 2001.

    Google Scholar 

  • Shriberg, E., & Stolcke, A. (2004). Mathematical foundations of speech and language processing. New York: Springer.

    Google Scholar 

  • Smits, R., & Yegnanarayana, B. (1995). Determination of instants of significant excitation in speech using group delay function. IEEE Transactions on Speech and Audio Processing, 3, 325–333.

    Article  Google Scholar 

  • Srikanth, S., Kumar, S. R. R., Sundar, R., & Yegnanarayana, B. (1989). A text-to-speech conversion system for Indian languages based on waveform concatenation model. Technical report No. 11, project VOIS, Dept. of computer science and engineering, Indian institute of technology, Madras, March.

  • Stevens, K. N. (1999). Acoustic phonetics. Cambridge: MIT Press.

    Google Scholar 

  • Suryakanth, G. V. (2005). Neural network models for recognition of consonant-vowel units of speech in multiple languages. PhD thesis, Dept. of computer science and engineering, Indian institute of technology, Madras, Chennai, India, May.

  • Vainio, M. (2001). Artificial neural network based prosody models for Finnish text-to-speech synthesis. PhD thesis, Dept. of phonetics, University of Helsinki, Finland.

  • Vainio, M., & Altosaar, T. (1998). Modeling the microprosody of pitch and loudness for speech synthesis with neural networks. In Proc. int. conf. spoken language processing, Sidney, Australia, Dec. 1998.

    Google Scholar 

  • Vapnik, V. N. (2001). Statistical learning theory. New York: Wiley.

    Google Scholar 

  • Weber, F., Manganaro, L., Peskin, B., & Shriberg, E. (2002). Using prosodic and lexical information for speaker identification. In Proc. IEEE int. conf. acoust., speech, signal processing.

    Google Scholar 

  • Werner, S., & Keller, E. (1994). Prosodic aspects of speech. In E. Keller (Ed.), Fundamentals of speech synthesis and speech recognition: basic concepts, state of the art, the future challenges (pp. 23–40). Chichester: Wiley.

    Google Scholar 

  • Yegnanarayana, B. (1999). Artificial neural networks. New York: Prentice Hall.

    Google Scholar 

  • Yin, B., Ambikairajah, E., & Chen, F. (2006). Combining cepstral and prosodic features in language identification. In 18th international conference on pattern recognition (ICPR’06).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to K. Sreenivasa Rao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rao, K.S. Application of prosody models for developing speech systems in Indian languages. Int J Speech Technol 14, 19–33 (2011). https://doi.org/10.1007/s10772-010-9086-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-010-9086-9

Keywords

Navigation