Abstract
In this paper we demonstrate the use of prosody models for developing speech systems in Indian languages. Duration and intonation models developed using feedforward neural networks are considered as prosody models. Labelled broadcast news data in the languages Hindi, Telugu, Tamil and Kannada is used for developing the neural network models for predicting the duration and intonation. The features representing the positional, contextual and phonological constraints are used for developing the prosody models. In this paper, the use of prosody models is illustrated using speech recognition, speech synthesis, speaker recognition and language identification applications. Autoassociative neural networks and support vector machines are used as classification models for developing the speech systems. The performance of the speech systems has shown to be improved by combining the prosodic features along with one popular spectral feature set consisting of Weighted Linear Prediction Cepstral Coefficients (WLPCCs).
Similar content being viewed by others
References
Batliner, A., Mobius, B., Mohler, G., Schweitzer, A., & Noth, E. (2001). Prosodic models, automatic speech understanding, and speech synthesis: towards the common ground. In Eurospeech, Scandinavia.
Benesty, J., Sondhi, M. M., & Huang, Y. (Eds.) (2008). Springer handbook on speech processing. New York: Springer.
Bozkurt, B., Bagein, M., & Dutoit, T. (2001). From MBROLA to NU-MBROLA. In Proc. 4th ISCA workshop on speech synthesis, Pitlochry, Scotland, UK (pp. 127–129).
Bozkurt, B., Dutoit, T., Prudon, R., D’Alessandro, C., & Pagel, V. (2002). Improving quality of MBROLA synthesis for non-uniform units synthesis. In IEEE workshop on speech synthesis, Santa Monica, California, USA, Sept. 2002.
Chopde, A. (2009). Itrans Indian language transliteration package version 5.2 source. http://www.aczone.com/itrans/.
Deller, J. R., Proakis, J. G., & Hansen, J. H. L. (1993). Discrete-time processing of speech signals. New York: Macmillan Co.
Gangashetty, S. V., Sekhar, C. C., & Yegnanarayana, B. (2004). Acoustic model combination for recognition of speech in multiple languages using support vector machines. In Proc. IEEE int. conf. acoust., speech, signal processing.
Gangashetty, S. V., Sekhar, C. C., & Yegnanarayana, B. (2004). Extraction of fixed dimension patterns from varying duration segments of consonant-vowel utterances. In Proc. IEEE int. conf. intelligent sensing and information processing, Chennai, India, Jan. 2004 (pp. 159–164).
Haykin, S. (1999). Neural networks: a comprehensive foundation. Upper Saddle River: Pearson Education.
Hogg, R. V., & Ledolter, J. (1987). Engineering statistics. New York: Macmillan Co.
Huang, X., Acero, A., & Hon, H. W. (2001). Spoken language processing. New York: Prentice-Hall.
Hwang, S.-H., & Chen, S.-H. (1995). A prosodic model for mandarin speech and its application to pitch level generation for text-to-speech. In Proc. IEEE int. conf. acoust., speech, signal processing, May 1995 (pp. 616–619).
Khan, A. N., Gangashetty, S. V., & Yegnanarayana, B. (2003). Syllabic properties of three Indian languages: Implications for speech recognition and language identification. In Int. conf. natural language processing, Mysore, India, Dec. 2003 (pp. 125–134).
Kishore, S. P., Kumar, R., & Sangal, R. (2002). A data-driven synthesis approach for Indian languages using syllable as basic unit. In Int. conf. natural language processing.
Krishna, N. S., Murthy, H. A., & Gonsalves, T. A. (2002). Text-to-speech (tts) in Indian languages. In Int. conf. natural language processing.
Kumar, S. R. R. (1990). Significance of durational knowledge for a text-to-speech system in an Indian language. Master’s thesis, Dept. of computer science and engineering, indian institute of technology, Madras, March.
Kumar, A. S. M., Rajendran, S., & Yegnanarayana, B. (1993). Intonation component of text-to-speech system for Hindi. Computer Speech and Language, 7, 283–301.
Kumar, K. K., Rao, K. S., & Yegnanarayana, B. (2002). Duration knowledge for text-to-speech system for Telugu. In Proc. int. conf. knowledge based computer systems, Mumbai, India, Dec. 2002 (pp. 563–571).
Lee, S., Hirose, K., & Minematsu, N. (2001). Incorporation of prosodic modules for large vocabulary continuous speech recognition. In Proc. ISCA workshop on prosody in speech recognition and understanding (pp. 97–101).
Leung, C.-C., Ferras, M., Barras, C., & Gauvain, J.-L. (2008). Comparing prosodic models for speaker recognition. In Interspeech, Brisbane, Australia, Sept 2008 (pp. 1945–1948).
Mary, L. (2006). Multi level implicit features for language and speaker recognition. PhD thesis, Dept. of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai, India, June.
Mary, L., & Yegnanarayana, B. (2008). Extraction and representation of prosodic features for language and speaker recognition. Speech communication, 50, 782–796.
Murthy, P. S., & Yegnanarayana, B. (1999). Robustness of group-delay-based method for extraction of significant excitation from speech signals. IEEE Transactions on Speech and Audio Processing, 7, 609–619.
Ostendorfy, M., Shafranz, I., & Bates, R. (2003). Prosody models for conversational speech recognition. In Symposium on prosody and speech.
Prasanna, S. R. M. (2004). Event-based analysis of speech. PhD thesis, Dept. of computer science and engineering, Indian institute of technology, Madras, Chennai, India, March.
Prasanna, S. R. M., & Yegnanarayana, B. (2004). Extraction of pitch in adverse conditions. In Proc. IEEE int. conf. acoust., speech, signal processing, Montreal, Canada, May 2004.
Prasanna, S. R. M., & Zachariah, J. M. (2002). Detection of vowel onset point in speech. In Proc. IEEE int. conf. acoust., speech, signal processing, Orlando, Florida, USA, May 2002.
Prasanna, S. R. M., Reddy, B. V. S., & Murthy, P. K. (2009). Vowel onset point detection using source, spectral peaks, and modulation spectrum energies. IEEE Transactions on Speech and Audio Processing, 17, 556–565.
Rabiner, L. R., & Juang, B. H. (1993). Fundamentals of speech recognition. Prentice Hall: Englewood Cliffs.
Rajendran, S., Rao, K. S., Yegnanarayana, B., & Reddy, K. N. (2003). Syllable duration in broadcast news in Telugu: A preliminary study. In National conf. language technology tools: implementation of telugu/urdu, Hyderabad, India, Oct. 2003.
Rao, K. S. (2008). Modeling supra-segmental features of syllables using neural networks. In P. B. Prasad & S. R. M. Prasanna (Eds.), Speech, audio, image and biomedical signal processing using neural networks (pp. 71–95). New York: Springer.
Rao, K. S., & Yegnanarayana, B. (2003). Prosodic manipulation using instants of significant excitation. In Proc. IEEE int. conf. multimedia and expo, Baltimore, Maryland, USA, July 2003 (pp. 389–392).
Rao, K. S., & Yegnanarayana, B. (2006). Prosody modification using instants of significant excitation. IEEE Transactions on Speech and Audio Processing, 14, 972–980.
Rao, K. S., & Yegnanarayana, B. (2007). Modeling durations of syllables using neural networks. Computer Speech and Language, 21, 282–295.
Rao, K. S., & Yegnanarayana, B. (2009). Intonation modeling for Indian languages. Computer Speech and Language, 23, 240–256.
Shriberg, E., & Stolcke, A. (2001). Prosody modeling for automatic speech understanding: An overview of recent research at SRI. In Prosody in speech recognition and understanding, ISCA tutorial and research workshop (ITRW), Molly Pitcher Inn, Red Bank, NJ, USA, Oct. 2001.
Shriberg, E., & Stolcke, A. (2004). Mathematical foundations of speech and language processing. New York: Springer.
Smits, R., & Yegnanarayana, B. (1995). Determination of instants of significant excitation in speech using group delay function. IEEE Transactions on Speech and Audio Processing, 3, 325–333.
Srikanth, S., Kumar, S. R. R., Sundar, R., & Yegnanarayana, B. (1989). A text-to-speech conversion system for Indian languages based on waveform concatenation model. Technical report No. 11, project VOIS, Dept. of computer science and engineering, Indian institute of technology, Madras, March.
Stevens, K. N. (1999). Acoustic phonetics. Cambridge: MIT Press.
Suryakanth, G. V. (2005). Neural network models for recognition of consonant-vowel units of speech in multiple languages. PhD thesis, Dept. of computer science and engineering, Indian institute of technology, Madras, Chennai, India, May.
Vainio, M. (2001). Artificial neural network based prosody models for Finnish text-to-speech synthesis. PhD thesis, Dept. of phonetics, University of Helsinki, Finland.
Vainio, M., & Altosaar, T. (1998). Modeling the microprosody of pitch and loudness for speech synthesis with neural networks. In Proc. int. conf. spoken language processing, Sidney, Australia, Dec. 1998.
Vapnik, V. N. (2001). Statistical learning theory. New York: Wiley.
Weber, F., Manganaro, L., Peskin, B., & Shriberg, E. (2002). Using prosodic and lexical information for speaker identification. In Proc. IEEE int. conf. acoust., speech, signal processing.
Werner, S., & Keller, E. (1994). Prosodic aspects of speech. In E. Keller (Ed.), Fundamentals of speech synthesis and speech recognition: basic concepts, state of the art, the future challenges (pp. 23–40). Chichester: Wiley.
Yegnanarayana, B. (1999). Artificial neural networks. New York: Prentice Hall.
Yin, B., Ambikairajah, E., & Chen, F. (2006). Combining cepstral and prosodic features in language identification. In 18th international conference on pattern recognition (ICPR’06).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Rao, K.S. Application of prosody models for developing speech systems in Indian languages. Int J Speech Technol 14, 19–33 (2011). https://doi.org/10.1007/s10772-010-9086-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-010-9086-9