Abstract
Speech signal carries characteristics of the speaker, language, emotion, and the sound units. It is difficult separate out features specific to speaker, language, emotion, and sound units contained in the speech. Human beings recognize speaker, language, emotion, and speech using multiple cues present in speech and evidence combined to arrive at a decision. Humans use several prosodic cues for these recognition tasks. But, conventional automatic speaker, language, emotion, and speech recognition systems mostly rely on spectral/cepstral features which are affected by channel mismatch and noise. Therefore, incorporation of prosody into these automatic recognition tasks will make them more robust and human like. In this chapter, the term prosody and its significance for speaker, language, emotion, and speech recognition tasks are discussed. Human way of recognition is discussed followed by the speaker-specific, language-specific, emotion-specific, and speech-specific aspects of prosody.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Adami, A. G., Mihaescu, R., Reynolds, D. A., & Godfrey, J. J. (2003). Modeling prosodic dynamics for speaker recognition. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Hong Kong, China (Vol. 4, pp. 788–791).
Ann, T.-G., & Hutchins, S. E. (1996). On using prosodic cues in automatic language identification. In Proceedings of International Conference on Spoken Language Processing, Philadelphia, PA, USA (Vol. 3, pp. 1768–1772).
Atal, B. (1972). Automatic speaker recognition based on pitch contours. Journal of Acoustic Society of America, 52, 1687–1697.
Atkinson, J. E. (1978). Correlation analysis of the physiological factors controlling fundamental voice frequency. Journal of Acoustic Society of America, 63, 211–222.
Ayadi, E. M., Kamel, M. S., & Karray, F. (2011). Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, 44(3), 572–587.
Bates, R. A., & Ostendorfy, M. (2002). Modeling pronunciation variation in conversational speech using prosody. In Proceedings of ISCA Tutorial and Research Workshop on Pronunciation Modeling and Lexical Access (pp. 42–47).
Bisio, I., Delfino, A., Lavagetto, F., Marchese, M., & Sciarrone, A. (2013). Gender-driven emotion recognition through speech signals for ambient intelligence applications. IEEE Transactions on Emerging Topics in Computing, 1(2), 244–257.
Busso, C., Lee, S., & Narayanan, S. (2009). Analysis of emotionally salient aspects of fundamental frequency for emotion detection. IEEE Transactions on Audio, Speech, and Language Processing, 17(4), 582–596.
Cahn, J. (1990). The generation of affect in synthesized speech. Journal of the American Voice Input/Output Society, 8, 1–19.
Cairns, D. A., & Hansen, J. H. L. (1994). Nonlinear analysis and classification of speech under stressed conditions. The Journal of the Acoustical Society of America, 96(6), 3392–3400.
Campbell, N., & Mokhtari, P. (2003). Voice quality: The 4th prosodic dimension. 15th ICPhS (pp. 2417–2420).
Cummins, F., Gers, F., & Schmidhuber, J. (1999). Comparing Prosody Across Languages. I.D.S.I.A. Technical Report IDSIA-07-99. Istituto Molle di Studie sull’Intelligenza Artificiale, CH6900 Lugano, Switzerland.
Cutler, A., Dahan, D., & Van Donselaar, W. (1997). Prosody in comprehension of spoken language: A literature review. Language and Speech, 40(2), 141–201.
Cutler, A., & Ladd, D. R. (1983). Prosody: Models and measurements. Berlin, Heidelberg, New York, Tokyo: Springer.
Doddington, G. (2001). Speaker recognition based on idiolectic differences between speakers. In Proceedings of Eurospeech, Aalborg, Denmark (pp. 2521–2524).
Fernandez, R. (2004). A Computational Model for Automatic Recognition of Affect in Speech. PhD Thesis, Massachusetts Institute of Technology, USA.
Fox, A. (2000). Prosodic features and prosodic structure. Oxford: Oxford University Press.
Gharsellaoui, S., Selouani, S.-A., & Dahmane, A. O. (2015). Automatic emotion recognition using auditory and prosodic indicative features. In Proceedings of 2015 IEEE 28th Canadian Conference on Electrical and Computer Engineering (CCECE) (pp. 1265–1270).
Hart, J., Collier, R., & Cohen, A. (1990). A perceptual study of intonation. Cambridge, UK: Cambridge University Press.
Heck, L. P. (2002). Integrating high-level information for robust speaker recognition in John Hopkins University workshop on SuperSID, Baltimore, Maryland, http:www.cslp.jhu.edu/ws2002/groups/supersid
Hirst, D., & Di Cristo, A. (1998). Intonation systems: A survey of twenty languages. Cambridge, UK: Cambridge University Press.
Kinnunen, T., & Li, B. H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 52, 12–40.
Krakow, R. A. (1999). Physiological organization of syllables: A review. Journal of Phonetics, 27, 23–54.
Kwon, O.-W., Chan, K., Hao, J., & Lee, T.-W. (2003). Emotion recognition by speech signals. Proceedings of Eurospeech (pp. 125–128).
Lee, C. M., Yildirim, S., Bulut, M., Kazemzadeh, A., Busso, C., Deng, Z., et al. (2004). Emotion recognition based on phoneme classes. In Proceedings of Interspeech (pp. 205–211).
Luengo, I., Navas, E., Hernáez, I., & Sánchez, J. (2005). Automatic emotion recognition using prosodic parameters. In Proceedings of Interspeech (pp. 493–496).
Lugger, M., & Yang, B. (2007). The relevance of voice quality features in speaker independent emotion recognition. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2007) (Vol. 4 , p. IV-17).
MacNeilage, P. F. (1998). The frame/content theory of evolution of speech production. Behavioral and Brain Sciences, 21, 499–546.
Madhukumar, A. S., Rajendran, S., & Yegnanarayana, B. (1993). Intonation component of text-to-speech system for Hindi. Computer, Speech and Language, 7, 283–301.
Mary, L., & Yegnanarayana B. (2008). Extraction and representation of prosodic features for language and speaker recognition. Speech Communication, 50, 782–796.
Mori, K., Toba, N., Harada, T., Arai, T., Kometsu, M., Aoyagi, M., et al. (1999). Human language identification with reduced spectral information. In Proceedings of Eurospeech (Vol. 1, pp. 391–394).
Muthusamy, Y. K., Barnard, E., & Cole, R. A. (1994). Reviewing automatic language identification. IEEE Signal Processing Magazine, 11, 33–41.
Muthusamy, Y. K., Cole, R. A., & Oshika, B. T. (1992). The OGI multi-language telephone speech corpus. In Proceedings of International Conference on Spoken Language Processing (Vol. 2, pp. 895–898).
NIST. (2001). Speaker recognition evaluation. http://www.nist.gov/speech/tests/spk/2001
Nooteboom, S. (1997). The prosody of speech: Melody and rhythm. In The handbook of phonetic sciences. Blackwell handbooks in linguistics (Vol. 5, pp. 640–673). Malden: Blackwell Publishers.
Nwe, T. L., Foo, S. W., & De Silva, L. C. (2003). Speech emotion recognition using hidden Markov models. Speech Communication, 41(4), 603–623.
Rabiner, L. R., & Juang, B. H. (1993). Fundamentals of speech recognition. Englewood Cliffs, NJ: PTR Prentice-Hall.
Ramus, F., & Mehler, J. (1999). Language identification with suprasegmental cues: A study based on speech resynthesis. The Journal of the Acoustical Society of America, 105, 512–521.
Ramus, F., Nespor, M., & Mehler, J. (1999). Correlates of linguistic rhythm in speech signal. Journal of Cognition, 73, 265–292.
Rao, K. S., Koolagudi, S. G., & Vempada, R. R. (2013). Emotion recognition from speech using global and local prosodic features. International Journal of Speech Technology, 16(2), 143–160.
Reynolds, D., Andrews, W., Campbell, J., Navratil, J., Peskin, B., Adami, A., et al. (2003). The superSID project: Exploiting high-level information for high-accuracy speaker recognition. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Hong Kong, China (Vol. 4, pp. 784–787).
Shriberg, E., & Stolcke, A. (2004). Direct modeling of prosody: An overview of applications in automatic speech processing. In Speech Prosody 2004, Nara, Japan.
Shriberg, E., Stolcke, A., Hakkani-Tur, D., & Tur, G. (2000). Prosody-based automatic segmentation of speech into sentences and topics. Speech Communication, 32, 127–154.
Sonmez, M. K., Heck, L., Weintraub, M., & Shriberg, E. (1997). A lognormal tied mixture model of pitch for prosody-based speaker recognition. In Proceedings of Eurospeech, Rhodes, Greece (Vol. 3, pp. 1391–1394).
Sonmez, K., Shriberg, E., Heck, L., & Weintraub, M. (1998). Modeling dynamic prosodic variation for speaker verification. In Proceedings of International Conference on Spoken Language Processing (Vol. 7, pp. 3189–3192).
Ten Bosch, L. (2003). Emotions, speech and the ASR framework. Speech Communication, 40(1), 213–225.
Waibel, A. (1988). Prosody and speech recognition. San Mateo: Morgan Kaufmann Publishers.
Waibel, A., Geutner, P., Tomokiyo, L. M., Schultz, T., & Woszczyna, M. (2000). Multilinguality in speech and spoken language systems. Proceedings of IEEE, 88, 1297–1313.
Williams, C. E., & Stevens, K. N. (1981). Vocal correlates of emotional states. In Speech evaluation in psychiatry (pp. 221–240). New York: Grune & Stratton.
Xu, Y. (1998). Consistency of tone-syllable alignment across different syllable structures and speaking rates. Phonetica, 55, 179–203.
Yegnanarayana, B., Prasanna, M., Skariah, J. M., & Gupta, C. S. (2005). Combining evidence from source, suprasegmental and spectral features for a fixed-text speaker verification system. IEEE Transactions on Speech and Audio Processing, 13, 575–582.
Zissman, M. A., & Berkling, K. M. (2001). Automatic language identification. Speech Communication, 35, 115–124.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2019 The Author(s), under exclusive licence to Springer International Publishing AG, part of Springer Nature
About this chapter
Cite this chapter
Mary, L. (2019). Significance of Prosody for Speaker, Language, Emotion, and Speech Recognition. In: Extraction of Prosody for Automatic Speaker, Language, Emotion and Speech Recognition. SpringerBriefs in Speech Technology. Springer, Cham. https://doi.org/10.1007/978-3-319-91171-7_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-91171-7_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-91170-0
Online ISBN: 978-3-319-91171-7
eBook Packages: EngineeringEngineering (R0)