Significance of Prosody for Speaker, Language, Emotion, and Speech Recognition

Mary, Leena

doi:10.1007/978-3-319-91171-7_1

Significance of Prosody for Speaker, Language, Emotion, and Speech Recognition

Leena Mary³

Chapter
First Online: 03 August 2018

857 Accesses
4 Citations

Part of the book series: SpringerBriefs in Speech Technology ((BRIEFSSPEECHTECH))

Abstract

Speech signal carries characteristics of the speaker, language, emotion, and the sound units. It is difficult separate out features specific to speaker, language, emotion, and sound units contained in the speech. Human beings recognize speaker, language, emotion, and speech using multiple cues present in speech and evidence combined to arrive at a decision. Humans use several prosodic cues for these recognition tasks. But, conventional automatic speaker, language, emotion, and speech recognition systems mostly rely on spectral/cepstral features which are affected by channel mismatch and noise. Therefore, incorporation of prosody into these automatic recognition tasks will make them more robust and human like. In this chapter, the term prosody and its significance for speaker, language, emotion, and speech recognition tasks are discussed. Human way of recognition is discussed followed by the speaker-specific, language-specific, emotion-specific, and speech-specific aspects of prosody.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 49.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Adami, A. G., Mihaescu, R., Reynolds, D. A., & Godfrey, J. J. (2003). Modeling prosodic dynamics for speaker recognition. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Hong Kong, China (Vol. 4, pp. 788–791).
Google Scholar
Ann, T.-G., & Hutchins, S. E. (1996). On using prosodic cues in automatic language identification. In Proceedings of International Conference on Spoken Language Processing, Philadelphia, PA, USA (Vol. 3, pp. 1768–1772).
Google Scholar
Atal, B. (1972). Automatic speaker recognition based on pitch contours. Journal of Acoustic Society of America, 52, 1687–1697.
Google Scholar
Atkinson, J. E. (1978). Correlation analysis of the physiological factors controlling fundamental voice frequency. Journal of Acoustic Society of America, 63, 211–222.
Google Scholar
Ayadi, E. M., Kamel, M. S., & Karray, F. (2011). Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, 44(3), 572–587.
Google Scholar
Bates, R. A., & Ostendorfy, M. (2002). Modeling pronunciation variation in conversational speech using prosody. In Proceedings of ISCA Tutorial and Research Workshop on Pronunciation Modeling and Lexical Access (pp. 42–47).
Google Scholar
Bisio, I., Delfino, A., Lavagetto, F., Marchese, M., & Sciarrone, A. (2013). Gender-driven emotion recognition through speech signals for ambient intelligence applications. IEEE Transactions on Emerging Topics in Computing, 1(2), 244–257.
Google Scholar
Busso, C., Lee, S., & Narayanan, S. (2009). Analysis of emotionally salient aspects of fundamental frequency for emotion detection. IEEE Transactions on Audio, Speech, and Language Processing, 17(4), 582–596.
Google Scholar
Cahn, J. (1990). The generation of affect in synthesized speech. Journal of the American Voice Input/Output Society, 8, 1–19.
Google Scholar
Cairns, D. A., & Hansen, J. H. L. (1994). Nonlinear analysis and classification of speech under stressed conditions. The Journal of the Acoustical Society of America, 96(6), 3392–3400.
Google Scholar
Campbell, N., & Mokhtari, P. (2003). Voice quality: The 4th prosodic dimension. 15th ICPhS (pp. 2417–2420).
Google Scholar
Cummins, F., Gers, F., & Schmidhuber, J. (1999). Comparing Prosody Across Languages. I.D.S.I.A. Technical Report IDSIA-07-99. Istituto Molle di Studie sull’Intelligenza Artificiale, CH6900 Lugano, Switzerland.
Google Scholar
Cutler, A., Dahan, D., & Van Donselaar, W. (1997). Prosody in comprehension of spoken language: A literature review. Language and Speech, 40(2), 141–201.
Google Scholar
Cutler, A., & Ladd, D. R. (1983). Prosody: Models and measurements. Berlin, Heidelberg, New York, Tokyo: Springer.
Google Scholar
Doddington, G. (2001). Speaker recognition based on idiolectic differences between speakers. In Proceedings of Eurospeech, Aalborg, Denmark (pp. 2521–2524).
Google Scholar
Fernandez, R. (2004). A Computational Model for Automatic Recognition of Affect in Speech. PhD Thesis, Massachusetts Institute of Technology, USA.
Google Scholar
Fox, A. (2000). Prosodic features and prosodic structure. Oxford: Oxford University Press.
Google Scholar
Gharsellaoui, S., Selouani, S.-A., & Dahmane, A. O. (2015). Automatic emotion recognition using auditory and prosodic indicative features. In Proceedings of 2015 IEEE 28th Canadian Conference on Electrical and Computer Engineering (CCECE) (pp. 1265–1270).
Google Scholar
Hart, J., Collier, R., & Cohen, A. (1990). A perceptual study of intonation. Cambridge, UK: Cambridge University Press.
Google Scholar
Heck, L. P. (2002). Integrating high-level information for robust speaker recognition in John Hopkins University workshop on SuperSID, Baltimore, Maryland, http:www.cslp.jhu.edu/ws2002/groups/supersid
Hirst, D., & Di Cristo, A. (1998). Intonation systems: A survey of twenty languages. Cambridge, UK: Cambridge University Press.
Google Scholar
Kinnunen, T., & Li, B. H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 52, 12–40.
Google Scholar
Krakow, R. A. (1999). Physiological organization of syllables: A review. Journal of Phonetics, 27, 23–54.
Google Scholar
Kwon, O.-W., Chan, K., Hao, J., & Lee, T.-W. (2003). Emotion recognition by speech signals. Proceedings of Eurospeech (pp. 125–128).
Google Scholar
Lee, C. M., Yildirim, S., Bulut, M., Kazemzadeh, A., Busso, C., Deng, Z., et al. (2004). Emotion recognition based on phoneme classes. In Proceedings of Interspeech (pp. 205–211).
Google Scholar
Luengo, I., Navas, E., Hernáez, I., & Sánchez, J. (2005). Automatic emotion recognition using prosodic parameters. In Proceedings of Interspeech (pp. 493–496).
Google Scholar
Lugger, M., & Yang, B. (2007). The relevance of voice quality features in speaker independent emotion recognition. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2007) (Vol. 4 , p. IV-17).
Google Scholar
MacNeilage, P. F. (1998). The frame/content theory of evolution of speech production. Behavioral and Brain Sciences, 21, 499–546.
Google Scholar
Madhukumar, A. S., Rajendran, S., & Yegnanarayana, B. (1993). Intonation component of text-to-speech system for Hindi. Computer, Speech and Language, 7, 283–301.
Google Scholar
Mary, L., & Yegnanarayana B. (2008). Extraction and representation of prosodic features for language and speaker recognition. Speech Communication, 50, 782–796.
Google Scholar
Mori, K., Toba, N., Harada, T., Arai, T., Kometsu, M., Aoyagi, M., et al. (1999). Human language identification with reduced spectral information. In Proceedings of Eurospeech (Vol. 1, pp. 391–394).
Google Scholar
Muthusamy, Y. K., Barnard, E., & Cole, R. A. (1994). Reviewing automatic language identification. IEEE Signal Processing Magazine, 11, 33–41.
Google Scholar
Muthusamy, Y. K., Cole, R. A., & Oshika, B. T. (1992). The OGI multi-language telephone speech corpus. In Proceedings of International Conference on Spoken Language Processing (Vol. 2, pp. 895–898).
Google Scholar
NIST. (2001). Speaker recognition evaluation. http://www.nist.gov/speech/tests/spk/2001
Nooteboom, S. (1997). The prosody of speech: Melody and rhythm. In The handbook of phonetic sciences. Blackwell handbooks in linguistics (Vol. 5, pp. 640–673). Malden: Blackwell Publishers.
Google Scholar
Nwe, T. L., Foo, S. W., & De Silva, L. C. (2003). Speech emotion recognition using hidden Markov models. Speech Communication, 41(4), 603–623.
Google Scholar
Rabiner, L. R., & Juang, B. H. (1993). Fundamentals of speech recognition. Englewood Cliffs, NJ: PTR Prentice-Hall.
Google Scholar
Ramus, F., & Mehler, J. (1999). Language identification with suprasegmental cues: A study based on speech resynthesis. The Journal of the Acoustical Society of America, 105, 512–521.
Google Scholar
Ramus, F., Nespor, M., & Mehler, J. (1999). Correlates of linguistic rhythm in speech signal. Journal of Cognition, 73, 265–292.
Google Scholar
Rao, K. S., Koolagudi, S. G., & Vempada, R. R. (2013). Emotion recognition from speech using global and local prosodic features. International Journal of Speech Technology, 16(2), 143–160.
Google Scholar
Reynolds, D., Andrews, W., Campbell, J., Navratil, J., Peskin, B., Adami, A., et al. (2003). The superSID project: Exploiting high-level information for high-accuracy speaker recognition. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Hong Kong, China (Vol. 4, pp. 784–787).
Google Scholar
Shriberg, E., & Stolcke, A. (2004). Direct modeling of prosody: An overview of applications in automatic speech processing. In Speech Prosody 2004, Nara, Japan.
Google Scholar
Shriberg, E., Stolcke, A., Hakkani-Tur, D., & Tur, G. (2000). Prosody-based automatic segmentation of speech into sentences and topics. Speech Communication, 32, 127–154.
Article Google Scholar
Sonmez, M. K., Heck, L., Weintraub, M., & Shriberg, E. (1997). A lognormal tied mixture model of pitch for prosody-based speaker recognition. In Proceedings of Eurospeech, Rhodes, Greece (Vol. 3, pp. 1391–1394).
Google Scholar
Sonmez, K., Shriberg, E., Heck, L., & Weintraub, M. (1998). Modeling dynamic prosodic variation for speaker verification. In Proceedings of International Conference on Spoken Language Processing (Vol. 7, pp. 3189–3192).
Google Scholar
Ten Bosch, L. (2003). Emotions, speech and the ASR framework. Speech Communication, 40(1), 213–225.
MATH Google Scholar
Waibel, A. (1988). Prosody and speech recognition. San Mateo: Morgan Kaufmann Publishers.
MATH Google Scholar
Waibel, A., Geutner, P., Tomokiyo, L. M., Schultz, T., & Woszczyna, M. (2000). Multilinguality in speech and spoken language systems. Proceedings of IEEE, 88, 1297–1313.
Google Scholar
Williams, C. E., & Stevens, K. N. (1981). Vocal correlates of emotional states. In Speech evaluation in psychiatry (pp. 221–240). New York: Grune & Stratton.
Google Scholar
Xu, Y. (1998). Consistency of tone-syllable alignment across different syllable structures and speaking rates. Phonetica, 55, 179–203.
Article Google Scholar
Yegnanarayana, B., Prasanna, M., Skariah, J. M., & Gupta, C. S. (2005). Combining evidence from source, suprasegmental and spectral features for a fixed-text speaker verification system. IEEE Transactions on Speech and Audio Processing, 13, 575–582.
Article Google Scholar
Zissman, M. A., & Berkling, K. M. (2001). Automatic language identification. Speech Communication, 35, 115–124.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electronics and Communication Engineering, Government Engineering College Idukki, Idukki, Kerala, India
Leena Mary

Authors

Leena Mary
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Mary, L. (2019). Significance of Prosody for Speaker, Language, Emotion, and Speech Recognition. In: Extraction of Prosody for Automatic Speaker, Language, Emotion and Speech Recognition. SpringerBriefs in Speech Technology. Springer, Cham. https://doi.org/10.1007/978-3-319-91171-7_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-91171-7_1
Published: 03 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-91170-0
Online ISBN: 978-3-319-91171-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics