Skip to main content

Significance of Prosody for Speaker, Language, Emotion, and Speech Recognition

  • Chapter
  • First Online:

Part of the book series: SpringerBriefs in Speech Technology ((BRIEFSSPEECHTECH))

Abstract

Speech signal carries characteristics of the speaker, language, emotion, and the sound units. It is difficult separate out features specific to speaker, language, emotion, and sound units contained in the speech. Human beings recognize speaker, language, emotion, and speech using multiple cues present in speech and evidence combined to arrive at a decision. Humans use several prosodic cues for these recognition tasks. But, conventional automatic speaker, language, emotion, and speech recognition systems mostly rely on spectral/cepstral features which are affected by channel mismatch and noise. Therefore, incorporation of prosody into these automatic recognition tasks will make them more robust and human like. In this chapter, the term prosody and its significance for speaker, language, emotion, and speech recognition tasks are discussed. Human way of recognition is discussed followed by the speaker-specific, language-specific, emotion-specific, and speech-specific aspects of prosody.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Adami, A. G., Mihaescu, R., Reynolds, D. A., & Godfrey, J. J. (2003). Modeling prosodic dynamics for speaker recognition. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Hong Kong, China (Vol. 4, pp. 788–791).

    Google Scholar 

  2. Ann, T.-G., & Hutchins, S. E. (1996). On using prosodic cues in automatic language identification. In Proceedings of International Conference on Spoken Language Processing, Philadelphia, PA, USA (Vol. 3, pp. 1768–1772).

    Google Scholar 

  3. Atal, B. (1972). Automatic speaker recognition based on pitch contours. Journal of Acoustic Society of America, 52, 1687–1697.

    Google Scholar 

  4. Atkinson, J. E. (1978). Correlation analysis of the physiological factors controlling fundamental voice frequency. Journal of Acoustic Society of America, 63, 211–222.

    Google Scholar 

  5. Ayadi, E. M., Kamel, M. S., & Karray, F. (2011). Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, 44(3), 572–587.

    Google Scholar 

  6. Bates, R. A., & Ostendorfy, M. (2002). Modeling pronunciation variation in conversational speech using prosody. In Proceedings of ISCA Tutorial and Research Workshop on Pronunciation Modeling and Lexical Access (pp. 42–47).

    Google Scholar 

  7. Bisio, I., Delfino, A., Lavagetto, F., Marchese, M., & Sciarrone, A. (2013). Gender-driven emotion recognition through speech signals for ambient intelligence applications. IEEE Transactions on Emerging Topics in Computing, 1(2), 244–257.

    Google Scholar 

  8. Busso, C., Lee, S., & Narayanan, S. (2009). Analysis of emotionally salient aspects of fundamental frequency for emotion detection. IEEE Transactions on Audio, Speech, and Language Processing, 17(4), 582–596.

    Google Scholar 

  9. Cahn, J. (1990). The generation of affect in synthesized speech. Journal of the American Voice Input/Output Society, 8, 1–19.

    Google Scholar 

  10. Cairns, D. A., & Hansen, J. H. L. (1994). Nonlinear analysis and classification of speech under stressed conditions. The Journal of the Acoustical Society of America, 96(6), 3392–3400.

    Google Scholar 

  11. Campbell, N., & Mokhtari, P. (2003). Voice quality: The 4th prosodic dimension. 15th ICPhS (pp. 2417–2420).

    Google Scholar 

  12. Cummins, F., Gers, F., & Schmidhuber, J. (1999). Comparing Prosody Across Languages. I.D.S.I.A. Technical Report IDSIA-07-99. Istituto Molle di Studie sull’Intelligenza Artificiale, CH6900 Lugano, Switzerland.

    Google Scholar 

  13. Cutler, A., Dahan, D., & Van Donselaar, W. (1997). Prosody in comprehension of spoken language: A literature review. Language and Speech, 40(2), 141–201.

    Google Scholar 

  14. Cutler, A., & Ladd, D. R. (1983). Prosody: Models and measurements. Berlin, Heidelberg, New York, Tokyo: Springer.

    Google Scholar 

  15. Doddington, G. (2001). Speaker recognition based on idiolectic differences between speakers. In Proceedings of Eurospeech, Aalborg, Denmark (pp. 2521–2524).

    Google Scholar 

  16. Fernandez, R. (2004). A Computational Model for Automatic Recognition of Affect in Speech. PhD Thesis, Massachusetts Institute of Technology, USA.

    Google Scholar 

  17. Fox, A. (2000). Prosodic features and prosodic structure. Oxford: Oxford University Press.

    Google Scholar 

  18. Gharsellaoui, S., Selouani, S.-A., & Dahmane, A. O. (2015). Automatic emotion recognition using auditory and prosodic indicative features. In Proceedings of 2015 IEEE 28th Canadian Conference on Electrical and Computer Engineering (CCECE) (pp. 1265–1270).

    Google Scholar 

  19. Hart, J., Collier, R., & Cohen, A. (1990). A perceptual study of intonation. Cambridge, UK: Cambridge University Press.

    Google Scholar 

  20. Heck, L. P. (2002). Integrating high-level information for robust speaker recognition in John Hopkins University workshop on SuperSID, Baltimore, Maryland, http:www.cslp.jhu.edu/ws2002/groups/supersid

  21. Hirst, D., & Di Cristo, A. (1998). Intonation systems: A survey of twenty languages. Cambridge, UK: Cambridge University Press.

    Google Scholar 

  22. Kinnunen, T., & Li, B. H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 52, 12–40.

    Google Scholar 

  23. Krakow, R. A. (1999). Physiological organization of syllables: A review. Journal of Phonetics, 27, 23–54.

    Google Scholar 

  24. Kwon, O.-W., Chan, K., Hao, J., & Lee, T.-W. (2003). Emotion recognition by speech signals. Proceedings of Eurospeech (pp. 125–128).

    Google Scholar 

  25. Lee, C. M., Yildirim, S., Bulut, M., Kazemzadeh, A., Busso, C., Deng, Z., et al. (2004). Emotion recognition based on phoneme classes. In Proceedings of Interspeech (pp. 205–211).

    Google Scholar 

  26. Luengo, I., Navas, E., Hernáez, I., & Sánchez, J. (2005). Automatic emotion recognition using prosodic parameters. In Proceedings of Interspeech (pp. 493–496).

    Google Scholar 

  27. Lugger, M., & Yang, B. (2007). The relevance of voice quality features in speaker independent emotion recognition. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2007) (Vol. 4 , p. IV-17).

    Google Scholar 

  28. MacNeilage, P. F. (1998). The frame/content theory of evolution of speech production. Behavioral and Brain Sciences, 21, 499–546.

    Google Scholar 

  29. Madhukumar, A. S., Rajendran, S., & Yegnanarayana, B. (1993). Intonation component of text-to-speech system for Hindi. Computer, Speech and Language, 7, 283–301.

    Google Scholar 

  30. Mary, L., & Yegnanarayana B. (2008). Extraction and representation of prosodic features for language and speaker recognition. Speech Communication, 50, 782–796.

    Google Scholar 

  31. Mori, K., Toba, N., Harada, T., Arai, T., Kometsu, M., Aoyagi, M., et al. (1999). Human language identification with reduced spectral information. In Proceedings of Eurospeech (Vol. 1, pp. 391–394).

    Google Scholar 

  32. Muthusamy, Y. K., Barnard, E., & Cole, R. A. (1994). Reviewing automatic language identification. IEEE Signal Processing Magazine, 11, 33–41.

    Google Scholar 

  33. Muthusamy, Y. K., Cole, R. A., & Oshika, B. T. (1992). The OGI multi-language telephone speech corpus. In Proceedings of International Conference on Spoken Language Processing (Vol. 2, pp. 895–898).

    Google Scholar 

  34. NIST. (2001). Speaker recognition evaluation. http://www.nist.gov/speech/tests/spk/2001

  35. Nooteboom, S. (1997). The prosody of speech: Melody and rhythm. In The handbook of phonetic sciences. Blackwell handbooks in linguistics (Vol. 5, pp. 640–673). Malden: Blackwell Publishers.

    Google Scholar 

  36. Nwe, T. L., Foo, S. W., & De Silva, L. C. (2003). Speech emotion recognition using hidden Markov models. Speech Communication, 41(4), 603–623.

    Google Scholar 

  37. Rabiner, L. R., & Juang, B. H. (1993). Fundamentals of speech recognition. Englewood Cliffs, NJ: PTR Prentice-Hall.

    Google Scholar 

  38. Ramus, F., & Mehler, J. (1999). Language identification with suprasegmental cues: A study based on speech resynthesis. The Journal of the Acoustical Society of America, 105, 512–521.

    Google Scholar 

  39. Ramus, F., Nespor, M., & Mehler, J. (1999). Correlates of linguistic rhythm in speech signal. Journal of Cognition, 73, 265–292.

    Google Scholar 

  40. Rao, K. S., Koolagudi, S. G., & Vempada, R. R. (2013). Emotion recognition from speech using global and local prosodic features. International Journal of Speech Technology, 16(2), 143–160.

    Google Scholar 

  41. Reynolds, D., Andrews, W., Campbell, J., Navratil, J., Peskin, B., Adami, A., et al. (2003). The superSID project: Exploiting high-level information for high-accuracy speaker recognition. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Hong Kong, China (Vol. 4, pp. 784–787).

    Google Scholar 

  42. Shriberg, E., & Stolcke, A. (2004). Direct modeling of prosody: An overview of applications in automatic speech processing. In Speech Prosody 2004, Nara, Japan.

    Google Scholar 

  43. Shriberg, E., Stolcke, A., Hakkani-Tur, D., & Tur, G. (2000). Prosody-based automatic segmentation of speech into sentences and topics. Speech Communication, 32, 127–154.

    Article  Google Scholar 

  44. Sonmez, M. K., Heck, L., Weintraub, M., & Shriberg, E. (1997). A lognormal tied mixture model of pitch for prosody-based speaker recognition. In Proceedings of Eurospeech, Rhodes, Greece (Vol. 3, pp. 1391–1394).

    Google Scholar 

  45. Sonmez, K., Shriberg, E., Heck, L., & Weintraub, M. (1998). Modeling dynamic prosodic variation for speaker verification. In Proceedings of International Conference on Spoken Language Processing (Vol. 7, pp. 3189–3192).

    Google Scholar 

  46. Ten Bosch, L. (2003). Emotions, speech and the ASR framework. Speech Communication, 40(1), 213–225.

    MATH  Google Scholar 

  47. Waibel, A. (1988). Prosody and speech recognition. San Mateo: Morgan Kaufmann Publishers.

    MATH  Google Scholar 

  48. Waibel, A., Geutner, P., Tomokiyo, L. M., Schultz, T., & Woszczyna, M. (2000). Multilinguality in speech and spoken language systems. Proceedings of IEEE, 88, 1297–1313.

    Google Scholar 

  49. Williams, C. E., & Stevens, K. N. (1981). Vocal correlates of emotional states. In Speech evaluation in psychiatry (pp. 221–240). New York: Grune & Stratton.

    Google Scholar 

  50. Xu, Y. (1998). Consistency of tone-syllable alignment across different syllable structures and speaking rates. Phonetica, 55, 179–203.

    Article  Google Scholar 

  51. Yegnanarayana, B., Prasanna, M., Skariah, J. M., & Gupta, C. S. (2005). Combining evidence from source, suprasegmental and spectral features for a fixed-text speaker verification system. IEEE Transactions on Speech and Audio Processing, 13, 575–582.

    Article  Google Scholar 

  52. Zissman, M. A., & Berkling, K. M. (2001). Automatic language identification. Speech Communication, 35, 115–124.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2019 The Author(s), under exclusive licence to Springer International Publishing AG, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Mary, L. (2019). Significance of Prosody for Speaker, Language, Emotion, and Speech Recognition. In: Extraction of Prosody for Automatic Speaker, Language, Emotion and Speech Recognition. SpringerBriefs in Speech Technology. Springer, Cham. https://doi.org/10.1007/978-3-319-91171-7_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-91171-7_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-91170-0

  • Online ISBN: 978-3-319-91171-7

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics