Exploring human voice prosodic features and the interaction between the excitation signal and vocal tract for Assamese speech

Bharadwaj, Sippee; Acharjee, Purnendu Bikash

doi:10.1007/s10772-021-09946-5

Exploring human voice prosodic features and the interaction between the excitation signal and vocal tract for Assamese speech

Published: 07 January 2022

Volume 26, pages 77–93, (2023)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

324 Accesses
1 Citation
Explore all metrics

Abstract

Speech conveys information about the emotive content of the speaker's perspective. The extraction of emotions is conceivable by analyzing the speech signal in every segment of its utterance. Prosody is a critical indicator of emotions and stress in a speech signal. Nonetheless, it is pertinent to take note that extraction of emotions has become a difficult undertaking because of the changeability in the speech signal. Tonal language, for example, the Assamese Language relies on prosody such as intonation. A similar expression passes on an alternate significance with the least change in the tone of a voice. People produce prosody to encode and interpret the message. Extracting and establishing the relationships among the prosody, for example, F0, Articulation Rate, Duration, and Intensity will assist us with tracking down the significant differences in the emotions. Variety in the prosody for change in emotions additionally leads to the variety in the glottal closure instants (GCIs). GCI or Epoch alludes to the moments of critical excitation of the vocal tract with the sudden closing of the glottis. Emotions influence the prosody as well as the strength of excitation of a speech signal. This paper is an endeavor to comprehend the effect of emotions on prosody and how the prosodic features are associated and how it means for the strength of excitation (epoch). This is investigated by computing the prosodic features through speech analyses tools PRAAT, Python, and Matlab. Also, a statistical analysis of the determined prosodic features is performed utilizing the ANOVA statistical test to build up the relationship among the prosodic features and to find out the changes that happen in the prosodic features for various emotions. Finally, extraction of the excitation source signal GCIs and its variations for the same utterance for various positions is investigated utilizing Matlab. Proper investigation of the emotive content of the speech through prosodic and excitation source features will improve the human–machine interface frameworks towards the advancement of speech emotion recognition concerning the Assamese Language for a real-time application like treating mental illness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic speech recognition: a survey

Article 10 November 2020

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

Article 29 September 2022

Speech Emotion Recognition: A Comprehensive Survey

Article 08 March 2023

References

Ananthapadmanabha, T. V., & Yegnanarayana, B. (1975). Epoch extraction of voiced speech. IEEE Transactions on Acoustics, Speech, and Signal Processing, 23(6), 562–570.
Article Google Scholar
Balyan, A., Agrawal, S. S., & Dev, A. (2013) Speech synthesis: Review, IJERT, 2: 57-75
Google Scholar
Banziger, T., & Scherer, K. R. (2005). The role of intonation in emotional expressions. Speech Communication, 46, 252–267.
Article Google Scholar
Beaver, B. Z. C., Flemming, E., Jaeger, T. F., & Wolters, M. (2008). When semantics meets phonetics: Acoustical studies of second-occurrence focus. Journal of the Linguistic Society of America, 83(2), 245–276.
Article Google Scholar
Benesty, J., Sondhi, M. M., & Huang, Y. (2008). Springer handbook of speech processing. Springer.
Cahn, J. E. (1990). The generation of effect in synthesized speech. In JAVIOS (pp. 1–19).
Cowie, R., & Cornelius, R. R. (2003). Describing the emotional states that are expressed in speech. Speech Communication, 40, 5–32.
Article MATH Google Scholar
Dellaert, F., Polzin, T., & Waibel, A. (1996). Recognizing emotions in speech. In ICSLP 96.
Drugman, T., Bozkurt, B., & Dutoit, T. (2012). A comparative study of glottal source estimation techniques. Computer Speech & Language, 26(1), 20–34.
Article Google Scholar
Gangamohan, P., Sudarsana R. K., & Yegnanarayana, B (2013) Analysis of emotional speech at subsegmental level. INTERSPEECH. Vol. 2013.
Goswami, G. C. (1982). Structure of Assamese (1st ed.). Department of Publication Gauhati University.
Hansson, P. (2002). Prosodic phrasing and articulation rate variation. Proc. FONETIK, 2002(44), 173–176.
Google Scholar
Hashizawa, Y., Takeda, S., Hamzah, M.D., & Ohyama, G. (2004). On the differences in prosodic features of emotional expressions in Japanese speech according to the degree of the emotion. In: Proceedings of 1993 international conference on acoustics, speech prosody, Nara, pp. 655–658.
Jankowski, C. R. Jr, Quatieri, T. F., & Reynolds, D. A. (1995) Measuring fine structure in speech: Application to speaker identification. In: IEEE international conference on acoustics, speech, and signal processing. Detroit pp. 325–328.
Jia, Y., Huang, D., Liu, W., Dong, Y., Yu, S., & Wang, H. (2008). Text normalization in mandarin text-to-speech system. In Proceeding of the IEEE international conference on acoustics, speech and signal processing, pp. 4693–4696.
Joseph, M. A., Guruprasad, S., Yegnanarayana, Y. (2006). Extracting formants from short segments using group delay functions. In: Interspeech. Pittsburgh, pp. 1009–1012.
Kadiri, S. R. et al. (2015). Analysis of excitation source features of speech for emotion recognition In Sixteenth annual conference of the international speech communication association.
Kakati, B. (2007). Assamese its formation and development. 5th edn. LBS Publication
Kumar, K. S., Reddy, M. S. H., Murty, K. S. R., Yegnanarayana, B. (2009). Analysis of laugh signals for detecting in continuous speech. In: Interspeech. pp. 1591–1594.
Kuremastsu, M., et al. (2008). An extraction of emotion in human speech using speech synthesize and classifiers for each emotion. International Journal of Circuits Systems and Signal Processing., 5, 246–251.
Google Scholar
Lee, C., Narayanan, S., & Pieraccini, R. (2001). Recognition of negative emotions from speech signal. In IEEE workshop on automatic speech and understanding, pp. 240–243
Ma, Y. K. C., & Willems, L. F. (1994). A Frobenius norm approach to glottal closure detection from the speech signal. IEEE Transactions on Speech and Audio Processing, 2, 258–265.
Article Google Scholar
Mattingly, I. G. (1974). Speech synthesis for phonetic and phonological models. In T. A. Sebeok (Ed.), Current trends in linguistics (Vol. 12, pp. 2451–2487). Mouton.
McLoughlin, I. (2009). Applied speech and audio processing, with MATLAB examples. Cambridge University Press.
Moulines, E., & Charpentier, F. (1990). Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication, 9(5–6), 453–467.
Article Google Scholar
Murray, I. R., & Arnott, J. L. (1995). Implementation and testing of a system for producing emotion by rule in synthetic speech. Speech Communication, 16, 369–390.
Article Google Scholar
Murty, K. S. R., Yegnanarayana, B., & Joseph, M. A. (2009). Characterization of glottal activity from speech signals. IEEE Signal Processing Letters, 16(6), 469–472.
Article Google Scholar
Murty, K. S. R. (2009). Significance of excitation source information for speech analysis. Ph.D. thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras.
Navarro-Mesa, J. L., Lleida-Solano, E., & Moreno-Bilbao, A. (2001). A new method for epoch detection based on the Cohen’s class of time-frequency representations. IEEE Signal Processing Letters, 8(8), 225–227.
Article Google Scholar
Nwe, T. L., Foo, S. W., & Silva, L. C. D. (2003). Speech emotion recognition using hidden Markov models. Speech Communication, 41, 603–623.
Article Google Scholar
Panda, S. P., Ajit, K. N., & Satyananda, C. R. (2020). A survey on speech synthesis techniques in Indian languages. Multimedia Systems, 26, 453–478.
Article Google Scholar
Prafianto, H., Nose, T., Chiba, Y., & Ito, A. (2019). Improving human scoring of prosody using parametric speech synthesis. Speech Communication, 111, 14–21.
Article Google Scholar
Quatieri, T. F. (2001). Discrete-time speech signal processing principles and practices. Prentice Hall PTR.
Raitio, T. et al. (2011). Utilizing glottal source pulse library for generating improved excitation signal for HMM-based speech synthesis. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE.
Rao, K. S., & Shashidhar, G. K. (2013). Robust emotion recognition using spectral and prosodic features. Springer.
Rathina, X. A., Mehata, K. M., & Ponnavaikko, M. (2012b). A study of prosodic features of emotional speech: Advances in computer science engineering & applications (pp. 41–49). Springer.
Rathina, X. A., Mehata, K. M., & Ponnavaikko, M. (2012a). A study of prosodic features of emotional speech. Advances in computer science engineering & applications. Springer.
Schroder, M., & Cowie, R. (2006). Issues in emotion-oriented computing toward a shared understanding. In Workshop on emotion and computing (HUMAINE).
Schroder, M. (2001). Emotional speech synthesis: A review. In Seventh European conference on speech communication and technology, Eurospeech Aalborg.
Seshadri, G., & Yegnanarayana, B. (2009). Perceived loudness of speech based on the characteristics of excitation source. Journal of Acoustical Society of America, 126(4), 2061–2071.
Article Google Scholar
Sigmund, M. (2012). Influence of psychological stress on formant structure of vowels. Elektronika Ir Elektrotechnika, 18(10), 45–48.
Article Google Scholar
Tseng, C & Lee, Y. (2004). Intensity in relation to prosody organization. In International symposium on Chinese spoken language processing, pp. 217–220, Hong-Kong.
Tuan, V. N. & d’Alessandro, C. (1999). Robust glottal closure detection using the wavelet transform. In European conference on speech communication and technology, Budapest, pp. 2805–2808.
Yegnanarayana, B., & Gangashetty, S. V. (2011). Epoch-based analysis of speech signals. Sadhana, 36(5), 651–697.
Article Google Scholar
Yegnanarayana, B., & Murty, K. S. R. (2009). Event-based instantaneous fundamental frequency estimation from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 17(4), 614–624.
Article Google Scholar
Zhou, J., Su, X., Ylianttila, M., & Riekki, J. (2012). Exploring pervasive service computing opportunities for pursuing successful ageing. In The gerontologist, pp. 73–82.

Download references

Acknowledgements

The authors wish to thank The Assam Kaziranga University and Dhing Govt. College students for their cooperation for recording and collecting the speech data in different induced emotions. We also thank The Assam Kaziranga University lab Assistants for their utmost cooperation.

Author information

Authors and Affiliations

Electronics and Communication Engineering, The Assam Kaziranga University, Jorhat, Assam, India
Sippee Bharadwaj
Computing Sciences, The Assam Kaziranga University, Jorhat, Assam, India
Purnendu Bikash Acharjee

Authors

Sippee Bharadwaj
View author publications
You can also search for this author in PubMed Google Scholar
Purnendu Bikash Acharjee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sippee Bharadwaj.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bharadwaj, S., Acharjee, P.B. Exploring human voice prosodic features and the interaction between the excitation signal and vocal tract for Assamese speech. Int J Speech Technol 26, 77–93 (2023). https://doi.org/10.1007/s10772-021-09946-5

Download citation

Received: 10 May 2021
Accepted: 13 November 2021
Published: 07 January 2022
Issue Date: March 2023
DOI: https://doi.org/10.1007/s10772-021-09946-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploring human voice prosodic features and the interaction between the excitation signal and vocal tract for Assamese speech

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

Speech Emotion Recognition: A Comprehensive Survey

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Exploring human voice prosodic features and the interaction between the excitation signal and vocal tract for Assamese speech

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

Speech Emotion Recognition: A Comprehensive Survey

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation