Skip to main content
Log in

Exploring human voice prosodic features and the interaction between the excitation signal and vocal tract for Assamese speech

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Speech conveys information about the emotive content of the speaker's perspective. The extraction of emotions is conceivable by analyzing the speech signal in every segment of its utterance. Prosody is a critical indicator of emotions and stress in a speech signal. Nonetheless, it is pertinent to take note that extraction of emotions has become a difficult undertaking because of the changeability in the speech signal. Tonal language, for example, the Assamese Language relies on prosody such as intonation. A similar expression passes on an alternate significance with the least change in the tone of a voice. People produce prosody to encode and interpret the message. Extracting and establishing the relationships among the prosody, for example, F0, Articulation Rate, Duration, and Intensity will assist us with tracking down the significant differences in the emotions. Variety in the prosody for change in emotions additionally leads to the variety in the glottal closure instants (GCIs). GCI or Epoch alludes to the moments of critical excitation of the vocal tract with the sudden closing of the glottis. Emotions influence the prosody as well as the strength of excitation of a speech signal. This paper is an endeavor to comprehend the effect of emotions on prosody and how the prosodic features are associated and how it means for the strength of excitation (epoch). This is investigated by computing the prosodic features through speech analyses tools PRAAT, Python, and Matlab. Also, a statistical analysis of the determined prosodic features is performed utilizing the ANOVA statistical test to build up the relationship among the prosodic features and to find out the changes that happen in the prosodic features for various emotions. Finally, extraction of the excitation source signal GCIs and its variations for the same utterance for various positions is investigated utilizing Matlab. Proper investigation of the emotive content of the speech through prosodic and excitation source features will improve the human–machine interface frameworks towards the advancement of speech emotion recognition concerning the Assamese Language for a real-time application like treating mental illness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

References

  • Ananthapadmanabha, T. V., & Yegnanarayana, B. (1975). Epoch extraction of voiced speech. IEEE Transactions on Acoustics, Speech, and Signal Processing, 23(6), 562–570.

    Article  Google Scholar 

  • Balyan, A., Agrawal, S. S., & Dev, A. (2013) Speech synthesis: Review, IJERT, 2: 57-75

    Google Scholar 

  • Banziger, T., & Scherer, K. R. (2005). The role of intonation in emotional expressions. Speech Communication, 46, 252–267.

    Article  Google Scholar 

  • Beaver, B. Z. C., Flemming, E., Jaeger, T. F., & Wolters, M. (2008). When semantics meets phonetics: Acoustical studies of second-occurrence focus. Journal of the Linguistic Society of America, 83(2), 245–276.

    Article  Google Scholar 

  • Benesty, J., Sondhi, M. M., & Huang, Y. (2008). Springer handbook of speech processing. Springer.

  • Cahn, J. E. (1990). The generation of effect in synthesized speech. In JAVIOS (pp. 1–19).

  • Cowie, R., & Cornelius, R. R. (2003). Describing the emotional states that are expressed in speech. Speech Communication, 40, 5–32.

    Article  MATH  Google Scholar 

  • Dellaert, F., Polzin, T., & Waibel, A. (1996). Recognizing emotions in speech. In ICSLP 96.

  • Drugman, T., Bozkurt, B., & Dutoit, T. (2012). A comparative study of glottal source estimation techniques. Computer Speech & Language, 26(1), 20–34.

    Article  Google Scholar 

  • Gangamohan, P., Sudarsana R. K., & Yegnanarayana, B (2013) Analysis of emotional speech at subsegmental level. INTERSPEECH. Vol. 2013.

  • Goswami, G. C. (1982). Structure of Assamese (1st ed.). Department of Publication Gauhati University.

  • Hansson, P. (2002). Prosodic phrasing and articulation rate variation. Proc. FONETIK, 2002(44), 173–176.

    Google Scholar 

  • Hashizawa, Y., Takeda, S., Hamzah, M.D., & Ohyama, G. (2004). On the differences in prosodic features of emotional expressions in Japanese speech according to the degree of the emotion. In: Proceedings of 1993 international conference on acoustics, speech prosody, Nara, pp. 655–658.

  • Jankowski, C. R. Jr, Quatieri, T. F., & Reynolds, D. A. (1995) Measuring fine structure in speech: Application to speaker identification. In: IEEE international conference on acoustics, speech, and signal processing. Detroit pp. 325–328.

  • Jia, Y., Huang, D., Liu, W., Dong, Y., Yu, S., & Wang, H. (2008). Text normalization in mandarin text-to-speech system. In Proceeding of the IEEE international conference on acoustics, speech and signal processing, pp. 4693–4696.

  • Joseph, M. A., Guruprasad, S., Yegnanarayana, Y. (2006). Extracting formants from short segments using group delay functions. In: Interspeech. Pittsburgh, pp. 1009–1012.

  • Kadiri, S. R. et al. (2015). Analysis of excitation source features of speech for emotion recognition In Sixteenth annual conference of the international speech communication association.

  • Kakati, B. (2007). Assamese its formation and development. 5th edn. LBS Publication

  • Kumar, K. S., Reddy, M. S. H., Murty, K. S. R., Yegnanarayana, B. (2009). Analysis of laugh signals for detecting in continuous speech. In: Interspeech. pp. 1591–1594.

  • Kuremastsu, M., et al. (2008). An extraction of emotion in human speech using speech synthesize and classifiers for each emotion. International Journal of Circuits Systems and Signal Processing., 5, 246–251.

    Google Scholar 

  • Lee, C., Narayanan, S., & Pieraccini, R. (2001). Recognition of negative emotions from speech signal. In IEEE workshop on automatic speech and understanding, pp. 240–243

  • Ma, Y. K. C., & Willems, L. F. (1994). A Frobenius norm approach to glottal closure detection from the speech signal. IEEE Transactions on Speech and Audio Processing, 2, 258–265.

    Article  Google Scholar 

  • Mattingly, I. G. (1974). Speech synthesis for phonetic and phonological models. In T. A. Sebeok (Ed.), Current trends in linguistics (Vol. 12, pp. 2451–2487). Mouton.

  • McLoughlin, I. (2009). Applied speech and audio processing, with MATLAB examples. Cambridge University Press.

  • Moulines, E., & Charpentier, F. (1990). Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication, 9(5–6), 453–467.

    Article  Google Scholar 

  • Murray, I. R., & Arnott, J. L. (1995). Implementation and testing of a system for producing emotion by rule in synthetic speech. Speech Communication, 16, 369–390.

    Article  Google Scholar 

  • Murty, K. S. R., Yegnanarayana, B., & Joseph, M. A. (2009). Characterization of glottal activity from speech signals. IEEE Signal Processing Letters, 16(6), 469–472.

    Article  Google Scholar 

  • Murty, K. S. R. (2009). Significance of excitation source information for speech analysis. Ph.D. thesis, Department of Computer Science and Engineering, Indian Institute of Technology Madras.

  • Navarro-Mesa, J. L., Lleida-Solano, E., & Moreno-Bilbao, A. (2001). A new method for epoch detection based on the Cohen’s class of time-frequency representations. IEEE Signal Processing Letters, 8(8), 225–227.

    Article  Google Scholar 

  • Nwe, T. L., Foo, S. W., & Silva, L. C. D. (2003). Speech emotion recognition using hidden Markov models. Speech Communication, 41, 603–623.

    Article  Google Scholar 

  • Panda, S. P., Ajit, K. N., & Satyananda, C. R. (2020). A survey on speech synthesis techniques in Indian languages. Multimedia Systems, 26, 453–478.

    Article  Google Scholar 

  • Prafianto, H., Nose, T., Chiba, Y., & Ito, A. (2019). Improving human scoring of prosody using parametric speech synthesis. Speech Communication, 111, 14–21.

    Article  Google Scholar 

  • Quatieri, T. F. (2001). Discrete-time speech signal processing principles and practices. Prentice Hall PTR.

  • Raitio, T. et al. (2011). Utilizing glottal source pulse library for generating improved excitation signal for HMM-based speech synthesis. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE.

  • Rao, K. S., & Shashidhar, G. K. (2013). Robust emotion recognition using spectral and prosodic features. Springer.

  • Rathina, X. A., Mehata, K. M., & Ponnavaikko, M. (2012b). A study of prosodic features of emotional speech: Advances in computer science engineering & applications (pp. 41–49). Springer.

  • Rathina, X. A., Mehata, K. M., & Ponnavaikko, M. (2012a). A study of prosodic features of emotional speech. Advances in computer science engineering & applications. Springer.

  • Schroder, M., & Cowie, R. (2006). Issues in emotion-oriented computing toward a shared understanding. In Workshop on emotion and computing (HUMAINE).

  • Schroder, M. (2001). Emotional speech synthesis: A review. In Seventh European conference on speech communication and technology, Eurospeech Aalborg.

  • Seshadri, G., & Yegnanarayana, B. (2009). Perceived loudness of speech based on the characteristics of excitation source. Journal of Acoustical Society of America, 126(4), 2061–2071.

    Article  Google Scholar 

  • Sigmund, M. (2012). Influence of psychological stress on formant structure of vowels. Elektronika Ir Elektrotechnika, 18(10), 45–48.

    Article  Google Scholar 

  • Tseng, C & Lee, Y. (2004). Intensity in relation to prosody organization. In International symposium on Chinese spoken language processing, pp. 217–220, Hong-Kong.

  • Tuan, V. N. & d’Alessandro, C. (1999). Robust glottal closure detection using the wavelet transform. In European conference on speech communication and technology, Budapest, pp. 2805–2808.

  • Yegnanarayana, B., & Gangashetty, S. V. (2011). Epoch-based analysis of speech signals. Sadhana, 36(5), 651–697.

    Article  Google Scholar 

  • Yegnanarayana, B., & Murty, K. S. R. (2009). Event-based instantaneous fundamental frequency estimation from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 17(4), 614–624.

    Article  Google Scholar 

  • Zhou, J., Su, X., Ylianttila, M., & Riekki, J. (2012). Exploring pervasive service computing opportunities for pursuing successful ageing. In The gerontologist, pp. 73–82.

Download references

Acknowledgements

The authors wish to thank The Assam Kaziranga University and Dhing Govt. College students for their cooperation for recording and collecting the speech data in different induced emotions. We also thank The Assam Kaziranga University lab Assistants for their utmost cooperation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sippee Bharadwaj.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bharadwaj, S., Acharjee, P.B. Exploring human voice prosodic features and the interaction between the excitation signal and vocal tract for Assamese speech. Int J Speech Technol 26, 77–93 (2023). https://doi.org/10.1007/s10772-021-09946-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-021-09946-5

Keywords

Navigation