Handling emotional speech: a prosody based data augmentation technique for improving neutral speech trained ASR systems

Kammili, Pavan Raju; Ramakrishnam Raju, B. H. V. S.; Krishna, A. Sri

doi:10.1007/s10772-021-09897-x

Handling emotional speech: a prosody based data augmentation technique for improving neutral speech trained ASR systems

Published: 20 September 2021

Volume 25, pages 197–204, (2022)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Pavan Raju Kammili ORCID: orcid.org/0000-0002-9276-0511¹,
B. H. V. S. Ramakrishnam Raju² &
A. Sri Krishna³

191 Accesses
1 Citation
Explore all metrics

Abstract

In this paper, the effect of emotional speech on the performance of neutral speech trained ASR systems is studied. Prosody-modification based data augmentation is explored to compensate the affected ASR performance due to emotional speech. The primary motive is to develop an Telugu ASR system that is least affected by these emotion based intrinsic speaker related acoustic variations. Two factors contributing towards the intrinsic speaker related variability that are focused in this research are the fundamental frequency [\((F_0)\) or pitch] and the speaking rate variations. To simulate ASR task, we performed the training of our ASR system on neutral speech and tested it for data from emotional as well as neutral speech. Compared to the performance metrics of neutral speech at testing stage, emotional speech performance metrics are extremely degraded. This performance degradation is observed due to the difference in the prosody and speaking rate parameters of neutral and emotional speech. To overcome this performance degradation problem, prosody and speaking rate parameters are varied and modified to create the newer augmented versions of the training data. The original and augmented versions of the training data are pooled together and re-trained in order to capture greater emotion-specific variations. For the Telugu ASR experiments, we used Microsoft speech corpus for Indian languages(MSC-IL) for training neutral speech and Indian Institute of Technology Kharagpur Simulated Emotion Speech Corpus (IITKGP-SESC) for evaluating emotional speech. The basic emotions of anger, happiness and sad are considered for evaluation along with neutral speech.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Prosody modification for speech recognition in emotionally mismatched conditions

Article 28 March 2018

Application of Emotion Recognition and Modification for Emotional Telugu Speech Recognition

Article 01 May 2018

Transfer learning through perturbation-based in-domain spectrogram augmentation for adult speech recognition

Article 05 August 2022

References

Dhananjaya, N., & Yegnanarayana, B. (2010). Voiced/nonvoiced detection based on robustness of voiced epochs. IEEE Signal Processing Letters, 17(3), 273–276.
Article Google Scholar
Gangamohan, P., Mittal, V., & Yegnanarayana, B. (2012). Relative importance of different components of speech contributing to perception of emotion. In Speech Prosody.
Geoffrey, H., Li, D., Dong, Y., George, E. D., & Mohamed, A.-R. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.
Article Google Scholar
Gerosa, M., Giuliani, D., Narayanan, S., & Potamianos, A. (2009). A review of asr technologies for children’s speech. In Proceedings of the 2nd workshop on child, computer and interaction. (pp. 1–8).
Hsiao, R., Ma, J., Hartmann, W., Karafiát, M., Grézl, F., Burget, L., Szöke, I., Černockỳ, J. H., Watanabe, S., Chen, Z., et al. (2015). Robust speech recognition in unknown reverberant and noisy conditions. In Workshop on automatic speech recognition and understanding (ASRU) (pp. 533–538). IEEE.
Ko, T., Peddinti, V., Povey, D., Seltzer, M. L., & Khudanpur, S. (2017) A study on data augmentation of reverberant speech for robust speech recognition. In International conference on acoustics, speech and signal processing (ICASSP). (pp. 5220–5224). IEEE
Koolagudi, S. G., Maity, S., Kumar, V. A., Chakrabarti, S.& Rao, K. S. (2009) Iitkgp-sesc: Speech database for emotion analysis. In International conference on contemporary computing. (pp. 485–492). Springer.
Lippmann, R., Martin, E., & Paul, D. (1987) Multi-style training for robust isolated-word speech recognition. In International conference on acoustics, speech, and signal processing (ICASSP), (vol. 12, pp. 705–708). IEEE
Murty, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio, Speech, and Language Processing, 16(8), 1602–1613.
Article Google Scholar
Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015) Librispeech: An asr corpus based on public domain audio books. In International conference on acoustics, speech and signal processing (ICASSP). (pp. 5206–5210). IEEE.
Peddinti, V., Chen, G., Manohar, V., Ko, T., Povey, D., & Khudanpur, S. (2015). Jhu aspire system: Robust lvcsr with tdnns, ivector adaptation and rnn-lms. In Workshop on automatic speech recognition and understanding (ASRU), (pp. 539–546)
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P. et al. (2011) The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society.
Prasanna, S., Govind, D., Rao, K. S., & Yegnanarayana, B. (2010). Fast prosody modification using instants of significant excitation. In Proceedings of Fifth International Conference of Speech Prosody.
Raju, V. V., Gangamohan, P., Gangashetty, S. V., & Vuppala, A. K. (2016). Application of prosody modification for speech recognition in different emotion conditions. In IEEE Region 10 Conference (TENCON), IEEE, 2016 (pp. 951–954).
Raju, V. V., Vydana, H. K., Gangashetty, S. V., & Vuppala, A. K. (2017) Importance of non-uniform prosody modification for speech recognition in emotion conditions, In 2017 Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC). (pp. 573–576). IEEE.
Rao, K. S., Prasanna, S. M., & Yegnanarayana, B. (2007). Determination of instants of significant excitation in speech using hilbert envelope and group delay function. IEEE Signal Processing Letters, 14(10), 762–765.
Article Google Scholar
Rousseau, A., Deléglise, P., & Esteve, Y. (2012). Ted-lium: An automatic speech recognition dedicated corpus. In LREC (pp. 125–129).
Russell, M., & D’Arcy, S. (2007). Challenges for computer recognition of children’s speech. In Workshop on speech and language technology in education.
Schalkwyk, J., Beeferman, D., Beaufays, F., Byrne, B., Chelba, C., Cohen, M., Kamvar, M., & Strope, B. (2010) “Your word is my command”: Google search by voice: A case study. In Advances in speech recognition. (pp. 61–90). Springer.
Shahnawazuddin, S., Adiga, N., Kathania, H. K., & Sai, B. T. (2020). Creating speaker independent asr system through prosody modification based data augmentation. Pattern Recognition Letters, 131, 213–218.
Article Google Scholar
Smits, R., & Yegnanarayana, B. (1995). Determination of instants of significant excitation in speech using group delay function. IEEE Transactions on Speech and Audio Processing, 3(5), 325–333.
Article Google Scholar
Srivastava, B. M. L., Sitaram, S., Mehta, R. K., Mohan, K. D., Matani, P., Satpal, S., Bali, K., Srikanth, R., & Nayak, N. (2018). Interspeech 2018 low resource automatic speech recognition challenge for indian languages. In SLTU (pp. 11–14).
Vegesna, V. V. R., Gurugubelli, K., & Vuppala, A. K. (2018). Prosody modification for speech recognition in emotionally mismatched conditions. International Journal of Speech Technology, 21(3), 521–532.
Vegesna, V. V. R., Gurugubelli, K., & Vuppala, A. K. (2019). Application of emotion recognition and modification for emotional telugu speech recognition. Mobile Networks and Applications, 24(1), 193–201.
Article Google Scholar
Vishnu Vidyadhara Raju, V. (2020). Towards building a robust Telugu ASR system for emotional speech. PhD Thesis. http://hdl.handle.net/10603/306355
Yu, D., & Deng, L. (2016). Automatic speech recognition. Springer.
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Centurion University of Technology and Management, Visakhapatnam, Andhra Pradesh, India
Pavan Raju Kammili
Department of Information Technology, SRKR Engineering College, Bhimavaram, Andhra Pradesh, India
B. H. V. S. Ramakrishnam Raju
Department of Information Technology, Shri Vishnu Engineering College for Women, Bhimavaram, Andhra Pradesh, India
A. Sri Krishna

Authors

Pavan Raju Kammili
View author publications
You can also search for this author in PubMed Google Scholar
B. H. V. S. Ramakrishnam Raju
View author publications
You can also search for this author in PubMed Google Scholar
A. Sri Krishna
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pavan Raju Kammili.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kammili, P.R., Ramakrishnam Raju, B.H.V.S. & Krishna, A.S. Handling emotional speech: a prosody based data augmentation technique for improving neutral speech trained ASR systems. Int J Speech Technol 25, 197–204 (2022). https://doi.org/10.1007/s10772-021-09897-x

Download citation

Received: 02 October 2020
Accepted: 21 August 2021
Published: 20 September 2021
Issue Date: March 2022
DOI: https://doi.org/10.1007/s10772-021-09897-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Handling emotional speech: a prosody based data augmentation technique for improving neutral speech trained ASR systems

Abstract

Access this article

Similar content being viewed by others

Prosody modification for speech recognition in emotionally mismatched conditions

Application of Emotion Recognition and Modification for Emotional Telugu Speech Recognition

Transfer learning through perturbation-based in-domain spectrogram augmentation for adult speech recognition

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Handling emotional speech: a prosody based data augmentation technique for improving neutral speech trained ASR systems

Abstract

Access this article

Similar content being viewed by others

Prosody modification for speech recognition in emotionally mismatched conditions

Application of Emotion Recognition and Modification for Emotional Telugu Speech Recognition

Transfer learning through perturbation-based in-domain spectrogram augmentation for adult speech recognition

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation