Excitation Features of Speech for Emotion Recognition Using Neutral Speech as Reference

Kadiri, Sudarsana Reddy; Gangamohan, P.; Gangashetty, Suryakanth V.; Alku, Paavo; Yegnanarayana, B.

doi:10.1007/s00034-020-01377-y

Excitation Features of Speech for Emotion Recognition Using Neutral Speech as Reference

Open access
Published: 25 February 2020

Volume 39, pages 4459–4481, (2020)
Cite this article

Download PDF

You have full access to this open access article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Excitation Features of Speech for Emotion Recognition Using Neutral Speech as Reference

Download PDF

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

In generation of emotional speech, there are deviations in the speech production features when compared to neutral (non-emotional) speech. The objective of this study is to capture the deviations in features related to the excitation component of speech and to develop a system for automatic recognition of emotions based on these deviations. The emotions considered in this study are anger, happiness, sadness and neutral state. The study shows that there are useful features in the deviations of the excitation features, which can be exploited to develop an emotion recognition system. The excitation features used in this study are the instantaneous fundamental frequency ($F_0$), the strength of excitation, the energy of excitation and the ratio of the high-frequency to low-frequency band energy ($\beta $). A hierarchical binary decision tree approach is used to develop an emotion recognition system with neutral speech as reference. The recognition experiments showed that the excitation features are comparable or better than the existing prosody features and spectral features, such as mel-frequency cepstral coefficients, perceptual linear predictive coefficients and modulation spectral features.

Text-independent speech emotion recognition using frequency adaptive features

Article 13 February 2018

Recognition of Emotion Intensity Basing on Neutral Speech Model

Recognition of Human Speech Emotion Using Variants of Mel-Frequency Cepstral Coefficients

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The goal of speech technology is to make human–machine interaction as natural as possible. The two important modules of speech technology are automatic speech recognition and text-to-speech synthesis. The naturalness of interaction depends on the ability of the system to recognize and synthesize emotions in speech.

In recent years, efforts have been made in the field of emotion recognition. From the literature, it is observed that there are many interrelated issues such as databases, features, approaches and evaluation procedures that need to be considered for the development of an emotion recognition system. Ideally, databases consisting of natural ‘spontaneous’ emotions should be used in analysis of vocal emotions. However, it is difficult to collect such speech data due to privacy and copyright issues. Therefore, different research groups have collected several databases of emotional speech that can be categorized as simulated, seminatural and (near to) natural [12, 29, 47]. The simulated emotion corpus is recorded from professional speakers (actors) by prompting them to enact emotions through specified text in a given language. There are many examples of simulated databases such as the Berlin Emotional Speech Database (EMO-DB) [6] and the Danish Emotional Speech Database (DES) [14]. The seminatural database is also a kind of enacted data, where the context is given to the speakers. Examples of databases in this category are the USC-IEMOCAP corpus [30] (in English), and the German and Russian databases described in [12, 29, 47]. The third type of emotional speech databases is (near to) natural database, where recordings do not involve any prompting or the obvious eliciting of emotional responses. Sources for such natural situations are mostly from talk shows in TV broadcasts, interviews, group interactions, etc. [20]. The important aspects in collecting emotional databases and the description of the various types of databases are discussed in [12, 63].

The set of features used for emotion recognition can be broadly characterized as prosodic and spectral features. The trend of the prosody features (including fundamental frequency ($F_0$), energy and speaking rate) in three emotion categories (anger, happiness and sadness) with respect to neutral state is given in Table 1 [37, 46]. Similarly, the trend of the spectral features (including changes in formant frequencies and spectral tilt) is given in Table 2 [37, 46]. There are some interconnections between the choice of features and the type of the database. For example, the deviations in spectral features such as formant frequencies and spectral tilt are analyzed in simulated parallel corpora. This is because the deviations in formant frequencies and spectral tilt can be compared only when the utterances of different emotion categories are of the same lexical content [37, 43, 46, 59].

Table 1 Trend in prosody features in emotional speech compared to neutral speech

Excitation Features of Speech for Emotion Recognition Using Neutral Speech as Reference

Abstract

Similar content being viewed by others

Text-independent speech emotion recognition using frequency adaptive features

Recognition of Emotion Intensity Basing on Neutral Speech Model

Recognition of Human Speech Emotion Using Variants of Mel-Frequency Cepstral Coefficients

1 Introduction

2 Background and Motivation for Exploring Excitation Features

2.1 Relation to Prior Work

3 Emotional Speech Databases and Feature Extraction

3.1 Databases

3.1.1 The IIIT-H Telugu Emotional Speech Database

3.1.2 The Berlin Emotional Database (EMO-DB)

3.2 Extraction of Excitation Features

4 Analysis of Excitation Features

5 Emotion Recognition System based on Excitation Features

6 Results and Discussion

7 Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation