1 Introduction

The goal of speech technology is to make human–machine interaction as natural as possible. The two important modules of speech technology are automatic speech recognition and text-to-speech synthesis. The naturalness of interaction depends on the ability of the system to recognize and synthesize emotions in speech.

In recent years, efforts have been made in the field of emotion recognition. From the literature, it is observed that there are many interrelated issues such as databases, features, approaches and evaluation procedures that need to be considered for the development of an emotion recognition system. Ideally, databases consisting of natural ‘spontaneous’ emotions should be used in analysis of vocal emotions. However, it is difficult to collect such speech data due to privacy and copyright issues. Therefore, different research groups have collected several databases of emotional speech that can be categorized as simulated, seminatural and (near to) natural [12, 29, 47]. The simulated emotion corpus is recorded from professional speakers (actors) by prompting them to enact emotions through specified text in a given language. There are many examples of simulated databases such as the Berlin Emotional Speech Database (EMO-DB) [6] and the Danish Emotional Speech Database (DES) [14]. The seminatural database is also a kind of enacted data, where the context is given to the speakers. Examples of databases in this category are the USC-IEMOCAP corpus [30] (in English), and the German and Russian databases described in [12, 29, 47]. The third type of emotional speech databases is (near to) natural database, where recordings do not involve any prompting or the obvious eliciting of emotional responses. Sources for such natural situations are mostly from talk shows in TV broadcasts, interviews, group interactions, etc. [20]. The important aspects in collecting emotional databases and the description of the various types of databases are discussed in [12, 63].

The set of features used for emotion recognition can be broadly characterized as prosodic and spectral features. The trend of the prosody features (including fundamental frequency (\(F_0\)), energy and speaking rate) in three emotion categories (anger, happiness and sadness) with respect to neutral state is given in Table 1 [37, 46]. Similarly, the trend of the spectral features (including changes in formant frequencies and spectral tilt) is given in Table 2 [37, 46]. There are some interconnections between the choice of features and the type of the database. For example, the deviations in spectral features such as formant frequencies and spectral tilt are analyzed in simulated parallel corpora. This is because the deviations in formant frequencies and spectral tilt can be compared only when the utterances of different emotion categories are of the same lexical content [37, 43, 46, 59].

Table 1 Trend in prosody features in emotional speech compared to neutral speech

The existing emotion recognition approaches are motivated from applications such as speech recognition, speaker recognition and language identification [23, 28, 35]. In most of the studies [17, 28, 31, 33, 46, 47], vectors consisting of spectral features like mel-frequency cepstral coefficients (MFCCs) and linear prediction cepstral coefficients (LPCCs), prosody features, energy features and their statistics are extracted from overlapping/non-overlapping segments of speech. For example, a large number of features are extracted in the open-source toolkit called OpenEAR [24, 47, 49, 64]. Emotions are modeled using discriminative/non-discriminative models such as Gaussian mixture models (GMMs), auto-associative neural networks (AANNs), multilayer feedforward neural networks (MLFFNNs) and deep neural networks (DNNs) [28, 29, 32, 51]. Binary classification techniques such as Bayesian logistic regression (BLR) and support vector machines (SVMs) are also used to classify the multi-class problem by adopting the hierarchical binary decision tree framework [23, 24, 30]. In [30], the authors used a binary decision tree approach using the features of the OpenSmile toolbox [48] (with a 384-dimensional feature set). In that study, neutral state was first distinguished from three emotions (anger, happiness and sadness), then in the next stage, sadness was distinguished from anger, and finally happiness was distinguished from sadness in the final stage. Recently, raw speech signals were used with deep neural networks for emotion recognition in [45, 55].

Table 2 Trend in spectral features in emotional speech compared to neutral speech

It is to be noted that the performance in terms of recognition accuracy of emotion recognition systems using simulated parallel databases is high when compared to the systems using seminatural and natural databases [19, 30, 47, 60]. This is because utterances of the same lexical content are used for training and testing, and on limited number of speakers data. As per analysis reported in [37, 46], for speech segments of the same lexical content, there are deviations in the spectral features such as formants and spectral tilt. These deviations might help in the discrimination of emotions in the case of simulated parallel corpora.

A cross-corpora study was reported in [10, 49], where cross tests were performed between corpora. The training of emotion models was developed using one corpus, and these models were tested with another corpus. The two corpora consisted of real-life call center data in French. The accuracy in percentage in the cross-corpora evaluation was reported to be 47% for three emotions (anger, neutral state and positive valence) [10]. Similarly in [49], intra-corpus and inter-corpus recognition of emotions was studied, and it was shown that the recognition accuracy depends on the specific group of emotions and feature combinations considered [40]. Most of the recent studies take advantage of various sets of features in recognition of emotions using sophisticated classification mechanisms [17, 23, 64].

The present study proposes a set of excitation features that are independent of language and lexical content. An approach for emotion recognition is proposed by characterizing emotions as deviations from neutral state. The objective is to analyze and capture these deviations using features related to the excitation component of the speech production system. The paper is organized as follows: In Sect. 2, background and motivation for exploring excitation features are discussed. Section 3 describes the emotional speech databases used in this study, and the extraction of the excitation features. Analysis of the excitation features is given in Sect. 4. In Sect. 5, the proposed emotion recognition system is discussed. Experimental results are discussed in Sect. 6. Finally, Sect. 7 provides a summary of the work and a scope for further studies.

2 Background and Motivation for Exploring Excitation Features

In studying emotion recognition, it is necessary to process the speech signal suitably to capture emotion-specific information. Since emotional speech is produced by the human speech production mechanism, emotions can be analyzed using both the excitation (voice source) parameters and the vocal tract system parameters. In the literature, emotion recognition systems have been mostly studied using features representing vocal tract system characteristics. Only a few studies have analyzed emotional speech using voice source features [1, 29, 41, 53, 54, 56, 57]. Most of these studies [1, 52,53,54, 57] have focused mainly on specific utterances like vowels. For extraction of these voice source features, glottal flow estimates have been computed in these studies by using iterative adaptive inverse filtering (IAIF) [2].

In [56, 57], the role of the voice source was analyzed in the perception of valence (positive and negative) and arousal (active and passive) from short vowels (150 ms), and it was shown that the normalized amplitude quotient (NAQ) correlates better with arousal than with valence for both genders. In [56, 57], it was observed that in the vowels [i:] and [u:], the equivalent sound level was the only statistically significant variable in emotional expressions for synthetic data. Similarly, emotions in short segments of the vowel [a:] extracted from continuous speech were analyzed in [1], and it was shown that NAQ yielded significant differences for most of the emotions studied. Even though NAQ correlates with emotions, it has to be noted that NAQ by itself is not sufficient to discriminate different emotions accurately [1]. The interdependencies among the voice source features in emotional speech was studied for the sustained vowel [a:] in five emotions using six voice source parameters extracted from the glottal flow in [54]. In [52, 53], robustness of glottal source features was studied in a cross-database scenario using four emotions (anger, happiness, sadness and neutral state).

Most of the studies that utilize voice source features in the analysis or recognition of emotions use glottal inverse filtering (GIF) to estimate the glottal flow from specific type of utterances, like vowels. Ideally, it would be preferable to derive the excitation features from the speech signal directly. However, it has been observed in many studies (e.g., [2, 13, 58]) that the performance of GIF deteriorates in high-pitched speech like in utterances produced by female or child speakers, and in emotional speech of high arousal. In addition, GIF might not work as well in continuous speech as in sustained vowel utterances and the performance of GIF is also affected when processing degraded speech [2, 13, 58]. Hence, it is justified to derive excitation features from the speech signal directly for the analysis and recognition of emotions.

2.1 Relation to Prior Work

In [21, 22, 36, 39, 62], attempts were made to derive some of the excitation features directly from the speech signal without computing the source-filter decomposition. In [38, 39, 62], excitation features such as epochs/glottal closure instants, the strength of glottal closure and the instantaneous fundamental frequency were derived using the zero frequency filtering (ZFF) method [39]. In [21, 22, 36], the loudness feature was derived to capture the sharpness of glottal closure. To measure the changes in the closed to open phase regions of the glottis, a ratio between the high-frequency and the low-frequency spectral energies was proposed in [36]. In  [15, 16, 27, 41, 42], some of these excitation features were used to study emotions in speech. In [16, 27], the authors analyzed excitation features [instantaneous fundamental frequency (\(F_0\)), strength of excitation (SoE), energy of excitation (EoE) [16] and loudness (\(\eta \))], which are extracted at the sub-segmental level of speech, for four emotions (anger, happiness, sadness and neutral state). The SoE parameter and the ratio of spectral energies between the high-frequency and the low-frequency ranges were used for discriminating angry and happy speech in [15]. In [41, 42], features such as \(F_0\) and SoE, and their first and second derivatives were used for the analysis and discrimination of emotions. In addition, in [18, 41, 44] the effect of emotions on the excitation of speech production was studied using prosody modification by converting speech in one emotion to another.

The present study is based on studying the relations among the parameters of the speech production mechanism using a physiologically motivated perspective. The study involves features of the excitation component of speech, namely the nature of the vocal fold vibration (at the glottis) [62], the strength of the impulse-like excitation at epoch [39], the energy around the epoch [21, 22] and changes in the spectral features caused by the excitation such as the low-frequency spectral energy (LFSE) and the high-frequency spectral energy (HFSE) [36]. All these features are extracted from the speech signal directly without using GIF to estimate the glottal flow waveform like in  [1, 54, 56, 57]. An approach for emotion recognition is proposed by characterizing emotions as deviations of features from neutral speech.

In [21, 22, 36, 38, 39, 62], methods were developed to derive excitation features from the speech signal. In [15, 16, 18, 27, 41, 41, 42, 44], the authors analyzed excitation features [instantaneous fundamental frequency (\(F_0\)), strength of excitation (SoE), energy of excitation (EoE) and loudness (\(\eta \))], which are extracted at the sub-segmental level of speech, for four emotions (anger, happiness, sadness and neutral state). Motivated by the good results achieved in [15, 16, 18, 27, 41, 42, 44], where it was shown that excitation features capture significant information about emotions, we make a systematic investigation in the analysis of features for two databases (including cases where lexical content is both same and different) and develop an emotion recognition system using neutral speech features as reference. This recognition system is developed based on the observation made in the feature analysis part of the study, which highlighted deviations in emotional features in a two-dimensional feature space compared to the two-dimensional feature space of neutral speech.

More specifically, the present study is an extension to the preliminary investigation in emotion recognition published in [27]. The extensions are as follows:

  • A systematic analysis of the excitation features is carried out in four emotions (anger, happiness, sadness and neutral state).

  • An emotion recognition system framework is developed using neutral speech as reference.

  • The proposed emotion recognition system processes speech in short segments (2 s) and is therefore possible to be used in real-time applications.

  • The excitation features studied are shown to be independent of lexical content.

  • The effectiveness of the excitation features is also investigated in a cross-language scenario, where the system is trained using a database of one language and tested using a database of another language.

3 Emotional Speech Databases and Feature Extraction

Two types of emotional speech databases (seminatural and simulated) are used in this study. The description of the databases and the feature extraction procedure are discussed in Sects. 3.1 and 3.2, respectively.

3.1 Databases

3.1.1 The IIIT-H Telugu Emotional Speech Database

The IIIT-H Telugu Emotional Speech Database [16] is a seminatural database consisting of speech in the Indian language of Telugu collected from students of IIIT-Hyderabad. The data were collected from seven speakers (two females and five males) producing speech in four emotions (anger, happiness, sadness and neutral state). The students were asked to script a text themselves which helped them to generate emotional speech by remembering past situations and memories. All the recordings were carried out in laboratory environment using a close-microphone and electroglottography (EGG). For each emotion and for each speaker, the lexical content is different. The recordings were carried out in 2–3 sessions for each speaker, and the entire data consist of around 200 utterances. The database was evaluated in a perceptual listening test by 10 listeners for recognizability of the emotions. A total of 130 utterances were used in the current study, and they consisted of 35, 27, 34 and 34 utterances in anger, happiness, neutral state and sadness, respectively. The mean duration of each utterance is approximately 3 s.

3.1.2 The Berlin Emotional Database (EMO-DB)

The Berlin Emotional Speech Database (EMO-DB) [6] is a German database that was recorded in an anechoic chamber at Technical University of Berlin. Ten (five males and five females) professional native actors were asked to speak 10 sentences in seven emotions (anger, happiness, neutral state, sadness, fear, disgust and boredom) in one or more sessions. The entire data set consists of around 800 utterances. The database was evaluated in a perception test with 20 listeners regarding the recognizability of emotions. The utterances were selected that had recognition rate better than 80% and naturalness better than 60%. The mean duration of each utterance is approximately 3 s. In this study, utterances in four emotions (anger, happiness, neutral state and sadness) are considered, and the number of utterances in each emotion is 127, 71, 79 and 62, respectively.

3.2 Extraction of Excitation Features

Motivated by the studies in [16], excitation features are used to develop an emotion recognition system. The excitation features used consist of the following parameters: the instantaneous fundamental frequency (\(F_0\)) [62], the strength of excitation (SoE) [39], the energy of excitation (EoE) [21, 22] and the ratio between the high-frequency and the low-frequency spectral energies (\(\beta \)) [36]. These features are extracted using the zero frequency filtering (ZFF) method [38, 62], linear prediction (LP) analysis [34] and short-time Fourier transform (STFT) [3].

The glottal closure instants (GCIs) of speech are obtained using the ZFF method [39]. In this method, the speech signal is passed through a cascade of two ideal digital resonators located at 0 Hz, followed by trend removal. The resultant signal is called the ZFF signal. The negative-to-positive zero crossings of the ZFF signal correspond to the GCIs. The interval between two successive GCIs gives the fundamental period \(T_0\). The instantaneous fundamental frequency is given by \(F_0 = 1/T_0\). The slope of the ZFF signal at each GCI is called the strength of excitation (SoE), which is related to the amplitude of the impulse-like excitation in most cases [39]. As the ZFF signal exhibits high energy in the voiced regions, the energy of the ZFF is used to detect voiced and unvoiced regions [11].

Linear prediction (LP) residual gives an approximation of the excitation component of the speech signal [34]. The energy of excitation (EoE) parameter is computed from the samples of the Hilbert envelope of the LP residual over a 2-ms region around each GCI. This gives a measure of vocal effort [21, 22]. A 10th-order LP analysis is used for each 16-ms frame using a 2-ms frame shift.

A segmental feature, which is the ratio between the high-frequency and low-frequency spectral energy (\(\beta \)), was proposed in [36] for discriminating shouting and neutral speech. It was shown that \(\beta \) is related to the effects caused by the changes in the vocal fold vibration characteristics between the two styles of vocalization. As the \(\beta \) feature captures arousal characteristic of speech, the feature is expected to be useful also in emotion recognition. The \(\beta \) measure is computed as the ratio of the high-frequency band (800–4000 Hz) energy to the low-frequency band (0–550 Hz) energy from short-time Fourier magnitude spectrum of speech signal.

4 Analysis of Excitation Features

The impulse-like excitation produced by the abrupt closure of the vocal folds is an important characteristic of the speech excitation [61]. Moreover, temporal regions around GCIs correspond to regions of high SNR in the speech signal. Hence, in this study, we focus on the features extracted in these regions of high SNR in the speech signal. The features (\(F_0\), SoE, EoE and \(\beta \)) are computed from speech using the ZFF method [38, 62], LP analysis [34] and short-time Fourier transform [3]. The mean and standard deviation of the distributions of the excitation features for two speakers (one female and one male) using five utterances for each emotion from the IIIT-H Telugu and German EMO-DB databases are given in Tables 3 and 4.

Table 3 Mean and standard deviation (SD) of the excitation parameters for emotional speech in the IIIT-H database
Table 4 Mean and standard deviation (SD) of the excitation parameters for emotional speech in the EMO-DB database

From Tables 3 and 4, it is observed that \(F_0\) in anger and happiness is high compared to neutral state [16, 41]. But comparatively, happiness shows a slightly lower value of the average \(F_0\) than anger. For sadness, the average \(F_0\) is mostly lower than that in neutral speech. This is in line with previous studies published in [4, 7, 43, 59].

It is interesting to note that the strength of the impulse-like excitation (SoE) in anger appears to be lower than that in neutral speech. This is due to the decrease in the length of the pitch period (\(T_0\)) in anger. In order to maintain the high rate of vibration, the vocal folds may not close with high suction, which results in lower values of SoE in anger. For the same reason, happiness shows a lower SoE, but this parameter is still higher than in anger. The variance of SoE is very low in anger when compared to happiness, even though the mean values of SoE are similar [41]. In the case of sadness, SoE is higher due to the large periodicity (\(T_0\)) associated with it.

The EoE parameter is computed from the samples of the Hilbert envelope of the LP residual over a 2-ms region around each GCI. The EoE is higher in anger than in happiness and lower in sadness compared to neutral speech [16]. Note that this is different from the energy computed from the speech signal directly. Hence, the energy of the excitation component is a better indicator of vocal effort.

The spectral band energy ratio (\(\beta \)) is related to the effects of changes in the vocal fold vibration characteristics, and it captures loudness or arousal characteristics of speech [36]. It can be observed that \(\beta \) is high in anger and happiness and low in sadness. The reason for the high \(\beta \) values in anger and happiness is that the high-frequency band energy is large due to a longer glottal closed phase and vice versa in sadness. The mean and standard deviation values of the first formant frequency (\(F_1\)) are also given in Tables 3 and 4. The mean of \(F_1\) (400–750 Hz) is slightly larger in anger compared to neutral state, but there is no clear difference between happiness and sadness compared to neutral speech. The standard deviations of \(\beta \) and \(F_1\) are similar in all emotions.

The above observations are with respect to neutral speech of the speaker. From Tables 3 and 4, it is important to note that the dynamic ranges of the features are speaker specific. For example, \(F_0\) in neutral speech of male speaker 2 is similar to that of speech in sadness by female speaker 1. But for a given speaker, the main trends in the feature values are emotion specific.

Thus, the analysis results of the excitation features in emotional speech with respect to neutral speech can be summarized as shown in Table 5. The excitation features show discrimination among the emotions even though there exists some correlation between anger and happiness.

Table 5 Characteristics of the excitation features for emotional speech with respect to neutral speech

In order to capture the relations among the features, two features are considered at a time to form a two-dimensional (2-D) feature space. As \(F_0\), SoE, EoE are extracted around GCIs, they are considered in pairs, and the segmental features \(\beta \) and \(F_1\) are used as another pair. Hence, four 2-D feature spaces (C1 to C4) are formed as follows:

C1::

(\(F_0\) vs SoE),

C2::

(EoE vs \(F_0\)),

C3::

(EoE vs SoE), and

C4::

(\(\beta \) vs \(F_1\)).

To analyze the emotion-specific deviations in these 2-D feature spaces, the reference (neutral) and test (emotional) utterances of the same speaker are considered together. For each reference utterance (neutral state), four 2-D feature spaces (corresponding to anger, happiness, sadness and neutral state) are obtained. As an illustration, Fig. 1 shows the 2-D distributions for a reference utterance (neutral state, indicated by ‘o’) and test utterance (anger, indicated by ‘*’) of the same speaker. The deviations in feature spaces between anger and neutral state can be observed from the figure. For example, in the feature space C1 \((F_0 ~ \hbox {vs} ~ SoE)\) in Fig. 1a, \(F_0\) increases and SoE decreases in anger. Similarly, the changes in the feature spaces for all emotions can be observed with respect to neutral speech.

Fig. 1
figure 1

Distribution for four combinations of the 2-D feature pairs between a male speaker’s reference (neutral) utterance (marked by ‘o’) and emotional (anger) utterance (marked by ‘*’). Figures 1(a), 1(b), 1(c) and 1(d) are computed before the normalization, and Figs. 1(e), 1(f), 1(g) and 1(h) after the normalization (where ‘N’ refers to normalization)

From the analysis of the excitation features of emotional speech (given in Tables 3, 4), it is observed that the variance of the features shows discrimination even though the mean values of the features indicate less discrimination. Hence, it is useful to capture the divergence between features extracted from neutral and emotional speech signals. In order to utilize divergence, the Kullback–Leibler (KL) distance [9] is used. The distribution in the 2-D feature space of each utterance is modeled by a Gaussian probability distribution function, which is represented by mean vector and covariance matrix. The KL distance is computed between the corresponding 2-D feature distributions of the reference and test utterances as follows:

$$\begin{aligned} D_\mathrm{KL}= & {} \frac{1}{2}\left( tr\left( \varSigma _1^{-1}\varSigma _0\right) + \left( \mu _1-\mu _0\right) ^T\varSigma _1^{-1}\left( \mu _1-\mu _0\right) \right) -\frac{1}{2}\left( k + \ln \left( \frac{\hbox {det} \varSigma _0}{\hbox {det} \varSigma _1}\right) \right) \nonumber \\ \end{aligned}$$
(1)

where \(D_\mathrm{KL}\) is the KL distance, k is the dimension of the distribution, \(\varSigma _0\), \(\varSigma _1\) are the covariance matrices of the distributions of the feature pair of reference (neutral) and test (emotional) utterances, respectively, and \(\mu _0\), \(\mu _1\) are the corresponding mean vectors.

Using the 2-D feature space, the KL distances between the reference (neutral) utterance and all other emotions are shown in Table 6 for two speakers of the IIIT-H database. It can be clearly seen that the KL distances between the reference neutral utterances and the test neutral utterances are lower compared to the KL distances between the reference neutral utterance and the test emotional utterances in anger, happiness and sadness.

Table 6 The average KL distances between reference (neutral) utterances and test utterances in different emotions (anger, happiness, sadness and neutral state) of the IIIT-H database involving different lexical contents

Results of the excitation feature analysis obtained for two speakers of the EMO-DB database are given in Tables 7 and 8. Table 7 corresponds to the case where the utterances in all emotions are of the same lexical contents, whereas the data in Table 8 correspond to having utterances of different lexical contents in the four emotion categories. Similar observations can be made as in the case of the IIIT-H database (Table 6). It appears that the characteristics of the excitation features are independent of the lexical content.

Table 7 The average KL distances between reference (neutral) utterances and test utterances in different emotions (anger, happiness, sadness and neutral state) of the EMO-DB database involving same lexical contents
Table 8 The average KL distances between reference (neutral) utterances and test utterances in different emotions (anger, happiness, sadness and neutral state) of the EMO-DB database involving different lexical contents

It is important to note that the KL distances vary between the speakers (i.e., variability due to speaker) and also between the emotions of the speakers (i.e., variability due to emotion). The speaker variability is mainly due to variations in dynamic ranges of feature values between speakers. From Tables 6, 7 and 8, it can also be observed that the KL distances of all feature combinations in the case of anger (test) utterances are high most of the time for both databases. This indicates that anger shows large deviations from neutral speech in both Telugu and German. When the test utterance corresponds to sadness, the KL distances for all four feature combinations are closer to neutral state, indicating that sadness may not deviate much from neutral state. This has also been observed in other studies in emotion recognition [28, 47, 48]. For developing emotion recognition system using these excitation source features, it is necessary to capture the speaker variability and emotion variability.

5 Emotion Recognition System based on Excitation Features

In order to capture the variability within the speakers, the distributions of neutral utterances are normalized as follows. Let us denote the values of \(F_0\), SoE, EoE, \(\beta \) and \(F_1\) for a reference neutral utterance by \(R_{F_0}\), \(R_\mathrm{SoE}\), \(R_\mathrm{EoE}\), \(R_{\beta }\) and \(R_{F_1}\), respectively, and for an emotional utterance by \(E_{F_0}\), \(E_\mathrm{SoE}\), \(E_\mathrm{EoE}\), \(E_{\beta }\) and \(E_{F_1}\), respectively. Let \(R_{m_{F_0}}\), \(R_{m_\mathrm{SoE}}\), \(R_{m_\mathrm{EoE}}\), \(R_{m_{\beta }}\) and \(R_{m_{F_1}}\), respectively, represent the mean values of the distributions of \(R_{F_0}\), \(R_\mathrm{SoE}\), \(R_\mathrm{EoE}\), \(R_{\beta }\) and \(R_{F_1}\). Likewise, let \(R_{\sigma _{F_0}}\), \(R_{\sigma _\mathrm{SoE}}\), \(R_{\sigma _\mathrm{EoE}}\), \(R_{\sigma _{\beta }}\) and \(R_{\sigma _{F_1}}\) represent the standard deviations of the distributions of \(R_{F_0}\), \(R_\mathrm{SoE}\), \(R_\mathrm{EoE}\), \(R_{\beta }\) and \(R_{F_1}\), respectively.

The distributions of neutral utterances are normalized with respect to mean and standard deviation as follows. The normalized distributions for \(R_{F_0}\) are given by:

$$\begin{aligned} N_{R_{F_0}}=\frac{{R_{F_0}}-{R_{m_{F_0}}}}{R_{\sigma _{F_0}}} . \end{aligned}$$
(2)

Similarly, the values of the normalized distributions \(N_{R_\mathrm{SoE}}\), \(N_{R_\mathrm{EoE}}\), \(N_{R_{\beta }}\) and \(N_{R_{F_1}}\) are obtained for \(R_\mathrm{SoE}\), \(R_\mathrm{EoE}\), \(R_{\beta }\) and \(R_{F_1}\), respectively. The normalized distributions for the neutral utterance are shown by ‘o’ in Fig. 1e–h for the distributions of the neutral utterance in Fig. 1a–d, respectively.

To capture the variability due to emotions of a speaker, the distributions of features of an emotion utterance are normalized with respect to the neutral utterance as follows. The normalized distribution of \(E_{F_0}\) is given by:

$$\begin{aligned} N_{E_{F_0}}=\frac{{E_{F_0}}-{R_{m_{F_0}}}}{R_{\sigma _{F_0}}}. \end{aligned}$$
(3)

Similarly, the values of the normalized distributions \(N_{E_\mathrm{SoE}}\), \(N_{E_\mathrm{EoE}}\), \(N_{E_{\beta }}\) and \(N_{E_{F_1}}\) are obtained for \(E_\mathrm{SoE}\), \(E_\mathrm{EoE}\), \(E_{\beta }\) and \(E_{F_1}\), respectively. The normalized distributions for the emotional (anger) utterance are shown by ‘*’ in Fig. 1e–h for the distributions of the emotional (anger) utterance in Fig. 1a–d, respectively.

The normalization is done in a speaker-specific manner using the speaker’s neutral utterance. This helps in reducing the variability among different speakers.

Four two-dimensional (2-D) feature distributions are formed by using the following combinations:

D1::

(\(N_{E_{F_0}}\) versus \(N_{E_\mathrm{SoE}}\)),

D2::

(\(N_{E_\mathrm{EoE}}\) versus \(N_{E_{F_0}}\)),

D3::

(\(N_{E_\mathrm{EoE}}\) versus \(N_{E_\mathrm{SoE}}\)), and

D4::

(\(N_{E_{\beta }}\) versus \(N_{E_{F_1}}\)).

Each of these 2-D feature distributions is modeled by a Gaussian distribution, represented by mean vector and covariance matrix.

The training and testing phases of the proposed emotion recognition system are as shown in Figs. 2 and 3, respectively.

Fig. 2
figure 2

Training phase (template generation process) of the emotion recognition system

Fig. 3
figure 3

Testing phase of the emotion recognition system

The training process involves generation of templates. Reference templates are generated by using three utterances for each of the four emotions (anger, happiness, sadness and neutral state) from seven speakers of the IIIT-H database. An additional neutral utterance from each of the seven speakers is also used. Therefore, a total of \((3\times 4\times 7=84; 84+7)\) 91 utterances are used. For each utterance (of the 84 utterances), four normalized distributions are generated using a neutral utterance of the corresponding speaker, and these distributions are called templates. Hence, \(84\times 4=336\) templates are created. The storage of mean vector and covariance matrix of the normalized 2-D emotion distribution is referred to as the stored template. As an illustration, a plan of the stored templates for two of the seven speakers from the IIIT-H database is shown in Fig. 4.

Fig. 4
figure 4

An illustration of the stored templates for two speakers of the IIIT-H database. Here Utt1, Utt2 and Utt3 refer to the Utterance 1, Utterance 2 and Utterance 3 of the corresponding emotion, respectively, and D1, D2, D3 and D4 refer to the normalized 2-D emotion distribution

For testing, a neutral utterance and an emotional utterance are collected from the test speaker to derive the normalized emotion features. For each test case, the distributions of the features of the emotional utterance are normalized with respect to the neutral utterance, as in Fig. 1e–h. The normalized features of the test utterance (test templates) are compared with each of the corresponding normalized features of the trained templates using the KL distance.

With three utterances for each of the four emotions for each reference (trained) speaker, there are 12 utterances. For each utterance, there are four distributions (D1, D2, D3 and D4). Thus, for each test utterance, we get \(12\times 4=48\) KL distances for each reference (trained) speaker. The KL distances for each 2-D feature and emotion category are averaged over the three utterances, thus giving a total of \(4\times 4=16\) averaged KL distances per speaker, for the four pairs of the 2-D features and for the four emotion categories. The lowest of the averaged KL distances (for a given 2-D feature combination) across the four emotions is used to determine the emotion label for that 2-D feature. Thus, we get four emotion labels for each reference speaker. Leaving out the comparison with the test speaker in the reference, there are six other reference speakers providing \(6\times 4=24\) emotion labels for a given test utterance.

6 Results and Discussion

From the 24 emotion labels for each test utterance, the emotion with the maximum number of emotion labels is selected as the emotion category for the test utterance. The resulting confusion matrix is given in Table 9. Note that all the experiments are carried out with leave-one-speaker-out (LOSO) cross-validation.

Table 9 Confusion matrix for emotions using the maximum number of output emotion labels for the IIIT-H database

From the results given in Table 9, it is observed that the confusion between anger and happiness is high. Similar observations are made between sadness and neutral state. This is because the features such as \(F_0\) show an increasing trend and SoE shows a decreasing trend for anger and happiness when compared to neutral speech [16, 41]. In the case of sadness, these excitation source features are not changing remarkably when compared to neutral state.

Fig. 5
figure 5

Block diagram of binary tree decision logic [27, 30]

In order to improve the performance, a 2-stage binary decision logic [30] is implemented as shown in Fig. 5. In Stage 1, anger and happiness are grouped into one class, and sadness and neutral state are grouped into another class. The final decision of the emotion category is obtained from Stage 2, where comparisons are made between neutral state versus sadness, and between anger versus happiness using the following decision criteria. Between neutral speech and sadness, neutral state is chosen, if the number of neutral labels > (the number of sad labels \(+\) 3). This is because there is high correlation between the features of neutral utterances compared to those of sad utterances. This is also evident from the KL distances of the feature combinations given in Tables 6, 7 and 8. Similarly, between anger and happiness, anger is chosen, if the number of anger labels > (the number of happy labels \(+\) 2).

Table 10 Confusion matrix after Stage 1 in binary tree decision logic for the IIIT-H database
Table 11 Confusion matrix after Stage 2 in binary tree decision logic for the IIIT-H database

The confusion matrices after Stage 1 and Stage 2 are given in Tables 10 and 11, respectively. From the results given in Table 10, the binary classification at stage 1 gives an accuracy of 96%. The number of original neutral/sad utterances recognized as angry/happy is reduced and vice versa. This is in line with the previous studies which have investigated acoustic features that are effective in discriminating emotions of high activation (anger, happiness) from emotions of low activation (sadness, boredom) [26, 30, 50]. From Table 11, it is observed that the confusion between anger and happiness remains high, whereas the performance to discriminate sadness and neutral state has improved. The recognition for neutral state, sadness, anger and happiness is 94.1%, 82.4%, 85.7% and 66.7%, respectively, giving an average recognition accuracy of 82.3% for the 4-class problem.

The proposed emotion recognition system was also evaluated using the EMO-DB database, and the results are given in Tables 12 and 13 after Stage 1 and Stage 2, respectively.

Table 12 Confusion matrix after Stage 1 in binary tree decision logic for the EMO-DB database
Table 13 Confusion matrix after Stage 2 in binary tree decision for the EMO-DB database
Table 14 Emotion recognition results obtained for EMO-DB with the proposed method and with the SVM classifier using baseline feature sets based on spectral features (MFCC [60], MSF [60] and PLP [60]), prosody features [60] and the combination of the excitation features and MFCCs

For the EMO-DB database, the recognition accuracy at Stage 1 is 98%. The recognition accuracy for the 4-class problem after Stage 2 is 76%. The performance of the system for the data of the EMO-DB database is low because of confusions between anger and happiness. The proposed excitation features were compared with the prosody features [60] and three short-term spectral features (mel-frequency cepstral coefficients (MFCCs) [60], perceptual linear predictive coefficients (PLPs) [25, 60] and modulation spectral features (MSFs) [60]) using a SVM classifier [8] with leave-one-speaker-out (LOSO) cross-validation [60]. Table 14 shows the emotion recognition results obtained using the baseline feature sets (MFCCs, PLPs, MSFs and prosody features) [60] with the SVM classifier, the proposed emotion recognition results with the excitation features, and the combination of the proposed excitation features with the MFCCs using the SVM classifier with LOSO cross-validation. From Table 14, it can be observed that the results obtained using the excitation features are comparable or better than the existing prosody features and spectral features (MFCCs, PLPs and MSFs). Furthermore, it can be observed that there exists complimentary information between the proposed excitation features and the MFCC features. In [24], a large number of multiple feature sets (6552 features extracted using the openEAR toolkit) and various SVM schemes were used for a language-dependent and speaker-independent system using the EMO-DB database and the study reported a recognition accuracy of 79.5%. It is to be noted that the focus of the present study is on the excitation features and their behavior in different emotions rather than unraveling which combination of feature toolkits and back-ends results in the best emotion recognition accuracy.

The proposed emotion recognition system is also used for online testing. For this, speech utterances of each speaker are concatenated with emotion labels intact. This is done for all the speakers in both of the databases. A neutral utterance of the corresponding speaker is used as a reference. The testing is carried out by processing a 2-s speech buffer of the test speech signal. The confusion scores for the online testing with the IIIT-H and EMO-DB databases are shown in Tables 15 and 16, respectively.

Table 15 Confusion matrix for the emotion classifier processing a 2-s speech buffer using the IIIT-H database
Table 16 Confusion matrix of the emotion classifier processing a 2-s speech buffer using the EMO-DB database

From the results given in Tables 15 and 16, the recognition accuracy of the IIIT-H and EMO-DB databases is 81.5% and 73.9%, respectively. Although there is some loss of suprasegmental information because of the 2-s speech buffering, there is not much reduction in performance. This is because the proposed features use only the sub-segmental information around the epochs. One reason for the reduction in performance is that all the segments of an utterance may not show similar distributions of emotional information. This is also evident from [4, 5, 7], where it was shown that the emotionally salient aspects of speech are important in recognition and synthesis of emotional speech.

To test the effectiveness of the excitation features for cross-language databases, the reference templates created (i.e., training) using the EMO-DB database are used to test with data from the IIIT-H database and vice versa. From the results given in Tables 17 and 18, the recognition accuracy is about 68% in the former and 61% in the latter for the 4-class problem. This indicates that to some extent language and cultural aspects in expressing vocal emotions affect in recognition of emotions. However, it is worth emphasizing that the extracted excitation features are independent of the lexical content.

Table 17 Confusion matrix of emotion classifier for training with the German EMO-DB database and testing with the IIIT-H Telugu database
Table 18 Confusion matrix of emotion classifier for training with the IIIT-H Telugu database and testing with the German EMO-DB database

The results of the proposed method indicate that the features corresponding to vocal effort seem to carry emotion-specific information. The performance of the system may be improved by increasing the number of reference (trained) templates and speakers. As there are confusions between anger and happiness, and between sadness and neutral state, deriving features which are more emotion specific may reduce the confusion between them.

7 Conclusions

In this paper, features corresponding to speech excitation were studied in analysis and recognition of vocal emotions. The emotion recognition system based on the features related to excitation component of speech production was developed by considering emotion states as deviations from neutral state. The deviations were captured through 2-D feature spaces. A template-based representation of 2-D normalized feature distributions of emotions using neutral speech as reference was generated from training examples. The emotion recognition system uses reference templates derived from the utterances in anger, happiness, sadness and neutral state. Although the system is speaker independent, a neutral speech utterance of the speaker is required for registration before testing. This is because of variability of dynamic ranges of the excitation features across speakers. One advantage of the proposed method is that it can be used in the recognition of emotions from short segments (2 s) of speech.

Ideally, an emotion recognition system should recognize the emotion category of speech without having access to neutral speech of the speaker. In this sense, the current study is limited. Existing emotion recognition systems have been developed using mainly features representing the vocal tract system characteristics. Since the present study demonstrates that the excitation features capture effectively the emotion-specific characteristics of speech, it may be possible to combine the features from the excitation and the vocal tract system to improve the overall performance of emotion recognition systems. In addition, exploring the relations among the excitation features in an emotion-specific way might help for developing more robust emotion recognition systems.