1 Introduction

Deep learning has been intensively applied to recent emotion recognition research and achieved considerable progress [23, 24, 27, 34, 36, 40, 41]. However, the well-known challenge underpinning deep learning-based emotion recognition approaches is the lack of labelled data to train deep models. Collecting and accurately labelling of emotion categories for large-scale datasets are not only costly and time consuming, but also require specific skills and knowledge [39]. Remedies for these limitations are being sought to exploit the full potential of deep learning techniques in emotion recognition, especially on limited and source-poor datasets.

To address the problem of data scarcity in emotion recognition, transfer learning has been widely adopted [11, 37]. As shown in the literature, existing transfer learning methods make use of a pre-trained model that has been well trained on a large-scale dataset then fine-tune it on novel ones. Experimental results show that the knowledge captured by a pre-trained model on a source dataset can be well transferred to target ones via fine-tuning. However, these efforts have only attempted to transfer the emotional knowledge learnt across various datasets within a single domain. It is shown that different domains, e.g., visual and auditory domain, provide complementary information for understanding of human’s emotion and thus could enrich emotion recognition models [11, 13, 23, 37]. However, the transfer of knowledge across domains, e.g., from visual to auditory domain and vice versa, is a challenging task. This kind of transfer learning also poses a greater challenge when the training and testing procedure are performed on different datasets; considerable drop in performance is often observed due to distribution shift across the datasets [31]. Furthermore, fine-tuning a pre-trained model on an unbalanced source-poor dataset may negatively affect both feature quality and classification accuracy, making the model even worse.

In this paper we address the challenge of transfer of emotional knowledge across multiple domains and on multiple source-poor datasets without performance suffering from distribution shift. Specifically, we formulate the transfer learning of emotional knowledge in a cross-domain setting as a simultaneous learning task on three supervisory channels including two emotion classifiers and one emotion matcher. The focus of the two emotion classifiers is to accurately detect all types of emotions from different domains, whereas the aim of the emotion matcher is to match pairs of samples via a contrastive-like loss to determine if they correspond to the same emotion.

Feature learning across domains/tasks has been explored in previous work. For example, in the face recognition method by Sun et al. [29], facial features are learnt from two supervisory channels: face identification and face verification. Unlike this method, we take pairs of training samples from two different source-poor datasets, rather than from the same rich dataset. This allows our method to effectively learn features across datasets. In [13], intra-category common feature representation (IC) channel, inter-category distinction feature representation (ID) channel, and ICID fusion network are proposed to learn features for cross-dataset facial expression recognition. Our method differs from [13] in the following points. First, features from the same emotion are detected via the Euclidean distance in [13]. Meanwhile, we integrate the learning of those features in a unified framework using a contrastive-like loss. Second, in [13], the IC, ID, and ICID fusion network are sequentially trained. In contrast, the three supervisory channels are jointly trained in our method, making the training process end-to-end. Additionally, we show that proper application of normalization can further boost up the performance of cross-domain transfer learning. As shown in experimental results, the proposed method is able to effectively learn cross-domain features and well generalises on different disjoint datasets. To this end, we make the following contributions.

  • We investigate a relatively unexplored problem: how to effectively train an emotion recognition model on various multi-domain datasets where some or all the datasets are source-poor, in an end-to-end fashion. We address this problem by proposing a joint deep cross-domain learning method that jointly performs emotion recognition and emotion matching in a multi-domain multi-dataset setting.

  • We study the impact of normalisation in cross-domain transfer learning. Specifically, we investigate the Group Normalisation technique proposed in [32] with different mini batches for emotion recognition. We report an interesting finding that Group Normalisation achieves very competitive recognition accuracy with small mini batches of size of 2. We found that using small size batches did not negatively affect cross-domain transfer, yet significantly saved memory consumed in training.

  • We conduct extensive experiments and implement various emotion recognition baselines on visual and auditory domains, and cross-domain transfer learning using various off-the-shelf backbones to validate the proposed method.

The remainder of this paper is organised as follows. Section 2 briefly reviews related work. Section 3 describes our proposed method. Sections 4 and 5 present experiments and results respectively. Section 6 concludes our paper and discusses future work.

2 Related work

In this section, we review existing research regarding to three aspects relevant to our work: fine-tuning, cross-domain transfer learning, and deep model training on multiple datasets. Progress and issues in each aspect are also discussed.

Fine-tuning::

Transfer learning has been intensively adopted in emotion recognition from visual data [20, 37]. The most common way is to fine-tune a pre-trained model such as ResNet or AlexNet on specific visual emotion datasets. This approach is inspired by the self-taught learning mechanism in [7], and aims to exploit rich representations learnt from a source dataset to improve the generalisation ability of the model in a target dataset. An advantage of this approach is the ability to alleviate over-fitting when training is done from scratch and with a small amount of training data. For example, in [20], an emotion recognition model was fine-tuned from a model pre-trained on ImageNet. In [37], a 3D convolutional network, encoding both spatial and temporal information, was initially trained on a large-scale video dataset and subsequently fine-tuned on a much smaller emotion dataset to learn both audio and visual features. However, it is also well-known that fine-tuning may worsen a pre-trained model when applied to source-poor datasets.

Cross-domain transfer::

Cross-domain transfer was investigated in [11, 13]. Specifically, the authors of [11] initially trained their model on the Large-scale Subtle Emotions and Mental States in the Wild database, then transferred the learnt knowledge to a traditional (non-subtle) expression dataset. Similarly, the pre-trained model in [13] was trained on two different domains and fine-tuned by fusing the pre/post-trained models with a classification loss. In [1], facial features learnt from a face image dataset were transferred to speech domain using distillation [5]. ResNet-50 was adopted in [14] to build an emotion recognition framework. Deep supervision was then imposed on the framework to further enhance its recognition performance. Modality-specific and shared generative adversarial networks were introduced in [33] for cross-modal retrieval. An existing issue with cross-domain transfer is on how to ensure the transfer is only applied to relevant and domain-invariant features.

Training on multiple datasets::

While exploiting cross-domain transfer was a key component in [11] to overcome insufficient data, this work aimed to address domain shift problem between multiple datasets. Specifically, distribution alignment was adopted in [11] to leverage tasks including subtle facial expression recognition and landmark detection on disjoint datasets. It was pointed in [35] that a straightforward combination of multiple datasets could not lead to any improvement of the recognition performance due to the bias and inconsistency in the annotation of the datasets and the large amount of unlabelled data. To address this issue, an Inconsistent Pseudo Annotations to Latent Truth (IPA2LT) scheme was proposed in [35]. In this scheme, each sample was initially assigned to more than one label manually or automatically through prediction. An end-to-end LTNet was then developed to discover the latent truth from input face images and inconsistent pseudo labels. In [2], a training scheme with dual objectives was proposed, including multi-domain image synthesis for data augmentation, and domain adaptation for transferring visual information between different domains while preserving facial characteristics (e.g., identity). In this method, expression recognition model learnt from source domain could generalise to images from target domain without re-training the model on the target domain. However, it is unknown how to define training batches including samples from multiple datasets. Inappropriate batch design may deteriorate the performance of transfer learning due to domain-shift and/or imbalance in domain-target datasets.

3 Proposed method

For convenience in description of our proposed joint cross-domain transfer learning method, we illustrate a pipeline of our method in Section 3.1. Section 3.2 presents pre-processing of visual and audio data from multiple data sources. Our joint learning algorithm is then described in Section 3.3.

3.1 Pipeline

Our method is designed to learn emotional knowledge across visual and auditory domain, and transfer the cross-domain knowledge from a source dataset to multiple source-poor datasets of a target domain.

Let M be a pre-trained model which has been well trained on a source dataset D. In our experiments, M is a Convolutional Neural Network (CNN). Let \(D^{\mathcal {S}_{v}}\) be a dataset in the target domain, e.g., visual domain. We first transfer M to the target domain by fine-tuning M on \(D^{\mathcal {S}_{v}}\) to achieve a model \(M^{\mathcal {S}_{v}}\). To incorporate audio knowledge, we perform cross-domain transfer of \(M^{\mathcal {S}_{v}}\) on an auditory dataset \(D^{\mathcal {S}_{a}}\). This results in a cross-domain model \(M^{\mathcal {S}_{v,a}}\). The cross-domain model \(M^{\mathcal {S}_{v,a}}\) is finally fine-tuned on N different datasets in the target domain denoted as \(D^{\mathcal {T}_{v}}_{1}\), ..., \(D^{\mathcal {T}_{v}}_{N}\) to achieve a cross-dataset fine-tuned model \(M^{\mathcal {S},\mathcal {T}_{v}}\). In order to transfer common knowledge shared by all target domains, the final cross-dataset model \(M^{\mathcal {S},\mathcal {T}_{v}}\) is simultaneously fine-tuned on multiple target datasets. The pipeline of our method is illustrated in Fig. 1.

Fig. 1
figure 1

Our joint cross-domain transfer learning. Given a pre-trained model, we first transfer the pre-trained model on a domain-target dataset. We then enrich the model by cross-domain transfer on different modalities. Finally, we simultaneously fine-tune the model on multiple source-poor domain-target datasets

The source dataset D that is chosen to train the initial model M is a large-scale dataset. D is supposed to contain sufficient amount of annotated data. The choice of domain for building the pre-trained model relies on the availability of the domain data. Visual data is usually more accessible than audio data, and the model thus can learn better on visual domain thereby being more useful for later transfer learning.

The emotional knowledge captured by \(M^{\mathcal {S}_{v}}\) is learnt from the target domain and can be re-used in cross-domain transfer. The reason we conduct this cross-domain transfer, i.e., transferring the learnt emotional knowledge from the visual domain to the auditory domain prior to carrying out joint learning, is to enrich \(M^{\mathcal {S}_{v}}\) with complementary features from both visual and auditory domain. The resulting cross-domain model \(M^{\mathcal {S}_{v,a}}\) therefore can accumulate useful cross-domain emotional knowledge. This cross-domain emotional knowledge is then transferred to multiple target datasets using our proposed joint learning algorithm described in Section 3.3.

3.2 Data pre-processing

Video stream::

Given a video stream, image frames from the stream are first extracted. Facial regions are then detected on each image frame using the improved version of the Viola-Jones algorithm in [22]. The detected facial regions are finally resized to 64 × 64 × 3 (as 3 colour channels are used) for further processing.

Audio Stream::

Given an audio stream, we adopt the method in [36] to extract three channels from log of the Mel-spectrograms of the input audio stream. Specifically, the Mel-spectrograms of audio segments with size F × T × C are calculated from the input audio signal, where F, T, and C denote the number of Mel-filter banks, the segment length corresponding to the frame number in a context window, and the number of channels of the Mel-spectrogram, respectively. In our experiments, we set F = 64, T = 64, and C = 3. We select such values for F and T to match with the counterpart pre-processed video data for cross-domain transfer. The three channels (i.e., C = 3) of the Mel-spectrograms are the static, delta, and delta-delta coefficients. Next, we convert 64 Mel-filter banks from 20 to 8000 Hz into a log Mel-spectrogram using 25 ms Hamming windows with 10 ms overlap between two consecutive windows. A context window of 64 frames (length of 10 ms × 63 + 25 ms = 655 ms) is then applied to the whole log Mel-spectrogram to extract static 2-D Mel-spectrogram segments (64 × 64) with an overlapping of 30 frames.

3.3 Joint learning

Our proposed joint learning algorithm aims to simultaneously fine-tune \(M^{\mathcal {S}_{v,a}}\) on N target datasets \(D^{\mathcal {T}_{v}}_{1}\), ..., \(D^{\mathcal {T}_{v}}_{N}\). Technically, the joint learning algorithm performs a joint optimisation problem: minimising intra-class emotion variance while maximising inter-class emotion variance on the target datasets. For the sake of simplicity and ease in presentation, we consider the case N = 2 and target datasets are in visual domain. However, our method is general and can be applied to various numbers of target datasets, and on either visual or auditory domain.

Without any confusion but convenience in annotation, we use \(D_{1}^{\mathcal {T}}\) and \(D_{2}^{\mathcal {T}}\) to replace \(D_{1}^{\mathcal {T}_{v}}\) and \(D_{2}^{\mathcal {T}_{v}}\), and \(M^{\mathcal {S},\mathcal {T}}\) to replace \(M^{\mathcal {S},\mathcal {T}_{v}}\) hereafter. Suppose that \(M^{\mathcal {S}_{v,a}}\) has been well trained on the visual and auditory datasets (i.e., source datasets) \(D^{\mathcal {S}_{v}}\) and \(D^{\mathcal {S}_{a}}\). We now adapt \(M^{\mathcal {S}_{v,a}}\) on two target datasets \(D_{1}^{\mathcal {T}}\) and \(D_{2}^{\mathcal {T}}\) to achieve the final cross-dataset fine-tuned model \(M^{\mathcal {S},\mathcal {T}}\). Let Θ = {𝜃f,𝜃c} be the set of learnable parameters of \(M^{\mathcal {S},\mathcal {T}}\), where 𝜃f denotes the parameters used for cross-domain features learning (i.e., the parameters in the layers before the fully-connected layers in a CNN architecture) and 𝜃c denotes the parameters used for classification (i.e., the parameters in the fully-connected layers and soft-max layer).

Let x denote an image/audio signal and lx denote its corresponding emotion label lx ∈{1,...,L}. Let fx = Conv(x,𝜃f) be the feature vector of the signal x generated by the CNN of the model \(M^{\mathcal {S},\mathcal {T}}\) with parameters 𝜃f. The cross-entropy loss for fine-tuning of the model \(M^{\mathcal {S},\mathcal {T}}\) is defined as:

$$ \mathcal{L}_{c}(f_{x}, l_{x}, \theta_{c}) = -\sum\limits_{l=1}^{L} p_{l}(f_{x})\log(\hat{p}_{l}(f_{x}))=-\log(\hat{p}_{l}(f_{x})) $$
(1)

where pl(fx) is the target probability distribution of fx, i.e., pl(fx) = 1 if l = lx, and 0, otherwise, and \(\hat {p}_{l}(f_{x})\) is the probability distribution predicted by the model \(M^{\mathcal {S},\mathcal {T}}\). The parameters in 𝜃c are randomly initialised but then learnt during pre-training.

Conventionally, when fine-tuning of a pre-trained model on multiple target datasets, one often tunes 𝜃c only, while freezing some or even all parameters in 𝜃f [18]. However, we argue that 𝜃f can be tuned effectively by fine-tuning of the pre-trained model with pairs of training samples from different datasets simultaneously. Specifically, let x1 and x2 be two signals sampled from \(D_{1}^{\mathcal {T}}\) and \(D_{2}^{\mathcal {T}}\) respectively, i.e., \(x^{1} \in D_{1}^{\mathcal {T}}\) and \(x^{2} \in D_{2}^{\mathcal {T}}\). We define a new loss \({\mathscr{L}}_{f}\) which measures the similarity between two feature vectors \(f_{x^{1}}=\text {Conv}(x^{1},\theta _{f})\) and \(f_{x^{2}}=\text {Conv}(x^{2},\theta _{f})\). The loss \({\mathscr{L}}_{f}\) is designed in a contrastive-like form and acts in a principle that two feature vectors of the same emotion should have high similarity. In particular, we define:

$$ \mathcal{L}_{f}(f_{x^{1}}, f_{x^{2}}, l_{x^{1}}, l_{x^{2}}, \theta_{f})= \begin{cases} \frac{1}{2}\|f_{x^{1}}-f_{x^{2}}\|_{2}^{2}& \text{if} l_{x^{1}}=l_{x^{2}},\\ \frac{1}{2}\max(0,m-\|f_{x^{1}}-f_{x^{2}}\|_{2})^{2} & \text{otherwise} \end{cases} $$
(2)

where m is a pre-defined margin.

Note that fine-tuning using contrastive loss has also been explored in existing works, e.g., [19, 29]. However, different from previous methods, our joint learning algorithm takes (x1,x2) from two different datasets \(D_{1}^{\mathcal {T}}\) and \(D_{2}^{\mathcal {T}}\).

Finally, we define the joint loss of our joint learning algorithm as,

$$ \mathcal{L}(x^{1}, x^{2}, {\Theta}) = \sum\limits_{n=1}^{2} \mathcal{L}^{n}_{c}(f_{x^{n}}, l_{x^{n}}, \theta_{c}) + \lambda \mathcal{L}_{f}(f_{x^{1}}, f_{x^{2}}, l_{x^{1}}, l_{x^{2}}, \theta_{f}) $$
(3)

where \({\mathscr{L}}^{n}_{c}(f_{x^{n}}, l_{x^{n}}, \theta _{c}), n=1,2\) is defined in (1), and λ is a hyper-parameter.

Intuitively, the optimisation in (3) aims to maximise inter-class variations while minimising intra-class variations. In other words, our joint learning algorithm learns to correctly classify signals into their emotion categories while embedding signals of the same emotion into similar feature vectors. We adopt stochastic gradient descent to solve (3). Our joint learning algorithm is summarised in Algorithm 1.

Algorithm 1
figure a

Joint learning algorithm

To generate paired samples for the joint learning algorithm, all possible pairs of data samples from the two datasets \(D_{1}^{\mathcal {T}}\) and \(D_{2}^{\mathcal {T}}\) could be considered. However, this approach would create a huge number of pairs, consequently slowing the fine-tuning process. We instead generate paired samples from mini batches. Specifically, at each iteration in the joint learning, we take two mini batches of size K, each from a target dataset (i.e., \(D_{1}^{\mathcal {T}}\) or \(D_{2}^{\mathcal {T}}\)). We then create all possible pairs of samples from these two batches. This step results in K2 pairs at each iteration. We have empirically found that our model could learn better if all pairs of mini batches are used at each iteration.

Batch normalisation (BN) [12] performs well only with large batches as it requires enough data samples to estimate statistics. In contrast, we have observed that Group Normalisation (GN) [32] achieves similar performance with batch sizes varying from 2 to 512. Following this finding, we exploit GN with a mini batch size of 2 in our experiments. We note that GN with such a mini batch size does not affect the transfer learning process since GN is independently computed along mini batches.

4 Experiments

In this section, we present datasets used in our experiments (in Section 4.1) and detailed implementation of our method, as well as evaluation and comparison protocols (in Section 4.2).

4.1 Datasets

We experimented our proposed joint cross-domain transfer learning for emotion recognition on several benchmark datasets including: eNTERFACE [17], SAVEE [6], EMO-DB [3], and RAVDESS [15] dataset. There exist other datasets for emotion recognition from audio and/or visual data, e.g., CREMA-D [4]. However, compared with other datasets, our chosen datasets have several advantages to showcase cross-domain transfer learning such as the availability to public data access, varying scalability (from source-rich to source-poor data), and modernity.

eNTERFACE

dataset [17] consists of 1,283 video sequences made by 44 subjects. The proportions of male and female subjects in the dataset are 77% and 23% respectively. Subjects were asked to express six discrete emotions: anger, disgust, fear, happiness, sadness, and surprise.

SAVEE

dataset [6] was recorded by higher-degree researchers (aged from 27 to 31 years) at the University of Surrey, and four native male British speakers. All the subjects were also required to speak and express seven discrete emotions including anger, disgust, fear, happiness, sadness, surprise, and neutral. The dataset contains 120 utterances per speaker.

EMODB

dataset [3] is an acted speech corpus containing 535 emotional utterances with seven different acted emotions: anger, disgust, fear, happiness, sadness, surprise, and neutral. These emotions were stimulated by five male and five female professional native German-speaking actors, generating five long and five short German utterances used on daily basic. These actors were asked to read predefined sentences in the targeted emotions.

RAVDESS

dataset [15] is a multimodal dataset of emotional songs and speeches. This dataset has gender balance, and contains 24 professional actors, vocalising lexically-matched statements in neutral North American accent. Speech emotions include sad, angry, calm, happy, fearful, disgust, and surprise expressions. Song emotions consist of happy, fearful, angry, calm, happy, sad, and angry. Each expression is created at two levels of emotional intensity with an additional neutral expression. All conditions are available in voice-only, face-only, and face-and-voice formats. The dataset includes 7,356 recordings, each recording was rated 10 times on emotional validity, intensity, and genuineness. Ratings were given by 247 individuals who were characteristic of untrained research participants from North America. An additional set of 72 participants including test-retest data is also supplied. High levels of emotional validity and test-retest intra-rater reliability are represented.

Note that our datasets include diversities in data domain (i.e., audio vs visual domain) and content (i.e., emotion types), and contain data source-based variations (i.e., language-based expressions). Furthermore, the datasets also vary in characteristics of the speech signals. For example, the amplitude of audio signals varies from 6000 (RAVDESS) to 30000 (EMODB). The frequency range, depending on emotion types, varies from [0,5000Hz] (RAVDESS) to [0, 8000Hz] (EMODB). Therefore, the datasets would be useful to demonstrate the capabilities of cross-domain transfer learning.

4.2 Implementation details

VGG-16 [28] was used as the backbone of the pre-trained model M and its succeeding models, i.e., \(M^{\mathcal {S}_{v}}\), \(M^{\mathcal {S}_{v,a}}\), and \(M^{\mathcal {S},\mathcal {T}_{v}}\), in our experiments. The network included four convolutional layers with 64, 128, 256, and 512 3×3 filters and stride 1. The output of each convolutional layer was activated using leaky rectified linear units (LReLU) [16] before being normalised by GN [32]. The network made use of four fully-connected layers with hidden units of dimensions 512, 128, 32, and 8 (for RAVDESS dataset) or 6 (for the other datasets), respectively. A LReLU activation function was also employed after each fully-connected layer.

For pre-training, we applied the variance scaling initialisation method proposed in [38] to initialise weights in the convolutional layers in the backbone network. GN [32] and Dropout (0.5) were used in all the convolutional layers. We trained the model M in 100,000 iterations with a learning rate of 10e-5. We adapted M to the visual domain to achieve \(M^{\mathcal {S}_{v}}\) by fine-tuning of only fully-connected layers in M. This is because both M and \(M^{\mathcal {S}_{v}}\) are trained on the same domain (i.e., visual domain). For cross-domain transfer, we fine-tuned the entire network of \(M^{\mathcal {S}_{v}}\) to achieve \(M^{\mathcal {S}_{v,a}}\). For cross-dataset transfer, i.e., fine-tuning \(M^{\mathcal {S}_{v,a}}\) using the proposed joint learning in Algorithm 1 with the loss function in (3), we set λ = 0.01, and the learning rate to 10e-6. The final model \(M^{\mathcal {S},\mathcal {T}_{v}}\) was fine-tuned in 20,000 iterations whereas the mini batch-size was fixed at 2. Those parameters are determined empirically.

For each experiment, we compared our method with existing relevant works. For example, we opted the method in [25] as a baseline on eNTERFACE and SAVEE dataset. For fair comparisons, we applied the same evaluation settings used in the baselines. Specifically, 10-fold cross-validation was used for evaluations on eNTERFACE and SAVEE dataset while 3-fold cross-validation was adopted for experiments on RAVDESS dataset. Leave-One-Subject-Out cross-validation was applied in evaluations on EMODB dataset. All experiments in this paper were implemented in TensorFlow and conducted on 10 computing nodes: 3780 × 64-bit Intel Xeon Cores.

5 Results

Following the pipeline presented in Fig. 1, we present experiments and results of our proposed cross-domain transfer learning into different stages. First, we report performance of the pre-trained model M and its domain-target fine-tuned model \(M^{\mathcal {S}_{v}}\) (in the same domain) in Section 5.1. Second, we present results of cross-domain transfer learning between visual and auditory domain in Section 5.2. Finally, we show the effectiveness of the joint learning on multiple datasets in Section 5.3.

5.1 Emotion recognition via fine-tuning

We investigate emotion recognition using visual data on various data sources. In this case study, we evaluate the pre-trained model M in two scenarios: M is pre-trained and tested on the same data source, and on different data sources with fine-tuning. Specifically, in the first scenario, we initially trained M from scratch on the training set of eNTERFACE (i.e., D), and then tested it on the test set of eNTERFACE. We did similar experiment but swapped eNTERFACE and SAVEE for the source dataset D. In the second scenario, we fine-tuned M on the training set of the visual data of SAVEE (i.e., \(D^{\mathcal {S}_{v}}\)) to achieve the domain fine-tuned model \(M^{\mathcal {S}_{v}}\) which was then tested on the test set of the visual data of SAVEE. Note that, at this stage, we only fine-tuned the fully-connected layers in M. We investigated domain fine-tuning in two cases: fine-tuning was done on the entire training set, and on 1 part of the training set of a target domain. For example, we took 1 part (of 10 parts in total) of eNTERFACE to perform domain fine-tuning. Note that 1 part of eNTERFACE contains 9,456 samples (i.e., images) and is much smaller than the training set of visual SAVEE including 67,888 samples. Again, we conducted the same experiment but swapped eNTERFACE and SAVEE for \(D^{\mathcal {S}_{v}}\). We present the results of these experiments and other existing methods in Table 1.

Table 1 Emotion recognition using visual data on various data sources

As shown in Table 1, training and testing on the same data source always achieve the best performance. Fine-tuning also shows its importance, e.g., the fine-tuned models only drop a few percents of their performance on the target domain datasets. However, while the fine-tuned models work pretty well on the datasets which have been used to fine-tune them, they fail back on the source datasets used for initial training after being fine-tuned. We will discuss this phenomenon in detail in Section 5.3.

Table 1 also shows that, despite being fine-tuned only on 1 part of the training data of the target domain, the domain fine-tuned models \(M^{\mathcal {S}_{v}}\) achieve competitive recognition performance compared with existing methods, e.g., the one in [25] was trained on 9 parts of eNTERFACE.

Experimental results also prove that our pre-trained models M achieve state-of-the-art performance with 96.0% and 99.3% of emotion recognition accuracy on eNTERFACE and visual SAVEE, respectively. To gain an insight into these achievements, individual emotion performance metrics including precision, recall, and F1-score of the pre-trained models M on eNTERFACE and visual SAVEE are represented in Table 2. The confusion matrices of M are also reported in Fig. 2.

Table 2 Detailed performance of the pre-trained models M on various data sources
Fig. 2
figure 2

Confusion matrices of the pre-trained models M on various data sources

5.2 Cross-domain transfer

As described in Fig. 1, before joint learning, we perform cross-domain transfer to enhance \(M^{\mathcal {S}_{v}}\) with complementary features from both visual and auditory domain. For cross-domain transfer, we fine-tuned all the layers in \(M^{\mathcal {S}_{v}}\) to learn cross-domain features, obtaining the cross-domain transfer model \(M^{\mathcal {S}_{v,a}}\). We show the benefit of cross-domain transfer on two datasets: the audio part of SAVEE and EMODB in Table 3. As shown in the results, compared with training and testing on the same domain (e.g., M is trained and tested on the audio part of SAVEE), cross-domain transfer significantly improves the recognition accuracy. In addition, the cross-domain transfer model \(M^{\mathcal {S}_{v,a}}\) achieves very promising recognition accuracy (97.7% on the audio part of SAVEE), standing the first in all baselines. The benefit of cross-domain transfer is also clearly proven on EMODB dataset. To further investigate the results, we report detailed performance metrics and confusion matrices of different models in Table 4 and Fig. 3 respectively.

Table 3 Emotion recognition using audio data on various data sources
Table 4 Detailed performance of the cross-domain model \(M^{\mathcal {S}_{v,a}}\) on various data sources
Fig. 3
figure 3

Confusion matrices of the cross-domain model \(M^{\mathcal {S}_{v,a}}\) on various data sources

We also compared our cross-domain transfer model \(M^{\mathcal {S}_{v,a}}\) with other cross-domain methods using off-the-shelf models including ResNet [8] (and its variants), Inception [30], MobileNet [10], and meta transfer learning [21]. As presented in Table 5, our model achieves the lowest errors when fine-tuned on either eNTERFACE or visual SAVEE. Moreover, while ResNet [8], Inception [30], and MobileNet [10] are over-fitted (shown as ‘-’ in the table) when fine-tuned on eNTERFACE, our model performs steadily and achieves the lowest error (0.136).

Table 5 Error rates of our cross-domain transfer model \(M^{\mathcal {S}_{v,a}}\) and existing cross-domain transfer models when fine-tuned on eNTERFACE and visual SAVEE

As shown in the literature, cross-domain transfer is often applied from visual to audio domain rather than the reverse direction for several reasons. First, emotional annotation for speech data is extremely expensive, most datasets consist of elicited or acted speeches. Second, due to the subjective nature of emotions, labelled datasets often suffer by disagreements from annotators, as well as the use of varying labelling schemes (i.e., dimensional or categorical labelling) which require careful alignment. Finally, cost and time prohibitions often result in datasets with low speaker diversity, making difficulties in speaker adaptation. On the other hand, it is more possible to obtain large-scale facial video datasets and recent advances in deep learning have enabled to learn to automatically map faces to emotional labels that consistently match a pool of human annotators. Therefore, our aim is to transfer the emotional knowledge from source-rich data (e.g., large-scale facial data) to source-poor data (speech data) to solve insufficient data issue in deep learning. However, the fine-tuning strategy from audio modality to visual modality so as to improve visual emotion recognition should be also examined, if there exists a large-scale multimodal emotion data. We carried out this set of experiments on RAVDESS dataset and report results in Table 6.

Table 6 Results of cross-domain transfer from audio RAVDESS to visual RAVDESS vs training and testing on visual RAVDESS

As shown in the results, the role of cross-domain transfer is clearly confirmed. Specifically, cross-domain transfer from audio RAVDESS to visual RAVDESS boosts up the recognition accuracy by 4.6% compared with training and testing a recognition model only on visual RAVDESS. Detailed performance metrics and confusion matrices of cross-domain transfer from audio RAVDESS to visual RAVDESS are presented in Table 7 and Fig. 4.

Table 7 Detailed performance of cross-domain transfer from audio RAVDESS to visual RAVDESS
Fig. 4
figure 4

Confusion matrix of cross-domain transfer from audio RAVDESS to visual RAVDESS

5.3 Joint learning

It is previously mentioned that transfer learning poses a greater challenge when the training and testing are performed on different datasets; significant drop in performance is observed due to distribution shift across the datasets. To demonstrate this issue, we conducted an experiment as follows. We tested the pre-trained model M and the domain fine-tuned model \(M^{\mathcal {S}_{v}}\) on two datasets of the target domain: \(D^{\mathcal {T}_{v}}_{1}\) and \(D^{\mathcal {T}_{v}}_{2}\). Recall that M is trained only on the source dataset D and \(M^{\mathcal {S}_{v}}\) is initially trained on D but then fine-tuned on \(D^{\mathcal {S}_{v}}\). In this experiment, we adopted the training set of eNTERFACE as D due to the large-scale of this dataset. We report the recognition performance of M and \(M^{\mathcal {S}_{v}}\) in Table 8. As shown in the recognition results, both M and \(M^{\mathcal {S}_{v}}\) significantly drop their performance when tested on data sources other than their training sets.

Table 8 Results of our proposed joint learning and other baselines on eNTERFACE and visual SAVEE

We also performed another test wherein the pre-trained model M was trained on a training set created from the training data of both eNTERFACE and visual SAVEE. We observed that, the model M trained using this setting (the 3rd row in Table 8) slightly degrades its performance compared with the one trained only on eNTERFACE (the 1st row in Table 8). This clearly shows that simply training of a model on a training set combined from various sources does not guarantee any improvements on any source yet may worsen the model overall. Moreover, if there is imbalance between training sets, the model can be biased by the source containing more training data. For example, since eNTERFACE is much larger than visual SAVEE, the model M trained on both eNTERFACE and visual SAVEE performs similarly to the one trained only on eNTERFACE. More importantly, exhaustively re-training of a model from scratch to adapt the model into a new data source is not always practical in reality as this type of training requires expensive computational resources and takes long time to complete.

As can be seen from Table 8, the cross-dataset model \(M^{\mathcal {S},\mathcal {T}_{v}}\) trained using our proposed joint learning algorithm significantly improves the overall performance. Specifically, our cross-dataset model \(M^{\mathcal {S},\mathcal {T}_{v}}\) achieves 66% and 94% of emotion recognition accuracy on visual SAVEE and eNTERFACE dataset respectively, and 81.6% of overall accuracy which is the best performance compared with other baselines. Moreover, it is worth noting that our joint learning is executed via fine-tuning the pre-trained model with a small mini batch-size of 2 rather than training from scratch. The joint learning algorithm, therefore, saves a huge amount of time and memory resources for training without sacrificing emotion recognition accuracy.

6 Discussion and conclusion

Human emotion understanding from multi-source multi-domain data is an important research problem in multimedia signal processing and analysis. Thanks to the proven power of deep learning in various applications, deep models have been adopted in human motion understanding. In addition, transfer of deep models in a cross-domain setting is crucial for up-scaling deep models to handle multi-source multi-domain data. In this paper, we address this research problem by proposing a joint deep cross-domain transfer learning method to learn emotional knowledge from multiple disjoint emotion datasets across visual and auditory domain using both cross-entropy loss and contrastive-like loss. By integrating cross-domain transfer via fine-tuning, our proposed framework successfully transfers emotional knowledge learnt between different modalities. To the best of our knowledge, our joint learning algorithm is the first study addressing the issue of learning on multiple emotion datasets. We verified the effectiveness of our framework and its superiority over existing works in visual and speech emotion recognition on various benchmark datasets.

Although experimental results confirmed the contribution of transferred emotional knowledge in improving the overall performance of emotion recognition, it is still unclear how much the knowledge has been kept in the framework to be transferred to make such improvement. In addition, there is a semantic gap in transfer of visual features (extracted from facial images) to auditory domain where features are extracted from spectrograms, and vice versa. We believe these are important research questions and consider them in our future work. We also believe that our technique could be beneficial and potential for cross-language transfer tasks.