Meta-transfer learning for emotion recognition

Deep learning has been widely adopted in automatic emotion recognition and has lead to significant progress in the field. However, due to insufficient training data, pre-trained models are limited in their generalisation ability, leading to poor performance on novel test sets. To mitigate this challenge, transfer learning performed by fine-tuning pr-etrained models on novel domains has been applied. However, the fine-tuned knowledge may overwrite and/or discard important knowledge learnt in pre-trained models. In this paper, we address this issue by proposing a PathNet-based meta-transfer learning method that is able to (i) transfer emotional knowledge learnt from one visual/audio emotion domain to another domain and (ii) transfer emotional knowledge learnt from multiple audio emotion domains to one another to improve overall emotion recognition accuracy. To show the robustness of our proposed method, extensive experiments on facial expression-based emotion recognition and speech emotion recognition are carried out on three bench-marking data sets: SAVEE, EMODB, and eNTERFACE. Experimental results show that our proposed method achieves superior performance compared with existing transfer learning methods.


Introduction
Emotions of humans manifest in their facial expressions, voice, gestures, and posture.An accurate emotion recognition system based on one or a combination of these modalities would be useful in various applications including surveillance, medical, robotics, human computer interaction, affective computing, and automobile safety (Nguyen et al., 2017).Researchers in this area have focused mainly in the area of facial expression recognition to build reliable emotion recognition systems.This is still a challenging problem since very subtle emotional changes manifested in the facial expression could go undetected (Nguyen et al., 2017).Recently several approaches based on deep learning techniques have contributed to progressing in this area (Fan et al., 2016;Abbasnejad et al., 2017;Hasani and Mahoor, 2017).
In addition to facial expression stream, speech signals, which are regarded as one of the most natural media of human communication, carry both the contents of explicit linguistic and the information of implicit paralinguistic expressed by a speaker (Zhang et al., 2018).Due to this rich information contained in the speech, over the last two decades numerous studies and efforts have been devoted to progressing approaches, focusing on automatic and accurate detection of human emotions from speech signals (Zhang et al., 2018).Speech emotion recognition is presently playing an essential role in a wide range of applications such as automobile safety, surveillance, human computer interaction, and robotics, and is attracting a great deal of attention within the affective computing research community (Nguyen et al., 2018).
To develop solutions for speech emotion recognition, a number of methodologies have been proposed, in which researchers have primarily applied the use of hand engineered features based on the acoustic and paralinguistic information (Shen et al., 2011).
Nevertheless, such hand-designed features seem not to be discriminative enough to boost the performance of speech emotion recognition (Zhang et al., 2018).Recently, algorithms based on deep learning techniques, which are capable of automatically learning features and also capable of modelling high-level information, have been the focus of most recent research and are gaining prominence.
Although the aforementioned deep learning approaches have made a great contri-bution to progressing the emotion recognition area, we pointed out a key issue that plagues the advancement of emotion recognition research; e.g., the lack of sufficient quantities of annotated emotion data.This issue has become more critical with the advent of deep learning techniques which promise major improvements in emotion recognition accuracy in both single and multi-modal settings; yet we are unable to exploit the full potential of deep learning for emotion recognition due to the scarcity of annotated emotion data; deep learning techniques require large amounts of data for training.Transfer learning method, which commonly fine-tunes pre-trained/off-theshelf CNN models on emotion dataset, have been widely investigated to overcome this problem.However, the representational features that are unrelated to emotion are still retained in off-the-shelf/pre-trained models and the extracted features are also vulnerable to identity variations in these approaches, leading to degrading the performance of the emotion recognition system fine-tuning off-the-shelf/pre-trained models on the emotion dataset.
To resolve these drawbacks, (Gideon et al., 2017) have exploited a progressive network originally proposed by (Rusu et al., 2016) which was able to potentially support transferring knowledge across sequences of tasks.(Gideon et al., 2017) have successfully transferred learning between three paralinguistic tasks: emotion, speaker, and gender recognition with an emphasis on speech emotion detection as the target application without the catastrophic forgetting effect.Their system outperformed the recent speech emotion recognition approaches utilizing fine-tuning pre-trained models and also performed significantly better than deep leaning models without the use of transfer learning techniques (Rusu et al., 2016).Nonetheless, an unavoidable limitation of this approach is that it is computationally intensive since a number of new networks keep on growing according to the demand for a increasing number of new tasks which need to be learned (Lee et al., 2017).More recently, in order to alleviate the aforementioned downsides, (Fernando et al., 2017) have proposed PathNet as an alternative novel learning algorithm for transfer learning.PathNet was designed as a neural network in which agents (e.g., pathways through different layers of the neural network) were embedded to discover which parts of the network to be re-used for new tasks (Fernando et al., 2017).Agents also hold an accountability for determining which subset of parameters to be used and to be updated for subsequent learning.Pathways through the neural network for replication and mutation were selected by a tournament selection genetic algorithm proposed by (Harvey, 2011) during learning.The parameters along an optimal path evolved on the source task were fixed and a new population of paths for the destination task is re-evolved.Such a manner of transfer learning enables the destination task to be learned faster than learning from scratch or fine-tuning, and therefore has greatly succeeded in several supervised learning classification problems (Fernando et al., 2017).
Motivated by the success of PathNet in those applications, in this paper, we explore utilizing PathNet for the facial expression based and speech based emotion recognition tasks.In this work, we first investigate how effective Path-Net is in transferring emotional knowledge from one visual emotion domain to another visual emotion domain to improve overall performance.Based on the experience we have gained in knowledge transferring ability of PahtNet within the visual domain in our previous work (Nguyen et al., 2018), we next investigate whether similar techniques can be used for transferring knowledge within the speech domain.Specifically we investigate the use of PathNet for speech emotion recognition (i) by exploring how well emotional knowledge learned from one speech emotion dataset could be transferred into another speech emotion dataset, and (ii) by examining how well emotional knowledge learned from multiple speech emotion datasets could be transferred to a single speech emotion dataset.
The contributions of our paper are as follows: • We introduce a novel transfer learning approach for the emotion recognition task by utilizing PathNet to deal with the problem of insufficient annotated emotion data as well to deal with the catastrophic forgetting issue commonly experienced with traditional transfer learning techniques.
• We confirm, through experimental results, that our proposed system has a significant potential to accurately detect emotions and demonstrates its substantial success in transferring learned knowledge between different emotion datasets, as well as in transferring learned emotional knowledge from multiple speech emotion datasets to a single speech emotion dataset.
• We conduct various sets of within-corpus on three commonly used bench-marking emotion datasets EMODB, eNTERFACE, and SAVEE and we show that the performance of our proposed transfer learning approach for emotion recognition exceeds recent state-of-the-art transfer learning schemes based on fine-tuning/pretrained models.
The remainder of this paper is organized as follows: Section 2 describes related research; Section 3 presents our proposed system; Section 4 reports our experimental results; and Section 5 concludes the paper.

Related work
A number of studies using transfer learning approaches have recently been proposed for the facial expression and speech emotion recognition task.Since the literature review for the facial expression recognition task utilizing deep learning and transfer learning method was reviewed and discussed in our previous work, the interested reader is referred to that work for detailed discussion and analysis, this section will only focus on reviewing the speech emotion recognition task.

Speech Emotion Recognition using Deep Learning Techniques
Deep learning techniques have emerged as powerful solutions in a wide variety of applications including natural language processing and computer vision owing to their inherent capability of directly learning a hierarchical feature representation from the input data (LeCun et al., 2015).Inspired by their success in multiple fields, many deep learning approaches have been investigated for the task of speech emotion recognition.(Kim et al., 2017a) have proposed a novel architecture for the speech emotion recognition task, in which long short-term memory (LSTM), fully convolutional neural network (FCN), and convolutional neural network (CNN) were combined aiming at extracting local invariant features from the spectral domain.By this combination, long-term dependencies have been well captured, thereby making utterance-level features more discriminative (Kim et al., 2017a).Moreover, by embedding identity skipconnection techniques in their temporal architecture, this proposed system avoided the over-fitting problem caused by training on small amounts of data (Kim et al., 2017a).In another study, in order to handle the large mismatch between training and testing data, (Kim et al., 2017b) have proposed utilizing multi-task learning, and then investigated gender and naturalness as auxiliary tasks of their proposed system.This can enhance significantly the capabilities of generalizing the speech emotion recognition models.
The experimental results evaluated on within-corpus and cross-corpus scenarios have shown good performance of their proposed system.
In other approaches, (Nguyen et al., 2017(Nguyen et al., , 2018) ) have proposed the learning of spatio-temporal features with C3Ds from audio and video for multimodal emotion recognition.(Kim et al., 2017c) have also proposed three dimensional convolutional neural networks (C3Ds) to address the challenge of modeling the spectro-temporal dynamics for speech emotion recognition by simultaneously extracting short-term and long-term spectral features with a moderate number of parameters.(Sahu et al., 2017) have exploited the adversarial auto-encoders focusing on (i) compressing high dimensional vectors encoding emotional utterances into low space vectors (referred to as code vectors) without sacrificing the discrimination during classifying the original vectors, and (ii) generating synthetic samples applying the adversarial auto-encoder, subsequently used for emotion classification.Their system has mainly concentrated on detecting emotions at utterance level features instead of frame-level ones.(Badshah et al., 2017) have proposed a speech emotion recognition system, in which spectrograms were initially extracted from speech signals at different frequencies.Such spectrograms were subsequently fed into a CNN for emotion prediction.Similarly, (Chang and Scherer, 2017) have addressed emotional valence in human speech by directly learning spectrograms of emotional speech using a CNN.In order to further improve the performance, this architecture has been extended to a deep convolutional generative neural network which was trained in an unsupervised fashion.
Researchers have also explored the whispered speech emotion recognition task, where different feature transfer learning approaches have been explored by utilizing shared-hidden-layer auto-encoders, extreme learning machines auto-encoders, and denoising auto-encoders (Deng et al., 2017a).The key ideas of these approaches were to develop a transformation for automatically capturing useful features hidden in data and to transfer the knowledge from the target domain-testing (whispered speech) to the source domain-training (normal phonated speech), consequently leading the great benefit regarding optimizing all parameters with the support from the test set.Extensive experiments have been conducted with a focus on entirely tackling the binary classification (i.e.valence/arousal) (Deng et al., 2017a).In another study, (Deng et al., 2017b) have also pointed out that many speech emotion recognition systems usually demonstrate poor performance on speech data when there is significant differences between training and test speech arising from the variations in the linguistic content, speaker accents, and domain/environmental conditions (Deng et al., 2017b).To further improve such systems performing under the mismatched training and testing condition, a novel unsupervised domain adaptation algorithm has been introduced and trained by simultaneously learning discriminative information from labeled data and incorporating the prior knowledge from unlabeled data into the learning.

Speech Emotion Recognition using Transfer Learning Techniques
However, recent studies into emotion recognition have been hindered by the lack of large databases for learning (Zhang et al., 2017;Kaya et al., 2017).To address the lack of large emotion datasets, the fine-tuning/pre-trained model has been recently widely investigated for the emotion recognition task (Kaya et al., 2017;Zhang et al., 2018;Kim et al., 2017a,c,b;Latif et al., 2018) in which the CNN architectures were pre-trained using the generic ImageNet dataset and fine-tuned on emotion datasets (Ng et al., 2015;Kaya et al., 2017).To learn audio features, (Zhang et al., 2017) used a pre-trained C3D models on large-scale image and video classification datasets, and then fine-tuned them on emotion recognition tasks.To improve the performance of a speech emotion recognition system on such challenging conditions as cross-corpus and cross-language scenario, (Latif et al., 2018) have proposed a transfer learning technique using deep belief networks (DBNs).The experimental results evaluated on five different corpora in three different languages demonstrated the robustness of their system.These results also indicated that use of a large number of languages and a small part of the target data during training could dramatically strengthen the emotion recognition accuracy.
However, since they have attempted to validate the system on five different datasets annotated differently, their system only focused on addressing the classification for binary positive/negative valence.
In more recent studies, (Zhang et al., 2018) have reconfirmed that the low-level hand-engineered features seem not to be discriminative enough to recognize the subjective emotions (Zhang et al., 2018).To address this disadvantage, (Zhang et al., 2018) have proposed to extract three channels of log Mel-spectrograms (static, delta, and delta delta corresponding to red, green, and blue in the RGB model of images) from segments over all utterances, and the pre-trained AlexNet model was then finetuned on those extracted features.A discriminant temporal pyramid matching technique was subsequently combined with optimal Lp-norm pooling, before exploiting a linear support vector machine to classify the final speech emotion score.Although this architecture demonstrated sufficient performance on tackling discrete speech emotion recognition, the system was unable to address the continuous emotion recognition task (Zhang et al., 2018).(Nguyen et al., 2020) have proposed a joint deep cross-domain transfer learning for emotion recognition which was able to effectively jointly transfer the knowledge learned from rich datasets to source-poor datasets.Moreover, as discussed earlier, all of these fine-tuning approach based systems (Zhang et al., 2018(Zhang et al., , 2017;;Ng et al., 2015;Kaya et al., 2017) still have the drawbacks previously alluded to such as discarding previously learned information which were detailed by (Rusu et al., 2016).To mitigate these drawbacks, (Gideon et al., 2017) have introduced a learning algorithm using the progressive networks proposed by (Rusu et al., 2016) to effectively transfer knowledge captured from one emotion dataset into another.Although they have handled somewhat successfully the above mentioned limitations, their expensive computation, which kept on increasing when adding new tasks to be learned, makes them less applicable in the implementation of emotion recognition for real world applications.

Proposed Methodology
Our main goal in this paper is to investigate techniques to improve the accuracy of deep learning based emotion recognition task which is hindered by the non avail-ability of large annotated emotion datasets.We achieve this goal by using the recently proposed innovation known as PathNet for transfer learning.We propose the use of a novel transfer learning approach by adopting PathNet to solve the facial expression emotion recognition, and the speech emotion recognition task.Our proposed system is illustrated by simple block diagram consisting of two main blocks which are an input pre-processing block (for video and speech) followed by our proposed PathNet block along with the output classifying block (as illustrated in Fig. 3).The video and audio stream are initially pre-processed.For video stream, we initially exploit a Viola Jonesbased algorithm (Nguyen et al., 2017) to extract all face regions from both SAVEE and eNTERFACE (Martin et al., 2006) datasets.This is described in detail in Section 3.1.For audio stream, we also initially extract three channels of log Mel-spectrograms (static, delta, and delta delta corresponding to red, green, and blue) from segments over all utterances from eNTERFACE (Martin et al., 2006), SAVEE (Haq and Jackson, 2010), and EMO-DB (Burkhardt et al., 2005) and these steps are described in detail in Section 3.2.These extracted video and audio features are subsequently fed into our PathNet to classify a final facial expression and speech emotion score, respectively.To the best of our knowledge, the use of PathNet has not been previously investigated in dealing with the dearth of suitable emotion databases for the development of emotion recognition system.The procedures of feature extraction and our PathNet architecture are described in more detail in the following subsections.

Video pre-processing
All frames are initially extracted from visual signal for further steps.Since such extracted frames still contain considerable redundant information for emotion detection, we extract only the face regions using the simple algorithm (Nguyen et al., 2017) as follows: 1.All bounding boxes containing face regions in each frame were extracted employing the Viola-Jones algorithm (Viola and Jones, 2004) and a face region was then detected.
2. In some cases where the Viola Jones algorithm detected no faces, or more than 1 face, the location of the previously detected face region was used.
Therefore, in order to boost the performance of speech emotion recognition system, instead of exploiting such hand-crafted features, (Zhang et al., 2018)

PathNets
Our PathNet architecture and its settings relies on PathNets (Fernando et al., 2017) which was used to conduct all sets of experiments on CIFAR (Krizhevsky and Hinton, 2009) and SVHN (Netzer et al., 2011).The following sections will provide in detail its architecture and explain further how to train our system, how to transfer learned emotional knowledge between emotion dataset.

PathNet architecture
Our PathNets includes a number of layers (L = 3), a number of modules (M = 20) per layer of 20 neurons in each.Each module itself functions as a neural network consisting of linear units, and followed by a transfer function (rectified linear units adopted).For each layer the outputs of the modules of this layer are averaged before being fed into the active modules of the subsequent layer.A module is active if it is shown in the path genotype and currently validated (shown in Fig. 3 (b)).A maximum of 4 distinct modules per layer are typically allowed in a pathway.The final layer is not shared for each task which is being learned (Fernando et al., 2017).

Pathway Evolution and Transfer Learning Approach
One emotion dataset (source data) is trained for a fixed number of generations with a goal of finding an optimal pathway by adopting a binary tournament selection algorithm proposed by (Harvey, 2011) (see Fig. 1) which takes responsibility for eliminating bad configurations and mutating good ones, and subsequently training them further.This pathway is then fixed.This means that its parameters are no longer permitted to modify and the rest of parameters, which are not shown in such best fit path, are reinitialized, and are then again trained/evolved on the another emotion dataset (destination data).Through this knowledge transferring approach, the destination data is permitted to be learned faster than learning from scratch or after fine-tuning.The performance measurement of our proposed system is the recognition accuracy achieved after such fixed training time.Evidence to confirm a positive transfer in these cases is given by a better final recognition accuracy of our proposed system achieved when trained on destination data than that achieved by learning from scratch.
When PathNet is trained on source data, at the beginning, a population of genotypes is randomly generated (See Fig. 4).In each generation, two pathways are randomly selected to train on source data (see Fig. 5).The reason only two pathways are selected is that binary tournament selection is exploited to choose the pathways.One and the winner is then mutated with equal probability 1/(4×3) per each candidate of the genotype (see Fig. 6, and Fig. 7), a new random integer from range [-2, 2] is added to the current value of the winner candidate.We repeat step 2 (Fig. 5) and step 3 (Fig. 6) for number of generations.When training on source data is completed we achieve best pathway (see Fig. 8).
The parameters presented in this best fit pathway are fixed and are reused for training on the destination data and the rest of parameters are randomly reinitialized.When training PathNet on the destination data, the procedures for this training stage are sim-  by only reading textual explanation that is presented in this section.Therefore, in order to make our PathNet based transfer learning approach more understandable, we have added further explanation on the progression of achieving the best pathway by visualizing every step using the corresponding figure (please see ).As we can see these Figs, to ease for readers to view and follow step by step, we simplify the architecture by drawing only ten modules in each layers, up to two modules are activated in each layer, and a population of four random pathways are initialized.Whereas the  are drawn to further explain and visualize the progression of how to achieve best pathway for one particular set of experiment.These Figs illustrate exact parameters and architectures corresponding to the progression in gaining the optimal pathway when conducting the set of experiment.We believe that this explanation is more comprehensive than done by the original PathNet paper (Fernando et al., 2017) as in that paper authors did not illustrate such steps using figures.

Experiments & Results
Dataset Details: The eNTERFACE dataset (Martin et al., 2006) is an audio-visual dataset which has 44 subjects and includes a total of 1293 video sequences in which the proportion of sequences corresponding to women and men are 23% and 77%, respectively.They were asked to express 6 discrete emotions including anger, disgust, fear, happiness, sadness, and surprise (Martin et al., 2006).
The SAVEE dataset (Haq and Jackson, 2010) is an audio-visual dataset which was recorded by higher degree researchers (aged from 27 to 31 years) at the University of Surrey, and four native male British speakers.All of them were also required to speak and express seven discrete emotions such as anger, disgust, fear, happiness, sadness, surprise, and neutral.The dataset comprises of 120 utterances per speaker, resulting in a total of 480 sentences (Haq and Jackson, 2010).
The EMO-DB dataset (Burkhardt et al., 2005) is an acted speech corpus containing 535 emotional utterances with seven different acted emotions listed as disgust, anger, neutral, sadness, boredom, and fear.These emotions were stimulated by five male and five female professional native German-speaking actors, generating five long and five short sentences German utterances used in daily communication.These actors were asked to read predefined sentences in the targeted seven emotions.The audio files are on average around 3 seconds long were recorded using an anechoic chamber with highquality recording equipment at a sampling rate of 16 kHz with a 16-bit resolution and mono channel (Zhang et al., 2018).
Performance Measures: We evaluate our proposed system on all three datasets: the SAVEE, the eNTERFACE, and the EMO-DB dataset.We apply k-fold crossvalidation, the original training data is randomly divided into k equal parts.Of the k-parts, one of them is fixed as the validation data for testing the model, and the other k-1 parts are used as training data.The cross-validation process is then repeated 5 times.We also apply the leave-one-subject-out cross-validation protocol which means that we have to conduct N experiments if the dataset consists of N subjects.For each experiment, N-1 subjects are included in the training set and the remaining subject is used for the testing set and all of our sets of experiments are carried out in a subject independent manner.Apart from these, Weighted Averaged Recall (WAR) has been recently adopted and has become widely a standard measure for evaluating the performance of emotion recognition systems (Zhang et al., 2018).In addition, Unweighted Averaged Recall (UAR) (Eyben et al., 2016;Eyben, 2016;Zhang et al., 2018) has been also popularly adopted to evaluate this performance with respect to reflecting unbalance between emotional classes (Zhang et al., 2018).Therefore, in order to fairly compare Table 1: Different measures for multi-class classification C i .For class C i , tp i are true positive, and f p ifalse positive, f n i -false negative, and tn i -true negative counts, l -the number of classes, (Sokolova and Lapalme, 2009).

Measure Formula Evaluation focus
Unweighted Averaged Precision (UAP) with some recent state-of-the-art speech emotion recognition schemes using the same above-mentioned measures (WAR, UAR), in this paper, we also compute and compare both Weighted Averaged Recall and Unweighted Averaged Recall to validate the performance of our proposed system.Table 1 illustrates the detail formulas on how to calculate these evaluating measures.
As our baseline systems, we use the methodology proposed by (Zhang et al., 2018) and a off-the-shelf CNN (AlexNet) as our baseline systems.Additionally, we also implement additional baseline systems and each of which is trained on the same emotion dataset from scratch using PathNet.
Since our main focus of this paper is to address the issue of lack of emotion data for deep learning techniques.We clearly introduced this problem in the Abstract and further discussed it in the Introduction section.To make compatible with the problem we are trying to solve, we should choose small emotion datasets to show that our proposed methodology demonstrates a robust performance on such poor data.Furthermore, all large-scale datasets such as CK+, RAF-DB and AffectNet only can be used for facial expression recognition which is not the main task of this paper.It is, however, noteworthy that addressing the issue of insufficient data for speech emotion recognition task is the main work of our paper.
We conduct various sets of experiments using our proposed system in examining

Method WAR (%)
Fine-Tuned Alexnet (Zhang et al., 2018) 0.88 V eNTER 0.88 as follows: the parameters presented in the best fit pathway, which is evolved on the source data, are fixed and the rest of parameters are randomly reinitialized and a new population of 20 of genotypes are randomly initialized and then are trained/evolved further on the destination data.

Experimental Results
In this section, we report, analyze, and compare experimental results of our proposed system on aforementioned sets of experiments.In all the experiments, we have taken meticulous care to ensure that the test data is never used for training.
In the first two sets of experiments, we conduct experiments to show the robustness of our proposed system when performing under the condition of insufficient data for facial expression recognition.Our proposed system is initially trained on visual eN-TERFACE, and is then evolved on visual SAVEE and vice versa.Since SAVEE dataset consists of an additional type of emotion (neutral) compared to eNTERFACE dataset, hence to be consistent between these two datasets, only shared types of emotion including anger, surprise, disgust, fear, happiness, and sadness are detected via these two sets of experiments.The experimental results of our proposed system on this evaluation are shown in the first two rows of Table 3.  3, our proposed system (V eNTER→SAV ), which transfers the emotional knowledge from visual eNTER-FACE to visual SAVEE, achieves 94% of facial expression recognition accuracy in regard to WAR.This accuracy is 5% and 9% significant higher than those achieved by our facial expression recognition baseline systems: V SAV which is trained on visual SAVEE from scratch and the system fine-tuning on AlexNet proposed by (Zhang et al., 2018), respectively (see Table 4).
Visual SAVEE → Visual eNTERFACE: To further show the efficiency of our proposed system in solving the issue of insufficient facial expression data, we explore transferring the emotional knowledge from visual SAVEE to visual eNTERFACE.The experimental results are illustrated in the second row of Table 3.In this setting, our proposed system also performs significantly better than our facial expression recognition baseline systems, achieving the best facial expression recognition accuracy regarding to WAR (94%), which is 6.5% and 6.35% better than those obtained by our facial expression recognition baseline systems: V eNTER which is trained on visual eNTER from scratch and the system fine-tuning on AlexNet proposed by (Zhang et al., 2018) Table 6: Results of our proposed system when transferring the emotional knowledge from audio eNTER-FACE to audio SAVEE in comparison with the best baseline speech emotion recognition system.

Method WAR
AlexNet (Zhang et al., 2018) 0.69 A SAV 0.81 To depict the performance of individual emotion of these two systems, the confusion matrices of their systems (V eNTER→SAV and V SAV→eNTER ) are illustrated in Fig. 14 (a), and Fig. 14 (b), respectively.We also visualize the robust performances of each emotion of these two systems, and plot their ROC curves which are illustrated in Fig.  6.It can be seen that our proposed system (A eNTER→SAV with the best speech emotion recognition accuracy of 85%) demonstrates a significant superior performance over the speech emotion recognition system proposed by (Zhang  7 that our proposed system (97%) performs significantly better than the transfer learning approach based on pre-trained/fine-tuning models (Zhang et al., 2018) by 14% and our baseline system (A EMO which have been trained on audio EMODB from scratch using PathNet) by 8% in regard to WAR.
Similarly, in order to further gain insight into the performances of individual speech emotions of both systems (A eNTER→SAV and A eNTER+SAV→EMO ), we illustrate the confusion matrix of these speech emotion recognition systems (see Fig. 18 (a), and Fig.This transfer learning approach is also considered as one of the most recent state-of- the-art methods.However, as pointed out earlier pre-trained models are still limited in their generalization capability and thus lead to poor performance on novel test sets.
The main objective of this paper is to propose an alternative transfer learning approach to address such disadvantages.Therefore, we have primarily compared our transfer learning method using PathNet with a state-of-the-art transfer learning method relying on fine-tuning/pre-trained models as reported in Experiments & Results Section.The transfer learning method that we have compared against my PathNet based approach using the same datasets is fined-tuned AlexNet.Moreover, unfortunately we could not find any other state of the art transfer learning method which uses the same datasets to make a fair comparison.Our proposed system outperforms the state-of-the-art emotion recognition system relying on the transfer learning method which fine-tunes offthe-shelf/pre-trained models.We believe that this is due to the fact that the representational features that are unrelated to emotion are still retained in off-the-shelf/pre-trained models and the extracted features are also vulnerable to identity variations in these approaches, leading to degrading the performance of the emotion recognition system fine-tuning off-the-shelf/pre-trained models on the emotion dataset.
As presented in Proposed Methodology Section, we have designed our PathNet ar-

Conclusion
Progress in emotion recognition research has been hindered by the lack of the large amounts of labeled emotion data.To overcome this problem, various studies have widely explored the use of transfer learning approach based on pre-trained/fine-tuning models, however, the proposed approaches have been still suffering from issues such as discarding prior learned information.In this paper, we have proposed utilizing an alternative transfer learning technique using PathNet which is a neural network algorithm that uses agents embedded in the neural network whose task is to discover which parts of the network to reuse for new tasks, leading to successfully addressing the abovementioned challenges.To verify performance of our proposed architecture, we have conducted various sets of experiments including, transferring the emotional knowledge from one emotion dataset into another, and transferring the learned emotional knowledge from multiple emotion datasets into one another.Experimental results on three datasets: eNTERFACE, SAVEE, and EMO-DB have indicated that our proposed system performs well under the conditions of insufficient emotion data, and significantly better than the recent transfer learning techniques exploiting fine-tuning/pre-trained models.

Acknowledgement
This research was supported by an Australian Research Council (ARC) Discovery grant DP140100793.

Figure 1 :
Figure 1: The genotypes of the population are viewed as a pool of strings.One single cycle of the Microbial GA is operated by initially randomly picking two, and subsequently compare their fitnesses to determine Winner, Loser, and finally recombine where some proportion of Winner's genetic material infects the Loser, before mutating the revised version of Loser (from(Harvey, 2011)).
Figure 2: This figure illustrates the pre-processing steps for video and audio stream.
Figure 3: Illustrates our proposed emotion recognition system

Figure 4 :
Figure 4: A number of pathways is randomly initialized when pathnet learning on source data

Figure 6 :
Figure 6: A pathway with bad performance (loser) is replaced by a pathway with better performance (winner)

Figure 9 :
Figure 9: A number of pathways are randomly initialized when pathnet trained on the destination data

Figure 11 :
Figure 11: A pathway with bad performance (loser) is replaced by a pathway with better performance (winner) when learning with the destination data

l
An average per-class effectiveness of a classifier to identify class labels

For
the next two sets of experiments, we conduct experiments for speech emotion recognition by transferring emotional knowledge from audio emotion eNTERFACE dataset (source data) to audio emotion SAVEE dataset (destination data), and transferring emotional knowledge from multiple audio emotion datasets (eNTERFACE and SAVEE) to audio emotion EMODB.PathNet Settings: Our proposed PathNet architecture (i.e.L = 3 layers, M= 20 liner units per layer of 20 neurons each followed by rectified linear units, average function used to activate units between two layers, and a maximum of 4 of those units per layer included in a pathway which is represented by a 4×3 matrix of intergers in the range[1,13]).Our proposed PathNet architecture is trained on visual eNTERFACE/visual SAVEE/audio eNTERFACE/(audio eNTERFACE and audio SAVEE) which are set as as source data and is then trained/evolved on visual SAVEE/visual eNTERFACE/audio SAVEE/audio EMODB, respectively.In both source tasks and destination tasks of these sets of experiments, our proposed systems are trained for 200 generations.At the beginning of each task, a population of 20 of pathways are randomly generated and are then trained on the source data.In each generation, two paths are randomly selected to train on the source data and then on the destination data for validation.To evaluate one pathway, a pathway is trained with stochastic gradient descent with learning rate 0.02, a mini-batch size of 64 and is trained for T epochs (T equals the number of samples counted in training set divided by mini-batch size of 64).The fitness of such pathway is the rate of correct samples on the training set during that period of training time.When completing the calculation of the fitness of two pathways, the pathway with the smaller fitness is replaced by the pathway with the greater one that is then mutated with equal probability 1/(4×3) per each candidate of the genotype, a new random integer from range[-2, 2]  is added to the current value of the lose candidate.Therefore, the network between tasks is modified Figure 14: Illustrates the confusion matrix of our proposed system evaluated on visual SAVEE and visual eNTERFACE Figure 16: Receiver operating characteristic (ROC) curves of our proposed system on visual SAVEE and visual eNTERFACE 16 (a) andFig.16 (b), respectively.Moreover, in order to further visualize the effectiveness of our proposed systems , the different testing learning curves are plotted as illustrated in Fig.15.As demonstrated in Fig.15, the performance of both proposed systems with transfer learning surpasses considerably those without transfer learning.In the first two sets of experiments, we have focused on conducting experiments for facial expression recognition.Through experimental results, we have shown that our proposed facial expression recognition system is vastly boosted when transferring the emotional knowledge from one visual emotion domain to another visual emotion domain using PathNet.The performance of our system is also superior to those of the recent state-of-the-art facial expression recognition systems fine-tuning on the offthe-shelf/pre-trained models.To further demonstrate the effectiveness of our proposed system, we carry out extensive experiments for speech emotion recognition task.Audio eNTERFACE → Audio SAVEE: In this set of experiments, the emotional knowledge is transferred from audio eNTERFACE to audio SAVEE.Experimental results are shown in Table Figure 17: We illustrate receiver operating characteristic (ROC) curves of individual emotions evaluated on audio SAVEE and audio EMODB (a)  and Fig.17 Figure 19: A population of 20 pathways are randomly initialized when learning on the source data (audio eNTERFACE).
(a)).It can also be seen that our A eNTER+SAV→EMO reveals a great success when detecting speech emotions such as anger, boredom, disgust, fear, and sadness with their corresponding AUC = 0.99, 0.99, 0.99, 0.98, and 0.99, whereas it performs less effectively on the happiness emotion with only AUC = 0.93 (as shown in Fig.17(b)).Discussion: The reason we achieve significantly better performance when transferring the emotional knowledge from one emotion dataset to another emotion dataset using PathNet is that the emotional knowledge presented in the best PathWay achieved when completed training on one emotion dataset (source data) is now reused as apart of initial emotional knowledge and is always involved in training stage on another set of pathways of PathNet on another emotion dataset (destination data).However, when PathNet is trained on the same emotion dataset (destination data) from scratch, the emotional knowledge is randomly initialized and is learned with only one set of pathways on only one emotion dataset (destination data).

Figure 20 :
Figure 20: The optimal pathway (highlighted by red colour) is achieved when the training stage is completed on audio eNTERFACE as the source data and the parameters presented in the pathway are fixed.

Figure 21 :
Figure 21: A new population of pathways are generated when learning on the destination data (audio SAVEE)

Figure 22 :
Figure 22: The pathway highlighted by red colour, which have been transferred from the model on the source data, are always activated during evolving on the destination task (audio SAVEE), however, their parameters are fixed.

Figure 23 :
Figure 23: The optimal pathway highlighted by blue colour is achieved when training stage is completed on audio SAVEE (the source data).The paths highlighted by red colour is best pathway gained from the source data.

Table 2 :
Description of notation of all models for all sets of experiments of our proposed system conducted on eNTERFACE, SAVEE, and EMODB;

Table 3 :
Results of our proposed system evaluated on visual SAVEE, visual eNTERFACE, audio SAVEE, and audio EMODB.In the first two rows, we explore transferring the emotional knowledge from visual eNTERFACE to visual SAVEE and vice versa.In the last two rows, the emotional knowledge is transferred from audio eNTERFACE to audio SAVEE.We also explore transferring the emotional knowledge from multiple audio emotion domains (audio eNTERFACE and audio SAVEE) to one audio emotion domain (audio EMODB).

Table 4 :
Results of our proposed system when transferring the emotional knowledge from visual eNTER-FACE to visual SAVEE in comparison with the best baseline system.

Table 5 :
Results of our proposed system when transferring the emotional knowledge from visual SAVEE to visual eNTERFACE in comparison with the best baseline system.

Table 7 :
Results of our proposed system when transferring the emotional knowledge from audio eNTER-