Personalized models for facial emotion recognition through transfer learning

Emotions represent a key aspect of human life and behavior. In recent years, automatic recognition of emotions has become an important component in the fields of affective computing and human-machine interaction. Among many physiological and kinematic signals that could be used to recognize emotions, acquiring facial expression images is one of the most natural and inexpensive approaches. The creation of a generalized, inter-subject, model for emotion recognition from facial expression is still a challenge, due to anatomical, cultural and environmental differences. On the other hand, using traditional machine learning approaches to create a subject-customized, personal, model would require a large dataset of labelled samples. For these reasons, in this work, we propose the use of transfer learning to produce subject-specific models for extracting the emotional content of facial images in the valence/arousal dimensions. Transfer learning allows us to reuse the knowledge assimilated from a large multi-subject dataset by a deep-convolutional neural network and employ the feature extraction capability in the single subject scenario. In this way, it is possible to reduce the amount of labelled data necessary to train a personalized model, with respect to relying just on subjective data. Our results suggest that generalized transferred knowledge, in conjunction with a small amount of personal data, is sufficient to obtain high recognition performances and improvement with respect to both a generalized model and personal models. For both valence and arousal dimensions, quite good performances were obtained (RMSE = 0.09 and RMSE = 0.1 for valence and arousal, respectively). Overall results suggested that both the transferred knowledge and the personal data helped in achieving this improvement, even though they alternated in providing the main contribution. Moreover, in this task, we observed that the benefits of transferring knowledge are so remarkable that no specific active or passive sampling techniques are needed for selecting images to be labelled.


Introduction
Emotions play a key role in how people think and behave. Emotional states affect how actions are taken and influence the decisions. Moreover, emotions play an important role in humanhuman communication and, in many situations, emotional intelligence, i.e. the ability to correctly appraisal, express, understand, and regulate emotions in the self and others [58], is crucial for a successful interaction. Affective computing researches aim to furnish computers with emotional intelligence [51] to allow them to be genuinely intelligent and support natural human-machine interaction (HMI). Emotion recognition has several applications in different areas such as marketing [18], safe and autonomous driving [22], mental health monitoring [17], brain-computer interfaces [65], social security [75], robotics [55].
Among them, the facial expression is one of the most natural information to use and is the main channel for nonverbal communication [30]. Moreover, RGB sensors for acquiring images of the face cost significantly less than others and do not need to be worn, which makes Facial Emotion Recognition (FER) a good candidate for commercial-grade systems. Traditional FER approaches are based on the detection of faces and landmarks, followed by the extraction of hand-crafted features, such as facial Action Units (AUs) [3]. Machine learning algorithms, such as Support Vector Machines (SVMs), are then trained and employed on these features. In contrast, deep learning approaches aim to provide an end-to-end learning mechanism, reducing the pre-processing of input images [33]. Among deep learning models, Convolutional Neural Networks (CNNs) are particularly suited for facial image processing and allow to highly reduce the dependence on physics-based models [78]. Moreover, recurrent nets, in particular Long Short Term Memory (LSTM), could take advantage of temporal features if the recognition is performed on videos instead of single images [9].
FER can be formulated as a classification or a regression problem. The distinction mainly depends on the emotional model employed for representing emotions. In categorical representations, emotions consist of discrete entities, associated with labels. Ekman observed six basic emotions (anger, disgust, fear, happiness, sadness, and surprise) characterized by distinctive universal signals and physiology [11]. Tomkins characterized nine biologically based affects: seven that can be described in couples (interest-excitement, enjoyment-joy, surprise-startle, distress-anguish, anger-rage, fear-terror, shame-humiliation), representing the mild and the intense representation, plus dissmell (a Tomkins' neologism) and disgust [71]. In contrast with discrete representations, dimensional models aim to describe emotions by quantifying some of their features over a continuous range. Russell developed a circumplex model suggesting that emotions can be represented over a two-dimensional circular space, where axes correspond to valence and arousal dimensions [57]. The circumplex model could be extended by adding axes, in order to describe complex emotions: Mehrabian proposed the Pleasure-Arousal-Dominance (PAD) model, where the Dominance-Submissiveness axis represents the how much one feels to control versus to be controlled by the emotion [45]; Trnka and colleagues developed a four-dimensional model describing valence, intensity, controllability, and utility [72]. Between discrete and dimensional models, Plutchik and Kellerman proposed a hybrid three-dimensional model where 8 primary bipolar emotions are arranged in concentric circles (equivalent to a cone in space), such that inner circles contain more intense emotions [52]. Which model is more suitable for emotions representation and the universality of facial expressions of emotions are still debated [2,12,21,23,38,41]. Our article steps back from these debates and focuses on the usability of the models. We decided to adopt the dimensional model, in particular the Valence-Arousal (VA) approach, because we observed that most of the datasets that rely on discrete emotions do not refer to the same emotion labels, thus making it difficult to conduct experiments that involve multiple datasets. Conversely, almost all available datasets based on the dimensional representations are compatible with each other, after normalizing values if needed, and the VA model represents their common denominator.
Independently of the model used to represent emotions, building a general FER model is still a challenge. The "one-fits-all" approach requires the predictor to be able to generalize acquired knowledge to unseen data, but this is prevented by subjects' differences in anatomies, cultures, and variable environment setups [59]. A possible solution could be the creation of subject/environment-specific models, but, on the other hand, it would require a considerable amount of labelled data to train the predictor. Therefore, it could be infeasible in practice since emotional labelling is an expensive process that has to be performed by expert annotators [47].
In this work, we propose a transfer learning approach to work around the problem of training a high-capacity classifier over a small, subject-specific dataset. First, we trained a general-purpose CNN, namely AlexNet [36], over the AffectNet (Affect from the InterNet) database [47], and then we exploited the assimilated knowledge to perform fine-tuning over small personal datasets of images extracted from videos, obtained from the AMIGOS dataset [46]. Our approach belongs to the category of transductive transfer learning since the source and target domains are different but related (since AffectNet images are not acquired in a controlled environment such as for AMIGOS) and source and target tasks are the same [49]. Our results show that transfer learning in this domain helps in improving the emotion recognition performance with respect to both personal models (i.e., trained on a subjectspecific data) and generalized models (i.e., trained on all the subjects). Moreover, they show that for the considered dataset, while the valence dimension is more generalizable from different subjects, arousal depends more on individual characteristics.
Additionally, we investigated if it would be possible to reduce the number of samples needed for fine-tuning the net. To this aim, we trained the net with sets of increasing size. We observed that a very small amount of personal data is needed to achieve quite good performance, regardless of the strategy employed to select, from the whole set, the samples to be labelled. Our test employed both active and passive sampling techniques developed for regression, namely Greedy Sampling (GS) [79] and Monte-Carlo Dropout Uncertainty Estimation (MCDUE) [73], to select samples characterized by higher uncertain that have to be added to the training set. Tested approaches rely on the assumption that there is an unlabeled dataset from which some samples must be selected and labeled (pool-based sampling). Passive sampling techniques explore the input space of unlabeled samples, while active sampling approaches also take into account a previously trained regression model to estimate sample uncertainty. In particular, the GSx variant of GS is passive, while GSy, iGS, and MCDUE follow the active sampling paradigm. Our results show that the error of the trained model decreases very fast with respect to the size of the training set so that there are no significant differences between using random sampling and active/passive sampling.
The rest of the paper is structured as follows: Section 2 discusses related work and recent results in FER from literature; Section 3 describes our approach; Section 4 exhibits and discusses the results; Section 5 contains conclusions and future developments.

Related work
Quite a large effort has been spent in FER during the last decades, both by employing traditional machine learning and deep learning methods. The work by Ko and colleagues [33] offers a recent and comprehensive review of this topic. Here, we report some of the most significant methods and results, also summarized in Table 1.
Among the traditional machine learning approaches, Shan and colleagues [61] extracted Local Binary Pattern (LBP) to be used as features for SVM. The authors tested the method in a cross-dataset experiment, involving Cohn-Kanade (CK) [29], Japanese Female Facial Expression (JAFFE) [43], and M&M Initiative (MMI) [50] datasets and achieving 95.1% of accuracy (6 classes) on CK. In [10], the authors proposed the use of Kernel Subclass Discriminant Analysis (KSDA), classifying 7 basic and 15 compound emotions based on a Facial Action Coding System (FACS). Average accuracy was 88.7% for basic emotions and 76.9% for compound emotions. Suk and Prabhakaran [66] classified seven emotions with 86.0% accuracy employing SVM on data from extended Cohn-Kanade dataset (CK+) [42]. Features were obtained by using Active Shape Model (ASM) fitting landmarks and displacement between landmarks. The same dataset was used in [15], where 95.2% and 97.4% of accuracy was obtained by AdaBoost and SVM with boosted features, respectively. In [82], Multiple Kernel Learning (MKL) for SVM was used to combine multiple facial features, a histogram of oriented gradient (HOG), and local binary pattern histogram (LBPH). The authors evaluated the proposed approach on multiple datasets, outperforming many of the state-of-the-art algorithms.
FER was addressed with a deep learning approach by Jung and colleagues [27] that combined two deep networks that extract appearance features from images (using convolutional layers) and geometric features from facial landmarks (using fully connected layers), obtaining 97.3% of accuracy on CK+ data. In [28] authors showed that a hybrid CNN-RNN architecture can outperform simple CNNs. Hasani and Mahoor [19] combined two wellknown CNN architectures [68,69], in a 3D Inception-ResNet, followed by an LSTM layer, to extract both spatial and temporal features. Arriaga and colleagues [1] implemented a CNN, with four convolutional levels and a final level of global average pooling with a soft-max activation function. They classified images from The Facial Expression Recognition 2013 (FER-2013) database [16], achieving 66% of accuracy.
The previously mentioned approaches deal with the classification of categorical emotions. In general, the literature offers much more contributions regarding the classification of categorical emotions than the regression of emotional dimension. For this task, visual data are often employed also for improving the recognition power of multi-modal approaches [5,64].
Using a dimensional model, Khorrami and colleagues [31] obtained a Root Mean Squared Error (RMSE) equal to 0.107 in predicting valence, by using a CNN-RNN architecture over the AV + EC2015 (Audio/Visual+Emotion Challenge) dataset [54]. In [47], authors employed the AlexNet CNN [36] to classify and regress emotions from AffectNet [47] dataset, that contains images collected by querying internet search engines, annotated both with the  [47], since our aim is to take advantage of the knowledge learned from AffectNet. In this sense, the choice of AlexNet is motivated by the promising results achieved on such dataset. However, to the best of our knowledge, the effect of transfer learning on FER has been poorly investigated and has mainly focused on categorical emotions recognition and AU detection [6,7,48,81]. The work by Feffer and Picard [13] addressed the personalization of deep neural networks for valence and arousal estimation. The authors proposed the use of a Mixture-of-Experts (MoEs) technique [24]. In MoEs frameworks each expert represents a subnetwork trained on a subset of available data, thus tuned for a specific context. A gating network weights the contribution of experts in the inference step. In [13], a ResNet architecture was followed by an experts' subnetwork, where each expert corresponded to a different subject of the training set. A supervised domain adaptation approach [26] was adopted to fine-tuning the expert network to unseen subjects. Results obtained on the RECOLA dataset [53] demonstrated the efficacy of the proposed approach.

Datasets preparation
We used two datasets of emotionally labeled images. AffectNet is a database of facial expressions in the wild, i.e. images are not acquired in a controlled environment of a laboratory. AffectNet contains more than 1 million samples collected by querying internet search engines, by using emotion-related keywords ( Fig. 1 shows an example set of images). Images have been preprocessed to obtain bounding boxes of faces and rescaled to a common resolution. About half of the dataset has been manually annotated, both with respect to the categorical model (seven basic emotions) and the dimensional model (valence and arousal). The remaining images are not annotated and are not provided in order to be used for future challenges. Hence, in this work, we employed just the annotated subset. AMIGOS (A dataset for Multimodal research of affect, personality traits, and mood on Individuals and GrOupS) [46] is a dataset consisting of multimodal signals recording of 40 participants in response to emotional fragments of videos. Signals include EEG, ECG, GSR, frontal and full-body videos, depth camera videos. All participants took part in a first experimental session, in which they watched short video fragments (<250 s), while some of them participated in a second session, where they had to watch long videos (>14 min), individually or in groups. The dataset also contains emotional annotations both by three external annotators and by the participants themselves. We included in our study individual frontal videos from 10 participants recorded during the first experiment and, since participants could be unreliable at reporting on their own emotions or willing to hide them [63], we considered the related external annotations of valence and arousal. For each short video, we extracted a frame every 4 s, obtaining about 1000 frames for each subject. We manually removed frames where the participant's face was not visible. Each frame has then been preprocessed, by using a pre-trained Cascade Classifier (CC) [76] to extract the bounding box of the user's face and resize it (see Fig. 2). Labels associated with each frame have been computed as the average of external annotators' scores.

CNN architecture and training
AlexNet [36] is a famous CNN designed for general-purpose image classification and localization. It won the ImageNet LSVRC-2012 [56] with a large margin both in Task 1 (Classification) and in Task 2 (Localization and Classification). The neural network consists of five convolutional layers (the first, the second, and the fifth are followed by max-pooling layers) and three globally connected layers. AlexNet takes advantage of using Rectifier Linear Unit (ReLU), instead of hyperbolic tangent function (tanh) as activation, to introduce nonlinearity. To reduce overfitting in the globally connected layers, hidden-unit dropout [77] is also employed. In its original form, AlexNet was developed to perform a classification between 1000 classes (a 1000-way softmax layer produces the output). Here, we built two twin CNNs, by adding a single unit with a linear activation function (to perform regression). CNNs were trained to estimate valence and arousal separately. AffectNet images have been split into training (60%) and validation (40%) sets. We selected mean square error (MSE) as loss function and trained the net by using Mini-Batch Gradient Descent (MBGD) [39], with batch size set to 16, learning rate equal to 0.001 and Nesterov momentum equal to 0.9. Training iterated for 50 epochs over the whole training set. The net achieved RMSE values of 0.279 and 0.242 for valence and arousal, respectively, over the validation set.

Fine-tuning
After the CNNs were trained over a large dataset, we could take advantage of two aspects: a) the features extraction capability of the nets (i.e. the pre-trained convolutional layers) and b) the current setting of the task-related (dense) layers that could be used as starting guess for the fine-tuning phase to ease the convergence. To this aim, we applied the following steps to each CNN. First, we split the net in its convolutional (including the flatten layer) and dense parts. From this point on, we operated just on the dense layers, treating them as a new net, except for the previously computed weights that were used as initialization for the fine-tuning training. This allowed to "freeze" the learning of the convolutional part and perform further experiments changing weights and bias just in the dense part. The frozen convolutional part is so used to obtain the features of each sample. Figure 3 shows the procedure.

Greedy sampling
GS [49] basic algorithm, GSx, is based on the exploration of the feature space of the unlabeled pool. It is the only passive sampling approach we tested in this work. Its counterpart, GSy, follows the same steps, but it is aimed to explore the output space of the pool (starting from a pre-trained regression model inferred by samples selected by means of GSx). We employed a simplified version of the latter since a) our aim was to update the regression model only after the selection of all the samples that had to be queried and b) in the fine-tuning scenario the pretrained regression model is already available. The algorithms steps are described in the pseudocode shown in Table 2. The idea at the base of these methods is the exploration of the feature and the output spaces, respectively, by computing the Euclidean distances between samples. The initialization phase selects the starting element to be placed in the output set (labeled set), as the one closest to the centroid of the set of elements to be labeled. Then, the iterative phase chooses, at each step, the element furthest from the labeled set, by computing the minimum distance from the labeled set of each candidate sample and selecting that with the maximum computed value (see Fig. 4 for a graphical representation of the strategy).
GSx and GSy can be combined in a further method (iGS) that takes advantage of the knowledge from both the feature and the output spaces (see Table 3 for the pseudo-code). iGS computes the minimum of the element-wise product of the distances calculated considering the feature and the output. This implies that the search is couple oriented, i.e. the best candidate would be the one that has the same labeled element very close in both spaces.

Monte-Carlo dropout uncertainty estimation
MCDUE [73] is based on a popular regularization technique for neural networks, namely Dropout [77]. Dropout randomly "turns off" some of the hidden layer neurons, so that they will output 0 regardless of their input. Dropout can be used as an active sampling technique too since it is a method for computing samples uncertain [14], starting from a pre-trained neural network. For each sample, this is obtained by iteratively (T times) muting a set of neurons (basing on a Dropout probability π) and predicting the regression output. Starting from the assumption that the Standard Deviation (STD) of the predictions could be used as a metric of the sample's uncertainty, samples with the larger STD are selected to be queried. Table 4 Fig . 3 Net splitting procedure for fine-tuning. Separating convolutional and dense layers allowed us to freeze the training of the first, maintaining the features extraction capability of the net. Note that two different networks are trained, one for valence and one for arousal describes the steps of the procedure. In our experiment, we set T and π to 50 and 0.5, respectively.

Results
In order to observe how our approach affects the performance of the recognition of a specific subject, we tested three conditions: in the first, used for comparison, the un-trained CNN (including the convolutional layers) was trained just on subject-specific data from AMIGOS, thus no transfer learning was performed ("No Transfer" label in Fig. 5); in the second condition, the net was pre-trained on AffectNet data, but no personal data were employed during training ("Transfer 0% Labeling" label in Fig. 5); in the third condition the net, pretrained on AffectNet, was fine-tuned by using the whole available personal training set ("Transfer 100% Labeling" label in Fig. 5).
For each subject from AMIGOS, we performed training and testing in a k-fold crossvalidation framework [34]. In k-fold cross-validation, the choice of the k parameter often represents a trade-off between bias and variance of the model. k is usually set to 5 or 10, but there is no formal rule [25,37]. In our specific case, k = 5 and k = 10 would lead to a fold's size of about 200 and 100 samples, respectively. We opted for k = 5 in order to obtain a test set better representing the variability in the underlying distribution. Results are averaged on the 10 selected subjects. Results show that there is a substantial difference between valence and arousal. For valence, the first test obtained poor results (average RMSE = 0.37), meaning that learning valence only by subjective data is a hard task. The pre-trained net (Transfer 0% Labeling) obtained better results (average RMSE = 0.14), suggesting that the network trained on the AffectNet database (the transferred knowledge) allowed us to create quite a good model able to generalize positive and negative valence among all the subjects. Conversely, for arousal, the net fed with subjectspecific data achieved a better initial performance (average RMSE = 0.22), while the pretrained net performed very poorly meaning that the correct recognition of arousal levels is more dependent on the specific subject and so more difficult to be generalized. Indeed, these differences can be explained by the intrinsic characteristics of valence and arousal and by the different nature of data used for the training. In general, learning from subject-specific data seems harder for valence than for arousal (see "No Transfer" scenario). Moreover, while moving from an in the wild context to a controlled environment does not negatively affect the learning process for valence, it is indeed quite confounding for arousal.
For both dimensions, the proposed method (Transfer 100% Labeling) improved recognition performances. This can be observed from the third scenario results, where quite good performances were obtained (RMSE = 0.09 and RMSE = 0.1 for valence and arousal, respectively). Overall results suggested that both the transferred knowledge and the personal data helped in achieving this improvement, even though they alternated in providing the main contribution.
To assess how many instances are needed by the fine-tuning process for developing an accurate personal model, for each fold, we considered different percentages of the training, from 5% (meaning that~40 samples were used to fine-tuning) up to 90% (~720 samples) have been selected. Moreover, we were also interested in discovering if, in the context of FER, it would be convenient to adopt "smart" sampling techniques, in order to further reduce the number of samples. To this aim, we tested the different sampling techniques and we compared results with those obtained by random sampling the set of available instances.
Results obtained testing the considered sampling techniques and random sampling at several steps of the training set size, further highlight little differences between valence and  (Fig. 6). We observed a rapid decrease in the error in both dimensions, which makes it impossible to appreciate significant differences between sampling approaches, including random sampling. Surprisingly, the higher slope corresponds to valence, even if it started with a lower RMSE in the "Transfer 0% Labeling" case, with respect to arousal.
Finally, in Fig. 6, the presence of an elbow point is evident, both in valence (20%) and in arousal (30%), meaning that querying from~240 to~320 personal samples is enough to obtain quite good results with RMSE values that are comparable with the results obtained in the Transfer 100% Labeling case (~800 instances). These results are in support of the creation of personal models for FER since the improvement in performance is obtained with a limited required number of labelled data.

Conclusions
In this paper, we addressed the FER challenge, proposing a transfer learning approach for regression, to exploit information learned by a CNN (AlexNet) on a large in the wild dataset (AffectNet) and then create a subject-specific model.
In summary, the contribution of this work is the following:  [73,79] for the personal dataset has been evaluated.
The results suggested that both transfer learning and subject-specific data are needed, as demonstrated by the fact that the value of RMSE obtained with transfer learning is far better both than the value obtained by simply evaluating the pre-trained network on AffectNet and than the value obtained by training the network, with random initialization of weights, just on the user's images. Interestingly, valence and arousal exhibited quite different behaviors, probably due to intrinsic differences between them. Arousal demonstrated to be less generalizable between subjects, but easier to detect by using few personal data, with respect to valence. This difference between valence and arousal could also be affected by the differences between datasets (in the wild with respect to controlled environments). Moreover, by considering these results together with the RMSE slope in performance, while changing the number of personal samples in the training, it is evident that for arousal very few samples are enough to fine-tune the network. This confirms the need but also the feasibility of training personal models for efficient Facial Emotion Recognition. Moreover, this different behavior and impact on the emotion recognition performance of the valence and arousal could be highlighted only by considering a dimensional approach instead of a categorical one.
Finally, in order to learn a solid single-subject model with minimum demand for new target data, we evaluate different active learning algorithms for regression. The experiment showed that our approach can significantly improve recognition performance with a limited number of target samples, regardless of the sampling techniques employed for querying samples. In particular, it would be sufficient to have a number of samples between~240 and~320, which is very low if we consider that the annotation in the subjective dataset was performed once for an entire videoclip. Moreover, taking into account that the dataset used external annotations, the process could be automatized, by using pre-annotated stimuli, thus sparing the user to Fig. 6 Average learning results between subjects at different training set sizes, for valence and arousal. The dashed lines represent the RMSE value obtained by using the whole training set annotate data by himself. This could be extremely useful in situations where the number of images accessed by a single person is few or one wants to optimize interaction with the user.