1 Introduction

Affective artificial intelligence (AI) technologies are increasingly being adopted and becoming more prevalent [1,2,3,4]. Such technologies can be deployed in various contexts for different purposes ranging from assistive technology [5] to more personalised user experiences and behaviour manipulation [6]. Facial images are one of the main data sources for developing these technologies [2, 3]. The images consist of facial expressions that could be used to develop systems to predict the emotional states of individuals. Common categories of emotional states identified by facial expressions are: happy, neutral, sad, surprised, angry, disgusted and anxious [7]. These categories are intuitive and simple (i.e. easy to observe and describe), but fail to represent different degrees of emotions [8]. To ensure the description of different degrees of emotions, continuous scales with multiple dimensions have been introduced [8] as shown in Fig. 1.

Fig. 1
figure 1

Two-dimensional description of emotions using continuous scales

In some cases, the use of affective AI can be viewed as causes to privacy- and security-related concerns. For example, the development of affective AI technologies relies on the processing of large amounts of facial images, which could pose significant privacy and security challenges if users facial identities are stolen in the process. Also, with the ubiquitous use of biometric authentication systems, the illegal possession of users’ facial images could be exploited for unlocking their devices, assuming their identities, making electronic payments, and accessing sensitive data. Moreover, McStay [9] noted that data derived from someone’s emotional state may be “intimate” and sensitive, and the information could be easily linked back to the person using their facial image.

In this work, we implement and evaluate the potential of two privacy-preserving strategies in predicting valence and arousal dimensions of emotion. The first strategy extracts Action Units (AUs) from users’ facial images at their local machines, discards the images and sends the AUs to the main processing machine for processing. AUs are anonymised facial features that safeguard the facial identities of users in case of unauthorised access or misuse of data, but still contain sufficient information for analysis. The second strategy employs a Federated Learning (FL) approach where raw images are processed in users’ local machines and the locally trained models sent to the main processing machine for augmentation. The main contributions of this study are:

  • An application of federated learning strategy for affect recognition using facial images to protect users’ facial identities.

  • A comprehensive analysis of affective computing using non-federated processing of facial images, non-federated processing of anonymised facial features (AUs) and federated processing of facial images using deep learning architectures.

2 Related works

In this section, we first review the literature on the processing of facial images and AUs using non-federated deep learning strategies. Later, we outline all works that explore a FL approach to protect users’ identity and detect emotions.

2.1 Non-federated deep learning methods

Different non-federated deep learning architectures have been successfully used to process facial images and predict emotional states [10,11,12,13]. For example, Tzirakis et al. [10] reported best valence and arousal recognition performance after training Convolutional Neural Networks (CNNs) coupled with Long Short-Term Memory networks (LSTMs) on images from the RECOLA database, while Lee et al. [13] combined features extracted from RECOLA images using 3D CNNs with spatio-temporal features extracted using Convolutional LSTMs to predict valence score. These non-federated CNN approaches require the developer to maintain a database of facial images for processing, and as such, could be susceptible to privacy and security issues if the facial images are accessed by malicious users or organisations.

To protect participants’ facial identity and still develop accurate affective AI technologies, researchers have explored AUs and facial landmarks extracted from facial expressions in images [11, 14,15,16]. These AUs represent human-observable facial muscle movements, which estimates the intensity of facial movements using facial landmarks. For example, AUs 12 (raising lip corners), 15 (lowering lip corners) and 20 (lip stretch) can be estimated using the facial landmarks on the lips. In typical privacy-preserving non-federated AU approaches, AUs are extracted from users’ facial images at their local machines and sent to the developer’s main machine for processing. The facial images are later discarded to protect users’ identities. This approach protect the facial identity of users as it is difficult to reconstruct facial images from the AUs. However, the results reported by the studies that apply the approach are less accurate compared to processing the facial images.

Even though it is difficult to reconstruct faces from the AUs or facial landmarks, the remarkable performance of auto-encoders and generative adversarial networks in image reconstruction could make face reconstruction possible [17, 18]. In addition, sensitive user information could still be extracted from AUs and facial landmarks [19, 20].

2.2 Federated deep learning methods

To prevent access to users’ identities and sensitive information, one possible solution is to employ a FL approach [21]. FL is a technique that allows a ML model to be trained without collecting data. This is done by the collaborative training of multiple ML models on users local machines (local models) where their personal data resides and sending the trained models (i.e model weights) back to the developer’s machine (central model) for aggregation. Different ensemble methods can be employed to aggregate the model weights depending on the problem, such as mean, median, and weighted average. The central model updates its weights using the aggregated weights and sends the updated weights back to the local models. This process keeps the local training data private and confidential. FL methods have been employed in speech emotion recognition [22], stress level detection using physiological signals [23], and detection of depression using mobile health data [24] but not in Facial Emotion Recognition (FER). To the best of our knowledge, the only study that mentions FL for FER is Chhikara et al. [25]. However, the study does not implement FL nor present any FL results. They simply mention FL as a privacy solution to their multi-modal affect recognition approach.

In this paper, we implement and evaluate the performance of two privacy-preserving strategies in predicting valence and arousal dimensions of emotion i.e., non-federated processing of anonymised facial features (AUs) and federated processing of facial images, in comparison with the conventional non-federated processing of facial images. We evaluate their performance on the RECOLA affect recognition database.Footnote 1

3 Methodology

In this section, we describe the privacy-preserving schemes as well as the conventional non-federated processing of facial images scheme. The processing modules for non-federated and federated processing of facial images use CNNs coupled with recurrent networks to learn the temporal dynamics among the videos (i.e., sequence of facial images), whilst the processing module for non-federated processing of AUs use only RNNs on the structured AU sequences extracted from the image sequences.

3.1 Scheme 1: non-federated processing of facial images

In this approach, the developer collects facial images from users and creates a database as shown in Fig. 2. Later, a deep learning architecture made up of CNNs and recurrent networks (RNNs) with fully connected neurons are trained together on the images to predict valence and arousal dimensions of emotion. The CNNs learn discriminative features in the images, represented as feature maps. The feature maps from the last convolutional layer are flattened, concatenated and sent to the RNNs in a sequential manner, which learns the temporal dynamics in image sequences with respect to valence and arousal dimensions of emotion. The output of the last RNN memory cell is passed to a fully connected layer with two output neurons representing valence and arousal predictions.

Fig. 2
figure 2

A non-federated deep learning strategy for affect recognition using facial images

3.2 Scheme 2: non-federated processing of action units

Figure 3 presents an illustration of the non-federated processing of AUs. AUs are locally extracted from users’ facial images using facial landmark detectors (e.g. OpenFace AU [26]) and the images are discarded to protect users’ identities and sensitive information as the AUs are free of human faces. Later, the AUs sent to the developer’s machine for processing. RNNs are employed in the processing module to capture the temporal dynamics among the sequential AUs and feed the output of the last RNN memory unit to fully-connected neurons. The fully-connected neurons learn the non-linear relationships between the temporal features and valence and arousal dimensions of emotion.

Fig. 3
figure 3

A non-federated deep learning strategy for affect recognition using action units

3.3 Scheme 3: federated processing using facial images

The federated approach processes users’ facial images at their local machines and sends their trained models to the central processing module for aggregation as shown in Fig. 4. The local and central processing machines should have the same model implementation to enable easy aggregation of the trained models. For simplicity, we adopt a mean aggregation strategy where the weights of the trained models are averaged to form the global weights of the main model. It is important to note that different aggregation strategies can be explored to merge the weights e.g. the central processing module could maintain n global weights for the n local machines where the different global weights are weighted averages of the local machines. We implement CNNs coupled with RNNs to process the images at the local machines. The CNNs and RNNs are trained together end-to-end.

Training occurs simultaneously across the machines. After each training iteration, the locally trained models (i.e. model weights) are sent to the central processing module for aggregation. The centrally aggregated weights are sent back to the local machines to update their weights for the next training iteration. The local training, central aggregation and local weight update processes are repeated until the training process is completed.

Fig. 4
figure 4

A federated learning approach for affect recognition using images

4 Experimental design

This section first describes the RECOLA database and presents the hyper-parameter configurations of the deep learning methods used in the different schemes. Later, it defines the evaluation metric i.e., Concordance Correlation Coefficient (CCC) and evaluation protocols for the experiments.

4.1 RECOLA database

The RECOLA (Remote COLlaborative and Affective interactions) database [27] is a popular and comprehensive affective database with continuous response variables (i.e. valence and arousal). The database consists of videos, AUs, audio, ECG and EDA datasets for 23 participants. The data were collected during spontaneous and naturalistic interactions between the participants when performing collaborative tasks. The database also contains the ground truth continuous labels for valence and arousal that range from – 1 to + 1. The annotations were carried out by six annotators with a step size of 0.04 s. In this paper, we explore the facial images extracted from the RECOLA videos with a frame rate of 25 fps and AUs extracted from the facial images. A total of 7500 images per participant was extracted, and 15 AUs extracted from each image i.e. AUs 1, 2, 4, 5, 6, 7, 9, 11, 12, 15, 17, 20, 23, 24 and 25. In addition, movements of the face in X-Y-Z directions (i.e. pitch, roll and yaw respectively), the mean and standard-deviation of the optical flow in the region of the face, and changes of the AUs, facial movements, mean and standard deviation of the optical flow from the previous time stamp (delta coefficients) were computed and added to the 15 AUs to produce a total of 40 human-understandable facial features per image.

4.2 Model selection and hyper-parameter configuration

We explore three state-of-the-art RNN models to detect valence and arousal: simple RNNs [3], Bi-directional Gated Recurrent Units (BiGRUs) [28], and Bi-directional Long Short Term Memory networks (BiLSTMs) [29]. We choose these models due to their remarkable performance in time series or sequential analysis [30]. To process the facial images, we employ shallow residual convolutional networks (i.e. ResNet18) [31] due to their remarkable training efficiency (fewer number of layers compared to other state-of-the-art CNNs) and prediction performance [30]. The ResNets are pre-trained on the ImageNet dataset [32] to take advantage of its large size (transfer learning). Later, we remove the fully connected layers of the networks and use their output feature maps as inputs to the RNN networks.

When training the networks, we minimise the Mean Squared Error (MSE) between the predicted valence and arousal, and their annotated values, and we use Adam Stochastic Gradient Descent to optimise the loss function (MSE), which is a fast optimisation algorithm for deep neural networks. The RNN networks consist of the following hyper-parameters: learning rate, hidden layers, sequence length, number of recurrent layers and fully connected layers. The learning rate controls how the weights are updated with respect to the estimated error. If the learning rate is very low, the learning process will be slow as the updates will be very small, and if the learning rate is very high, the weight updates will be very large which can lead to divergence. We train the models using popular learning rates, 0.001, 0.0001, and 0.00001. The hidden size represents the number of hidden units within each recurrent memory cell. We explored 8, 16, 64, 128, 256 and 512 hidden sizes. We also explored 50, 100, 200, 400, 600, 800, 1000 and 2000 AU sequence lengths, and 4, 8, 16, 32 image sequence lengths. The following number of recurrent memory cells (recurrent layers) were evaluated: 1, 2, 4, 6 and 8. Lastly, one fully connected layer was used in the networks consisting of 10 neurons with 2 output neurons for valence and arousal. Table 1 presents the optimal hyper-parameter configurations of the architectures after evaluating the validation loss using the above selected hyper-parameters.

Table 1 Hyper-parameter configuration of models

4.3 Evaluation metrics

For performance evaluation, we use Concordance Correlation Coefficient (CCC). CCC is the correlation between two variables that fall on the 45\(^{\circ }\) line through the origin. Similarly to Pearson’s correlation coefficient, CCC measures how closely related two variables are in linear fashion, but it also calculates the degree of correspondence (agreement) between the two variables by measuring their fitness to the line passing through the origin with a slope of 1. It is said to be more robust than Pearson’s correlation as it measures both co-variation and correspondence. Figure 5 shows two plots (orange and green) with Pearson’s correlation coefficients of 1 but the orange plot has a CCC of 1 while the green plot has a CCC of 0.403 due to its disagreement with the 45 degree line. CCC ranges from – 1 to 1, with perfect concordance at 1 and perfect discordance at – 1.

CCC is calculated as follows:

$$\begin{aligned} CCC = 2\rho \sigma _x\sigma _y/\sigma _x^2 + \sigma _y^2 + \left( \mu _x - \mu _y\right) ^2, \end{aligned}$$

where \(\mu _x\) and \(\mu _y\) are the means for the two variables and \(\sigma _x^2\) and \(\sigma _y^2\) are their corresponding variances. \(\rho\) is the correlation coefficient between the two variables.

Fig. 5
figure 5

An example to compare Pearson’s correlation and CCC

4.4 Evaluation protocol

First, the ground truth valence and arousal values are obtained by averaging the annotations from the six annotators. Second, we employ k-fold cross-validation to evaluate the models. The dataset is split by participants to prevent overfitting. In our experiments, we select \(\textit{k} = 8\) i.e. data split into 8 folds with each fold consisting of data for 2–3 participants depending on the split. The training process is repeated k times to produce k trained models and during each training process, one fold is left out for evaluating the model and the remaining folds are used to train the model. The average CCC amongst the k evaluated models gives the overall performance of the method across the entire dataset. The higher the value of k the more computationally expensive is the training process, however, the more robust and accurate is the model’s performance. For a more realistic implementation of FL, we use each participant as a local machine and divide the total training time by the number of participants to represent synchronous local processing. For example. using 8-fold cross validation on 23 participants, we have data for 20 or 21 participants (local machines) for training and data for the remaining 2 or 3 participants kept aside for evaluating the global model. The experiments were split into 4 machines found on the University’s remote cluster. One machine for aggregating the results obtained from the remaining three. All machines consisted of a Graphics Processing Unit (GPU), 4 CPU cores and 6GB RAM. Our code is implemented in Pytorch with an epoch size of 100 for each experiment. The code for the implementation of all three schemes can be found in our GitHub repository python.Footnote 2 It is important to mention that there exist frameworks in Pytorch [33] and Tensorflow [34] for federated learning.

Table 2 Average CCC for predicting valence and arousal using variations of RNN models on RECOLA datasets (best performance in bold)
Table 3 Model training time, inference time, and size for the best performance RNNs (best performance in bold)
Table 4 Comparison of valence and arousal predictions between our proposed methods and other studies using RECOLA datasets (best performance in bold)

5 Results and discussion

5.1 Comparison of the different schemes

We implemented three state-of-the-art RNN models (i.e., RNN, BiGRU, and BiLSTM) for each scheme and evaluated their performance using CCC coupled with cross-validation on the RECOLA image and AU datasets. Table 2 shows the average CCC for valence and arousal after evaluating the models using the best hyper-parameters shown in Table 1. The bold values represent the best model performance for valence and arousal. Overall, the non-federated processing of facial images shows best valence and arousal predictions, followed by the federated processing of facial images. These strategies that process facial images outperform the processing of AUs due to the loss of spatial information in the AUs. CNNs coupled with BiLSTMs show best performance for non-federated processing of images with 0.476 average CCC for valence and 0.515 for arousal. Next, the processing of AUs shows similar arousal prediction performance compared to the federated processing of images. In addition, we observe that LSTMs outperform GRUs when processing the images similar to results from other studies that analyse raw images [30]. However, for AU processing, GRUs show better performance compared to LSTMs. This is due to the efficiency of GRUs in processing smaller datasets or feature sets compared to LSTMs as only 40 facial features are extracted by the facial landmark extractor while 512 features are extracted by the convolutional networks.

Table 3 presents the efficiency results of the best performing models in terms of training time, inference time and model size. We observe that processing AUs has the least training and inference times due to a smaller feature set (which reduces the complexity of the network) and lack of the convolutional feature extraction layer. This makes the AU processing modules more suitable for real-time prediction of valence and arousal such as, real-time monitoring of patients’ valence and arousal to identify apparently aggressive and threatening patients, non-cooperative patients that may declined care, etc. However, the predictive accuracy of processing AUs is lower compared to the non-federated processing of images for both valence and arousal. The non-federated processing of images shows better accuracy in predicting valence and arousal compared to AUs and FL at the detriment of the potential exposure of users’ facial identities. FL best preserves users’ identities and sensitive information compared to the other methods as data is maintained in users’ local machines, however, its training time is significantly higher, which can further increase if the processing at the local machines is not done synchronously. Lastly, FL’s CCC results are inferior to the non-federated processing of images due to limited data at the local machines.

5.2 Comparison with other studies

In Table 4, we compare the performance of our models with other studies that employ machine learning methods on the RECOLA image and AU datasets for affect recognition. For the non-federated processing of facial images, we observe that [10, 11, 13, 14] show better valence recognition results compared to our model, with Tzirakis et al. [10] having the best CCC valence (0.620). However, our model shows best arousal accuracy with a CCC value of 0.514. Those studies also explored different architectures of CNNs coupled with LSTMs, however, they are limited in their model evaluation strategy (i.e. train-test split) that prevents a comprehensive exploration of the data and may lead to less reliable results.

Furthermore, the processing of AUs and facial landmarks by previous studies [11, 15, 16] show better CCC results in predicting the valence dimension. Valstar et al. [15] presented best valence CCC results of 0.507 using support vector machines. Our model shows a contrary performance as our arousal prediction results are better than valence and outperforms the arousal accuracy of the other studies (0.401). This is due to the remarkable performance of GRUs in processing small feature sets. Next, storing the anonymised AUs is more secured in terms of privacy compared to facial images. Therefore, maintaining a database of facial images requires appropriate security levels and systems to safeguard the data, which can be challenging to implement. The trade-off between efficiency and privacy is at the cross road. Consequently, from a privacy-compliant and data protection perspective, it could be argued that storing facial images may not be necessary if other alternative methods are available and extracting AUs could be considered as a data anonymisation technique for images to protect the identities of users, and an alternative method for affect recognition.

Lastly, we could not find any study in the literature that explores FL to process facial images for FER. As a result, we show the performance of our FL architecture that uses CNNs coupled with BiLSTMs as a benchmark for future research on FL and privacy-preserving deep learning techniques for FER. Even with the privacy benefits of FL as well as their promising results (i.e., best result with CCC = 0.426) compared to the best results of non-federated processing of facial images (i.e., CCC = 0.620), further research is required to advance FL for an acceptable affect recognition solution as its performance is still less than half of perfect agreement i.e., CCC of 1.

6 Conclusion

In this paper, we prioritised facial identity protection in facial emotion recognition by presenting two privacy-preserving schemes consisting of: (1) non-federated deep learning approach to process anonymised facial features (Action Units), and (2) a federated deep learning approach that aggregates locally trained models on facial images. We implemented three variations of RNNs and compared the models’ performance including the non-federated processing of images on the RECOLA databases. Our results show state-of-the-art performance of 0.426 for valence and 0.401 for arousal using Concordance Correlation Coefficient evaluation metric using the privacy-preserving schemes.

For future work, we plan to improve the performance of these models by combining and fusing data from other modalities while still maintaining the privacy proactive nature of our system as well as promoting responsible technology as “data protection by design and by default”. For example, extracting and combining acoustic features with the AUs or a federated learning approach to aggregate locally trained models on audio-video data. We also intend to explore other state-of-the-art computer vision models, such as vision transformers, to improve performance.