Facial identity protection using deep learning technologies: an application in affective computing

Mase, Jimiama M.; Leesakul, Natalie; Figueredo, Grazziela P.; Torres, Mercedes Torres

doi:10.1007/s43681-022-00215-y

Facial identity protection using deep learning technologies: an application in affective computing

Original Research
Open access
Published: 08 September 2022

Volume 3, pages 937–946, (2023)
Cite this article

Download PDF

You have full access to this open access article

AI and Ethics Aims and scope Submit manuscript

Facial identity protection using deep learning technologies: an application in affective computing

Download PDF

Jimiama M. Mase ORCID: orcid.org/0000-0003-4070-534X¹,
Natalie Leesakul¹,
Grazziela P. Figueredo¹ &
…
Mercedes Torres Torres²

1973 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

Automatic prediction of human attributions of valence and arousal using facial recognition technologies can improve human–computer and human–robot interaction. However, data protection has become an issue of great concern in affect recognition using facial images, as the facial identities of people (i.e. recognising who a person is) could be exposed in the process. For instance, malicious individuals could exploit facial images of users to assume their identities and infiltrate biometric authentication systems. Possible solutions to protect the facial identity of users are to: (1) extract anonymised facial features in users’ local machines, namely action units (AU) of facial images, discard their facial images and send the AUs to the developer for processing, and (2) employ a federated learning approach i.e. process users’ facial images in their local machines and only send their locally trained models back to the developer’s machine for augmenting the final model. In this paper, we implement and compare the performance of these privacy-preserving strategies for affect recognition. Results on the popular RECOLA affective datasets show promising affect recognition performance in adopting a federated learning approach to protect users’ identities, with Concordance Correlation Coefficient of 0.426 for valence and 0.390 for arousal.

A Privacy-Preserving Federated-MobileNet for Facial Expression Detection from Images

Ethics of Face Recognition in Smart Cities Toward Trustworthy AI

Efficient facial expression recognition framework based on edge computing

Article 24 July 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Affective artificial intelligence (AI) technologies are increasingly being adopted and becoming more prevalent [1,2,3,4]. Such technologies can be deployed in various contexts for different purposes ranging from assistive technology [5] to more personalised user experiences and behaviour manipulation [6]. Facial images are one of the main data sources for developing these technologies [2, 3]. The images consist of facial expressions that could be used to develop systems to predict the emotional states of individuals. Common categories of emotional states identified by facial expressions are: happy, neutral, sad, surprised, angry, disgusted and anxious [7]. These categories are intuitive and simple (i.e. easy to observe and describe), but fail to represent different degrees of emotions [8]. To ensure the description of different degrees of emotions, continuous scales with multiple dimensions have been introduced [8] as shown in Fig. 1.

In some cases, the use of affective AI can be viewed as causes to privacy- and security-related concerns. For example, the development of affective AI technologies relies on the processing of large amounts of facial images, which could pose significant privacy and security challenges if users facial identities are stolen in the process. Also, with the ubiquitous use of biometric authentication systems, the illegal possession of users’ facial images could be exploited for unlocking their devices, assuming their identities, making electronic payments, and accessing sensitive data. Moreover, McStay [9] noted that data derived from someone’s emotional state may be “intimate” and sensitive, and the information could be easily linked back to the person using their facial image.

In this work, we implement and evaluate the potential of two privacy-preserving strategies in predicting valence and arousal dimensions of emotion. The first strategy extracts Action Units (AUs) from users’ facial images at their local machines, discards the images and sends the AUs to the main processing machine for processing. AUs are anonymised facial features that safeguard the facial identities of users in case of unauthorised access or misuse of data, but still contain sufficient information for analysis. The second strategy employs a Federated Learning (FL) approach where raw images are processed in users’ local machines and the locally trained models sent to the main processing machine for augmentation. The main contributions of this study are:

An application of federated learning strategy for affect recognition using facial images to protect users’ facial identities.
A comprehensive analysis of affective computing using non-federated processing of facial images, non-federated processing of anonymised facial features (AUs) and federated processing of facial images using deep learning architectures.

2 Related works

In this section, we first review the literature on the processing of facial images and AUs using non-federated deep learning strategies. Later, we outline all works that explore a FL approach to protect users’ identity and detect emotions.

2.1 Non-federated deep learning methods

Different non-federated deep learning architectures have been successfully used to process facial images and predict emotional states [10,11,12,13]. For example, Tzirakis et al. [10] reported best valence and arousal recognition performance after training Convolutional Neural Networks (CNNs) coupled with Long Short-Term Memory networks (LSTMs) on images from the RECOLA database, while Lee et al. [13] combined features extracted from RECOLA images using 3D CNNs with spatio-temporal features extracted using Convolutional LSTMs to predict valence score. These non-federated CNN approaches require the developer to maintain a database of facial images for processing, and as such, could be susceptible to privacy and security issues if the facial images are accessed by malicious users or organisations.

To protect participants’ facial identity and still develop accurate affective AI technologies, researchers have explored AUs and facial landmarks extracted from facial expressions in images [11, 14,15,16]. These AUs represent human-observable facial muscle movements, which estimates the intensity of facial movements using facial landmarks. For example, AUs 12 (raising lip corners), 15 (lowering lip corners) and 20 (lip stretch) can be estimated using the facial landmarks on the lips. In typical privacy-preserving non-federated AU approaches, AUs are extracted from users’ facial images at their local machines and sent to the developer’s main machine for processing. The facial images are later discarded to protect users’ identities. This approach protect the facial identity of users as it is difficult to reconstruct facial images from the AUs. However, the results reported by the studies that apply the approach are less accurate compared to processing the facial images.

Even though it is difficult to reconstruct faces from the AUs or facial landmarks, the remarkable performance of auto-encoders and generative adversarial networks in image reconstruction could make face reconstruction possible [17, 18]. In addition, sensitive user information could still be extracted from AUs and facial landmarks [19, 20].

2.2 Federated deep learning methods

To prevent access to users’ identities and sensitive information, one possible solution is to employ a FL approach [21]. FL is a technique that allows a ML model to be trained without collecting data. This is done by the collaborative training of multiple ML models on users local machines (local models) where their personal data resides and sending the trained models (i.e model weights) back to the developer’s machine (central model) for aggregation. Different ensemble methods can be employed to aggregate the model weights depending on the problem, such as mean, median, and weighted average. The central model updates its weights using the aggregated weights and sends the updated weights back to the local models. This process keeps the local training data private and confidential. FL methods have been employed in speech emotion recognition [22], stress level detection using physiological signals [23], and detection of depression using mobile health data [24] but not in Facial Emotion Recognition (FER). To the best of our knowledge, the only study that mentions FL for FER is Chhikara et al. [25]. However, the study does not implement FL nor present any FL results. They simply mention FL as a privacy solution to their multi-modal affect recognition approach.

In this paper, we implement and evaluate the performance of two privacy-preserving strategies in predicting valence and arousal dimensions of emotion i.e., non-federated processing of anonymised facial features (AUs) and federated processing of facial images, in comparison with the conventional non-federated processing of facial images. We evaluate their performance on the RECOLA affect recognition database.^{Footnote 1}

3 Methodology

In this section, we describe the privacy-preserving schemes as well as the conventional non-federated processing of facial images scheme. The processing modules for non-federated and federated processing of facial images use CNNs coupled with recurrent networks to learn the temporal dynamics among the videos (i.e., sequence of facial images), whilst the processing module for non-federated processing of AUs use only RNNs on the structured AU sequences extracted from the image sequences.

3.1 Scheme 1: non-federated processing of facial images

In this approach, the developer collects facial images from users and creates a database as shown in Fig. 2. Later, a deep learning architecture made up of CNNs and recurrent networks (RNNs) with fully connected neurons are trained together on the images to predict valence and arousal dimensions of emotion. The CNNs learn discriminative features in the images, represented as feature maps. The feature maps from the last convolutional layer are flattened, concatenated and sent to the RNNs in a sequential manner, which learns the temporal dynamics in image sequences with respect to valence and arousal dimensions of emotion. The output of the last RNN memory cell is passed to a fully connected layer with two output neurons representing valence and arousal predictions.

3.2 Scheme 2: non-federated processing of action units

Figure 3 presents an illustration of the non-federated processing of AUs. AUs are locally extracted from users’ facial images using facial landmark detectors (e.g. OpenFace AU [26]) and the images are discarded to protect users’ identities and sensitive information as the AUs are free of human faces. Later, the AUs sent to the developer’s machine for processing. RNNs are employed in the processing module to capture the temporal dynamics among the sequential AUs and feed the output of the last RNN memory unit to fully-connected neurons. The fully-connected neurons learn the non-linear relationships between the temporal features and valence and arousal dimensions of emotion.

3.3 Scheme 3: federated processing using facial images

The federated approach processes users’ facial images at their local machines and sends their trained models to the central processing module for aggregation as shown in Fig. 4. The local and central processing machines should have the same model implementation to enable easy aggregation of the trained models. For simplicity, we adopt a mean aggregation strategy where the weights of the trained models are averaged to form the global weights of the main model. It is important to note that different aggregation strategies can be explored to merge the weights e.g. the central processing module could maintain n global weights for the n local machines where the different global weights are weighted averages of the local machines. We implement CNNs coupled with RNNs to process the images at the local machines. The CNNs and RNNs are trained together end-to-end.

Training occurs simultaneously across the machines. After each training iteration, the locally trained models (i.e. model weights) are sent to the central processing module for aggregation. The centrally aggregated weights are sent back to the local machines to update their weights for the next training iteration. The local training, central aggregation and local weight update processes are repeated until the training process is completed.

4 Experimental design

This section first describes the RECOLA database and presents the hyper-parameter configurations of the deep learning methods used in the different schemes. Later, it defines the evaluation metric i.e., Concordance Correlation Coefficient (CCC) and evaluation protocols for the experiments.

4.1 RECOLA database

The RECOLA (Remote COLlaborative and Affective interactions) database [27] is a popular and comprehensive affective database with continuous response variables (i.e. valence and arousal). The database consists of videos, AUs, audio, ECG and EDA datasets for 23 participants. The data were collected during spontaneous and naturalistic interactions between the participants when performing collaborative tasks. The database also contains the ground truth continuous labels for valence and arousal that range from – 1 to + 1. The annotations were carried out by six annotators with a step size of 0.04 s. In this paper, we explore the facial images extracted from the RECOLA videos with a frame rate of 25 fps and AUs extracted from the facial images. A total of 7500 images per participant was extracted, and 15 AUs extracted from each image i.e. AUs 1, 2, 4, 5, 6, 7, 9, 11, 12, 15, 17, 20, 23, 24 and 25. In addition, movements of the face in X-Y-Z directions (i.e. pitch, roll and yaw respectively), the mean and standard-deviation of the optical flow in the region of the face, and changes of the AUs, facial movements, mean and standard deviation of the optical flow from the previous time stamp (delta coefficients) were computed and added to the 15 AUs to produce a total of 40 human-understandable facial features per image.

4.2 Model selection and hyper-parameter configuration

We explore three state-of-the-art RNN models to detect valence and arousal: simple RNNs [3], Bi-directional Gated Recurrent Units (BiGRUs) [28], and Bi-directional Long Short Term Memory networks (BiLSTMs) [29]. We choose these models due to their remarkable performance in time series or sequential analysis [30]. To process the facial images, we employ shallow residual convolutional networks (i.e. ResNet18) [31] due to their remarkable training efficiency (fewer number of layers compared to other state-of-the-art CNNs) and prediction performance [30]. The ResNets are pre-trained on the ImageNet dataset [32] to take advantage of its large size (transfer learning). Later, we remove the fully connected layers of the networks and use their output feature maps as inputs to the RNN networks.

When training the networks, we minimise the Mean Squared Error (MSE) between the predicted valence and arousal, and their annotated values, and we use Adam Stochastic Gradient Descent to optimise the loss function (MSE), which is a fast optimisation algorithm for deep neural networks. The RNN networks consist of the following hyper-parameters: learning rate, hidden layers, sequence length, number of recurrent layers and fully connected layers. The learning rate controls how the weights are updated with respect to the estimated error. If the learning rate is very low, the learning process will be slow as the updates will be very small, and if the learning rate is very high, the weight updates will be very large which can lead to divergence. We train the models using popular learning rates, 0.001, 0.0001, and 0.00001. The hidden size represents the number of hidden units within each recurrent memory cell. We explored 8, 16, 64, 128, 256 and 512 hidden sizes. We also explored 50, 100, 200, 400, 600, 800, 1000 and 2000 AU sequence lengths, and 4, 8, 16, 32 image sequence lengths. The following number of recurrent memory cells (recurrent layers) were evaluated: 1, 2, 4, 6 and 8. Lastly, one fully connected layer was used in the networks consisting of 10 neurons with 2 output neurons for valence and arousal. Table 1 presents the optimal hyper-parameter configurations of the architectures after evaluating the validation loss using the above selected hyper-parameters.

Table 1 Hyper-parameter configuration of models

Full size table

4.3 Evaluation metrics

For performance evaluation, we use Concordance Correlation Coefficient (CCC). CCC is the correlation between two variables that fall on the 45$^{\circ }$ line through the origin. Similarly to Pearson’s correlation coefficient, CCC measures how closely related two variables are in linear fashion, but it also calculates the degree of correspondence (agreement) between the two variables by measuring their fitness to the line passing through the origin with a slope of 1. It is said to be more robust than Pearson’s correlation as it measures both co-variation and correspondence. Figure 5 shows two plots (orange and green) with Pearson’s correlation coefficients of 1 but the orange plot has a CCC of 1 while the green plot has a CCC of 0.403 due to its disagreement with the 45 degree line. CCC ranges from – 1 to 1, with perfect concordance at 1 and perfect discordance at – 1.

CCC is calculated as follows:

$$\begin{aligned} CCC = 2\rho \sigma _x\sigma _y/\sigma _x^2 + \sigma _y^2 + \left( \mu _x - \mu _y\right) ^2, \end{aligned}$$

where $\mu _x$ and $\mu _y$ are the means for the two variables and $\sigma _x^2$ and $\sigma _y^2$ are their corresponding variances. $\rho$ is the correlation coefficient between the two variables.

4.4 Evaluation protocol

First, the ground truth valence and arousal values are obtained by averaging the annotations from the six annotators. Second, we employ k-fold cross-validation to evaluate the models. The dataset is split by participants to prevent overfitting. In our experiments, we select $\textit{k} = 8$ i.e. data split into 8 folds with each fold consisting of data for 2–3 participants depending on the split. The training process is repeated k times to produce k trained models and during each training process, one fold is left out for evaluating the model and the remaining folds are used to train the model. The average CCC amongst the k evaluated models gives the overall performance of the method across the entire dataset. The higher the value of k the more computationally expensive is the training process, however, the more robust and accurate is the model’s performance. For a more realistic implementation of FL, we use each participant as a local machine and divide the total training time by the number of participants to represent synchronous local processing. For example. using 8-fold cross validation on 23 participants, we have data for 20 or 21 participants (local machines) for training and data for the remaining 2 or 3 participants kept aside for evaluating the global model. The experiments were split into 4 machines found on the University’s remote cluster. One machine for aggregating the results obtained from the remaining three. All machines consisted of a Graphics Processing Unit (GPU), 4 CPU cores and 6GB RAM. Our code is implemented in Pytorch with an epoch size of 100 for each experiment. The code for the implementation of all three schemes can be found in our GitHub repository python.^{Footnote 2} It is important to mention that there exist frameworks in Pytorch [33] and Tensorflow [34] for federated learning.

Table 2 Average CCC for predicting valence and arousal using variations of RNN models on RECOLA datasets (best performance in bold)

Full size table

Table 3 Model training time, inference time, and size for the best performance RNNs (best performance in bold)

Full size table

Table 4 Comparison of valence and arousal predictions between our proposed methods and other studies using RECOLA datasets (best performance in bold)

Full size table

5 Results and discussion

5.1 Comparison of the different schemes

We implemented three state-of-the-art RNN models (i.e., RNN, BiGRU, and BiLSTM) for each scheme and evaluated their performance using CCC coupled with cross-validation on the RECOLA image and AU datasets. Table 2 shows the average CCC for valence and arousal after evaluating the models using the best hyper-parameters shown in Table 1. The bold values represent the best model performance for valence and arousal. Overall, the non-federated processing of facial images shows best valence and arousal predictions, followed by the federated processing of facial images. These strategies that process facial images outperform the processing of AUs due to the loss of spatial information in the AUs. CNNs coupled with BiLSTMs show best performance for non-federated processing of images with 0.476 average CCC for valence and 0.515 for arousal. Next, the processing of AUs shows similar arousal prediction performance compared to the federated processing of images. In addition, we observe that LSTMs outperform GRUs when processing the images similar to results from other studies that analyse raw images [30]. However, for AU processing, GRUs show better performance compared to LSTMs. This is due to the efficiency of GRUs in processing smaller datasets or feature sets compared to LSTMs as only 40 facial features are extracted by the facial landmark extractor while 512 features are extracted by the convolutional networks.

Table 3 presents the efficiency results of the best performing models in terms of training time, inference time and model size. We observe that processing AUs has the least training and inference times due to a smaller feature set (which reduces the complexity of the network) and lack of the convolutional feature extraction layer. This makes the AU processing modules more suitable for real-time prediction of valence and arousal such as, real-time monitoring of patients’ valence and arousal to identify apparently aggressive and threatening patients, non-cooperative patients that may declined care, etc. However, the predictive accuracy of processing AUs is lower compared to the non-federated processing of images for both valence and arousal. The non-federated processing of images shows better accuracy in predicting valence and arousal compared to AUs and FL at the detriment of the potential exposure of users’ facial identities. FL best preserves users’ identities and sensitive information compared to the other methods as data is maintained in users’ local machines, however, its training time is significantly higher, which can further increase if the processing at the local machines is not done synchronously. Lastly, FL’s CCC results are inferior to the non-federated processing of images due to limited data at the local machines.

5.2 Comparison with other studies

In Table 4, we compare the performance of our models with other studies that employ machine learning methods on the RECOLA image and AU datasets for affect recognition. For the non-federated processing of facial images, we observe that [10, 11, 13, 14] show better valence recognition results compared to our model, with Tzirakis et al. [10] having the best CCC valence (0.620). However, our model shows best arousal accuracy with a CCC value of 0.514. Those studies also explored different architectures of CNNs coupled with LSTMs, however, they are limited in their model evaluation strategy (i.e. train-test split) that prevents a comprehensive exploration of the data and may lead to less reliable results.

Furthermore, the processing of AUs and facial landmarks by previous studies [11, 15, 16] show better CCC results in predicting the valence dimension. Valstar et al. [15] presented best valence CCC results of 0.507 using support vector machines. Our model shows a contrary performance as our arousal prediction results are better than valence and outperforms the arousal accuracy of the other studies (0.401). This is due to the remarkable performance of GRUs in processing small feature sets. Next, storing the anonymised AUs is more secured in terms of privacy compared to facial images. Therefore, maintaining a database of facial images requires appropriate security levels and systems to safeguard the data, which can be challenging to implement. The trade-off between efficiency and privacy is at the cross road. Consequently, from a privacy-compliant and data protection perspective, it could be argued that storing facial images may not be necessary if other alternative methods are available and extracting AUs could be considered as a data anonymisation technique for images to protect the identities of users, and an alternative method for affect recognition.

Lastly, we could not find any study in the literature that explores FL to process facial images for FER. As a result, we show the performance of our FL architecture that uses CNNs coupled with BiLSTMs as a benchmark for future research on FL and privacy-preserving deep learning techniques for FER. Even with the privacy benefits of FL as well as their promising results (i.e., best result with CCC = 0.426) compared to the best results of non-federated processing of facial images (i.e., CCC = 0.620), further research is required to advance FL for an acceptable affect recognition solution as its performance is still less than half of perfect agreement i.e., CCC of 1.

6 Conclusion

In this paper, we prioritised facial identity protection in facial emotion recognition by presenting two privacy-preserving schemes consisting of: (1) non-federated deep learning approach to process anonymised facial features (Action Units), and (2) a federated deep learning approach that aggregates locally trained models on facial images. We implemented three variations of RNNs and compared the models’ performance including the non-federated processing of images on the RECOLA databases. Our results show state-of-the-art performance of 0.426 for valence and 0.401 for arousal using Concordance Correlation Coefficient evaluation metric using the privacy-preserving schemes.

For future work, we plan to improve the performance of these models by combining and fusing data from other modalities while still maintaining the privacy proactive nature of our system as well as promoting responsible technology as “data protection by design and by default”. For example, extracting and combining acoustic features with the AUs or a federated learning approach to aggregate locally trained models on audio-video data. We also intend to explore other state-of-the-art computer vision models, such as vision transformers, to improve performance.

Notes

References

McStay, A.: Emotional ai, soft biometrics and the surveillance of emotional life: An unusual consensus on privacy. Big Data Soc. 7(1), 2053951720904386 (2020)
Article Google Scholar
Kim, D.H., Baddar, W.J., Jang, J., Ro, Y.M.: Multi-objective based spatio-temporal feature representation learning robust to expression intensity variations for facial expression recognition. IEEE Trans. Affect. Comput. 10(2), 223–236 (2017)
Article Google Scholar
Jain, N., Kumar, S., Kumar, A., Shamsolmoali, P., Zareapoor, M.: Hybrid deep neural networks for face emotion recognition. Pattern Recogn. Lett. 115, 101–106 (2018)
Article Google Scholar
Mase, J.M., Majid, S., Mesgarpour, M., Torres, M.T., Figueredo, G.P., Chapman, P.: Evaluating the impact of heavy goods vehicle driver monitoring and coaching to reduce risky behaviour. Accid. Anal. Prev. 146, 105754 (2020)
Article Google Scholar
Bishop, J.: Supporting communication between people with social orientation impairments using affective computing technologies: rethinking the autism spectrum. In: Assistive Technologies for Physical and Cognitive Disabilities, pp. 42–55. IGI Global (2015)
Poria, S., Cambria, E., Bajpai, R., Hussain, A.: A review of affective computing: from unimodal analysis to multimodal fusion. Inf. Fusion 37, 98–125 (2017)
Article Google Scholar
Breuer, R., Kimmel, R.: A deep learning perspective on the origin of facial expressions. arXiv preprint arXiv:1705.01842 (2017)
Sun, Y., Wang, X., Tang, X.: Deep learning face representation from predicting 10,000 classes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1891–1898 (2014)
McStay, A.: Empathic media and advertising: industry, policy, legal and citizen perspectives (the case for intimacy). Big Data Soc. 3(2), 2053951716666868 (2016)
Article Google Scholar
Tzirakis, P., Trigeorgis, G., Nicolaou, M.A., Schuller, B.W., Zafeiriou, S.: End-to-end multimodal emotion recognition using deep neural networks. IEEE J. Sel. Top. Signal Process. 11(8), 1301–1309 (2017)
Article Google Scholar
Chao, L., Tao, J., Yang, M., Li, Y., Wen, Z.: Long short term memory recurrent neural network based multimodal dimensional emotion recognition. In: Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge, pp. 65–72 (2015)
Khorrami, P., Paine, T.L., Brady, K., Dagli, C., Huang, T.S.: How deep neural networks can improve emotion recognition on video data. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 619–623. IEEE (2016)
Lee, J., Kim, S., Kiim, S., Sohn, K.: Spatiotemporal attention based deep neural networks for emotion recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1513–1517. IEEE (2018)
Ortega, J.D.S., Senoussaoui, M., Granger, E., Pedersoli, M., Cardinal, P., Koerich, A.L.: Multimodal fusion with deep neural networks for audio-video emotion recognition. arXiv preprint arXiv:1907.03196 (2019)
Valstar, M., Gratch, J., Schuller, B., Ringeval, F., Lalanne, D., Torres, M.T., Scherer, S., Stratou, G., Cowie, R., Pantic, M.: Avec 2016: Depression, mood, and emotion recognition workshop and challenge. In: Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, pp. 3–10 (2016)
Han, J., Zhang, Z., Cummins, N., Ringeval, F., Schuller, B.: Strength modelling for real-worldautomatic continuous affect recognition from audiovisual signals. Image Vis. Comput. 65, 76–86 (2017)
Article Google Scholar
Sun, P., Li, Y., Qi, H., Lyu, S.: Landmarkgan: synthesizing faces from landmarks. arXiv preprint arXiv:2011.00269 (2020)
Choi, J., Medioni, G., Lin, Y., Silva, L., Regina, O., Pamplona, M., Faltemier, T.C.: 3d face reconstruction using a single or multiple views. In: 2010 20th International Conference on Pattern Recognition, pp. 3959–3962. IEEE (2010)
Fan, Y., Lam, J.C.K., Li, V.O.K.: Demographic effects on facial emotion expression: an interdisciplinary investigation of the facial action units of happiness. Sci. Rep. 11(1), 1–11 (2021)
Article Google Scholar
Jaiswal, M., Provost, E.M.: Privacy enhanced multimodal neural representations for emotion recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence 34, 7985–7993 (2020)
Yang, Q., Liu, Y., Cheng, Y., Kang, Y., Chen, T., Han, Yu.: Federated learning. Synth. Lect. Artif. Intell. Mach. Learn. 13(3), 1–207 (2019)
Google Scholar
Latif, S., Khalifa, S., Rana, R., Jurdak, R.: Federated learning for speech emotion recognition applications. In: 2020 19th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN), pp. 341–342. IEEE (2020)
Can, Y.S., Ersoy, C.: Privacy-preserving federated deep learning for wearable iot-based biomedical monitoring. ACM Trans. Internet Technol. (TOIT) 21(1), 1–17 (2021)
Article Google Scholar
Xu, X., Peng, H., Sun, L., Bhuiyan, M.Z.Al., Liu, L., He, L.: Fedmood: Federated learning on mobile health data for mood detection. arXiv preprint arXiv:2102.09342 (2021)
Chhikara, P., Singh, P., Tekchandani, R., Kumar, N., Guizani, M.: Federated learning meets human emotions: A decentralized framework for human-computer interaction for iot applications. IEEE Internet Things J. 8(8), 6949–6962 (2020)
Article Google Scholar
Baltrusaitis, T., Zadeh, A., Lim, Y.C., Morency, L.-P.: Openface 2.0: facial behavior analysis toolkit. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 59–66. IEEE (2018)
Ringeval, F., Sonderegger, A., Sauer, J., Lalanne, D.: Introducing the Recola multimodal corpus of remote collaborative and affective interactions. In: 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture recognition (FG), pp. 1–8. IEEE (2013)
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Gated feedback recurrent neural networks. In: International Conference on Machine Learning, pp. 2067–2075 (2015)
Tzirakis, P., Chen, J., Zafeiriou, S., Schuller, B.: End-to-end multimodal affect recognition in real-world environments. Inf. Fusion 68, 46–53 (2021)
Article Google Scholar
Mase, J.M., Chapman, P., Figueredo, G.P., Torres, M.T.: Benchmarking deep learning models for driver distraction detection. In: International Conference on Machine Learning, Optimization, and Data Science, pp. 103–117. Springer (2020)
Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)
Article Google Scholar
Russakovsky, O., Deng, J., Hao, S., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Beutel, D.J., Topal, T., Mathur, A., Qiu, X., Parcollet, T., Lane, N.D.: Flower: a friendly federated learning research framework. arXiv preprint arXiv:2007.14390 (2020)
Abadi, Martín, Agarwal, Ashish, Barham, Paul, Brevdo, Eugene, Chen, Zhifeng, Citro, Craig, Corrado, Greg S., Davis, Andy, Dean, Jeffrey, Devin, Matthieu, Ghemawat, Sanjay, Goodfellow, Ian, Harp, Andrew, Irving, Geoffrey, Isard, Michael, Jia, Yangqing, Jozefowicz, Rafal, Kaiser, Lukasz, Kudlur, Manjunath, Levenberg, Josh, Mané, Dandelion, Monga, Rajat, Moore, Sherry, Murray, Derek, Olah, Chris, Schuster, Mike, Shlens, Jonathon, Steiner, Benoit, Sutskever, Ilya, Talwar, Kunal, Tucker, Paul, Vanhoucke, Vincent, Vasudevan, Vijay, Viégas, Fernanda, Vinyals, Oriol, Warden, Pete, Wattenberg, Martin, Wicke, Martin, Yu, Yuan, Zheng, Xiaoqiang: TensorFlow: Federated learning, 2015. Software available from tensorflow.org

Download references

Acknowledgements

This work was supported by the Horizon Centre for Doctoral Training at the University of Nottingham (UKRI Grant No. EP/L015463/1), the Engineering and Physical Sciences Research Council (DigiTOP; EP/R032718/1), and Huawei Technologies Co., Ltd.

Author information

Authors and Affiliations

School of Computer Science, The University of Nottingham, Nottingham, UK
Jimiama M. Mase, Natalie Leesakul & Grazziela P. Figueredo
B-Hive Innovations, Lincoln, UK
Mercedes Torres Torres

Authors

Jimiama M. Mase
View author publications
You can also search for this author in PubMed Google Scholar
Natalie Leesakul
View author publications
You can also search for this author in PubMed Google Scholar
Grazziela P. Figueredo
View author publications
You can also search for this author in PubMed Google Scholar
Mercedes Torres Torres
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jimiama M. Mase.

Ethics declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest. The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by the Horizon Centre for Doctoral Training at the University of Nottingham (UKRI Grant No. EP/L015463/1), the Engineering and Physical Sciences Research Council (DigiTOP; EP/R032718/1), and Huawei Technologies Co., Ltd.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Mase, J.M., Leesakul, N., Figueredo, G.P. et al. Facial identity protection using deep learning technologies: an application in affective computing. AI Ethics 3, 937–946 (2023). https://doi.org/10.1007/s43681-022-00215-y

Download citation

Received: 25 May 2022
Accepted: 21 August 2022
Published: 08 September 2022
Issue Date: August 2023
DOI: https://doi.org/10.1007/s43681-022-00215-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Facial identity protection using deep learning technologies: an application in affective computing

Abstract

Similar content being viewed by others

A Privacy-Preserving Federated-MobileNet for Facial Expression Detection from Images

Ethics of Face Recognition in Smart Cities Toward Trustworthy AI

Efficient facial expression recognition framework based on edge computing

1 Introduction