1 Introduction

Robots are no longer just machines being used in factories and industries. There is a growing need and demand towards robots sharing space with humans as collaborative robotics or assistive robotics [35, 63]. Robots are, now, increasingly being deployed in a variety of domains as receptionists [120], educational tutors [49, 59], household supporters [111] and caretakers [25, 49, 67, 125]. Thus, there is a need for these social robots to effectively interact with humans, both verbally and non-verbally. Facial expressions are non-verbal signals that can be used to indicate one’s current status in a conversation, e.g., via backchanneling or rapport [3, 31].

Fig. 1
figure 1

Publications on emotion recognition of human faces during HRI and generation of facial expressions of robots

Perceived sociability is an important aspect in human–robot interaction (HRI) and users want robots to behave in a friendly and emotionally intelligent manner [28, 48, 99, 105]. For social robots to be more anthropomorphic and for human–robot interaction to be more like human-human interaction (HHI), robots need to be able to understand human emotions and appropriately respond to those human emotions. Stock and Merkle show that emotional expressions of anthropomorphic robots become increasingly important in business settings as well [118, 121]. The authors of [119] emphasize that robotic emotions are particularly important for the acceptance of a robot by the user. Thus, emotions are pivotal for HRI [122]. In any interaction, 7% of the affective information is conveyed through words, 38% is conveyed through tone, and 55% is conveyed through facial expressions [92]. This makes facial expressions an indispensable mode of affective communication. Accordingly, numerous studies have examined facial expressions of emotions during HRI [e.g. 18, 91, 8, 116, 15, 81, 2, 110, 17, 19, 33, 38, 50, 81, 91].

In any HHI, human beings first infer the emotional state of the other person and then accordingly generate facial expressions in response to their peer. The generated emotion could be a result of parallel empathy (generating the same emotion as the peer) or reactive empathy (generating emotion in response to the peer’s emotion) [26]. Similarly, in the case of HRI, we would like to study robots recognizing human emotion as well as robots generating their emotion as a response to human emotion.

There has been a growth in the number of papers on facial expressions in HRI in the last decade. Between 2000 and 2020 (see Fig. 1), there has been a gradual increase in the number of publications. Thus, the overarching research question is: What has been done so far on facial emotion expressions in human–robot interaction, and what still needs to be done?

Fig. 2
figure 2

Flowchart of the literature screening process

Fig. 3
figure 3

Framework of the overview

In Sect. 2 the framework of the overview is outlined, followed by the method of selection of studies in Sect. 3. Recognition of human facial expressions and generation of facial expressions by robots are covered in Sects. 4 and 5. The current state of the art and future research are discussed in Sect. 6 with the conclusion in Sect. 7.

2 Framework of the Overview

This overview focuses on two aspects: (1) recognition of human facial expressions and (2) generation of facial expressions by robots. The review framework (Fig. 3) is based on these two streams. (1) Recognition of human facial expressions is further subdivided depending on whether the recognition takes place on (a) a predefined dataset or in (b) real-time. (2) Generation of facial expressions by robots is also subdivided depending on whether the facial generation is (a) hand-coded or (b) automated, i.e., facial expressions of a robot are generated by moving the features (eyes, mouth) of the robot by hand-coding or automatically using machine learning techniques.

3 Method

Studies with the keywords “facial expression recognition AND human–robot interaction / HRI”, ”facial expression recognition” and ”facial expression generation AND human–robot interaction / HRI” between 2000 and 2020 were reviewed on Google Scholar.

In this overview, studies that use voice or body gestures as a modality for emotional expression but do not involve facial expressions are not included. Studies that involve HRI with humans having mental disorders like autism are also not included. Furthermore, studies that work on single emotion such as recognition of smile or facial expression generation of anger are not included. In total, 175 studies of 276 were rejected (Fig. 2).

Fig. 4
figure 4

Process of facial expression recognition in machine learning (adapted from Canedo and Neves [13])

Fig. 5
figure 5

Process of facial expression recognition in deep leaning (adapted from Li and Deng [71])

In Table 3, various studies on facial expression recognition are listed. Here, studies with an accuracy of greater than 90% for facial expression recognition on predefined datasets are selected. For real-time facial expression recognition, all studies that perform facial expression recognition in a human–robot interaction scenario are listed.

4 Recognition of Human Facial Expressions

Earlier, facial expression recognition (FER) consisted of the following steps: detection of face, image pre-processing, extraction of important features and classification of expression (Fig. 4). As deep learning algorithms have become popular, the pre-processed image is directly fed into deep networks (like CNN, RNN etc.) to predict an output [71] (Fig. 5).

In the machine learning algorithms, Viola Jones algorithm and OpenCV were popular choices for face detection. However, dlib face detector and ADABOOST algorithm were also used. To pre-process the images, greyscale conversion, image normalization, image augmentation (such as flip, zoom, rotate etc.) were usually applied. Further, some studies extract the important regions in faces like eyebrows, eyes, nose and mouth (also known as the acting units or AUs) that play an important role in FER. Others use local binary pattern (LBP) or histogram of oriented gradients (HOG) to extract the featural information. Finally, the classification is performed. Most of the studies perform classification for the six universally known emotions (happy, sad, disgust, anger, fear and surprise) and sometimes include a neutral expression. For final classification, k-Nearest Neighbor (KNN), Hidden Markov Model (HMM), Recurrent Neural Network (RNN), Convolutional Neural Network (CNN), Support Vector Machine (SVM) and Long Short-Term Memory (LSTM) are used.

In the deep learning algorithms, the input images are first pre-processed by performing face alignment, data augmentation and normalization. Then the images are directly fed into deep networks like CNN, RNN etc. which predict the emotion of the images. The most commonly used classification methods are explained in more detail below. They are arranged in the order in which they were invented.

KNN: Nearest neighbor based classifier was first invented in the 1950s [37]. In KNN [57], given the training instances and the class labels, the class label of an unknown instance is predicted. KNN is based on a distance function that measures the difference between two instances. While the Euclidean distance formula is mostly used, there are also other distance formulae such as Hamming distance which can be used.

HMM: An HMM [104] was introduced in the late 1960s. It is a doubly embedded stochastic process, bearing a hidden stochastic process (a Markov chain) that is only visible through another stochastic process, producing a sequence of observations. The state sequence can be learned using Viterbi algorithm or Expectation-Modification (EM) algorithm.

RNN: RNN [78] was introduced in the 1980s. RNN is a feed-forward neural network that has an edge over adjacent time steps, introducing a notion of time. Hence, RNN is mainly used for a dynamic data input that has a temporal sequence. In RNN, a state depends upon the current input as well as the state of the network at the previous time step, making it possible to contain information from a long time window.

CNN: Convolutional Networks [70] were invented in 1989 [69]. CNNs are trainable multistage architectures composed of multiple stages. The input and output of each stage are sets of arrays called feature maps. Each stage of CNN is composed of three layers- a filter bank layer, a non-linearity layer and a feature pooling layer. The network is trained using the backpropagation method. They are used for end-to-end recognition wherein given the input image, the output is predicted by CNNs. They are even used as feature extractors which are further connected with neural networks layers like LSTM or RNN for the prediction.

SVM: SVM was invented by Vapnik [128]. In SVM [129], the training data can be separated by a hyperplane.

LSTM: LSTM [39] was invented by Hochreiter and Schmidhuber [47]. It also has recurrent connections but unlike RNN, it is capable of learning long-term dependencies.

Table 1 summarizes the major purpose, application areas, advantages, disadvantages and frequency of use for commonly used algorithms. For the frequency of use, only the number of papers that implement facial expression recognition during HRI or in real-time scenarios were counted. Although RNN has not been used for facial expression recognition during HRI or in real-time, some studies perform facial expression recognition on predefined datasets using RNN.

27 studies on facial expression recognition during HRI were reviewed. Some of the studies have not been performed on a robot platform. These studies perform emotion recognition in real-time and mention HRI as their intended application. The studies are summarized in Table 2. Here, studies that perform facial expression recognition on predefined datasets or studies that perform facial expression recognition but not in real-time were not included.

Table 1 Details about the commonly used algorithms
Table 2 Detailed information about studies on emotion recognition and Human Robot Interaction (HRI)
Table 3 Studies on FER; Note: Studies listed according to accuracy level

4.1 FER on Predefined Dataset

Fig. 6
figure 6

Facial expression generation techniques

Although the goal of this study is to perform FER in real-time and during HRI, the studies on real-time FER are compared with FER on predefined datasets. FER has been carried out on static human images as well as on dynamic human video clips. While some studies, perform facial recognition on still images, others perform facial recognition on videos. In Datcu and Rothkrantz [24], they show that there is an advantage in using data from video frames over still images. This is because videos contain temporal information that is absent in still images.

Results of studies with above 90% accuracy in FER on still images are summarized in Table 3a and on videos are summarized in Table 3b. Table 3a, b are for comparison with Table 3c. Studies are arranged according to their accuracy level. It should be noted that these studies are carried out on predefined datasets consisting of human images and videos and do not involve robots. There are a considerable number of studies that achieve accuracy greater than 90% on CK+, Jaffe and Oulu-Casia datasets on both still images and videos.

4.2 FER in Real-Time

It is easier to achieve high accuracy while performing emotion recognition on predefined datasets as they are recorded under controlled environmental conditions. On the other hand, it is difficult to achieve the same level of accuracy when performing emotion recognition in real-time when the movements are spontaneous. It should be noted that studies that perform facial expression recognition in real-time were carried out under controlled laboratory conditions with little variation in lighting conditions and head poses.

As this study is about facial expressions in HRI, for a robot to be able to recognize emotion, emotion recognition has to be performed in real-time. Table 3c provides studies with facial expression recognition in real-time for HRI. Here, the accuracies are comparatively lower than the accuracies for predefined datasets. As can be seen in Table 3c, only two studies have an accuracy greater than 90%. The robots that are used in the studies are either robotic heads or humanoid robots such as Pepper, Nao, iCub etc. Many studies that perform facial expression recognition in real-time use CNNs, making it a popular choice for facial expression recognition [2, 2, 8, 15, 133]. However, the highest accuracy is achieved by Bayesian and Artificial Neural Network (ANN) methods for facial expression recognition in real-time.

Table 4 Detailed information about studies on emotion expression and Human Robot Interaction (HRI)

5 Facial Emotion Expression by Robots

For robots to be empathic, it is necessary that the robots not only be able to recognize human emotions but also be able to generate emotions using facial expressions. Several studies enable robots to generate facial expressions either in a hand-coded or an automated manner (Fig. 6). By hand-coded, we mean that the facial expressions are coded by moving the eyes and mouth of the robot in a desirous manner, and automated is when the emotions are learned automatically using machine learning techniques.

16 studies on facial emotion expression in robots were reviewed. These studies are summarized in Table 4.

5.1 Facial Expression Generation is Hand-Coded

Earlier studies started by hand-coding the facial expressions in robots. There is a static as well as dynamic generation of facial expressions on robots.

Among the static methods, there is a humanoid social robot “Alice” that imitates human facial expressions in real-time [91]. Kim et al. [61] introduced an artificial facial expression imitation system using a robot head, Ulkni. As Ulkni is composed of 12 RC servos, with four Degrees of Freedom (DoFs) to control its gaze direction, two DoFs for its neck, and six DoFs for its eyelids and lips, it is capable of making the basic facial expressions after the position commands for actuators are sent from the PC. Bennett and Sabanovic [7] identified minimal features, i.e. movement of eyes, eyebrows, mouth and neck, which are sufficient to identify the facial expression.

In this study, the main program called functions that specified facial expressions according to the direction (used to make or undo an expression) and degree (strength of the expression–i.e. smaller vs. larger). The facial expression functions would in turn call lower functions that moved specific facial components given a direction and degree, following the movement related to specific AUs in the facial acting coding system (FACS).

Breazeal’s [9] robot Kismet generated emotions using an interpolation-based technique over a 3-D space, where the three dimensions correspond to valence, arousal and stance. The expressions become intense as the affect state moves to extreme values in the affect space. Park et al. [102] made diverse facial expressions by changing their dynamics and increased the lifelikeness of a robot by adding secondary actions such as physiological movements (eye blinking and sinusoidal motions concerning respiration). A second-order differential equation based on the linear affect-expressions space model is used to achieve the dynamic motion for expressions. Prajapati et al. [103] used a dynamic emotion generation model to convert the facial expressions derived from the human face into a more natural form before rendering them on the robotic face. The model is provided with the facial expression of the person interacting with the system and corresponding synthetic emotions generated are fed to the robotic face.

Summary of findings The robot faces are capable of making basic facial expressions as they contain enough DoFs in the eyes and mouth. They are able to generate static emotions [7, 61, 91]. Additionally, the robot faces are able to generate dynamic emotions [9, 102, 103].

5.2 Facial Expression Generation is Automated

Some of the studies automatically generate facial expressions on robots. Unlike hand-coded techniques where the commands for the position of features like eyes and mouth are sent from the computer, here, the facial expressions are generated using machine learning techniques such as neural networks and RL.

Breazeal et al. [10] presented a robot Leonardo that can imitate human facial expressions. They use neural networks to learn the direct mapping of a human’s facial expressions onto Leonardo’s own joint space. In Horii et al. [50], the robot does not directly imitate the human but estimates the correct emotion and generates the estimated emotion using RBM. RBM [46] is a generative model that represents the generative process of data distribution and latent representation, and can generate data from latent signals [98, 117, 123].

Li and Hashimoto [73] developed a KANSEI communication system based on emotional synchronization. KANSEI is a Japanese term that means emotions, feeling, sensitivity etc. The KANSEI communication system first recognizes human emotion and maps the recognized emotion to the emotion generation space. Finally, the robot expresses its emotion synchronized with the human’s emotion in the emotion generation space. When the human changes his/her emotion, the robot also synchronizes its emotion with the human’s emotion, establishing a continuous communication between the human and the robot. It was found that the subjects became more comfortable with the robot and communicated more with the robot when there was emotional synchronization.

In Churamani et al. [17], the robot Nico learned the correct combination of eyebrow and mouth wavelet parameters to express its mood using RL. The learned expressions looked slightly distorted but were sufficient to distinguish between various expressions. The robot could also generate expressions that were not limited to the basic five expressions that were learned. For a mixed emotional state (for example, anger mixed with sadness), the model was able to generate novel expression representations representing the mixed state of the mood.

Summary of findings In all of the above studies, the robots learn to generate facial expressions automatically using machine learning techniques. While Breazeal [10], Li and Hashimoto [73] used direct mapping of human facial expressions, Horii et al. [50] generated the estimated human’s emotion on the robot. In Churamani et al. [17], the robot was able to associate the learned expressions with the context of the conversation.

Table 5 Possible categories for facial recognition in the wild

6 Discussion

6.1 Summary of the State of the Art

There are already studies having high accuracy (greater than 90%) in facial expression recognition on CK+, Jaffe and Oulu-Casia datasets. (see Table 3a, b). The accuracies on CK+, Jaffe and Oulu-Casia datasets have been as high as 100%, 99.8% and 92.89% respectively. In comparison to this, the accuracy for facial expression recognition in real-time is not as high.

Zhang et al. [146] used a deep convolutional network (DCN) that had an accuracy of 98.9% on CK+ dataset and 55.27% on Static Facial Expressions in the Wild (SFEW) dataset. Here, the same network produced very different results for two different datasets. SFEW [29] consists of close to a real-world environment extracted from movies. The database covers unconstrained facial expressions, varied head poses, large age range, occlusions, varied focus, different resolution of faces, and close to real-world illumination. In Zhang et al. [146] the accuracy for ”in the wild” settings was considerably lower than on CK+ dataset, implying that the expression recognition algorithms can still not handle the variations in environment, head poses etc. in real-world settings.

Table 5 provides possible categories for facial recognition in the wild. It contains the basic emotional facial expressions, situation-specific face occlusions, permanent face features, face movements, situation-specific expressions and side activities during facial expressions.

Most of the current research in facial expression recognition relates to the first category of basic emotional facial expression. Survey articles on facial expression recognition have been cited in the Table 5 [11, 13, 21, 42, 43, 71, 88, 109, 112]. For more details on individual studies, refer to Table 3. Facial expression recognition in the presence of situation-specific face occlusions like a mouth–nose mask, glasses, hand in front of face etc. has also been studied [74, 75, 131]. Pose invariant facial expression recognition when the face is moving or turned sideways has also been partially studied [96, 113, 143, 144].

For the facial expression generation, robots can make certain basic facial expressions by moving their eyes, mouth and neck. However, they cannot make as many expressions as human beings due to the limited number of DoFs present in a robot’s face. There are relatively fewer studies for automated facial expression generation in robots [10, 17, 50, 73]. While the robots are capable of displaying their facial expressions by manually coding the movement of the eyes and mouth, there are fewer studies that would make a robot learn to display its facial expressions automatically.

Most of the studies on facial expression generation have been carried out on robotic heads or humanoid robots like iCub and Nico [e.g. 9, 10, 17, 50]. In Becker-Asano and Ishiguro [5], Geminoid F’s facial actuators are tuned such that the readability of its facial expressions is comparable to a real person’s static display of emotional expression. It was found that the android’s emotional expressions were more ambiguous than that of a real person and ’fear’ was often confused with ’surprise’.

An advantage of automated facial expression generation over hand-coded facial expression generation is that in automated facial expression generation, a robot could learn mixed expressions than simply the learned expressions. Unlike in hand-coded facial expression generation, where a robot can only express the emotions that it has learned, in Churamani et al. [17], the robot could express complex emotions that were made up of a combination of emotions.

6.2 Future Research

Although facial expression recognition under specific settings has high accuracy and robots can express basic emotions through facial expressions, there are several possible directions for future research in this area.

Suggestion 1: Performing facial expression recognition in the wild needs to be emphasized upon.

To efficiently recognize facial expressions in real-time and in a real-world environment, the robot should be able to perform facial expression recognition with varied head poses, varied focus, presence of occlusions, different resolutions of the face and varied illumination conditions. The studies that perform facial expression recognition in real-time are limited to a laboratory environment which is far different from a real-world scenario. A good study would be the one where facial expression recognition in the wild is performed.

Some studies perform facial expression recognition in the wild, but their accuracy is much less than the accuracy on predefined datasets like CK+, Jaffe, MMI etc. To increase the efficiency of facial expression recognition in real-world scenarios, the performance of facial expression recognition in the wild needs to be improved. This can also be used to recognize facial expressions in real-time. Based on this, a direct adaptation of emotions would make HRI smoother.

Suggestion 2: Facial expressions during activities like talking, nodding etc. need to be studied.

Situation-specific expressions (nodding, yawning, blinking, looking down) and side activities during facial expressions (talking, eating, drinking, sneezing) in Table 5 have not been studied. To understand vivid expressions, it is required to be able to recognize facial expressions for all categories. Humans also express emotions while interacting with someone verbally, such as smiling while speaking when they are happy. In this case, it should be possible to recognize a smile during speech.

Suggestion 3: Combine facial expression recognition with the data from other modalities such as voice, text, body gestures and physiological data to improve the emotion recognition rate.

Although this overview focuses on facial expression recognition, it may be possible to control one’s face and not express the emotion one is truly experiencing. Some studies combine facial expression recognition with audio data, body gestures or physiological data for an improved emotion recognition [41, 53, 83]. There are very few studies that combine facial data with both audio and physiological data [106, 107] and studies that analyze all modalities (face, voice, text, body gestures and physiological signals) have not been found. Humans can recognize the emotion of a person quickly and effectively by taking into account their facial expression, body gestures, voice and words. Combining facial, audio, text and body gestures with physiological data could lead to a higher emotion recognition rate by machine learning algorithms than by humans.

Suggestion 4: How should a robot react towards a given human emotion?

In HHI, a human’s reaction to a given emotion is either a result of parallel empathy or reactive empathy [26]. It should be studied with which emotion should a robot appropriately react to a given human emotion. Moreover, it needs to be studied if a robot should be able to express negative emotions. Most of the existing studies allow a robot to be able to express basic emotions (anger, fear, happiness, neutral, sadness, surprise). It may be reasonable for a robot to react with a sad expression when a human being expresses anger. But, should a robot be able to express extreme emotions such as anger?

For facial expression generation, while robots are capable of displaying facial expressions both static and dynamic, they are unable to generate facial expressions when they are speaking. For example, robots could smile while talking to express their happiness or they could speak with a frown when angry. Robots could also express their emotions through partial facial or bodily gestures instead of showing a full face expression. For example, tilting head down to express sadness, frowning to express anger, eyes wide open to express surprise and raising eyebrows.

Suggestion 5: Robots should be able to recognize and generate facial expressions with various intensities.

Emotions form a continuous range and can have various intensities. If one is less happy, one would smile less. Similarly, if someone is very happy, the smile would also be big. It should be possible to recognize not just the emotion but the intensity of emotion. Moreover, in most of the existing studies, robots express their emotions with only one configuration per emotion. Robots should also be able to express their emotions with different intensities. Finally, it needs to be studied whether the intensity of emotion with which a robot reacts to a given human emotion has any effects on the human and whether these effects are positive or negative.

Suggestion 6: Robots should be able to express their emotions through a combination of body gestures and facial expressions.

While in this overview, we focus on robotic facial expressions, there are other articles where emotional expression is performed through the robot’s body postures [4, 20, 22, 55, 86, 90]. A potential future study could be to compare the robot’s facial expressions with robot’s bodily expressions and also with the combination of facial and bodily expressions to see if there is any difference in the recognition of these.

Suggestion 7: Robots should be able to both recognize and generate complex emotions such as that of thinking, calm and bored states.

For both facial expression recognition and generation, there is a need to go beyond the basic seven emotions to recognizing and generating more complex emotions such as calm, fatigued, bored etc. It might be difficult to generate complex emotions given the hardware limitations of the robot, but if this is made possible, robots could express a wider range of emotions similar to human beings.

7 Conclusion

This overview emphasizes the recognition of human facial expressions and the generation of robotic facial expressions. There are already plenty of studies having high accuracy for facial expression recognition on pre-existing datasets. Accuracy on facial expression recognition in the wild is considerably lower than the experiments which have been conducted under controlled laboratory conditions. For human facial emotion recognition, future work would be to improve emotion recognition for non-frontal head poses in presence of occlusions (i.e. emotion recognition in the wild). It should be made possible to recognize emotions during speech as well emotions with varying intensities. In the case of facial expression generation in robots, robots are capable of making the basic facial expressions. Few studies perform autonomous facial generation in robots. In the future, there could be studies comparing robotic facial expressions with the robot’s bodily expressions and also with a combination of facial and bodily expressions to see if there is any difference in recognizing these. Robots should be able to express their emotion with partial bodily or facial gestures while speaking. They should also be express their emotions with various intensities instead of a single configuration per emotion. Lastly, there is a need to go beyond the basic seven expressions for both facial expression recognition and generation.