1 Introduction

1.1 Emotion and e-learning

Emotions are a significant influential factor in the process of learning [58]. Current instructional methods for online learning increasingly address emotional dimensions by accommodating challenges, excitement, ownership, and responsibility among other things in the learning environment [25, 80]. Educational games [18] would be the case in point: offering a challenging and dynamic learning setting that effortlessly combines emotion and cognition [78]. While online learning has expanded radically over the past years, there is a renewed interest in adaptive methods and personalization that adjust the instruction and support explicitly to the learners’ mental states and requirements. Such personalization is conventionally based on producing and maintaining a model of the learner, which is mainly based on individual characteristics and validated performances [13, 14]. Emotion has systematically been ignored as a learner model variable because it was hard, if not impossible to detect. Now that technology is about to be capable of automatically recognising the learners’ emotional states, learner models could readily include emotions and thereby improve the quality of personalization.

Emotion recognition in e-learning environments could, while obviously taking into account issues of ethics and privacy, propose a valuable source for improving the quality of learning [11]. Responses based on emotional states [35] could enhance learners’ understanding of their own performance [10].

Also, educational game development [79] can take advantage of emotional data of learners [4] to optimise experiences and the flow of events. In this study, we offer an accurate and reliable technology for emotion recognition that can be easily applied in digital educational games and other e-learning environments.

1.2 Approaches to emotion recognition

Technologies related to emotion recognition date back to the early 1900s. Different tests of blood pressure were used for lie recognition during the questioning of criminal suspects at that time [53]. Although occasionally lie detectors are admitted as evidence in court, however, they are generally considered unreliable. Over the years, there have been several improvements in the accuracy of emotion recognition software. Bettadapura [12] reports accuracies for existing emotion recognition solutions ranging from 55% to 98% since 2001. Basically, six different approaches to emotion recognition are available, ranging from 1) using facial expressions [7], 2) speech and vocal intonations [8], 3) physiological signals [65], 4) body gesture and pose [59], 5) text [66], and 6) a combination of the two or more of these approaches [9, 60, 64].

Facial expressions provide the most informative data for computer awareness of emotions [64]. However, software applications that use facial expressions have a number of restrictions that mostly limit their accuracy and applicability. Usually, they can only manage a small set of expressions from a frontal view of faces without facial hair and glasses, and they require good and stable lighting conditions. Also, most software applications cannot be used in real-time, but require extensive post-processing for the analysis of videos and images [57].

Emotion recognition can also be provided based on the audio and vocal intonations in recorded speech [19]. However, the analysis of vocal intonations produces less accurate emotion recognition results than facial expressions approaches. Vocal intonation analysis can only manage a subset of basic emotions from the recorded audio files or from the speech streams that come from microphones. These require post-processing through various speech analysis methods [75].

Physiological sensors allow for capturing a variety of physiological responses such as body temperature, heart rate, blood volume, and skin conductance of an individual [42]. These sensors are sometimes offered in the form of wearable devices [69]. They sometimes are used to study experienced emotions of learners in schools [3], and as an add-on to intelligent tutoring systems [31]. Although such technologies show promising results in emotion recognition, it is scarcely applied, because it is obtrusive to learners and requires expensive and dedicated equipment [46].

Body movements and gestures are an additional source of emotion recognition [40, 76]. D’Mello and Graesser report that there are significant relationships between emotion and body movements. The combination of posture and conversational dialogue systems reveals a modest amount of redundancy among them [17]. Kyriakos and colleagues developed a real-time dynamic hand gesture and posture recognition system for the formation of a visual dictionary by merging hand postures and dynamic gestures [43]. Recently, the video games industry has introduced sensory devices as a commodity, mainly meant for interaction control in entertainment games [61]. Already in 2005, Nintendo introduced its Wii game console, with a movement and gesture recognition sensor. Likewise, Microsoft introduced in 2010 its Kinect to provide optical sensor technology for body recognition and motion tracking [84]. Although both Wii and Kinect have greatly enhanced gaming interaction modes by supporting capturing gestures and bodily movements, they are not capable of extracting the users’ emotions [68]. Emotion recognition through text or speech analysis is applied to a set of words in a specific language [48]. Such analysis is called sentiment analysis and uses natural language processing techniques for extracting the affective state represented in the text and thereby the affective state and attitude of the author of a text [81]. Dependency on a specific language is the main obstacle in developing a worldwide software application for recognizing emotions from the text [48]. Another obstacle is where the speakers or authors do not necessarily express their own emotions, but describe somebody else’s emotion [51] or sometimes they do not express the emotion in a sentence explicitly [16]. Some studies reported that these issues could be solved using semantic technologies (see for example [16]). Such technology adds some metadata over the textual data and encodes the meaning of the text [5, 6].

The accuracy of emotion recognition can be greatly improved by combining two or more of the previous approaches. Jaimes and Sebe [34] have shown improved performance by combining visual and audio information. They showed that the multimodal data fusion could rise to accuracy levels from 72% up to 85% if the following conditions are met: 1) clean audio-visual input, such as noise free dataset, closed and fixed microphone, and non-occluded portraits, 2) with actors’ performances, 3) who speak single words, and 4) who display exaggerated facial expressions of the six basic emotions (happiness, sadness, surprise, fear, disgust, and anger) (cf. [20]).

1.3 Emotion recognition in e-learning

Notwithstanding the limitations of emotion recognition described above, topical hardware developments on regular computer equipment [23] would now enable emotion recognition at a larger scale [7]. A typical example would be the use of common webcams for emotion recognition from facial expressions [7, 52]. It has been suggested that e-learning applications can benefit from such emotion recognition devices for more natural interactions [71] because they collect data of learners continuously and unobtrusively [7, 8]. For a long time approaches for collecting emotional data of learners have been either obtrusive or discontinuous [67]. For example, physiological sensors and questionnaires can fundamentally hinder the learning process [24] and they are not convenient or appropriate to use in e-learning environments [63]. Using webcam-based approaches to emotion recognition would overcome these problems. Various problems have been reported though. Emotion recognition from facial expressions could not be detected in real-time from the frontal view of faces [82]. Intensive post-processing is often needed to analyse recorded video files or stored image files of learners [38]. Occasional solutions for real-time recognition of facial emotions produced low accuracies that are not comparable to emotion recognition by humans [7]. It has been difficult to accurately detect faces and facial emotions when beard, glasses, hair over face, wounds, or other objects cover any parts of the face [47]. Moreover, recognition is hampered when disturbing light shines directly into the face of the learner.

1.4 Several techniques for classification of facial emotion recognition

Researchers have proposed many methods for recognising and classifying emotions from facial expressions. Prior studies show that there are many different techniques to distinguish facial expressions. However, we only report eight of the most notable methods in this study: 1) pixel-based recognition [77], 2) local binary pattern [50], 3) wavelet transform [36], 4) discrete cosine transform [37], 5) Gabor filter [56], 6) edge and skin detection [32], 7) facial contour [26], and 8) fuzzy logic model [22]. Each of these studies has shown that the facial emotion recognition can accomplish an average level of success, but the performance is less than human judgement. These studies have shown that the accuracy of automatic facial emotion recognition classifications remains challenging because of the inconstancy, complexity, hard to implement, and inappropriate tracking of facial features in real-time or recorded video streams. As a result, we introduce a new approach using fuzzy logic rules to generate better, faster, more accurate, and more reliable results.

Recent studies have shown that researchers can recognise and classify the facial emotional expressions more appropriately. For example, Ali and her colleagues [1] have proposed an application of nonlinear and non-stationary data analysis techniques named, Empirical Mode Decomposition (EMD) that can classify the facial emotions with better accuracy compared to the stated methods. They have used static images as input to their application and have extracted facial features accordingly. They have applied ANOVA test as the statistical data analysis technique to obtain the facial features that were statistically significant. Then they have sent the facial features into the algorithms such as K-NN and SVM for classification of seven categories of facial emotions. In another study, Gunes and Pantic [27] have considered Russell’s method for circular configuration called Circumflex of Affect [62]. In this method, every primary emotion illustrates a bipolar entity as an element of the similarly emotional continuum. The suggested polar are valence (pleasant versus unpleasant), and arousal (relaxed versus aroused). The recommended emotional space comprises four quadrants: high arousal positive, low arousal positive, high arousal negative, and low arousal negative. Consequently, it is likely to represent every emotion by its valence and arousal. Moreover, they have investigated automatic, dimensional, and continuous emotion recognition using visual, audio, physical, and brainwave methods on their study. Their findings revealed that representing emotions continuously is not a small problem to ignore and to handle easily. In another study, Anisetti and his colleagues [2] proposed a semi-supervised fuzzy facial emotional classification system based on Russell’s circumplex model. Their proposed system works only on face related features classified with the Facial Action Coding System (FACS). They have extracted facial emotional expressions from streaming videos. To evaluate the quality of their system, they have used the Cohn-Kanade database and the MMI Database to apply Russell’s space mapping. To make Russell’s axes classification, they have created an emotional inference space and have mapped action units to axes values. Then, they have exploited some well-defined rules from this mapping. Although they proposed this novel system, however, the system requires expert tuning to guarantee context awareness. They have concluded that researchers should further investigate on tuning the system to obtain better outputs for facial emotional classification in the complex scenarios.

1.5 Starting point

In this paper, we present a new methodology of webcam-based emotion recognition, along with a full technical implementation that was used for its validation. The approach is based on fuzzy logic, using unordered fuzzy rule induction (FURIA algorithm; [29]). Compared to the statistically data analysis approach proposed by Ali and her colleagues [1], our fuzzy logic approach uses the supervised machine learning method to provide more favourable output because fuzzy logic rules can be easily generated from a dataset of recorded emotions, while alternative machine learning approaches, such as neural networks, Bayesian networks, and decision trees would require extensive implementation. Moreover, our approach can use single image files, recorded video files, and live webcam streams to propose an accurate recognition of facial expressions compared to the approach suggested by Ali and her colleagues that can only use single image files [1]. We follow the emotion classification approach of Ekman and Friesen [20], which has been frequently used over the past decades for classifying the six basic emotions: happiness, sadness, surprise, fear, disgust, and anger.

We do not follow Russell’s method for Circumflex of Affect used by Gunes and Pantic [27]. Therefore we do not calculate bipolar entities such as high arousal positive, low arousal positive, high arousal negative, and low arousal negative in our approach. Instead, we use extracted facial features by tracking a human face in real-time and classify facial emotions. Moreover, compared to the semi-supervised fuzzy facial emotional classification system based on Russell’s circumplex model proposed by Anisetti and his colleagues [2], we recommend a new approach that can classify emotions based on the FURIA fuzzy rules using the supervised machine learning technique. Our rules do not need to be produced based on the emotional inference space and mapped action units to axes values. Instead, we generate our rules based on the cosine values of the most significant triangles created based on the most significant facial feature points. We will describe our approach in the coming sections.

Although similar to our previous approach [7], which used Principal Component Analysis, the fuzzy logic-based approach produces better, more accurate, and more reliable results. To allow maximal portability the software is implemented as a RAGE-compliant software component [73, 74]: the RAGE software architecture omits dependencies on platforms and operating systems and accommodates the easy reuse and integration of software in a variety of video game engines. In the rest of this paper, we first describe creating a facial emotion database and the functionalities of our software. Thereafter, we explain the validation method used in this study, discuss the results of this study, and provide suggestions for future work, respectively.

2 Database, fuzzy rules, and software

2.1 Creating a database of the facial emotions

We started from an existing database, the Cohn-Kanade AU-coded expression extended database (CK+) as the reference to this study [49]. This database is used for automatic facial image analysis and includes an annotated set of human’s facial images, including validated emotion labels for each image. Based on this, we then created a database of emotions including the rotated images of each subject. We then created cosine values of facial landmarks for training and testing purposes. This database then was used to deduce fuzzy rules. To this end, we developed a small software application that used DLIB [39], which is a widely used C++ toolkit including machine learning tools and algorithms. After loading images and their related emotion labels from the CK+ database, face recognition and face tracking functionalities from DLIB were used and extended to develop facial emotion classification functionality. From each image, we extracted 68 facial landmarks and made 54 vertices for 18 relevant triangles using every three important landmarks in our database. For example, two important triangles with 6 vertices are the triangles between eyebrows and eyes (see Fig. 1, facial landmarks 17, 36, and 39 & 22, 42, and 45). We then calculated the cosine values (54 values) of all vertices in all triangles. Next, we stored all the cosine values along with the related emotion labels of each image of the CK+ database in our database in the form of a WEKA attribute-relation file format (arff) [70]. WEKA is a tool that provides a number of machine learning algorithms for data mining tasks. The arff file is a textual database that defines a list of instances sharing a set of attributes: each instance is represented with 55 attributes called Cosine0, Cosine1, …, Cosine53 and Emotion, respectively. By loading the database in WEKA 37 so-called FURIA fuzzy rules (see appendix 1) could be generated, allowing us to automatically detect and classify emotions from facial expressions. FURIA is a fuzzy rule-based classification method, which offers simple and comprehensible rule sets [29]. WEKA does not provide the FURIA rule-based classifier algorithm as default; therefore users must add this classifier algorithm to the existing classifiers. Users can use the package manager of the WEKA tool in the WEKA GUI Chooser to install FURIA before they run the WEKA Explorer application. When users added the FURIA classifier to the list of the WEKA classifiers, they can run the FURIA classifier and produce the FURIA fuzzy rules. The mechanism of fuzzy rules is briefly explained in the next section.

Fig. 1
figure 1

A detected face, facial landmarks, the vertices, and the relevant triangles of the face

2.2 Fuzzy rules

A fuzzy rule is obtained by replacing binary logic intervals with fuzzy intervals. For example, a binary interval would be represented as step or block function (with a discrete value of 1 (“true”) if the parameter under consideration is inside the interval, and 0 (“false”) elsewhere). A fuzzy rule, however, could be shaped as a trapezium, allowing for “fuzzy” truth-values between 0 and 1 (Fig. 2). This can be formalised as follows: the trapezoidal membership function for a fuzzy set F on the universe of discourse X is defined as μF:X ➔ [0,1], where each element of X is mapped into a value between 0 and 1. This function is defined by four parameters [29]: a lower limit LL, an upper limit UL, a lower support limit LSL, and an upper support limit USL, where LL < UL < LSL < USL:

Fig. 2
figure 2

Binary logic and the trapezoidal member function of a fuzzy interval

$$ {\displaystyle \begin{array}{ccc}\upmu \mathrm{F}:\mathrm{X}=& & \\ {}& 0,& \left(\mathrm{X}<=\mathrm{LL}\right)\ \mathrm{or}\ \left(\mathrm{X}>\mathrm{USL}\right)\ \\ {}\begin{array}{c}\\ {}\\ {}\end{array}& \begin{array}{l}\left(\mathrm{X}-\mathrm{LL}\right)/\left(\mathrm{UL}-\mathrm{LL}\right),\\ {}1,\\ {}\left(\mathrm{USL}-\mathrm{X}\right)/\left(\mathrm{USL}-\mathrm{LSL}\right),\end{array}& \begin{array}{l}\mathrm{LL}<=\mathrm{X}<=\mathrm{UL}\ \\ {}\mathrm{UL}<=\mathrm{X}<=\mathrm{LSL}\ \\ {}\mathrm{LSL}<=\mathrm{X}<=\mathrm{USL}\ \end{array}\end{array}} $$

The four parameters of the trapezoidal member function are indicated on the horizontal axis in Fig. 2.

We have generated 37 FURIA fuzzy rules in this study. Appendix 1 presents all the rules. As an example, we explain one of our generated FURIA fuzzy rules (rule number 10) to show how the emotion recognition logic is expressed. Fuzzy rule number 10 reads as follows:

$$ {}^{``}\left(\mathrm{Cosine}1\ \mathrm{in}\ \left[-\operatorname{inf},-\operatorname{inf},6.82602,7.03498\right]\right)\ \mathrm{and}\ \left(\mathrm{Cosine}15\ \mathrm{in}\ \left[14.5889,15.0512,\operatorname{inf},\operatorname{inf}\right]\right)=>\mathrm{Emotions}=\mathrm{Sad}\ {\left(\mathrm{CF}=0.53\right)}^{"}. $$

The antecedence of the rule includes two trapezoidal conditions. The arguments between brackets represent the 4 trapezoidal parameters. As inf indicates infinity, both trapeziums in this example are degenerate. The overall rule can be interpreted as:

  1. (1)

    Cosine1 in [−inf, −inf, 6.82602, 7.03498]: This expression is completely valid for Cosine1 < = 6.82602, invalid for Cosine1 > 7.03498, and partially valid in-between

  2. (2)

    Cosine15 in [14.5889, 15.0512, inf, inf]: it is invalid for Cosine15 < 14.5889, completely valid for Cosine15 > = 15.0512, and partially valid in-between.

  3. (3)

    This rule means that if the aforementioned conditions are met, then the emotion will be considered to be “sad”.

2.3 Implementation of emotion recognition from facial expressions

The software was developed in accordance with the RAGE client asset architecture [73, 74], which prohibits direct access to the operating system and hardware. As a result, the software accepts raw image data that can originate from various sources such as pictures or screenshots and frames from either pre-recorded video or live webcam streams making is very versatile in its application. The process of emotion recognition starts with face recognition. This is done using DLIB [39], which provides functionality for real-time tracking for not losing the face. It also provides a sufficient set of 68 landmarks, which reflect the significant positions on the individual’s face, which are dynamically updated. Once the 68 facial landmarks of a face are extracted, we overlay 18 relevant triangles on the face. We then calculate 54 cosine values of all vertices of the triangles. Next, the fuzzy rules come into play: all 54-cosine values are passed into the rules set to extract and classify the expressed emotion. Figure 3 represents the software with 3 detected faces.

Fig. 3
figure 3

The software with 3 detected faces. Two faces are faces of real persons in real-time and the third face is a drawing face of Michael Jackson printed on a T-Shirt

3 Validation method

The validation of the approach is arranged by asking test persons to express a series of emotions and compare the judgements by the fuzzy logic approach with the judgements made by experts. For this, we used the recorded video files of test persons from a previous study [7] to feed into the fuzzy logic system. The whole procedure is described below.

3.1 Participants

We have sent an email out to employees from the Open University of the Netherlands to recruit the participants for this study. The e-mail mentioned the estimated time investment of 20 min for enrolling in the study. Activities entailed the active expression of a series of facial emotions. No specific background knowledge was requested. Ten participants, all employees from Open University of the Netherlands (8 male, 2 female; age M = 42, SD = 11) volunteered to participate in the study. Altogether, this small number of participants was sufficient for generating a dataset of 1000 facial expressions. By signing an agreement form, the participants allowed us to capture their facial expressions and to use their data anonymously for future research. We assured the participants that their raw data would not be available to the public, would not be used for commercial or similar purposes, and would not be available to third parties. Participants were told that participation in the study might help them to become more aware of their emotions while they were communicating through a webcam with our software.

3.2 Tasks

Five consecutive tasks were given to the participants. Participants were asked to expose six basic facial expressions as well as the neutral one. Totally, facial expressions were requested one hundred times, uniformly distributed over the six emotions and the neutral emotion. Each of tasks serves a different purpose. The first task was meant to calibrate the user’s facial expressions. In the second task, participants were asked to mimic a pre-set emotion that was presented in an image shown to them. There were 35 images presented subsequently through PowerPoint slides; the participant scrolled through the slides. Each image illustrated a single emotion. All six basic facial expressions and the neutral one were five times present with the following order: happy, sad, surprise, fear, disgust, anger, neutral, happy, etcetera. In the third task, participants were requested to mimic the seven facial expressions twice: first, through slides that each presented the keyword of the requested emotion and second, through slides that each presented the keyword and the picture of the requested emotion with the following order: anger, disgust, fear, happy, neutral, sad, and surprise. The fourth task presented 14 slides with the text transcript (both sender and receiver) taken from a good-news conversation. The text transcript also included instructions what facial expression should accompany the current text-slide. Here, participants were requested to read and speak aloud the sender text of the ‘slides’ from the transcript and show the accompanying facial expression. The fifth task with 30 slides was similar to task 4, but in this case, the text transcript was taken from a bad-news conversation. The transcripts and instructions for tasks 4 and 5 were taken from an existing Open University of the Netherlands (OUNL) training course [45] and a communication book (Van der [72]).

3.3 Hardware and software

Participants performed individually on a single computer. The computer screen was separated into two panes, left and right. The tasks and the PowerPoint file were presented in the right pane, while the participants could read in the left pane how the software classified their facial expressions. An integrated webcam and a 1080HD external camera were used to capture and record the emotions of the participants as well as their interactions with mouse and keyboard on the computer screen. The integrated webcam was used to capture and recognise the participants’ emotions, while the external cameras used screen-recording software (Silverback version 2.0) to capture facial expressions of the participants and record the complete session. Raters for validating our software used the recorded video.

3.4 Procedure

Each participant signed the agreement form before his or her session started. Participants individually performed all five tasks in a single session of about 20 min. The session was conducted in a silent room with good lighting condition. The moderator of the session was present in the room but did not intervene. All sessions were conducted on two consecutive days. The participants were requested not to talk to each other in between sessions so that they could not influence each other. The moderator gave a short instruction at the beginning of each task. For example, participants were asked to show mild and not too intense expressions while mimicking the emotions. All tasks were recorded and captured by our software. After the session, each participant filled out an online questionnaire to gather participants’ opinions about their learning experience and the setup of the study.

3.5 Validation

Two expert raters analysed the recorded video streams to provide a validation reference for the software output. The raters, both associate professors at the psychology department of the Open University of the Netherlands, were invited to individually rate the emotions of the participants’ in the recorded video streams. Both raters are familiar and skilled with using the Facial Action Coding System [20].

Firstly, they received an instruction package for doing individual ratings of participants’ emotions in one out of ten video recordings. Secondly, both raters participated in a training session together with the main researcher where ratings of this first participant were discussed to identify possible issues with the rating task and to improve common understanding of the rating categories. Thirdly, raters resumed their individual ratings of participants’ emotions in the nine remaining video streams. Fourthly, they participated in a negotiation session together with the main researcher where all ratings were discussed to check whether negotiation about dissimilar ratings could lead to an agreement or to sustained disagreement. Finally, the final ratings resulting from the negotiation session were taken as input for the data analysis. The data of the training session were also included in the final analysis. The raters received: 1) a laptop, 2) a user manual, 3) an instruction guide on how to use ELAN, which is a professional tool for making complex annotations on video and audio resources, and 4) an excel file with ten data sheets; each of which represented the participant’s information.

4 Analysis of data and results

In this section, we will first describe how to calculate the total sample size for this study. We then explain the results of the raters. Finally, we explain the agreement between requested emotions and the recognised emotions by the software.

4.1 The required sample size

We used G*Power tool [21], which is a tool to compute statistical power analyses for several statistical tests. We then applied a “t-test” with “correlation analysis of point biserial model” and “a priori” to compute required sample size with given “alpha (significance level)”, “power”, and “effect size” to realize the total sample size of this study. We used the following input parameters: one tail, effect size = .11, alpha error probability = .05, and the [power (1 - beta error probability)] = .95; so we used beta = 0.05, Type II. The total sample size required for this study appeared to be 885 occurrences with the actual power of .95. We used 1000 occurrences for sampling the ‘requested emotions’, thus this criterion was met.

4.2 Results of the raters for recognising emotions

Hereafter, we describe how the raters detected participants’ emotions from their recorded video streams. The disagreement between the raters, which was 34% before the negotiation session, was reduced to 22% at the end of the negotiation session. In order to determine consistency among raters, we performed the cross-tabulation between the raters and also inter-rater reliability analysis using the Kappa (ϰ) statistic approach [44]. The ϰ value in statistics can measure inter-rater agreement for qualitative items. We calculated and presented the ϰ value for the original ratings before negotiation. We have 1000 displayed emotions (see Table 1) rated by two raters as being one of the six basic emotions and the neutral emotion. The cross-tabulation data are given in Table 1. Each recognised emotion by the rater 1 is separated into two rows that intersect with the recognised emotions by the rater 2. The first row indicates the number of occurrences of the recognised emotion and the second row displays the percentage of agreement about the identified emotions. In addition to the ‘ϰ’ value, we also calculated overall agreement (‘α’) for each table. This ‘α’ value is the average of each diagonal in the related tables, which is calculated based on the uniform distribution of emotions. For instance the ‘α’ value in Table 1 is calculated as: α = (90.6 + 53.3 + 53.3 + 39.7 + 68.2 + 73.4 + 95.2)/7 = 67.7.

Table 1 Rater1 * Rater2 Cross-tabulation – All 1000 emotions are rated by both raters. (ϰ = .715 and α = 67.7%)

Cross-tabulation analysis between the raters indicates that the neutral expression has the highest agreement (95.2%) and the fear expression has the lowest agreement between them (39.7%) (Table 1). According to Murthy and Jadon [54], people have more difficulty in recognising fear facial expression, which clarifies why the most confused expression is fear. Sadness is the next confused category, which is often recognised as neutral (26.7%). Analysis of the ϰ statistic underlines the high degree of agreement among the raters. The inter-rater reliability was calculated to be ϰ = .715 (p < 0.001), which qualifies as a substantial agreement among raters according to the interpretation of ϰ values by Landis & Koch [44].

4.3 Emotion recognition by the software

Table 2 shows the requested emotions of participants contrasted with software recognition results. These numbers are taken from all 1000 emotions (10 test persons displaying 100 emotions each) including the cases that one or more of the raters judged that the test person was unable to mimic the requested emotion correctly. Each requested emotion is separated into two rows that intersect with the recognised emotions by the software. Our software has the highest recognition rate for the happy expression (93.3%) and the lowest recognition rate for the fear expression (43.8%) (See Table 2).

Table 2 Requested emotions and emotions recognised by the software – These numbers are taken from all 1000 emotions including ‘unable to mimic’ by the participants (ϰ = .716 and α = 71.5%)

Please note that the obtained differences between software and requested emotions are not necessarily software faults but could also indicate that participants were sometimes unable to mimic the requested emotions (30.6%). The software had in particular problems to distinguish sad from neutral, fear from neutral, anger, disgust, and surprise, disgust from anger and neutral, anger from disgust and sad. Error rates of the software are typically between 0.8% and 31.1%.

The numbers in Table 2 show that all six basic emotions and the neutral one have different distributions for being confused as to the other emotions. In other words, they have different discrimination rates. Apart from neutral, the emotions that are best discriminated from other ones are happiness, surprise, and anger. Happiness has the highest accuracy rate of 93.3% and is not confused with fear and anger at all; surprise has the next highest accuracy rate of 86.3% and is not confused with happiness and sadness at all. The most difficult emotion is fear, which scores only 43.8% and is easily confused with neutral 17.5%, anger 11.3%, surprise 10.0%, disgust 10.0%, sadness 6.3% and happiness 1.1%, respectively. This is in accordance with Murthy and Jadon [54] and Zhang [83]. Moreover, Murthy and Jadon [54] states that sadness, disgust, and anger are difficult to distinguish from each other and are therefore often wrongly classified.

Taking the raters’ analysis results as a reference Table 3 shows that the participants were able to mimic requested emotions correctly in 69.4% of the occurrences. In 200 occurrences (20%) there was disagreement between raters. In the remaining10.6% of the cases, the raters agreed that participants were unable to mimic requested emotions (106 times). Participants are best at mimicking neutral (87.4%) and worst at mimicking fear correctly (21.3%), which is in accordance with Murthy and Jadon [54].

Table 3 Raters’ agreements and disagreements about 1000 mimicked emotions

Table 4 shows the requested emotions of participants contrasted with software recognition results while excluding both the ‘unable to mimic’ records and the records on which the raters disagreed with the dataset. We, therefore, re-calculated the results of each emotion separately and in total.

Table 4 Requested emotions and recognised emotions by the software – These numbers are taken by the raters from 694 emotions of the participants that were able to mimic the requested emotions (ϰ = .837 and α = 83.2%)

In 306 out of 1000 cases at least one of the raters indicated that the participants were ‘unable to mimic’ the requested emotions properly. We only summed occurrences when both raters agreed that the displayed emotion was the same as the requested emotion’. The results for all emotions move toward positive changes. Table 5 shows the comparison between the accuracy of results of Tables 2 and 4.

Table 5 The comparison between the accuracy results of Tables 2 and 4. Each emotion is independently compared

The overall accuracy of 83.2% and the associated ϰ value of .837 are the final results that fully rely on the comparison of requested emotions and recognised emotions.

4.4 Comparison of our software output with the extended Cohn-Kanade database

Table 6 shows the labelled emotions of the Cohn-Kanade subjects contrasted with the FURIA classifier algorithm of our software. These numbers are taken from all 432 labelled emotions of the subjects including the 432 rotated images of the subjects (in total = 864 emotions). Each labelled emotion is separated into two rows that intersect with the recognised emotions through the FURIA classifier algorithm. The FURIA classifier algorithm in our software has the highest recognition rate for the surprise expression (95.2%) and the lowest recognition rate for the fear expression (34%).

Table 6 Recognition rate of the FURIA classifier algorithm our software over the labelled emotions of the Cohn-Kanade database – These numbers are taken from all 432 labelled emotions of the subjects including the 432 rotated images of the subjects

Based on our calculation using WEKA, the correctly classified instances of the Cohn-Kanade database are 678 instances with the accuracy rate of 78.5% and the incorrectly classified instances are 186 instances with the accuracy 21.5%. The Kappa value is therefore ϰ = .737. Table 7 shows the comparison between the accuracy results of Tables 4 and 6 for each independent emotion.

Table 7 The comparison between the recognition results of Tables 4 and 6. Each emotion is independently compared

The results show that the accuracy of our software overcomes the accuracy of the Cohn-Kanade database. While, the precisions of sad, fear, and anger emotions show significant increases, the precisions of happy, disgust and neutral emotions show the small improvements. The surprise emotion shows less precision in our results. This might be the case that the total sample size required for this study appeared to be minimum 885 occurrences with the actual power of .95. However, for analysing the Cohn-Kanade database, we used only 432 frontal faces as well as 432 rotated faces, thus this criterion was not met properly.

5 Discussion

This study presented an analysis for establishing the accuracy of facial emotion recognition based on a fuzzy logic model. The result showed that ϰ = .837 and an average accuracy α = 83.2% based on the comparison of recognised emotions and requested emotions. The data show that most intensive emotions (e.g., happiness, surprise) can be detected better than the less intensive emotions except neutral and fear. This is in accordance with Murthy and Jadon [54] and Zhang [83], who found that the most difficult emotion to mimic accurately is fear. Moreover, this result expresses that fear is differently interpreted from other basic facial emotions. Furthermore, our data analysis confirms Murthy’s [54] finding that sadness, disgust, and anger are difficult to distinguish from each other and are therefore often wrongly classified. Anger and disgust share many similar facial actions [20] and that is probably the reason why they are often confused. In 137 cases of disgust from joint Tables 2 and 4, 14 cases are detected as anger. In 132 cases of anger from Tables 2 and 4, 16 cases are detected as disgust. Hence confusion of anger and disgust is well over 8.9%.

Some potential limitations of the study should be pointed out. First, we have considered only six basic emotions and the neutral emotion in this study, although a larger diversity might be opportune. Nevertheless, the fuzzy logic approach could be easily extended to more emotions provided that an annotated reference database is available. Second, to validate the fuzzy-logic approach we used the recorded data of non-actors. A previous study by Krahmer and Swerts has shown that actors, although they evidently have better acting skills than laymen, will produce more realistic (i.e., authentic, spontaneous) expressions [41]. Third, given our sample of medium-aged participants, we did not take into account participants’ age as a disturbing factor. Existing research shows that youngsters and older adults are not equally good at mimicking different basic emotions, e.g., older adults are less good at mimicking sadness and happiness than youngsters, but older adults can mimic disgust in a better way than youngsters do [30]. Likewise, potential gender differences have not been taken into account.

6 Conclusion

The presented approach to fuzzy-logic based emotion recognition offers high quality, reliable recognition, and categorisation of emotions. The approach fulfils the requirements of being 1) an unobtrusive approach, with 2) an objective method that can be verified by researchers, 3) which requires inexpensive and ubiquitous equipment (webcam), and 4) which outperforms existing approaches. Compared to our previous study with the accuracy of 72% and less reliability [7], this study achieve a 83.2% average accuracy (α) level, which is comparable with human performance [15, 55]. Moreover, multiple faces in a picture can detect at the same time. Furthermore, being compliant with the RAGE software architecture, the emotion recognition component created in this study can be easily ported to a variety of game engines and e-learning environments.

Emotion recognition technology can now be easily added to educational games or e-learning environments to enhance overall support for the learning. It opens up new possibilities to including the learners’ emotional states in the user profiles needed for adaptive and personalised feedback, and to the dedicated training of communication skills and other soft skills are heavily rely on emotion [28]. This technology can also be easily used in other domains, such as the healthcare. For instance, this technology can be used in social adaptation skills for children with autism spectrum disorder [33].