1 Introduction

Emotion recognition systems scientifically measure and analyze complex feelings, such as comfort, discomfort, convenience, and inconvenience, in humans to guide product or environment design to improve the quality of human life. Practically speaking, emotion recognition technology can facilitate emotion-based services for users by detecting emotions in entertainment, education, medicine, etc. This technology enables the analysis of immediate user reactions at the time of service, thereby improving the quality of services [11].

Emotion recognition has attracted significant attention from the computer vision and affective computing communities over the past decade because it is an important front-end task in many applications. The majority of existing techniques focus on classifying seven basic expressions: anger, disgust, fear, happiness, neutral, sadness, and surprise [8, 18]. A few methods follow a dimensional approach in which emotional expressions are treated as regression data in the arousal-valence space [16, 20].

Recently, deep learning approaches have been proposed to classify emotions [12] [14] [1]. Based on the outstanding performance of deep learning approaches, emotion recognition models have achieved remarkable accuracy. However, emotion recognition is still a very difficult task for deep learning architectures. In particular, one of the major limitations of emotion recognition is a lack of appropriate emotion databases.

The majority of existing emotion databases focus on Western people, whose tone and energy of speech, facial expression, lip-reading, and gaze are slightly different during natural communication in comparison with Eastern people. Benitez-Garcia et al. [2] presented a methodical analysis of Western and Eastern prototypical expressions with a focus on four facial regions: forehead, eyes-eyebrows, mouth and nose. They determined that the major differences between Western and Eastern facial expressions occur in the regions of the mouth and eyes-eyebrows, specifically for expressions of disgust and fear. Similarly, the authors [10] report marked differences between Eastern and Western observers in the decoding of universal facial expressions. Eastern observers exhibited an inadequate to reliably distinguish universal facial expressions of fear and disgust. Western observers distributed their fixations evenly across the face, whereas Eastern observers persistently fixate the eye region. Additionally, emotion is not only biological, but is also influenced by the environment, which differs significantly between Western and Eastern culture [13]. Therefore, datasets with a focus on Eastern people are required for studying emotion recognition in Eastern cultures.

In this paper, we propose a novel emotion video database containing over 1200 video clips of Eastern people (mostly Korean people). Our database contains video clips collected from 41 Korean movies. The clips were annotated by six Korean evaluators. The database has a main face designated in every frame of each clip to provide more clear information about the object for recognition. Each clip has clear emotion conditions to facilitate easy for training and the development of emotion recognition models. This dataset provides information about the changes in face and background. This is appropriate for models about emotion recognition using facial expression, head pose and background. In addition, it provides visual and audio for studies using multimodal. We also propose a semi-automatic authoring tool to support the rapid creation and labelling of emotion video clips. This authoring tool can automatically cut an original video into a list of clips for a particular face. It also provides default emotion prediction for clips.

The remainder of this paper is organized as follows. Section 2 briefly describes well-known datasets for emotion recognition. Section 3 introduces the proposed video labelling tool and presents the detailed process used for generating and annotating the proposed emotion database. Finally, we present a baseline model for evaluating the proposed dataset in Section 4 and discuss the conclusions of this study in Section 5.

2 Related work

Emotion recognition methods can be categorized based on two types of environments, namely “lab controlled” and “in the wild”. Most human facial expression databases and methods largely depend on lab-controlled environments. This implies that most existing databases were created by controlling backgrounds, illumination, and head movement, indicating that they are not representative of real-world scenarios. We compare KVDERW dataset with other emotion recognition datasets such as MMI [15], IEMOCAP [3], AFEW [5], Belfast [7] as shown in Table 1. The MMI [15] database is a facial expression database containing both images and videos of 75 subjects in a lab-controlled environment. This database does not accurately capture the conditions of real-world scenarios. The IEMOCAP [3] dataset is one of the most commonly used databases for researching emotion recognition systems. Although this dataset is used by many researchers, it is heavily limited by data collection in a lab-controlled environment.

Table 1 Comparison of the KVDERW with existing emotion recognition datasets such as AFEW [5], Belfast [7], MMI [15], IEMOCAP [3]

Based on the limitations of natural human expression in lab-controlled environments, several databases and methods representing close-to-real-world environments have been proposed recently. These databases cover unconstrained facial expressions, different age ranges, different face resolutions, varying focus levels, and real-world illumination conditions. The AFEW [5] dataset is a facial expression database consisting of close-to-real-world conditions from 54 movies with data collected using a recommender system based on subtitles. This system extracts subtitles containing expression-related keywords. The length of each video clip is equal to the time of appearance of the corresponding subtitle. The video clips were annotated with seven basic expressions, namely anger, disgust, fear, happiness, neutral, sadness, and surprise, by human labelers. The database contains 330 subjects with a large age range of 1–77 years and addresses the issue of temporal facial expressions in the wild. The SFEW [4] dataset was created by selecting specific frames from the AFEW [5] dataset. The SFEW [4] dataset contains 700 images labeled with seven basic emotions that were annotated by two labelers. The AFEW [5] dataset consists of multiple labeled subjects in the same frame, therefore, it don’t provide rich information of target object for training such as the changes in face expression and head pose. The Belfast [7] database consists of 239 clips collected from TV programmes and the interview recordings labeled with particular expressions. The number of TV clips in this database is sparse. Compared to the manual method which was used to construct and annotate, our labelling tool is faster and more effective.

Recently, deep learning technology has advanced significantly and achieved remarkable results. Therefore, the organized challenges related to automatic emotion recognition have grown in popularity. Two of the most difficult recent challenges are the Emotions in the Wild (EmotiW) challenge [6] and Aff-Challenge [19]. Both challenges use in-the-wild datasets. EmotiW uses clips from movies annotated based on six universal emotions and a neutral emotion. AFF-Challenge uses YouTube videos annotated based on dimensional arousal and valence scales. These datasets include extremely difficult conditions for recognizing the emotions of targets.

3 Video labelling tool and database creation procedure

3.1 Video labelling tool

The proposed semi-automatic labelling tool was designed to support the creation of datasets and annotations. The authoring tool was developed using the C # Windows Presentation Foundation framework. It provides an application interface that can easily create video clips and annotate emotions. The recorded annotations can be exported in the comma-separated variables (.csv) format for data analysis via majority voting or multiple labels.

The proposed labelling tool includes two main tasks: automatic cutting of an original video into a list of video clips based on faces and annotation of emotion labels by multiple annotators. Figure 1 presents the user interface of the proposed labelling tool with imports indexed by number. Box one is the interface for cutting an original video. Box two contains control buttons for video playback. Boxes three and four contain interfaces for annotating emotion labels based on categories and continuously, respectively. Box five presents information about the current video. In the following section, we describe our data generation procedure.

Fig. 1
figure 1

User interface for the proposed semi-automatic authoring tool for creating and annotating emotion video clips

3.2 Data generation

The process of creating and annotating emotion video clips has three main steps. First, we create a list of video clips by automatically cutting an original video based on a face identification process. This process is performed using the interface in box one in Fig. 1. Next, we eliminate low-quality video clips from the list of clips, as shown in Fig. 3. Finally, evaluators annotate emotions in video clips. To obtain a final emotion label, one can use either majority voting or multiple labels for the process in box three in Fig. 1. In the real world, an emotion can be recognized differently based on the target individual. In this study, only one emotion label was assigned for each video clip. Multiple labels will be examined in future research. Figure 2 presents an overview of the procedure of creating and annotating emotion video clips.

Fig. 2
figure 2

Flow chart illustrating the process of creating and annotating emotion video clips

Fig. 3
figure 3

Examples of low-quality video clips

3.2.1 Cutting the original video

Cutting an original video into clips manually requires significant time and effort. To handle this problem, we designed an algorithm that automatically extracts video clips from an original video. Video clips must contain at least one face in every frame. A video clip is defined as a set of consecutive frames containing the same face.

For detecting faces in videos, we integrated several face detection algorithms, including OpenCV, Dlib, Mtcnn [21], and Tinyface [9], as shown in the face detection section of box one in Fig. 1. The proposed interface allows users to select the desired algorithms for face detection. To determine if the faces in consecutive frames represent the same person, we calculate the Euclidean distances between face features extracted by a pre-trained VGG16 model [17] based on the VGGFace dataset. In the proposed system, cutting one video into clips takes approximately 10 hours when using Mtcnn face detection with a clip length of 2 seconds. The video cutting algorithm is outlined in Algorithm 1.

To test the cutting algorithm, approximately fifty thousand video clips from 41 movies were generated. Most clips were eliminated in the deletion step to generate a final dataset with good conditions for recognizing emotions.

figure a

3.2.2 Delete low-quality video clips

The quality of the video clips created automatically by the tool will vary significantly. One of the most common bad conditions is when the same content is repeated in consecutive video clips. In other words, there are many similar video clips. We eliminate as many similar video clips as possible to enhance the quality of dataset. Furthermore, our dataset consists of video clips focusing on facial expression, therefore, the bad condition is the appearance of factors affecting the quality of video clips for the emotion recognition, such as part of a face disappearing, illumination variation on a face, incorrectly identified faces, faces being covered by helmets or masks, faces appearing behind glass or a fence, and faces being covered by blood or color. Therefore, the face detection is hard to recognize the seven emotions. We remove such video clips to obtain a final set of video clips that is suitable for the seven emotion recognition. Figure 3 presents some examples of low-quality video clips for the emotion recognition generated by the proposed tool.

The final dataset used in this study contained 1316 video clips with good conditions for recognition. These clips were manually annotated to get the final emotion labels. The process of annotating was described in the following session.

3.2.3 Categorical emotion annotation

Six native Korean evaluators (two female and four male university students) were asked to annotate the emotional content in the database based on seven emotional categories. The database was divided into 41 folders corresponding to 41 movies. Each folder was annotated in sequence. Six evaluators assigned labels separately for each folder. Each folder was annotated a single session. They were restricted from seeing the annotations of other evaluators to avoid any bias. No discussion between evaluators was allowed. The evaluators were instructed to take a suitable rest between annotating folders. The databaseFootnote 1 was designed to target seven categorical emotions: anger, disgust, fear, happiness, neutral, sadness, and surprise. The process of annotation required approximately 12 hours and was conducted in three sessions over three days, with four hours of annotating per day.

The evaluators were asked to select only a single emotion label for each video clip. For the sake of simplicity, majority voting was used for emotion assignment with a label requiring at least four votes to be selected as the final label. Table 2 summarizes the parameters for clip annotation. Figure 4 and Table 3 present the distribution of emotion labels assigned by at least four people, where the other label in Fig 4 indicates labels assigned by less than four people. One can see that the emotion database has a balanced distribution of target emotions overall, but the disgust emotion has relatively few samples. The total number of video clips in Table 3 is 1246, which excludes the other category in Fig 4.

Fig. 4
figure 4

Distribution of emotional categories, where the total number of video clips is 1316

Table 2 Parameters for video clip annotation
Table 3 Emotion distribution for KVDERW

Table 4 summarizes the metadata of the proposed database. Figure 5 presents some example emotions from the proposed database, which are very similar to real world scenarios.

Fig. 5
figure 5

Example emotions in database. From left to right, top to bottom: Anger, Disgust, Fear, Happiness, Neutral, Sadness, and Surprise

Table 4 Dataset metadata

4 Baseline experiments

In this section, we present baseline experiments that were performed to determine the quality of the proposed dataset. The dataset was divided into training (50%), validation (20%), and testing sets (30%) consisting of 639, 271, and 336 clips, respectively.

To predict the emotion labels of video clips, we first extract the main face from each frame in a clip using the Tinyface detector. Figure 6 depicts an example of the main face in each frame of a video clip. The list of main faces is then down- or up-sampled to generate a final list containing 20 faces for each video clip. Features are then extracted from each face in the list using a pre-trained VGG16 model [17] based on the VGGFace dataset. The face features from each face are concatenated to create a single feature vector. Finally, a multi-layer perceptron (MLP) classifier is trained to classify video emotions. During training, the weights of the pre-trained models were frozen, implying that only the weights of the MLP were adjusted. We used the Adam optimizer for training with a learning rate of 1e-6 and batch size of 8. Figure 7 illustrates the pipeline for our baseline model.

Fig. 6
figure 6

Example of main face in each frame of a video clip

Fig. 7
figure 7

Pipeline for the baseline model for emotion recognition

The baseline model obtained accuracies of 44.28% and 39.58% on the validation and testing sets, respectively. Figures 8 and 9 present the confusion matrices for the baseline model on the validation and testing sets, respectively. The model performed well on the anger, happiness, and surprise emotions. In contrast, the disgust, fear, neutral categories are difficult to recognize. It may be caused by the imbalanced distribution of the data for different categories. From the data distribution shown in Table 3, the label number ratios form these three categories are less than 10% for each category in the entire dataset (disgust (2.73%), fear (9.87%) and neutral (5.86%)). While the anger, happiness, sadness and surprise emotions categories compose most of the dataset

Fig. 8
figure 8

Confusion matrix for the Korean validation set

Fig. 9
figure 9

Confusion matrix for the testing set of KVDERW

Additionally, we created an English emotion dataset by collecting video clips from 23 movies. The pipeline for creating this dataset is the same as that for creating KVDERW. This dataset consists of 925 clips. The emotion label distribution for this dataset is presented in Table 5. The English video dataset was divided into training (50%), validation (20%), and testing (30%) sets consisting of 489, 192, and 244 clips, respectively. We conducted an evaluation on this dataset using a baseline model trained in the same manner discussed above. The confusion matrices for the validation and testing sets are presented in Figs 10 and 11, respectively, with accuracy values of 43.23% and 37.71%, respectively. One can see that the fear and sadness emotions are predicted accurately, whereas disgust and neutral are often misclassified.

Fig. 10
figure 10

Confusion matrix for the English validation set

Fig. 11
figure 11

Confusion matrix for the English testing set

Table 5 Emotion distribution for the English dataset

We also tested a baseline model trained using KVDERW on the English testing set and vice versa. The experimental results are listed in Table 6. One can see that there are differences between the confusion matrices for the testing set of KVDERW presented in Figs. 9 and 12, and between the confusion matrices for the English testing set presented in Figs. 11 and 13. For example, the fear emotion is more prominent in Fig. 9 than in Fig. 12. Similarly, the fear emotion is accurately predicted in Fig. 11 and poorly predicted in Fig. 13. These large differences demonstrate that there are differences between the expressions of emotions of Eastern and Western people. Similar to the disgust, fear, neutral categories in KVDERW, the results from three categories the disgust, neutral, surprise categories are not good. It may be caused by the imbalanced distribution of the data for different categories (shown in Table 5) and the limit data size.

Fig. 12
figure 12

Confusion matrix for the testing set of KVDERW with pre-training on the English set

Fig. 13
figure 13

Confusion matrix for the English testing set with pre-training on KVDERW

Table 6 Experiments on different testing sets

5 Conclusion

In this study, we proposed a novel Korean emotion dataset containing emotion video clips under close-to-real-world conditions. The clips were extracted from movies for emotion recognition and the database was assessed using a baseline system. The main purpose of this dataset is to help researchers more easily develop and improve emotion recognition models for various races. To reduce costs and save time when creating and labelling videos, we also developed a semi-automatic tool for supporting the creation and annotation of emotion video clips. This type of tool can help users identify difficult cases when assigning emotion video labels by providing clearer information regarding a video clip as a whole. In the future, we will improve this tool and consider an additional dataset based on the arousal-valence space.