1 Introduction

Emotions produce different physiological, behavioral and cognitive changes. Traditionally emotion has been indirectly assessed by teachers. However, teachers are too busy to assess children’s emotions. Wei (2011) developed a standardized ECSYC comprised of 40 emotional competencies including four subscales of understanding self, understanding others, emotional adjustment, and self-motivation. Can children’s emotional competencies be detected by child–computer interaction? Recently, the trends of big data and artificial intelligence have begun to save resources, including time, human power, and costs.

At present, the evaluation of children's emotional ability is assessed by teacher based on long-term observation. Most studies have also adopted indirect assessment by rating scales (Darling-Churchill and Lippman 2016). This study directly detected children's emotional competencies based on expressions, voices, and drawing. The purpose of this study is to explore the correlation between emotional scores obtained by ECSYC and YCEL.

2 Literature review

2.1 ECSYC

ECSYC is a standardized test with the norm of emotional development of 4–5 years old children with high reliability and validity study. The reliability study has accomplished by stratified random sampling 1067 children in Taiwan. Cronbach’s α has reached 0.98 and the validity study on Kaiser–Meyer–Olkin (KMO) has reached 0.951 (Wei 2011). It is a 5-level scale comprised four subscales and each subscale is composed of ten emotional competencies. It is assessed by teachers who understand children’s emotion based on long-term observation.

2.2 YCEL

Visual-based emotion detection for “natural” human-robot interaction (HRI) creates an appropriate reaction according to the emotional detection of the communication partner (Udochukwu and He 2015). The rationale for developing the YCEL is to keep continuously update for 4–6 years old children’ emotional development average applying artificial intelligence with big-data analysis. To achieve this purpose, the researcher intends to develop a standardized YCEL. Researcher develops the TCEL based on ECSYC. Accordingly, children’s emotional competencies are assessed by 40 emotional theaters constructed by 40 emotional competencies of ECSYC. Finally, the reliability study on consistency is conducted by Cronbach’s alpha. Criterion-related validity is conducted by analyzing the correlation relationship between ECSYC and YCEL.

2.3 Emotional expression

A related study used emotional pictures for expression recognition and found that widely used facial emotion pictures (PoFA; i.e. "Ekman face") and the Radboud Faces database (RaFD) are generally not considered to show real emotions (Dawel et al. 2017; Mehta et al. 2018). Another study used video to capture children’s facial expressions that the techniques can continuously detect the emotions each time unit determined by the researcher (Cunha 2018). In general, one second is most frequently used so that this study also uses one second as the time interval.

2.4 Emotional detection

Most existing emotion recognition techniques rely on the Convolutional Neural Networks (CNN) training model. Karen et al. proposed a VGG16 architecture based on the CNN network (Delplanque 2017). VGG16 emphasizes the importance of depth in CNN networks. Compared to CNN, each convolutional layer filter is compressed to 3 × 3, replacing the larger filter commonly used in CNN. The efficiency of the convolution operation is improved, but the parameters in the fully connected layer of the last layer still account for 90% of the overall network. Therefore, Christian et al. proposed the Inception V3 architecture, which uses Global Average Pooling technology (Joseph and Strain 2003). Averaging the data of the pooling layer can solve the problem of having too many parameters and features captured by the layer.

Today’s CNN-based emotion recognition technology has solved the problem of having many parameters. However, if there are too many filters on the convolution operation layer, the time required for the convolution operation is greatly increased. Thus, Francois et al. proposed the Xception architecture. Through depth-wise separable convolution technology, the speed of convolution operations is further improved and the amount of data is reduced (Daunic 2015). In the past, the general CNN convolution operation method convolved with each filter for each layer of data. Depth-wise separable convolution is a layer of data that is convoluted with only one filter, and a 1 × 1 filter is created for each output. Finally, it lets each layer of output data and each filter perform another convolution operation. In this way, the amount of computation of the convolution operation can be reduced from 1/8 to 1/9 of the general CNN.

A related study found that the success of emotional recognition depends on the transmission of information between the computer and the face (Mehta et al. 2018). The computer should be able to get information about the face instantly, such as emotions and gender. However, because the information of the face is highly complex, too many parameters arise in the process of machine learning. This makes it impossible to achieve instant recognition. Therefore, this study uses a real-time CNN-based emotion recognition program for identification. By using Global Average Pooling in the architecture, the information in each image is compressed while retaining important features. Moreover, the depth-wise separable convolutions technique reduces the amount of computation in the convolution operation, making emotional recognition of images possible in real-time.

Visual-based emotion detection for “natural” human–robot interaction (HRI) creates an appropriate reaction according to the emotional detection of the communication partner. HRI can be processed due to the development hardware and complex software applications (Strupp et al. 2008; Natarajan and Muthuswamy 2015). Dixit and Gaikwad (2016) used patch based face features and a SVM classifier to detect emotion through facial expressions. The experimentation carried out on The Japanese female facial expression database obtained 91% average accuracy for emotion detection of basic emotions, including happiness, anger, sadness, surprise, disgust, fear and neutral. The average emotion detection time was 1.1 s. It was a feasible follow-up research to identify which emotion could be easily detected. Vaish et al. (2019) maintained the highest accuracy possible while keeping the computational cost minimal. The system was tested on subjects that were not available in the dataset and gives a comparable result with other real-time emotion detection systems. In other words, some researchers have attempted to upgrade the accuracy of facial emotional detection. Accordingly, the study used Microsoft Azure system to detect facial expression instead of developing our own recognition system (Microsoft Azure 2018).

2.5 Statistical methods

The underlying methodology of previous studies is a lexicon-based method by which a lexicon is used to detect emotions in text based on speech recognition (Pajupuu et al. 2012; Kato et al. 2006). Using a single modal of speech recognition, the correct emotion recognition reaches 73% for happiness, 60% for angry, 55% for sadness, and the overall accuracy is 62% for all emotions (Hsu and Chen 2012). However, bimodal emotion recognition can reach an accuracy of 86.85%, an increase of 5% compared with using a single modal of emotion recognition (Song et al. 2015; Chuang and Wu 2004; Kessous et al. 2010). Furthermore, previous studies have indicated that it is impossible to achieve satisfactory results by recognizing emotions based on a single model for either speech or facial expression (Ma et al. 2019; Yang et al. 2017). Accordingly, this study applied a bimodal emotion recognition system by using both facial expression recognition (55%) and speech recognition (45%). Speech recognition included two parts. Part one which is scored according to lexicon, accounted for 38% since the time consumed for 80% of total time. Part two asks each child to draw and tell about the story. This part accounted for 7% because the time confused accounted for 20% of the total time. The reliability study was analyzed by examining the correlation between part one and part two.

This study developed an emotional lexicon by analyzing 200 children’s answers to 40 questions from ECSYC. In addition to speech recognition, this study also applied automatic facial action analysis (Kapoor 2002) of expressions of seven emotions, including angry, despised, disgusted, happy, no opinion, sad and surprised. This study identified and calculated the frequency of seven emotions during each emotional theater game. According to the database, the score is converted into five levels multiplied by the number of occurrences, and finally added to the numerator. The sum of the occurrences of the seven emotions is the denominator, and the weight distribution of each emotion in each question is calculated. According to the five levels of the database, if the percentage reaches 80–100%, five points are given. This study used Microsoft's Project Oxford tool to establish expression recognition as the sub-criteria (Zhao et al. 2016).

In addition to facial expression detection, Udochukwu and He (2015) developed a rule-based approach to implicit emotional detection of text. The approach gave an average F-measure of 82.7% to “Happy”, “Angry-Disgusted” and “Sad”. Moreover, Cho et al. (2008) proposed a Bayesian-based method of emotion detection of voice. The method is based on the Bayesian networks which represent the dependence and strength between the dialogist's utterance and his emotion. Darekar and Dhande (2017) used an algorithm in emotion detection to gain a higher accuracy level of speech analysis.

Karpouzis et al. (2007) applied an approach via facial, vocal, and bodily expressions recognition in emotional detection. They described a multi-cue, dynamic approach to detect emotion in naturalistic video sequences. Therefore, the hybrid approach proposed in this study also implemented facial, vocal, and language recognition.

3 Methods

This study attempts to answer the major research question: Is there a significant positive correlation between YCEL and ECSYC? To achieve this goal, the method of the study included in-depth interviews, focus group discussion, observation and experimental method. The implementation process included: (1) developing of emotional theater games based on 40 emotional scripts, (2) designing 40 questions for children to answer, and (3) developing five indicators for assessing each competency. For the experiments, (4) the researcher randomly selected 200 children aged 4–6 years, and (5) conducted 40 emotional theater experiments. After experiments, (6) the researcher executed observer training and carried out a reliability study of the consistency of four observers. (7) The validity study was undertaken by analyzing the correlation between ECSYC and YCEL.

This study uses the following three analysis techniques, Microsoft Azure Bing Speech-to-Text, Microsoft Azure Text Analytics technology, and Microsoft Azure Emotion technology. First, Microsoft Azure Bing Speech-to-Text is used to convert the children's answer to a sentence. The researcher used the Microsoft Azure Text Analytics technology to analyze the emotions contained in the text. The keywords of the sentence were compared to YCEL to determine the emotional score. This study then used Microsoft Azure Emotion technology to analyze the facial emotional changes throughout the process.

3.1 System design

This study established the YCEL for children and developed 40 emotional theater games. Young children identified facial expressions while enjoying the games and produced facial expressions (55% = A). Second, the teachers presenting the emotional theatre games asked the children questions and elicited responses. The scores were compared to YCEL according to the phonetic text of the children’s answers, producing a speech recognition sentiment score (38% = B). Finally, we asked the children to draw the parts they watched, and verbally interpret and convert the drawing in to text. The scores were then compared to YCEL to produce a character recognition sentiment score (7% = C). The emotional scores obtained by young children (A + B + C) are related to a standardized test, ECSYC (standard), developed by the investigator. The purposes of the study are listed below: (1) to develop YCEL based on ECSYC (Wei 2011); (2) to detect emotion and score the competency using the following formula: facial expression recognition × 55% + speech recognition scores × 38% + explanatory drawing scores × 7%.

The study created a database to store the results of the Microsoft Azure Emotion analysis. The researcher created a QR Code maker as the ID for each participant. The ID records the participant’s name, age, and gender, and the teacher’s name. The database recorded the answer text (Speech-to-Text), text analysis results (TextAnalyticsResult) and emotional data obtained after analysis (FaceEmotion). The system architecture was constructed as shown Fig. 1.

Fig. 1
figure 1

System architecture

In Fig. 1, the left hand column represents how a child interacts with the computer. The middle column shows the front- end interface design. The right column reveals the back-end database design.

3.2 Development of YCEL

  1. 1.

    Develop 40 emotional theater scripts based on ECSYC.

  2. 2.

    Determine the scoring standards of five rating scales for each emotional competency.

  3. 3.

    Conduct a focus group discussion with the kindergarten teachers to decide the scoring standards of five rating scales, as shown in Fig. 2.

    Fig. 2
    figure 2

    Focus group discussion

  4. 4.

    Integrate the 40 emotional theater scripts into teaching activities and videotape the activities, as shown in Fig. 3.

    Fig. 3
    figure 3

    Emotional theater is in processing

  5. 5.

    Undertake observer training using videos and establish inter-rater reliability (we reached a rate of 93.9%).

  6. 6.

    To develop YCEL, the researcher analyzes the 200 children’s emotional competencies according to the videos and scoring standards.

3.3 Emotion detection model

The emotion detection software and the triangle verification method were applied (Zhang et al. 2018). The emotion detective technologies included Microsoft Azure Bing Speech-to-Text, Microsoft Azure Text Analytics, and Microsoft Azure Emotion API, as shown in Fig. 4.

Fig. 4
figure 4

Emotion detect model

The detailed steps of the emotion detection model are described as follows.

3.3.1 Formula of emotional score

The formula of the emotional score equals FES × 55% + VTS × 38% + LES × 7%, as shown in Fig. 5.

Fig. 5
figure 5

Emotional score calculation

Figure 5 indicates that the facial expression score accounted for 55% of a child’s overall emotional score. The Microsoft Azure Bing Emotion API is very mature technology with an accuracy rate over 99% (Salvaris et al. 2018).

3.3.2 Facial expression score (FES) calculation

The expression emotion scores accounted for 55% of the children's overall emotional scores. Four children were analyzed by Microsoft Azure Emotion. After calculating the scores according to the formula, we calculated the number of emotions. Voice recognition turns the description into text and saves it to the database, as shown in Fig. 6.

Fig. 6
figure 6

Expression recognition score calculation method

According to the expression recognition software, analyze the emotion type per second, and finally calculate the total number of emotions. Divide the number of emotions by the total number of times and multiply by 100%. The formula is (a/N) × 100% = A, which is converted into a percentage. According to the fifth grade score (81–100% is 5 points), the expression recognition score is determined.

3.3.3 Voice to text score (VTS) calculation

Voice emotion score calculation accounted for 38% of the children’s overall emotional scores. By using Microsoft Azure Text Analytics, each child’s answer was analyzed using voice to text. It was impossible to manually compare the text with the vocabulary database, thus the application of AI in this study was essential for determining the VTS emotion score.

3.3.4 Language emotional score (LES) calculation

The language emotional score accounted for 7% of the children’s overall emotional scores. Upon finishing each item, we asked the child to explain what he/she drew and then converted the voice to the text. We then compared it with the vocabulary database to get the language emotional score. In order to confirm that the data obtained meets the requirements, a pilot experiment was conducted.

3.4 Implementation of the pilot study

To verify the emotional scoring formula, the researcher implemented a pilot study to discover any problems.

3.4.1 Register ID

Each child's ID number was used to store the child's photo, drawing and database. Press Enter to jump to the photo taking step. Take a photo. The image is automatically captured and the photo is accessed based on the child's ID number. The child’s photo and his teacher’s picture will automatically jump to the screen.

In Fig. 7, the avatar shown on the screen represents the kindergarten teacher. The child image was shot by the camera and projected onto the screen.

Fig. 7
figure 7

Teacher–child interaction

3.4.2 Select the emotional theater game and start the video display

The program side plays the corresponding video by clicking the corresponding number. After the video is played, it automatically jumps to the system to ask questions.

3.4.3 Set up the system and try the connection

The researcher installs the system background, microphone preparation and Google speech recognition system, and ensures the Wi-Fi is connected.

3.4.4 Implement child-computer interaction

When the system asks a question, the animation of the child–computer interaction is displayed on the screen. The voice recognition converts the text and compares it with the database. If the text is not in the database, it is saved and judged by machine learning. Finally, the teacher asks the child to draw what he watched and the system saves it as a jpg file. The teacher asks the child to describe he/she has drawn, as shown in Fig. 8.  The children reply to the teacher’s question in Fig. 9.

Fig. 8
figure 8

Teacher asks child to talk about his drawing

Fig. 9
figure 9

The children reply to the teacher’s question

3.4.5 Light up the LED display according to facial expression score

At the end of each theater game, the system show a score representing the child’s emotion. The LED display lights up accordingly.

4 Results

Table 1 shows that the α of YCEL is 0.75, and Table 2 shows that the α of ECSYC is 0.98. Table 4 shows that there is a significant correlation between YCEL and ECSYC, r = 0.41 (p = 0.026 *). The conclusions of the study suggest there is a significant positive correlation between YCEL and ECSYC. This study found that YCEL is suitable for machine measurement. A child's emotional development norm can be constantly updated automatically when YCEL is applied with machine learning.

Table 1 Reliability statistics of YCEL
Table 2 Reliability statistics of ECSYC

4.1 Reliability analysis of YCEL

The reliability analysis evaluates the reliability of the whole scale. The reliability evaluation of this study is based on “developing a common emotional vocabulary database for children” to perform Cronbach's Alpha detection. The reliability of YCEL was estimated as follows:

Table 1 shows that the YCEL reliability coefficient Alpha was Cronbach's α 0.747, and the normalized reliability coefficient was 0.710. The normalized α indicates the influence of the unequal volatility of each subject, and the corrected coefficient.

Table 2 shows that the ECSYC reliability coefficient Alpha was Cronbach's α 0.976, and the normalized reliability coefficient was 0.976. The normalized α represents the influence of the unequal volatility of each subject, and the corrected coefficient.

4.2 Correlation analysis between YCEL and ECSYC

Table 3 shows that the sample averages of YCEL and ECSYC are 35.6 and 42.7, respectively.

Table 3 Descriptive statistics

Table 4 showed that the Spearman's rho coefficient reached up to 0.406*. (p = 0.026), which was significant, indicating that YCEL and ECSYC were significant. When it came to a significant correlation, ECSYC is a standardized test with high reliability and validity, so the YCEL developed in this study also has high reliability and validity. The findings suggest that applying YCEL is feasible for detecting emotional competency according to child-computer interaction in terms of emotional competency according to child-computer interaction in terms of emotional theater games.

Table 4 Correlation analysis

4.3 Database of children’s frequently used emotional vocabulary

Figure 10 is an example of a lexicon of competency for YCEL (#1). The % represents the frequency of the lexicon at each level of the 5-level rating scale. There are five rating scales according to scoring standards, which were developed according to the following steps. First of all, this study developed 40 scenarios based on ECSYC. Secondly, we developed the five-level criteria that were categorized by kindergarten teachers. Thirdly, we implemented observer training and calculated inter-rater consistency reliability. Fourthly, the teacher asked children questions to reflect the children’s emotional competencies. Fifthly, four observers categorized 200 children’s replies into five levels. Sixthly, this study ranked the sequence of frequency for each level and completed the emotional lexicon.

Fig. 10
figure 10

An example of a lexicon for emotion competency (#1)

4.4 Emotional detection by child–computer interaction

Results of a pilot study of emotion detection are summarized in Table 5 as follows.

Table 5 Emotional detection scores

Table 5 is an example of the verification of the emotional detection formula: facial expression recognition scores × 55% + speech recognition scores × 38% + explaining drawing scores × 7%. The emotional competency scores of ECSYC are 3.59, 2.98, 1.44, and 3.56 for child #1, child #2, child #3, and child #4, respectively. ECSYC are assessed by teachers based on long-term observation and understanding each child.

As for the accuracy of facial expression recognition scores, child #1 achieved (2/3.59) × 100% = 55.71%. Child #2, child #3, and child #4 achieved scores of 33.56%, 69.44%, and 28.09%, respectively. In sum, the average of the four children’s facial expression recognition accuracies is 46.7%.

For the accuracies of speech recognition scores and drawing explanation scores, child #1 achieved [(4 × 38% + 3 × 7%)/3.59]  × 100% = 78.83%. The accuracies of speech recognition scores and drawing explanation scores were 60.40%, 40.97%, 63.2% for child #2, child #3, and child #4, respectively. In sum, the average speech recognition accuracy of the four children was 60.85%.

As a conclusion, the emotion detection accuracies of the emotion recognizer using a bimodal emotion recognition approach achieved 46.7%, 60.85% and 78.73% for facial expression recognition, speech recognition, and bimodal emotion recognition, respectively. Findings confirmed that the YCEL is feasible for speech recognition. The bimodal emotion recognition accuracies increased 32.03% = (78.73% − 46.7%) and 17.88% = (78.73% − 60.85%) compared with using a single modal of facial expression recognition and speech recognition, respectively.

5 Discussion

5.1 Emotional lexicon for children

The findings of this study could effectively improve the emotional ability of young children, and support the emotional theater experiments. The proposed approach offers a feasible and effective design to improve children's emotional ability. The results were same as those of related studies (Joseph and Strain 2003; Poventud et al. 2015). The teacher in the study told the children that the puppet was frustrated. The teacher asked the children to pay attention to the teacher’s story-telling and what the teacher said to the puppet. A related study also indicated that using seven effective strategies to improve children's vocabulary had a significant relationship with students' social emotional vocabulary scores (Daunic 2015).

Emotional recognition combining by facial expression and speech (Truong et al. 2007) has been applied in health monitoring and e-learning. There is a growing need for the development of agreed standards in automatic emotion recognition research. Therefore, this study developed five levels of assessing standards for each emotional competency. Also, based on big data, this study developed a database of emotional development with an incremental norm.

5.2 Correlation between YCEL and ECSYC

This study aims to develop YCEL based on the standardized test of ECSYC for autonomous emotion detection using a bimodal emotion recognition approach. The methodology of the bimodal approach is to combine speech recognition and facial expression recognition. In this design, scoring standards of five levels are programmed based on the emotional competencies of ECSYC. Using emotional theater games, the bimodal emotion recognition detection analysis showed that YCEL and ECSYC are significantly correlated. Findings showed that Spearman's rho coefficient reached up to 0.406*. (p = 0.026), which is significant. ECSYC is a standardized test with a reliability and validity study published by Psychological Publishers in Taiwan. ECSYC is a criterion to support the YCEL, which is also a standardized test with high reliability and validity.

5.3 Emotion detection with bimodal emotion recognition

The process and rationale of the first purpose is to develop the criteria for each level of the 5-level rating scale. In this process, we developed 40 emotional games based on 40 emotion competencies. Each question reflected on each emotion competency. An example of the lexicon database is shown in Fig. 10.

The findings of the study are the same as those of a related study of bimodal emotion recognition accuracies; i.e., we reached increases of 12.15% and 3.54% compared with using a single modal of speech recognition and facial expression recognition, respectively (Song et al. 2015). The proposed method and architecture outperform other previously proposed approaches.

The system design of this study applied an open source method, in terms of real-time face emotion detection. Real-time image emotion recognition was executed by combining the OpenCV and CNN training models. The training dataset system used here is fer2013. CNN undergoes multiple convolutions, pooling, and full connection to determine the consequences of emotional expression.

The system compares the matching features of all the regions in the image at the beginning when the system does not know the characteristics of the feature. The mechanism used in this step is a convolution operation, where the degree of coincidence results is used to convolve the pixels of each block of a fixed size and the feature detector to obtain the block and the feature.

After the system repeats convolution operation and pooling, it enters the step of full connection. The system also transforms the pooled result matrix into one dimension and allows the values in the matrix to vote on the result. The result with the highest number of votes is the result of the current identification. The different values reflect varied degrees of discrimination. For example, some features better reflect the emotion of happiness. The number is expressed in terms of weight or connection strength.

Related studies have developed robots with emotional abilities similar to humans, giving them the ability to generate and express emotions (Mehrabian 2017). Based on a literature review, this study assumes that facial expression account for 55% of emotional scores. The voice to text recognition scores make up 38% of emotional scores, while the remaining 7% of emotional scores come from language recognition scores. Therefore, it is very important to be able to correctly express the emotional expressions of the face. As long as a facial expression can be correctly expressed, most of the expression of a child’s emotion can be identified.

Other studies have also pointed out that human expression involves the muscle movement of the face. A specific expression can be assigned to each action unit according to the intensity of the contraction. Therefore, by detecting the facial expression, we can observe the emotion of the subject. However, this method is very time consuming. Thus, a child–computer must be used to speed up the facial expression recognition process (Delplanque 2017).

6 Conclusion

ECSYC is a standardized emotion test with high reliability, high validity, and the average of emotional development for 4–6 years old children in Taiwan. It is assessed by children’s teachers. Is it possible to develop a standardized automatic emotion recognition system from facial and speech recognition? It is a challenging task that relies on an emotional lexicon developed in this study.

To achieve this goal, this study firstly developed 40 child–computer emotional theater games based on the standardized emotional test of ECSYC. The study has demonstrated that it is feasible to automatically detect children’s emotional development by child-computer emotional theater games. The contribution of this study is the accomplishment of the YCEL so that children’s emotion can be automatically detected and it can bring the norm up to date. The follow-up study is in processing using deep learning to automatically discover emotionally relevant features. The average of emotional competency scores is automatically calculated and the norm is always kept in an updated version.