A method for evaluating the learning concentration in head-mounted virtual reality interaction

In education, learning concentration is closely related to the quality of learning, and teachers can adjust their teaching methods accordingly to improve the learning outcomes of students. Particularly in head-mounted virtual reality interactions, current methods for assessing learning concentration cannot be fully applied to new interactive environments because immersion shaping and cognitive formation differ from the conventional education. Therefore, in this study, a learning concentration assessment method is proposed to measure the learning concentration of students in head-mounted virtual interaction, using the expression score, visual focus rate, and task mastery as evaluation indicators. In addition, the weights of the evaluation indicators can be configured to be included in the calculation of learning concentration depending on the characteristics of different types of courses. The results of a usability evaluation indicate that the learning concentration of students can be effectively evaluated using the proposed method. By developing and implementing strategies for optimizing learning effects, the learning concentration and assessment scores of students increased by 18% and 15.39%, respectively.


Introduction
Virtual reality (VR) technology can overcome the time and space limitations in conventional education. The immersive learning experience provided by this technology can promote learning motivation and situation cognition, and enhance learning experience. Therefore, VR technology has been extensively applied in education to improve the quality of teaching in recent years (Kim et al. 2020;Sutjarittham et al. 2019;Tsai et al. 2020). Learning concentration is a crucial factor affecting the learning effect in conventional education, and it reflects students' degree of learning attention (Arana-Llanes et al. 2018;Castelló et al. 2020). Thus, learning concentration influences the learning effect in a virtual environment. If the learning concentration of students during the interactions in VR education can be effectively evaluated, then it will help to adjust their learning status and thus the learning effect can be improved. Accordingly, in this study, the learning concentration of students in headmounted VR interaction was determined by analyzing the characteristics of VR education.
Significant progress has been made in detecting learning concentration in conventional classrooms. Guo et al. (2018) proposed a convolutional neural network (CNN)-based analysis method for evaluating the learning concentration suitable for conventional teaching. The learning concentration of students was quantified using micro-expressions as a quantitative index. In 2020, using the head-up rate and facial expression recognition (FER) results as evaluation indicators, Shi (2020) developed an analysis method to assess the learning concentration for conventional education. Although these methods have achieved better results in the concentration analysis of conventional education, they are not applicable to the evaluation of learning concentration in head-mounted VR environments. Notably, students' attention is usually focused on blackboards or instructors in conventional classrooms. Conversely, the students' attention changes as the focus of interest alters when interacting with head-mounted displays (HMD) in a VR environment, which means that it is no longer focused on a fixed area. In addition, the eyes, eyebrows, part of the nose, and other essential expression features of students are obscured by the HMD, resulting in a significant reduction in the accuracy of FER using conventional methods.
To solve these problems that the conventional methods were no longer suitable, a concentration analysis method for the VR interaction was explored. The concentration analysis method based on FER primarily uses sensors to capture data (Sutjarittham et al. 2019;Tsai et al. 2020). Facial and eyes data collected from electromyography (EMG) sensors were analyzed according to the electrical signals generated by facial muscle movement. However, in these methods, students have to wear various sensors, such as brain wave sensors, electromyograms, and skin electrical sensors, which cause discomfort. In addition, because the human body cannot be entirely obscured by all the sampling points of the sensor, FER has low precision (Shen et al. 2019).
Considering these problems, more CNN-based FER methods have been developed. Teng (2017) proposed an FER process in a virtual environment based on LeNet. Image data were used as the FER data source to avoid uncomfortable experiences caused by wearing EMG sensors. However, the FER network constructed using this method lacked sufficient training owing to the limited dataset. Thus, the average FER accuracy was 69.39%, which can be further improved.
To further enhance the FER rate in a VR environment, Wu (2019) developed an FER method by reconstructing face image. The face image obscured by an HMD was reconstructed based on generative adversarial networks, and visual geometry group 16 (VGG16) was used to improve the feature extraction effect. The evaluation results indicated that the FER rate for the CK + and restored CK + datasets reached 98.8% and 94.8%, respectively. Although the FER accuracy is higher in a virtual environment when the deflection angle is small, it has inherent defects. First, the frontal unblocked picture of the user should be obtained in advance as a reference picture. Otherwise, the blocked face image cannot be accurately reconstructed, which affects the accuracy of FER. Second, owing to the high degree of freedom (DOF) of the VRI, the image captured by the camera may contain a more extensive adjustment range of head posture, further reducing the accuracy of the FER method.
In the method based on interaction data, using the number of missed errors and reaction time as the main concentration evaluation indicators, Yeh et al. (2020) constructed an automatic assessment system for attention deficit disorder by selecting focus time and total rotation angle as the concentration evaluation metrics. By analyzing the interaction data, this system automatically determines whether the user has attention deficit hyperactivity disorder. However, the primary assessment target of the system is children suspected to have attention deficit hyperactivity disorder and is relatively simple in terms of task design, making the system somewhat less scalable.
In this study, we proposed a method for evaluating the learning concentration in head-mounted virtual reality interaction (VRLC). The VRLC has the following characteristics: 1. According to the high DOF characteristics of the operations in VR interactions and the diversity of virtual scenes, expression score, visual focus rate, and task mastery were set as comprehensive indicators for evaluating learning concentration. Moreover, the graded valence emotion can be set based on the characteristics of different types of VR education systems, and the corresponding expression weight can be assigned to calculate the learning concentration score of students. The optimization strategy was formulated by studying the user's interaction behavior pattern from an analysis of the learning concentration score to improve the interactive experience of the VR education system. Finally, the learning effect is enhanced; 2. In a head-mounted VR environment, the recognition rate of existing FER methods is reduced owing to the larger adjustment range of head posture. Thus, by simplifying the attention mechanism (Abdullah et al. 2019;Maraza et al. 2020), we proposed an FER method suitable for head-mounted VR interaction (FERVR). By fusing the global and local features, the weights of the unobscured local areas increased. Thus, the influence of HMD occlusion and a more extensive head posture on FER were reduced, and the reliability of the learning concentration score in VRLC was enhanced by improving the FER rate in a VR environment.

Method
As shown in Fig. 1, the VRLC process includes the calculation of learning concentration in head-mounted VR interaction and research on FER methods. The optimization strategy was formulated by studying the user's interaction behavior patterns derived from the analysis of the learning concentration and assessment scores to improve the interactive experience of VR education systems. In conventional education, facial expressions and head-up rate can be used as indicators for measuring learning concentration (Guo and Zhang 2019;Shi 2020). However, the evaluation of learning concentration when wearing an HMD for VR interaction differs from that in a conventional classroom. First, students' engagement and experience with VR interaction are more valuable in achieving knowledge acquisition and skill consolidation through practicing and reflecting. Second, because interactive objects in a virtual environment exist anywhere in a three-dimensional (3D) space, the student's visual focus is not limited to a specific area. The evaluation indicators for measuring learning concentration 1 3 in a VR environment are shown in box ① using the blue line in Fig. 1; in this study, expression score, visual focus rate, and task mastery were proposed as comprehensive evaluation indicators for measuring the learning concentration.
To provide the FER results for the expression score from the first step, we simplified the concentration mechanism according to the characteristics of occlusion and the larger adjustment range of head posture in a VR environment, and then proposed FERVR. Global and local features were fused using FERVR to recognize emoticons. As a result, the influence caused by HMD occlusion was reduced and the robustness of the larger adjustment range of head posture was improved. As shown in Fig. 1 ②, because few expression datasets from the VR environment using HMD are available, we generated a new dataset suitable for head-mounted VR interaction by adding the HMD device mask to the eye position of the face images in the Radboud Faces Database (RaFD) (Langner et al. 2010). After the FERVR training was completed, the facial expression data during virtual learning were used as the input data for FERVR. Finally, the FER result was applied to the expression score calculation.
The steps for analyzing the learning concentration are shown in Fig. 1 ③. After the learning concentration score has been derived by steps 1 and 2, the learning concentration of students combined with the assessment score of the VR education system can be analyzed to improve the learning effect. Further, psychological counseling and VR education system optimization were used for the development of learning effect optimization strategies. Students with a low concentration score were guided to actively participate in the experience of the VR education system using psychological counseling. According to the analysis results of the learning concentration, the shortcomings of the VR interaction design can be found. By formulating and executing VR education system optimization, interactive experiences and teaching quality can be improved.

Evaluation indicators for measuring the learning concentration
Learning concentration is a meta-construct that includes emotional, behavioral, and cognitive focus (Fredricks and Mccolskey 2012). Emotional focus is related to the students' emotional involvement during learning activities (Christenson and Reschly 2012). Positive emotions include enthusiasm, interest, and enjoyment while learning (Renninger and Hidi 2016), whereas negative emotional components include boredom, sadness, and frustration in the classroom (Skinner et al. 2008;Skinner 2016). Theories of motivation, including the self-determination and control-value theories of academic emotions (Deci and Ryan 1985;Pekrun and Linnenbrink-Garcia 2012), emphasize the role of both positive and negative emotions on the students' involvement in learning activities and underscore how affective dynamics can sustain or disrupt learners' Behavioral focus is the degree to which students are active in learning activities (Fredricks et al. 2004). This is reflected in the students' ability to effectively execute cognitive strategies and put action and effort into achieving learning goals (Sinatra et al. 2015;Alemdag and Cagiltay 2018). Moreover, behavioral focus is considered key to success (Sinatra et al. 2015).
Cognitive focus is the student's level of investment in learning (Meece et al. 1988;Parong and Mayer 2021). It includes being thoughtful, strategic, and willing to exert the necessary effort for comprehension of complex ideas or master difficult skills (Fredricks et al. 2004). Cognitive focus measures are considered to have self-regulation and motivation components (Fredricks et al. 2004;Ainley 2012;Christenson et al. 2012), and they have been found to affect various positive outcomes, including motivation and learning achievements (Guthrie et al. 2004;Chi and Wylie 2014;Greene 2015).
In summary, emotional, behavioral, and cognitive focus are the three dimensions that can effectively reflect the learning concentration of students. Therefore, as shown in Fig. 2, we used the expression score, visual focus rate, and task mastery as the indicators to quantify the impact of these dimensions and used a set of formulas to calculate the learning concentration.
2.1.1.1 Emotional focus and expression score Psychologist Mehrabian suggested that emotional information can be expressed by 7% of language, 38% of voice, and 55% of facial expressions . Therefore, facial expressions have an essential role in emotional expression.
The basic emotions within the valence-arousal-dominance model were constructed by Arya et al. (2021), as shown in Fig. 3a. They categorized emotions into three dimensions: valence, arousal, and dominance. Valence describes the feeling of a negative one to the positive one. Arousal defines the strength of an emotion, how much the person feels about that emotion like when he is excited. Dominance, the thirddimension, is related to the strength of an emotion. It also represents the degree of control generated by the stimulus. A person feels controlling or submissive about something (Mitruţ et al. 2019;Arya et al. 2021). The most common model used is the circumplex model of affect spanned by valence and arousal dimensions (Russell and Barrett 1999;Zangeneh Soroush et al., 2018), in which emotions are categorized into two-dimensional circular space. As shown in Fig. 3b, the vertical axes represent arousal and the valence dimension is represented by horizontal axes. In this model, the various types of expressions were classified as lowvalence negative affect and high-valence positive affect, which were set to negative and positive values, respectively (Russell and Barrett 1999;Zangeneh Soroush et al. 2018). Guo and Zhang (2019) combined the 3D learning state space with the affective dimension theory and proposed an evaluation model of classroom attention. The model sets the weight values of expressions not related to the classroom (fear, expressionless) to 0, and the weight value of emotions when students are very dissatisfied with the classroom content (disgust, contempt, anger, sadness, and confusion) was set to − 2. In addition, the emotions of happy and surprised, which were considered as satisfied with the class content, were given a weight value of 2. Because of the fact that students showed no significant emotional expressions (neutral expressions) when they listened attentively, Shi (2020) believed that setting the weight value of neutral expressions to 1 can improve the accuracy of learning concentration assessment.
During the design of the VR education system, an explicit and careful thought-out educational purpose is essential (Boutefara and Mahdaoui 2020). VR can stimulate students' motivation and interest in learning, and their evaluation of VR learning content can reflect whether the design of the VR education system is reasonable (Suhaimi et al. 2020;Tai et al. 2022). We set the expression score as the evaluation indicator of emotional focus to measure the students' interest in the VR learning content. In addition, considering the diversity of VR education scenarios, we classified the graded valence emotion in head-mounted VR interaction as high-valence positive affect, medium valence neutral affect, and low-valence negative affect.
High-valence positive affect indicates a category of expressions that are consistent with the instructional purpose of the VR education system. The presence of this type of expression suggests that students were immersed in the virtual experience and they focused on the interactive content, resulting in higher expression scores.
Conversely, the expressions that designers of VR education systems do not expect to appear are set as low-valence negative affect. Expressing this type of affect indicates that students are not interested in the current virtual experience.
Medium valence neutral affect is in a state of emotional ambiguity between high-valence positive affect and low-valence negative affect, and emotions in this state cannot provide a clear valence (Guo and Zhang 2019;Shi 2020). Medium valence neutral affect is not what the designers of VR education systems expect from students, and it does not accurately measure the level of interest in the virtual experience. In conventional classroom education, expressionless usually falls into the category of medium valence neutral affect. However, in a head-mounted VR experience, expressions other than expressionless can be set as medium valence neutral affect in specific situations. Configuration rule of expression weight are summarized in Table 1.
Referring to the measure of learning concentration developed by Shi (2020) and Guo and Zhang (2019), we propose a set of formulas for calculating the expression score. Each student's expression score (f k ) is calculated by multiplying the proportions of various expressions and expression weights. Equation (1) shows that the eight categories of expressions (happiness, sadness, disgust, surprise, fear, anger, contempt, and expressionless) are represented by expression number i (0 ≤ i ≤ 7). The frequency of each expression is represented by T i (0 ≤ i ≤ 7). N is the total number of expressions, and C i (0 ≤ i ≤ 7) represents the weights corresponding to each expression category.
To formulate the evaluation criteria for measuring expression score, the expression score of students is normalized by the polar linear method (Pedram et al. 2020). As shown in Eq. (2), by comparing the maximum and minimum expression scores, the expression score of each student is normalized to the range [0,1].

Behavioral focus and visual focus rate
In conventional classroom teaching, the presentation area of knowledge points is fixed on a blackboard or a projection screen. Consequently, students' degree of learning concentration can be measured using indicators such as facial expressions and head-up rate. However, owing to the high DOF of VR interactive operation, the position of the student's visual focus constantly changes, making conventional metrics inapplicable for measuring learning concentration in HMDwearing contexts. A visual focus data channel is another objective means of discerning cognitive engagement fluctuations during learning (D'Mello et al. 2017). By monitoring the visual focus, feedback can be provided on the learner's state in response to specific stimuli. When students' vision is steadily focused on a point, it indicates that they are taking more action to accomplish their learning goals (D'Mello et al. 2017). In addition, visual focus can reveal the decision-making process (Krejtz et al. 2016). Choice behavior and gaze allocation are related, thereby people tend to look longer at the item they will choose than at the item they will reject, thereby generating a gaze bias effect (Thomas et al. 2019). Accordingly, the visual focus rate was set as one of the indicators for evaluating the degree of learning concentration.
In a virtual environment, explaining knowledge points using an avatar is a common form of interaction. During explanation, the time that the students' sight stayed on the avatar or the presentation area was set as the focused learning time. The visual focus rate, which is the ratio of focused learning time to the total time spent on explaining knowledge, was set as another measure of learning concentration. The process of knowledge points seeking is shown in Fig. 4; when constructing the VR interaction, a ray is emitted from the position of a student wearing the HMD to the front direction to simulate the attention line of sight. It can be used to measure the visual focal rate bydetermining whether the rays are located within the knowledge presentation area. Figure 4a shows that the focused learning time starts counting when students observe the presentation area of knowledge points, indicating that they are in a state of focused learning. As shown in Fig. 4b, students are not in a focused learning state when their eyesight stays out of the presentation area of knowledge points; thus, the focused learning time stops counting.
The visual focus rate (R c ) is calculated as in Eq. (3): where T c is the focused learning time and T t is the total knowledge explanation time.

Cognitive focus and task mastery
In a conventional classroom lecture-teaching setting, students' mastery of learning content affects the final learning outcome. It is difficult for instructors to ensure that all students master the skills being taught when the number of students is large and the course duration is limited. A common method for testing students' task mastery is through individual and group questioning. However, any approach that relies on the instructors' subjective judgment inevitably leads to deviations. Cognitive focus, which is a complex process of cognitive processing and information handling (Liu and Wang 2017), reflects students' mastery of learning content (Kim and Schatschneider 2017;Wong 2018). This requires students to activate priori knowledge and access memory, and then analyze, integrate, transfer, and create new knowledge, information, and problems (Liu and Wang 2017). The longer the students spend at a knowledge point, the more difficult it is for cognitive processing. Therefore, they invest more cognitive effort (Liu and Chuang 2011;Krejtz et al. 2016). Kruger and Doherty (2016) discovered that the time students spent on a reading task was significantly related to their degree of cognitive focus. Furthermore, the degree of cognitive focus can be predicted by the duration of a reading task (Kim and Schatschneider 2017;Wong 2018). The aforementioned knowledge construction and cognitive focus theories are also applicable to designing VR educational scenarios. The knowledge acquired in virtual scenarios is usually divided into multiple task prompts; after the guidance of these prompts, students deepen their cognitive understanding by comparing and analyzing such knowledge. We recorded interaction data related to students' reading task prompts to assess their cognitive focus on the knowledge points in the virtual scenario. First, the learning task reminder in a virtual environment appears in sequence according to a pre-planned trigger mechanism that can continuously guide the student to learn stepwise. Additionally, every task prompt is guided by a combination of text, voice, and icons. Thus, students can verify the task reminder according to their mastery of knowledge points or re-watch them later. The time to confirm the current task reminder is set as the reading completion time of the learning task when the student reads the task reminder, and task mastery is the ratio of this parameter to the total time of the learning task prompt. In summary, the expression score, visual focus rate, and task mastery are used as comprehensive evaluation indicators for learning concentration. The task mastery (Rm) is calculated as in Eq. (4): where T r is the learning reading completion time of the learning task, and T g is the total time of the learning task prompt.
(4) R m = T r T g

Learning concentration calculation
The calculation process of the learning concentration for head-mounted VR interaction is shown in Fig. 5. It mainly comprises three steps: facial expression score calculation, interaction data analysis, and learning concentration score calculation.
As shown in Eq. (5), the learning concentration score is weighted by the expression score, visual focus rate, and task mastery, which are the comprehensive evaluation indicators of virtual learning concentration. α, β, and γ are set as the weights for the above three indicators, respectively. Among them, the expression score reflects students' concentration on VR experience content, and the visual focus rate and task mastery reflect students' attention on the presentation area of knowledge points.
In this study, the analytic hierarchy process (AHP) was used to determine the weights distribution of the expression score, visual focus rate, and task mastery in calculating learning concentration (Shete et al. 2020). AHP is a decision-making method that decomposes elements related to decision-making into levels of decision goals, intermediatelevel elements, and alternatives. On this basis, qualitative and quantitative analyses were performed. This method is a comparative hierarchy derived by experts after comparing every indicator according to the meaning of the weights; thus, it has high reliability and low error.
A total of 21 senior engineers and six interaction designers from the VR development departments of Netdragon and Huayu Education Technology were invited to participate in Calculation of the learning concentration for head-mounted VR interaction the expert review to evaluate the indicator weights of virtual learning concentration. In this expert scoring review, the relationship between importance was compared with the dimensions of expression score, visual focus rate, and task mastery. The relative numbers of the scale were set as 1, 3, 5, 7, and 9, which indicate equal, slight, significant, vital, and extreme importance, respectively. The findings demonstrated that the consistency ratio from the expert rating was 0.0516, which was less than 0.1, and it passed the consistency test.
The final weights of the indicators calculated by the AHP were 0.4074, 0.3735, and 0.2191. Therefore, as shown in Eq. (6), the calculated weight values are substituted into α, β, and γ to obtain the formula for learning concentration.
Based on the results of the weight values, the following conclusions can be drawn. First, experiential and instructional contents are essential components of VR education because it is a novel teaching practice of experiential learning. The expression score and visual focus rate were relatively more important for the assessment of learning concentration. In addition, although task mastery can assess the students' understanding of learning tasks, it is influenced by their learning and understanding abilities. Therefore, the weight value of task mastery was relatively low to reduce the impact of the subjective ability factors of students on the calculation of learning concentration.

Research of the FER method
The expression score was calculated from the results of FER and expression weights. Aiming at the characteristics of a larger adjustment range of head posture in head-mounted VR interaction, we proposed a FERVR framework to improve the accuracy of FER in a VR environment.

FERVR framework
In recent years, attention mechanism has been applied to achieve better FER results in the presence of partially occluded faces (Abdullah et al. 2019;Maraza et al. 2020). The attention mechanism in FER draws on the human visual selective attention mechanism; that is, the human eye quickly scans global images to obtain the target region to be focused on (Jiao et al. 2021). Therefore, more attentional resources are invested in this focus region to obtain more detailed features, and useless information is suppressed. The attention mechanism in FER divides a global image into several local images and adaptively adjusts the region weights according to the degree of occlusion of the local image. In a VR environment, because the occluded areas of the HMD are relatively fixed, the local regions with higher weights are determined. Therefore, although the HMD occludes important facial recognition features, such as eyes, eyebrows, and part of the nose, the mouth area is clear. If the mouth area can be used as the main target area for FER, the influence of the HMD occlusion is suppressed. The images captured by the video devices have a large deflection angle when the rotation angle of the head is wide. Although FER can effectively reduce the influence of HMD blocking by using local features, the recognition rate is reduced when the head deflection angle is wide. By contrast, the global image from the VR environment contains HMD occlusion, local FER feature region, and overall pose feature. If the global and local features are fused, the robustness of the FER network in the presence of a wide head deflection angle can be improved while reducing the effect of HMD occlusion.
Accordingly, by simplifying the attention mechanism, we proposed the FERVR framework by fusing global and local features. An overview of the proposed framework is shown in Fig. 6. First, the face image of the HMD wearer was imported into the FERVR. Second, the input image was divided into a global area containing all facial information and an unobscured local area. Thereafter, both areas were imported into the feature extraction network to obtain global and local features. Finally, after the fusion features were normalized, the facial expression classification results were obtained.

Feature extraction
Extracting more effective expression features improves the FER rate, therefor designing a feature extraction network can directly influence the final recognition accuracy. The lowdimensional information obtained using conventional feature extraction methods such as supervised latent Dirichlet allocation (sLDA) (Rajan et al. 2019) and multi-support vector machine (multi-SVM) Guo and Zhang (2019), have insufficient expressiveness, resulting in limited recognition capacity. By contrast, FER methods based on CNN, which extract a hierarchy of nonlinear facial features using multilayers of convolution and pooling, can achieve higher rates of accuracy on several facial expression benchmarks. Hence, the feature extraction method based on deep learning is better than conventional methods. Accordingly, to enhance the feature extraction effect in a VR environment and improve the accuracy of FER, we developed a novel feature extraction method by optimizing VGG16 (Simonyan and Zisserman 2014).
For feature extraction, a stack of two 3 × 3 convolutional (conv.) layers have an effective receptive field of 5 × 5, and a stack of three 3 × 3 conv. layers have an effective receptive field of 7 × 7. Therefore, using filters with a very small receptive field (3 × 3) reduces the number of hyperparameters of the CNN and leads to better feature extraction results. In addition, few conv. layers lead to a poor performance of the extracted features. By contrast, too many conv. layers cause overfitting. Thus, according to the conv. layers configuration of VGG16 (Simonyan and Zisserman 2014), the architectures of the model used for facial emotion feature extraction in FERVR is shown in Fig. 7, including five convolutional blocks and one fully connected (FC) block. First, 3 × 3 conv. layers were used in the convolution block.
Second, rectified linear units were used as activation functions after each convolution calculation was completed to enhance the nonlinear capability. The channel number of the FC is relatively low because FERVR is an 8-classification model. By debugging the parameter, the channels of FC1, FC2, FC3, and FC4 in the FC block were set to 1024, 512, 256, and 256, respectively. Therefore, the fitting speed of the boosting network was reduced by decreasing the number of CNN parameters. A dropout layer with 50% probability of deactivating neurons was introduced after FC3 and FC4 to prevent overfitting of the feature extraction network.

System construction
To verify the usability of the VRLC, a verification method was used to conduct evaluation experiments in this study. Thus, considering safety education, a VR elevator safety education system (VRESE) was developed to capture the expression and interaction data required by the system usability evaluation.

Composition of the VRESE system
The composition of the VRESE is shown in Fig. 8, including the user interface (UI), virtual interaction, and data acquisition and analysis modules.

UI module
The UI module acts as a bridge for exchanging information between VRESE and students, including learning task prompts, height prompts, and remaining time. Learning task prompts provide instructions for learning tasks and interactive operations, which are sequentially triggered according to the operational progress of students. The height prompt appears after the user enters the virtual elevator scene, providing relevant hints for students to make selfhelp judgments based on the current height and enhancing the immersive experience. The remaining time is displayed when the elevator is about to fall, fully engaging the user's tension by showing the countdown of the crash.

Virtual interaction module
The virtual interaction module of VRESE is responsible for designing human-computer interaction logic to realize real-time interaction in the virtual environment, including operation guidance, teaching, and assessment modes. The operation guidance mode instructs learners to quickly master basic operations such as moving, turning, picking up, and using items in a virtual environment because proper operation guidance can reduce learning costs. In the teaching mode, the instructor's avatar explains the elevator safety knowledge and guides students to quickly grasp the correct emergency handling methods when the elevator falls. The assessment mode is used to evaluate the students' learning effects. In the assessment mode, interactive operation guidance and learning task reminders are no longer provided by VRESE, and students have to complete the assessment using the elevator safety knowledge they have learned.

Data acquisition and analysis module
Students' interaction data, including expression data, focused learning time, total time of knowledge explanation, reading completion time of learning task, total time of learning task prompt, and assessment score, are acquired in real time by VRESE. Expression data are acquired in real time by cameras during students' virtual interaction and used as input data for FERVR after data acquisition. In addition to facial expressions, channels such as language, voice, and context can be used to identify the students' learning emotions. However, emotion recognition based on multi-modal data will make the proposed computation of the learning concentration more complicated. Therefore, the algorithm design in this work temporarily does not consider the combined effect of the above factors and only calculates expression scores by collecting data on facial expressions.
In the teaching mode of VRESE, the avatar explains the elevator safety knowledge to demonstrate correct avoidance actions when the elevator falls. Therefore, VRESE uses a ray-based approach to simulate the attention and eyesight of students. An intersection of the ray with the region in which Fig. 8 Composition of the VRESE the avatar is located determines whether the focused learning time is begun to be measured. Moreover, the total time of the learning task prompt is derived from the time when the student confirms the learning task prompt in VRESE. The student's score in the assessment mode is set as the assessment score, which is automatically calculated using VRESE according to the assessing standard.
By accumulating the duration of the avatar's skeletal animation and the duration of teaching audio files, the VRESE system counts the total duration of the knowledge explanation and learning task reminder interface after the VR experience is completed. Thereafter, the visual focus rate can be calculated according to the focused learning time and total time of knowledge explanation. Moreover, the reading completion time of learning task and total time of the learning task prompt were used to calculate task mastery.

Setting the facial expression weight
The characteristics of VR education systems must be analyzed in advance because the concentration of each type of facial expression was different in various VR education systems. Subsequently, the weights of a variety of expressions of the participants in the virtual scenario were determined and used in the calculation of the learning concentration score. By simulating the scene of a malfunctioning elevator sliding down, VRESE enables students to master correct self-rescue skills during an emergency. The external and internal perspectives of the elevator are illustrated in Fig. 9a and b, respectively. As the relative displacement of the sightseeing elevator and other buildings can visually enhance the tension of the participant during the interaction, the observation elevator was regarded as the main interactive scene for safety education training. The observation elevator rises to an altitude of 180 m, which was equivalent to the height of a 45-story building, simulating the process of stalling and falling. By creating a sense of falling and weightlessness through the friction between the car and rails, the VRESE system simulated a situation in which participant's emotions fluctuate and tests their emergency performance.
Depending on the atmosphere created by the virtual scenario, when students showed expressions such as fear, surprise, sadness, and happiness, they represented their highvalence positive affect toward the current activity and were given a high weighting factor. On the contrary, expressions such as disgust, anger, and contempt indicated that students were dissatisfied with the experiential content of the VRESE system and treated the interaction negatively. Thus, these expressions were categorized as low-valence negative affect of VRESE. Expressions that cannot clearly reflect the level of concentration of students on the experience content were classified as medium valence neutral affect.
In summary, the facial expression weights for VRESE are listed in Table 2. The line represents the expression type, graded valence emotions, and weights. In addition, the weights configuration in Table 2 is only applicable to VRESE. The setting of expression weights in other VR education systems should consider the characteristics of the teaching theme and follow the expression weights configuration rules.

Evaluation
To measure the usability and reliability of VRLC, using VRESE as an example, we conducted a usability evaluation experiment. The usability evaluation process used in this assessment is illustrated in Fig. 10. First, we evaluated the expression recognition rate of FERVR to ensure the reliability of the expression data. In sequence, we set the assessing standard according to the learning content of the VR education system. Moreover, the participants' expression data were captured by cameras, and the interaction data were collected by VRESE to analyze the learning concentration. After acquiring the data, the learning concentration score was calculated according to the learning concentration formula. Finally, the assessing standard of VRESE was formulated based on the importance of elevator self-help steps. If the overall assessment score meets the assessing standard, the evaluation experiment is completed. Otherwise, the corresponding learning effect optimization strategy is formulated. Thereafter, another cycle of experiments is started until the assessment criteria are met. The validrange of FERVR expression recognition is between -90°and 90°. However, in order to avoid losing some of theexpression data when the head posture changes beyond the range, two image capture devices were deployed atthe experimental site, which enabled the range to be extended from -180°to 180°. The position settings of the cameras are shown as solid-line boxes ① and ② in Fig. 11, which were located in the front and at the back of the participants, respectively. Moreover, the resolution of the image capture devices was 1920 × 1080 pixels, and the frame rate was 30 FPS.

Participants
A total of 103 valid participants, consisting of undergraduate and graduate students recruited from a technical university in China, were enrolled in this experiment. Informed consent was obtained from all voluntary participants prior to the start of the experiment. Basic information such as gender, age, and priori VR knowledge is summarized in Table 3. There were 55 men and 48 women, with a male to female ratio of 1.15:1, and 82 people aged 18-25, accounting for 79.61%. The remaining 20.39% were in the 26-30 age group. Participants with priori knowledge in VR interaction accounted for 44.66%.
The participants were divided into 12 groups, with no more than 10 participants in each group. Each participant was allowed to wear an HMD device and experience VRESE using two controllers. Expression and interaction data were collected to examine the usability of the VRLC throughout the participants' entire learning process. In addition, each participant was asked to use the VRESE for no more than 15 min, and it took 20 days to complete data collection.

Expression recognition rate of FERVR
Before conducting instance validation, the accuracy of FERVR must be checked because it determines the data reliability of VRLC, including dataset preprocessing and network training.

Dataset preprocessing
Images collected from various sources cannot fully meet the experimental requirements, such as the limitations of the image size, color, orientation, or angle. Therefore, preprocessing operations are necessary, including face detection, segmentation, graying, and normalization, to eliminate the interference of non-expression features and make the expression image as clear and rich as possible.
In view of the fact that published datasets for facial recognition in VR environments are relatively few, existing datasets should be processed to meet the requirements. As shown in Fig. 12a, RaFD (Langner et al. 2010) with 8040 facial expression images contains 67 performers of different ages, genders, and skin tones in eight types of expressions (happiness, sadness, disgust, surprise, fear, anger, contempt, and expressionless) and five types of postures (− 90°, − 45°, 0°, 45°, and 90°), which can meet the training requirements of the robustness of head poses. However, facial images from this dataset are not applicable to FER in VR scenarios   Fig. 12 Types of facial expression dataset because the influence of HMD occlusion on face recognition has not been considered. To solve this problem, the HMD occlusion mask was added to the image from RaFD during data preprocessing. In addition, the pitch angle of the head in the picture from RaFD is 0°, and we set a pitch angle of 0° as a qualifying condition for using FERVR. The dataset preprocessing steps are shown in Fig. 13, including face detection, HMD occlusion simulation, graying, and normalization.
(1) Step 1: By performing facial detection on the images from the dataset, the face region is preserved to remove irrelevant background information from the FER features. (2) Step 2: As shown in Fig. 12b, the dataset suitable for VR interaction (SRaFD) is generated by adding HMD occlusion masks of different angles to the eye positions of the image from RaFD. (3) Step 3: To reduce the system overhead, images from the SRaFD are grayed to improve the network training efficiency. (4) In the last step, the gray images are normalized to obtain face images of the same scale.

Network training
Network training of FERVR was performed after dataset preprocessing. First, the data enhancement method of random cropping and filling with a small-scale occlusion mask was used to expand the image quantity of the dataset. Thereafter, the dataset was stochastically divided into training and testing datasets using the Python programed data splitting method to reduce the influence of human factors on dataset classification. Finally, the processed data were fed into the FERVR for training. Network initialization used default parameters; the learning rate was set to 0.001; the momentum was set to 0.9; the stochastic gradient descent was set as the optimizer. After all the parameters of this network were set, the FERVR was trained for 200 epochs. The expression recognition accuracy of FERVR should be evaluated in real time during network training. Moreover, the hyper-parameter can be optimized iteratively based on the evaluation results to improve the recognition accuracy. One of the metrics used to evaluate network recognition effectiveness is the loss value (Checa and Bustillo 2020), which evaluates the degree of error between the output value of FERVR and the real value of the dataset. The lower the loss value, the better the robustness of this method.
The change curves of the loss value and FER accuracy during the FERVR training process are shown in Figs. 14 and 15, respectively. Initially, during network training, a loss value above 2.00 was achieved. Thereafter, the loss value gradually decreased with the increase in epochs and stabilized at approximately 0.1 after 150 epochs. Conversely, the accuracy rate was only 0.2 in the early training period, and the accuracy increased gradually with the increase in epochs. When the epoch reaching150 iterations, the changing tendency of the curve became stable. Moreover, the accuracy rate was high because the loss value was small.
The confusion matrix (Mohammed and Al-Ani 2020) allows recognition performance according to each label (in our case, each emotion). Using the confusion matrix, the recognition rate of each emotion can be analyzed, and the easiest and most difficult emotions can be recognized. The diagonal line in the confusion matrix represents the average accuracy of each class of expressions, and the remainder of the results indicate confusion with other expressions. A comparison of the confusion matrices is presented in Fig. 16. Figure 16a and b shows the confusion matrices for the eight emotions of RaFD and SRaFD, respectively. The horizontal axis in Fig. 16 indicates the predicted class from the FERVR, the vertical axis represents the true class of RaFD or SRaFD, and the color depth indicates the degree of FER accuracy.
As shown in Fig. 16a, the accuracy rates of recognizing fear, disgust, anger, happiness, expressionless, and contempt were all above 0.92. The facial expression characteristics of sadness and disgust are occasionally similar, and 14% of sadness was considered as disgust; therefore, the accuracy of sadness was relatively low. Similarly, 5% of surprise was predicted as happiness; thus, the recognition rate of the surprise expression was 0.86. Accordingly, the In addition, as shown in Fig. 16b, the accuracy rates of anger, delight, expressionless, and contempt were above 0.90. In particular, the accuracy rate of expressionless reached 0.99, and those of fear, disgust, sadness, and surprise were approximately 0.85. In summary, the average accuracy of FERVR for SRaFD reached 0.9004, which is 20.65% higher than that of FER based on LeNet for VR interaction.
By comparing Fig. 16a and b, the overall FER rate decreased by 2.71% in the VR interaction environment because the features of the eyes, eyebrows, and parts of the nose were obscured by the HMD. In particular, the recognition rate of fear and disgust decreased by approximately 10% because the expression of fear was confused with sadness and surprise without the above key features. Moreover, the probabilities of judging fear as sadness and surprise were 6% and 5%, respectively, and the probability of judging disgust as sadness was 9%.
The multi-angle expression recognition accuracy results for RaFD are summarized in Table 4, and in multi-angle FER, 0° was set as having no deflection; − 45° and − 90° (45° and 90°) were set as 45° and 90° head deflection to the left (right), respectively. From the results in the table, the average accuracy of eight types of expressions at 0° was above 0.95, indicating that FERVR has an ideal recognition effect for various facial expressions. Among them, the accuracy rates of anger, happiness, and expressionless reached 0.99, as the expression features of the forward direction were the most abundant. When the head deflection angle reached 45°, most of the effective facial features could also be extracted. Compared with 0° head deflection angle, the effect of FER slightly decreased. When the head deflection angle reached 90° and − 90°, the recognition accuracy of all types of expressions was significantly reduced owing to the sharp reduction in effective expression information. Particularly, the recognition accuracy of fear, disgust, sadness, and surprise decreased by approximately 10%. The results of expression recognition using deep learning methods include true positive, true negative, false positive and false negative (Chicco et al. 2021). The ones that the algorithm correctly identified as positive are called true positives, while those wrongly classified as negative are labeled false negatives. On the other side, the negative elements that are correctly labeled negative are called true negatives, while those which are wrongly predicted as positives are called false positives (Chicco et al. 2021). Therefore, the results of the expression recognition will influence the expression score.
First, a true positive or true negative result for expression recognition indicates that the result is correct for this time.
Second, if the expressions that occurred as false positive or false negative belonged to the same graded valence emotion, they have no effect on the overall expression score. Finally, as shown in Table 5, if the expressions in which false positives or false negatives occurred do not belong to the same graded valence emotion, the expression score change tendencies should be determined according to the graded valence emotions and recognition results.
The accuracy comparison of the different FER methods for RaFD is presented in Table 6. Specifically, the conventional FER of sLDA and multi-SVM has significant shortcomings in real time, accuracy, and robustness. Many researchers have used CNN algorithms for FER (Liong   You et al. 2020), enabling computers to read meanings expressed in face images more quickly and accurately. Dense-Net121, Res Net50, VGG16, and VGG19Net are FER methods based on deep learning that can solve the shortcomings of conventional expression recognition methods. Accordingly, the deep CNN model can effectively extract features from data, which is beyond the range of many machine learning recognition algorithms.
However, if these methods are not improved according to the characteristics of the recognized objects, overfitting may occur because of the high number of convolutional layers in the methods. As presented in Table 6, the average accuracy of the FERVR for RaFD was 92.75%. Compared with sLDA and multi-SVM, the average FER rate of FERVR for this dataset improved by 29.45% and 26.62%. Compared with Dense-Net121, Res Net50, and VGG16, the average accuracy of FERVR for RaFD was improved by 5.03% to 25.21%.

Assessing standard setting
The interaction design of VRESE referenced the self-rescue steps in the event of an elevator falling, as proposed by Shi et al. (2019). First, when an elevator falls, participants should quickly press as many floor buttons as possible to improve the possibility of preventing the elevator from falling. The participant's back should be close to the elevator wall to protect the spine. Finally, they should cover their heads with both hands, bend their knees, and point their toes to the ground to relieve the impact of the fall.
The assessing standard of VRESE that can directly evaluate the learning effect of participants is summarized in Table 7, and the participant's interactive behavior was key to this assessment. First, depending on the importance of each self-rescue step, participants should press at least seven floor buttons. Second, the participants should stand in the safe area of the elevator. Additionally, at least one protective  1 3 action must be taken. Accordingly, the full VRESE score was set to 100, and the approval threshold was 85 points.

Results analysis
The usability evaluation results of VRLC are presented in Fig. 17, the average learning concentration score was 0.63, and the average values for the expression score, visual focus rate, and task mastery were 0.69, 0.62, and 0.57, respectively. Relatively, the expression score was higher than the visual focus rate and task mastery scores, indicating that participants were attracted by the experience content of VRESE; thus, high-valence positive affect appeared more frequently than low-valence negative affect. However, the low visual focus rate shows that the time for participants to focus on learning content was insufficient in the teaching mode. Therefore, some key points of knowledge were neglected. Similarly, a low degree of task mastery means that participants had an insufficient understanding of the VRESE learning task, leading to cognitive errors in the follow-up operation.
The distribution of the learning concentration and assessment scores is shown in Fig. 18, thirty-four participants met the assessing standards, and their average learning concentration score reached 0.80. On the contrary, the assessment score of participants whose learning concentration scores were less than 0.6 was apparently low. However, although the learning concentration scores of some participants were high, their assessment scores were low. For instance, the participant with ID number 16 had a learning concentration score of 0.88; however, the assessment score was only 65. This is significantly inconsistent with the expectation that a higher learning concentration will lead to a higher assessment score. In response to this issue, we concluded that this participant was unable to complete the tasks in the assessment section because it did not fully understand the operation of the system. Accordingly, owing to the overall low concentration of learning, the average score of the VRESE system was only 70.10. This indicates that the results of the first usability assessment did not meet the assessing standards. As for a few individual abnormal results, as the operation method of the VRESE system has not been fully mastered by some participants, low assessment results were obtained. In summary, deploying optimization strategies based on the analysis results of the learning concentration score can improve the learning effect and interaction experience. The following two learning effect optimization strategies were formulated by analyzing the results of the learning concentration evaluation:

Psychological counseling
Aiming at participants with low concentration scores, according to cognitive comprehension and cognitive behavior therapies (Sarioglan 2020), we guided them to realize the importance of elevator safety self-help knowledge by playing elevator accident videos. Therefore, the learning state of these participants was adjusted, and they actively engaged in the learning of elevator safety-related knowledge.

Optimization of the system interaction design
In response to the problems in interaction design, scene realism and the interaction mechanism of the VRESE system were comprehensively optimized, including the following: (1) To alleviate dizziness, the movement mode in the virtual scene was changed from walking to curve blinking; (2) On the premise of a smooth picture, the camera in the virtual scene was added with rending filters to enhance the reality of the virtual scene; (3) A new scene that can be experienced multiple times was created for learning the system operation method to ensure that the basic interactive operation was mastered by participants. Furthermore, by reducing the size of the confirmation button on the task reminder prompt, the problem of the task reminder prompt being easily closed by incorrect manipulation was solved. Thus, the completion time of task reminder reading was counted more accurately; (4) When the avatar explained the knowledge points of elevator safety, a number of 3D arrows were added to guide participants to turn their attention to the presentation area of the knowledge points. Therefore, the visual focus rate increased.
Applying the same evaluation process as the first test, the second usability evaluation experiment was conducted by performing a 32-day experiment using a learning effect optimization strategy. The results of the second evaluation experiment are presented in Fig. 19. Overall, the average values of the expression score, visual focus rate, and task mastery were 0.79, 0.83, and 0.85, respectively. Specifically, the average expression score was 0.79, which shows that the proportion of high-valence positive affect was significantly higher than that of low-valence negative affect. Participants were satisfied with the experience content, and consequently, they were immersed in the elevator scene created by VRESE. The average visual focus rate was 0.83, indicating that 83% of participants' attention was focused on the presentation area of the knowledge points. Thus, the focused learning time was high. The average task mastery score reached 0.85. Therefore, the proportion of the learning task reminder prompt read by the participants reached 85%, and the correct self-help operation could be completed in the follow-up operation. The distribution of the learning concentration and assessment scores in the second experiment is shown in Fig. 20; Seventy-five participants met the assessing standards and their average learning concentration score reached 0.86. Accordingly, as more attention was focused on the learning content of VRESE, the learning concentration score reached 0.81. As a result, the average assessment score reached 85.49 and met the assessing standard of VRESE. Thus, this usability evaluation experiment was terminated.

Discussion
As mentioned earlier, the learning concentration of the participants and examination results have been significantly improved. A comparison result of two experiments is presented in Fig. 21a. Compared with the first experiment, the average expression score, visual focus rate, and task mastery in the second experiment increased by 10%, 21%, and 28%, respectively. Thus, the learning concentration of the participants on the learning content of the VRESE was significantly improved by implementing the learning effect optimization strategy. Therefore, the average learning concentration score in the second experiment was 0.81, consisting of an increase of 18% compared to that of the first experiment. The average assessment scores of the two experiments are shown in Fig. 21b. The average assessment score of the second experiment reached 85.49, which is 15.39% higher than that of the first experiment. Consequently, in head-mounted VR interaction, learning concentration is an essential factor for improving the learning effect. Moreover, according to the learning concentration score, the corresponding optimization strategy can effectively improve the participants' learning effects.
In this study, we proposed a VRLC with expression score, visual focus rate, and task mastery as evaluation indicators to improve the learning effect and interactive experience of virtual interaction. The results of this study have reference value for VR education researchers and VR system developers.
For the aforementioned personnel involved, the optimization of interactive experience is a problem that needs to be addressed. VRLC can help to analyze students' learning behavior during VR interaction and provide objective data for the optimization of interactive experiences. First, VRLC can adjust the expression weights according to the teaching content and characteristics of the VR education system to fit most VR education topics. In addition, the evaluation indicators of VRLC can quickly locate the deficiencies of the VR education system in the interaction design and formulate optimization strategies based on the analysis results of the learning concentration to improve the learning effect and interactive experience.

Conclusion
As head-up rate and FER results are used as evaluation metrics, conventional concentration analysis methods are not suitable for VR interaction. Solutions based on sensing devices and interaction data have been proposed to analyze learning concentration. However, these methods provide users with uncomfortable experiences. First, the recognition rate of the sensing-device-based method is relatively low. In addition, the method based on interaction data has specific evaluation populations and poor scalability of the evaluation indexes.
To promote the learning effect and interactive experience, we proposed a VRLC method to solve the above problems. Depending on the different characteristics of the types of VR education applications, the learning concentration scores of students for head-mounted VR interaction can be calculated by adjusting expression weights. The expression score, visual focus rate, and task mastery were used as the evaluation indicators for measuring the learning concentration. The expression score effectively evaluates the students' concentration in VR learning content. The proportion of focus on learning and degree of knowledge mastery were evaluated using the visual focus rate and task mastery, respectively. Therefore, the analysis result of virtual learning concentration provides a comprehensive and objective basis for formulating learning effect optimization strategies.
The results of the evaluation experiment showed that the learning concentration can be effectively estimated. By formulating and implementing the corresponding learning effect optimization strategy, the learning concentration of students increased by 18%. The substantial increase in learning concentration reflects that students can better immerse themselves in interactive scenes and focus on the learning content, and knowledge from VRESE can be mastered. Accordingly, the average assessment score improved by 15.39%. The experience results indicated

100.00
The first experiment The second experiment Average assessment score Experiment round that the learning concentration is an essential influencing factor for measuring the VR learning effect, and the learning effect and interactive experience can be effectively improved by formulating and implementing a corresponding optimization strategy.
To obtain the FER data required to calculate the learning concentration score, we proposed FERVR for head-mounted VR interaction. By simplifying the attention mechanism, FERVR reduces the influence of HMD occlusions on FER results. In addition, by fusing global and local features, robustness was improved in the presence of a larger adjustment range of head posture. The FER evaluation results indicated that the FER rate of FERVR for RaFD reached 92.75%. Compared to the conventional FER methods of sLDA and multi-SVM, FERVR achieved 29.45% and 26.62% higher average accuracies, respectively. Compared with other FER methods based on deep learning, including Dense-Net121, Res Net50, and VGG16, FERVR achieved 5.03% to 25.21% higher average accuracy. In addition, the average accuracy of FERVR for SRaFD was 90.04%. Compared with the methods of FER based on LeNet for VR, the FER rate increased by 20.65%. Consequently, the data reliability of the VRLC was significantly enhanced.
Currently, VRLC should be further improved in two aspects. First, FERVR relies on local features that are not occluded by an HMD for recognizing facial expressions. When users bow their heads, their facial features may be completely occluded by the HMD. In this case, the expression is not accurately recognized by FERVR, accordingly some of the students' expression data will be lost. Therefore, considering that the effect of pitch angle on expression recognition will be a matter of concern in the next step of our research. Second, the calculation of learningconcentration also requires consideration of joint effects of language, voice, and the context in learning emotionrecognition. The reliability of the calculation results can be effectively improved by constructing a multi-modal emotion recognition model. Finally, in the usability evaluation process, it was established that the learning concentration was also affected by some hardware factors. For instance, some students are unable to adapt to the dizziness caused by long-term virtual interaction, resulting in a decline in concentration learning. Accordingly, our next research goal is to determine whether hardware factors need to be included as an evaluation indicator for measuring learning concentration and to quantify the impact of hardware factors on concentration in virtual learning.