First-person Video Analysis for Evaluating Skill Level in the Humanitude Tender-Care Technique

In this paper, we describe a wearable first-person video (FPV) analysis system for evaluating the skill levels of caregivers. This is a part of our project that aims to quantize and analyze the tender-care technique known as Humanitude by using wearable sensing and AI technology devices. Using our system, caregivers can evaluate and elevate their care levels by themselves. From the FPVs of care sessions taken by wearable cameras worn by caregivers, we obtained the 3D facial distance, pose and eye-contact states between caregivers and receivers by using facial landmark detection and deep neural network (DNN)-based eye contact detection. We applied statistical analysis to these features and developed algorithms that provide scores for tender-care skill. In experiments, we first evaluated the performance of our DNN-based eye contact detection by using eye contact datasets prepared from YouTube videos and FPVs that assume conversational scenes. We then performed skill evaluations by using Humanitude training scenes involving three novice caregivers, two Humanitude experts and seven middle-level students. The results showed that our eye contact detection outperformed existing methods and that our skill evaluations can estimate the care skill levels.


Introduction
As the elderly population increases, the number of people suffering from dementia continues to grow. As a result, the care that needs to be administered to them is becoming increasingly important in social terms [8,25,42]. The population of people in Japan afflicted with dementia is expected to exceed seven million by 2025. A more serious problem is the shortage of caregivers. The number of caregivers that will be necessary in 2025 is estimated to be 2.53 million, but the actual number is estimated to be 2.15 million [29].
Dementia occurs when the brain is damaged by maladies such as Alzheimer's disease and Lewy body dementia and produces a set of symptoms that include memory loss and difficulties with thinking, problem-solving, and verbal communication. Dementia can be accompanied by psychosis, agitation and aggression; thus, caring for people with dementia is quite difficult [7].
Two approaches can be cited as ways to alleviate the difficulties this poses for caregivers. The first is providing patients with customized treatment, which can slow the progression of dementia and prevent side effects such as infections. The second is reducing the burden on caregivers to prevent them from "burning out." Dementia can cause symptoms similar to those of mental illnesses, known as behavioral and psychological symptoms of dementia (BPSD), so caregivers' working conditions can be harsh. As a result, the number of caregiving staff members that leave and become burned out is increasing.
Humanitude tender-care style: Due to this social background situation, the caregiving style Humanitude has been spotlighted by care professionals and family caregivers since it can reduce the occurrence of BPSD events and caregivers' burden [20]. Humanitude was developed by Y. Geneste and R. Marescotti 35 years ago [18] and has been introduced in more than 600 hospitals and nursing homes in Europe. Humanitude primarily uses a combination of four communication skills: gaze, verbal communication, touch, and helping care receivers to stand up. Several studies have reported that the cost-efficiency of introducing Humanitude is around 20 times that of care without it because of a 40% decrease in the use of psychotropic drugs and in the number of care staff members who leave [22]. In recent years, Humanitude has become popular in Japan; in the past three years, more than 2,600 people took Humanitude training over the course of more than 30 training sessions.
Computational tender-care science project: Since we believe improved care techniques using robotics and computer vision technologies are valuable to both caregivers and care receivers, we started a comprehensive project that aims to (a) quantize and visualize the Humanitude skills, (b) reveal the brain mechanism behind Humanitude-based communications and (c) develop a system that will help people to learn Humanitude skills (Fig. 1). This paper shows one of the topics in (a) skill quantization, but we briefly introduce all topics below.
For (a), we developed a system that automatically finds the skill elements by using wearable sensors that capture learner's and care receivers' behaviors. The system uses data mining and recognition algorithms developed to enhance computer vision and machine learning. We then obtained the essence of Humanitude skills through multi-modal analysis.
For (b), we tried to reveal why using Humanitude facilitates communication with people with dementia (PwD) and why it reduces BPSD through cognitive neuroscientific approaches. Namely, we conducted functional neuroimaging studies to find the differences between younger people and elderly people, and between healthy people and people suffering from Alzheimer's disease while giving emotional stimuli such as facial pictures that show eye contact or dynamic facial expressions.
For (c), on top of the (a) and (b) findings, we developed a tender-care education platform that presents the caregivers' skill level to learners by using our care-skill evaluation systems. With these systems, learners will be able to evaluate their current tender-care skill levels easily and at a low cost by themselves. It follows that this platform will be suitable for non-professional caregivers such as family caregivers, as well as for professional caregivers who want to periodically refresh their caregiving skills.

Wearable sensing technique for care behavior analysis:
Since Humanitude consists of communication behaviors from close distances such as gaze and touch, we use wearable sensing devices to extract events in which such behaviors occur. We observed care techniques that do and do not use Humanitude and found behavioral differences and their outcomes through statistical computational analysisbased approaches, as illustrated in Fig. 2.
This paper describes the first step of this project -a system for extracting face-to-face communication behavioral skills by using a head-mounted camera worn by a caregiver. From camera images obtained using mutual facial distances, we obtained poses and eye contact states by using facial parts tracking and deep neural network (DNN)based eye contact detection algorithms. We compared these behavioral elements from those we ascertained among care novices, middle-level Humanitude-care learners and Humanitude-care experts. Although many attempts have been made to obtain the aforementioned information by using third-person view videos, the videos were analyzed by video annotators. This required considerable cost and time and there was the risk of obtaining biased results caused by annotator subjectivity [21]. To address these problems, the contributions and limitations of this paper are as follows: 1. It describes our development of a prototype system that uses wearable cameras and image analysis for care skill evaluation. 2. It describes our development of DNN-based eye contact detection algorithms that outperform existing approaches by using eye contact datasets based on YouTube videos and first-person video (FPVs). 3. It describes how we obtained an FPV dataset while conducting Humanitude training sessions for novices, middle-level caregivers and expert caregivers and found the differences among them regarding face-to-face distance and pose (angle) as well as eye contact frequency. 4. It describes how we performed unsupervised principal component analysis (PCA) for the features obtained from FPVs and found significant correlation between caregiver levels and PCA scores. 5. The present research represents a preliminary analysis that uses a relatively small number of datasets. It will be thus be followed by our analysis of extensive studies using a much larger number of samples. It will be also important to use chronological behavioral data for the same individuals to observe how the skills were acquired or forgetten.

Related Work
In this section, we show related studies regarding caregiver's burden and effects of intervention, care skill evaluation, first-person video-based skill evaluations and eye contact detection from video images.

Caregiver's burden and effects of interventions:
The burden of caregiver for dementia patients have been reported in a lot of literature [2,10,12,15,16,23,33,41]. According to the recent meta-review paper [2], the larger caregivers burden is related to 1) female sex, 2) low education, 3) cohabitation with care recipient, 4) caregiving time and effort, 5) financial stress, 6) lack of choice and inability to continue regular employment. As the result, caregiver tends to having larger risk in mortality, weight loss, poor self-care and sleep deprivation. Effects of interventions for reducing caregiver's burden have been reported as well [1,13,15,19,31,32]. Interventions are categorized into several types: psychoeducational intervention, psychosocial intervention, cognitive behavior therapy, respite, caregiver support groups, anticholinergic and antipsychotic drugs, and skill training. As the results, practical interventions to reduce caregiver's burden are 1) encouraging caregivers to function as a member of the care team, 2) encouraging caregivers to improve self-care and maintain their health, 3) providing education and information, 4) coordinating for assistance with care, 5) encouraging caregiver to access respite care and 6) using the supports of technology [2]. Specifically, there are several reports that skills training such as coping skill training (CST) may reduce the caregiver strain, depression and fatigue in caregivers of the patient with cancers [9,31].

Care skill quantization:
There are several approaches that use care skill quantization. In computer science, Ishikawa et al. developed a method of care skill evaluation based on the knowledge of care experts [21]. They categorized care skills into three layers: intramodality, intermodality and multimodal-interaction. Intramodality consists of behavior primitives such as gaze, speech, touch, nodding and knocking on a door. Intermodality shows the relationships among intramodalities, such as comprehensiveness of care, waiting for elderly people's actions and consistency. Multimodal-interaction consists of actions that develop a relationship between actors, such as eye contact and Care skill score Fig. 3 Overview of the care skill estimation method using first-person video verbal/nonverbal dialogue. They also developed a web interface that shows care learners care skills in visual form to confirm the effectiveness of the system.

First-person video analysis for skill evaluation:
There have been a number of studies on action recognition and prediction using FPVs [17,28,37,39,40]. However, few studies have been conducted for skill evaluation. In recent years, Bertasius et al. showed a method to assess a basketball player's performance from FPVs. They designed and used temporal CNN and long short-term memory (LSTM) architecture to evaluate whether a particular play in basketball was good or not from a player's FPV [5].
In the medical field, Hei et al. proposed a method for evaluating skill in robotic surgical operations from video images. Their method tracks the keypoints of surgical robot instruments by using cloud sourcing or hourglass networks and evaluates the skill by support vector machine analysis [26].
Video-based eye contact detection: Detecting and making eye contact are important for understanding social communication and designing communication robots. Therefore, several studies in this area have been conducted. Smith  [38] proposed an algorithm to detect gaze-locking (looking at a camera) faces using eye appearances and PCA plus multiple discriminate analysis. Ye et al. developed a pioneering algorithm that detects mutual eye gaze using wearable glasses [43,45]. In recent years, deep-learningbased approaches are being implemented for eye-contact detection. Mitsuzumi [48]. In robotics, Petric et al. developed an eye contact detection algorithm that uses facial images taken with a camera embedded in a robot's eyes [34] to develop robot-assisted ASD-diagnosis systems. These eye contact detection algorithms depend on facial landmark detection libraries or gaze estimation algorithms with which it is assumed that subject faces are not occluded. Image-based gaze estimation algorithms have also been recently studied, although they differ in scope from our detection algorithms. The current trend in this area is deep learning-based approaches, namely, learning and predicting gaze directions according to datasets that describe the relation between facial images, facial landmarks, and gaze points. For example, Lu et al. developed a head pose-free gaze estimation method by synthesizing eye images from small samples. However, their method requires personal-dependent eye image samples taken under experimental setups [27]. Zhang et al. proposed a DNN algorithm that inputs eye images and 3D head poses obtained from facial landmark points [46]. They also developed a DNNbased algorithm using full facial images without occlusions [47]. Krafka

N-1 Previous Frames
Facial parts detection  The eye contact state is obtained by only the CNN that inputs the eye images. b Proposed temporal eye contact detection algorithms that use multiple (i.e. N) image frames. First, they detect facial landmarks with OpenFace face detection, with which they then obtain eye regions in each of the N frames. The resulting N pairs of eye images are inputted to CNNs that have a structure similar to that of DeepEC. These CNNs are followed by an LSTM network, which learns the temporal state of the eyes. Finally, the target eye contact state is obtained by the following fully connected networks, which use not only the LSTM's outputs but also the CNN's outputs of the target frame (t = T ) with skip connections estimation algorithm that inputs full facial images as well as eye images [24]. In contrast, our detection algorithms only output binary (eye-contacted/averted) information. However, they do not require personal-dependent samplings and are robust to facial occlusions, which frequently occur in FPVs in caregiving and communication scenarios. This was achieved by designing a CNN that only uses images taken around the eyes.

Proposed Method
The flow of our skill evaluation is illustrated in Fig. 3. From a first-person camera worn by a caregiver we obtained mutual facial distances, mutual facial poses and eye contact states. Then, we estimated tender-skill scores through an unsupervised analysis. In the following subsections, we first describe the first-person camera hardware and then our algorithms we used for analyzing FPVs.

Hardware
We used two types of head-mounted first-person camera systems. One was a Pivothead Kudu camera [35], which is equipped with a front-view camera in the middle of a pair of glasses. The camera takes full HD (1920 × 1080 pixels) videos at 30 fps. The other was a Pupil Labs camera system [36], whose frontal camera also takes full HD (1920 × 1080 pixels) videos at 30 fps. The cameras' projection matrices were obtained by using the MATLAB camera calibrator.

Face Detection and 3D Pose Estimation
We then obtained facial positions, poses and eye locations from the input FPVs. We used the OpenFace library [4] and obtained 3D facial positions, poses and 68 facial landmark points. We computed the cameras' focal lengths from the camera projection matrices and used them to estimate 3D facial positions and rotations.

Histograms of Facial Distances and Poses
To quantize the face-to-face communication behaviors between caregivers and care receivers, we encoded the mutual facial distances and poses obtained from OpenFace as illustrated in Fig. 4

Visualization
After obtaining histograms, we normalized the histogram and applied principal component analysis (PCA). While many data analysis or machine learning techniques have been proposed, we used PCA for its simplicity and reliability in exploratory data analysis (EDA). Since we tried to find tender-care technique skills in a bottom-up (data-driven) manner, this nature of PCA fitted our task better than other more complicated methods such as nonlinear or supervised-learning based approaches.
We here denote f s as a D × 1 column vector that represents a normalized histogram of either f dist , f r x , f r y or We plot the eigenvalues of all subjects to visualize the distribution of their behaviors, as well as to analyze the elements of eigenvectors to find the relation between skill levels and behavioral features.

Eye Contact Features
Another feature is counting eye contact bids, which was introduced by Ye et al. [44]. They assume "eye contact bids", i.e., situations when a subject wearing an FPV camera is gazed at by other subjects. Since the definition of eye contact is making mutual eye gaze-two people look at each other at the same time -eye contact bids are not the same as actual eye contact. If we want to accurately detect eye contact, two people must wear FPV cameras or a caregiver must use an eye gaze tracker (EGT) device that detects observers' gaze information. However, from the practical point of view it is difficult to use two FPV cameras or an EGT device since 1) it is difficult for subjects with dementia to wear such devices and 2) even for caregivers it is difficult to use eye trackers in actual care scenes due to their noticeable appearance, calibration requirements and headmount drift. Therefore, rather than accurately detecting eye contact, we tried to measure and use eye contact bids for evaluating care skills. We used facial poses and eye images for detecting eye contact bids using DNNs. Figure 5 illustrates the existing eye contact detection algorithm (Fig. 5a) and proposed TempEC and TempEC-HP (Fig. 5b). The TempEC uses only eye images whereas the TempEC-HP uses both eye images and 3D facial poses. These algorithms consist of the following components.
Eye region detection: From the landmark points detected by OpenFace, we obtained the right and left eye regions in the target frame, from which we obtained each eye image used as input for the CNN, after gray-scaling and   The numbers with * and * * indicates respectively ≤ 0.05 and ≤ 0.01 normalizing with global contrast normalization (GCN).
Using the landmarks as a basis, we obtained the coordinates of four corner points that determine the eye region. At this time, we applied a 10% margin to the height and width of the region so that facial landmark detection errors can be accepted.
Deep temporal eye contact detection: Given the images of both eye regions and the 3D facial pose, we implemented our two deep temporal eye contact detection algorithms, as shown in Fig. 5b. The algorithms use ten continuous video frames -the target frame and nine preceding frames -to make predictions. As shown in Fig. 5b, each of these eye image pairs I t R , I t L were input to the CNN. This CNN had the same structure as DeepEC with the exception of the last two fully connected layers; namely, it had two streams and six layers-two convolution layers followed by four max pooling layers. The CNN outputs a pair of 512-dimensional feature vectors for each eye image f R (I t R ) and f L (I t L ). These feature vectors were input to two separate LSTM networks for the left and right eye images. In the TempEC algorithm, each LSTM accepts 10 vectors corresponding to a series of eye images and outputs one 512-dimensional feature vector. In the TempEC-HP algorithm, a series of 3D vectors that represent 3D head positions are additionally input to LSTM.
However, we found that a naïve LSTM could not perform satisfactorily. To solve this problem, we prepared fully connected layers that had 2048 (512×4) units at the last frame, which accepted the outputs of the left and right DeepEC's and the LSTM's cell state vectors. Because the DeepEC results for the current frames are directly used for eye contact detection, and since temporal inference is also merged to the fully connected layers, we were ultimately able to obtain better results than the naïve implementations of DeepEC and LSTM.

Dataset
We prepared two dataset for learning and evaluating the algorithms. The first one was an eye contact video dataset, which we prepared by using publicly available videos from YouTube and our original FPV videos that assume conversational scenarios. The second one was obtained in care learning scenes. We recorded the FPV videos equipped to the caregiver during Humanitude care teaching classes.

First-person Eye Contact Video Dataset
The first-person eye contact video dataset was used for evaluating the eye contact detection performance. The ground-truth eye contact states (1 or 0) were provided for every video frame in the dataset. We asked three people to annotate the eye-contact states and set the eye contact status as 1 if more than two annotators thought the eye contact was engaged at the frame.

a) First-person eye contact video Youtube (Youtube dataset)
We used 13 videos in which a person talked into a camera from Youtube. We took a consensus of the annotations and made ground-truth data. A list of the videos and their properties is shown in Table 1 and example frames are shown in Fig. 6.

b) First-person eye contact video dataset during conversation (Conversation dataset)
We additionally prepared first-person-view videos while two individuals were conversing. The scenarios were taken in a lab environment in which two participants were talking, where one of them wore a Pivothead Kudu firstperson camera. A list of the videos and their properties is shown in Table 2 and Figure 6. We took three video clips from six participants and two test-video clips from two participants.

FPVs of Care-learning Scenes
To verify the applicability of the proposed algorithms, we prepared first-person videos of a) two Humanitude care experts (instructors), b) seven middle-level Humanitude caregivers and c) three novice Humanitude caregivers as shown in Table 5 and Fig. 7. In all videos, caregivers were equipped with the Pupil Labs first-person camera and performed the same task: Step 1 Approach the simulated patient while making eye contact, Step 2 Perform the care, and Step 3 Leave the care receiver.

Experiments
In the first of two experiments we performed, we evaluated the performance of eye contact (bids) algorithms, comparing the two proposed approaches and an existing approach (DeepEC [30]). The second experiment was performed for an actual Humanitude care training scene. In it, we obtained data from a novice caregiver and a Humanitude care expert and compared the results through the use of an unsupervised learning algorithm.

Experiment 1: Evaluation of eye contact detection performance
As mentioned, we first conducted an experiment to compare the performances of the proposed algorithms and an existing algorithm by using the datasets. One video was chosen for testing and the others were used for learning. We iterated this step for 16 videos and obtained the average performance for them.
The learning of the networks with DeepEC, TempEC and TempEC-HP was conducted as follows. We first computed the bounding rectangles of eyes using the facial landmark points obtained by OpenFace. The obtained eye images were then rescaled such that the image was (60 × 36) pixels. We used static CNN hyper-parameters for all of the experiments. Specifically, the drop-out rate was 0.5 and Leaky ReLU activation function's α was set to 0.01. We used a Nadam optimizer [14] with the learning rate set to 0.001, the decay as 0.004 per epoch and β 1 and β 2 as 0.9 and 0.999, respectively. Learning was performed on a GPUbased workstation (Intel Core i7-7800X CPU 3.50GHz, 128GB RAM, NVIDIA GeForce1080Ti-11G). On average, The results are given in Table 3 and t-test results between four algorithms are given in Table

Experiment 2: Observation of Face-to-Face Communication Behavior During Care Learning Scenes
In the second experiment we compared the occurrence of Humanitude care skill between novice and expert caregivers using the FPVs of care learning scenes. We obtained the number of eye contact frames, mutual facial distances and poses from a care scene dataset and compared the results.

Analysis and results:
The occurrences of eye contact frames, average mutual facial distance and poses (angles) are shown in Table 5 and normalized histograms of each feature are shown in Fig. 9 (left). We applied PCA to the histograms. The resulting PCA scores are shown in Fig. 9 (right), where the x-axis shows the scores of the first component and the y-axis indicates the scores of the second PCA component. From the eye contact rates and PCA analysis results, we were able to clearly distinguish the scores of novices and experts for eye contact rate, mutual facial distance and r z PCA scores. There were significant differences in eye contact rate between the expert & middlelevel and novice groups (p = 0.0452), and clear thresholds at about x = 0.16 for mutual facial distance and at about x = −0.18 for the r z PCA scores. In the mutual facial distance category, the histograms showed that the expert caregivers and most of the middle-level ones approached the care receiver such that the distance was less than 30 [cm]. In the mutual facial pose category, there were clear dissimilarities in the z-rotation, which is the rotation of the care receiver's face in the FPV image plane (the plane perpendicular to the facial frontal direction) as shown in Fig. 4. Namely, the average and peak z-rotation values of the experts and the middle-level caregivers were located around 0 [deg] while those of novices were much larger.

Discussion and Conclusion
In this section, we will discuss the main points we have made for the proposed image-based eye contact detection algorithms and wearable care-skill evaluation system and draw conclusions from them.

Image-based eye contact detection:
We developed eyecontact-detection algorithms that use temporal features as well as static image features. Our algorithms show better performance for various types of datasets. They combine CNNs and LSTM and successfully learned both static features and temporal dependence. In experiments, the proposed TempEC and TempEC-HP algorithms outperformed the DeepEC algorithm. In particular, TempEC-HP achieved a 25% improvement in the miss-detection rate over existing algorithms.
In a preliminary experiment, we found that a simple concatenation of CNNs and LSTM was not effective. We concluded that such a primitive combination was not suitable for learning both static and temporal features at the same time. Thus, in the final estimation step we introduced a skip connection that jumps over the LSTM networks and directly links the CNN outputs to the final fully connected layers. This structure improved our algorithms' performance, as the experiment 1 results showed. They showed that the skip connection enables the algorithms to successfully learn both static and temporal features at the same time.
Surprisingly, some tests showed that TempEC performed better than TempEC-HP. This was contrary to what we had expected because we believed the facial pose information in TempEC-HP would help in detecting eye contact for various face directions. However, these results do not indicate that facial pose information is useless. In our TempEC-HP algorithm, 3D facial pose estimation is based on facial landmarks, the detection of which is mostly accurate but has a certain degree of error. This error is not significant, which is why it is not a problem when used to obtain eye regions. However, in facial pose estimation, such a small error sometimes causes a large incorrect gap between two contiguous frames. The facial pose of two adjacent frames should be close because a human's face cannot move very much in a short time (namely, 0.03 sec because the videos were recorded at 30 fps). Due to this problem, facial pose estimation is occasionally not sufficiently reliable, which causes TempEC-HP to perform poorly. Hence, the performance of TempEC-HP can be improved by using a more accurate facial detection or facial pose estimation algorithm.
Another notable finding was that introducing temporal inference increased the algorithms' recall performance, which means the temporal information contributed to 'overlooked' effects of eye contact. Our experiments showed that adding a conditional random field (CRF) to the DeepEC algorithm could not improve its results. Several examples (Fig. 10) showed that CRF tends to "smoothen" temporal inference in DeepEC, which may help to avoid the 'jittering' effects of single frame estimations but does not solve the temporal inference problem fundamentally. Thus, we believe our current algorithms, which combine the internal states of single frame recognition and LSTM, are a better solution.
Our results show that our algorithms enable excellent eye contact detection performance to be achieved. They also show the potential of temporal learning of eye behavior, with which we can evaluate the care skills of caregivers.
Evaluation of wearable care-skill estimation system: Unsupervised analysis results of mutual facial distances and facial poses enabled us to find significant differences between novices, middle-level and expert Humanitude caregivers. Specially, we found a clear threshold in eye contact frequency and PCA scores of facial distance and r z -rotation histograms, which indicate that the important skills in Humanitude tender-care are related to a) frequent eye contact, b) a nearest mutual facial distance of less than 30 [cm] and c) mutual facial poses being in the same direction. This can also be seen from the 1st PCA components of the distance and r z histograms (Fig. 11). In the distance histogram, This is based on the idea of Humanitude care methodology that all behaviors are considered to imply non-verbal messages. To have the eye contact straight in front of the care receiver expresses the fairness, and the distance between caregivers and care receivers reflects their friendliness. The study results show that the experts expressed fairness and friendliness much more than the novices. This skill is a core skill with which to establish a good relationship that leads to high-quality care.
Open issues and future work: Though our analyses can quantize and visualize the Humanitude care communication skills, several open issues remain to increase the analysis quality, as we ascertained from the responses of Humanitude experts. The first point is the face detection stability of OpenFace. Specifically, OpenFace cannot detect care receivers' faces when x or y rotations are quite large (e.g., looks at the size of the faces). The second point is the temporal analysis. The estimation of facial poses and distances is currently performed frame-by-frame and TempEC considers only 1/3 seconds as the temporal duration. However, it has been reported that the duration of eye contact is about three seconds in typical communication scenes [6] and that a longer duration of mutual gaze is often effective in communicating with a dementia patient [3]. Thus, temporal inference using a longer duration can be expected to be effective in care-skill evaluation as well.
The tender-care concept involves multi-modal skills including gazing, speaking and touching. As the initial step in computational care communication analysis, we treated face-to-face communication skills. We are currently developing a method for detecting and analyzing voice signals and sensing touch behaviors through the use of wearable contact sensors or vision analysis. We believe that our findings for tender-care skills and systems for obtaining care skills will prove to be important and usable, not only for increasing the skills of caregivers but also for designing and evaluating care robots' behaviors.