1 Introduction

The increasing number of elderly people living alone has become a social problem. Owing to the coronavirus pandemic, the associated problems have become severe considering that the elderly people had to spend more time at home, which reduced their social activities. As a result, loneliness and depression have increased among the elderly population living alone, which directly affects their health and increases the risk of dementia [1]. In this study, we attempted to solve this problem by providing social services using a robot.

In existing commercial robot services, robots only understand simple commands and provide fragmented responses, failing to engage in natural and contextual communication with the elderly and establish emotional connections. Moreover, existing human-care robots cannot understand the diverse characteristics of individual elderly persons to provide services tailored to their specific needs. As a result, the relationship between the elderly and robots fails to develop, diminishing the effectiveness of care services provided by robots.

We developed core intelligence for human-care robots, including 1) personal profiling that can finely recognize individual characteristics, behaviors, and situations, 2) interpersonal intelligence modeling that acquires communication skills necessary for natural communication with the elderly through learning, and 3) behavior intention understanding that accurately tracks the daily activities of the elderly. We integrated intelligence into the robot platform and applied the robots to elderly-specific services, and verified the intelligence in a real-world environment.

Our main contributions are threefold. First, we developed a service scenario using multimodal feature recognition as an integrated module, rather than a unit recognition module based on artificial intelligence (AI). We evaluated and verified this in the real world. The unit recognition module exhibits high performance through evaluation using datasets. However, the performance of the integrated recognition module based on multimodal features recognition fell short of our expectations in the real world due to the prioritization of recognition results and resource allocation issues for processing recognition results. To solve this problem, we integrated the recognition module into the robot system and verified its stability through robot services in the real world.

Second, we evaluated the human-care robot service both quantitatively and qualitatively to verify its stability and usefulness. To quantitatively evaluate stability, we measured success rates and frequency of use as evaluation criteria. Success rates indicate how well the service performs its functions, while frequency of use reflects how often users utilize the service. For qualitative evaluation of usefulness, we analyzed changes in satisfaction with life, perception of the robot, and its reliability as criteria. Through this study, we analyzed the impact of the results of quantitative and qualitative evaluations and proposed solutions to enhance both stability and usefulness.

Finally, we developed a service focusing not only on cognitive support but also on emotional support based on personal profiling and interpersonal intelligence modeling. Fragmented responses shown in existing robot services have made it difficult to establish emotional communication and relationships with the elderly. In this study, we developed and validated a service that understands the various characteristics of each elderly person and provides services tailored to their characteristics in the real world. Through emotional support services, we aimed to enhance the usefulness of the human-care robot service and form a rapport between the robot and the user.

2 Related Works

There are many service robots for the elderly in the type of social and companion robots. These robots serve as friendly companions to the elderly, providing cognitive support, physical care, and emotional care through interaction in daily life. There are two main types of service robots for elderly care: doll-shaped and mobile service robots. Based on these platforms, a variety of research has been conducted to provide better care for the elderly.

Several studies have been conducted using pet robots such as “Paro” or doll-shaped robots in hospitals, elderly facilities, and local communities for elderly people with dementia or cognitive decline [2,3,4,5,6,7,8]. For example, a cognitive ability improvement program was provided to the elderly with mild cognitive impairment [9]; medication and meal times were informed to patients with dementia [10]. Recently, studies have been conducted on the application of robots to the elderly with normal cognitive function. A dog robot named “Jennie” has been used to reduce loneliness among the elderly and increase stability and intimacy.

The test using the Hobbit mobile platform [11] was conducted with actual users in private households, not in a lab environment. Based on the existing field trials, services including physical and cognitive support necessary for the elderly were discovered. They examined changes at each stage of the experiments through user self-reporting based on interviews or questionnaires.

A robot named Stevie [12] was used to collect data for qualitative evaluation through cognitive games such as musical bingo and interactive quiz with the elderly. As for the qualitative evaluation method, issues and episodes were analyzed through semi-structured interviews, questionnaires, and a research diary. In another qualitative evaluation case using the Tiago platform, a study was conducted to analyze acceptance through the Godspeed [13] and Almere [14] survey after using the robot for a long term (10 weeks) targeting seniors aged 65 or older [15].

The SERROGA project mainly measured navigation performance in living laboratories, staff, and elderly private apartments using the Max platform [16] Major services included approach goal, autonomous navigation, user following, and user searching. They were used as evaluation indicators for autonomous navigation and navigation complexity considering illumination conditions, duration, and velocity. The IRMA system [17] used satisfaction and accuracy to evaluate the performance of the service of finding belongings using the moving to the position and describing the position of the object using reference points.

Fig. 1
figure 1

Robot platform

Fig. 2
figure 2

Software architecture

Portugal et al. [18] conducted questionnaires for usability and satisfaction for care center users such as seniors, visitors, and caregivers using face detection or recognition and speech recognition services. Rather than using logs, they evaluated the performance of each service by taking a note, recording a video, taking a picture, and receiving feedback from participants.

These studies were limited to qualitative evaluations through interviews and questionnaires or to quantitative evaluations of services using vision-based recognition. This study aims to define emotional and cognitive support services using the Pepper platform and evaluate their performance quantitatively and qualitatively. In our previous studies [19] and [20], we focused on activity detection to evaluate the performance and user satisfaction of robot services. In this study, we expand to emotional and cognitive support services using various perception and gesture generation modules, and evaluate performance of services in the real world. Our goal is to provide a more comprehensive evaluation compared to previous studies, and to verify the stability and usefulness of the services.

3 System Architecture

3.1 Robot Platform

We used the Pepper robot manufactured by Softbank Robotics as one of the mobile service robots. The Pepper robot is a platform that facilitates the creation of human-like gestures through head and arm movements. This was selected because it can familiarly approach the elderly as a humanoid-type robot platform.

In the social interaction intelligence covered in this study, recognition modules based on vision are the most important. Accordingly, an Xtion camera was equipped to control the recognition modules. A LiDAR sensor was added to install the navigation module, and a separate power supply for the navigation module was attached (see Fig. 1).

Table 1 Definition of modules

3.2 Software Architecture

The integrated processing of the human-care robot consists of the perception, interaction, and action stages, as shown in Fig. 2. The perception stage receives binary image data as input and performs object detection and face recognition. In the memory, the data received from all stages are stored and synchronized based on time. In the interaction stage, robot actions suitable for the context are determined or services are selected based on the results of the modules. Robot actions refer to multimodal actions, including utterances, actions, and device control. The action stage controls the actions of the robot by receiving multimodal actions from the interaction stage.

Table 2 Definition of services

Multimodal interaction control is a crucial part of the interaction process in which the response of the human-care robot is determined and executed based on the speech and behavior of the user. The service selection engine (SLA) [21] receives the results through various perception modules and determines the service goals that are appropriate to the situation. When determining service goals, specific user preferences are considered and episodic memory can be referenced. Episodic memory includes a user model acquired through an interaction experience with the user.

The service processor, AIRego, executes the multimodal human interaction process based on the service goals. AIRego decides which service to execute based on the service goals determined by the SLA and initiates the interaction by sending the ID of the selected service to the dialog service. The dialog service is executed using a chatbot embedded in the cloud platform.

We implemented human-care robot services through multimodal interaction control in the perception and interaction phases, using the key modules defined in Table 1. To implement speech recognition (T11), we used the Google speech recognition engine and hotword (‘Hey Jenny’) to improve performance in real-world environments. We limited the scope of our study to the field of vision recognition due to practical issues. Therefore, in this study, we evaluated audio-based environmental sound recognition (T10) and navigation (T12) as services, but excluded them from the module performance evaluation because they do not directly use vision recognition.

4 Service Scenario

4.1 Definition of Service Scenario

The module tests conducted in previous studies had limitations in not guaranteeing performance in the real world. In this study, we propose an evaluation of each module in terms of integrating several modules. To this end, we defined service scenarios based on an integrated system like a robot and analyzed the results of qualitative evaluations, such as reliability and perception of robots as well as quantitative evaluations.

Services were defined with a focus on cognitive and emotional support for older adults. Based on the 12 modules, we defined 10 services that are likely to be needed by the elderly. As shown in Table 2, the services provided by the robot can be broadly categorized into two types: proactive services, which involve the robot recognizing the voice and actions of the user or responding to environmental information, and services provided upon request by the user. Each service contains one or more main modules. We defined the service such that the user could not identify the details of the module or functionality.

In the next section, we provide a detailed description of the four service scenarios among the ten scenarios defined in Table 2. The four scenarios (daily living assistance, finding belongings, outfit check, and verbal/non-verbal interaction service) consisted of three or more modules each, with a focus on vision-based recognition.

4.2 Daily Living Assistance Service

The purpose of the daily living assistance service is to enable a robot to detect the daily behavior of the user through images and to provide a living assistance service at all times without requests from the user. Figure 3 shows the overall system structure and scenario for the daily living assistance service.

Fig. 3
figure 3

Flow diagram of daily living assistance service

Fig. 4
figure 4

Flow diagram of find belongings service

The daily living assistance service is based on daily activity detection(T05). Most datasets for activity detection training and testing are generally focused on adults [35]. Therefore, we constructed a dataset on the daily behavior of the elderly. The ETRI-Activity 3D dataset [26] includes RGB videos, depth images, and skeleton information for the tracked human bodies. We selected 12 behaviors that are frequently used in daily life and defined eight daily living assistance sub-scenarios based on these.

4.3 Find Belongings Service

In the scenario of finding belongings, when a user requests the location of specific belongings, such as ‘Find my mobile phone,’ the robot navigates around along a designated path, detects, and recognizes objects from the images at the designated location (target). After finding the location, the robot arrives at the user and notifies the user about the location of the requested belongings. This service scenario consisted of object detection and recognition (T04), speech recognition (T11), and navigation (T12) modules (see Fig. 4).

As mentioned, Object recognition used in finding belongings recognizes not only the class but also the instance, unlike existing recognition. We pre-designated the target location at the point where belongings were likely to be placed for the robot to navigate. Figure 5 shows a scene where the robot navigates and finds the belongings at a designated location. During navigation, the robot memorizes the location of all registered belongings, as well as the specific belongings requested by the user.

Fig. 5
figure 5

The scene of find belongings service in a private home

4.4 Outfit Check Service

The outfit check is a service in which a robot generates an appropriate comment when a user goes out using facial (T01) or clothing attribute recognition (T02), style comment generation (T03), and object recognition (T04) through an input video. Figure 6 shows the flow diagram for outfit check service.

Fig. 6
figure 6

Flow diagram of outfit check service

Fig. 7
figure 7

Overview of identity recognition for partially obscured faces

4.4.1 Facial Attribute Recognition

The facial attribute recognition module detects the face area from the image, recognizes the identity, and determines whether a mask is worn to generate a related comment. Although there are various attributes related to faces, such as gender, age, and race [36], we focused on whether a mask is worn rather than on identification because it targets small families or elderly people living alone.

To find the masks, we detected the face region and generated normalized facial images using facial landmarks. Subsequently, we eliminated non-candidate areas of the mask within the normalized face region, focusing solely on the area surrounding the mouth. This approach allowed us to create a classification model based on mouth area [37]. To evaluate the classification model’s performance, we used images collected from the web along with those that artificially overlay a virtual mask on the Multi-PIE database [38]. We utilized approximately 200,000 training images and 700 validation images [22]. In the initial stage of identity recognition, the features were calculated such that the influence was not reflected, depending on whether the mask was worn. During registration, both feature vectors for full-face and partially obscured face images were stored, and the final recognition reliability was improved through selective comparison, depending on whether a mask was worn. Figure 7 shows an overview of the identity recognition algorithm that adapts to images partially obscured by a mask.

4.4.2 Clothing Attribute Recognition

Clothing items can be classified into more specific categories such as shirts, jumpers, coats, and dresses. Moreover, various attributes, such as color, gender, pattern, clothing style, sleeve length, and season can be defined and extracted to express the unique properties of the object [39]. Multiattribute-based classification is widely used in clothing recognition, including pedestrian character recognition, human recognition, and fine-grained image recognition [40, 41].

The types of clothes were defined as seven types of tops (shirt, jumper, jacket, vest, parka, coat, dress) and two types of bottoms (pants, skirt). We defined that the top or bottom can have its own color, pattern, gender, season, sleeve length, pants length, and leg posture. Therefore, two clothing type classes and 11 sub-attribute classes were combined to create 13 multiclass attributes, as listed in Table 3.

Table 3 13 multi-attribute definition of clothing
Fig. 8
figure 8

Deep class-wise learning model for multi-attributes classification [23]

Figure 8 illustrates the structure of the proposed deep class-wise learning model for classifying multiple attributes of clothing. The model consists of two main steps: the detection part that identifies the region of interest (ROI) of the person in the photo and the classifier that predicts clothing attributes using features extracted from the ROI area. The yolo-v3 model is used to detect human locations [42], and after the input image passes through this model, a human ROI is obtained. The yolo-v3 2D feature vector region, represented by the ROI coordinates, was then extracted and pooled. These ROI-pooled feature vectors are then used as inputs for the attribute classifier [43].

We compiled a dataset of 40,000 human instances from 30,000 images containing more than one person. Each human instance was defined as an area containing a person wearing top and bottom clothing. We allocated 60% of the data for training, 10% for validation, and 30% for testing. The average accuracy of attribute classification was 73.57%, with lower accuracy observed for color and season attributes compared to that of other cases [23].

4.4.3 Style Comment Generation

Image captioning was adopted to generate comments on the attire using natural language sentences that describe the content of the given images [44]. While typical image captioning generates factual descriptions of the entire image, the style comment generator in this study generates subjective opinions on the attire of the person in the image (assuming there is only one person). We collected 6,062 images of elderly people with a variety of attire from the internet. Then we asked annotators to generate friendly comments about the attire of the people in the images, such as “The floral design of the clothes looks unique.” We recruited undergraduate students majoring in fashion as annotators in order to obtain more sensible comments. There were 11 annotators, and each generated one comment per image, resulting in 6,062 times 11 = 66,682 comments (the final number was 66,445 because several were missing) [24]. We used an open-source image caption program, modified to process the Korean language [45]. The network used Inception V3 [46] for image feature extraction, and bidirectional long short-term memory (LSTM) was used to decode the extracted feature vectors into Korean sentences. We allocated 80% of the annotated data for training, with the remaining 20% reserved for testing. To evaluate model performance, we measured the BLEU-1 [47] score and obtained a result of 0.34 [24].

4.4.4 Object Detection and Recognition

As a core technology for elderly-care robot service, there are an object detection and recognition functionalities, which recognize and memorize the objects around the house and provide related information when they receive requests from the user or other modules. These could provide an independent service such as a lost and found service, and also run with other modules by providing information on surrounding objects. Related technologies include object recognition, which recognizes general types of objects in predefined categories, instance recognition, which identifies specific object instances, object detection, which locates object position in images, and few-shot learning [25], which quickly registers objects with a few images.

Fig. 9
figure 9

Example images of 15 object categories [48]

For candidate objects for general object detection, 15 object categories were selected based on a feasibility study where we surveyed the most frequently lost and found objects. Those are glasses, mobile phones, remote controllers, medicine paper bags, medicine cases, cups, newspapers, cigarettes, hats, canes, towels, socks, wallets, writing instruments, and keys. Contrary to the objects available in the public detection datasets, these objects are small or articulated, or nonrigid. Examples of images are shown in Fig. 9.

The object detection and recognition module consist of two stages. In the first stage, 15 predefined objects in the image are detected, and the categories are recognized based on Faster RCNN [49]. Next, based on the region and its feature, the object instance is recognized based on 1-stage linear layer and softmax function. The overall architecture is shown in Fig. 10.

Fig. 10
figure 10

Overall architecture for object detection and recognition

The detection module in the first stage consists of Faster RCNN with feature pyramid networks [50] and CBAM attention module [51] to boost its detection performance. Resnet101 network pretrained on COCO dataset was used as an initial weight and OpenImage [52], VisualGenome [53], and an additional dataset was used for finetuning.

In the second stage, based on the frozen feature extractor in the object detection module, 1-staged linear layer and softmax function were applied. We also considered the result in [54] where a prototype and cosine similarity-based classification method showed better recognition performance than well-known meta-learning few-shot classification. However, the 1-staged linear layer and softmax function showed higher accuracy and low variance in our cases than the prototype and cosine similarity-based method.

Fig. 11
figure 11

Generating comments related to belongings in the outfit check service: a “If you carry a cell phone, you can drop it. Put it in your pocket and carry it.”, b “The hat suits you.”

Based on object detection and recognition, we implemented a module that identifies object belongings and generates comments related to the belongings. This module first detects the object, identifies whether the object is near a person, and then chooses one of the comments from the predefined comment templates. Figure 11 shows an example of the comments generated based on object detection and recognition results.

4.5 Verbal and Non-verbal Interaction Service

Verbal and non-verbal interaction service provides natural conversation similar to human-to-human interaction using gestures and behavior generation. The robot generates its own gestures using co-speech gesture generation (T08) and interaction gesture generation (T09) modules and encourages natural conversation with the user, as shown in Fig. 12.

Fig. 12
figure 12

Flow diagram of verbal and non-verbal interaction service

4.5.1 Co-speech Gesture Generation

In general, the appropriate use of gestures helps in understanding what the other person is communicating. It is important not only in interactions among humans, but also in human-robot interactions [55, 56]. Non-verbal behaviors for these interactions include facial expressions, hand movements, and gestures. We focus on upper-body gestures that are generated with speech.

We propose an end-to-end gesture-generation model that uses spoken sentences, multimodal speech information, and speaker ID. It generates gestures that match the spoken content as well as the timing of the speech. We suggest an encoder-decoder structure that considers the relationship between speech and gesture and uses the same timeline to synchronize text/speech/pose temporally. The model can generate gestures from text/speech; gesture generation is also possible with only text using TTS.

Fig. 13
figure 13

Network structure for gesture generation

We defined a single neural network architecture consisting of encoders for each input modality and a decoder for gesture generation. Figure 13 illustrates the overall network structure. Three input modalities-speech text, speech audio, and speaker ID- were encoded in different ways and fed into the generator. To ensure a smooth transition between gestures, the starting pose is explicitly inputted during gesture generation. Gestures are represented as sequences of poses, and the GRU is used to generate poses at each time step. The model was trained on a TED dataset without prior knowledge about co-speech gestures. Subsequently, the model output human-like gestures that matched the speech content and rhythm [29].

4.5.2 Interaction Gesture Generation

Human-robot interaction behavior means that the robot must understand user behavior, infer the intention, and respond appropriately to it for the social robot to interact with the user. For example, when users extend their hands to shake hands, the robot recognizes this and extends its hand. Various studies (e.g., [57,58,59]) have implemented social intelligence, but they have limitations in repeating predetermined actions. We propose a machine-learning method to learn how to act based on the experience of the robot [31] instead of explicitly telling the robot what to do.

To learn the social behavior of the robot, we used 7,500 human-human interaction data from the AIR-Act2Act dataset [60]. The dataset contains scenes of interactions between older adults and adults in a home environment. It consists of ten actions for interaction, such as greeting, shaking hands, and hugging. It provides the depth map and 3D skeleton of each person captured using a Kinect ver.2 camera. We used the actions of the elderly as input data and those of adults as output data for the robot to learn.

To recognize the user behavior, we used an LSTM-based model [61], as shown in Fig. 14. It is a popular and effective model for understanding sequential data. The input to the LSTM is a sequence of feature vectors of user poses, and the output is a one-hot vector of behavior class labels. M represents the number of user poses inputted to the LSTM at one time, and the output dimension is equal to the number of user behavior classes. Using 65,063 training input and output pairs, the learning phase was facilitated by the the descent algorithm with a learning rate of 0.01 and batch size of 64. In a subsequent experiment with 7,098 test data, the recognition accuracy for all user behaviors was 99%, with most behaviors recognized correctly [32].

To ensure that robots respond appropriately to user behavior selection rules. Each behavior of the robot was designed by considering the interaction scenarios of the AIR-Act2Act. In addition, we identified the key poses for each robot’s behavior. When a behavior is selected, the robot transitions from its current pose to a key pose associated with the chosen behavior. Repeating the same pose for a certain behavior may prevent the robot from executing the exact motion desired by the user. To solve this problem, we adjusted the key poses of the robot behavior based on the user’s current posture, position, and physical characteristics (such as height).

Fig. 14
figure 14

Flow of interaction gesture generation

We focused on validating high-five and handshake actions in the real world. Figure 15 shows an example of a social interaction generation scene, where the robot adapts to movements that are expressed differently depending on the height and pose of the user.

Fig. 15
figure 15

Interaction gesture generation samples in real-world

5 Experiment Settings

5.1 Procedure

This study aimed to expand the usability of human-care robots for the elderly. Therefore, a quasi-experimental research design was used to establish the human-care robot service strategy. A quasi-experimental design is a research methodology used to estimate the causal impact of an intervention on a target population without random assignment. It is a type of empirical interventional study that is useful in situations where true experiments cannot be used for ethical or practical reasons. However, they do not allow for random assignment of participants to groups, which may result in non-equivalent groups and confounding variables [62, 63]. In this study, one-group time series design and one-group pretest-post test design were used as a type of quasi-experimental research design to increase the internal validity of the study and reduce the risk of confounding variables, as illustrated in Fig. 16. In the testbed experiment, the one-group pretest-post test design was used to compare the results of a single group before and after using the robot. On the other hand, in the home environment experiment, the one-group time series design was used to measure the same group’s results over time without a control group. This study was approved by the Institutional Review Board of Suwon science college in advance (IRB No: IRB2-7008167-AB-N-01-202002-HR-001-02).

Fig. 16
figure 16

Test procedure

In the testbed, 40 people participated in the experiment, divided equally into \(1\text {st}\) and \(2\text {nd}\) sessions. In the pre-experiment survey, the general demographic characteristics and technological proficiency of the participants were investigated. We defined the 25 test scenarios based on 10 services and explained how to call the robot and the corresponding reaction from the robot instead of the detailed functions. Given that most participants were using the robot for the first time, it took them approximately 2 h to complete the test. After the experiment, a post-experiment survey was conducted to examine changes in the perception and reliability of the robot.

In the home environment, two elderly participants living alone conducted the experiment in their own homes. Before the experiment, we conducted a pre-experience survey to investigate the demographic characteristics and technical proficiency of each participant. The follow-up survey tracked changes by measuring the usability of the care service while the participants used it for 3–4 h daily over 16–20 days. We evaluated the effectiveness of human-care robot services through a follow-up survey, and indicators related to loneliness, depression, social relationships, health status and life satisfaction were used as factors that affect the quality of life of the elderly. After the experiment, we conducted a post-experience survey by interviewing the users and asking them to provide feedback on the acceptability, convenience, and usefulness of the technology, as well as their intention to adopt such a service.

5.2 Testbed

An apartment-type test bed was built to simulate a real-life residential environment. Three IP cameras were installed on the testbed and the recordings were used to examine the success rate of the service through comprehensive performance analysis such as responses, gestures of robots, and navigation in interacting with humans. Figure 17 shows the images from the robot camera and the images from three IP cameras.

Fig. 17
figure 17

Testbed environment-\({\textcircled {1}}\): robot camera, \({\textcircled {2}}\)\({\textcircled {4}}\): IP cameras

We enrolled elderly people without cognitive decline, with verbal and non-verbal communication skills to respond to interviews, and without visual and hearing impairments. We explained how to participate in the study and obtained their consent. The participants were recruited with the help of a community elderly facility.

A total of 15 males and 25 females participated for 20 days (July 6–20; November 8–19). By dividing the demonstration period between winter and summer, seasonal factors can be reflected in clothing attribute recognition. As most of the test scenarios utilized vision-based modules, the experiments were divided into morning and afternoon sessions to reflect the changes in natural light. Therefore, the experiments were conducted twice a day.

Table 4 Demographic characteristics of 40 participants
Table 5 The demographic characteristics of two participants

Most participants were married and living with their spouses (see Table 4). The average age of the participants was 73.2 years, and the average period of income activity was 23.9 years. The average level of income satisfaction was 3.3 points. In the case of technology-related proficiency, the ability to use smartphones was 5.9 points higher than the median level; however, the ability to use computers was 3.9 points lower than the median level. The life satisfaction and health status were 3.8 points and 4.0 points, respectively, which indicated satisfaction above the median level.

Educationally, 67.5% of the participants completed high school or higher level. Regarding the subjective health status, 75% of the respondents answered, “I am in good health” or “I am very healthy.” As for the subjective income situation, more than 90% responded that they were above average, so the group was considered of better income conditions and health than the average elderly in Korea.

5.3 Private Home Environment

The home environment participants were elderly people with normal cognitive function residing in city Suwon. The recruited Participants A and B were 79 and 73 years old, respectively, and each lived alone. They were in stable health with a stable income level and had high levels of skill and familiarity with technology such as computers and robots. Specially, participant B actively participated in work using computers and social activities using smartphones as a poet. Participant A participated in the experiment from 10:00 to 3:30 for 16 days (September 16 to October 26), and Participant B participated in the experiment for 20 days (November 3 to December 1). Participant B participated for four hours, alternating between morning and afternoon sessions (see Table 5).

Fig. 18
figure 18

Private home environment of Participant A

Before the experiment, we visited the homes of the subjects to observe their daily lives living alone. We collected their behaviors and environmental sounds in advance to improve recognition performance. Considering the structure and arrangement of furniture in each home, we designated the location of Home and Target for navigation. We generated maps and routes for navigation based on Home and Target. Figure 18 shows the interaction with a robot at a designated target considering the life pattern of the participant.

Fig. 19
figure 19

Comparison with service success rates of S01 based on activity detection in testbed environment

6 Module Performance Evaluation in Real-World

Human-care robot services are composed of integrated multiple modules, rather than a single module. Therefore, for the service to be performed stably in real environments, the performance of each module in real world is important. In this section, we aim to analyze and verify the performance of each module in real world for the main modules that make up human-care robot services.

Fig. 20
figure 20

Comparison with service success rates and number of uses of S01 in private home

Fig. 21
figure 21

Change in service success rate for S01 service in private home

6.1 Daily Activity Detection

Several issues arose when the activity detection model trained with the elderly behavior dataset was directly applied to the real world. One of the main issues is that several motions occur in the scene, even when there is no specific action, leading to frequent false alarms. It was difficult to directly apply the dataset-based detection model because most of the datasets did not consider this aspect. To solve this problem, virtual motions were created as extra actions and learned as “no action.” Another problem was the limitation of the pose information, which made it difficult to detect certain actions, such as “reading a book” due to the slight motion involved. Additionally, some actions like “hanging laundry” that could easily be detected in terms of tool use (e.g., interacting with a drying rack) were difficult to detect using motion only. Finally, there were many action classes with extremely similar motions, such as “drinking water” and “making a phone call,” which posed a challenge to accurate detection.

In this study, we addressed the issue of pose information limitations by using an RGB image-based detector that is not overfitted. To avoid the computational burden and slow processing speed associated with inputting an entire set of RGB images from a video, we proposed a method that employs only one RGB image. This method considers spatial features, life tools, and relationships with human body parts to detect activity. To accomplish this, we used MobileNetV2 [64] as the underlying model.

Fig. 22
figure 22

Definition of five classes and four instances

Fig. 23
figure 23

Performance of object detection and recognition at the designated target locations

We conducted two experimental sessions to compare performance. Figure 19 shows service success rates of S01 based on activity detection in testbed environment. As a result, a “no action” behavior was used to reduce false alarms in activity detection and increase the service success rate of all behaviors. In particular, the average success rate increased from 48% to 70%. Using an ensemble model with two modalities, that is, pose and RGB images, the service success rate for the “taking medicine (S01-B),” and “hanging laundry (S01-F),” behaviors using life tools were significantly improved.

In the testbed, each participant performed each of the eight S01 services once. In contrast, we asked home-based participants to use each service at least once per day in their homes. After performing each service once, the participants were asked to freely use the robot according to their natural patterns of daily living and to note their experiences. Although there were slight differences between participants, the service success rates were high for S01-E, S01-F, and S01-G, and both participants used S01-G frequently, as shown in Fig. 20. Both participants initially found it difficult to use the services, but showed a pattern of increasing service success rates as they became more adapted over time (see Fig. 21).

We were able to improve activity recognition performance in the real world by using not only the skeleton but also RGB images. The high success rate for services frequently used by users was encouraging; however, more studies are needed to improve detection performance for services that do not occur often (e.g., eating and taking medicine).

6.2 Object Detection and Recognition

The object detection and recognition used in the finding belongings scenario detect not only the classes but also the instances. Five classes (cup, remote controller, medicine case, mobile phone, and glasses) and four cup instances were targeted in this experiment. As shown in Fig. 22, four types of cups with different colors and patterns were used.

In the test bed environment, the scenario of finding belongings was performed by the participant once for a total of 40 times. For each experiment, 5–7 belongings were randomly placed in the designated areas. As a result, 4-6 belongings were recognized for each experiment, class recognition was 99%, and instance recognition performance was 98%. There were more cases of failure due to miss than false recognition. Figure 23 shows the class and instance recognition performance at the designated target locations in the test bed environment.

Object detection and recognition are heavily influenced by lighting changes in the input image. In general, lighting changes are severe because of the natural light from the outside in real household environments. Figure 24 shows a comparison of object detection and recognition performance in a private home environment, the recognition performance was lower in the home environment of Participant B (right) than that of Participant A (left). Object detection is often impossible, primarily because of backlighting.

Fig. 24
figure 24

Comparison with performance of object detection and recognition in private home

Fig. 25
figure 25

Facial attribute recognition results

Fig. 26
figure 26

Clothing attribute recognition results in a testbed environment and b private home environment

6.3 Facial Attribute Recognition

We measured the performance of the facial attribute recognition module in recognizing whether a mask was being worn. In the testbed environment for many participants, the mask recognition success rate was 69.07%, which was lower than the performance in the private home environment for one person. In addition, even with the same result for one person, the success rate of Participant B (73.17%) was lower than that of Participant A (96.97%). This is presumed to be due to changes in lighting (natural light). In other words, Participant B had a severe lighting change while performing the test alternately in the morning and afternoon, and as a result, the recognition performance deteriorated. Figure 25 shows the performance comparison results of the facial attribute recognition module and an example of comments generated based on logs related to facial attribute recognition.

6.4 Clothing Attribute Recognition

Although there are various attributes related to clothing, comments on clothing styles were generated using only the recognition results of the top clothing features, such as color, pattern, and type. The experimental results showed no significant difference in performance between the testbed and private home. Both performed best in pattern recognition and lowest in color recognition (see Fig. 26). Most participants wore clothes without patterns, and the recognition rate for other patterns such as stripes and floral patterns was low.

Table 6 Evaluation metric for attire style comment
Fig. 27
figure 27

Examples of style comments

Fig. 28
figure 28

Comparison of the comment by the recognition of clothing attribute with comment about attire style

6.5 Style Comment Generation

The attire style comment generation was evaluated on a 5-point scale based on the evaluation criteria summarized in Table 6. Three evaluators determined the adequacy of the generated comments considering the view image of the robot.

The average value was 2.87 points, the maximum value was five points, and the minimum value was 1-2 points. The style comments generated for colorful or distinctive costumes received higher scores from evaluators, while relatively lower scores were given for simple or uncharacteristic clothes, as shown in Fig. 27. The evaluators often complained about too generic comments and some incorrect attire attributes included in them. However, the users were often satisfied simply because the robots said something interesting, even when the comments were absurd because of a misperception of the clothing styles. User satisfaction was clearly exhibited when we asked users about their service satisfaction in the post-experience survey.

Figure 28 compares the comment by the recognition of clothing attribute with attire style comment. Only certain types of comments were generated to recognize clothing features. However, although the style comments were somewhat rough or inappropriate, various comments were generated, resulting in high satisfaction among the participants.

6.6 Motion Qualitative Evaluation

The main modules used in the verbal and non-verbal interaction service were related to robot movement, and we conducted a user satisfaction survey to evaluate the performance of the service according to the subjective opinion of the user. Table 7 presents the results of the survey on robot movement.

Table 7 User satisfaction of robot movement in testbed

Among the questions about gesture generation, ‘Was the gesture movement of the robot as natural as that of a human?’ exhibited the lowest level of user satisfaction. It seems that it is difficult to expect natural human-like gesture movements from a robot unless it is a humanoid type. The item about the response within the appropriate time among the questions about movement speed and gesture generation showed lower user satisfaction than the other items and a large deviation. Because the movement speed and gesture generation speed are significantly affected by the real-time image input through the camera, it seems to be a deviation that does not show the same movement speed and reaction speed every time.

7 Robot Service Evaluation

7.1 Service Performance and User Satisfaction

7.1.1 Evaluation Method

We aim to describe quantitative and qualitative evaluation methods for robot services in real-world environments from a robot service perspective. For a quantitative evaluation of human-care robot services, the service success rate was measured for each service. The service evaluation tool used is shown in Fig. 29. The evaluator checked the correct, incorrect, and missing answers by comparing the video with the IP camera, audio, and logs. In addition to the service success rates, evaluation tools can be used to measure the performance of each module and the step-by-step success rate for each service. A post-service survey was used for the qualitative evaluation of the service; it investigated the satisfaction with, and importance of, each service of the human-care robot.

Fig. 29
figure 29

Service evaluation tool-\({\textcircled {1}}\): robot view, \({\textcircled {2}}\): service success rate, \({\textcircled {3}}\): performance for each components \({\textcircled {4}}\): test case for each service, \({\textcircled {5}}\): log

7.1.2 Service Performance

The service stability of a human-care robot was evaluated based on its service success rate. The results for the 10 services are shown in Fig. 30. Except for S06, the service success rate of services corresponding to the request service showed a high success rate of 90% or more. On the other hand, the success rates of S01 and S02 services, which are always on services, were low. More than 95% of the causes of failure of S01 and S02 services were due to errors in behavior detection and environmental sound recognition. S06, which showed a low success rate despite being a requested service, had a low service success rate due to image recognition failure. As a result, the service success rate was determined by the detection accuracy of image and sound recognition in the real environment.

Fig. 30
figure 30

Service success rate in testbed environment (a) and in private home environment (b)

Compared to the testbed environment, the success rates of S02 in private home environment were significantly higher. This is because we collected and trained data tailored to the home environment of the user, which improved sound detection performance. For activity detection, we also collected and trained data to reflect the individual behavior patterns of the user; however, we did not observe any performance improvement. We asked participants to behave naturally according to their personal lifestyle patterns. It appears that pose changes due to user behavior in various locations, rather than designated places, made behavior detection and recognition difficult. In addition, changes in lighting due to external natural light, depending on the time of day and weather, may have been a factor.

Fig. 31
figure 31

Changes in service success rate in private home environment

In the case of Participant A in Fig. 31, the success rates of the service steadily increased, indicating that with an increase in the usage of services, user was able to handle human-care robots well as they became more familiar with their functionalities. In contrast, for Participant B, the service success rate fluctuated wildly, which is interpreted as being related to the amount of natural light varying depending on the time of day during which the experimental procedure was performed. Due to the experiment being performed in the participant’s residence and alternately carried out in the morning and afternoon, it was not possible to maintain consistent lighting conditions. This demonstrates that lighting conditions had an impact on the robot’s navigation and recognition performance, which consequently affected the success rate of the service. Proactive services had a greater impact on the success rates of services than user-requested services.

7.1.3 User Satisfaction

After using the human-care robot, a survey was conducted on the satisfaction of participants with each service. As shown in Fig. 32, except for S01 and S02, most services showed high user satisfaction in the 4-point range, which highly correlated with the service success rate discussed in the previous section. In other words, participants showed high satisfaction with the service.

Fig. 32
figure 32

User satisfaction in testbed environment

Fig. 33
figure 33

Satisfaction (left) with service and importance (right)

Fig. 34
figure 34

Service success rate and user satisfaction for each service in private home environment

S01 and S02 showed a large deviation, which may be attributed to service success or failure influencing the user satisfaction. “Not recognizing my behavior” contributed to 86% of the reasons for dissatisfaction with the S01 service and the rest were “recognizing it too late”. Most of the reasons for dissatisfaction with the S02 service were “not recognizing environmental sounds”. Some participants said they could not believe the information provided by the robot.

At the end of the test, each participant was asked to choose three services with high satisfaction and importance. The service that the participants felt satisfied with was different from the service that they thought was important. They were most satisfied with the exercise assistant service (18.3%) and event reminder service (17.5%). In contrast, they selected the service to request help in the case of an emergency at home as the most needed service, accounting for 21.7%. The importance of the event reminder and belonging finding services were 19.2% and 13.3%, respectively (see Fig. 33).

The participants were overall satisfied with the human-care robot service that solved the difficulties caused by aging. For some services, they answered that they were more suitable for the lonely elderly or the elderly living alone, which indicated that the need for related functions was relatively low, considering that most of the elderly who participated in the survey were living with their families (average of 2.1).

Figure 34 shows the success rate and service satisfaction rate for each service in private home environment. The service success rates were similar, except for S01, S04, and S06, and high user satisfaction was shown for services with high success rates. Overall, Participant B showed higher user satisfaction compared to Participant A, but with significant variation across different services.

7.2 Survey Results

7.2.1 Post-experience Survey Results

After using the human-care robot, a post-experience survey was conducted to examine changes in the perception and reliability of the robot. Table 8 lists the perception and reliability of the human-care robot. Each item was measured on a 5-point Likert scale.

Given that the participants had a very positive perception of the robot, they answered “very much” with a score of 4.7 to “Robots are necessary because they can perform tasks that are too difficult or dangerous for humans.” However, when asked “Robots will take away people’s jobs,” the average score was 4.7, which is close to “very much.” This indicates that the perceptions of the participants regarding the robots were both positive and negative.

The participants were positive about the image and reliability of the robot because they had a high level of familiarity with it. As for whether they are attracted to talking with robots, they gave a positive opinion of 4.5 points on the level of “very yes,” 4.5 points on the level of trust in talking with robots, and 4.3 points on the degree of thinking that they can build lasting friendships.

Table 8 Perception and reliability of robots
Fig. 35
figure 35

A conceptual model and hypotheses

The technology acceptance model is a model that explains the factors that influence user adoption of technology. We analyzed the factors that influence robot purchasing through a post-survey. Regarding the reliability of the survey, we obtained a Cronbach’s alpha of 0.88, which satisfies internal consistency. As shown in Fig. 35, the perceived usefulness and ease of use had a positive impact on attitude toward using and intention to use the robot. The attitude toward using positively influenced the intention to use, and the intention to use positively influenced purchase intention. The pleasure motive had no effect, while the health motive had an effect on both intention to use and purchase. Older adults were more concerned with health than with pleasure or enjoyment, which may have had a positive impact on their intention to use and purchase the robot.

7.2.2 Follow-Up Survey Results

We conducted a follow-up survey using five scales (Loneliness, Depression, Social Relationships, Subjective Health Status, and Life Satisfaction). Figure 36 shows the changes in the quality of life of two participants reported on the follow-up survey.

Fig. 36
figure 36

Comparison with changes in the quality of life of two participants reported on the follow-up survey

The DeJong Gierveld Loneliness Scale [65] consists of three negative and three positive questions on a scale of zero to six (zero point: seldom loneliness, six points: most loneliness). In general, elderly people living alone show a higher level of loneliness than those who live with their families. Loneliness refers to an emotionally unpleasant experience that occurs when an individual feels that social relationships are insufficient compared with what he or she wants [66]. The loneliness of Participant A was very low (two points and one point) at the beginning, but it was located four and five points later. The loneliness of Participant B was the highest at six points, maintaining the level of six without significant change, even after using the robot.

Using Hoyl GDS-5 [67], which reduced the 15 indicators of depression (Geriatric Depression Scale: as GDS ) to 5 in the elderly, a GDS-5 score of 2 or higher is considered to have depressive symptoms. Depression can cause dementia in the elderly, and can develop into various physical diseases and physical function deterioration (loss of appetite, insomnia, etc.) due to aging; thus, it is a symptom that cannot be regarded simply as a psychological phenomenon. The depression index of Participant A was three points in the pre-survey, indicating symptoms of depression. However, the participant did not exhibit any symptoms of depression after using the robot. On the other hand, the depression index of Participant B was as low as one point in the pre-survey, but it was generally found that the change in the depression index was severe. It seems that the participant responded honestly to emotional changes as a poet.

We used the Lubben Social Network Scale [68] to determine the social relationships. This measures the extent to which individuals are able to share personal matters with their relatives or friends and the frequency of their communication through phone calls or meetings. The scale ranges from 0 to 15 points. The social relationship of Participant A shows a repeating pattern of increase and decrease. It seems that the participant restrained from visiting or meeting with friends and family during the experiment. On the other hand, the social relationship of Participant B showed a gradual rise by telling his family and friends about her experiences of using the human-care robot.

Subjective health status consists of a five-point scale (one point: very unhealthy, five points: very healthy). It is a holistic evaluation that considers one’s own perceptions of their physical and mental well-being. It is an important factor in assessing the mortality rate or quality of life of the elderly [69]. While using the human-care robot, there was no significant change in the subjective health status of either Participants A or B. However, it seems that their subjective health status included fears about future health caused by aging rather than current conditions; they had no major health problems.

Life satisfaction is measured on a five-point scale (one point: very dissatisfied, five points: very satisfied). The life satisfaction of Participant A repeated the normal (three points) and slightly satisfied (four points) stages at the beginning, but converged to the normal stage in the second half. Participant B showed more severe changes in life satisfaction during the experiment than Participant A. Although the results of the follow-up using the five scales showed some differences between participants, there was no significant change before and after using the robot in terms of life satisfaction.

7.2.3 User Feedback

The 40 elderly people who participated in the testbed experiment provided the strengths and weaknesses of human-care robot services in a post-experience survey. Most participants said that the human-care robot service would be more useful for the elderly living alone rather than those living with their families. The positive points regarding the human-care robot service presented by the testbed experiment participants are as follows: 1. The robot relieves the loneliness of the elderly through a conversation function. 2. The robot provides information that is helpful in daily life. 3. Medication guidance and remider services provide convenience in life. It was found that among the proposed human-care robot services, high satisfaction was shown not only in cognitive support, but also in emotional support through conversation.

On the other hand, the disadvantages of the human-care robot service that participants pointed out are as follows: 1. It was difficult to communicate with the robot because of the frequent misrecognition and slow responses. 2. The gestures of the robot did not feel soft, they felt hard and seemed emotionless. 3. The participants felt a sense of unfamiliarity and distance because of the prejudice of being a robot. Participants expressed their dissatisfaction with the recognition performance and response time. Considering these opinions, since the recognition performance of major modules affects service satisfaction, it seems that efforts to improve performance are needed.

In the home environment experiment targeting the elderly living alone, changes in quality of life, such as depression and loneliness, were investigated through follow-up surveys. We also conducted a post-experiment survey, including interviews, to obtain opinions. As the participants became more familiar with the robot and its capabilities, they expressed positive feelings regarding its potential to provide convenience and assistance in daily life. Here are a few comments: “I feel closeness,” “It handles all functions well,” “The future is bright as technology continues to develop,” and “Robots help to interact with people.” However, they were concerned about the disadvantage of being too large for use in a home environment and privacy issues due to the camera. Participant B noted loud noise, electricity costs, and so forth: “It is inconvenient that equipment and devices for robot systems occupy a lot of space in the house.”, “When charging the robot, there is a concern about the risk of electromagnetic waves and the cost of electricity.”

Most participants thought that both the proposed cognitive and emotional support services were necessary services for the elderly. However, frequent misrecognition and slow responses did not meet the expectations of users who wanted natural communication with the robot. As a result, they had a sense of distance from the robot.

8 Discussion

The main outcome of this study is compelling empirical evidence that human-care robots can be effectively utilized by older adults in real-world settings. Unlike prior studies, we conducted a comprehensive evaluation of the utility and stability of the services provided by human-care robots through both quantitative and qualitative measures.

Most modules in the perception stage were related to recognition and heavily influenced by the input images. The lighting conditions were often uneven due to natural light entering from the outside in the real world, resulting in dark or backlit situations. In such cases, recognition performance deteriorated, leading to lower user satisfaction. There was a significant difference in object recognition performance between the experiments of Participants A and B, and the main cause of this difference was the influence of external natural light. To ensure service stability in the real world, it is necessary to develop recognition modules that are robust to the lighting conditions.

The success rate of the proactive services was generally lower than that of the request services. This impacted user satisfaction, with the satisfaction rate of the proactive services being lower than that of the request services. Participants of the testbed experiment expressed satisfaction with the exercise assistant service and event reminder service. In contrast to the satisfied service scenarios, they found that the emergency call service was the most important.

Most participants thought that both cognitive and emotional support services provided through the human-care robot were necessary for the elderly. However, most participants who lived with their families in the apartment testbed thought that the services would be more useful for elderly people living alone. The participants were generally satisfied with the human-care robot service, which showed a relatively stable success rate. However, the frequent misrecognition and slow response times did not meet the expectations of users who preferred to communicate naturally with the robot.

9 Conclusion

In this study, we aimed to develop a human-care robot service for the elderly and apply it in the real world to verify its usefulness and stability. We conducted experiments in apartment testbed and private home environments, and collected system performance records, observations, and interviews for quantitative and qualitative analyses. We evaluated the performance and stability of the proposed human-care robot service using success rate metrics. Additionally, we analyzed changes in the perception of the robot through a post-survey and changes in life satisfaction through a follow-up survey to verify the usefulness of the service. These results provide valuable insights for delivering continuous, and reliable care services to the elderly, and are expected to contribute to future improvements in robotic elderly care services.