Comparing Robot and Human guided Personalization: Adaptive Exercise Robots are Perceived as more Competent and Trustworthy

Schneider, Sebastian; Kummert, Franz

doi:10.1007/s12369-020-00629-w

Comparing Robot and Human guided Personalization: Adaptive Exercise Robots are Perceived as more Competent and Trustworthy

Open access
Published: 08 February 2020

Volume 13, pages 169–185, (2021)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Social Robotics Aims and scope Submit manuscript

Comparing Robot and Human guided Personalization: Adaptive Exercise Robots are Perceived as more Competent and Trustworthy

Download PDF

4914 Accesses
21 Citations
Explore all metrics

Abstract

Learning and matching a user’s preference is an essential aspect of achieving a productive collaboration in long-term Human–Robot Interaction (HRI). However, there are different techniques on how to match the behavior of a robot to a user’s preference. The robot can be adaptable so that a user can change the robot’s behavior to one’s need, or the robot can be adaptive and autonomously tries to match its behavior to the user’s preference. Both types might decrease the gap between a user’s preference and the actual system behavior. However, the Level of Automation (LoA) of the robot is different between both methods. Either the user controls the interaction, or the robot is in control. We present a study on the effects of different LoAs of a Socially Assistive Robot (SAR) on a user’s evaluation of the system in an exercising scenario. We implemented an online preference learning system and a user-adaptable system. We conducted a between-subject design study (adaptable robot vs. adaptive robot) with 40 subjects and report our quantitative and qualitative results. The results show that users evaluate the adaptive robots as more competent, warm, and report a higher alliance. Moreover, this increased alliance is significantly mediated by the perceived competence of the system. This result provides empirical evidence for the relation between the LoA of a system, the user’s perceived competence of the system, and the perceived alliance with it. Additionally, we provide evidence for a proof-of-concept that the chosen preference learning method (i.e., Double Thompson Sampling (DTS)) is suitable for online HRI.

Socially Assistive Robot Exercise Coach: Motivating Older Adults to Engage in Physical Exercise

Use of a Socially Assistive Robot to Promote Physical Activity of Older Adults at Home

A dichotomic approach to adaptive interaction for socially assistive robots

Article Open access 17 November 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Future scenarios of social robots envision a personable system that is flexible and adapts itself to the user’s preferences [20]. Typical applications of social robots include, for example, social assistance for physical or cognitive exercising [12, 26]. Different users or target groups can have different preferences or requirements towards the system interaction. However, anticipating all user types and pre-programing the system for their needs will be an obstacle to deploy robots in domestic settings and engage users beyond an exploration phase. In these situations, robots would need capabilities to adjust to different user profiles (e.g., match a user’s personality [1]). In web-based applications, this requirement is already widely accomplished (e.g., recommender systems on Amazon, Netflix). However, it remains a challenge for social robots that have no access to an extensive user database and thus cannot utilize techniques like collaborative filtering. Thus, social robots face the cold start problem, which requires the system to gather initial data to personalize the interaction experience. Nevertheless, deploying an adaptive system still comes with some known difficulties:

First, querying the user for information in real-time HRI might be more cost-intensive than in web-based applications. Cakmak et al. [6] showed that a constant stream of questions in a Learning by Demonstration (LdB) task annoys users.

Second, autonomous personalization decisions by robots could result in diametral effects for the HRI experience satisfaction. When a robot controls the personalization process, it could lead to technology disuse due to wrongly learned user profiles or users prefer to be in control [5]. Thus, it is essential to consider whether people prefer an interface to adjust the system behavior or prefer to let the system control the adaptation process.

These different personalization strategies would influence the system’s autonomy and affect the interaction experience. Based on the theory by Epley et al. [10] on anthropomorphization, an autonomous adaptive system could create unexpected user experiences [10]. This unanticipated experiences can increase a user’s perceived human-likeness of the robot and enhance the credibility of, and trust in the system. In contrast, a user-controlled system would increase the match between the user’s expectation and the robot’s behavior and therefore reduce anthropomorphic effects. The investigation of these two aspects is the core of this work. We try to find an answer to the question:

What effects have different types of personalization methods on the user’s perception of the system, trust in the alliance, and motivation to interact with the system?

To answer this question, we investigate the effects of different personalization behaviors of the system and present a study in the area of robotic exercising companions that compares the impact of an adaptive robot versus an adaptable robot as an exercising partner for physical activities.

Previous work on robots for exercising and coaching have investigated the motivational effects of using such coaching systems [12, 16, 44]. However, most studies used only one type of exercise (e.g., arm, or plank exercises). In this work, we present a system that offers a range of activities to the user, which we use to investigate a suitable preference matching framework to provide personalized interaction. Therefore, in the adaptive condition, the robot proposes different activities for the user and tries to learn a user’s exercising ranking based on comparative user preference feedback. In the adaptable condition, the robot is directly controlled by the user, and the user can decide which exercise she/he wants to do with the robot.

Our work contributes to the community by showing that the k-armed Dueling Bandit Problem is a suitable approach for online preference learning in HRI scenarios [55]. Moreover, we provide evidence that the alliance to an adaptive robotic exercising partner is perceived as more trustworthy and that this is mediated by a perceived higher competence of the system by the user.

The manuscript is organized as follows. The difference between adaptive and adaptable robots will be explained in Sect. 2 along with the concepts of automation and alliance, which might be important variables when looking at the adaptivity of a system. Section 3 introduces the system design and Sect. 4 explains the study design to test the effects of a robot’s different personalization mechanisms. Section 5 presents the results of the study, which are discussed in Sect. 6. Finally, Sect. 7 gives a conclusion of this work.

2 Adaptation, Automation and Alliance

This section gives a brief introduction into the concepts of adaptation, automation and alliance. Discussing these topics is a challenging task because they are used differently across disciplines (e.g., philosophy, psychology, economics, biology). Therefore, the following explanations can not be exhaustive and will focus mainly on a computer science and psychology perspective.

2.1 Adaptation: Adaptivity Versus Adaptability

In computer science adaptation refers to the informative-based process of adjusting the behavior of an interactive system to meet the need of individual users [47]. Even though computer software or robots are running through many software design cycles, it is hard to anticipate the requirements for every possible user. The goal of adaptive processes is to minimize the discrepancy between the user’s needs and system behavior after the deployment. This adaptive process (see Fig. 1) can either be automatically initiated by the system, in this case, the system is adaptive (e.g., the system chooses exercises for the users by itself), or users can adjust the system by themselves, in this case the system is adaptable (e.g., the users can choose the exercises by themselves). Adaptation can include different user profiles (e.g., personality), various times (e.g., morning/evening, days of the week, summer/winter), or other user characteristics (e.g., mood, experience).

Previous work in HRI investigated the implementation of adaptive processes to match a user’s personality, generate empathic behavior, adapt therapy sessions, interaction distance, linguistic style or puzzle skills [18, 25, 26, 31, 40, 49, 50]. The results of these works show an improvement in task performance based on personalized lessons or user personality matching [18, 26, 49]. Additionally, providing adaptive empathic feedback also improved user engagement [25]. Other works present evidence for the feasibility of certain adaptation algorithms (e.g., [31]). Overall, there is evidence that adaptive personalization is a crucial capability for robots. Though, there are still many open issues that future works need to target. For example, to which objectives should the robot to adapt? How can the system adapt when the objectives are not apparent? Should the robot communicate the adaptation process and thus make it transparent? Finally, who should be in control of the adaptation process?

Although some have compared adaptive robots with experimental baseline control conditions (e.g., [25, 26]), to the best of our knowledge, no investigation looked at the effects of robot-initiated personalization versus user-initiated personalization. It is reasonable to argue that the users could control and adjust the robot behavior to their preferences. Leyzberg et al. [26], for example, investigated the effects of a robot that gives personalized lessons to the user. The robot select these tutorials based on a decision algorithm. However, also the user could have requested for a specific experience.

Both strategies might match a user’s preference and increase interaction satisfaction, but the underlying difference in decision making is fundamental. One can interpret the approaches as either more transparent or as more competent. Generally, the question of whether to build an adaptive or adaptable system raises the concern of who is in control and how does it affect the interaction experience. The issue of who is in control is, in general, associated with the LoA of the system.

2.2 Level of Automation

An autonomous agent acts based on the information it receives from its sensors, knows in which state it is, and makes a decision accordingly (see [42, ch. 1]). The LoA of a system changes an agent’s capability to act and react based on information on its own without any other external control instances. Thus, the agent’s LoA alters depending on the task, the agent’s environment, and whether a human can interfere in the agent’s control loop. This distinction becomes essential where robots are carrying out delicate tasks (e.g., lethal autonomous weapons). There are various frameworks that can be used to identify the LoA of a system (e.g., [9, 48]). However, most recently, Beer et al. [2] have proposed a taxonomy to classify the level of robot autonomy for HRI.

Regardless of the exact different LoA, systems can be categorized as human-in-the-loop systems where the human has to approve a control decision by the autonomous agents, human-on-the-loop where the human is informed about the decision but the agent would carry out a decision if the human operator is not interfering or human-off-the-loop, where a human cannot interfere with the agent’s decisions.^{Footnote 1}

The relevance to consider the different LoA is apparent in sensible domains such as military operations or medical applications (e.g., surgery or medicine dispenser), but (yet) less apparent in socially assistive fields (e.g., rehabilitation or teaching). Nevertheless, social situations will require to understand whether a social robot should act autonomously, semi-self controlled, or is in full human control also. For the interaction experience, it will be crucial to understand the effects of different LoA. In the course of this work, we are interested in the impact of whether a robot exercising companion is in control to choose the exercises or whether the users can decide which tasks they want to do. The question of which LoA is appropriate and the effects it will have on the interaction experience will be related to the alliance and trust between the users and the SAR [2].

2.3 Alliance

The impact of trust in the alliance between a user and a robot has recently been investigated in use cases in which a robot shows a faulty behavior, gives explanations for actions, or varies the degree of expressivity and vulnerability [29, 41, 43, 51]. Besides these aspects, human’s trust in a robot’s capabilities also depends on, for example, a high LoA and whether the system makes autonomous decisions [13].

Trust is defined in Human–Computer Interaction (HCI) as “the extent to which a user is confident in and willing to act by, the recommendations, actions, and decisions of an artificially intelligent decision aid” [30, p. 25]. As Madsen and Gregor [28] state, this definition “encompasses both the user’s confidence in the system and their willingness to act on the system’s decision and advice” [28, p. 1]. Thus, it already incorporates a notion of user trust regarding the willingness to take a system’s recommendations into account.

To understand how trust influences HRI, Hancock et al. [17] reviewed different applications where confidence is an essential factor when robots and humans are working together in a team. They state that it is a crucial aspect of industry, space, or warfare applications. However, due to the rise of SAR for rehabilitative, therapeutic or educational tasks, understanding trust for social tasks is also an essential research topic [12, 22]. Hancock et al. [17] found several factors influencing trust in HRI, which are related to the human, the robot, and the environment. Though, the robot-related factors were the most important ones in their meta-review. They found that essential factors influencing the associated trust are the human’s perception of the system’s behavior, adaptability, competence, and performance [17]. Considering how different types of personalization change the LoA and how this might alter the perceived trust, we question how the manipulation of the LoA (for example how the system adapts or can be adapted) influences the associated competence and the perceived confidence in the system.

Rau et al. [37] investigated the influence of a social robot’s LoA on the user’s trust in the HRI based on the robot’s decision making. They manipulated the robot’s LoA by either giving the human the possibility to make a team decision and the robot could suggest a different decision (low autonomy) or the robot makes the team decision and the human can either reject or accept this decision (high autonomy). They hypothesized that a highly autonomous robot would increase the associated trust. Their results show the influence of an autonomous robot on human decision making, but in contrast to the hypothesis, people rated that they trust the low autonomous robot more. However, there result is only marginal significant (p = .084). And further investigation on this is needed.

Other works investigated how perceived anthropomorphization influenced perceived trust in autonomous vehicles [52]. Waytz et al. [52] found that the degree of anthropomorphization is associated with higher confidence in its competence. This indicates that the perceived level of skill might also influence the related trust. However, there are, to the best of our knowledge, no other works that investigated the influence of a social robot’s LoA, based on its decision-making capabilities, on the perceived trust in the HRI and competence besides the work of Rau et al. [37].

2.4 Objectives and Contribution

This paper has two major objectives. The first objective is to test whether a preference learning approach is suitable for online interaction which is novel for the HRI community and test the feasibility in a realistic use case study in comparison to a user-controlled adjustment method.

The second objective is to investigate how the different personalization methods influence the user’s evaluation of the system in terms of its social attributes, trust in the alliance with the system and the motivation to continue the interaction with it. We will discuss the hypothesis related to the later objective in the following.

2.5 Hypotheses

Based on the reviewed literature, we found that there is a substantial lack in understanding the effects of adaptive social robots for future HRI scenarios. We found that it is still uncertain, which is the best way to personalize the robot’s behavior (i.e., should it be under control of the user or the robot). Additionally, it remains unclear how the different LoA changes the user trust in the HRI and the perceived competence of the system in social scenarios, as well as how these variables are related to each other. To find empirical evidence that can help answer these questions, we derived four hypotheses from previous works.

Due to the robot’s initiative and control of the interaction, people will be likely to associate the robot with higher competence [17]. Since users do not have to control the robot on their own, the robot could create the impression of proactively deciding on its own, which creates unexpected experiences for the user and elicits anthropomorphic reasoning about the agent [10, pp. 873–874]. Thus, based on the theory of Epley et al. [10], we hypothesize that:

Hypothesis 1

Users perceive an adaptive robot as more competent than an adaptable robot (H$_{0}$: adaptive and adaptable robots are perceived as equally competent).

We hypothesize that this different level of perceived competence is associated with the perceived trust or relationship with the agent.

Because the research from Rau et al. [37] did not show any significant effect on perceived trust in HRI depending on the LoA, we want to continue on the hypothesize that the LoA will affect the associated trust. Likely, the previous research did not find an effect on the trust because the robot was only a marginal partner that was not important for the task. Instead, in our work, the robot is not just a team member but an instructor and exercising partner. Therefore, trust in the alliance will be an essential feature for the relationship between the user and the robot.

Hypothesis 2

The trust in alliance to an adaptive robot is rated higher than to an adaptable robot.

Since we hypothesize that the participants in the conditions will perceive both the competence and trust differently, it is plausible to argue that the perceived competence and trust will be somehow correlated. Based on the review on trust in HRI, one can argue that users will more likely trust a system that is perceived by the users as competent [17]. Thus, we hypothesize that:

Hypothesis 3

The associated trust in HRI between the conditions is significantly mediated by the perceived competence of the system.

Table 1 Names of the exercises used for the presented study. The exercises were selected to represent a variety of exercises targeting possible preferences

Full size table

Additionally, low trust is often associated with the misuse or disuse of an autonomous robot [2]. Previous works hypothesized that if people do not trust a robot, they stop using it. This trust in the competence of an interaction partner to achieve the desired goal is also highly critical between a client and a therapist [19]. Perceived higher competence increases the trust in the relationship to achieve a common goal. Thus, if people do not feel that the therapist has the competence to accomplish a common goal, they do not trust the therapist, do not build rapport, and are more likely to stop the therapy or intervention.

Thus, we draw our final hypothesis for this work:

Hypothesis 4

An adaptive robot increases the participant’s motivation to engage in a second interaction compared to an adaptable robot.

To investigate these hypothesis, we present in the following a system and study design that incorporates two different adaptation strategies in an exercising scenario.

3 System Design

Figure 2 shows a high-level view of the system and interaction flow. The system consists of different components that communicate in a distributed system. The system composition includes a database of different exercises for Nao; a session controller, monitoring the exercises of the user and executing the robot’s behavior; a simple computer vision system using a 3D depth sensor to analyze the skeleton of the use; a position controller for the robot as well as a preference learning algorithm. The system and decision components are implemented using the framework presented in [46] and are not further detailed in this manuscript.

3.1 Exercise Database

As previously found, exercising preference is unique to each person [38]. Thus, for the aim of this study, we developed a system that provides a variety of different exercises. We have chosen 25 exercises in total from 5 different categories: strength, stretch, cardio, taichi, and meditation. This set of activities tackles one of the open issues of SARs for exercising tasks. Previous work often looked at a single type of exercises like arm movements [11, 12, 16]. The approach of using a spectrum of different physical activities might show that people can perform various exercises together with a robot.

Table 1 presents the list of the chosen exercises. They have been selected based on a variety of criteria: (a) the possibility to animate and execute them on Nao (i.e., Nao cannot jump), (b) the difficulty that users can perform them (i.e., exercises should not be too challenging for the participants), (c) the exercises should target the full embodiment of the robot (i.e., laying down, balancing, standing).

Moreover, we limited the set of exercises to five categories and five exercises per class due to two considerations. First, participants in our study are asked to do at least 14 exercises. Thus, we chose five categories to make sure that the user is presented at least once every combination of exercising categories, which will be important for the preference learning approach that relies on the comparison between exercising categories.^{Footnote 2} Second, we chose five exercises per group so that the users eventually try out some other categories after the exercises start repeating.

All of them have been animated on Nao using Choregraphe [15, 34]. Figure 3 shows an example of a user exercising together with the robot.

3.2 Preference Learning Framework

Preference learning is a subfield of machine learning that aims to learn predictive models from previously observed information (i.e., preference information) [14]. In supervised learning, a data set of labeled items with preference information is used to predict preferences for new items or all the other items from a data set. In general, the task for preference learning is concerned with the problem of learning to rank.

There are many different approaches to preference learning. It can be solved using supervised learning, unsupervised learning, and also reinforcement learning. Since there exists no particular data set we could use for supervised or unsupervised learning, it is challenging to build a model that can predict preferences from previously observed information. Therefore, we are focusing on how the system can learn an initial preference relation for a given item set without any prior knowledge (i.e., the cold start problem). Thus, we are trying to solve the preference learning problem using online methods for the Multi-Armed Bandit problem or, more precisely, Dueling Bandit algorithms [56].

The dueling bandit problem consists of $K$($K\ge 2$) arms, where at each time step $t>0$ a pair of arms ($\alpha _{t}^{(1)},\alpha _{t}^{(2)}$) is drawn and presented to a user. A noisy comparison result $w_{t}$ is obtained, where $w_{t}=1$ if a user prefers $\alpha _{t}^{(1)}$ to $\alpha _{t}^{(2)}$, and $w_{t}=2$ otherwise. The distribution of the outcomes is presented by a preference matrix $P=[p_{ij}]_{KxK}$, where $p_{ij}$ is the probability that a user prefers arm $i$ over arm $j$ (e.g., $p_{ij} = P\{i\succ j\}, i,j = 1,2,\ldots ,K)$).

The goal of the preference learning task is, given a set of different actions (e.g., different sport categories), to find the user’s preference order for these categories by providing the user two $\alpha _{i}$ and $\alpha _{j}$ and update the user preferences based on the selection of the preference between $\alpha _i$ $\succ $ $\alpha _j$ or $\alpha _i$ $\prec $ $\alpha _j$.

Thus, the challenge is to find the user’s preference by running an algorithm that balances the exploration (gaining new information) and the exploitation (utilizing the obtained information). In this work, we are using the DTS algorithm presented in [55]. Since there are several implementations to solve the dueling bandit problem, we need to answer the question of why we have chosen this specific kind of algorithm.

Two reasons mainly drive this decision, the state of the art algorithms at the time of this study were DTS, RMED and its successor ECW-RMED [23, 24]. Both perform reasonably well regarding their asymptotic behavior. However, currently, we are interested in the initial phase and not interested in the long-term run of these algorithms. If one takes a look at the first steps of these algorithms, one can see a significant difference between them that likely influences the HRI experience. RMED and ECW-RMED both have an initial phase where all possible pairs are repeatedly drawn for some time (see Algorithm 1 [23, 24]). From an algorithmic perspective this is reasonable, but looking at it from the viewpoint of the interaction, this would lead to systematic comparisons that could result in boredom and even annoyance when the interaction partner is seemingly interrogating the user for her/his preferences. Thus, we assume that the DTS algorithm is more useful for HRI (especially for the initial contact between the trainee and the robot coach) because it does not rely on a systematic comparison of all possible pairs.

In previous research, we both verified the applicability of this preference learning approach in an ad-hoc preference learning scenario and the influence of the robot’s embodiment on the satisfaction of the preference learning results, which is not existent [45]. This assures that the embodiment does not influence the user’s acceptance of the preference ranking, which is an essential prerequisite for this study. Additionally, we validated that the used algorithm performs significantly better compared to a randomly selecting preference ranking.

4 Study Design

We conducted a study with a between-subject design (adaptive robot vs. adaptable robot) where participants were randomly assigned to one of two conditions.

4.1 Conditions

In both conditions, the system waits for a user to be present in the room. Depending on the distance, it asks the participant to come closer. The system introduces itself to the user, explains its behavior and asks whether the user wants to start the exercising program.

Adaptive The robot in the adaptivity condition used the algorithm described in [55]. During the introduction phase, the system explains to the user that it will do different exercises together with the user and will ask for preference feedback relating to the various exercises. At each time step, the system selects two practices based on the preference learning algorithm and executes them consecutively with the user. Afterward, the system queries the user for a preference statement. The robot acknowledges the decision by repeating the chosen exercise. The preference learning algorithm updates the user’s preference database and selects the next activities based on the current user preference. This behavior happens for 14 exercises (or seven iterations of the algorithm). As an additional measurement for motivation, the system asks after the 14 workouts whether the user wants to continue exercising for two more exercises or quit the experiment. However, only one participant did not want to do two more activities. After the two additional exercises, the robot finishes the interaction. It states the user’s learned preferences based on the user’s feedback in ascending order, and thanks for the participation. We limited the voluntary workouts to two, due to battery concerns and overheating of joints.

Adaptable The robot in the adaptability condition did not use any preference learning algorithm and did not autonomously select the next exercises. In the introduction phase, the system explains that it offers different exercises they can do together. Before each activity, the robot verbally lists the possible exercising categories in a randomized order, and the user can choose the exercise category she or he wants to experience. Thus, the user was in control of the exercise session and could choose the exercise category she or he prefers. Also, in this condition, the human and robot did 14 exercises together, and the robot asked whether the user wants to do two additional exercises. Additionally, in this condition, only one participant did not want to do two more practices.

4.2 System and Interaction Flow

The primary interaction flow for the conditions during the exercises is as follows: based on the current user’s preference database, the algorithm selects two exercises (adaptable condition: the user selects the next exercise category), then the session manager runs these exercises sequentially. During the activities, the session manager receives user skeleton information from a depth sensor and classifies if the user is doing the practices, by comparing the joint angles with the joint angle configuration for the specific exercises. We divided activities into crucial chunks (e.g., going up and down for squats and tracking the bend of the knees). The robot starts the movement and waits for the user to follow (i.e., for squats, the robot goes into the squat position). We only check whether there is a change in the joint angles and not whether it is the correct exercises. In case the user is not doing the activities, the system will run into a timeout after one second and continues with the next step of the exercises. At the start of a new task, the user usually first looks how to do the exercises and then joins the robot in doing the exercise. No participant refused to do an activity or did the wrong training. For tasks on the ground, the skeleton tracking does not work correctly, and the system is just following the exercise scripts. However, the instructions of the robot only start when the participant is on the ground. The interested reader can find more details on the used system, classification, and exercise pattern modeling in our previous publication [46].

4.3 Wizard of Oz Strategy

In the beginning, we used the internal speech recognition of Nao. However, prototype experiments showed that speech recognition capabilities are below an acceptable recognition rate, therefore we manually inserted the user’s speech input using a Wizard of Oz (WoZ) style. The wizard listens to the user’s feedback on the exercising preference or selected exercise category using an installed microphone in the experimental room and forwards the user’s response as fake speech recognition results using the middleware to the session manager.

Additionally, when Nao performs the exercises, it moves away from the initial position. We have implemented a simple marker-based localization strategy. However, the duration for localization and positioning creates an unsatisfying interaction experience. Since this extended time is a significant disturbance for the HRI experience, we also have implemented a WoZ position controller to move the robot to the correct position manually after each exercise.

This approach ensures that speech recognition and position work reliable and do not influence the user’s trust in the system. The general instructions for the wizard were to type in the user’s speech for the exercise selection and position the robot to face the human. Additionally, in cases when the robot falls during an exercise, the in-build position sensor detects this, and an automatic stand up behavior is triggered.

4.4 Participants

Due to the expensiveness of this experiment in terms of time costs (i.e., 2 h per subject, breaking robots), we limited our sampling to 20 people per group. However, due to prior testing and experiments, we are expecting a large effect size for our hypothesis relevant measurements.

The sampled participants ($N$ = 40; average age $M$ = 26.02, $SD$ = 5.48, 13 female and 7 male in the adaptivity condition; 12 female and 8 male in the adaptability condition) were mostly university students that were acquired by information on the campus and social media. The majority of the participants were naive robot user and had no background in computer engineering or programming.

4.5 Procedure

Participants arrived at the lab individually. First, they gave informed consent. Then, the experimenter led the participants to a room where they could change their clothes. Later, the experimenter told the participant to enter the lab and follow the instructions of the system. Until this point, the participants did not know that they will be interacting with a robotic system. We neglected this prior information to not bias the participants or raise false beliefs. Then the participants entered the lab without the experimenter. The interaction took approximately 50 to 60 min, and the experimenter monitored the experiment from a control room. After the interaction finished, participants had to answer a questionnaire and had a voluntary post-study interview. Finally, they were debriefed and received 8 Euros for their participation. The ethical committee of our university approved the procedure.

4.6 Measurements

We use subjective responses to questionnaires to test our hypotheses and evaluate the performance of the online learned preference ranking using the position error between the learned ranking and the ground truth of the user. The following subsections explain these different measurements.

4.6.1 Hypotheses Related Measurements

In this study, we are investigating whether different personalization methods change a user’s subjective perception of the robot, the alliance to it and motivation to interact with the system. The following measures were used in this study to find evidence for our presented hypotheses. We used Cronbach’s $\alpha $ as a measure for the internal consistency of the scales and excluded scales that are below .5 for our report [8].

Negative Attitudes Towards Robots Attitudes towards robots were measured using the Negative Attitudes Towards Robots (NARS), $\alpha $ = .8, on a five-point Likert scale [33]. Negative attitudes towards robots could be a confounding factor explaining results obtained on the perception of the robot.

Physical Activity Enjoyment Participants rated their physical training enjoyment using the Physical Activity Enjoyment Scale (PAES), $\alpha $ = .91 [21]. The average overall item responses calculate the overall enjoyment score.

System Usability System’s usability was measured by the System Usability Scale (SUS), $\alpha $ = .84, with ten items on a 5-point Likert [4].

Team Perception We measured the user’s perceived team perception using scales from [32]. These scales measure the general team perception ($\alpha $ = .38), the openness to suggestions from the team member ($\alpha $ = .94) and the perceived cooperation ($\alpha $ = .39). All scales were on a 5 point-based Likert-scale.

Perception of the Partner Participants were asked to rate the perception of the robot on the Robot Social Attribute Scale (RoSAS). This scale includes the perceived warmth ($\alpha $ = .85), competence ($\alpha $ = .77) and discomfort ($\alpha $ = .76) on a 9 point-based Likert-scale [7].

Motivation To have an additional measure to see whether people are interested in exercising a second time with the robot, we let the participants opt-in for voluntarily exercising with the robot again without monetary compensation. Participants were asked at the end of the questionnaire to enter their email address if they want to exercise again in the following week.

Trust in Alliance Finally, we used the Working Alliance Inventory (WAI), $\alpha $ = .91, as a measure commonly used in helping alliances to assess trust and belief in a common goal of helping that a therapist, clinician or coach has for another [19]. This measure has recently been used in HCI and HRI studies for assessing the alliance and trust between the human and a SAR [3, 22]. We adopted it for our use case (e.g., ‘What I did in today’s session, gives me a new view on my exercising preferences’, ‘Nao and I have worked together on our common goals for this session’).

4.6.2 Preference Ranking Measurements

We use two ranking error functions to evaluate the quality of the preference ranking: $D_{PE}$ which is the position error distance and $D_{DR}$ which is the discounted error. Given a set of items $X = {x_1, ... ,x_c}$ to rank and r as the user’s target preference ranking and $\hat{r}$ as the learned preference ranking. Both r and $\hat{r}$ are functions from $X\rightarrow {N}$ which return the rank of an item x. The position error is defined as follows

$$\begin{aligned} D_{PE}(r,\hat{r}) = \hat{r}(arg min_{x\in X} r(x))-1 \end{aligned}$$

(1)

The idea of this distance measure is that we want the target item (i.e. the highest ranked item from r) to appear as high as possible in the learned preference ranking $\hat{r}$. Thus, this distance gives the number of wrong items that are predicted before the target item. The discounted error is defined as follows

$$\begin{aligned} D_{DR}(r,\hat{r}) = \sum _{i=1}^{c} w_i \cdot d_{x_i}(\hat{r},r) \end{aligned}$$

(2)

where $w_i=\frac{1}{log(r(x_i)+1)}$. This distance measure gives higher ranked items from r a higher weight for the distance error $d_{x_i}$ between the rankings, where $d_{x_i}(\hat{r},r)$ is the position difference between the the learned preference $\hat{r}$ and the true preference r.

In other words, having a correct ordering of the high ranked values form r is more important than of the low ranked items of r.

Since the goal of this study is to learn the user’s most preferred exercise during the exploration phase for a cold start problem, we consider the position error $D_{PE}$ as the most critical measurement. In other words, the goal is to rank the most preferred item as high as possible on the learned ranking after the exploration phase. Therefore, the exact ranking of the least preferred items is not as crucial as getting the most favorite item correct.

5 Results

In this section, we present the results from our quantitative survey evaluation (see Sect. 5.1) on the participant’s subjective expression regarding the interaction experience using the above-described measures. Additionally, we show the results from the preference learning algorithm to verify the applicability of DTS in out-of-the-box personalization scenarios (see Sect. 5.2). Finally, we summarize the qualitative results from semi-structured post-study interviews which highlight participant’s experience of the motivational capability of robotic exercising companions and their strategies to interact with the distinct personalization methods (see Sect. 5.3).

The quantitative data were analyzed using the statistical computing language R [36]. We analysed the data for normality assumptions and used Welch’s two-sample t-test if the data meet the criteria and Wilcoxon rank sum test respectively [53, 54]. To increase reproducible science, we published the data and scripts for the analysis on Github.^{Footnote 3}

5.1 Quantitative Results

Manipulation Check The data was checked for differences in possible confounding factors such as the participant’s previous experience with technology, their average weekly exercising activity, personality, physical activity enjoyment and the attitudes towards robots. We found no significant difference for these variables. Previous experience ($W$ = 146.5, $p$ = .15, $r$ = .22), exercising activity $(W = 218, p = .61, r = -0.08)$, PAES $(W = 164, p = .34, r = -.96)$, as well as NARS $(t(37.7) = 1.78, p = .08, d = .56)$ and the rating for the personality (all $p > .5$, see Fig. 4) were not significantly different between the conditions. Thus, the manipulation seems to be successful.

The general hypothesis unrelated measurements show that participants did not evaluate the usability of the systems significantly different, $t(35.56) = .95, p=.35, d=.30$. There was also no significant different evaluation regarding the openness to follow the system’s suggestions, $W=204, p=.92, r= -.10$ (see Fig. 5). Also, participants did not feel significantly more discomfort between the conditions, $(W=210, p = .80, r = -.26$ (see Fig. 6).

However, the system was perceived as more warm measured by the responses on the RoSAS scale, $t(36.23)=-2.47, p= .02, d=.78$. The adaptive system is perceived as warmer $(M = 4.08, SD=1.62)$ than the adaptable system $(M=2.93, SD=1.29)$.

In the following paragraphs, we present the results for our hypotheses.