1 Introduction

Human–Robot Interaction (HRI) explores how to facilitate the communication between humans and robots, improve their use, and personalize their behavior to each user [9]. Personalized robot behavior drives these machines to establish bonds with their users based on acceptance and trust [6]. In this context, Reinforcement Learning (RL) has gained attention in the last few years because it allows robots to learn from user feedback and explore the environment, thus producing adaptive and personalized behavior. RL methods have opened many new opportunities in social robotics, a field where HRI typically undergoes unforeseen changes and requires adaptation [2]. Nonetheless, HRI still faces many challenges, especially when the robot needs to interpret the user’s feedback, preferences, and intentions [32]. Thereby, it is fundamental that the user feedback influences the robot’s actions to correctly learn and succeed in the interaction [22].

This paper presents a RL framework for social robots to produce autonomous decision-making and drive entertainment or cognitive stimulation sessions using the user’s preferences. The model, shown in Fig. 1, considers implicit and explicit user feedback to evaluate the system’s performance and role in long-term HRI. The system can be applied to generate autonomous and personalized decision-making in any robot, dynamically generating adapted sessions.

Fig. 1
figure 1

The Reinforcement Learning system for preference learning is used to personalize activity selection from the user’s implicit and explicit feedback. The Decision-making system selects and controls the activities executed by the robot. During HRI, the robot obtains implicit feedback from how the user interacts and explicit feedback from the activity ratings. The model in the Decision-making system receives feedback to know the preferred activities and play them more often

In previous works [16, 18, 19], we developed decision-making architectures for autonomous social robots in cognitive stimulation and entertainment that make decisions by simulating an artificial biological state that drives motivated behavior. Nonetheless, these architectures barely consider the user’s preferences in the decision-making process. More recently, we presented methods based on prediction [17] and conceptual models [20], but they needed to include the adaptive behavior and experiments included in this paper. The social robotics community agrees that social robots must be user-oriented to engage users and facilitate their use [12]. Consequently, this work concentrates on developing robot learning methods to personalize online activity selection during entertainment sessions by autonomously selecting the user’s favorite activities more often.

The user’s features are often unknown by the robot at the beginning of the interaction, so it does not have any information about how to personalize behavior. In these situations, user preferences must be obtained from the interaction experiences while adapting the robot’s behavior step-by-step. Social robots have two ways to gain this experience from user feedback. On the one hand, they can ask the user to rate how much they liked the activity after executing it (explicit feedback). On the other hand, it can estimate the rating from the interaction metrics (implicit feedback).

The literature has previously employed explicit and implicit user feedback to improve HRI [4, 8, 13, 27, 35, 39]. However, we have failed to find a previous work that explores the impact of the user’s feedback when learning the user’s preferences to personalize the robot’s behavior, and whether explicit and implicit feedback should be combined or used separately. Consequently, this work explores the impact of dynamically learning to personalize robot behavior from online user feedback, the implications of feedback on user engagement, and the influence on the number of activities during the sessions.

We evaluated the system through a long-term study with 24 participants (6 women and 18 men). Initially, the participants indicated their preferences from 0 to 5 points toward 15 multimedia activities using an online survey. Each participant performed at least 5 sessions of 20 min each (minimum interaction time of 100 min). The participants were equally divided into three conditions. These conditions investigated which feedback method produces the best robot adaptive behavior and whether the feedback method influences user engagement and the number of activities executed per session. The three conditions are:

  • Condition 1: Explicit feedback (C1). The user preferences were updated using only explicit feedback. After executing an activity, the robot asked the user how much they liked it to obtain a rating on a 0 to 5-point scale.

  • Condition 2: Implicit feedback (C2). The user preferences were updated using implicit feedback. The robot considered three interaction metrics to estimate a value from 0 to 5 points to indicate how much the user likes a particular activity.

  • Condition 3: Combined feedback (C3). The user preferences were updated by combining explicit and implicit user feedback.

The definition of the experiment conducted to assess the role of user feedback led us to hypothesize:

(H1):

C3 should be the best alternative in terms of preference adaptation, producing lower error, improving user engagement, and balancing activity exploration.

(H2):

Combining the user’s ratings (explicit feedback) with interaction metrics (implicit feedback) should lead the learned preferences to be more similar to the initial ratings of the users

(H3):

Adapting the user’s preferences using explicit feedback (C1) should produce accurate learning. However, asking the users for their responses after each activity will reduce the number of activities tested.

This manuscript is organized as follows. Section 2 reviews the current state of social robots, focusing on learning preferences from user feedback to state the gap that this paper addresses. Section 3 formalizes the RL problem focusing on Multi-armed Bandits applied to constant learning conditions. Section 4 describes the experiment to test the performance of the learning system. Section 5 presents the experiment results, comparing the learning system outcomes for the three conditions. Section 6 discusses the outcomes of this work and states its limitations. Finally, Sect. 7 provides the main conclusions.

2 Related Work

In the last few years, RL has been successfully applied to dynamic environments to provide social robots with adaptive behavior in applications such as social navigation [8, 15], education [7], and assistance [28]. Nonetheless, the literature contains only a few contributions that address adaptation from explicit and implicit user feedback in social robotics to personalize HRI sessions.

In this line of research, Baraka and Veloso [4] designed one of the first studies to generate personalized entertainment sessions. In particular, the authors studied the role of explicit and implicit user feedback in learning preferences. In their study, which is very close to the work presented in this manuscript, they classify users into different profiles (i.e., conservative, erratic, and consistent but fatigable) and study how to learn user preferences in simulation. However, their work was not tested with human users, and their definition of user feedback needs to be practically proposed. Similarly, Whitney et al. [39] presented a robot that reduces its errors in an object-fetching task by using explicit and implicit feedback from the user. In this case, the robot only asks the user for information when necessary, which avoids fatigue and increases the fluency of the interaction. The difference in this work was testing the system with just a few robot actions and short-term interactions without learning.

Hemminahaus and Kopp [14] investigated how a social robot can adapt its social behavior while interacting with humans to attain specific goals in unpredictable situations. The adaptive process uses RL and implicit user feedback during a memory game assistive task to improve the robot’s decision-making. In a similar work, Moro et al. [21] proposed an RL-supported system that learns personalized behavior in daily assistive activities by considering the user’s implicit feedback. Meanwhile, Ritschel and André [25] used RL to dynamically adapt their robot behavior to the human’s personality profile to make the interaction more engaging. This setting employs implicit feedback because the robot does not explicitly ask the user about their preferences but instead uses social signals to estimate their level of introversion or extroversion. The primary difference between these works and ours is that the robot uses predefined short-term scenarios and only considers implicit feedback.

Later, Ritschel et al. [27] presented a robot that can adapt its communicative acts while performing activities such as information retrieval, reminders, communication, and entertainment and giving health-related recommendations. The adaptive process gathers explicit user feedback obtained during the interaction and uses an Upper Confidence Bound action selection method supported by RL to improve the process.

In the last few years, social robots have been used to drive HRI sessions with older adults and children. Wang et al. [38] studied the impact of service robots in interactions with older people by focusing on their interface preferences in multi-modal communication procedures. In education environments, RL has also been used in HRI by Park et al. [24] to help children improve their language skills. In this case, the robot gathers verbal and non-verbal feedback from the children to modulate their engagement and maximize their learning gain. Che et al. [8] presented a mobile robot that can produce efficient social navigation by combining explicit and implicit user feedback. This setting resembles our work because it conceptually uses explicit and implicit feedback to produce appropriate behaviors. However, the differences in our approach reside in applying preference learning in mobile instead of social robotics and the lack of real experiments to validate the impact of feedback on the robot’s performance. In a similar scenario, Shi et al. [33] have shown the great potential of adaptive social assistive robots during long-term interventions for children with autism spectrum. However, user feedback plays a shallow role in this work because most session parameters are predefined, and only facial gestures are considered.

Focused on personalizing HRI, Tsiakas et al. [35] proposed an Interactive Reinforcement Learning framework combining explicit feedback from task performance and implicit feedback from user engagement. Their results show that combining explicit and implicit feedback drives real-time HRI personalization. The 69 participants interacted with the robot in a single session to evaluate if factors like exercise level or engagement were appropriately personalized. Olatunji et al. [23] studied the design of effective feedback strategies in person-following robots with older adults. Their results show how users preferred voice feedback over tone at a continuous rate to receive information about the robot’s actions constantly. Akalin et al. [1] explored how different robot feedback (negative, positive, and flattering) influences the users’ perception in cognitive training tasks. The results show how flattering and positive feedback was preferred. However, the study does not describe implicit or explicit feedback from the user. Besides, the evaluation was carried out in a single session. The main differences found in the previous works are in not evaluating the system in long-term experiments and analyzing the role of user feedback on the task.

Boggess et al. [5] developed a system to generate personalized explanations for path planning from user preferences. The robot answers the users’ question using HRI and, using RL, define the best strategy for each situation. The main difference of this work is not using implicit feedback and evaluating the method using an online survey instead of real interactions. Recently, Asprino et al. [3] presented a software architecture considering adaptive behavior from user preferences. In this task, the architecture obtains explicit user feedback before interacting to store the user’s favorite activities, which are later autonomously presented to the user. However, this paper does not provide an evaluation, and neither online adaptation nor user feedback are considered.

As Wirth et al. [40] have recently reviewed, numerous works have employed RL to produce a set of preferences toward a group of actions/activities. For this application, the authors present Multi-armed Bandits as an effective alternative to organize activities as a ranking based on the values of a model-free tabular RL algorithm. Many authors have deeply explored these methods [26, 30, 37] in the last years to personalize the interaction of social robots. These contributions agree in ordering a set of labels (in most cases, activities) to select those preferred by the user more often. Considering the positive results of these studies in HRI, we opted for using a Multi-armed Bandit action-based method in a non-stationary scenario with a constant learning rate.

Table 1 Related contributions to our method analyzing the similarities and differences for adaptive robot behavior for HRI sessions

Adaptive systems in HRI have a great potential to improve the interaction significantly. The analysis of the previous literature review provided in Table 1 highlights the lack of adaptive robots for long-lasting interactions that consider and analyze the role of explicit and implicit user feedback. The review shows some works [4, 8, 17, 35, 39] that personalize HRI from explicit and implicit user feedback. Besides, some works [5, 17, 21, 24, 26] include RL to improve further interactions from past experiences. However, none of these studies investigate the role of explicit and implicit feedback on learning preferences and whether user feedback influences the personalization and learning process in long-term experiments with onsite participants. Acknowledging this gap, this paper presents a framework that (i) generates online adaptive behavior during long-term sessions from RL, (ii) considers implicit feedback obtained from the user’s actions during the interaction, (iii) considers explicit feedback by asking the users their preferences, and (iv) explores the influence of feedback in user engagement and activity execution.

3 Methods

This section formalizes the RL methods based on non-stationary Multi-armed Bandits [34, p. 32]. Action refers to the robot selecting and executing an activity.

3.1 Formalization

RL is a learning method that allows an agent to learn how to map situations to actions to maximize a numerical reward signal representing a goal [34, p. 1]. Initially, the agent does not know their action effects but can explore them by continuous interaction (i.e., trial and error) with the environment. When an action finishes, its effects are “perceived” by the agent and then converted to a numerical reward value that measures the action’s quality in the agent’s situation. Considering the previous idea, it is possible to infer that the reward function has to be predefined by the designer because it is specific to the learning scenario and dependent on the goal that the agent seeks to attain (see Sutton and Barto [34] for a review of reward function shaping).

Formally, RL problems are generally Markov Decision Processes (MDP) [36] that consider features of the agent’s situation to develop a probabilistic model that is based on transition probabilities from one state (situation) to another. The transition occurs when the agent executes an action selected from a list of possibilities. Among RL methods, Monte Carlo, Temporal Difference, or Dynamic Programming algorithms consider a variant agent state learning in which action better suits each situation. Meanwhile, methods such as Multi-armed Bandits consider a constant agent state but focus on learning action values (i.e., optimal action execution) using environmental feedback. Although both streams have remarkable differences, TD and action-value methods have similarities in updating the values assigned to each state-action pair in MDPs or action-value methods. Both approaches consider the error between a target estimated value in opposition to previous agent estimations. Thus, an error appears when the current estimate differs from the previous knowledge, which drives its correction. The error correction is performed by moving a step toward the target. Equation (1) expresses this idea,

$$\begin{aligned} \text {Value} \leftarrow \text {Oldvalue} + \text {StepSize} \left[ \underbrace{ \text {Target} - \text {Oldvalue}}_{error} \right] \end{aligned}$$
(1)

where the StepSize indicates the amplitude of the error correction towards the Target value using the old value and the new value.

This learning scenario requires the robot to learn a behavior policy (i.e., a sequence of actions) whose goal is to fulfill the goal defined by the reward function by maximizing the reward obtained after each action in a fixed agent state. Following Sutton and Barto’s [34, p. 33] ideas, we opted for using Multi-armed Bandits since they are a simple and efficient solution to learn how well a specific action is executed in a static agent state. This method allows us to compare the most suitable action from a group of possibilities in preference selection. Equation 2 shows the original formulation for Multi-armed Bandits considering non-stationary rewards and variable step size.

$$\begin{aligned} Q(a) \leftarrow Q(a) + \frac{1}{N(a)} \left[ r - Q(a) \right] \end{aligned}$$
(2)

In this equation, Q(a) is a float value that represents how good it is to execute the action a, and the step size \(\alpha \) is a function of the number of updates N(a) expressed as \(\frac{1}{N(a)}\). This change allows convergence because the error decreases with the action’s number of updates N(a).

3.2 Proposed Learning System

This paper’s proposed learning system employs Multi-armed Bandits [34, p. 25] considering a constant learning rate. These methods can be applied to our problem due to two main properties:

  • The learning process is continuous since the learning rate does not change, and learning occurs during the whole lifespan of the robot.

  • In this set-up, the robot learns activity preferences instead of state transition suitability, as presented in Eq. (2). Thus, Eq. (2) deals with learning the best actions in a non-stationary scenario with a fixed agent state.

We opted for using Eq. (3) in our learning model. This equation is based on Eq. (2), setting a constant learning rate \(\alpha \).

$$\begin{aligned} Q(a) \leftarrow Q(a) + \alpha \left[ r - Q(a) \right] \end{aligned}$$
(3)

In RL, the learning rate \(\alpha \) often depends on the number of updates of the Q-value associated with the action N(a). The original algorithm proposed in [34] and presented in the previous section considers this setting to converge to an optimal solution. However, converging on an optimal solution is not necessary in our application, where the user’s preferences may vary in the long term. Instead, we propose that the learning system continuously adapts to the user’s preferences. To find the best \(\alpha \) value, we conducted a preliminary evaluation to choose between four empirically selected alternatives: 0.1, 0.25, 0.5, and 1. The preliminary evaluation consisted of simulating the learning system’s dynamics using the four rates to update a single action during 20 iterations with different rewards. The results of this evaluation reported that \(\alpha =0.5\) was the best alternative because the learning rates of 0.1 and 0.25 produced prolonged adaptation, and the learning rate of 1 unit could not fit well to the initial user rating of the activity.

The Q-values representing the user preferences Q(a) range from 0 to 5 points. Meanwhile, 0 indicates that the user does not like the activity, whereas 5 indicates that the user loves the activity. All action values Q(a) start in 5 points to allow activity exploration at the beginning of the experiment and select preferred activities more often as the experiment progresses. The reward value can only be between 0 and 5 points to keep the values in the range. Recall that the user’s feedback can be obtained from the activity ratings (explicit feedback) and interaction metrics (implicit feedback).

4 Experiment

This section describes the experimental setup of this work. It introduces the Mini social robot used to test the system’s performance. Then, it describes the experimental setup and the session dynamics. Finally, we describe the robot’s actions in the learning scenario.

4.1 Mini Social Robot

Mini [29] is a social desktop robot that assists older adults in cognitive stimulation therapies and entertainment sessions. Mini communicates with the users using a HRI manager [11] that manages the verbal and non-verbal interaction and obtains the user’s feedback using perception to adapt its behavior to the situation it is experiencing. The user executes the robot’s activities using a connected touch screen. The Decision-making system [16] manages activity selection, which employs the user’s preferences to personalize the interaction.

4.2 Experimental Setup

The experiments were conducted to validate our approach, which consisted of comparing the learning system’s performance under three conditions that define how the reward function is updated based on user feedback. In the experiment, we recruited 24 university students with little expertise in robotics (6 women and 18 men) aged from 20 to 30 years old (\(\mu =24.55\), \(\sigma =2.75\)). These students had not previously interacted with the robot. They were equally and randomly divided into one of the three conditions to execute entertainment sessions that aim at completing activities related to watching photos, videos, and listening to music. This task was selected considering the application we want to give to the method: learning user preferences towards the robot’s activities to personalize future activity selection and the versatility and repertoire that multimedia activities offer. Each session had a minimum allotted time of 20 min per session. The robot autonomously ended the session when the 20 min had elapsed, and the ongoing activity finished. Since the participants required four weeks to complete the experiment, the sample size was limited to 8 people per condition.

Fig. 2
figure 2

Mini executing an activity with a user during the entertainment experiment

Mini is a desktop robot. Therefore, in the experiments, it was fixed to an office table where the participants individually interacted with Mini without the intervention of other people. The sessions were face-to-face interactions, as shown in Fig. 2. Each participant tested the robot’s activities as described in Sect. 4.3, participating only in one of the conditions (between subjects study). The participants decided when to interact with the robot and had 20 days (from Monday to Friday for four consecutive weeks) to complete a minimum of five sessions (total time of 100 min). The number of sessions (5) and the time per session (20 min) were set arbitrarily based on a previous user study we conducted with the robot [17]. However, all participants voluntarily completed more than the requested sessions, as indicated in Table 7. The conditions under evaluation were:

  • Condition 1: Explicit feedback (C1). Only explicit user feedback was considered when updating the Q-values associated with each activity. Independently of the activity result, the user was requested to rate the activity once finished using a 0 to 5 point scale.

  • Condition 2: Implicit feedback (C2). The users never rated an activity, but instead, implicit feedback was calculated using the interaction metrics to estimate how much the user liked the activity.

  • Condition 3: Combined feedback (C3). This condition joins both of the previous approaches. In this scenario, the robot autonomously decides whether it asks the user to rate the activity after finishing it (the probability of asking is \(50\%\)). The reward value is predicted by considering the interaction metrics.

Fig. 3
figure 3

Flow diagram representing the experiment dynamics. The experiment starts by loading the user profile data, including preferences. Then, the robot decides if the user selects the next activity or itself. Once the activity finishes, the Q-value is updated using the implicit and explicit (if obtained) user feedback

4.3 Session Dynamics

As shown in Fig. 3, the session dynamics followed the same course with subtle differences in the three conditions presented earlier. Before starting the experiment and testing the robot’s activities, each participant completed an online survey to rate how much they liked different photos, videos, and music activities using a 0 to 5 point scale. These ratings were later used as a baseline to compare the initial and predicted preferences.

At the beginning of the first session, the robot informed the users about the experiment’s dynamics and the need to complete at least five sessions in four weeks while allowing them to execute more. At the beginning of each session, the participants had to press a Start button and select their name on the touch screen to load their information into the robot’s memory. This profile contained basic personal information about the participants’ features (e.g., age or name), which the designers previously included. The profile also included the participant’s preferences towards the robot activities, with an initial value of 5 points adapted by the learning algorithms with the interaction. Thus, we ensured that all activities had the same probability of being selected at the beginning of the experiment.

The participants or the robot could select the activities in each iteration. The probability of the robot or the user selecting the activity was \(50\%\) in each case. Thus, depending on who made the decision, the activities could be selected in two ways:

  • If the user selects the activity, a menu appears on the touch screen with the activities classified under the categories photos, videos, and music. The user can navigate these menus and select the robot’s following activity.

  • If the robot autonomously selected the activity, then the robot notified the next activity before starting. In this case, the robot employed the Boltzmann distribution [10] with a Temperature value set to 5 points to select the user’s preferred activities more often and foster unexplored activities using the learned Q-values. We also foster less visited activities to be selected more often, increasing their probability.

The likelihood of preferred and less selected activities increased as the experiment progressed. Meanwhile, the selection probability of those activities often explored by participants with low ratings was substantially reduced. It is also important to remark that the user could stop the activity by touching the robot’s right-hand shoulder. At that moment, the activity was paused, and the robot waited for the user’s confirmation using the touch screen. After canceling an activity, the user or the robot could select a new activity if the session time was below 20 min.

When an activity finishes, three possible events occur based on the evaluation condition:

  1. 1.

    If the participants were in Condition 1, then they were requested to rate the activity using a 0 to 5 point scale with \(100\%\) chance.

  2. 2.

    If the participants were in Condition 2, then the user never rated the activity, and interaction metrics were used to update the activity Q-value.

  3. 3.

    If the participants were in Condition 3, then they were occasionally requested to rate the activity (\(50\%\) chance). The interaction metrics were also used.

If the robot detected the user’s inactivity (i.e., not answering the questions), then the session finished. This issue happened only once during the experiments since one user had to leave due to personal problems. This session was removed from the data and not considered in the analysis. The robot then returned to an idle state and waited for a new participant to press the Start button to begin a new session.

4.4 Obtaining User Feedback

Mini has two ways of obtaining user feedback and getting a numerical reward to update the user’s preferences.

  • Explicit feedback is obtained from the user ratings using the touch screen.

  • Implicit feedback, estimated from the interaction using predefined metrics.

Both alternatives yield a numerical reward to update the previously executed Q-value associated with the activity. The reward value obtained after each activity ranges from 0 to 5 points, which limits the Q-value associated with each activity inside this range. On the one hand, the robot obtains explicit feedback by asking the user to rate how much they liked the activity from 0 to 5. When obtaining explicit feedback, the robot asked the user How much you like to listen to/watch...?, ending with the name of the activity just performed. Then, using the touch screen, the user can rate the activity from 0 to 5. Equation (4) shows how the numerical value associated with the explicit feedback is calculated.

$$\begin{aligned} r_{explicit} = \text {User rating in } \{0, 1, 2, 3, 4, 5\} \end{aligned}$$
(4)

On the other hand, the robot obtains implicit feedback by estimating the numerical reward using customized parameters that define how good the interaction between both agents was while executing an activity. We defined three interaction metrics, which jointly represent the quality of the interaction process. These metrics are 0 or 1, depending on whether its associated definition is false or true. The three metrics that are used in this work and their related conditions are:

  • User Selection (US): This metric represents if the user selected the activity (1) or if it was autonomously proposed by the robot (0).

  • Activity Result (AR): These activities can have two possible outcomes: succeeded, in which case the value of this metric is 1; or aborted, which is provoked by the user when they voluntarily cancel the activity, in which case the value of the metric is 0.

  • Execution Time (ET): This metric represents if the activity’s execution time is similar to the execution times of other participants. To validate this condition, we calculate the mean execution time of the activity considering all users \(\mu \) and its standard deviation \(\sigma \). Then, we check if the current execution time is within the interval [\(\mu -\sigma ,\mu +\sigma \)]. If it is, then the value of the metric is 1. Otherwise, the value is 0.

Equation (5) shows the reward value associated with the implicit feedback using the interaction metrics presented earlier. It is worth noting here that the interaction metrics user selection and activity result have a double influence on the reward when compared to execution time because we consider them to be more relevant and reliable in our scenario. However, in other scenarios, interaction metrics can be different.

$$\begin{aligned} r_{implicit} = 2 \cdot US + 2 \cdot AR + ET \end{aligned}$$
(5)

Finally, if the reward value is calculated by combining the explicit and implicit feedback (C3), then it is the average value between the explicit (\(r_{explicit}\)) and implicit (\(r_{implicit}\)) feedback. Otherwise, if the explicit feedback is not obtained because the question is not asked the user, then the combined feedback is the implicit feedback, as Eq. (6) shows.

$$\begin{aligned} r_{combined} = \left\{ \begin{array}{ll} \frac{(r_{explicit} + r_{implicit})}{2} &{} \quad \text {if question} \\ r_{implicit} &{} \quad \text {if not question} \\ \end{array}\right. \end{aligned}$$
(6)

4.5 Activities

The learning system aims to obtain the user’s preferences regarding the entertainment activities of the Mini robot. As mentioned earlier, the learning process adapts by obtaining explicit and implicit feedback from the user after executing each activity. Thus, learning can only succeed if the user executes each activity many times so that the robot can acknowledge how much the user likes each activity.

In the task that we designed to evaluate the learning system and the role of user feedback, we opted for the activities related to displaying multimedia content because they are easy to use and offer versatility and diversity in their execution since they have different options for each type of activity. The activities were classified into the categories photos, music, and videos.

Photos category have the activities of animals, monuments, landscapes, or sad moments. Music can be Spanish pop, Spanish rock, English pop, English rock, Latin, or noise. Finally, videos can be about cooking recipes, funny moments, sport, film trailers, and comedy.

The photos category displays eight photos for 5 s each. The total duration of the activities 40 s. The photos were downloaded from Google to create a database of around 1000. Each photo activity has a similar number of photos (around 250 each). The video category consists of displaying a single video for around 3 min. We downloaded the videos from YouTube to create a database of around 110 videos equally divided into the previous categories. Finally, the music category consists of playing a song selected from a database that stores around 90 songs equally sorted in activities. The duration of the music’s activities is around 3 min. The item was selected inside each activity randomly, but remembering the last five items was selected to reduce the repetition chance.

The activities sad moments and noise inside the photos and music categories were included to have two activities that we believe the participants will dislike. Thus, we expect that they will give a negative evolution to the Q-values adaptation compared to the activities the participants typically like. Besides, using a big database and dividing activities into categories provide diversity so users have activities that they can like and dislike.

4.6 Evaluation

The evaluation of the learning performance exhibited by the robot was carried out in two stages.

  1. 1.

    First, we used the Root Mean Squared Error (RMSE) and the Spearman correlation to statistically report which condition produced the best adaptation to the initial preferences obtained from the online survey. This analysis was carried out for all activities in each category (i.e., Photos, Videos, and Music). The RMSE measures the absolute difference between observed and predicted values, which strongly indicates the deviation between two samples. Preference values range from 0 to 5 points, and RMSE is used to compare the three ways of learning user preferences. According to the range of our data, RMSE values below 0.5 units can be considered excellent, from 0.5 to 1 as moderately positive and above 1 as high. The Spearman correlation is a metric applied to non-normal distributions to obtain the degree of relationship between two variables. This metric ranges from \(-1\) to 1, distinguishing between explicit and inverse relations. The lower RMSE and Spearman values close to 1 indicate a strong relationship [31] between the observed and predicted values. The final correlation scores are the average correlation values of the 8 users participating in each condition. The correlation score of each user is computed considering the initial preferences and learned values for all 15 activities, only the photos activities (4), only the video activities (5), and only the music activities (6).

  2. 2.

    Second, we statistically analyzed whether using different types of feedback led the users to interact more often with the robot and if this affected the number of updates of the activities. We employed the one-way ANOVA test to look for significant differences in the users’ engagement between the three conditions, analyzing the impact of the user feedback.

  3. 3.

    We used the Kruskal-Wallis statistical test to determine whether the number of times each activity was updated was affected by how the robot obtained user feedback. This test is carried out for non-normally distributed small samples.

5 Results

This section presents the main results of the experiment described in the previous section. We statistically compare the three conditions used to adapt the Q-values, focusing on the error yielded by each approach and the correlation metrics. Additionally, we statistically analyze whether user feedback influences the interaction time as an indicator of increased engagement or if it is affected by the number of times the activities were updated. Table 2 shows a summary of the study participants and the main outcomes obtained in the study.

Table 2 Summary showing the details of the participants in this condition and the results of the experiment
Fig. 4
figure 4

RMSE when comparing the initial and predicted preference values for all the activities in each condition

Fig. 5
figure 5

RMSE value for each condition and type of activities when comparing the initial and predicted user preferences

5.1 RMSE and Spearman Correlation

The methodology presented in this manuscript was evaluated by comparing which kind of feedback produces better adaptive results when correlating the initial user preferences with the predicted preferences. The statistical analysis consisted of analyzing the RMSE and the Spearman correlation from data in Tables 4, 5, and 6.

Figure 4 shows the RMSE obtained for the three conditions evaluated in this work. As we initially hypothesized, the condition combining explicit and implicit feedback (C3) yields the lowest RMSE (0.772), which indicates that this alternative produces better predictions of the initial user preferences. Then, the condition using only explicit feedback (C1) reports a RMSE of 1.355 points, overcoming the use of implicit feedback (C2), which reports a RMSE of 1.662 points.

Moving deeper into our analysis, we also explored for which category (photos, videos, or music) the RMSE was lower. As Fig. 5 shows, the system produces the best results for the activities showing photos, followed by videos and music. Considering explicit feedback C1, the photos category reports a RMSE of 0.86 points, videos a value of 1.32 points, and music a value of 1.63 points. Regarding implicit feedback C2, photos obtains a score of 1.28, videos 1.64, and music 1.91. Finally, combined feedback C3 obtains the best RMSE value with 0.64 points for photos, 0.90 for videos, and 1.03 for music.

From this analysis, it is possible to perceive two interesting results. First, it is possible to observe the same tendency for all the categories. Independently of the type of activity, C3 produces lower RMSE than C1 and C2. This result supports our hypothesis about the benefits of combining explicit and implicit feedback rather than using them separately. Second, it seems the system fits the preferences better for shorter activities, producing better scores for photos (40 s) than videos and music (duration around 3 min) since the data used to train the RL algorithm can be obtained more often improving the learning speed.

From the previous outcomes, we wanted to analyze whether the participants’ mean duration of the activities in each condition affects the RMSE value and, therefore, the learned preferences. We conducted a second statistical analysis using the Spearman correlation metric to find similarities between the mean duration of the activities and the RMSE of each user participating in the three conditions. The correlation results obtained were 0.23 points for Condition 1 (implicit feedback), 0.28 for Condition 2 (explicit feedback), and (0.16) for Condition 3 (combined feedback). These values are considered low Spearman correlations, so the activities’ duration in this study does not affect the RMSE values.

Considering the previous RMSE scores, we can conclude that when computing the RMSE for all activities, the RMSE value when combining explicit and implicit feedback (C3) indicates that the model produces good learning. However, these results are not positive when using explicit (C1) or implicit (C2) feedback, as the RMSE values are high. The analysis of the RMSE values considering each condition and each category (photos, videos, and music) reports significant differences. For example, the RMSE value is positive for the combined feedback condition (C3) for the photos and videos categories but not that positive for music. Considering conditions C1 and C2, only the learning values for the photos category using explicit feedback can be considered positive. The other cases are not positive since RMSE values are above 1 unit. The comparison of the 3 conditions evaluated using the RMSE value shows that the learning values produced when combining explicit and implicit feedback are much better than when individually considering explicit or implicit feedback.

After analyzing the RMSE, we used the Spearman correlation metric for non-normal distributions to determine the correlation between the initial user ratings obtained from the online survey and the predicted Q-values. Table 3 shows the correlation metrics obtained for each condition considering all the activities and sorting them by categories. As mentioned in this manuscript, stronger correlations are represented by Spearman coefficients close to 1, as the study [31] states. This correlation can only be interpreted if the analysis reports statistical significance (\(p-value > 0.05\)).

The results we obtained regarding the Spearman correlation for all activities separating the three conditions under evaluation show that the observed and predicted values are strongly correlated (0.811) when combining explicit and implicit feedback (C3). Similarly, using only explicit feedback (C1) also reported a moderate correlation (0.567), which suggests that the user’s ratings should be included in the loop. However, as occurred with the RMSE, the implicit feedback alone (C2) did not report a significant correlation, and therefore, the Spearman value is not worth interpreting.

The analysis by categories supports the outcomes produced by computing the RMSE. As shown in Table 3, combining explicit and implicit feedback (C3) produces a strong correlation for all categories, with values of 0.850 for photos, 0.779 for videos, and 0.817 for music. Similarly, the use of explicit feedback (C1) leads to positive outcomes, with a moderate Spearman correlation for videos (0.594) and music (0.542) and a strong correlation for photos (0.826). However, the initial and predicted preferences, when only considering implicit feedback (C2), reports a moderate correlation for photos with 0.462 points but not for videos (0.123) and music (-0.028).

Table 3 Spearman correlation values when comparing the initial user preferences and the model predictions

5.2 Engagement and Number of Updates

The last statistical analysis we conducted aimed to assess whether how feedback was obtained affected user engagement and the number of times each activity was executed. We used the results in Table 7 for this analysis.

We carried out the one-way ANOVA test for the engagement analysis because the data was normally distributed. In this case, the one-way ANOVA did not report a significant statistical difference between the groups of the condition \(F(2, 21)=.119, p-value=.889\).

We then conducted the Kruskal–Wallis non-parametric test to determine if the number of times each activity was performed was influenced by how the robot obtained user feedback. The Kruskal-Wallis test was conducted because the data was not normally distributed in this case. As in the previous case, the results did not provide any significant statistical difference (\(p-value>0.05\)). More specifically, the Kruskal–Wallis test reported \(H(2)=2.949, p-value=.229\) for photos activities, \(H(2)=2.573, p-value=.276\) for video activities, and \(H(2)=3.822, p-value=.148\) for music activities. From the previous results discussed in the following section, we cannot assure that the type of feedback used to update the user preferences affects the number of updates or the user engagement. This is because, during the experiments, the number of times each user executes the activities changes, and this can not be accurately evaluated.

6 Discussion

Generating adaptive behavior in HRI is a process that continuously looks for user preferences. Most RL algorithms seek to optimize a problem finding to attain a specific goal. However, in this application, the learning system’s goal is not to find a solution but to learn the users’ preferences towards a group of activities. This fact implies that Q-values will not converge to a final value unless the rating is repeated for updating the reward function. If the user’s rating is always the same, the learning system converges to the same value that the user set as their initial preferences. It is worth mentioning here that if a different updating value is obtained, then the adaptive process will modify the previous value to correct the error and cancel the convergence process.

Our experiment required each participant to interact with the robot for at least 100 min in four weeks to obtain accurate learning dynamics. This experiment differs from most of the HRI studies found in the literature, which typically design short-term experiments with only one or two activity sessions. However, we are aware that more participants and experiments are required to precisely define the influence of user feedback on HRI and which factors play a role in this process, especially on implicit feedback, due to the numerous metrics involved. This study serves as an initial step to evaluate the impact of explicit and implicit user feedback on preference learning and adaptive behavior in long-term scenarios.

The experiment results show that users were free to interact with the robot, so the amount of time spent with Mini differed across users. However, they all carried out more sessions than expected, which may indicate their will to interact with Mini in this task. Our results also show that combining explicit and implicit feedback provides the most similar learning values to the users’ initial preferences, highlighting this approach’s potential benefit. These results suggest that designing efficient methods to obtain implicit feedback is important to HRI and supports the information explicitly provided by the users.

The results suggest that implicit feedback alone is not a good alternative because the correlation is not significant for the categories of videos and music and is weak for photos. A possible reason for the worse results produced by only considering implicit feedback might reside in the interaction metrics we selected to compute it. Unlike explicit feedback, which obtains the real user preferences from their ratings, implicit feedback in our model is computed from the interaction time, activity result, and user selection. However, other metrics like for example the difficulty of the activity, the user experience or knowledge level, the number of times the activity has been repeated or the errors that might appear during the execution due to the activity programming/design.

The statistical analysis conducted on the data indicates that the kind of feedback used to retrieve the user preferences does not affect user engagement since the interaction time with the robot does not decay with time. This analysis also reports that the number of updates of each activity is not affected by how the robot obtains user feedback. This suggests that more invasive methods, like continuously asking the user to rate the activity, are not perceived as negative. The statistical analyses conducted in this study ignore the relevance per subject and user since we treat all activities and users equally. We are aware that possibly there are differences in the activities and subjects, but these differences are subjective and can not be easily analyzed from the current data.

Based on the previous discussion and the three hypotheses we enumerated at the beginning of the manuscript, it is possible to provide the following statements. The first hypothesis (H1) is partially accepted. The results show that condition C3 yields better adaptive results than C1 and C2, but no significant differences could be obtained for better user engagement and activity exploration. The second hypothesis (H2) can be validated since the results prove that combining implicit and explicit feedback in C3 is the best way to learn user preferences to personalize HRI. Finally, the third hypothesis (H3) is partially accepted. C1 provides good learning results since it uses real user feedback but does not impact activity execution.

From the previous results, we also statistically analyzed if gender, age, and the other demographic factors obtained from the demographic survey impacted the results of our experiment. However, we did not observe any effects on these factors. For the specific case of gender, we think that the kind of activity might impact the feedback provided. However, the low number of women in the sample makes obtaining results in line with this hypothesis difficult. A similar issue occurs with age. Since the recruited participants are all university students of similar ages, we do not have clear indications that this factor influences how the robot obtains user feedback to update the activity preferences.

6.1 Limitations

Design factors and the tasks chosen affect the learning system described in this contribution. Next, we enumerate the limitations of this work:

  1. 1.

    In tabular RL methods, such as the one used in this work, the number of actions greatly impacts learning speed and tractability. This means the learning process is slower as the number of actions increases. Thus, we propose to use function approximation methods to speed up the learning process in further tests. In our approach, the method used will not affect final values because feedback will not change.

  2. 2.

    Adaptation rate is controlled by \(\alpha \), which is a practical value that regulates how fast the error is corrected. High alpha values can drive the system not to fit the Q-value properly, while shallow values affect the speed of the adaptive mechanisms. This work sets the \(\alpha \) parameter to 0.5, which limits the error correction per time step. Although this and other values modify the learning performance, the designer can tune them to make the system work as they expect.

  3. 3.

    Activity exploration balances the update process of all of the actions equally. In this study, we apply Boltzmann’s distribution using a low Temperature value of 5 units to explore all activities at the beginning of the experiment but promote selecting preferred activities as the experiment progresses. This method contains randomness in the activity selection since it is based on probabilities generated from the user preferences (Q-values). Consequently, as mentioned earlier, the randomness in action selection and user preferences may lead to unexplored activities. This issue might subtly affect the Spearman metric and the RMSE.

  4. 4.

    The reward function design is the key to learning correctly. In our method, the interaction metrics define the implicit feedback reward. We know that the interaction metrics may affect the reward value; therefore, we expect to explore the impact of the interaction metrics used in future studies. Besides, in this scenario, we give more importance to a couple of metrics (user selection and activity result) since we believe they are more important than the execution time. However, the designer can change these weights or the interaction metrics depending on their application.

7 Conclusion

The results of this study show that combining the users’ explicit and implicit feedback to learn the users’ preferences toward a group of activities to personalize the interaction improves the results of individually using explicit or implicit feedback. This fact indicates that the robot can obtain more information from the interaction to improve HRI. The results also show that the kind of feedback does not affect activity exploration and user engagement, suggesting that including explicit questions does not influence the execution of activities. From these results, it is possible to state that only hypothesis 2 (H2) can be entirely accepted. The other hypotheses (H1 and H3) can only be partially accepted since not all the assumptions are confirmed based on the data analyses.

Considering the positive results provided by this study, we would like to investigate these methods to produce adaptive sessions in assistive robotics, extend the activity repertoire of the robot, and explore the role of user feedback in other areas and applications. We would also like to combine our system with a preference predictor that we developed in a previous work [17] to enable the robot to start the personalization of activities from a user-oriented prediction and not from scratch to evaluate which factors influence user feedback in HRI. Finally, exploring Continuous and Active Learning methods that might be a good alternative to solving preference learning from user feedback would be interesting.