1 Introduction

Robot learning from demonstration (RLfD) is a technique where a robot derives a mapping from states to actions, a.k.a policy, from instructors’ demonstrations [7]. This technique is successful in teaching robots physical skills by imitating instructors’ body movements e.g., pole balancing [7], tennis swings [31], air hockey maneuvers [9], etc. A standard RLfD process takes two steps: demonstration gathering step, which collects demonstrations from the human demonstrators, and policy deriving step, which reasons the underlying state-action mappings [4]. Like a human learner, a robot in RLfD could have different strategies of gathering demonstrations according to its underlying policy derivation algorithms. For example, robots with the DAgger algorithm [50] learn progressively by taking incremental demonstrations from instructors, much like going through a scaffolding process [32, 51]. A robot can also learn more proactively. For example, if equipped with Confidence-Based Autonomy (CBA) [17], an interactive algorithm for RLfD, a robot can request demonstrations at the states of which it has little or no knowledge. These learning strategies have been proven to be very effective and thus widely adopted in RLfD [40].

Unlike human learners, robots in previous RLfD processes rarely show any engagement cues during the learning process. They mostly remain stationary without giving any feedback, especially when instructors are giving demonstrations (i.e., in the demonstration gathering step). In human tutelage, engagement cues play an important role in shaping instructors’ mental model of the learners [58]. For example, learners’ attentional engagement, e.g., gaze, indicates their points of interest in the instructions. Attention direction is one of the essential mechanisms that contributes to the learning process. On the other hand, imitation, a gesture engagement cue, shows learners’ motivation to perform like the instructors [16]. It is reported that learner engagement cues could potentially affect instructor perceptions and behavior [25]. For example, in educational research, instructors are found to tend to provide more support to learners of high gesture engagement [58]. The gaze and gesture engagement are also reported to be specific to a learning process [16] while other cues are common to any interactions.

These effects of showing these learning-specific engagement cues, however, are less explored in the RLfD research, partly because designing engagement cues for robots in the context of RLfD is challenging. First, most of the existing methods for generating engagement cues in human–robot interaction (HRI) cannot be directly applied to RLfD. For example, it is common practice in HRI to simulate robots’ attentional engagement by directing their gaze towards visually salient elements (e.g., color or lightness [45]), specific objects (e.g., human faces [54]) or predefined events (e.g., pointing gestures [11]). This practice cannot be easily set up in RLfD because the robot’s allocation of attention should follow the development of instructors’ demonstrations. This is especially true in skill-oriented RLfD, where the robot needs to reproduce the body skills from the human demonstrator. In this context, the attention should be subject to the demonstrations, i.e., body movements, which are less constrained and highly dynamic compared to a standard HRI process. Methods for generating other engagement cues, e.g., imitation [8, 41, 49], also need further adaptation to accommodate the dynamic nature of RLfD, especially when the human body and the robot structures are not aligned (a.k.a. the correspondence problem [4]). Second, even if an engagement cue can be designed effectively, its deployment in RLfD should be in real-time with a low computational cost.

To adapt and improve existing attention and imitation methods to be used in RLfD, we focus on skill-oriented RLfD and propose two novel methods (Instant attention and Approximate imitation) to enable robots to communicate their learning engagement in an RLfD process. Note that we consider the demonstration gathering step as the interaction scenario since it determines the demonstration quality, which is crucial for the policy optimality [4, 61]. We do not focus on designing effective learning algorithms for demonstration learning. The learning engagement cues are generated as follows: the Instant attention method generates robot gaze engagement by tracking instructors’ body movements through particle filters; the Approximate imitation method produces gesture engagement, i.e., imitation, by partially mapping the instructor’s joint movements to those of the robot with approximations. We then use the proposed methods to generate three modes of engagement communication (via attention, via imitation, and a hybrid of the two) for robots in RLfD. To investigate the effects of the three engagement modes on humans, we compare them with another mode (“without-engagement” in which the robot remains stationaryFootnote 1 as most robots do in existing RLfD studies [7, 9, 31]) by a within-subject user study in a simulation environment. Results suggest that robots with the proposed cues are perceived to be more engaged in the learning process and their behaviors are more socially acceptable in RLfD than the robots without. Also, having engagement cues significantly affects human’s estimation of the robots’ learning capabilities, making their estimation over-optimistic. The robots which communicate engagement in RLfD are perceived to be significantly more capable in learning than the robots without, even though none of them are equipped with imitation learning algorithms.Footnote 2 Engagement communication also affects the human’s expectation towards the final learning outcomes. Furthermore, gesture cues influence humans’ perceptions significantly more than gaze engagement does, while the hybrid cues significantly outperform the other two. We also find that showing gesture or combined engagement significantly improves humans’ evaluation of demonstration quality. Specifically, the human participants perceived the demonstrations to be significantly more appropriate for the robot to learn when the robot communicates its engagement via gesture or mixed engagement, even though all demonstrations are of the same quality.

The contributions of this paper are as follows. First, we propose two novel algorithms which allow robots to generate attention and imitation behavior to communicate their learning engagement with low computations in RLfD. Second, we developed a simulation platform to evaluate the effect of engagement communication in RLfD. Third, we take a first step towards evaluating the effects of three types of engagement cues (attention, imitation, and hybrid) on humans. Through evaluation in a simulation environment with a humanoid robot learning the different skills from a simulated demonstrator, we show interesting findings on the design of robot engagement communication in RLfD. To the best of our knowledge, this paper is the first to systematically investigate how robot engagement communication affects the humans’ perceptions and expectations of the robot in RLfD.

2 Related Work

2.1 Robot Learning from Demonstration (RLfD)

Robot Learning from Demonstration (RLfD) is also known as “Programming by Demonstration”, “imitation learning”, or “teaching by showing” [52]. Rather than exhaustively searching the entire policy space, RLfD enables robots to derive an optimal policy from demonstrators’ (also called instructors) demonstrations [7]. Usually, this technique does not require additional knowledge about programming and machine learning from human instructors, and thus opens up new possibilities for common users to teach robots [18]. Existing studies on RLfD focus mainly on policy derivation algorithms, e.g., mapping states to actions by supervised learning [17], updating the policy by value iteration in Reinforcement Learning [7], and recovering rewards to explain demonstrations by Inverse Reinforcement Learning [1, 61]. Some studies also work on designing robots’ reciprocal learning feedback to communicate what the robots have learned to human teachers, e.g., demonstrating the robot’s current learned policy [12], providing verbal and/or nonverbal cues [2, 11, 13, 36, 48, 63], or visualizing where they succeed and fail [53]. These studies, however, largely overlook how the robots’ engagement behavior would affect the instructors and their demonstrations, especially during the demonstration gathering step. Hence, in this work, we consider how to generate behavior that allows robots to communicate their learning engagement to instructors, and investigate their potential effects on RLfD.

2.2 Engagement and Learning Engagement Cues

Engagement is a broad concept in HRI with many different definitions. Some studies focus on the whole spectrum of interaction, and defines engagement as the process of initiating, maintaining, and terminating the interaction between humans and robots [55]. Others narrow the notion of engagement down to the maintenance of interactions, interpreting engagement as humans’ willingness to stay in the interaction [65, 69].

In the context of learning, engagement mainly refers to the state of being connected in the learning interaction, which can be measured from three aspects: cognition, behavior, and emotion [56]. Cognitive engagement is closely related to the allocation of attention as it is one of the most important cognitive resources [47]. Failure to attend to another person indicates a lack of interest [5]. Thus, we adopt attention as a cue to communicate cognitive engagement in RLfD. Gesture engagement is captured by task-related behavior, e.g., task attempts, efforts, active feedback, etc. Imitation, a common gesture engagement signal, refers to “non-conscious mimicry of the postures, mannerisms, facial expressions, (speech), and other behaviors of one’s interaction partners” [14]. In interpersonal communications and HRI, the imitation behavior increases the likelihood of understanding [15], interpersonal coordination [10] and emotional contagion [27]. In the context of learning, the imitation behavior also indicates the robot’s internal status in learning, e.g., the progress and motivation [16]. Thus, we use imitation as a way to communicate the gesture engagement for robots in RLfD. Emotional engagement is associated with the affective states evoked by the interaction, including valence and arousal. Despite its importance, emotional engagement is hard to apply in RLfD since most existing RLfD robot systems lack the full ability to express emotions. In the scope of this paper, we define the robot learning engagement as the involvement in the learning process, with a focus on its cognitive engagement, i.e., attention, and gesture engagement, i.e., imitation.Footnote 3The following subsection presents related work on generating attention and imitation behavior to communicate engagement.

2.3 Robots’ Communication of Engagement

In HRI, a robot can communicate its attention via different physical channels, e.g., gaze [38, 42,43,44], head orientation [42, 63], and body postures [66]. Regardless of which channel they use, robots are usually programmed to pay attention to salient elements, including but not limited to colors [11], objects with visual intensity [45], and movements [11, 45]. For example, Nagai et al. regarded visually outstanding points in the surroundings, in terms of their colors, intensities, orientations, brightness, and movements, as points of attention [45]. Other work directs robots’ attention to specific objects, e.g., human faces [54] and colorful balls [3] to name a few, or predefined events, e.g., pointing gestures [42]. For example, Sidner et al. designed a robot that pays attention to participants’ faces most of the time [54]. Lockerd et al. drove the robot attention mechanism with interaction events, such as looking at an object when it is pointed at or looking at a subject when the person takes a conversational turn [42]. To accommodate multiple events, a state transition diagram is usually adopted to control any attention shifts [11, 42]. Though these studies provide insightful information about the design of robot attention, their approaches may not easily apply to skill-oriented RLfD as the point for attention in instructors’ body movements is dynamically changing. In an RLfD process, the robot may be required to learn how to manipulate a specific object or to perform a specific action, e.g., walking like a human being. It may be true in the first case that the engagement could be communicated by paying attention to the object when it is being manipulated. However, in the second case of learning an action or a skill, it is hard to define the salient elements as the whole demonstration motion would be of interest.

Compared to attention, the imitation behavior has been less widely adopted as a robot engagement cue. The robot imitation of a human participant’s behavior in real-time is inherently challenging due to the correspondence problems [4] as well as the robot’s physical constraints [34, 35, 60]. Hence, instead of generating full-body imitation behavior, some HRI researchers proposed to do partial imitations. For example, Baileson and Yee built an immersive virtual agent that subtly mimicked people’s head movements in real-time [8]. A similar imitation strategy was applied by Riek et al. to a chimpanzee robot [49]. In addition to head imitation, gesture “mirroring” has also been implemented by Li et al. on a robot confederate [41]. Although these studies showed that partial imitation behavior improves participants’ perception of robots’ capabilities [21, 23], they mainly used ruled-based methods [8] or predefined behavior [41], which may not be transferable to RLfD scenarios. In this work, we employ the same strategy and allow robot learners to make partial imitations. Different from existing work, we take an algorithmic approach to automatically generating approximate imitations of instructors’ body movements for robots in real-time.

3 Learning Engagement Modeling

This section presents two methods for generating engagement cues. The first subsection briefly introduces human body poses and forms the basis of the proposed methods. The remaining subsections describe the methods in detail.

3.1 Representation of the Body Pose

In RLfD, instructors usually demonstrate a point via their body movements. Our proposed methods thus use human body poses to generate attention and imitation behavior. A body pose is usually depicted by a tree-like skeleton, with nodes as human joints and links as body bones (shown in Fig. 1). Mathematically, this skeletal structure can be represented in two formsFootnote 4: the position form and the transformation form.

Position form The position form describes the body pose in a single frame of reference (usually the sensor frame), as shown in Fig. 1a. In this form, the pose skeleton is denoted as \([J^{(1)}, J^{(2)},\ldots , J^{(n)}]\), where \(J^{(i)} \in {\mathbb {R}}^{3}\) is the position vector of the i-th joint in the skeleton, and n is the number of joints. This form gives each joint its global position, providing the potential attention point for the robot. Hence it is used for the Instant attention algorithm to generate robot attention points.

Transformation form The transformation form describes the body pose in a series of frames of reference [62], as shown in Fig. 1b. In particular, each joint has its frame (a right-handed frame), and the links in the tree-like skeleton define parent–child structures between frames. The pose of a non-root joint is then described by a translation (i.e., the bone length) and a rotation (i.e., joint movement) in its parental frame, with the root joint (often the hip joint) described in the sensor frame. This form decomposes a human body movement into joint rotations (body-independent) and joint translations (body-dependent) in a way that the movement can be easily imitated by robots: just mapping the rotations onto robot joints. We denote this form as \([T_1, T_2, T_3, \ldots , T_n]\), and use it for the Approximate imitation algorithm to obtain approximate imitation behavior.

Fig. 1
figure 1

a A body pose in the position form: all joints are described in a single frame by their positions; b A body pose in the transformation form: each joint has its frame and the skeleton defines parent–child structures and translations between frames; the frame with xyz labels is the root frame and is referred in the sensor frame

3.2 Instant Attention

The gaze engagement for robots is generated based on the cognitive theories on human attention. Generally speaking, a generation process of human visual attention involves two stages [33]: first, attention is distributed uniformly over the visual scene of interest; then, it is concentrated to a specific area (i.e., it is focused) for gaining information [20]. In a skill-oriented RLfD process, the instructor demonstrates skills mainly through their body joint poses. The above mechanism thus corresponds to that the human joints of interest are tracked uniformly at the initial stage, and then one joint providing the most information for learning is picked as an attention point. As for demonstration learning, the more predictable/track-able a body joint movement is, the less information the robot could gain from that part, and consequently, less attention the robot should pay to it. In other words, if a body joint moves out of expectation the most among all joints, it will be worth paying attention to.

To this end, we use the particle filter (PF) as it is robust and effective in predictions [37] and tracking [6]. In short, PF is a Bayesian filter which uses a group of samples to approximate the true distribution of a state [68]. Particularly, given the state observations, PF employs many samples (called particles) to describe the possible distribution of that state. The particles are denoted as

$$\begin{aligned} X_{t} := x_{t}^{[1]}, x_{t}^{[2]}, \ldots , x_{t}^{[M]} \end{aligned}$$
(1)

here M is the number of particles in the particle set \(X_{t}\). Each particle \(x_{t}^{[m]}\) (with \(1 \le m \le M\)) is a hypothesis as to what the true state might be at time t, and is first produced by a prediction model \(p(x_t | z_{1:t-1})\) which is based on all history observations \(z_{1:t-1}\), i.e., \(x_{t}^{[m]} \sim p(x_{t} | z_{1:t})\). At each updating stage, particle \(x_{t}^{[m]}\) is then re-sampled according to the importance weight \(w_{t}^{[m]}\), i.e., the probability that the the particle \(x_{t}^{[m]}\) is consistent with the current observation \(z_{t}\), i.e., \(w_{t}^{[m]} = p(z_{t} | x_{t}^{[m]})\). In other words, each \(x_t^{[m]}\) survives into the next stage with the probability \(w_{t}^{[m]}\). For more details on the particle filter, refer to [68].

We apply one PF to track each relevant joint during the human demonstration. Specifically, state \({\mathbf {x}}_{t}^{[m]} \in {\mathbb {R}}^3\) describes the joint position in the sensor frame. We assume the state transits with additive Gaussian noise:

$$\begin{aligned} {\mathbf {x}}_{t}^{[m]} \sim {\mathbf {x}}_{t-1}^{[m]} + \varDelta _{t-1} + {\mathcal {N}}\big (\mathbf{0} , \sigma _{t} {\mathbf {I}} \big ) \end{aligned}$$
(2)

where \({\mathbf {x}}_{t}^{[m]}\) denotes the predicted joint position vector, \(\varDelta _{t-1}\) is the observed joint shift: \(\varDelta _{t-1} = J_{t-1} - J_{t-2}\) (\(J_{t}\) refers to the observed joint position at time t); and \({\mathcal {N}}\big ( \mathbf{0} , \sigma _{t} {\mathbf {I}} \big )\) is the multivariate normal distribution with zero mean and diagonal covariance matrix \(\sigma _{t} {\mathbf {I}}\). The importance factor for each particle is defined to be exponential to the Euclidean distance between the predicted and observed joint position:

$$\begin{aligned} w_{t}^{[m]} = \eta e^{-2 \big ( {\mathbf {x}}_{t}^{[m]} - J_{t} \big )^{T} \big ( {\mathbf {x}}_{t}^{[m]} - J_{t} \big )} \end{aligned}$$
(3)

where \(\eta \) is the normalizer. Each joint in the body pose is tracked by a particle cloud, a group of particles \({\mathbf {X}}_{t}\). In order to dynamically adjust the cloud size in accordance with the joint movement, the variance \(\sigma _{t}\) is set to be proportional to the average Euclidean distance between the predicted and observed joint position:

$$\begin{aligned} \sigma _{t} = \frac{\alpha }{M} \sum _{m} \big [ \big ( {\mathbf {x}}_{t}^{[m]} - J_{t} \big )^{T} \big ( {\mathbf {x}}_{t}^{[m]} - J_{t} \big ) \big ] \end{aligned}$$
(4)

where \(\alpha \) is a hyper-parameter and M is the number of particles. The \(\sigma _{t}\) indicates the cloud size: the greater the \(\sigma _{t}\) is, the more attention the robot should pay to the associated joint. Thus, the joint with maximum \(\sigma _{t}\) corresponds to the attention point. In the experiment, the \(\alpha \) is set to 0.02 for the best tracking of human joints.

Figure 2 illustrates how the PF works to generate attention. The particle cloud functions as the robot’s prediction of the joint future movements and is subject to change based on the current observations. Initially, the robot predicts the movements of all body joints of interest to be the same, i.e., all clouds are of the same sizes. During a demonstration, when a joint moves out of its cloud region, beyond the robot’s prediction, the cloud grows to catch that movement and the robot will thereafter be likely to pay attention to that joint. Likewise, if the joint movement is small, within the robot’s prediction, or no movement at all, the cloud shrinks, and chances are small that the attention will be given to that joint. Overall, the cloud size indicates the predictability of the instructor’s body movements as well as the level of attention the robot needs to pay. At each time, the joint with the biggest cloud is picked as the attention point. This process loops with every new body pose as shown in Fig. 3.

Fig. 2
figure 2

The particle clouds evolve: a all clouds are initialized at the same size; b if the joint movement is small, the cloud shrinks: the picked cloud becomes smaller since the elbow did not move; c if the joint moves out of its cloud region, the cloud grows to catch the movement: the picked cloud becomes larger to adapt to the elbow’s movements

Fig. 3
figure 3

The flow chart of the Instant attention

We now present a practical algorithm for Instant attention to generate gaze engagement instantly for robots (Algorithm 1). The algorithm takes TrackedJoints \(JSet_{\text {tracked}}\) and the BodyPose in the position form \([J_{t}^{(1)}, J_{t}^{(2)}, \ldots , J_{t}^{(n)}]\) as input, and outputs one attention point at each time. Specifically, the TrackedJoints contains the joints required to be tracked. In practice, the joints to be tracked are task-dependent, and should be defined according to the possible attention points on the instructor’s body. For example, a cooking robot may only need to track the instructor’s upper body movements and the joint correspondence can be configured by the developers based on the robot’s physical structures. Another input BodyPose is the human body pose in the position form. The algorithm runs as follows: first, it initializes a particle filter with the same covariance for each tracked joint (line 2–4). Then it estimates the distribution of the next joint position (line 9–11), followed by the estimation correction given the current position observations (line 12–13). Finally, the algorithm adjusts the covariance of the noise distribution to capture the joint movement (line 14), and the attention point is found by selecting the joint with the maximum covariance value (line 15).

figure a

Once an attention point is generated, say \(P_a\), it is worth mentioning that \(P_a\) is located in the sensor frame. To obtain the accurate attention point of the robot, a further transformation is required. Figure 4 illustrates how to transform \(P_a\) in the sensor frame \(T_S\) into the robot head frame \(T_R\) given the transformation \(T_{RS}\) from \(T_S\) to \(T_R\).

Fig. 4
figure 4

The attention point \(P_{a}^{S}\) is located in the sensor frame \(T_{S}\). We need to do the transformation \(P_{a}^{R} = T_{RS}P_{a}^{S}\) to get \(P_{a}^{R}\) in the robot head frame \(T_{R}\), where \(T_{RS}\) is the transformation from \(T_{S}\) to \(T_{R}\)

The Instant attention method has several advantages. First, unlike other mechanisms (salience-based, object-based, or event-based), this method utilizes the particle cloud to track the instructor’s joint movements, and automatically produces attention points based on the information gained from the movements. Second, the attention point is generated and shifted with very little abruptness because the spatial size of the cloud evolves smoothly. Specifically, the particle distribution \(p(X_{t})\) is iteratively sampled based on their previous distribution \(p(X_{t-1})\) by the importance weight \(w_t\), i.e., a \(p(x_{t-1}^{[m]})\) in \(X_{t-1}\) survives into \(X_{t}\) with probability \(w_t^{[m]}\), even if the joint moves abruptly (i.e., \({\mathbf {x}}_{t}^{[i]} - J_t\) is large).Footnote 5 Second, the joints to be tracked can be dynamically changed, offering a flexible and adjustable attention mechanism based on the RLfD task. Furthermore, if an object is involved in the demonstration and needs to be considered, we can simply interpret the object as an additional “joint”, or point of interest (POI) in a general term, for which a new set of cloud points is used to track its movements. The particle-based engagement generation mechanism also applies to more general cases. For example, the joints considered in the evaluation experiments could potentially be generalized to a wide range of objects, each of which could be a POI and thus need to tracked by a set of particle points.Footnote 6

3.3 Approximate Imitation

Behavior imitation in robotics is usually formulated as an optimization problem, which needs to find the joint correspondence first [4], and then solves the inverse kinematics for the robot structure [24]. Both of the processes are difficult, computationally intensive, and robot-configuration-dependent, hence not applicable for generating imitation behavior for robots with different configurations and hardware. On the other hand, psychological results reported that people mimic behavior to communicate engagement by adopting similar postures or showing similar body configurations according to the context [14]. We thus relax the behavior imitation in robotics as follows: First, the robot is not required to search blindly for the best joint correspondence since the joint correspondence is task-dependent. We allow the user to explicitly specify the joint correspondence according to the RLfD context. Second, for those robot joints whose Degree of Freedom (DoF) do not match the human joint, we only set the joint angles for the available robot joints to approximate the human movements. Though this solution of approximation may not be optimal in the sense of behavior mimicry, it runs very fast (in real-time) to generate gesture engagement, achieving a balance between simplicity and optimality.

To achieve this, we propose the algorithm Approximate imitation, which allows robots to generate similar motions as the demonstrator for specified joints. Given the joint correspondence, the algorithm runs with two steps: frame transformation, and rotation approximation, as presented in Fig. 5.

Fig. 5
figure 5

The flow chart of the Approximate imitation

The frame transformation is to transform the instructor’s body pose to match the robot frames. To be specific, we leverage the transformation form of body poses to decompose the frame matching into two steps: first, rotation alignment and then translation alignment. The rotation alignment is to rotate the human joint frames so that their axes are aligned with the robot joint frames, as shown in Fig. 6a; the translation alignment is to translate the human joint frames in their parent frames so that the initial skeletal structure of the demonstrator’s body matches the robot initial configurations, as shown in Fig. 6b. To sum up, we represent the rotation alignment as \(T_R\) in the joint frame, \(\{H\}\), and the translation alignment as \(T_p\) in the parent frame of \(\{H\}\), \(\{H_p\}\) (both represented in Homogeneous transformation). Then for \(\{H\}\), its frame transformation is \(T_{H}^{H_p}T_p\{H\}T_R\), where \(T_{H}^{H_p}\) is the transformation from \(\{H_p\}\) to \(\{H\}\).

Fig. 6
figure 6

Frame transformation. a Rotation alignment: aligning the local frame \(\{H\}\) of the human body pose with the corresponding robot joint frame \(\{R\}\) by rotation matrix R. The aligned local frame is \(\{H'\}\). b Translation alignment: translating \(\{H'\}\) in its parent frame by \(T_p\) to match the corresponding robot frame \(\{R\}\) so that the human pose link \(p_H\) is aligned with the robot link \(p_R\)

Since the DoF of the robot joint may not equal the DoF of its corresponding human joint, we could not have the exact movement mapping. Instead, we use the robot joint to approximate the human joint rotations as follows. First, a human joint rotation is converted into Euler forms, \((\theta _{\text {roll}}, \theta _{\text {pitch}}, \theta _{\text {yaw}})\). Second, if the DoF of a robot joint is 3 (roll, pitch and yaw) and exactly matches the human DoF, then the conversion is straightforward: rotate for the robot joint with roll first, then pitch, and finally yaw. If the DoF of a robot joint is 2 (e.g., roll and pitch), then the conversion can be approximated as rotating with roll first, and then pitch. If the DoF of a robot joint is 1 (e.g., roll only), then rotate with roll only. For example, in Fig. 7, the robot arm has the same structure as the demonstrator’s but with different joint DoF, as shown in Fig. 7a and b. It can approximate the instructor’s left arm movement by first converting \(T_S\) (the rotation) into Euler angles \((\theta _{roll}, \theta _{pitch}, \theta _{yaw})\), and then setting the joint roll to \(\theta _{roll}\), and the joint pitch to \(\theta _{pitch}\) for the shoulder, ignoring the \(\theta _{yaw}\), as shown in Fig. 7c.

Fig. 7
figure 7

Rotation approximation: a the instructor’s left shoulder has a DoF of 3 and its transformation is \(T_S\) ; b the robot shoulder joint has a DoF of 2: roll and pitch; c the robot rotates for its shoulder the roll joint with \(\theta _{roll}\) and then the pitch joint with \(\theta _{pitch}\), without considering \(\theta _{yaw}\)

We now present the algorithm Approximate imitation in Algorithm 2. The algorithm takes joint correspondence JointCorrespondence, and instructor’s body pose JointMovement in transformation form as input, and outputs the joint configurations, JointConfigs, for the robot. Specifically, JointCorrespondence defines the joint mapping, \(\{J^H_i \rightarrow J^R_i\}\), from human joint \(J^H_i\) to robot joint \(J^R_i\) for part joints. The JointMovement is represented as a series of transformations along the skeletal structure, \([T_1, T_2, \ldots , T_n]\) (see Sect. 3.1 for more details). The algorithm runs as follows: first, it calculates the frame transformations from \(J^H\) to \(J^R\), and saves the rotation alignment and translation alignment in \(Rotation\_align\) and \(Translation\_align\) (line 3–5). Then for each joint movement \(T_i\) in \([T_1, T_2, \ldots , T_n]\), the algorithm transforms it into the corresponding robot frame \(T_i^\prime \) by translation and rotation alignment, followed by a conversion into the Euler form (line 7–8). The algorithm proceeds by selecting the right angles from \(\theta _{roll}\), \(\theta _{pitch}\), and \(\theta _{yaw}\) for the robot joint according to the DoF of the robot joint (line 9–16). The joint configurations are saved in \(q_R\), and returned as the final output.

figure b

The Approximate imitation method has several advantages for generating imitation behavior for robots in RLfD. First, this algorithm runs in real-time as the imitation is only partially taken place on the instructor’s body poses. In particular, we take advantage of local transformations of body poses to avoid solving inverse kinematics for the whole robot joints, which is computationally intensive and may also not have closed-form solutions. Also, instead of finding the exact mapping for robot joint angles, we set configurations based on the DoF of the robot joint to achieve a similar motion trend. This conversion may sometimes distort movements, but, still, the directions and trends are captured (as reflected in 4). Second, this method is generic and applicable to standard skill-oriented RLfD. Depending on the RLfD scenario, we can also assign different joint correspondences to do a partial imitation. For other types of RLfD, e.g., object-related demonstrations or goal-oriented learning from demonstrations, we can also apply the proposed method to generate the approximate imitation based on the object or the goal. Specifically, we can replace the joint transformations with the poses of the object or the goal, and generate the target \(\theta _{roll}\), \(\theta _{pitch}\), and \(\theta _{yaw}\). Then we can adopt the inverse kinematic solvers to calculate a set of joint configurations to move the robot’s end-site to the target pose \((\theta _{roll}, \theta _{pitch}, \theta _{yaw})\). Based on the DoF and the space constraints of the robot end-effectors, we can make similar approximations to have the end-effector only achieve the roll pose, the roll and pitch pose, or the complete target pose.

4 Evaluation

This section first introduces our RLfD simulation platform, then describes a preliminary study for determining the timing of imitating behavior, and finally presents the main user study.

4.1 RLfD Simulation Platform

Our RLfD simulation platform is composed of a virtual human instructor and a robot, as shown in Fig. 8a and b. The virtual human instructor performs different yet controlled types of movement skills, while the robot (a Pepper) needs to capture motion and learn skills from the instructor. Both parties stand facing each other in a simulated 3D space, as shown in Fig. 8c.

Fig. 8
figure 8

RLfD simulation platform: a the simulated human instructor; b the virtual Pepper robot; c the instructor and robot are facing towards each other for teaching and learning; d platform composition

The simulation platform has three major components: demonstration component, sensing component, and engagement component, as shown in Fig. 8d.Footnote 7 The demonstration component determines what movements the instructor needs to perform. We exploit motion capture (MoCap) data to simulate real movements. The MoCap data are recorded by 3D motion capturing systems with high precision and are usually used for simulations and animations [22]. The sensing component serves as a pose sensor, extracting body poses from the virtual instructor. This component also converts body poses between two representations (global positions and local transformations). Finally, the engagement component controls the robot’s engagement communication. Based on the proposed algorithms, the robot could choose one of the three ways to communicating engagement in RLfD: showing attention (A-mode), showing imitation (I-mode), and showing both (AI-mode). We further add one more mode, i.e., no engagement (N-mode), to evaluate the effectiveness of these three modes. In N-mode, the robot just stands near the instructor and remains stationary without any body movements. Compared with the A-mode, the robot’s gaze is fixed on the demonstrator’s face and is not affected by the demonstrator’s body movements. One might wonder why in N-mode the robot is not focusing randomly on one of the participant’s joints. We argue that such setup is just a variant of gaze engagement except that the robot acts much less intelligently as it randomly moves its head, which could further deteriorate the human perception per se. In a human learning scenario, if a teacher sees students randomly move their heads and/or bodies, the teacher is likely to feel that the students may be listening but are quite lost—not paying attention to the right places—and thus gets quite confused about the students’ actual learning status.

In this simulated RLfD, the tasks for robots to learn are sports skills performed by a virtual instructor. We chose sports skills for robots to learn as this type of movement has often been adopted in RLfD [9, 31]. Four types of sports movements, i.e., boxing, rowing, swimming, and frisbeeing, are selected from CMU Graphics Lab Motion Capture DatabaseFootnote 8 as these four sports involve movements of various body parts. Regarding the policy deriving algorithms, even the state-of-the-art method may fail to deliver good learning outcomes, which may, in turn, change human participants’ perception towards the demonstration gathering. Thus, to minimize any side-effects or biases introduced by the performance of the learning algorithms, we do not utilize any learning algorithms, and the robot has no actual learning ability in the demonstration gathering process. In the other words, the robot only communicates its engagement when observing the human demonstrations by showing different cues and will not learn the sports skills in the following experiments and studies.

Fig. 9
figure 9

An example to show how the platform works: Row 1 shows the human instructor’s real demonstration; Row 2 shows the re-targeted demonstrations onto the virtual instructor; Row 3 and 4 present the running of Instant attention and robot showing attention (A-mode); and Row 5 presents the corresponding imitation engagement of the robot (I-mode)

Figure 9 presents an example of how the simulation platform works. The first row shows the human instructor’s real demonstration, which is then re-targeted onto the instructor, as shown in the second row. The third and fourth rows present the running of Instant attention and robot showing attention (A-mode). The last row presents the approximate imitation behavior of the robot (I-mode). We purposely rotate the 3D scene in the last two rows to get a better view of the robot communicating engagement.

We chose an online simulation rather than a field test due to the following constraints and concerns: First, due to the current limitations of RLfD techniques, the demonstrators are usually required to wear motion-capture devices, confined in a designated space, and repeatedly showcase the target movements. This could potentially impact on their interaction with robots and their perception of the robot’s behavior. Also, even if the current state-of-the-art vision-based methods, obtaining full-body motions with good precision (such that it could be used for task learning) is still very challenging. First, one camera Kinect is not enough to recover the body motion precisely. Second, multiple cameras also require cross-camera calibration, which itself could be hard to set up on the fly in an open-world HRI. Furthermore, as an initial attempt, we would like to study if humans could make sense of the learning engagement cues in a controlled environment without distractions and complications introduced by noises, jerks, etc. If we confirm the effectiveness of the proposed method in a controlled setting, we could then move to a realistic environment to test in the field. Also, it is very common these days to first train the algorithm on a simulation platform to reduce the costs. We thus use simulation in our experiment to avoid all these side effects and unexpected outcomes. Furthermore, we purposely select a viewpoint that allows the participants to have a better view of both the robot’s and the instructor’s behavior, i.e., the staging effect [67]. Second, the robot’s engagement behavior could be evaluated in a more consistent and repeatable manner in a simulation. In a field test, the instructor’s demonstrations are usually non-repeatable and could be easily influenced by robots’ reactions. The simulation allows different engagement cues to be compared without bias. Second, the simulation provides a controllable and measurable environment to monitor and evaluate a system’s performance from various metrics, which is often a necessity before algorithms are deployed in RLfD.

This simulation platform was built upon the Gazebo simulatorFootnote 9 and the Robot Operating System (ROS). We use the Matlab Robotics System ToolboxFootnote 10 to facilitate the algorithm implementation.

4.2 Preliminary Study

In interpersonal communication, a person’s imitation behavior, also called mirroring behavior, often happens after the partner’s target behavior with a certain time delay [14, 30]. In this paper, we generate such mirroring behavior via the approximation mechanism. We need to determine the exact time delay so that users can correctly recognize imitation as a learning engagement cue. We run a within-subject pilot experiment to check the appropriate timing of robot imitation relative to the target action.

Fig. 10
figure 10

Results for the right timing of behavioural engagement: a participants’ ratings on robot learning behaviour, and b distribution of participants’ feedback

Manipulated variable We set time delay as the independent variable in this study and experiment with three intervals: 0.5s, 1.0s, and 2.0s. Technically, we used a buffer to store instructors’ body poses to postpone any imitation behavior. After proper setup, the buffer size was set to 15, 30, and 60 to achieve an appropriate time delay of about 0.5s, 1.0s, and 2.0s, respectively.

Subject allocation We recruited 30 participants (mean age: 35.5, female: 12) via Amazon Mechanical Turk (AMT) who had no prior experience with physical or virtual robots. Each participant watched three simulated RLfD videos corresponding to the three delay intervals. In the videos, the instructor was teaching the robot some type of sports skill, and we staged the 3D scene at a fixed angle for a better view of the robot imitations. We counterbalanced the presentation order of the different time delays. In other words, each subject tested only four out of the 16 possible combinations.

Dependent variables Participants watched videos showing the robot imitating the instructor with three different time delays. They were informed that the robot is supposed to learn sports skills from the demonstrator. After each video, they were asked to rate their agreement on a 7-point Likert scale as to whether the robot in the video is learning.

Figure 10 presents the average and overall rating distribution on different time delays. We run a repeated measures ANOVA with time delay as the factor, and find that there is a significant difference in delay-induced perception of robot learning engagement (\(F(2, 58)=88.37\), \(p<0.01\), \(\eta ^2 = .76\)). Results of the Bonferroni posthoc test suggest that the engagement rating of delaying for 1.0s is significantly higher than that of delaying for 0.5s (\(p<0.01\)) and 2.0s (\(p<0.01\)). Overall, setting the imitation time delay to 1.0s can effectively communicate robots’ learning engagement ( 70% agree and strongly agree). We apply this configuration to the Approximate imitation algorithm in the main user study.

One might be wondering that why the rating difference between 0.5s and 1s delay is noticeably dramatic, even larger than the difference between 1s and 2s delay. The cause may be the approximation mechanism adopted for generating the mirroring behavior. When the delay time is small (e.g., 0.5s), the approximate imitation algorithm generates the movement in a very responsive manner, almost at the same pace as the demonstrator’s movement. The subjects are likely to feel that the robot is showing, rather than following, the demonstrator’s movement. As the delay time becomes longer (e.g., 1s), the movement following effects becomes more obvious, and the robot appears to be learning from the demonstrator by mimicking his/her behavior. Consequently, the ratings between the 0.5s and 1s in terms of robot communicating learning engagement become higher. Such dramatic rating difference also confirms the necessity and importance of using the preliminary study to determine the appropriate delay time for the followed studies. Furthermore, different ratings of timing for robots to conduct mirroring behavior confirms that it is not the complexity of behavior that leads to different perceptions of robots’ learning ability. Rather, it is when the behavior should be performed (i.e., engagement) that matters.

4.3 Main Study

To evaluate the effectiveness of engagement communication and our proposed cues on participants’ perception of the robot and the demonstration, we conducted a within-subject experiment on an RLfD simulation platform, with an additional “without engagement” condition (N-Mode) as the baseline.

4.3.1 Hypothesis

Our proposed methods generate different types of engagement cues for robots to express their engagement. Accordingly, we first hypothesize that:

H1 (1) Regardless of actual cues taken, robots that communicate engagement are perceived to be significantly more engaged (H1a) in learning, and their learning behavior is significantly more socially acceptable (H1b) than those in the N-mode. Further, (2) imitation cue will receive a significantly higher engagement rating than attention cue (H1c), while combined cues will be rated significantly the most (H1d). Similarly, (3) imitation cue will be rated significantly more acceptable than attention cue (H1e) while combined cues will be rated significantly the most (H1f).

According to educational theory postulating that learners’ engagement cues, especially gesture engagement, could have reciprocal effects on instructors [58], we hypothesize that:

H2 Robots communicating engagement via different cues will have significantly different influences on human participants. Specifically, (1) regardless of the cues, communicating engagement will significantly influence humans’ estimation of the robot learning capability (H2a), and significantly raise the humans’ expectations towards the learning outcomes (H2b) than no communication. Further, (2) imitation cues will lead to a significantly higher estimation of the robot’s capabilities than attention cues (H2c) while combined cues have the most significant influence than others. (H2d). Similarly, (3) imitation cues will result in a significantly higher expectation towards the learning outcome than attention cues (H2e) while combined cues have significantly the highest expectation than others (H2f).

We further hypothesize that the robot showing different engagement behavior can affect humans’ assessment of demonstration quality. More specifically:

H3 (1) Regardless of the exact demonstrations shown to robots, different engagement cues will influence the human participants’ assessment of the demonstration quality. Specifically, demonstrations for robots with attention cues, imitation cues, and hybrid cues will be rated as significantly more appropriate (in terms of the expected robot capabilities) than that without engagement cues even if they are the same (H3a). Further, (2) demonstrations for robots with imitation cues and the hybrid cues will have a significantly higher rating on appropriateness than that with attention cues (H3b).

In the study, these different aspects were measured via post-study questions with 7-point Likert scale answers, as shown in Figs. 11 and 12. We derived these questions in the user study based on the previous research on human–robot interactions and robot learning. Specifically, the questions to measure robot communicating engagement are adapted from the engagement studies [59, 64]; the questions to measure participants’ expectations towards the robot learning capability are derived based on the studies on human expectations and assessment of human–robot collaborations [39]. In addition to those engagement-related questions, we asked several factual questions, including “what sports skill is the virtual human demonstrating?” and “what skill is the robot learning?”, and open questions, including “why do you have such estimate of the likelihood about the robot mastering the sports skill?” and “any comments on the robot’s behavior”. These factual and open questions were asked in each round to collect participants’ understanding of the study materials and opinions on the robot learning. We also took two steps to ensure the effectiveness of the answers to all the questions. First, the questions could only be answered after participants took the necessary actions to understand the experiment. For example, the questions to measure engagement were only visible when the participants finished watching the full learning videos; and the questions to measure the participants’ expectations also require the participants to provide the answers and their reasons (those without giving reasons could not proceed to the next questions). Second, all answers were manually checked to reject any invalid responses, e.g., a response with the same answers to all questions, and a response with vague and inconsistent comments. In total, we only removed 4 responses and the comments in those responses: OK (subject#12), I do not care (subject#5), none of my business (subject#2) and I do not like robots (subject#23).

4.3.2 User Study Design

The study consisted of five sessions: one introductory session and four experimental sessions. The introductory session requested demographic information and presented a background story to engage users: the participant has a robot team of four for an Olympic game. They needed to assess the robots’ performance when they were under a professional coach’s tutelage. This session also presents all robots at the same time and showcase their learning capability by showing two video footages: one shows a different set of demonstrations given by a human demonstrator (dribbling a basketball) and another shows how the robot could learn such dribbling motion (learned by the behavioral cloning method). The participant can only proceed to the next sessions after finishing watching these two videos. This introductory session ensures that participants have been fully aware of the robot’s learning capabilities. In experiment sessions, participants watched the human instructor’s movements first and then monitored the robot learning process in the RLfD simulation platform. After each session, participants were required to fill post-study questionnaires. Each session checked one mode, and modes were counter-balanced with learning tasks. Specifically, we randomized the order of engagement modes and the four physical skills to ensure the mode applies evenly across different skills and the skill also occurs evenly across different modes. We recruited 48 participants from Amazon Mechanical Turk (AMT) (mean age: 30.9, female: 6, no prior experience with teaching robots, and no participation in the preliminary study. (Note that some participants reported that they had prior experience interacting with physical/virtual robots as they mentioned in answers to the open-ended questions. Our participants are representative in terms of their exposure to HRI and RLfD, and our findings could potentially be generalized to young adults with good digital literacy. We acknowledge that our participants lack diversity in age and education level. We are interested in exploring how different user populations may perceive and react to RLfD in the future.) Each subject receives two dollars as compensation for his/her contribution.

During the experiment, we asked the participants to rate if they perceived the robot was paying attention or imitating based on its behavior. This served as the manipulation check for validity, ensuring that our designs indeed convey the intended type of engagement. Note that what we want to assess is that, knowing that the robot can learn the target skill if taught right, do participants think that the demonstrations given are good enough for the purpose based on their observation of the robot’s learning engagement. Without emphasizing the robot behavior in the pilot study, participants rate the quality of demonstration based on their prior belief of whether or not they think the robot could acquire the demonstrated skill in an ideal situation. So we improved the question to specify the requirement that the quality of demonstrations should be assessed based on the robot’s reactions (which is also highlighted in the experiments).

4.3.3 Analysis and Results

Manipulation check. The manipulation check for different engagement communications shows, in Fig. 11, that the manipulation is effective (for attention cue: repeated measures ANOVA, \(F(3, 141) = 153.79\), \(p < 0.01\), \(\eta ^2 = .80\); for imitation cue: repeated measures ANOVA, \(F(3, 141) = 197.45, p < 0.01, \eta ^2 = .84\)). Robots in A-mode (\(M=5.53, SD=1.85\)) and AI-mode (\(M=6.17, SD=1.11\)) are indeed perceived to show more attention than robots in N-mode (\(M=2.53, SD=1.83\)); Bonferroni posthoc test \(p<0.05\). Also, more imitation behavior is reported by subjects with robots in I-mode (\(M=4.98, SD=1.33\)) and AI-mode (\(M=6.05, SD=1.22\)) than robots in N-mode (\(M=1.88, SD=1.57\)); Bonferroni posthoc test \(p<0.05\).

Fig. 11
figure 11

Participants’ ratings on robot engagement communications and their behavior in RLfD

Efficacy of proposed engagement cues We analyze participants’ ratings via a one-way repeated measures ANOVA with the mode as the independent variable. We find that both attention and imitation cues significantly improve the ratings of robots’ engagement levels and their behavior, as shown in Fig. 11. Specifically, the robots with A-mode (\(M=5.53, SD=1.85\)), I-mode (\(M=5.78, SD=1.03\)) and AI-mode (\(M=6.17, SD=1.11\)) are perceived to be significantly more engaged in the learning process than the robot in N-mode (\(M=2.53, SD=1.83\)); repeated measures ANOVA, \(F(3, 141) = 153.79, p < 0.01, \eta ^2 = .80\). Thus, H1a accepted. Consequently, subjects accept the robots’ behavior in RLfD (A-mode: \(M=4.02, SD=1.78\), I-mode: \(M=4.62, SD=1.44\), and AI-mode: \(M=5.58, SD=1.39\)) significantly more than the robot in N-mode (\(M=2.20, SD=1.60\)); repeated measures ANOVA, \(F(3, 141) = 102.89, p < 0.01, \eta ^2 = .73\). Thus, H1b accepted. Further, in terms of engagement, combined cues are reported to be significantly better than single cues; Bonferroni posthoc test \(p < 0.01\). Thus, H1d accepted. in terms of acceptability, combined cues are reported to be significantly better than single cues; Bonferroni posthoc test \(p < 0.01\). Thus, H1f accepted. However, we do not notice a significant difference between imitation cue and attention cue, thus H1c and H1e are both rejected. Therefore, H1 is partially accepted.

Based on these analyses, we, therefore, conclude that:

Overall, our results partially support H1:showing attention, imitation, or both are perceived to be significantly more engaged in learning and is significantly more acceptable. Also, showing both behaviors is perceived to be significantly better than showing only one behavior. However, no significant difference can be found between showing attention and showing imitation.

Fig. 12
figure 12

Participants ratings on the effects of engagement communication on the participants’ perception and their assessment of demonstration qualities

Effects of engagement cues on participants’ perception We then compare the effects of different engagement cues on subjects’ perception via a one-way repeated measures ANOVA with the mode as the independent variable. In general, robot engagement communication significantly enhances the participants’ estimation of robots’ learning capabilities and the participants’ expectation of the learning outcomes, even if none of the robots in the experiment have the learning ability (no learning algorithms are adopted in the user study). Specifically, in terms of estimating the robots learning capability, participants rated the robots, in Fig. 12, with respect to A-mode (\(M=4.13\), \(SD=1.70\)), I-mode (\(M=4.88\), \(SD=1.49\)) and AI-mode (\(M=5.63, SD=1.21\)) to be significantly more intelligent than the robots in N-mode (\(M=2.10\), \(SD=1.45\)); repeated measures ANOVA, \(F(3, 141) = 155.25, p < 0.01, \eta ^2 = .80\). Thus, H2a accepted. Similarly, participants rated the robots with engagement behavior (A-mode: \(M=3.70\), \(SD=1.94\), I-mode: \(M=4.40\), \(SD=1.63\), and AI-mode: \(M=5.73, SD=1.47\)) to be more likely to master the skills than the robots without (N-mode: \(M=2.02, SD=1.59\)); repeated measures ANOVA, \(F(3, 141) = 125.38\), \(p < 0.01, \eta ^2 = .76\). Thus, H2b accepted.

In addition, showing gesture engagement, i.e., I-mode, have significantly more influences on the participants than showing gaze engagement, i.e., A-mode. In particular, the robots in I-mode (\(M= 4.88, SD=1.49\)) are perceived to be significantly more capable of learning the demonstrated skills than the robots in A-mode (\(M= 4.13, SD=1.70\)); repeated measures ANOVA, \(F(3, 141) = 155.25, p < 0.01, \eta ^2 = .80\). Thus, H2c accepted. Similarly, the robots in I-mode (\(M= 4.40, SD=1.63\)) receive significantly higher ratings than the robots in A-mode (\(M= 3.70, SD=1.94\)) in terms of participants’ expectation towards the learning outcomes; repeated measures ANOVA, \(F(3, 141) = 125.38, p < 0.01, \eta ^2 = .76\). Thus, H2e accepted.

Further, we also notice significant differences between robots in AI-mode and robots in other modes. Specifically, robots in AI-mode show significantly more intelligence in learning (\(M=5.63, SD=1.21\)) than robots in N-mode (\(M=2.10, SD=1.45\)), A-mode (\(M=4.13, SD=1.70\)), and I-mode (\(M=4.88, SD=1.49\)); repeated measures ANOVA, \(F(3, 141) = 155.25, p < 0.01, \eta ^2 = .80\). Thus, H2d accepted. Also, the robots in AI-mode (\(M=5.73, SD=1.47\)) are estimated by the participants to be significantly more likely to master the skill than the robots in modes (N-mode: \(M=2.02, SD=1.59\), A-mode: \(M=3.70, SD=1.94\) and I-mode: \(M=4.40, SD=1.63\)). Thus, H2f accepted. Note that in all different engagement modes and different skill settings, the robots are equipped with no learning algorithms and thus have no actual learning abilities.

In general, participants perceived the robot showing gaze engagement to be quite different from the one showing gesture engagement. Specifically, for the open-ended question “why do you have such estimate of the likelihood about the robot mastering the sports skill?”, participants stated that, when showing gaze only, the robot seems to be “attentive throughout the demonstration so it knows what to do” (P16), “listening so there is a possibility of mastering this sport skill” (P18), and “learning by watching” (P13). These comments suggest that the participants perceive the robot exhibiting gaze engagement as an attentive listener in a mental state of learning. For the same question, when showing gesture engagement, the robot appears to our participants as “doing so well at following the instructor” (P4), and “making attempts and ideally would get better over time” (P18). These comments imply that the participants are judging the robots learning capability based on the robot’s motivation to perform like the instructors. These differences between two types of perceptions also justify our motivation to separate gaze engagement and gesture engagement in this study.

Overall, our results support H2: communicating engagement significantly influence the humans’ estimation of the robots’ learning capabilities, and significantly changes their expectation towards the final learning outcomes, even though none of the robots have the learning abilities. Moreover, the gesture engagement in RLfD, i.e., imitation, presents significantly more influence on the participants than the gaze engagement. Furthermore, communicating engagement via two cues at the same time have significantly more effects on participants than communicating engagement via a single cue.

Effects on participants’ assessment of demonstration qualities Finally, we analyze the participants’ ratings on the appropriateness of instructors’ demonstrations. As shown in Fig. 12, no significant difference can be found between A-mode (\(M= 4.48, SD=2.10\)) and N-mode (\(M= 3.35, SD=2.08\)); H3a rejected. However, compared with A-mode, only AI-mode (\(M= 5.93, SD=1.00\)) significantly improves the participants’ assessment of demonstration quality in RLfD, Bonferroni posthoc test \(p < 0.01\). Thus, H3b partially accepted. Note that in different engagement modes, the skills to be learned are all generated by the same set of MoCap data. Thus all demonstrations are of the same quality.

Overall, our results partially support H3: communicating gesture engagement or combined engagement will significantly improve participants’ assessment of demonstration qualities, while showing attention cannot, even though all the demonstrations are of the same quality.

Further, in the comments collected from the user study, we found that most participants explicitly stated that the robots without gesture engagement may fail in learning, and accordingly, they were more likely to adjust future demonstrations when the robots communicated no engagement or only gaze engagement.

5 Discussion

5.1 Engagement Communication for Robots in RLfD

The Choice of Engagement Cue Should Consider the Nature of the Learning Task

Our results show that robots’ gesture engagement is preferable to gaze engagement in a physical skill-oriented RLfD, which can probably be explained by the correspondence between the practice of RLfD and the cone of learning [19]. Cone of learning, a.k.a. pyramid of learning or cone of experience, depicts the hierarchy of learning through involvement in real experiences [19]. It proposes that visual receiving (just watching the demonstration) is a passive form of learning, and learners can only remember half of the knowledge passing through this channel two weeks later. In contrast, “doing the real thing” is a type of active learning that leads to deeper involvement and better learning outcomes [19].

In RLfD, the basic task for robots is to derive a policy from demonstrations and then reproduce the instructors’ behavior [4]. On the one hand, a robot’s imitation behavior resembles this “behavior reproducing” process; it is thus deemed actively engaged in the learning process. On the other hand, although showing gaze engagement implies that the robot is involved in the visual receiving of instruction, it is still considered as a passive way to learn. Consequently, instructors may conclude that a robot showing gesture engagement will have a deeper understanding and better mastery of the skill than that showing gaze engagement. Moreover, by analyzing the quality gap between a robot’s imitation behavior and the demonstration (behavior to be reproduced), instructors may have a more accurate assessment of the robot’s learning progress. In a word, to design effective engagement cues for robots in RLfD, we need to consider the nature of the learning task.

Engagement communication should reflect robot’s actual capabilities

In our study, we do not equip the robot with any actual policy derivation algorithm since we want to avoid the perception bias caused by the algorithm selection. In other words, the robot has no learning ability. Still, many subjects are convinced that robots with engagement communication (attention, imitation, or both) would finally master the skill. They hold such a belief even if some tasks are technically very challenging for robots to learn because of the correspondence problem, e.g., swimming. These findings suggest that engagement communication can affect instructors’ mental model of the robot’s capability and progress. There can be a misalignment between instructors’ expectations and the actual development as shown in our study. If instructors shape their teaching according to an inaccurate mental model, frustration may occur later in the RLfD process. Hence, it is critical to ensure that a robot’s communication of engagement reflects its actual capabilities (policy development in the case of RLfD). One possible direction is that the robot engagement communication should be driven by the learning progress. For example, we can define the robot’s actual capability as what it has learned. In this sense, the robot could gradually show its engagement, e.g., show more mimic behaviors, during the learning process.

5.2 Limitations

We are motivated by the phenomenon of human tutelage, in which the teachers build their mental model of students by observing learning behavior, especially their engagement cues. The teachers, in this process, have different roles: observer (to monitor how students do) and demonstrator (to intervene in the learning process by providing demos). Similarly, a teacher in an RLfD process is also assigned these roles. They could observe the learning process and intervene at the point when they thought the demonstrations need to be changed, which is directly validated through their answers to questionnaires after each session. We would like to answer: (1) whether human instructors can understand the learning engagement cues simulated on robots, (2) how they make sense of these cues, and (3) how the instructors’ interpretation of these cues may affect their perception and teaching. Whether and to what extend instructors’ consequent change of teaching could improve robots’ learning algorithm is to be investigated in future work. In our current evaluation, we cannot evaluate whether and how the robot behavior leads to improved demonstrations that in turn result in better learning. How the demonstrations in turn affect the learning process could be a promising future study.

In addition, this work has several limitations. First, in our study, engagement communication is decoupled from the robot’s actual learning process. However, in the human or animal learning, such communication is usually associated with the learning process. For example, a student making good progress tends to show more gesture engagement [58]. We will investigate how to couple the learning process with engagement communication in the future. Second, in this paper, we only consider two types of learning engagement cues, i.e., attention and imitation. In practice, human learners may employ more diverse cues, e.g., spatially approaching, etc. Third, the proposed methods, Instant attention and Approximate imitation are both based on the human body poses. They may not apply to the learning tasks which do not necessarily involve the demonstrator’s body movements, e.g., object manipulations. For those tasks, designing a good mechanism to communicate the robot engagement is still an open question. Fourth, in this work, we only consider skill-oriented RLfD in which the robot has to master a skill taught by instructors. Other types of RLfD, e.g., goal-oriented RLfD in which the robot learns how to achieve a goal from human examples, are inherently different in task settings. Though the proposed method may work, we still need to evaluate their effects in future work. Fifth, several studies show that there are gender effects on non-verbal communication [26]. Therefore, people in different genders may have different understandings of the robot engagement communication in our studies. We leave investigation of possible gender effects as future work.

And, lastly, we conduct the user study in an online simulation environment without a further offline and real-time RLfD test. Though the simulation is common practice to evaluate the idea in RLfD, the participants do not have any control over the teaching process. How the participants might reshape future demonstration based on the robot’s engagement feedback needs further investigation.

6 Conclusion

In this work, we propose two methods (Instant attention and Approximate imitation) to generate robots’ learning engagement in RLfD. The Instant attention method automatically generates the point of attention and the Approximate imitation method produces robot imitation behavior. Based on the two methods, we investigate the effects of three types of engagement communication (showing attention, showing imitation, and showing both) via a within-subject user study. Results suggest that the proposed cues enable robots to be perceived to be significantly more engaged in the learning process and behave significantly more acceptably in RLfD than with no engagement communication. Also, these engagement cues significantly affect the human participants’ estimation of robots’ learning capabilities and the participants’ expectation of the learning outcomes, even though all the robots have no actual learning abilities. In particular, imitation cue influences instructors’ perceptions significantly more than attention cue, while the hybrid cues significantly outperform a single cue. We also find that showing gesture or combined engagement significantly improves instructors’ assessments of demonstration qualities. This paper takes the first step to reveal the potential effects of communicating engagement on humans in RLfD.