Investigating the Effects of Robot Engagement Communication on Learning from Demonstration

Robot learning from demonstration (RLfD) is a technique for robots to derive policies from instructors’ examples. Although the reciprocal effects of student engagement on teacher behavior are widely recognized in the educational community, it is unclear whether the same phenomenon holds for RLfD. To fill this gap, we first design three types of robot engagement behavior (gaze, imitation, and a hybrid of the two) based on the learning literature. We then conduct, in a simulation environment, a within-subject user study to investigate the impact of different robot engagement cues on humans compared to a “without-engagement” condition. Results suggest that engagement communication has significantly negative influences on the human’s estimation of the simulated robots’ capability and significantly raises their expectation towards the learning outcomes, even though we do not run actual imitation learning algorithms in the experiments. Moreover, imitation behavior affects humans more than gaze does in all metrics, while their combination has the most profound influences on humans. We also find that communicating engagement via imitation or the combined behavior significantly improves humans’ perception towards the quality of simulated demonstrations, even if all demonstrations are of the same quality.

to actions, a.k.a policy, from instructors' demonstrations [7]. This technique has been shown to be successful in teaching robots physical skills by imitating instructors' body movements e.g., pole balancing [7], tennis swings [28], air hockey maneuvers [9], etc. A standard RLfD process takes two steps: demonstration gathering step, which collects demonstrations from the human demonstrators, and policy deriving step, which reasons the underlying state-action mappings [4]. Like a human learner, a robot in RLfD could have different strategies of gathering demonstrations according to its underlying policy derivation algorithms. For example, robots with the DAgger algorithm [46] learn progressively by taking incremental demonstrations from instructors, much like going through a scaffolding process [29,47]. A robot can also learn more proactively. For example, if equipped with Confidence-Based Autonomy (CBA) [17], an interactive algorithm for RLfD, a robot can request demonstrations at the states of which it has little or no knowledge. These learning strategies have been proven to be very effective and thus widely adopted in RLfD [37].
Unlike human learners, robots in previous RLfD processes rarely show any engagement cues during the learning process. They mostly remain stationary without giving any feedback, especially when instructors are giving demonstrations (i.e., in the demonstration gathering step). In human tutelage, engagement cues play an important role in shaping instructors' mental model of the learners [53]. For example, learners' attentional engagement, e.g., gaze, indicates their points of interest in the instructions. Imitation, a behavioral engagement cue, shows learners' motivation to perform like the instructors [16]. It is reported that learner engagement cues could potentially affect instructor perceptions and behavior [25]. For example, in educational research, instructors are found to have the tendency to provide arXiv:2005.01020v1 [cs.HC] 3 May 2020 more support to learners of high behavioral engagement [53].
These effects of showing learning engagement, however, are less explored in the RLfD research, partly because designing engagement cues for robots in the context of RLfD is challenging. First, most of the existing methods for generating engagement cues in Human-Robot Interaction (HRI) cannot be directly applied to RLfD. For example, it is common practice in HRI to simulate robots' attentional engagement by directing their gaze towards visually salient elements (e.g., color or lightness [42]), specific objects (e.g., human faces [50]) or predefined events (e.g., pointing gestures [11]). This practice cannot be easily set up in RLfD because the robot's allocation of attention should follow the development of instructors' demonstrations. This is especially true in skill-oriented RLfD, where the robot needs to reproduce the body skills from the human demonstrator. In this context, the attention should be subject to the demonstrations, i.e., body movements, which are less constrained and highly dynamic compared to a standard HRI process. Methods for generating other engagement cues, e.g., imitation [8,45,38], also need further adaptation to accommodate the dynamic nature of RLfD. Second, even if an engagement cue can be designed effectively, its deployment in RLfD should be in real-time with low computational cost.
To this end, we focus on skill-oriented RLfD and propose two novel methods (Instant attention and Approximate imitation) to enable robots to communicate their learning engagement in a RLfD process. Note that we consider the demonstration gathering step as the interaction scenario since it determines the demonstration quality, which is crucial for the policy optimality [4,56]. We do not focus on designing effective learning algorithms for the demonstration learning. The learning engagement cues are generated as follows: the Instant attention method generates robot attentional engagement by tracking instructors' body movements through particle filters; the Approximate imitation method produces behavioral engagement, i.e., imitation, by partially mapping the instructor's joint movements to those of the robot with approximations. We then use the proposed methods to generate three modes of engagement communication (via attention, via imitation, and via a hybrid of the two) for robots in RLfD. To investigate the effects of the three engagement modes on humans, we compare them with another mode ("withoutengagement" in which the robot remains stationary as most robots do in existing RLfD studies [7,28,9]) by a within-subject user study in a simulation environment. Results suggest that robots with the proposed cues are perceived to be more engaged in the learning process and their behaviors are more socially acceptable in RLfD than the robots without. Also, having engagement cues significantly affect human's estimation of the robots' learning capabilities. The robots which communicate engagement in RLfD are perceived to be significantly more capable in learning than the robots without, even though none of them are equipped with the learning algorithms. The engagement communication also affects the human's expectation towards the final learning outcomes. Furthermore, behavioral cues influence humans' perceptions significantly more than attentional engagement does, while the hybrid cues significantly outperform the other two. We also find that showing behavioral or combined engagement significantly improves humans' evaluation of demonstration quality. Specifically, the human participants perceived the demonstrations to be significantly more appropriate for the robot to learn when the robot communicates its engagement via behavioral or the mixed engagement, even though all demonstrations are actually of the same quality.
The contributions of this paper are as follows. First, we propose two novel algorithms which allow robots to generate attention and imitation behavior to communicate its learning engagement with low computations in RLfD. Second, we developed a simulation platform to evaluate the effect of engagement communication in RLfD. Third, we take a first step towards evaluating the effects of three types of engagement cues (attention, imitation, and hybrid) on humans. Through evaluation in a simulation environment with a humanoid robot learning the different skills from a simulated demonstrators, we show interesting findings on the design of robot engagement communication in RLfD. To the best of our knowledge, this paper is the first to systematically investigate how the robot engagement communication affects the humans' perceptions and expectations of the robot in RLfD.

Robot Learning from Demonstration (RLfD)
Robot Learning from Demonstration (RLfD) is also known as "Programming by Demonstration", "imitation learning", or "teaching by showing" [48]. Rather than exhaustively searching the entire policy space, RLfD enables robots to derive an optimal policy from demonstrators' (also called instructors) demonstrations [7]. Usually, this technique does not require additional knowledge about programming and machine learning from human instructors, and thus opens up new possibilities for common users to teach robots [18]. Existing studies on RLfD focus mainly on policy derivation algorithms, e.g., mapping states to actions by supervised learning [17], updating the policy by value iteration in Reinforcement Learning [7], and recovering rewards to explain demonstrations by Inverse Reinforcement Learning [1,56]. Some studies also work on designing robots' reciprocal learning feedback to communicate what the robots have learned to human teachers, e.g., demonstrating the robot's current learned policy [12], providing verbal and/or nonverbal cues [33,2,44,58,11,13], or visualizing where they succeed and fail [49]. These studies, however, largely overlook how the robots' engagement behavior would affect the instructors and their demonstrations, especially during the demonstration gathering step. Hence, in this work, we consider how to generate behavior which allow robots to communicate their learning engagement to instructors, and investigate their potential effects on RLfD.

Engagement and learning engagement cues
Engagement is a broad concept in HRI with many different definitions. Some studies focus on the whole spectrum of an interaction, and defines engagement as the process of initiating, maintaining, and terminating the interaction between humans and robots [51]. Others narrow the notion of engagement down to the maintenance of interactions, interpreting engagement as humans' willingness to stay in the interaction [64,60].
In the context of learning, engagement mainly refers to the state of being connected in the learning interaction, which can be measured from three aspects: cognition, behavior, and emotion [52]. Cognitive engagement is closely related to the allocation of attention as it is one of the most important cognitive resources [43]. Failure to attend to another person indicates a lack of interest [5]. Thus, we adopt attention as a cue to communicate cognitive engagement in RLfD. Behavioral engagement is captured by task-related behavior, e.g., task attempts, efforts, active feedback, etc. Imitation, a common behavioral engagement signal, refers to "non-conscious mimicry of the postures, mannerisms, facial expressions, (speech), and other behavior of one's interaction partners" [14]. In interpersonal communications and HRI, the imitation behavior increases the likelihood of understanding [15], interpersonal coordination [10] and emotional contagion [26]. In the context of learning, the imitation behavior also indicates the robot's internal status in learning, e.g., the progress and motivation [16]. Thus, we use imitation as a way to communicate the behavioral engagement for robots in RLfD. Emotional engagement is associated with the affective status evoked by the interaction, including valence and arousal. Despite its importance, emotional engagement is hard to apply in RLfD since most existing RLfD robot systems lack the full ability to express emotions. In the scope of this paper, we define the robot learning engagement as the involvement in the learning process, with a focus on its cognitive engagement, i.e., attention, and behavioral engagement, i.e., imitation. The following subsection presents related work on generating attention and imitation behavior to communicate engagement.

Robots' communication of engagement
In HRI, a robot can communicate its attention via different physical channels, e.g., gaze [39,35,40,41], head orientation [39,58], and body postures [61]. Regardless of which channel they use, robots are usually programmed to pay attention to salient elements, including but not limited to colors [11], objects with visual intensity [42], and movements [11,42]. For example, Nagai et al. regarded visually outstanding points in the surroundings, in terms of their colors, intensities, orientations, brightness and movements, as points of attention [42]. Other work directs robots' attention to specific objects, e.g., human faces [50] and colorful balls [3] to name a few, or predefined events, e.g., pointing gestures [39]. For example, Sidner et al. designed a robot that pays attention to participants' faces for most of the time [50]. Lockerd et al. drove the robot attention mechanism with interaction events, such as looking at an object when it is pointed at or looking at a subject when the person takes a conversational turn [39]. To accommodate multiple events, a state transition diagram is usually adopted to control any attention shifts [11,39]. Though these studies provide insightful information about the design of robot attention, their approaches may not easily be applicable to skill-oriented RLfD as the point for attention in instructors' body movements is dynamically changing.
Compared to attention, the imitation behavior has been less widely adopted as a robot engagement cue. The robot imitation of a human participant's behavior in real-time is inherently challenging due to the correspondence problems [4] as well as the robot's physical constraints [31,55,32]. Hence, instead of generating full-body imitation behavior, some HRI researchers proposed to do partial imitations. For example, Baileson and Yee built an immersive virtual agent that subtly mimicked people's head movements in real-time [8]. A similar imitation strategy was applied by Riek et al. to a chimpanzee robot [45]. In addition to head imitation, gesture "mirroring" has also been implemented by Li et al. on a robot confederate [38]. Although these studies showed that partial imitation behavior improve participants' perception of robots' capabilities [23,21], they mainly used ruled-based methods [8] or predefined behavior [38], which may not be transferable to RLfD scenarios. In this work, we employ the same strategy and allow robot learners to make partial imitations. Different from existing work, we take an algorithmic approach to automatically generating approximate imitations of instructors' body movements for robots in real-time.

LEARNING ENGAGEMENT MODELING
This sections presents two methods for generating engagement cues. The first subsection briefly introduces human body poses and forms the basis of the proposed methods. The remaining subsections describe the methods in detail.

Representation of the body pose
In RLfD, instructors usually demonstrate a point via their body movements. Our proposed methods thus use human body poses to generate attention and imitation behavior. A body pose is usually depicted by a treelike skeleton, with nodes as human joints and links as body bones (shown in Figure 1). Mathematically, this skeletal structure can be represented in two forms 1 : the position form and the transformation form.
Position form: The position form describes the body pose in a single frame of reference (usually the sensor frame), as shown in Figure 1(a). In this form, the pose skeleton is denoted as [J (1) , J (2) , ..., J (n) ], where J (i) ∈ R 3 is the position vector of the i-th joint in the skeleton, and n is the number of joints. This form gives for each joint its global position, providing the potential attention point for the robot. Hence it is used for the Instant attention algorithm to generate robot attention points.
Transformation form: The transformation form describes the body pose in a series of frames of reference [57], as shown in Figure 1(b). In particular, each joint has its own frame (a right-handed frame), and the links in the tree-like skeleton define parent-child structures between frames. The pose of a non-root joint is then described by a translation (i.e., the bone length) and a rotation (i.e., joint movement) in its parental frame, with the root joint (often the hip joint) described A body pose in the transformation form: each joint has its own frame and the skeleton defines parent-child structures and translations between frames; the frame with x-y-z labels is the root frame and is referred in the sensor frame.
in the sensor frame. This form decomposes a human body movement into joint rotations (body-independent) and joint translations (body-dependent) in a way that the movement can be easily imitated by robots: just mapping the rotations onto robot joints. We denote this form as [T 1 , T 2 , T 3 , ..., T n ], and use it for the Approximate imitation algorithm to obtain approximate imitation behavior.

Instant attention
The attentional engagement for robots is generated based on the cognitive theories on human attention. Generally speaking, a generation process of human visual attention involves two stages [30]: first, attention is distributed uniformly over the visual scene of interest; then, it is concentrated to a specific area (i.e., it is focused) for gaining information [20]. In a skill-oriented RLfD process, the instructor demonstrates skills mainly through their body joint poses. The above mechanism thus corresponds to that the human joints of interest are tracked uniformly at the initial stage, and then one joint providing the most information for learning is picked as an attention point. As for a demonstration learning, the more predictable/track-able a body joint movement is, the less information the robot could gain from that part, and consequently, less attention the robot should pay to it. In other words, if a body joint moves out of expectation the most among all joints, it will be worth paying attention to.
To this end, we use the particle filter (PF) as it is robust and effective in predictions [34] and tracking [6]. In short, PF is a Bayesian filter which uses a group of samples to approximate the true distribution of a state [63]. Particularly, given the state observations, PF employs many samples (called particles) to describe the possible distribution of that state. The particles are denoted as Here M is the number of particles in the particle set X t . Each particle x is a hypothesis as to what the true state might be at time t, and is first produced by a prediction model p(x t |z 1:t−1 ) which is based on all history observations z 1:t−1 , i.e., x t , i.e., the probability that the the particle x t ). In other words, each x t . For more details on the particle filter, refer to [63].
We apply one PF to track each relevant joint during the human demonstration. Specifically, state x [m] t ∈ R 3 describes the joint position in the sensor frame. We assume the state transits with additive Gaussian noise: and N 0, σ t I is the multivariate normal distribution with zero mean and diagonal covariance matrix σ t I. The importance factor for each particle is defined to be exponential to the Euclidean distance between the predicted and observed joint position: where η is the normalizer. Each joint in the body pose is tracked by a particle cloud, a group of particles X t . In order to dynamically adjust the cloud size in accordance with the joint movement, the variance σ t is set to be proportional to the average Euclidean distance between the predicted and observed joint position: where α is a hyper-parameter and M is the number of particles. The σ t indicates the cloud size: the greater the σ t is, the more attention the robot should pay to the associated joint. Thus, the joint with maximum σ t corresponds to the attention point. In the experiment, the α is set to 0.02 for best tracking of human joints. Note that, though the importance factor is calculated as the distance between predicted and observed joint positions, it is not equivalent to the measure of joint acceleration. In particular, the predicted joint position is just an estimate, and the difference between predicted and observed joint positions measures how much the estimate deviates from the truth. The importance factor thus implies the unpredictability and can only be computed after the current observation is available. Figure 2 illustrates how the PF works to generate attention. The particle cloud functions as the robot's prediction of the joint future movements, and is subject to change based on the current observations. Initially, the robot predicts the movements of all body joints of interest to be the same, i.e., all clouds are of the same sizes. During a demonstration, when a joint moves out of its cloud region, beyond the robot's prediction, the cloud grows to catch that movement and the robot will thereafter be likely to pay attention to that joint. Likewise, if the joint movement is small, within the robot's prediction, or no movement at all, the cloud shrinks and chances are small that the attention will be given to that joint. Overall, the cloud size indicates the predictability of the instructor's body movements as well as the level of attention the robot needs to pay. At each time, the joint with the biggest cloud is picked as the attention point. This process loops with every new body pose as shown in Figure 3.
We now present a practical algorithm for Instant attention to generate attentional engagement instantly for robots (Algorithm 1). The algorithm takes Tracked-Joints JSet tracked and the BodyPose in the position form [J

Algorithm 1: Instant attention
for each joint j in JSet tracked do [3] initialize j-th particle filter for joint j; obtain particles x t−1 from i-th particle filter; [9] for m = 1 to M do /* Equation3 */ [12] for m = 1 to M do [13] draw new particles x tention point at each time. Specifically, the Tracked-Joints contains the joints required to be tracked. In practice, the joints to be tracked are task-dependent, and should be defined according to the possible attention points on the instructor's body. For example, a cooking robot may only need to track the instructor's upper body movements and the joint correspondence can be configured by the developers based on the robot physical structures. Another input BodyPose is the human body pose in the position form. The algorithm runs as follows: first, it initializes a particle filter with the same covariance for each tracked joint (line 2-4).
Then it estimates the distribution of the next joint position (line 9-11), followed by the estimation correction given the current position observations (line 12-13). Finally, the algorithm adjusts the covariance of the noise distribution to capture the joint movement (line 14), and the attention point is found by selecting the joint with the maximum covariance value (line 15).
Once an attention point is generated, say P a , it is worth mentioning that P a is actually located in the sensor frame. In order to obtain the accurate attention  Fig. 4: The attention point P S a is located in the sensor frame T S . We need to do the transformation P R a = T RS P S a to get P R a in the robot head frame T R , where T RS is the transformation from T S to T R . point of the robot, a further transformation is required. Figure 4 illustrates how to transform P a in the sensor frame T S into the robot head frame T R given the transformation T RS from T S to T R .
The Instant attention method has several advantages. First, unlike other mechanisms (salience-based, object-based or event-based), this method utilizes the particle cloud to track the instructor's joint movements, and automatically produces attention points based on the information gained from the movements. Second, the attention point is generated and shifted smoothly because the spatial size of the cloud evolves smoothly. Specifically, the particle distribution p(X t ) is iteratively sampled based on their previous distribution p(X t−1 ) by the importance weight w t , i.e., a p(x t , even if the joint moves abruptly (i.e., x t − J t is large). Also, the cloud is immune to noises and outliers, e.g., joint vibrations caused by sensors, since small turbulence (no matter the exact speed) will not change cloud size (the σ t is averaged over all predicted states), while existing speed-/spatial-position-based methods could cause gaze jerks or sudden gaze shifts due to these noise/outliers. Third, the joints to be tracked can be dynamically changed, offering a flexible and adjustable attention mechanism based on the RLfD task.

Approximate imitation
Behavior imitation in robotics is usually formulated as an optimization problem, which needs to find the joint correspondence first [4], and then solve the inverse kinematics for the robot structure [24]. Both of the processes are difficult, computationally intensive, and robot-configuration-dependent, hence not applicable for generating imitation behavior for general robots. On the other hand, psychological results reported that people mimic behavior to communicate engagement by adopting similar postures or showing similar body configurations according to the context [14]. We thus relax the behavior imitation in robotics as follows: First, the robot is not required to search blindly for the best joint correspondence since the joint correspondence is taskdependent. We allow the user to explicitly specify the joint correspondence according to the RLfD context. Second, for those robot joints whose Degree of Freedom (DoF) do not match the human joint, we only set the joint angles for the available robot joints to approximate the human movements. Though this solution of approximation may not be optimal in the sense of behavior mimicry, it runs very fast (in real-time) to generate behavioral engagement, achieving a balance between simplicity and optimality. To achieve this, we propose the algorithm Approximate imitation, which allows robots to generate similar motions as the demonstrator's for specified joints. Given the joint correspondence, the algorithm runs with two steps: frame transformation, and rotation approximation, as presented in Figure 5.
The frame transformation is to transform the instructor's body pose to match the robot frames. To be specific, we leverage the transformation form of body poses to decompose the frame matching into two steps: first, rotation alignment, and then translation alignment. The rotation alignment is to rotate the human joint frames so that their axes are aligned with the robot joint frames, as shown in Figure 6(a); the translation alignment is to translate the human joint frames in their parent frames so that the initial skeletal structure of the demonstrator's body matches the robot initial configurations, as shown in Figure 6 Since the DoF of the robot joint may not equal the DoF of its corresponding human joint, we could not have the exact movement mapping. Instead, we use the robot joint to approximate the human joint rotations as follows. First, a human joint rotation is con- verted into Euler forms, (θ roll , θ pitch , θ yaw ). Second, if the DoF of a robot joint is 3 (roll, pitch and yaw ) and exactly matches the human DoF, then the conversion is straightforward: rotate for the robot joint with roll first, then pitch, and finally yaw. If the DoF of a robot joint is 2 (e.g., roll and pitch), then the conversion can be approximated as rotating with roll first, and then pitch. If the DoF of a robot joint is 1 (e.g., roll only), then rotate with roll only. For example, in Figure 7, the robot arm has the same structure as the demonstrator's but with different joint DoF, as shown in Figure 7(a) and (b). It can approximate the instructor's left arm movement by first converting T S (the rotation) into Euler angles (θ roll , θ pitch , θ yaw ), and then setting the joint roll to θ roll , and the joint pitch to θ pitch for the shoulder, ignoring the θ yaw , as shown in Figure 7(c). We now present the algorithm Approximate imitation in Algorithm 2. The algorithm takes joint correspondence JointCorrespondence, and instructor's body pose JointMovement in transformation form as input, and outputs the joint configurations, JointConfigs, for  [3] for Rotate align.append(rotateAlign(J R i , J H i )) ; [5] T ranslate align.append(translateAlign(J R i , J H i )) ; [6] for i in [1, 2, ..., n] do [7] T i = T ranslate align[i] * T i * Rotate align[i] ; [8] (θ roll , θ pitch , θ yaw ) = convertToEuler(T i ) ; [9] if DoF (J R i ) == 3 then [10] append [θ roll , θ pitch , θ yaw ] to q R ; [11] else [12] if DoF (J R i ) == 2 then [13] append [θ roll , θ pitch ] to q R ; [14] else [15] if DoF (J R i ) == 1 then [16] append [θ roll ] to q R ; [17] return q R ; the robot. Specifically, JointCorrespondence defines the joint mapping, {J H i → J R i }, from human joint J H i to robot joint J R i for part joints. The JointMovement is represented as a series of transformations along the skeletal structure, [T 1 , T 2 , ..., T n ] (see Section 3.1 for more details). The algorithm runs as follows: first, it calculates the frame transformations from J H to J R , and saves the rotation alignment and translation alignment in Rotation align and T ranslation align (line [3][4][5]. Then for each joint movement T i in [T 1 , T 2 , ..., T n ], the algorithm transforms it into the corresponding robot frame T i by translation and rotation alignment, followed by a conversion into the Euler form (line 7-8). The algorithm proceeds by selecting the right angles from θ roll , θ pitch , and θ yaw for the robot joint according to the DoF of the robot joint (line 9-16). The joint configurations are saved in q R , and returned as the final output.
The Approximate imitation method has several advantages for generating imitation behavior for robots in RLfD. First, this algorithm runs in real-time as the imitation is only partially taken place on the instructor's body poses. In particular, we take advantage of local transformations of body poses to avoid solving inverse kinematics for the whole robot joints, which is computationally intensive and may also not have the closed form solutions. Also, instead of finding the exact mapping for robot joint angles, we set configurations based on the DoF of the robot joint to achieve a similar motion trend. This conversion may sometimes distort movements, but, still, the directions and trends are captured (as reflected in 4). Second, this method is generic and applicable to standard skill-oriented RLfD. Depending on the RLfD scenario, we can also assign different joint correspondences to do a partial imitation. For other types of RLfD, e.g., object-related demonstrations or goal-oriented learning from demonstrations, we can also apply the proposed method to generate the approximate imitation based on the object or the goal. Specifically, we can replace the joint transformations with the poses of the object or the goal, and generate the target θ roll , θ pitch , and θ yaw . Then we can adopt the inverse kinematic solvers to calculate a set of joint configurations to move the robot's end-site to the target pose (θ roll , θ pitch , θ yaw ). Based on the DoF and the space constraints of the robot end-effectors, we can make the similar approximations to have the end-effector only achieve the roll pose, the roll and pitch pose, or the complete target pose.

Evaluation
This section first introduces our RLfD simulation platform, then describes a preliminary study for determining the timing of imitating behavior, and finally presents the main user study.

RLfD simulation platform
Our RLfD simulation platform is composed of a virtual human instructor and a robot, as shown in Figure 8(a) and (b). The virtual human instructor performs different yet controlled types of movement skills, while the robot (a Pepper) needs to capture motion and learn skills from the instructor. Both parties stand facing each other in a simulated 3D space, as shown in Figure 8(c).
The simulation platform has three major components: demonstration component, sensing component,  . We further add one more mode, i.e., no engagement (N-mode), to evaluate the effectiveness of these three modes. In N-mode, the robot just stands near the instructor and remains stationary without any body movements. Compared with the A-mode, the robot's gaze is fixed on the demonstrator's face and is not affected by the demonstrator's body movements.
In this simulated RLfD, the tasks for robots to learn are sports skills performed by a virtual instructor. We chose sports skills for robots to learn as this type of movement has often been adopted in RLfD [28,9]. Four types of sports movements, i.e., boxing, rowing, swimming, and frisbeeing, are selected from CMU Graphics Lab Motion Capture Database 2 as these four sports involve movements of various body parts. Regarding the policy deriving algorithms, even the state-of-theart method may fail to deliver good learning outcomes, which may in turn change their perception towards the demonstration gathering. Thus, to minimize any sideeffects or biases introduced by the performance of the learning algorithms, we do not utilize any learning algorithms, and the robot has no actual learning ability in the demonstration gathering process. In the other words, the robot only communicates its engagement when observing the human demonstrations by showing different cues and will not learn the sport skills in the following experiments and studies. Figure 9 presents an example of how the simulation platform works. The first row shows the human instructor's real demonstration, which is then re-targeted onto the instructor, as shown in the second row. The third and forth rows present the running of Instant attention and robot showing attention (A-mode). The last row presents the approximate imitation behavior of the robot (I-mode). We purposely rotate the 3D scene in the last two rows to get a better view of robot communicating engagement.
We chose online simulation rather than a field test due to the following constraints and concerns: First, due to the current limitations of RLfD techniques, the demonstrators are usually required to wear motion-capture devices, confined in a designated space, and repeatedly showcase the target movements. This could potentially impact on their interaction with robots and perception of the robot behavior. Also, limited by physical abilities, robots, e.g., Pepper, barely move without making undesirable noises, jerks, and vibrations, which could disturb the human participants and influence their assessment of robot learning. We thus use simulation in our experiment to avoid all these side effects and unexpected outcomes. Furthermore, we purposely select a viewpoint that allows the participants to have a better view of both the robot's and the instructor's behavior, i.e., the staging effect [62]. Second, the robot's engagement behavior could be evaluated in a more consistent and repeatable manner in a simulation. In a field test, the instructor's demonstrations are usually nonrepeatable and could be easily influenced by robots' reactions. The simulation allows different engagement cues to be compared without bias. Second, the simulation provides a controllable and measurable environment to monitor and evaluate a system's performance from various perspectives, which is often a necessity before algorithms are deployed in RLfD.
This simulation platform was built upon the Gazebo simulator 3 and the Robot Operating System (ROS). We use the Matlab Robotics System Toolbox 4 to facilitate the algorithm implementation.

Preliminary study
In interpersonal communication, a person's imitation behavior, also called mirroring behavior, often happens after the partner's target behavior with certain time delay [14,27]. In this paper, we generate such mirroring behavior via the approximation mechanism. We need to determine the exact time delay so that users can correctly recognize imitation as a learning engagement cue. We run a within-subject pilot experiment to check the appropriate timing of robot imitation relative to the target action.
Manipulated variable. We set time delay as the independent variable in this study and experiment with three intervals: 0.5s, 1.0s, and 2.0s. Technically, we used a buffer to store instructors' body poses to postpone any imitation behavior. After proper setup, the buffer size was set to 15, 30, and 60 to achieve an appropriate time delay of about 0.5s, 1.0s, and 2.0s, respectively.
Subject allocation. We recruited 30 participants (mean age: 35.5, female: 12) via Amazon Mechanical Turk (AMT) who had no prior experience with physical or virtual robots. Each participant watched three simulated RLfD videos corresponding to the three delay intervals. In the videos, the instructor was teaching the robot some type of sports skill, and we staged the 3D scene at a fixed angle for a better view of the robot imitations. We counterbalanced the presentation order of the different time delays.
Dependent variables. Participants watched videos showing the robot imitating the instructor with three different time delays. They were informed that the robot is supposed to learn sports skills from the demonstrator. After each video, they were asked to rate their agreement on a 7-point Likert scale as to whether the robot in the video is actually learning. Figure 10 presents the average and overall rating distribution on different time delays. We run a repeated measures ANOVA with time delay as the factor, and find that there is a significant difference in delay-induced perception of robot learning engagement (F (2, 58) = 88.37, p < 0.01, η 2 = .76). Results of the Bonferroni post-hoc test suggest that the engagement rating of delaying for 1.0s is significantly higher than that of de-3 http://gazebosim.org/ 4 https://www.mathworks.com/hardware-support/robotoperating-system.html laying for 0.5s (p < 0.01) and 2.0s (p < 0.01). Overall, setting the imitation time delay to 1.0s can effectively communicate robots' learning engagement ( 70% agree and strongly agree). We apply this configuration to the Approximate imitation algorithm in the main user study. One might be wondering that why the rating difference between 0.5s and 1s delay is noticeably dramatic, even larger than the difference between 1s and 2s delay. The cause may possibly be the approximation mechanism adopted for generating the mirroring behavior. When the delay time is small (e.g., 0.5s), the approximate imitation algorithm generates the movement in a very responsive manner, almost at the same pace with the demonstrator's movement. The subjects are likely to feel that the robot is showing, rather than following, the demonstrator's movement. As the delay time becomes longer (e.g., 1s), the movement following effects becomes more obvious, and the robot appears to be learning from the demonstrator by mimicking his/her behavior. Consequently, the ratings between the 0.5s and 1s in terms of robot communicating learning engagement become higher. Such dramatic rating difference also confirms the necessity and importance of using the preliminary study to determine the appropriate delay time for the followed studies.

Main study
To evaluate the effectiveness of engagement communication and our proposed cues on participants' perception of the robot and the demonstration, we conducted a within-subject experiment on an RLfD simulation platform, with an additional "without engagement" condition (N-Mode) as the baseline.

Hypothesis
Our proposed methods generates different types of engagement cues for robots to express their engagement. Accordingly, we first hypothesize that: H1 . 1) Regardless of actual cues taken, robots that communicate engagement are perceived to be signifi-cantly more engaged (H1a) in learning, and their learning behavior is significantly more socially acceptable (H1b) than those in the N-mode. Further, 2) imitation cue will receive a significantly higher engagement rating than attention cue (H1c), while combined cues will be rated significantly the most (H1d ). Similarly, 3) imitation cue will be rated significantly more acceptable than attention cue (H1e) while combined cues will be rated significantly the most (H1f ).
According to educational theory postulating that learners' engagement cues, especially behavioral engagement, could have reciprocal effects on instructors [53], we hypothesize that: H2 . Robots communicating engagement via different cues will have significantly different influences on human participants. Specifically, 1) regardless of the cues, communicating engagement will significantly influence humans' estimation of the robot learning capability (H2a), and significantly raise the humans' expectations towards the learning outcomes (H2b) than no communication. Further, 2) imitation cues will lead to a significantly higher estimation of the robot's capabilities than attention cues (H2c) while combined cues have the most significant influence than others. (H2d ). Similarly, 3) imitation cues will result in a significantly higher expectation towards the learning outcome than attention cues (H2e) while combined cues have significantly the highest expectation than others (H2f ).
We further hypothesize that the robot showing different engagement behavior can affect humans' assessment of demonstration quality. More specifically: H3 . 1) Regardless of the exact demonstrations shown to robots, different engagement cues will influence the human participants' assessment of the demonstration quality. Specifically, demonstrations for robots with attention cues, imitation cues and the hybrid cues will be rated as significantly more appropriate (in terms of the expected robot capabilities) than that without engagement cues even if they are actually the same (H3a). Further, 2) demonstrations for robots with imitation cues and the hybrid cues will have a significantly higher rating in appropriateness than that with attention cues (H3b).
In the study, these different aspects were measured via post-study questions with 7-point Likert scale answers, as shown in Figure 11 and Figure 12. We derived these questions in the user study based on the previous research on Human-Robot Interactions and robot learning. Specifically, the questions to measure robot communicating engagement are adapted from the engagement studies [54,59]; the questions to measure participants' expectation towards the robot learning capability are derived based on the studies on human expectations and assessment of human-robot collaborations [36]. We also took two steps to ensure the effectiveness of the answers to all the questions, . First, the questions could only be answered after participants took necessary actions to understand the experiment. For example, the questions to measure engagement were only visible when the participants finished watching the full learning videos; and the questions to measure the participants' expectation also require the participants to provide the answers and their reasons (those without giving reasons could not proceed to next questions). Second, all answers were manually checked to reject any invalid responses, e.g., a response with the same answers to all questions, and a response with vague and inconsistent comments.

User study design
The study consisted of five sessions: one introductory session and four experimental sessions. The introductory session requested demographic information and presented a background story to engage users: the participant has a robot team of four for an Olympic game. They needed to assess the robots' performance when they were under a professional coach's tutelage. In experiment sessions, participants watched the human instructor's movements first and then monitored the robot learning process in the RLfD simulation platform. After each session, participants were required to fill poststudy questionnaires. Each session checked one mode, and modes were counter-balanced with learning tasks. Specifically, we randomized the order of engagement modes and the four physical skills to ensure the mode applies evenly across different skills and the skill also occurs evenly across different modes. We recruited 48 participants (mean age: 30.9, female: 6, no prior experiences with teaching robots and no participation in the preliminary study) from Amazon Mechanical Turk (AMT).
During the experiment, we asked the participants to rate if they perceived the robot was paying attention or imitating based on its behavior. This served as the manipulation check for validity, ensuring that our designs indeed convey the intended type of engagement.  The robot was paying attention to the coach's demonstration.

Analysis and results
The robot was following the coach's demonstration.
How much was the robot engaged in learning the sports skill?
The robot's behaviors in the learning process are acceptable. accepted. Further, in terms of engagement, combined cues are reported to be significantly better than single cues; Bonferroni post-hoc test p < 0.01; H1d accepted. in terms of acceptability, combined cues are reported to be significantly better than single cues; Bonferroni post-hoc test p < 0.01; H1f accepted. However, we do not notice a significant difference between imitation cue and attention cue, thus H1c and H1e are both rejected. Therefore, H1 is partially accepted.
Based on these analyses, we therefore conclude that: Overall, our results partially support H1: showing attention, imitation or both are perceived to be significantly more engaged in learning, and is significantly more acceptable. Also, showing both behavior is perceived to be significantly bet- The robot shows its intelligence in learning.
What's the likelihood that the robot will master this sports skill Based on the robot's reactions, the coach's demonstration is appropriate for robots to learn. ter than showing only one behavior. However, no significant difference can be found between showing attention and showing imitation.

Effects of engagement cues on participants' perception.
We then compare the effects of different engagement cues on subjects' perception via a one-way repeated measures ANOVA with the mode as the independent variable. In general, robot engagement communication significantly enhances the participants' estimation of robots' learning capabilities and the participants' expectation of the learning outcomes, even if none of the robots in the experiment have the learning ability (no learning algorithms are adopted in the user study). Specifically, in terms of estimating the robots learning capability, participants rated the robots in Overall, our results support H2: communicating engagement significantly influence the humans' estimation of the robots' learning capabilities, and significantly changes their expectation towards the final learning outcomes, even though none of the robots have the learning abilities. Moreover, the behavioral engagement in RLfD, i.e., imitation, presents significantly more influence on the participants than the attentional engagement. Furthermore, communicating engagement via two cues at the same time have significantly more effects on participants than communicating engagement via a single cue.
Effects on participants' assessment of demonstration qualities.
Finally, we analyze the participants' ratings on the appropriateness of instructors' demonstrations. As shown in Figure 12, no significant difference can be found between A-mode (M = 4.48, SD = 2.10) and N-mode (M = 3.35, SD = 2.08); H3a rejected. However, compared with A-mode, only AI-mode (M = 5.93, SD = 1.00) significantly improves the participants' assessment of demonstration quality in RLfD, Bonferroni post-hoc test p < 0.01; H3b partially accepted. Note that in different engagement modes, the skills to be learned are all generated by the same set of MoCap data. Thus all demonstrations are actually of the same quality.
Overall, our results partially support H3: communicating behavioral engagement or combined engagement will significantly improve participants' assessment of demonstration qualities, while showing attention cannot, even though all the demonstrations are actually of the same quality.
Further, in the comments collected from the user study, we found that most participants explicitly stated that the robots without behavioral engagement may fail in learning, and accordingly, they were more likely to adjust future demonstrations when the robots communicated no engagement or only attentional engagement.

Engagement communication for robots in RLfD
The choice of engagement cue should consider the nature of the learning task Our results show that robots' behavioral engagement is preferable to attentional engagement in a physical skill-oriented RLfD, which can probably be explained by the correspondence between the practice of RLfD and the cone of learning [19]. Cone of learning, a.k.a. pyramid of learning or cone of experience, depicts the hierarchy of learning through involvement in real experiences [19]. It proposes that visual receiving (just watching the demonstration) is a passive form of learning, and learners can only remember half of the knowledge passing through this channel two weeks later. In contrast, "doing the real thing" is a type of active learning that leads to deeper involvement and better learning outcomes [19].
In RLfD, the basic task for robots is to derive a policy from demonstrations and then reproduce the instructors' behavior [4]. On the one hand, a robot's imitation behavior resemble this "behavior reproducing" process; it is thus deemed actively engaged in the learning process. On the other hand, although showing attentional engagement implies that the robot is involved in the visual receiving of instruction, it is still considered as a passive way to learn. Consequently, instructors may come to the conclusion that a robot showing behavioral engagement will have deeper understanding and better mastery of the skill than that showing attentional engagement. Moreover, by analyzing the quality gap between a robot's imitation behavior and the demonstration (behavior to be reproduced), instructors may have a more accurate assessment of the robot's learning progress. In a word, to design effective engagement cues for robots in RLfD, we need to take the nature of the learning task into consideration.

Engagement communication should reflect robot's actual capabilities
In our study, we do not equip the robot with any actual policy derivation algorithm since we want to avoid the perception bias caused the algorithm selection. In other words, the robot has no learning ability. Still, many subjects are convinced that robots with engagement communication (attention, imitation, or both) would finally master the skill. They hold such a belief even if some tasks are technically very challenging for robots to learn because of the correspondence problem, e.g., swimming. These findings suggest that engagement communication can affect instructors' mental model of the robot's capability and progress. There can be a misalignment between instructors' expectations and the actual development as shown in our study. If instructors shape their teaching according to an inaccurate mental model, frustration may occur later in the RLfD process. Hence, it is critical to ensure that a robot's communication of engagement reflects its actual capabilities (policy development in the case of RLfD).

Limitations
This work has several limitations. First, in our study, engagement communication is decoupled from the robot's actual learning process. However, in human or animal learning, such communication is usually associated with the learning process. For example, a student making good progress tends to show more behavioral engagement [53]. We will investigate how to couple learning process with engagement communication in the future. Second, in this paper, we only consider two types of learning engagement cues, i.e., attention and imitation. In practice, human learners may employ more diverse cues, e.g., spatially approaching, etc. Third, the proposed methods, Instant attention and Approximate imitation, are both based on the human body poses. They may not be applicable to the learning tasks which do not necessarily involve the demonstrator's body movements, e.g., object manipulations. For those tasks, designing a good mechanism to communicate the robot engagement is still an open question. Fourth, in this work, we only consider skill-oriented RLfD in which the robot has to master a skill taught by instructors. Other types of RLfD, e.g., goal-oriented RLfD in which the robot learns how to achieve a goal from human examples, are inherently different in task settings. Though the proposed method may work, we still need to evaluate their effects in the future work. Fifth, we conduct the user study in an online simulation environment without a further off-line and real-time RLfD test. Though the simulation is common practice to evaluate the idea in RLfD, the participants do not have any control over the teaching process. How the participants might reshape future demonstration based on robot's engagement feedback needs further investigation.

Conclusion and Future work
In this work, we propose two methods (Instant attention and Approximate imitation) to generate robots' learning engagement in RLfD. The Instant attention method automatically generates the point of attention and the Approximate imitation method produces robot imitation behavior. Based on the two methods, we investigate the effects of three types of engagement communication (showing attention, showing imitation, and showing both) via a within-subject user study. Results suggest that the proposed cues enable robots to be perceived to be significantly more engaged in the learning process and behave significantly more acceptably in RLfD than with no engagement communication. Also, these engagement cues significantly affect the human participants' estimation of robots' learning capabilities and the participants' expectation of the learning outcomes, even though all the robots have no actual learning abilities. In particular, imitation cue influences instructors' perceptions significantly more than attention cue, while the hybrid cues significantly outperform a single cue. We also find that showing behavioral or combined engagement significantly improves instructors' assessments of demonstration qualities. This paper takes the first step to reveal the potential effects of communicating engagement on the humans in RLfD.