Keywords

1 Introduction

When humans interact with each other, they often make use of deictic gestures [1] such as pointing to help pick out targets of interest to their conversation [2]. In the field of Human-Robot Interaction, many researchers have explored how we might enable robots to generate the arm motions necessary to effect these same types of deictic gestures [3,4,5,6,7,8]. However, a number of challenges remain to be solved if effective robot-generated deictic gestures are to be possible regardless of morphology and context. Consider, for example, the following scenario:

A mission commander in an alpine search and rescue scenario instructs an unmanned aerial vehicle (UAV) “Search for survivors behind that fallen tree.” The UAV can see three fallen trees and wishes to know which its user means.

This scenario presents at least two challenges. First, there is a problem of morphology. The UAV’s lack of arms means that generating deictic gestures may not be physically possible. Second, there is a problem of context. Even if the UAV had an arm with which to gesture, doing so might not be effective; picking out far-off fallen trees within a forest may be extremely difficult using traditional gestures.

Recent advances in augmented and mixed reality technologies present the opportunity to address these challenges. Specifically, such technologies enable new forms of deictic gesture for robots with previously problematic morphologies and in previously problematic contexts. For example, in the previous example, if the mission commander were wearing an augmented reality head-mounted display, the UAV may have been able to pick out the fallen trees it wished to disambiguate between by circling them in the mission commander’s display while saying “Do you mean this tree, this tree, or this tree?”.

While there has been little previous work on using augmented, mixed, or virtual reality techniques for human-robot interaction, this is beginning to change. In March 2018, the first international workshop on Virtual, Augmented, and Mixed-Reality for Human-Robot Interaction (VAM-HRI) was held at the 2018 international conference on Human-Robot Interaction (HRI) [9]. The papers and discussion at that workshop make it evident that we should begin to see more and more research emerging at this intersection of fields.

In this paper, we summarize our own recent work on using augmented, mixed and virtual reality techniques to advance the state-of-the-art of robot-generated deixis, some of which was presented at the 2018 VAM-HRI workshop. In Sect. 2, we begin by providing a framework for categorizing robot-generated deixis in augmented and mixed-reality environments. In Sect. 3, we then discuss a novel method for enabling mixed reality deixis for armless robots. Finally, in Sect. 4 we present a novel method for robot teleoperation in Virtual Reality, and discuss how it could be used to trigger mixed-reality deictic gestures.

2 A Framework for Mixed-Reality Deictic Gesture

Augmented and mixed-reality technologies offer new opportunities for robots to communicate about the environments they share with human teammates. In previous work, we have presented a variety of work seeking to enable fluid natural language generation for robots operating in realistic human-robot interaction scenarios [10, 11] (including work on referring expression generation [12, 13], clarification request generation [14], and indirect speech act generation [15,16,17]). By augmenting their natural language references with visualizations that pick out their objects, locations, and people of interest within teammates’ head-mounted displays, robots operating in such scenarios may facilitate conversational grounding [18, 19] and shared mental modeling [20] with those human teammates in ways that were not previously possible.

While there has been some previous work on using visualizations as “gestures” within virtual or augmented environments [21] and video streams [22], as well as previous work on generating visualizations to accompany generated text [23,24,25,26], this metaphor of visualization-as-gesture has not yet been fully explored. This is doubly true for human-robot interaction scenarios, in which the use of augmented reality for human-robot communication is surprisingly underexplored. In fact, in their recent survey of augmented reality, Billinghurst et al. [27] cite intelligent systems, hybrid user interfaces, and collaborative systems as areas that have been under-attended-to in the AR community.

Most relevant to the current paper, Sibertsiva et al. [28] use augmented reality annotations to indicate different candidates referential hypotheses after receiving ambiguous natural language commands, and Green et al. [29] present a system that uses augmented reality to facilitate human-robot discussion of a plan prior to execution. There have also been several recent approaches to using augmented reality to non-verbally communicate robots’ intentions [30,31,32,33,34,35,36] These approaches, however, have looked at visualization alone, outside the context of traditional robot gesture. We believe that, just as augmented and mixed reality open up new avenues for communication in human-robot interaction, human-robot interaction opens up new avenues for communication in augmented and mixed reality. Only in mixed-reality human-robot interaction may physical and virtual gestures be generated together or chosen between as part of a single process. In order to understand the different types of gestures that can be used in mixed-reality human-robot interaction, we have been developing a framework for analyzing such gestures along dimensions such as embodiment, cost, privacy, and legibility [37]. In this paper, we extend that framework to encompass new gesture categories and dimensions of analysis.

2.1 Conceptual Framework

In this section, we present a conceptual framework for describing mixed-reality deictic gestures. A robot operating within a pure-reality environment has access to but a single interface for generating gestures (its own body) and accordingly but a single perspective within which to generate them (its own)Footnote 1. A robot operating within a mixed-reality environment, however, may leverage the hardware that enables such an environment, and the additional perspectives that come with those hardware elements. For robots operating within mixed-reality environments, we identify three unique hardware elements that can be used for deixis, each of which comes with its own perspective, and accordingly, their own class of deictic gestures.

First, robots may use their own bodies to perform the typical deictic gestures (such as pointing) available in pure reality. We categorize such gestures as egocentric (as shown in Fig. 1a), because they are generated from their own perspective. Second, robots operating in mixed-reality environments may be able to use of head-mounted displays worn by human teammates. We categorize such gestures as allocentric (as shown in Fig. 1b) because they are generated using only the perspective of the display’s wearer. A robot, may, for example, “gesture” to an object by circling it within its teammate’s display. Third, robots operating in mixed-reality environments may be able to use projectors to change how the world is perceived for all observers. We categorize such gestures as perspective-free (as shown in Fig. 1c) because they are not generated from the perspective of any one agent.

In addition, robots operating in mixed-reality environments may be able to perform multi-perspective gestures that use the aforementioned mixed-reality hardware in a way that connects back to the robot’s perspectives. A robot may, for example, gesture to an object in its teammate’s display, or using a projector, by drawing an arrow from itself to its target object, or by gesturing towards its target using a virtual appendage that only exists in virtuality. We call the former class ego-sensitive allocentric gestures and the latter class ego-sensitive perspective-free gestures.

Table 1. Analysis of mixed-reality deictic gestures
Fig. 1.
figure 1

Categories of mixed-reality deictic gestures

2.2 Analysis of Mixed-Reality Deictic Gestures

Each of these gestural categories comes with its own unique properties. Here, we specifically examine six: perspective, embodiment, capability, privacy, cost, and legibility. These dimensions are summarized in Table 1.

The most salient dimensions that differentiate these categories of mixed-reality deictic gestures are the perspectives, embodiment, and capabilities they require. The perspectives required are clearly defined: egocentric gestures require access to the robot’s perspective, allocentric gestures require access to the human interlocutor’s perspective, and perspective-free gestures require access only to the greater environment’s perspective. The ego-sensitive gestures connect their initial perspective with that of the robot. Those categories generated from or connected to the perspective of the robot notably require the robot to be embodied and co-present with their interlocutor; but only the egocentric category requires the robot’s embodied form to be capable of movement.

The different hardware needs of these categories result in different levels of privacy. Here, we distinguish between local privacy and global privacy. We describe those categories that use a head-mounted display as affording high local privacy, as gestures are only visible to the human teammate with whom the robot is communicating. This dimension is particularly important for human-robot interaction scenarios involving both sensitive user populations (e.g., elder care or education) or in adversarial scenarios (e.g., competitive [39], police [40], campus safety [41], or military domains (as in DARPA’s “Silent Talk” program) [42]). On the other hand, we describe egocentric gestures as having high global privacy, as, unlike with the other categories, information about gestural data need not be sent over a network, and thus may not be as vulnerable to hackers.

These categories of mixed-reality deictic gestures also come with different technical challenges, resulting in different computational costs. From the perspective of energy usage, egocentric gestures are expensive due to their physical component (a high generation cost). On the other hand, gestures that make use of a head-mounted display may be expensive to maintain due to registration challenges (a high maintenance cost).

Finally, these gestures differ with respect to legibility. In previous work, Dragan et al. [43] defined the notion of the legibility of an action, which describes the ease with which a human observer is able to determine the goal or purpose of an action as it is being carried out. In later work with Holladay et al. [5], Dragan then applies this notion to deictic gestures as well, analyzing the ability of the final gestural position to enable humans to pick out the target object. We believe, however, that this is really a distinct sense of legibility from Dragan’s original formulation, and as such, we first refine this notion of legibility as applied to deictic gestures into two categories: we use dynamic legibility to refer to the degree to which a deictic gesture enables a human teammate to pick out the target object as the action is unfolding (in line with Dragan’s original formulation), and static legibility to refer to the degree to which the final pose of a deictic gesture enables a human teammate to pick out the target object after the action is completed (in line with Holladay’s formulation).

The gestural categories we describe differ with respect to both dynamic and static legibility. Allocentric and perspective-free gestures have high dynamic legibility (given that there is no dynamic dimension) and high static legibility (given that the target is uniquely picked out). Egocentric gestures have low dynamic legibility (relative to allocentric gestures) given that their target may not be clear at all as the action unfolds, and low static legibility, as the target may not be clear after the action is performed either, depending on distance to the target and density of distractors. The legibility of multi-perspective gestures depends on how exactly they are displayed. If they extend all the way to a target object, they may have high static legibility, whereas if they only point toward the target they will have low static legibility. Dynamic legibility depends both on this factor, as well as temporal extent. If a multi-perspective gesture unfolds over time, this may decrease the legibility (although it may better capture the user’s attention).

2.3 Combination of Mixed-Reality Deictic Gestures

Finally, given these classes of mixed-reality deictic gestures, we can also reason about combinations of these gestures. Rather than explicitly discuss all 31 non-empty combinations of these five categories, we will briefly describe how the gestural categories combine. Simultaneous generation of gestures requiring different perspectives results in both perspectives being needed. The embodiment and capability requirements of simultaneous gestures combine disjunctively. The legibilities and costs of simultaneous gestures combine using a max operator, as the legibility of one gesture will excuse the illegibility of another, but the low cost of one gesture will not excuse the high cost of another. And the privacies of simultaneous gestures combine using a min operator, as the high privacy of one gesture does not excuse the low privacy of another.

3 Enabling Deictic Capabilities for Armless Robots Using Mixed-Reality Robotic Arms

In the previous section, we presented a framework for analyzing mixed-reality deictic gestures. Within this framework, the gestural categories that have received the least amount of previous attention are the ego-sensitive categories which connect the gesture-generating robot with the perspective of the human viewer or with the perspective of their environment. In this section, we present a novel approach to ego-sensitive allocentric gesture. Specifically, we propose to superimpose mixed-reality visualizations of robot arms onto otherwise armless robots, to allow them to gesture within their environment. This will allow an armless robot like a wheelchair or drone to gesture just as if it had a physical arm, even if mounting such an arm would not be mechanically possible or cost effective. Unlike purely allocentric gestures (e.g., circling an object in ones’ field of view), this approach emphasizes the generator’s embodiment, and as such, we would expect it to lead to increased perception of the robot’s agency, increased likability of the robot, and promote positive team dynamics.

In this section we present the preliminary technical work necessary to enable such an approach. Specifically, we present a kinematic approach to perform this kind of mixed-reality deictic gesture. Compared to motion planning, a purely kinematic approach is more computationally efficient, a potential advantage for low-power embedded systems that we may wish to use for AR displays. The trade-off is that the kinematic approach is incomplete, so it may fail to find collision-free motions for some cluttered environments. However, collisions are not an impediment for virtual arms, thus mitigating the potential downside of purely kinematic motions.

Our approach applies dual-quaternion forward kinematics and Jacobian damped-least-squares inverse kinematics.

Fig. 2.
figure 2

Kinematic diagrams for diectic gestures. (a) the local coordinate frames (“frames”) of a serial manipulator. (b) a schematic of a serial manipulator with vectors for pointing direction and the vector to a target object.

3.1 Kinematics

Forward Kinematics. We adopt the conventional model for serial robot manipulators of kinematic chains and trees [44,45,46,47,48,49]. Each local coordinate frame (“frame”) of the robot has an associated label, and the frames are connected by Euclidean transformations (see Fig. 2a).

We represent Euclidean transformations with dual quaternions. Compared to matrix representations, dual quaternions offer computational advantages in efficiency, compactness, and numerical stability. A dual quaternion is a pair of quaternions: an ordinary part for rotation and dual part for translation. Notationally, we use a leading superscript to denote the parent’s local coordinate frame p and trailing subscript to denote the child frame c. Given rotation unit quaternion and translation vector \(\mathbf {v}\) from p to c, the transformation dual quaternion is:

(1)

where \(\varvec{\hat{\imath }}\), \(\varvec{\hat{\jmath }}\), \(\varvec{\hat{k}}\) are the imaginary elements, with \(\varvec{\hat{\imath }}^2 = \varvec{\hat{\jmath }}^2 = \varvec{\hat{k}}^2 = \varvec{\hat{\imath }}\varvec{\hat{\jmath }}\varvec{\hat{k}}= -1 \), and is the dual element, with and .

Chaining transformations corresponds to multiplication of the transformation matrices or the dual quaternion. For a kinematic chain, we must match the child frame of predecessor to parent frame of successor transformations. The result is the transform from the parent of the initial to the child of the final transformation.

(2)

We illustrate the kinematics computation for the simple serial manipulator in Fig. 2b. Note that the local frames and relative transforms of the robot in Fig. 2b correspond to those drawn in Fig. 2a.

The kinematic position of a robot is fully determined by its configuration \(\varvec{\phi }\), i.e, the vector of joint angles,

(3)

The relative frame at each joint i is a function of the corresponding configuration: . The frame for the end-effector is the product of all frames in the chain

(4)

Cartesian Control. We compute the least-squares solution for Cartesian motion using a singularity-robust Jacobian pseudoinverse:

$$\begin{aligned} \dot{\mathbf {x}} = \mathbf {J}\dot{\varvec{\phi }} \quad \leadsto \quad \dot{\varvec{\phi }} = \mathbf {J}^+ \dot{\mathbf {x}} \end{aligned}$$
(5)
$$\begin{aligned} \mathbf {J}^+ = \sum _{i=0}^{\min (m,n)} \frac{s_i}{\max ({s_i}^2,{s_{\min }}^2)} \mathbf {v_i} \mathbf {u_i}^T \end{aligned}$$
(6)

where \(\dot{\mathbf {x}} = [\omega ,\dot{v}]\) is the vector of rotational velocity \(\omega \) and translational velocity \(\dot{v}\), \(\mathbf {J}= \mathbf {U}\mathbf {S}\mathbf {V}^T\) is the singular value decompositionFootnote 2 of Jacobian J, and \(s_{\min }\) is a selected constant for the minimum acceptable singular value.

We determine Cartesian velocity \(\dot{\mathbf {x}}\) with a proportional gain on position error, computed as the velocity to reach the desired target in unit time, decoupling rotational \(\omega \) and translational \(\dot{v}\) parts to achieve straight-line translations:

(7)
(8)

In combination, we compute the reference joint velocity as:

(9)

where e is the actual end-effector frame and r is the desired or reference frame.

3.2 Design Patterns

While the method above provides a general approach to enabling mixed-reality deictic gestures, there are a variety of different possible forms of deictic gestures that might be generated using that approach. In this section, we propose three candidate gesture designs enabled by the proposed approach: Fixed Translation, Reaching, and Floating.

Fixed Translation. The first proposed design, Fixed Translation, is the most straightforward manifestation of the proposed approach. In this design, the visualized arm rotates in place to point to the desired target. To enable this design, we must find a target orientation for the end-effector. We find the relative rotation between the current end-effector frame and pointing direction towards the target based on the end-effector’s pointing direction vector and the vector from the end-effector to the target (see Fig. 2b).

First, we find the end-effector’s global pointing vector \(\hat{u}_e\) by rotating the local pointing direction \(\hat{a}_e\).

(10)

Then, we find the vector from the end-effector to the target by subtracting the end-effector translation \({^{0}\!}{v}_{e}\) from the target translation \({^{0}\!}{v}_{b}\) and normalizing to a unit vector.

(11)

Next, we compute the relative rotation between the two vectors \(\hat{u}_e\) and \(\hat{u}_b\) using the dot product to find the angle \(\theta \) and cross product to find the axis \(\hat{a}\),

$$\begin{aligned} \theta&= \cos ^{-1}\left( \hat{u}_e\bullet \hat{u}_b\right) \end{aligned}$$
(12)
$$\begin{aligned} \hat{a}&= \frac{\hat{u}_e\times \hat{u}_b}{\sin \theta } \end{aligned}$$
(13)

The axis \(\hat{a}\) and angle \(\theta \) then give us the rotation unit quaternion :

(14)

Note that a direct conversion of the vectors to the rotation unit quaternion avoids the need for explicit evaluation of transcendental functions.

Now, we compute the global reference frame for the end-effector using

(15)

Combining, (15) and (9), we compute the joint velocities \(\dot{\varvec{\phi }}\) for the robot arm.

Reaching. Our second proposed design, Reaching, stretches the arm out towards the target, increasing gesture legibility in a way that would not be feasible with a physical arm. To enable this design, we compute the instantanous desired orientation as in the fixed translation case, but now set the desired translation to the target object’s translation .

(16)

Then we combine, (16) and (9) to compute the joint velocities \(\dot{\varvec{\phi }}\) for the robot arm.

Floating Translation. Diectic information is conveyed primarily by the orientation of the end-effector rather than its translation. Thus, in our final design, Floating Translation, we consider a case where the translation can freely float, allowing the arm to point with more natural-looking configurations. First, we remove the translational component from the control law. Second, we center all joints within the Jacobian null space, so centering does not impact end-effector velocity. We update the workspace control law with a weighting matrix and null space projection term:

(17)

The weighting matrix \(\mathbf {W}\) removes the translational component from the Jacobian \(\mathbf {J}\), so only rotational error contributes to the joint velocity \(\dot{\varvec{\phi }}\). Structurally, \(\mathbf {J}^+\) consists of rotational block \(\mathbf {j}^+_\omega \) and translational block \(\mathbf {j}^+_{\dot{v}}\). We construct \(\mathbf {W}\) to remove \(\mathbf {j}^+_{\dot{v}}\).

(18)

where n is the length of \(\varvec{\phi }\), or equivalently the number of rows in \(\mathbf {J}^+\).

We use the null space projection to move all joints towards their center configuration, without impacting on end-effector pose:

(19)

where \(\varvec{\phi }_c\) is the center configuration and \(\varvec{\phi }_a\) is the actual configuration.

The combined workspace control law is

(20)

In this paper, we have proposed a new form of mixed-reality deictic gesture, and proposed a space of candidate designs for manifesting such gestures. In current and future work, we will implement all three designs using the Microsoft Hololens, and evaluate their performance with respect to both each other, and to the other categories of gesture we have described. In the next section, we turn to methods by which such gestures might be generated by human teleoperators during human-subject experiments.

4 An Interface for Virtual Reality Teleoperation

In the previous sections, we presented a framework for mixed-reality deixis, and a novel form of mixed-reality deictic gesture. But a question remains as to how robots might decide to generate such gestures. While in future work our interests lie in computational approaches for allowing robots to decide for themselves when and how to generate such gestures, in this work we first examine how humans might trigger such gestures, and how novel virtual reality technologies might facilitate this process.

Specifically, we examine the use of virtual reality and gesture recognition technologies may be used to control gesture-capable robots used by Human-Robot Interaction (HRI) researchers during human-subject experiments [50]. Manual control of language- and gesture-capable robots is crucial for HRI researchers seeking to evaluate human perceptions of potential autonomous capabilities which either do not yet exist, or are not yet robust enough to work consistently and predictably, as in the Wizard of Oz (WoZ) experimental paradigm [51]. For the purposes of such experiments, manual control of dialogue and gestural capabilities is particularly challenging [52]. Not only is it repetitive and time consuming to design WoZ interfaces for such capabilities, but such interfaces are not always effective, as the time necessary for an experimenter to decide to issue a command, click the appropriate button, and have that command take effect on the robot is typically too long to facilitate natural interaction.

What is more, such interfaces typically require experimenters to switch back and forth between monitoring a camera stream depicting the robot’s environment and consulting their control interface: a pattern that can decrease robots’ situational awareness and harm experiment effectiveness [53]. This is particularly true when the camera stream depicts the robot’s environment from a third-person perspective, which can lead to serious performance challenges [54]. While some recent approaches have introduced the use of augmented reality for safely teleoperating co-present robots [55, 62], robots are not typically co-present with teleoperators during tightly controlled WoZ experiments. For such applications, Virtual Reality (VR) teleoperation provides one possible solution. VR is also beneficial as immersion in the robot’s perspective improves depth perception and enhances visual feedback, resulting in an overall more immersive experience [56]. On the other hand, immersive first-person teleoperation comes with its own concerns. Recent researchers have noted safety concerns, as a sufficiently constrained robot perspective may limit the teleoperator’s situational awareness [57]. What is more, VR teleoperation in particular raises challenges as the teleoperator may no longer be able see their teleoperation interface.

4.1 Previous Work

There have been a large number of approaches to robot teleoperation through virtual reality, even within only the past year. First, there has been some work on robot teleoperation using touchscreens displaying first- or third-person views of the robot’s environment [63, 64]. There have been a number of approaches enabling first-person robot teleoperation using virtual reality displays, using a variety of different control modalities, including joysticks [65], VR hand controllers [66,67,68,69,70,71], gloves [72, 73], and full-torso exo-suits [59]. There has been less work enabling hands-free teleoperation, with the closest previous work we are aware of being Miner and Stansfield’s approach, which allowed gesture-based control in simulated, third-person virtual reality. The only approaches we are aware of enabling first-person hands-free control are our own approach (discussed in the next section), and the Kinect-based approach of Sanket et al., which was presented at the same workshop as our own work [70].

Fig. 3.
figure 3

Multiple views of integrated system

4.2 Integrated Approach

In our recent work [50], we have proposed a novel teleoperation interface which provides hands-free WoZ control of a robot while providing the teleoperator with an immersive VR experience from the robot’s point of view. This interface integrates a VR headset, interfaced directly with the robot’s camera to allowing the experimenter to see exactly what the robot sees (Fig. 3b), with a Leap Motion Controller. Translating traditional joystick or gamepad control to robotic arm motions can be challenging, but the Leap Motion Controller can simplify this process by allowing the user to replicate the gesture he/she desires of the robot, making it a powerful hands-free teleoperation device [58]. There has been work on using the Leap Motion for teleoperation outside the context of virtual reality [74,75,76] but to the best of our knowledge our approach is the first to pair it with an immersive virtual-reality display. In our approach, we use the Leap Motion sensor to capture the experimenter’s gestures, and then generate analogous gestures on the robot in real time. Specifically, we first extract hand position and orientation data from raw Leap Motion data. Figure 3c shows the visualization of the tracking data produced by the Leap Motion. Each arrow represents a finger, and each trail represents the corresponding movement of that finger. Changes in this position and orientation data is used to trigger changes in the robot’s gestures according to the following equations:

$$\begin{aligned} robotGesturePitch = \left\{ \begin{matrix} low &{} \tau _{p_1}< humanGesturePitch< \tau _{p_2} \\ high &{} \tau _{p_2}< humanGesturePitch < \tau _{p_3} \end{matrix}\right. \end{aligned}$$
$$\begin{aligned} robotGestureRoll = \left\{ \begin{matrix} low &{} \tau _{r_1}< humanGestureRoll< \tau _{r_2} \\ high &{} \tau _{r_2}< humanGestureRoll < \tau _{r_3} \end{matrix}\right. \end{aligned}$$
Fig. 4.
figure 4

Architecture diagram: The user interacts directly with a VR headset (e.g., Google Cardboard) and a Leap Motion gesture sensor. These devices send data to and receive data from a humanoid robot (e.g., the Softbank Pepper) using an instance of the ROS architecture whose Master node is run on a standard Linux laptop.

Here, parameters \(\tau _{p_1}< \tau _{p_2} < \tau _{p_3}\) and \(\tau _{r_1}< \tau _{r_2} < \tau _{r_3}\) are manually defined pitch and raw thresholds. While in this work our initial prototype makes use of these simple inequalities, in future work we aim to examine more sophisticated geometric and approximate methods for precisely mapping human gestures to robot gestures, with the aim of enabling a level of control currently seen in suit-based teleoperation systems [59].

All components of the proposed interface are integrated using the Robot Operating System (ROS) [60]. As shown in Fig. 4, the Leap Motion publishes raw sensor data, which is converted into motion commands. These motion commands are then sent to the robotFootnote 3. Similarly, camera data is published by the robot, to a topic subscribed to by the Android VR app which displays it in the VR headsetFootnote 4.

5 Conclusion

Virtual, augmented, and mixed reality stand to enable – and are already enabling – promising new paradigms for human-robot interaction. In this work, we summarized our own recent work in all three of these areas. We see a long, bright avenue for future work in this area for years to come. In our own future work, we plan to focus on exploring the space of different designs for mixed-reality deictic gesture, and integrating these approaches with our existing body of work on natural language generation, thus enabling exciting new ways for robots to express themselves.