We will explore the question of limits and opportunities in mathematizing communicational conduct using the example of a robotic museum guide deployed at the Bielefeld Historical Museum in September 2014 (see also Gehle et al. 2015). A humanoid NAO robot was positioned on a table (1.20 × 2 m, 0.7 m high) and set up to autonomously engage in a focused encounter with visitors and to give explanations about several exhibits by using talk, head and arm gestures and walking across the table (see images of the setting below). The system’s functions relied on the perceptual results from the robot’s internal VGA camera(s) and an external microphone positioned on the table.
Shaping expectations
When human and robot enter in contact with each other, they establish the conditions for their interaction. Users are faced with the task of discovering what the system can do and what it might be responsive to. This is a privileged moment in which the system can—through its own conduct—pro-actively shape the users’ perception of its capabilities, their expectations about roles, ways of participating and relevant subsequent actions (Pitsch et al. 2012, 2013, in press).
In our case, the robot is designed to greet “hello; i am nao” accompanied by a head nod. It then offers to provide information and asks “would you be interested,” which, again, is accompanied by a small head nod at the end of the utterance. Video recordings of such situations show that the visitors build hypotheses about relevant subsequent actions and the robot’s interactional capabilities based on the communicational resources used by the robot in the opening phase. This becomes particularly visible when the robot does not provide the subsequent action in the timeframe expected by the visitors, and they begin to explore different ways of making the robot continue. In session 4-004 of our corpus (which will be used here as a case example), the visitors try out different ways—[head nod + “yes”] (V2), repeated head nods (V1), pronounced and loud “yes” (V2), “yes” (V3)—to answer the robot’s question and take up the multimodal resources which the robot has introduced itself in its initial utterances.
In this way, through careful design of the robot’s conduct, a powerful resource exists to pro-actively influence the users’ expectations for relevant subsequent actions. The robot could thus contribute to establishing the interactional conditions which would be most suitable for its own functioning. We suggest that such an interactional approach could help to reduce parts of the contingency and openness of communication without, however, eliminating them. Systematic empirical research will need to explore in which ways these issues might become more manageable in HRI and how far we can go with, e.g., combinations of rule-based and probabilistic modeling (see Lison 2015) combined with local building blocks for dealing with misunderstanding.
Establishing co-orientation: challenges for mathematization and the interactional system ‘human and robot’
To provide information about some exhibit, the robot is faced with the task of orienting visitors to a particular object. This not only constitutes an individual deictic act, but also requires—at least basic—forms of interactional coordination (Pitsch and Wrede 2014). In our case (session 4-004), the robot is set up to invite the visitors to orient to the life-size image of a tomb slab by saying “over there you can see who used to live at the Sparrenburg [i.e., the name of local medieval castle]” and extend its right arm to perform a pointing gesture with its head turned to the visitors (Fig. 1, #00.44.05). From the three visitors in our fragment, who are initially facing the robot (#00.44.05), two (V2, V1) follow the robot’s deictic reference and successively turn their head in the indicated direction (#00.45.09). Only visitor V3 keeps looking at the robot during the utterances and during the following 1.5 s (#00.48.08). This situation offers insights into a set of issues on mathematizing interactional phenomena.
Uncertainty of the robot’s perception
The robot’s perspective in this situation is based on the input of its internal VGA camera and the calculation resulting from modules for detecting/tracking users and categorizing their visual focus of attention (Sheikhi and Obodez 2012). At the beginning of the robot’s utterance (#00.44.05), three visitors (displayed as bounding boxes around their heads, group size = 3) are detected and classified as oriented “to Nao” and correctly located in the robot’s spatial model. When V2 shifts his orientation to the exhibit—from #00.45.09 to #00.45.10—this is directly perceivable by the robot and correctly interpreted—from “to Nao” to “unfocused.” While these results are highly promising on the technical level of perception, also the challenges set by the real-world setting become visible at the same time: V1 and V3 are also oriented to the robot, but they are not classified as such by the system, and a structure in the ceiling is momentarily categorized as a human face. Even with the ongoing improvements in the detection algorithms and filtering processes, a conceptual challenge remains: Interactional modeling needs to take into account different levels of (un)certainty in the system’s perception. While there are mathematical methods for ‘smoothing’ such data streams, it is not clear to which extent they would be compatible with the moment-by-moment contingencies of social interaction or whether (from a human’s perspective) interactionally relevant details might have been cancelled out this way.
Reducing the complexity of interactional conduct
By the end of the robot’s utterance, two visitors have followed the robot’s invitation to inspect the relevant exhibit while V3 remains oriented to the robot (#00.48.08). How should the system interpret this situation and which next action should it undertake with what expected consequences?—On the one hand, modeling decisions are required for dealing with the diverging states of participation of multiple visitors. On the other hand, formalizations need to account for the visitors’ assumed diverging states of participation. These would need to be based on perceivable interactional cues (such as head orientation) and result in quantifiable measures, probably similar to the current analogy of the ‘speed indicator’ used in the current system to describe a visitor’s ‘Interest Level’ (#00.44.05). How to best reduce the complexity of visitor conduct and interactional history in such ways and as a basis for deciding locally on the robot’s subsequent action constitutes a central challenge.
Perceptual delay and diverging representations
In our case, the robot is set up to interpret V3’s focus of attention as an indicator of trouble with regard to her following the robot’s reference to the exhibit and thus offers a second reference to the exhibit (“over there on the big picture”). However, the exact timing around this decision proves difficult and a perceptual delay of about (in terms of current autonomous systems: only) 0.5 s leads to diverging representations of the situation between human and robot, best visible in #00.48.12 (Fig. 2). In fact, V3—similarly to V1 and V2—begins to turn to the exhibit after #00.48.08 which is perceivable to the robot only after #00.49.01, i.e., at the moment when it is just starting the deictic gesture of the second orientational hint. Thus, in sequential structural terms, the robot’s next action comes out ‘misplaced’, i.e., directly after also V3 is oriented to the exhibit.
Confusion with regard to sequential structures
While V1 does not react to the second reference (“over there on the big picture”), V2 looks back to the robot for about 4 s (Fig. 3, #00.50.06) and then re-orients to the exhibit (#00.53.10). In contrast, V3 appears visibly confused orienting back and forth between robot and exhibit during the robot’s utterance (#00.50.06—#00.51.03—#00.51.13—#00.52.12), turns round to inspect the room (#00.53.10) and finally gazes back to the exhibit shielding her eyes with a hand visibly indicating a ‘search activity’ (#00.58.05). Thus, she treats the robot’s second reference as a repair of her last action (i.e., of her orientation to the exhibit)—an interpretation which adequately follows the sequential structure as it has emerged, but which is—due to the time lag—different from the one aimed for by the robot.
Robot’s resources between ‘interaction’ and ‘functioning’
The robot’s second reference was designed as an upgrade, i.e., verbally more explicit (“over there on the big picture”) and bodily including also a head turn (in addition to the deictic gesture—#00.51.03) toward the exhibit. This entails that the robot’s cameras—located at the front of its head—cannot monitor the visitors’ conduct at this point, and as a consequence, the robot is unable to detect V3’s confusion. Formalizing interactional phenomena for HRI thus must also address the challenge of how to manage the robot’s resources in a way as to produce—at the same time and with the same resources—interactionally relevant conduct and provide the basis for its own functioning.
Human’s competence as a central resource in the interactional system ‘human(s) and robot’
When the robot announces the next action—i.e., to go to the exhibit indicated—V1 and V2 promptly acknowledge this invitation (#00.58.08) and begin to reposition themselves. In contrast, V3—who is searching for the indicated exhibit—does not engage in the new activity. As the robot needs to turn its head for navigational purposes, it is, again, unable to recognize nor to provide a solution to this problem. In this case (as in many others instances in our corpora), it is the human’s competence which solves the problem and helps to re-establish functional sequential structures. Here, V1 incites V3 to refocus her attention, invites her to join the next action and makes transparent the next relevant action. In this way, all three visitors happen to gather in front of the exhibit when the robot also arrives ready to engage in the next explanation.