1 Introduction

Research in Social Robotics strives to endow robotic systems with interactional capabilities that allow users to deal with them intuitively by using means of natural communication and social interaction. This goal is particulary challenging because of the discrepancy between the situatedness, contingency and indexicality of human social conduct and the formalized descriptions required to program technical systems (Suchman 1987). Rule-based approaches to discourse modeling stand in direct conceptual contrast to the openness and unpredictability of social interaction, and it is unclear on what grounds a technical system can select an appropriate and relevant subsequent action. Levinson (2006: 45/56) points out that there is “no such thing as a formal grammar of discourse” because interaction is “governed not by rule but by expectations” (see also Schegloff 1996; Button 1990; Luhmann 1984). This becomes particularly evident at times that require a high degree of interactional coordination between co-participants, such as the opening of an encounter and attempts to establish co-orientation (e.g., Pitsch et al. 2013, 2014). Thus, Lindemann (this vol.) asks critically whether it would be possible to mathematize joint attention, expectations or indexical expressions, or more generally: “Are there limits to mathematization?”

In what follows, we will discuss this question using the example of an autonomous robotic research prototype set up as guide in a real-world museum site (e.g., Pitsch and Wrede 2014). We will point to challenges in mathematizing social conduct on different levels, in particular those that become evident when combining the robot’s internal perspective with the participants’ view. Provided the conceptual and factual impossibility of equipping technical systems with full human-like social and interactional competences, we suggest taking the idea of “hybrid socio-technical systems” (Rammert and Schulz-Schaeffer 2002) further. Adopting an interactional perspective that understands human and robot as one interactional system (Luhmann 1984), we suggest that an important—yet mostly neglected—resource for the robotic system consists of the human’s interactional competences and adaptability. If we can provide the technical system with systematic resources to make use of them (Pitsch et al. 2013), the limits of formalization might gain an interesting twist.

2 Goal: intuitive human–machine interface or reproducing human communication?

Thinking about the possibilities and limitations of mathematizing (human) communicational conduct is closely tied to the goals of Social Robotics. One strand of research seeks to reproduce natural human communication explicated in formulations such as “we consider that establishing models is a path to make such a robot fully behave in a natural way as humans do” (e.g., Kanda and Ishiguro 2012:102). Another strand considers HRI as a particular type of humanmachine interface that should allow the user to deal with a technical system in most intuitive ways by using means of human natural communication (e.g., Breazeal 2003). This way, the design of the interface is—as suggests Suchman (1987:22) on human–machine interaction more generally—“less a project of simulating human communication than of engineering alternatives to interaction’s situated properties.” These two approaches entail different requirements for the formalization of communicative principles and conduct: In the first case, researchers would need to build models able to address the inter-individual variability of multimodal conduct, local-indexical sense-making practices and the unpredictability of emergent interactional processes. This is a goal so ambitious that this author would be too humble to strive for. In contrast, the second approach would enable us to conceptually take into consideration the different (evolving) competences and status of machines and humans and their particular (changing) relationship to each other. This would allow us to open the perspective toward solutions functional for humanrobot interaction (HRI) and include—as an important resource—the human’s competences and adaptability in the modeling.

3 Mathematization: transforming communication for real-world HRI

Formalization and mathematization of real-world phenomena—such as communicational conduct—are based on the assumption of idealized objects (Lenhard and Otte 2005) and thus constitute a transformation that changes the phenomenon itself (Lynch 1988; Schegloff 1996). While it is impossible to escape the challenge of unpredictability and contingency when dealing with real-world phenomena, a particular phenomenon can be modeled in different ways. The limits of mathematization are thus not predefined per se, but depend on the frame we choose (Lenhard and Otte 2005). In this regard, the current conceptualizations in Social Robotics/HRI range from highly restricted one-way communication over laboratory experiments with highly idealized conditions of the physical environment and pre-trained users (e.g., Sugiyama et al. 2012 for a “model of natural deictic interaction”) up to approaches dealing with the complexity of real-world settings (Shiomi et al. 2008; Yamazaki et al. 2009; Pitsch and Wrede 2014).

While highly idealized laboratory conditions provide better grounds to model more sophisticated interactional conduct, we believe that it is necessary to assume early on the challenge of exploring autonomous systems in real-world settings (see Lindemann and Matzusaki 2014). Such an approach enables us to gain a better understanding of the full complexity of the phenomenon and the specific conditions of the human–robot interface (as opposed to attempting to reproduce human communication). In doing so, we begin with inspiration from human communication (see also Yamazaki et al. 2007), but have to reduce its multimodal complexity to the most salient features (Pitsch et al. 2014), in addition to making other types of adjustments. Transformation of the phenomenon “human communication” is thus a conditio sine qua non, but does not pretend per se to discard the idea of interactivity (see Schegloff 1996:29).

4 Example of real-world human–robot interaction: a robotic museum guide

We will explore the question of limits and opportunities in mathematizing communicational conduct using the example of a robotic museum guide deployed at the Bielefeld Historical Museum in September 2014 (see also Gehle et al. 2015). A humanoid NAO robot was positioned on a table (1.20 × 2 m, 0.7 m high) and set up to autonomously engage in a focused encounter with visitors and to give explanations about several exhibits by using talk, head and arm gestures and walking across the table (see images of the setting below). The system’s functions relied on the perceptual results from the robot’s internal VGA camera(s) and an external microphone positioned on the table.

4.1 Shaping expectations

When human and robot enter in contact with each other, they establish the conditions for their interaction. Users are faced with the task of discovering what the system can do and what it might be responsive to. This is a privileged moment in which the system can—through its own conduct—pro-actively shape the users’ perception of its capabilities, their expectations about roles, ways of participating and relevant subsequent actions (Pitsch et al. 2012, 2013, in press).

In our case, the robot is designed to greet “hello; i am nao” accompanied by a head nod. It then offers to provide information and asks “would you be interested,” which, again, is accompanied by a small head nod at the end of the utterance. Video recordings of such situations show that the visitors build hypotheses about relevant subsequent actions and the robot’s interactional capabilities based on the communicational resources used by the robot in the opening phase. This becomes particularly visible when the robot does not provide the subsequent action in the timeframe expected by the visitors, and they begin to explore different ways of making the robot continue. In session 4-004 of our corpus (which will be used here as a case example), the visitors try out different ways—[head nod + “yes”] (V2), repeated head nods (V1), pronounced and loud “yes” (V2), “yes” (V3)—to answer the robot’s question and take up the multimodal resources which the robot has introduced itself in its initial utterances.

In this way, through careful design of the robot’s conduct, a powerful resource exists to pro-actively influence the users’ expectations for relevant subsequent actions. The robot could thus contribute to establishing the interactional conditions which would be most suitable for its own functioning. We suggest that such an interactional approach could help to reduce parts of the contingency and openness of communication without, however, eliminating them. Systematic empirical research will need to explore in which ways these issues might become more manageable in HRI and how far we can go with, e.g., combinations of rule-based and probabilistic modeling (see Lison 2015) combined with local building blocks for dealing with misunderstanding.

4.2 Establishing co-orientation: challenges for mathematization and the interactional system ‘human and robot’

To provide information about some exhibit, the robot is faced with the task of orienting visitors to a particular object. This not only constitutes an individual deictic act, but also requires—at least basic—forms of interactional coordination (Pitsch and Wrede 2014). In our case (session 4-004), the robot is set up to invite the visitors to orient to the life-size image of a tomb slab by saying “over there you can see who used to live at the Sparrenburg [i.e., the name of local medieval castle]” and extend its right arm to perform a pointing gesture with its head turned to the visitors (Fig. 1, #00.44.05). From the three visitors in our fragment, who are initially facing the robot (#00.44.05), two (V2, V1) follow the robot’s deictic reference and successively turn their head in the indicated direction (#00.45.09). Only visitor V3 keeps looking at the robot during the utterances and during the following 1.5 s (#00.48.08). This situation offers insights into a set of issues on mathematizing interactional phenomena.

Fig. 1
figure 1

Session 4-004, Transcript part 01

4.2.1 Uncertainty of the robot’s perception

The robot’s perspective in this situation is based on the input of its internal VGA camera and the calculation resulting from modules for detecting/tracking users and categorizing their visual focus of attention (Sheikhi and Obodez 2012). At the beginning of the robot’s utterance (#00.44.05), three visitors (displayed as bounding boxes around their heads, group size = 3) are detected and classified as oriented “to Nao” and correctly located in the robot’s spatial model. When V2 shifts his orientation to the exhibit—from #00.45.09 to #00.45.10—this is directly perceivable by the robot and correctly interpreted—from “to Nao” to “unfocused.” While these results are highly promising on the technical level of perception, also the challenges set by the real-world setting become visible at the same time: V1 and V3 are also oriented to the robot, but they are not classified as such by the system, and a structure in the ceiling is momentarily categorized as a human face. Even with the ongoing improvements in the detection algorithms and filtering processes, a conceptual challenge remains: Interactional modeling needs to take into account different levels of (un)certainty in the system’s perception. While there are mathematical methods for ‘smoothing’ such data streams, it is not clear to which extent they would be compatible with the moment-by-moment contingencies of social interaction or whether (from a human’s perspective) interactionally relevant details might have been cancelled out this way.

4.2.2 Reducing the complexity of interactional conduct

By the end of the robot’s utterance, two visitors have followed the robot’s invitation to inspect the relevant exhibit while V3 remains oriented to the robot (#00.48.08). How should the system interpret this situation and which next action should it undertake with what expected consequences?—On the one hand, modeling decisions are required for dealing with the diverging states of participation of multiple visitors. On the other hand, formalizations need to account for the visitors’ assumed diverging states of participation. These would need to be based on perceivable interactional cues (such as head orientation) and result in quantifiable measures, probably similar to the current analogy of the ‘speed indicator’ used in the current system to describe a visitor’s ‘Interest Level’ (#00.44.05). How to best reduce the complexity of visitor conduct and interactional history in such ways and as a basis for deciding locally on the robot’s subsequent action constitutes a central challenge.

4.2.3 Perceptual delay and diverging representations

In our case, the robot is set up to interpret V3’s focus of attention as an indicator of trouble with regard to her following the robot’s reference to the exhibit and thus offers a second reference to the exhibit (“over there on the big picture”). However, the exact timing around this decision proves difficult and a perceptual delay of about (in terms of current autonomous systems: only) 0.5 s leads to diverging representations of the situation between human and robot, best visible in #00.48.12 (Fig. 2). In fact, V3—similarly to V1 and V2—begins to turn to the exhibit after #00.48.08 which is perceivable to the robot only after #00.49.01, i.e., at the moment when it is just starting the deictic gesture of the second orientational hint. Thus, in sequential structural terms, the robot’s next action comes out ‘misplaced’, i.e., directly after also V3 is oriented to the exhibit.

Fig. 2
figure 2

Session 4-004, Transcript part 02

4.2.4 Confusion with regard to sequential structures

While V1 does not react to the second reference (“over there on the big picture”), V2 looks back to the robot for about 4 s (Fig. 3, #00.50.06) and then re-orients to the exhibit (#00.53.10). In contrast, V3 appears visibly confused orienting back and forth between robot and exhibit during the robot’s utterance (#00.50.06—#00.51.03—#00.51.13—#00.52.12), turns round to inspect the room (#00.53.10) and finally gazes back to the exhibit shielding her eyes with a hand visibly indicating a ‘search activity’ (#00.58.05). Thus, she treats the robot’s second reference as a repair of her last action (i.e., of her orientation to the exhibit)—an interpretation which adequately follows the sequential structure as it has emerged, but which is—due to the time lag—different from the one aimed for by the robot.

Fig. 3
figure 3

Session 4-004, Transcript part 03

4.2.5 Robot’s resources between ‘interaction’ and ‘functioning’

The robot’s second reference was designed as an upgrade, i.e., verbally more explicit (“over there on the big picture”) and bodily including also a head turn (in addition to the deictic gesture—#00.51.03) toward the exhibit. This entails that the robot’s cameras—located at the front of its head—cannot monitor the visitors’ conduct at this point, and as a consequence, the robot is unable to detect V3’s confusion. Formalizing interactional phenomena for HRI thus must also address the challenge of how to manage the robot’s resources in a way as to produce—at the same time and with the same resources—interactionally relevant conduct and provide the basis for its own functioning.

4.2.6 Human’s competence as a central resource in the interactional system ‘human(s) and robot’

When the robot announces the next action—i.e., to go to the exhibit indicated—V1 and V2 promptly acknowledge this invitation (#00.58.08) and begin to reposition themselves. In contrast, V3—who is searching for the indicated exhibit—does not engage in the new activity. As the robot needs to turn its head for navigational purposes, it is, again, unable to recognize nor to provide a solution to this problem. In this case (as in many others instances in our corpora), it is the human’s competence which solves the problem and helps to re-establish functional sequential structures. Here, V1 incites V3 to refocus her attention, invites her to join the next action and makes transparent the next relevant action. In this way, all three visitors happen to gather in front of the exhibit when the robot also arrives ready to engage in the next explanation.

5 Conclusion

In this text, we have attempted (1) to point out a set of challenges that researchers are faced with once they engage in modeling interactional conduct for autonomous robot systems in real-world situations with untrained users. And (2) we have developed a vision and a conceptual basis of how the limitations of technical systems in dealing with the situatedness of human social interaction might be pushed a little further. To consider human and robot as one ‘interactional system’ (Luhmann 1984; Rammert and Schulz-Schaeffer 2002; Pitsch et al. 2013), in which the participants jointly solve the practical tasks, makes it possible to integrate the human’s competence in the development of building blocks for interactional conduct in HRI. Through careful design of the robot’s conduct, a powerful resource exists to pro-actively influence the users’ expectations about relevant subsequent actions, so that the robot could contribute to establishing the conditions which would be most beneficial to its own functioning.

As a consequence, the question of whether a technical system is able to deal with situatedness, contingency, indexical expressions, etc., could be reformulated to ask in what ways the interactional system ‘human and robot’ can solve these practical tasks. In this way, mathematization would not need to provide self-contained models, but rather think of ways to include the human’s competences of sense-making and organizing interaction as well as of equipping robotic systems with strategies to make their own actions and states transparent to the user. As such, the limits of mathematization might present with a different twist.