Towards efficient human–machine collaboration: effects of gaze-driven feedback and engagement on performance

Mitev, Nikolina; Renner, Patrick; Pfeiffer, Thies; Staudte, Maria

doi:10.1186/s41235-018-0148-x

Towards efficient human–machine collaboration: effects of gaze-driven feedback and engagement on performance

Original article
Open access
Published: 29 December 2018

Volume 3, article number 51, (2018)
Cite this article

Download PDF

You have full access to this open access article

Cognitive Research: Principles and Implications Submit manuscript

Towards efficient human–machine collaboration: effects of gaze-driven feedback and engagement on performance

Download PDF

Nikolina Mitev ORCID: orcid.org/0000-0002-5488-3510¹,
Patrick Renner²,
Thies Pfeiffer² &
…
Maria Staudte¹

2174 Accesses
5 Citations
Explore all metrics

Abstract

Referential success is crucial for collaborative task-solving in shared environments. In face-to-face interactions, humans, therefore, exploit speech, gesture, and gaze to identify a specific object. We investigate if and how the gaze behavior of a human interaction partner can be used by a gaze-aware assistance system to improve referential success. Specifically, our system describes objects in the real world to a human listener using on-the-fly speech generation. It continuously interprets listener gaze and implements alternative strategies to react to this implicit feedback. We used this system to investigate an optimal strategy for task performance: providing an unambiguous, longer instruction right from the beginning, or starting with a shorter, yet ambiguous instruction. Further, the system provides gaze-driven feedback, which could be either underspecified (“No, not that one!”) or contrastive (“Further left!”). As expected, our results show that ambiguous instructions followed by underspecified feedback are not beneficial for task performance, whereas contrastive feedback results in faster interactions. Interestingly, this approach even outperforms unambiguous instructions (manipulation between subjects). However, when the system alternates between underspecified and contrastive feedback to initially ambiguous descriptions in an interleaved manner (within subjects), task performance is similar for both approaches. This suggests that listeners engage more intensely with the system when they can expect it to be cooperative. This, rather than the actual informativity of the spoken feedback, may determine the efficiency of information uptake and performance.

AI-based chatbots in customer service and their effects on user compliance

Article Open access 17 March 2020

The effect of emotional arousal on visual attentional performance: a systematic review

Article Open access 07 July 2023

The role of politeness in human–machine interactions: a systematic literature review and future perspectives

Article Open access 27 June 2023

Significance statement

Can listener gaze facilitate goal-oriented human–machine collaboration? To solve a task jointly, interlocutors often need to establish a reference in a shared environment, e.g., to identify task-relevant objects. In such situated interactions, interlocutors typically use natural language, but other modalities, in particular gaze and gestures, support communicative success. In our work, we address the domain of assembly assistance and in particular object identification tasks. We show that an artificial speaker (a natural language generation (NLG) system) can improve task performance when providing gaze-aware proactive feedback based on a listener’s inspections of an object. In particular, giving information incrementally in subsequent chunks is more efficient than giving the description in one piece. Moreover, the feedback’s informativity not only leads to more efficient interactions but also influences the overall expectation for the capabilities of the NLG system. This expectation determines to what extent the listener wants to cooperate and will engage with the NLG system. The more intensely listeners engage with the system, the more effective is the information uptake and the better the task performance, even when some of the system’s responses are less informative.

Introduction

In situated collaboration, spoken natural language is often used to refer to task-relevant objects in the form of installments. Installments are chunks of information uttered by a speaker to provide partial information to the listener in an incremental manner. Human speakers may produce installments without planning an entire unambiguous utterance. This effect is increased when they are under time pressure (Striegnitz et al. 2012). Using installments, speakers can quickly adapt to changes in the surroundings and in particular to the listeners’ feedback and actions. As shown by Zarrieß and Schlangen (2016), an artificial speaker can use installments to generate referring expressions effectively. This was considered intuitive and enhanced the identification of real objects depicted in static images.

In our research, we integrate listener feedback into the interaction loop by addressing the question of whether a listener’s gaze can successfully be used as a non-verbal feedback cue for adaptive installment generation. In particular, we investigate the interactions of an artificial speaker, i.e., a machine instructor, and a human listener. There is some evidence from studies in virtual environments that feedback from the artificial speaker based on listener gaze can increase interaction efficiency (Koller et al. 2012; Staudte et al. 2012; Garoufi et al. 2016). However, there are two remaining questions that we address in the present paper: (1) Can the successful use of listener gaze be replicated in real environments, which are much more complex to handle technically? (2) Can gaze-aware NLG be used to generate adaptive installments that provide references both incrementally and in the form of contrastive feedback? Specifically, we present a NLG system that monitors the gaze of the human listener and provides installments only if necessary, that is, if the listener’s gaze indicates wrong reference resolution. We further report on two experiments that evaluate the efficiency and the general perception of this behavior in comparison to long and exhaustive instructions. The results suggest that communication efficiency benefits from giving interactive and incremental instructions. However, in our experiments, this is preferred less by the users. Thus, there is a trade-off between efficiency and users’ preference in terms of perception.

Our approach draws on previous findings from (i) human–human interactions, which show that listener eye movements are closely tied, and time-aligned, to the current understanding of the comprehender and (ii) human–machine interactions, especially from work with assistive systems, in particular for assembly tasks, which show more generally that systems employing gaze as a communicative signal are socially beneficial—though sometimes less efficient. Below, we briefly review selected literature from those areas.

Human–human interactions

To ensure communicative success in situated collaboration, speakers tend to observe listeners to detect if their communication message was received and understood correctly (Clark 1996). Listeners reliably inspect objects they believe are being referred to by the speaker (Tanenhaus et al. 1995; Eberhard et al. 1995). Consequently, speakers can monitor understanding and the mapping of meaning to the world by considering listener gaze (Clark and Krych 2004; Hanna and Brennan 2007; Brown-Schmidt 2012). Most of these studies, however, focus on the role of listener gaze as an index to the underlying comprehension processes. The benefit of gaze-based feedback cues for the speaker and successful reaction strategies are rarely examined. Human instruction givers might not be prepared to use technical cues based on listener gaze beneficially, as has been shown by Koleva et al. (2015). Coco et al. (2018), who examined the role of feedback and alignment in a “spot the difference task,” further found that their gaze aligned only if interlocutors could not exchange verbal feedback. Both results indicate that exploiting a technical augmentation of the listener gaze (e.g., by visualizing a gaze cursor) is not something that human speakers naturally do efficiently. In the studies described, the instructors were faced with the additional perception task of following gaze cursors, which might have increased the cognitive load too much. In contrast to this research, we focus on artificial speakers’ use of gaze feedback.

Human–machine interactions

Gaze-based assistive systems have, along with the advances in mobile eye tracking technologies, moved into real-world environments in the last decade (Pfeiffer 2013). Our work is related to work in attentive assistance systems (Maglio et al. 2000) and human–robot and human–agent interactions, where gaze is relevant for the social aspects of interaction (Sidner et al. 2004) as well as for grounding verbal utterances using mechanisms of joint attention (Imai et al. 2003). Smart eyewear has been identified as a key technology for assistance systems (Pfeiffer et al. 2016a) and recently has been combined with a real-time analysis of eye tracking to support assembly tasks (Renner and Pfeiffer 2017; Blattgerste et al. 2017). Work on such assembly tasks has been done in both virtual and real worlds (Kopp et al. 2003; Kirk et al. 2007). However, projects including and examining the role of listener gaze are considerably less frequent. Fang et al. (2015) proposed a collaborative referring expression generation algorithm for situated human–robot interactions and used listener gaze to provide information incrementally. Their results surprisingly showed a performance drop when using listener gaze. This may, however, be explained by the method they used to interpret the gaze signal. We address this issue and apply the procedure for inspection detection proposed by Garoufi et al. (2016) to trigger verbal feedback that supplements instructions in our real-world assembly scenario.

Contribution of this paper

We investigated the utility of listener gaze in a real-time object identification task. In particular, we designed, implemented, and employed an interactive NLG system using augmented reality technologies (Pfeiffer 2012; Pfeiffer and Renner 2014) to describe co-present objects to a human listener, who needs them for assembly. The system can monitor and react to listener gaze by generating verbal feedback to accept or reject the listener’s intentions as to which object to grasp next. We further compared two levels of instructional ambiguity and tested their effectiveness in two experiments: generating a long, unambiguous instruction vs. generating a short, ambiguous instruction followed by gaze-driven feedback. Furthermore, we examined the impact of feedback specificity by either providing an underspecified “No, not that one!” or contrastive feedback expressing the spatial relation of the target relative to the current gaze position (e.g., “Further left!”). We predicted that the interaction with the generation system would benefit from gaze-based feedback, despite the small-scale setting and the noise typically emerging in real-world and real-time interactions (in terms of movement and motion). We further hypothesized that this benefit might be large enough to compensate even for short, ambiguous instructions by the system and lead to similar if not shorter interaction times than for full exhaustive unambiguous instructions. Lastly, we predicted that contrastive feedback that provides additional referential information (i.e., after ambiguous instructions), incrementally and on demand, would be most efficient and lead to the shortest interaction times overall (cf. Zarrieß and Schlangen (2016) on installments and their efficiency).

Experiment 1

To investigate how listener gaze can be used in a dynamic task-oriented real-world interaction, we designed an assembly-like task and implemented a multimodal interactive system called GazInG (an interactive NLG system in a real environment), which can generate instructions in natural language. This system instructed a naive human listener to select and assemble building blocks. During the assembly process, the repeated identification of a specific object was required. By design, the scene was overloaded with many similar objects so that exhaustive object descriptions required naming two colors, sizes, and object types as well as locations. GazInG is further capable of monitoring and interpreting listener gaze and generating adaptive feedback. It relies on the EyeSee3D module, which models the environment as a 3D situation model using abstract geometry to represent the stimuli (see the turquoise arrow in Fig. 1).

In Experiment 1 we compared long but unambiguous instructions with ambiguous instructions that were followed by gaze-based feedback, which was either simple (underspecified feedback group) or contrastive (contrastive feedback group).

Two groups of participants completed the experiment. The underspecified feedback group received unambiguous instructions paired with no feedback on one block of trials, and ambiguous instructions paired with underspecified feedback on another block of trials. The contrastive feedback group received unambiguous instructions paired with contrastive feedback on one block of trials, and ambiguous instructions paired with contrastive feedback in another block of trials.

Methods

Participants

Altogether, 48 participants, mainly students enrolled at Saarland University, took part in the experiment. Of these, 24 were assigned to each group: the underspecified feedback (19 female) and the contrastive feedback (16 female) groups. The average participant age of the underspecified feedback group was 25 years (19–35 years), and of the contrastive feedback group 24 years (20–31 years).

All participants were German native speakers and reported normal or corrected-to-normal vision and no red-green color blindness. Their participation was compensated for with €8 (underspecified feedback group) or €5 (contrastive feedback group) with the difference being due to the slightly shorter duration of the second group’s experiment.

Setup and apparatus

Figure 1 depicts our setup. We chose LEGO DUPLO as the target domain because the building blocks are of convenient, graspable size with easy to identify colors. At the same time, they offer a multitude of combinations and various ways of assembly. As the number and similarity of available objects in the workspace was high, it was not trivial to generate automatically unique identifying instructions (see Appendices A and B). A layout consisted of 20 composed objects with eight targets to be collected. Each composed object comprised two basic building blocks. The instructions did not provide guidelines on how to put together the selected elements but left this to the listener’s creativity. This was made clear in the task description. The participants had to build an individual LEGO model, based on the components the system instructed them to pick.

We used a binocular head-mounted eye tracker (SMI Eye Tracking Glasses) to collect gaze data. The tracker is equipped with a high-resolution scene camera (1280×960) recording at 24 Hz and two eye cameras recording at 30 Hz. The user’s head position and orientation are integrated by GazInG into a situation model in real time. This is realized by instrumenting the environment with low-cost printable fiducial markers (see the tablecloth in Fig. 1). These are located in known positions relative to the stimuli and are tracked by the scene camera of the eye tracker using computer vision. Fusing the thus-derived head position and orientation with eye tracking data from the glasses reconstructs the user’s gaze direction. This allows the system to cast a 3D gaze ray into the situation model (see the yellow arrow in Fig. 1). The intersections of the ray with the geometric models of the stimuli identify gazed-at objects. At this point, GazInG has semantically mapped the listener’s inspections. For further technical details of the approach, see Pfeiffer and Renner (2014) and Pfeiffer et al. (2016b). In this experiment, feedback was triggered by pooled inspections with a dwell time larger than 200 ms.

Natural Language Generation GazInG uses a heuristic approach to generate an instruction containing a referring expression that describes a composed object consisting of two basic building blocks on the fly given the domain knowledge. The syntactic structure of the instructions is predefined. The system is able to distribute the information needed to identify a target over several chunks. The first chunk, thus, realizes an ambiguous instruction, which can then be incrementally extended. Such an ambiguous instruction consists of a main clause that describes the bottom object. Its size and color are used as pre-modifiers and the head noun is randomly chosen from a set of synonyms for the type of object, as shown in Example (1). To output an unambiguous instruction, the algorithm appends two further post-modifiers: (i) a prepositional phrase or a relative clause to describe the top object and (ii) an adverbial phrase containing absolute position information (see Example (2)).

Example (1) Pick the big red building block.

Example (2) Pick the big red building block with the small yellow one on top at the back toward the left.

Inspections of target objects trigger positive feedback (e.g., “Yes”, “Exactly” etc.), and inspections of competitors trigger negative feedback signaling that the listener is considering a wrong object. This can be underspecified, e.g., “No, not that one!” or contrastive, providing relative position information, e.g., “Further left!” In the former case, the listener can exclude only the inspected competitor, which might be sufficient for simple scenes where fewer competitors are available in the visual context. In the latter case, however, the listener’s attention is directed towards the target from the relative gaze position. The system thereby reduces inspections of other competitors before the target is found and implements the notion of referring in installments, i.e., in chunks of information rather than one long referring expression.

Task

GazInG instructed a human listener to take a certain object; the listener performed grasping actions in response and assembled the LEGO objects in their own way. A total of eight objects had to be selected and taken from a single layout. Assembly continued with subsequent layouts. The final constructions were photographed and entered into a competition. The most creative result won a €10 Amazon voucher.

Procedure

We manipulated the instructional ambiguity within participants by presenting either unambiguous or ambiguous instructions to everyone. Further, we varied the gaze-driven verbal feedback and feedback specificity between groups. That is, the underspecified feedback group received unambiguous instructions without feedback and ambiguous instructions supplemented with underspecified feedback (extending on the design of Garoufi et al. 2016). On the other hand, the contrastive feedback group received contrastive feedback in both instructional ambiguity approaches. The feedback was more informative and meant to direct the listener’s attention toward the intended target, particularly after ambiguous instructions.

Participants were seated in front of the workspace and asked to listen carefully to and follow the system’s instructions. They were instructed to act as a team with the system and solve the task together as precisely as possible, i.e., to avoid taking the wrong building blocks. Then participants put on the eye tracking glasses and followed a three-point calibration procedure. Calibration was repeated between layouts and whenever needed. Before performing the actual task, a short practice session was completed: participants had to collect three targets among six objects in total to familiarize themselves with the task and the system’s pace.

The experiment consisted of two parts, one for each type of instructional ambiguity. In each part, the participants completed one layout, in which they searched for eight target objects. The order was balanced across participants. Each part consisted of working through one layout (see Appendix A). Participants were instructed to select an object as soon as they were sure which one was meant by the system. They heard a confirmation after a correct grasp action.

An example trial is presented in Fig. 2. In the following examples, the labels presented in brackets refer to the feedback specificity. Example (3) illustrates a typical interaction using unambiguous instructions for both groups.

Example (3)

SYSTEM: Pick the big red building block with a small yellow piece on top of it at the back toward the left.

LISTENER: [inspects the target]

SYSTEM: [silence]/ Yes, exactly! (underspecified/contrastive)

LISTENER: [grasps the target]

SYSTEM: Well done!

The underspecified feedback group experienced the ambiguous instructions usually as shown in Example (4).

Example (4)

SYSTEM: Pick the big red building block.

LISTENER: [inspects a competitor]

SYSTEM: No, not that one! (underspecified)

LISTENER: [inspects a competitor]

SYSTEM: No, not that one! (underspecified)

LISTENER: [inspects the target]

SYSTEM: Yes, exactly!

LISTENER: [grasps the target]

SYSTEM: Well done!

The contrastive feedback group may require fewer turns in the ambiguous instructions as shown in Example (5).

Example (5)

SYSTEM: Pick the big red building block.

LISTENER: [inspects a competitor]

SYSTEM: Further toward the left! (contrastive)

LISTENER: [inspects the target]

SYSTEM: Yes, that one!

LISTENER: [grasps the target]

SYSTEM: Well done!

After finishing a layout, participants filled in a questionnaire assessing their perception and impressions of their interaction with the system. Participants answered 13 questions to judge the interaction in each instructional ambiguity approach. Eight questions were followed by a five-point Likert scale (1 indicating a very good and 5 a poor score), e.g., “How good/precise did you find the spoken instructions?” or “How flexible did you find the interaction?” In addition, there were five yes/no questions, such as “Was the system’s feedback confusing?” to assess if the interaction with the system felt natural. The question “Were the instructions exhaustive, i.e., you were able to identify a target upon hearing the instruction?” checked whether the participants paid attention. In a final questionnaire, they were asked five yes/no questions to compare both interaction strategies and assess user preferences. The experiment lasted between 30 and 45 minutes.

Analyses

All measures were collected on a per-item basis. Performance was measured using the total time from instruction onset until a target was grasped, and whether the interaction ended successfully with the correct object selected. The total time was further divided into three phases, which differ depending on the instructional ambiguity (Fig. 3). The first phase is determined by the duration of the spoken instruction, from speech onset to speech offset. Secondly, we assessed identification, the time needed from the offset of the instruction to the listener’s first inspection of the target. Finally, the time from the first target inspection until the grasp of the target determined the duration of the third phase.

We further counted the number of feedback occurrences per interaction and also assessed the time from instruction offset to the first positive or first negative feedback instance, which marks the end of the (initial) visual search for the target.

Statistical analyses were conducted in the R statistical programming environment (R Core Team 2014). We assessed statistical significance using linear mixed-effects models using the lme4 package in R and model comparison to determine the influence of instructional ambiguity and feedback specificity. As proposed by Bates et al. (2015), we started with the maximal model fitting our assumptions with respect to the random effects structure. If the models failed to converge, we simplified the random structure by first removing the correlations between random slopes and intercepts, followed by the intercept terms, starting with the random effect for items (if present).

Results

The results reported in this section are based on 722 unique trials after outliers had been removed (data points that are 2.5 standard deviations above or below the mean) from a total of 768.

Total time

The time to solve each task, i.e., to find and collect a building block, indicates the degree of efficiency of the communication with the system. All tasks were solved, and there were only a few wrong grasps (8.7%), as well as almost no need for repetition of an instruction, showing that both interaction strategies are effective. Table 1 summarizes the response times for the interaction phases. Specifically, the underspecified feedback group was faster at solving the task after listening to an unambiguous instruction (M=14.31 s, standard deviation SD =8.60 s) than to an ambiguous instruction with underspecified feedback (M=17.56 s, SD =10.44 s). For the contrastive feedback group, the direction of the effect changed. The ambiguous instruction now led to shorter task completion time (M=11.96 s, SD =5.61 s) compared to following the unambiguous instruction (M=12.75 s, SD =4.75 s). Specifically, we constructed an individual model for each group with instructional ambiguity as a fixed effect and with random intercepts and slopes for subjects and items. Both comparisons revealed the main effects of instructional ambiguity: for the underspecified feedback group, χ²(1)=4.008 with p<0.05, and for the contrastive feedback group, χ²(1)=4.502 with p<0.05. For the subset of ambiguous instructions in both groups, we fitted a linear mixed-effects model with feedback specificity as a fixed effect and included random intercepts and slopes for subjects and items. There was a main effect of feedback specificity on total time revealed by model comparison (χ²(1)=15.907,p<0.001), that is, contrastive feedback improved task completion time over underspecified feedback.

Table 1 Mean durations of the interaction phases in Experiment 1

Full size table

Identification time

Next, we analyzed the time needed to find and inspect the intended target after instruction offset. Unsurprisingly, participants were quicker at identifying a target following an unambiguous instruction, as it contains all the information needed. In addition, they could start searching as soon as they had heard the first part of the instruction. Analogously to the analysis of the total time, we fitted linear mixed-effects models for each data set with the same random structure. Model selection revealed two main effects for our within-subject manipulation of instructional ambiguity: for the underspecified feedback group, χ²(1)=60.257 with p<0.001, and for the contrastive feedback group, χ²(1)=92.868 with p<0.001. Additionally, we analyzed our between-subject manipulation and observed a main effect of feedback specificity for the ambiguous approach (χ²(1)=4.172,p<0.05). In other words, listeners needed three times longer after hearing an ambiguous instruction (M=7.22 s, SD =8.37 s) to find the target object than after listening to an unambiguous one (M=2.17 s, SD =5.12 s). This time was shortened dramatically when gaze-driven contrastive feedback followed the instructions, though listeners still inspected the intended target sooner after the unambiguous instructions (M=1.27 s, SD =2.21 s) than after the ambiguous instruction (M=4.21 s, SD =3.80 s).

Feedback occurrences

We analyzed the number of negative feedback instances that occurred after the ambiguous instructions across groups, but surprisingly, there was no significant difference (p=0.658).

We further examined how much time elapsed until a feedback instance was triggered through listener gaze after the ambiguous instructions, in both groups. Specifically, we contrasted this time for the first negative with the first positive feedback instance in an interaction (see red arrows in Fig. 4), since this indexes the visual search and how actively and intensively participants engaged with the instruction-giving system.

Figure 5 depicts the respective means. For the analysis, we fitted a model with feedback specificity as a fixed effect and with random intercepts and slopes for subjects and items. Importantly, we found a main effect of feedback specificity (χ²(1)=18.416,p<0.001). As expected, the pattern observed for the identification time (determined by first target inspection) persists for the time to first positive feedback instance because it is precisely this inspection that triggers the first positive feedback instance. The underspecified feedback group provoked positive feedback later (M=10.33 s, SD =16.91 s) than the contrastive feedback group (M=5.43 s, SD =5.97 s). This demonstrates how more specific feedback narrowed down the search for the target object and shortened the time to find it. Furthermore, the investigation of the first occurrence of a negative feedback instance revealed that listeners also inspected a competitor matching the description faster after the contrastive feedback (M=1.97 s, SD =2.68 s) than after the underspecified (M=4.07 s, SD =5.77 s) feedback. This suggests that listeners’ expectation of an informative response elicits more deliberate and controlled use of gaze to engage better with the system because it constantly responds to it with useful information restricting the search space. They could use their gaze feedback to probe actively.

Questionnaires

Overall the interaction with the system was perceived as rather natural and the gaze-driven feedback was rated as helpful and not confusing. Interestingly, there was a clear preference in both groups for listening to and following an unambiguous instruction. All participants (100%) in the underspecified feedback group and most of the participants in the contrastive feedback group (87.5%) stated that they preferred unambiguous instructions and indicated them as more pleasant, although the contrastive feedback group was faster when responding to ambiguous instructions. We ran a simple linear regression on the responses of each group to the question “How good did you find the interaction flow?” (Fig. 6) and observed a marginal effect of instructional ambiguity for the underspecified feedback group (β=−0.375,t(46)=−1.98,p=0.0537). Further, for the subset of ambiguous instructions, simple linear regression revealed an effect of feedback specificity approaching significance (β=−0.333,t(46)=−1.829,p=0.0739). That is, when contrastive feedback followed an ambiguous instruction, it was judged to be better (M=1.25, SD =0.44) than when underspecified feedback was provided (M=1.58, SD =0.77). The former assessment was like the perception of the unambiguous instructions by the contrastive feedback group (M=1.25, SD =0.53) and the underspecified feedback group (M=1.20, SD =0.51). This, and similar results from the other questions, demonstrates that the informativity of the verbal feedback mitigates for the instructional ambiguity when giving initially partial, ambiguous instructions and so listeners experience it as smoother.

Discussion

Our data provide some evidence for the successful use of listener gaze in a real-world task. Instructional ambiguity that refers to objects incrementally and reacts to listeners’ gaze can be used to identify objects in the shared space. Moreover, the performance results indicate that feedback specificity is essential for efficiency. The results reveal that contrastive feedback benefits task performance because it not only warns the listener against grasping a wrong object, but also includes a relative direction in which to look further for the target. In contrast, underspecified feedback merely prevents the user from wrong grasps, and does not facilitate the search. Notably, the combination of ambiguous instructions with contrastive feedback numerically even outperformed unambiguous instructions.

Interestingly, there was a mismatch in the perception and performance measures with respect to unambiguous and ambiguous instructions. Apparently, listeners felt more confident in their own performance when following unambiguous instructions. One reason for this might be that the unambiguous instructions allowed participants to remain rather passive until the grasp action. After an ambiguous instruction, in contrast, they had to engage actively with the system to make progress in the task. The former is considered as more convenient despite being apparently less efficient compared to the more interactive strategy, i.e., ambiguous instructions with specific contrastive feedback. Whether this behavior emerges as a direct response to the system’s behavior in a given trial or whether this is a result of a more global adaptation to the system was investigated in Experiment 2.

Experiment 2

By giving only ambiguous instructions, this experiment further examined the impact of feedback specificity on task performance. Feedback specificity was manipulated within participants and in an interleaved and randomized order, item by item. Thus, participants did not know in advance which type of feedback they might receive. This aimed at assessing whether participants benefited from the contrastive feedback in the first experiment because more information was conveyed, so that this system is inherently more efficient—or whether more generally the participants adapted to the system, e.g., by increasing their attendance or willingness to collaborate and thus, to really take up and process the information provided efficiently. If the former hypothesis holds, then performance with contrastive feedback would remain high (and higher than with underspecified feedback), even if interleaved. If the latter hypothesis is true, we would expect to see either low performance in both approaches (since engagement decreases altogether) or high performance in both approaches (since engagement is high and leads to more efficient information uptake).