Introduction

Typically, when we talk, we also gesture. That is, we perform manual movements as part of the expressive effort (Kendon 2004; McNeill 1992). Such speech-accompanying gestures typically convey meaning (e.g., size, shape, direction of movement), which is related to the ongoing talk. The communicative role of these gestures is somewhat controversial. It is debated both whether speakers actually intend gestural information for their addressees (e.g., Holler and Beattie 2003; Melinger and Levelt 2004), and whether addressees attend to and integrate the gestural information. This paper focuses on the latter issue.

There is growing evidence that speech and speech-accompanying gestures are processed and comprehended together, forming an ‘integrated’ system or a ‘composite signal’ (e.g., Clark 1996; Kendon 2004; McNeill 1992). Gestural information is integrated with speech in comprehension and influences the interpretation and memory of speech (e.g., Beattie and Shovelton 1999a, 2005; Kelly et al. 1999; Langton and Bruce 2000; Langton et al. 1996). For instance, information expressed only in gestures re-surfaces in retellings, either as speech, as gesture, or both (Cassell et al. 1999; McNeill et al. 1994). Further, neurocognitive studies show that incongruencies between information in speech and gesture yield electrophysiological markers of integration difficulties such as the N400 (e.g., Özyürek et al. 2007; Wu and Coulson 2005). However, surprisingly few studies have attempted to examine directly whether attention to gestures and uptake of gestural information is deterministic and unavoidable or whether such attention is modulated in human interaction, and if so by what factors. Furthermore, surprisingly little is known about the role of gaze in this context. This study therefore aims to examine what factors influence overt, direct visual attention to gestures and uptake of gestural information, focusing on one social factor, namely speakers’ gaze at their own gestures, and two physical properties of gestures, namely their place in gesture space and the effect of gestural holds. The study also examines the relationship between addressees’ gaze and uptake.

Visual Attention to Gestures

Gestures are visuo-spatial phenomena, and so the role of vision and gaze for attention is important. However, addressees seem to gaze directly at speakers’ gestures relatively rarely. Addressees mainly look at the speaker’s face during interaction (Argyle and Cook 1976; Argyle and Graham 1976; Bavelas et al. 2002; Fehr and Exline 1987; Kendon 1990; Kleinke 1986). Studies using eye-tracking techniques in face-to-face interaction have further demonstrated that addressees spend as much as 90–95% of the total viewing time fixating the speaker’s face and thus fixate only a minority of gestures (Gullberg and Holmqvist 1999, 2006).

However, the likelihood of an addressee directly fixating a gesture increases under the following three circumstances (Gullberg and Holmqvist 1999, 2006; Nobe et al. 1998, 2000). The first is when speakers first look at their own gestures (speaker-fixation) (Gullberg and Holmqvist 1999, 2006). This tendency is stronger in live face-to-face interaction than when observing speakers on video (Gullberg and Holmqvist 2006). This suggests that the overt shift of visual attention to the target of a speaker’s gaze is essentially social in nature rather than an automatic response. The second circumstance is when a gesture is produced in the periphery of gesture space in front of the speaker’s body (cf. McNeill 1992). The third is when a gestural movement is suspended momentarily in mid-air and goes into a hold before moving on (cf. Kendon 1980; Kita et al. 1998; Seyfeddinipur 2006). Holds are often found between the dynamic movement phase of a gesture, the stroke, and the so-called retraction phase, which marks the end of a gesture. It is currently not clear whether these three factors—speaker-fixation, peripheral articulation, and holds—all contribute independently to the increased likelihood of the addressee’s fixation on gesture. The evidence for the influence of these three factors mostly comes from observational studies of naturalistic conversations, in which the three factors often co-occur (Gullberg and Holmqvist 1999, 2006; see also Nobe et al. 1998, 2000). Therefore, one of the goals of this study is to experimentally manipulate these factors and assess their relative contributions to the likelihood of addressees’ fixations of gesture.

The three factors may draw the addressee’s attention either for bottom-up, stimulus-related reasons or for top-down, social-cognitive reasons. Gestures in peripheral gesture space or with a hold may elicit the addressee’s fixation for bottom-up reasons, namely, because these gestures challenge peripheral vision. Firstly, the acuity of peripheral vision decreases the further away from the fovea the image is projected, and secondly, peripheral vision, which is good at motion detection, cannot process information about a static hand in a hold efficiently. In contrast, gestures with speaker-fixations may elicit the addressee’s fixation for top-down social reasons, namely to manifest social alignment or joint attention. The difference between bottom-up and top-down processes should be reflected in different onset-latencies of fixations to gestures (cf. Gullberg and Holmqvist 2006). Fixation onsets that are bottom-up driven should be short, whereas fixations driven by top-down concerns should have longer onsets (e.g., Yantis 1998, 2000). Thus, another goal of the study is to compare the onset-latency for fixations on gestures triggered by the three factors to further elucidate the reasons for fixation.

Uptake of Gestural Information

Only a few studies have attempted to directly examine whether attention to and uptake of information from gestures is unavoidable or whether it is ever modulated and if so by what factors. Rogers (1978) manipulated noise levels showing that addressees pick up more information from gestures the less comprehensible the speech signal. Beattie and Shovelton (1999a, b) demonstrated that addressees decode information about relative position and size better when presented with speech and gesture combined than with either gesture or speech alone. Interestingly, this study also indicated that not all gestural information was equally decodable. Addressees reliably picked up location and size information pertaining to objects, but did worse with information such as direction. These studies indicate that the comprehensibility of speech affects addressees’ attention to gestures and also that the type of gestural information matters.

Other factors may also modulate addressees’ attention to gestures. Speakers’ gaze to their own gestures, a factor of a social nature, is a likely candidate. It is well-known that humans are extremely sensitive to the gaze direction of others (e.g., Gibson and Pick 1963), and that gaze plays a role in the establishment of joint attention (e.g., Langton et al. 2000; Moore et al. 1995; Tomasello 1999; Tomasello and Todd 1983). It has been suggested that speakers look at their own gestures as a means to draw addressees’ attention to them in face-to-face interaction (e.g., Goodwin 1981; Streeck 1993, 1994). Such behavior could increase the likelihood of addressees’ uptake of gestural information, although this has not been tested with naturalistic, dynamic gestures that are not pointing gestures.

Physical properties of gestures may also affect addressees’ uptake of gestural information. First, the location of the gesture in gesture space may matter (cf. McNeill 1992). Speakers often bring gestures up into central gesture space, that is, to chest height and closer to the face, when they want to highlight the relevance of gestures in interaction (e.g., Goodwin 1981; Gullberg 1998; Streeck 1993, 1994). The information expressed by such a gesture seems more likely to be integrated than that of a gesture articulated for instance on the speaker’s lap in lower, peripheral gesture space.

A second potentially important physical property is the gestural hold. The functional role of holds is somewhat debated, but holds have been implicated in turn taking and floor holding in interaction. Transitions between speaker turns in interaction are more likely once a gesture is terminated or when a tensed hand position is relaxed (e.g., Duncan 1973; Fornel 1992; Goodwin 1981; Heath 1986). If holds are a first indication that speakers are about to give up their turn, it would be communicatively useful for addressees to attend to them. This in turn may increase the likelihood of information uptake from a gesture with a hold. A further goal of this study, then, is to examine the impact of these three factors on addressees’ uptake of gesture information.

The Relationship Between Fixations and Information Uptake

As indicated above, most gestures are perceived through peripheral vision. Although peripheral vision is powerful, optimal image quality with detailed texture and color information is achieved only in direct fixations, that is, if the image falls directly on the small central fovea. Outside of the fovea, parafoveal or peripheral vision gives much less detailed information (Bruce and Green 1985; Latham and Whitaker 1996). Consequently, it is generally assumed that an overt fixation indicates attention in the sense of information uptake. If addressees shift their gaze from the speaker’s face to a gesture in interaction, this might indicate that they are attempting to integrate the gestural information (e.g., Goodwin 1981; Streeck 1993, 1994).

However, addressees’ tendency to gaze directly at an information source is modulated in face-to-face interaction by culture-specific norms for maintained or mutual gaze to indicate continued attention (e.g., Rossano et al. 2009; Watson 1970). In cultures where mutual gaze is socially important, face-to-face interaction may emphasize the reliance on peripheral vision for gesture processing and dissociation between overt and covert attention. Addressees can fixate a visual target without attending to it (“looking without seeing”), and conversely, attend to something without directly fixating it (“seeing without looking”). If the speaker’s face is the default location of visual attention in interaction, then most gestures must be attended to covertly. It is therefore not entirely clear what the relationship between overt fixation and information uptake might be in interaction from information sources like gestures. A final goal of this study is therefore to examine the relationship between overt fixation of and uptake of information from gestures.

The Current Research

This study aims to examine what factors modulate addressees’ visual attention to and information uptake from gestures in interaction by asking the following questions:

  1. 1.

    Do social and physical factors influence addressees’ fixations on speakers’ gestures? Furthermore, do different factors trigger qualitatively different fixations, reflecting the difference between top-down vs. bottom-up processes? We expect top-down driven fixations to have longer onset latencies than bottom-up driven fixations.

  2. 2.

    Do social and physical factors influence addressees’ uptake of gesture information?

  3. 3.

    Are addressees’ fixations a good index of information uptake from gestures?

To examine these questions we present participants (‘addressees’) with video recordings of naturally occurring gestures embedded in narratives. We examine the effect of a social factor, namely the presence/absence of speakers’ fixations of their own gestures (Study 1), and the effect of two physical properties of gestures, namely gestures’ location in gesture space (central/peripheral) and the presence/absence of holds (Study 2). In Studies 1 and 2, we manipulate the independent variables by selecting gestures with the relevant properties from a corpus of video recorded gestures. In a second set of control experiments, we present participants with digitally manipulated versions of the gesture stimuli used in Studies 1 and 2, examining the effect of presence/absence of speakers’ artificial fixations of their own gestures (Study 3) and the presence/absence of artificial holds (Study 4). These studies are undertaken to control for any other unknown variables that may have differed between the stimulus gestures used in the conditions in Studies 1 and 2.

In all studies, participants were presented with brief narratives that included a range of gestures, but our analyses focus on one “target gesture” in each narrative. Each target gesture conveyed information about the direction of a movement. This information was only encoded in the target gesture, and not in other gestures or in speech. Overt visual attention to gestures was operationalized as direct fixations of gestures. Participants’ eye movements were recorded during the presentation of the narratives using a head-mounted eye-tracker. Further, information uptake was operationalized as the extent to which participants could reproduce the information conveyed in the target gesture in a drawing task following stimulus presentation. Participants were asked to draw an event in the story that crucially involved the movement depicted by the target gesture. The match between the directionality of the movement in the drawing and in the target gesture was taken as indicative of information uptake.

Study 1: Speaker-fixations

The first study examines the effect of a social factor on addressees’ overt visual attention to and uptake of information from gestures, namely the presence/absence of speakers’ fixations of their own gestures.

Methods

Participants

Thirty Dutch undergraduate students from Radboud University Nijmegen participated in this study (M age = 22, SD = 3), 23 women and 7 men. They were paid 5 euros for their participation.

Materials

The stimuli were taken from a corpus of videotaped face-to-face story retellings in Dutch (Kita 1996). The video clips showed speakers facing an addressee or viewer retelling short stories. The video clips did not show the original live addressee, but only the speaker seated en face. Each video clip contained a whole, unedited story retelling. Each clip therefore contained multiple gestures, only one of which was treated as a target gesture. Consequently, the target gesture appeared within sequences of other gestures so as not to draw attention as a singleton. The stimulus videos were selected from the corpus because they contained one target gesture displaying the appropriate properties. For Study 1, each target gesture displayed either presence or absence of speaker-fixation, that is, the speakers either looked at their own gestures or not. The target gestures were otherwise similar, and performed in central gesture space without holds. All target gestures were representational gestures encoding the movement of a protagonist in the story from an observer viewpoint (McNeill 1992), meaning that the speaker’s hand represented a protagonist in the story as seen from outside. The target gestures, typically expressing a key event in the story lines, encoded the direction of the protagonist’s motion left or right. Although the movement itself was an important part of the storyline, the direction of the movement was not. The directional information was only present in the target gesture and not in co-occurring speech. Further, the directional information could not be inferred from other surrounding gestures. Care was taken to ensure that the gestural information was not highlighted in any other way. Co-occurring speech did not contain any deictic expressions referring to and therefore drawing attention to the gesture (e.g., ‘that way’). Moreover, the target gesture did not co-occur with hesitations in speech, with the story punch line or with first mention of a protagonist, as all of these features might have lent extra prominence to a co-occurring gesture. Descriptions of the animated cartoons used to elicit the narratives and the target scenes therein are provided in Appendix 1. Outlines of the spatio-temporal properties of the target gestures across conditions (and all studies) are provided in Appendix 2, and speech co-occurring with target gestures is listed in Appendix 3.

In Study 1, the target gestures consisted of gestures that were either fixated or not by the speaker in the video (speaker-fixation vs. no-speaker-fixation). Location in gesture space and presence/absence of hold were held constant (central space, no hold). There were 4 items in each condition. The mean durations of the target gestures in each condition in Study 1 are summarized in Table 1.

Table 1 Mean duration (ms) of target gestures with and without speaker-fixation

Apparatus

We used a head-mounted SMI iView© eye-tracker, which is a monocular 50 Hz pupil and corneal reflex video imaging system. The eye-tracker records the participant’s eye movements with the corneal reflex camera. The eye-tracker also has a scene-camera on the headband, which records the field of vision. The output data from the eye-tracker consist of a merged video recording showing the addressee’s field of vision (i.e., the speaker on the video), and an overlaid video recording of the addressee’s fixations as a circle overlay. Since the scene-camera moves with the head, the eye-in-head signal indicates the gaze point with respect to the world. Head movements therefore appear on the video as full-field image motion. The fixation marker represents the foveal fixation and covers a visual angle of 2°. The output video data allow us to analyze both gesture and eye movements with a temporal accuracy of 40 ms.

Procedure

Participants were randomly assigned to one of the two conditions: Speaker-fixation (central space, no hold, speaker-fixation) and No-speaker-fixation (central space, no hold, no speaker-fixation). The participants were seated 250 cm from the wall and fitted with the SMI iView© headset. A projector placed immediately behind the subject projected a nine-point matrix calibration screen on the wall of the same size as the subsequent stimulus videos. After calibration, four stimulus video clips were projected against the wall. The speakers appearing in the videos were thus life-sized, and their heads were level with the participants’ heads. Life-sized projections have been shown to yield fixation behavior towards gestures that is similar to behavior in live interaction (Gullberg and Holmqvist 2006). A black screen appeared between each video clip for a duration of 10 s. Participants were instructed to watch the videos carefully to be able to answer questions about them subsequently. The instructions did not mention gestures or the direction of the movements in the story. Participants’ eye movements were recorded as they watched the video clips. After watching all four videos, participants answered questions about the target events of each video by drawing pictures of the protagonists in the story. An example question is “De muis heeft moeite met roeien. Hoe komt hij toch vooruit?” (“The mouse has trouble rowing. How does it make progress?”) (see Appendix 4 for the complete set of questions).

The participants did not know the contents of the questions until they had finished watching all four videos. A drawing task was chosen because it allows directionality to be probed implicitly: The participant must apply a perspective on the event and the protagonist in order to draw them, a perspective which in turn will reveal the direction of the protagonist (see Fig. 1). The drawing task thus avoids the well-known difficulties involved in overt labeling of left-right directionality (e.g., Maki et al. 1979). A post-test-questionnaire ensured that gesture was not identified as the target of study.

Fig. 1
figure 1

Example of a match between the (gesture) direction seen on the stimulus video (left) and the direction indicated as a response on the subsequent drawing task (left)

Coding

The eye movement data were retrieved from the digitized video output from the eye-tracker. The merged video data of the participants’ gaze positions on the scene image were analyzed frame-by-frame and coded for fixation of target gesture (Yes or No) and for matched reply (Yes or No). A target gesture was coded as fixated if the fixation marker was immobile on the gesture, i.e., moved no more than 1 degree, for a minimum of 120 ms (equal to 3 video frames) (cf. Melcher and Kowler 2001). Note that fixations on gestures were spatially unambiguous. Either a gesture was clearly fixated, or the fixation marker stayed on the speaker’s face (cf. Gullberg and Holmqvist 1999, 2006). A drawing was coded as a matched reply if the direction of the motion in the drawing matched the direction of the target gesture on the video as seen from the addressee’s perspective (see Fig. 1).Footnote 1 Only responses that could be coded as matched or non-matched were included in the analysis. When drawings did not depict a lateral direction of any kind, the data point was discarded. Chance performance therefore equals 50%.

Analysis

The dependent variables were (a) the proportion of trials with fixations on target gestures, and (b) the proportion of matched responses as defined above. We employed non-parametric Mann–Whitney tests to analyze the fixation data because the dependent variable, proportions of trials with fixation on gesture, had a skewed distribution with clustering of data at zero. We analyzed the information uptake data using parametric, independent samples analyses of variance and single sample t-tests. Throughout, the alpha level for statistical significance is p = .05.

Results and Discussion

The proportion of trials in which the addressee fixated gestures were significantly higher in the speaker-fixation condition (M = .08, SD = .12) than in the no-speaker-fixation condition (M = 0, SD = 0), Mann–Whitney, Z = −2.41, p = .016 (see Fig. 2a). The proportion of trials in which the addressees’ drawn direction and the gesture direction matched (an index of information uptake) was higher in the speaker-fixation condition (M = .86, SD = .19) than in the no-speaker-fixation condition (M = .63, SD = .32), F(1, 28) = 5.59, p = .025, η2 = .17 (see Fig. 2b). Furthermore, the proportion of trials in which addressees’ drawing and gestures matched was above chance level (.50) in the speaker-fixation condition, one-sample t-test, t(14) = 7.33, p < .001, but not in the no-speaker-fixation condition, t(14) = 1.61, p = .13.

Fig. 2
figure 2

a Mean proportion of fixated target gestures in the Speaker-fixation and No-speaker-fixation conditions, b mean proportion of matched responses in the Speaker-fixation and No-speaker-fixation conditions, i.e. responses where the direction in the drawing matched that of the target gesture (chance = .5). Error bars indicate standard deviations

The results show that speakers’ fixation of their own gestures increase the likelihood of addressees fixating the same gestures. Furthermore, speaker-fixations also increase the likelihood of addressees’ uptake of gestural information, even when it is of little narrative significance and embedded in other directional information. Overall, the combined fixation and uptake findings suggest that speakers’ gaze at their own gestures constitute a very powerful attention directing device for addressees influencing both their overt visual attention and their uptake.

Study 2: Location in Space and Holds

The second study examines the effect of two physical gestural properties on addressees’ overt visual attention to and uptake of information from gestures, namely gestures’ location in space (central vs. peripheral) and the presence vs. absence of holds.

Methods

Participants

Forty-five new Dutch undergraduate students from Radboud University Nijmegen participated in this study (M age = 21, SD = 2), 41 women and 4 men. They were paid 5 euros for their participation.

Materials

Three new sets of stimulus videos were selected from the aforementioned corpus using the same criteria as previously, targeting narratives containing different target gestures. For Study 2, the target gestures consisted of gestures performed in central vs. peripheral gesture space with presence vs. absence of hold, with four items in each condition. We used McNeill’s (1992) schema to code gesture space. McNeill divides the speaker’s gesture space into central and peripheral gesture space, where central space refers to the space in front of the speaker’s body, delimited by the elbows, the shoulders, and the lower abdomen, and peripheral gesture space is everything outside this area. Although McNeill makes more fine-grained distinctions within central and peripheral space, we collapsed all cases of center-center and center space, and all cases of peripheral space, leaving two broad categories: central and peripheral. To code for holds (the momentary cessation of a gestural movement), we considered post-stroke holds, that is, cessations of movement after the hand has reached the endpoint of a trajectory of a gesture stroke (Kita et al. 1998; Seyfeddinipur 2006). The speakers never fixated the target gestures. The mean durations of the target gestures in each condition are summarized in Table 2. As before, descriptions of the animated cartoons used to elicit narratives and the target scenes are provided in Appendix 1, outlines of the spatio-temporal properties of the target gestures across conditions in Appendix 2, and speech co-occurring with target gestures in Appendix 3.

Table 2 Mean duration (ms) of central and peripheral target gestures with and without holds

Apparatus, Procedure, Coding, and Analysis

Participants were randomly assigned to one of the three conditions (15 participants in each condition): central hold, peripheral no hold, peripheral hold. The data from the no-speaker-fixation condition from Study 1 was used as the fourth condition, central no hold, in the analysis. The apparatus, procedure, coding, and analyses were otherwise identical to Study 1.

Results and Discussion

We examined the effect of location and hold on fixations in separate Mann–Whitney tests. The proportion of trials in which the addressees fixated gestures was significantly higher for gestures with hold (M = .11, SD = .16) than for gestures with no hold (M = 0, SD = 0), Mann–Whitney, Z = −3.63, p < .001 (see Fig. 3a). In contrast, there was no significant difference in fixation rate between central gestures (M = .07, SD = .13) and peripheral gestures (M = .04, SD = .12), Z = −.957, p = .339 (see Fig. 3a). The proportion of trials in which the addressees’ drawn directions matched the gesture directions was significantly higher for central gestures (M = .65, SD = .28) than for peripheral gestures (M = .50, SD = .26), F(3, 56) = 4.32, p = .042, η2 = .072 (see Fig. 3b). There was no significant effect of hold, F < 1, and no significant interaction, F < 1. Moreover, the proportion of trials where the drawings and the gestures matched was only above chance in the central hold condition, one sample t-test t(14) = 2.54, p = .023.

Fig. 3
figure 3

a Mean proportion of fixated target gestures across the four conditions Location (Central vs. Peripheral) and Hold (presence vs. absence), b mean proportion of matched responses across the four conditions Location (Central vs. Peripheral) and Hold (presence vs. absence), i.e., responses where the direction in the drawing matched that of the target gesture (chance = .5). Error bars indicate standard deviations

The results show that, when location in gesture space and holds were teased apart, only holds increased the likelihood of addressees fixating gestures, whereas the location in gesture space where gestures were produced did not influence addressees’ fixations. Moreover, surprisingly, only information conveyed by gestures performed in central, neutral gesture space was taken up and integrated by addressees. However, this result seems to be due to properties of a single item in the central hold condition, viz. the “trashcan” item (cf. Appendix 2). Eighty percent of the participants (12/15) had a matched response on this item. Closer inspection of the stimulus showed that the speaker in this stimulus item had looked at another gesture immediately preceding the target gesture. The item therefore inadvertently became similar to the items in the speaker-fixation condition. When this item was removed from the analysis, uptake for the central hold condition dropped to chance level, (M = .59, SD = .32) t(14) = 1.17, p = .262. Therefore, we conclude that location in gesture space and holds do not modulate the likelihood of information uptake from gestures.

Post-hoc Analysis of Fixation Onset Latencies from Studies 1 and 2

To examine whether different gestures are fixated for different reasons, we analyzed the fixation onset latencies for those gestures that drew fixations, that is, gestures with speaker-fixations, and gestures with holds (collapsing central and peripheral hold gestures). We measured the time difference between the onset of the relevant cue (speaker-fixation or gestural hold) and the onset of the addressees’ fixations of the gestures. Fixation onset latencies for gestures with speaker-fixations were significantly longer (M = 800 ms, SD = 400 ms) than onset latencies for gestures with holds (M = 102 ms, SD = 88 ms), Mann–Whitney, Z = −3.14, p = .01.

These differences suggest that addressees’ fixations of gestures are driven by different mechanisms. Onset latencies in the realm of 800 ms indicate that top-down concerns involving higher cognitive mechanisms are driving the fixation behavior. Onset latencies around 100 ms instead suggest that fixations of gestural holds may be bottom-up responses driven by the inner workings of the visual system (cf. Yantis 2000).

Study 3: Artificial Speaker-Fixations

The unexpected effect of an individual stimulus item in Study 2 raises a general concern that the independent variables may have been confounded with other unknown variables, given that the stimulus gestures differed across the conditions. For instance, the target gesture in the “plank” item had a more complex trajectory than the other items, and the gesture in the “pit” item was performed closer to the face than other target gestures (cf. Appendix 2). Although it is a strength of these studies that they draw on ecologically valid stimuli where the target gestures are naturally produced, dynamic gestures embedded in discourse and among other gestures, it is important to ascertain that the fixation and uptake findings were not caused by other factors. To test whether speaker-fixations and holds do account for the fixation and uptake data, we therefore created minimal pairs of the most neutral, baseline test items, the centrally produced gestures with no hold or speaker-fixation, by artificially introducing speaker-fixation (Study 3) and holds (Study 4) on these neutral gestures through video editing.

The third study examines the effect of artificially induced speaker-fixations on addressees’ overt visual attention to and uptake of information from gestures.

Methods

Participants

Fifteen new Dutch undergraduate students from Radboud University Nijmegen participated in this study (M age = 21, SD = 3), 11 women and 4 men. They were paid 5 euros for their participation.

Materials

Four stimulus items from Study 1, characterized as central, no hold, no speaker-fixation, were used to create four new test items. Each of these was digitally manipulated in Adobe® After Effects® to create minimal pairs of gestures with or without an artificial speaker-fixation. A section in the video was identified where the speaker’s eyes seemed to be directed towards her hands. This set of eyes was cut out and pasted over the real eyes starting at the onset of the stroke of the target gesture and maintained for a duration of 7 frames or 480 ms to form an artificial speaker-fixation (see Fig. 4). The speech stream and the synchronization between the auditory and the visual parts of the stimulus videos were not manipulated. This procedure allowed for a speaker-fixation to be imposed on a gesture while keeping the gesture, mouth movements, etc., constant. Although the mean duration of the real speaker-fixations in the original speaker-fixation condition in Study 1 was 980 ms, the artificial speaker-fixations had to be shorter (i.e., 480 ms) for the manipulation to align with the shorter gesture strokes of the original central, no hold, no-speaker-fixated gestures. However, the artificial speaker-fixations were still within the range of the naturally occurring speaker-fixations. The four digitally manipulated items constitute the artificial speaker-fixation condition.

Fig. 4
figure 4

Example of the minimal pair creation (Artificial Speaker-fixation) used in Study 3. The top panel shows example frames of the original target gesture. The bottom panel shows a set of eyes seemingly directed towards the target gesture pasted over the original eyes for a certain number of frames

Apparatus, Procedure, Coding, and Analysis

These were identical to Study 1. The data from the artificial speaker-fixation condition were compared to the data from the original no-speaker-fixation condition reported in Study 1, henceforth referred to as the control condition (Fig. 5a, b).

Fig. 5
figure 5

a Mean proportion of fixated target gestures in the Control and Artificial speaker-fixation conditions, b mean proportion of matched responses in the Control and Artificial speaker-fixation conditions, i.e. responses where the direction in the drawing matched that of the target gesture (chance = .5). Error bars indicate standard deviations

Results and Discussion

There was no significant difference between the proportion of fixated trials in the artificial speaker-fixation condition (M = .03, SD = .09) and the control condition (M = 0, SD = 0), Mann–Whitney, Z = −1.44, p = .15. Furthermore, there was no significant difference in the proportion of trials with uptake in the artificial speaker-fixation condition (M = .71, SD = .31) and the control condition (M = .63, SD = .32), F(1, 28) < 1, p = .536. However, the proportion of trials with uptake was reliably above chance (.50) in the artificial speaker-fixation condition, one-sample t-test, t(14) = 2.58, p = .022, but not in the control condition, t(14) = 1.61, p = .13.

Both for fixation and uptake, the differences between the artificial speaker-fixation and control condition went in the same direction as predicted by the results from Study 1, but neither difference reached statistical significance. The comparison against chance nevertheless indicated uptake above chance from gestures in the artificial speaker-fixation, in line with the effect of natural speaker-fixations on uptake found in Study 1.

There are two possible explanations for the weaker fixation results in this study than in Study 1. First, for practical reasons the duration of the artificial speaker-fixations was significantly shorter (480 ms) than the average authentic duration (M = 980 ms, SD = 414 ms), Mann–Whitney, Z = −2.46, p = .014. It is likely that the longer the speaker’s gaze on a gesture, the more likely the addressee is to also look at it. A closer inspection of the results from Study 1 revealed a tendency for longer speaker-fixations to yield more addressee-fixations than shorter ones. Second, the duration of the gesture stroke itself may also have played a role. Again, the average duration of the authentic gestures with speaker-fixations was significantly longer (M = 2,410 ms, SD = 437 ms) than the strokes of the control gestures on which we imposed the artificial speaker-fixation (M = 1,310 ms, SD = 305), Mann–Whitney, Z = −2.31, p = .021. However, the influence of the stroke duration is debatable because peripheral gestures, which by virtue of their spatial expanse also have longer duration than centrally produced gestures, did not draw fixations. Indirectly, then, these findings suggest that speakers’ fixations of their own gestures increase the likelihood of addressees’ shifting overt visual attention to gestures, and this effect is enhanced the longer the speakers’ fixation.

Study 4: Artificial Holds

The fourth study examines the effect of artificially induced gestural holds on addressees’ overt visual attention to and uptake of information from gestures.

Methods

Participants

Fifteen new Dutch undergraduate students from Radboud University Nijmegen participated in this study (M age = 21, SD = 2), 11 women and 4 men. They were paid 5 euros for their participation.

Materials

As in Study 3, the four items characterized as central, no hold, no speaker-fixation from Study 1 were digitally manipulated in Adobe® After Effects® to create minimal pairs of gestures with or without an artificial hold. The hand shape of the last frame of the original target gesture stroke was isolated and then pasted and maintained over the original retraction phase of the gesture for 5 frames or 200 ms, using the same procedure as illustrated in Fig. 4. The pasted hand shape was then moved spatially for a number of transition frames to fit onto the original, underlying location of the hand without creating a jerky movement. As before, speech and the synchronization between the auditory and the visual parts of the stimulus videos were not manipulated. The procedure allowed head and lip movements to remain synchronized with speech. Note that the original mean duration of natural holds (central and peripheral) was 575 ms. As in Study 3, a shorter hold duration (i.e., 200 ms), although still within the range of naturally occurring holds, was chosen to avoid too large a spatial discrepancy between the location of the artificially held hand, and the underlying retracted gesture. Such a discrepancy would have made the manipulation impossible to conceal. The four digitally manipulated items constitute the artificial hold condition.

Apparatus, Procedure, Coding, and Analysis

These were identical to Study 1. The data from the artificial hold condition were compared to the data from the original no-speaker-fixation condition reported in Study 1, henceforth referred to as the control condition (Fig. 6a, b).

Fig. 6
figure 6

a Mean proportion of fixated target gestures in the Control and Artificial hold conditions, b mean proportion of matched responses in the Control and Artificial hold conditions, i.e. responses where the direction in the drawing matched that of the target gesture (chance = .5). Error bars indicate standard deviations

Results and Discussion

The proportion of fixated trials was significantly higher in the artificial-hold condition (M = .08, SD = .12) than in the control condition (M = 0, SD = 0), Mann–Whitney, Z = −2.41, p = .016. There was no significant difference in uptake between the artificial hold (M = .59, SD = .35) and the control conditions (M = .63, SD = .32), F(1, 28) < 1, p = .75. Moreover, the proportion of matched trials was at chance both in the artificial hold condition, one-sample t-test, t(14) = 1.03, p = .319, and in the control condition, t(14) = 1.61, p = .13.

To summarize, both the fixation and the uptake findings from Study 2 were replicated. Holds made addressees more likely to fixate speakers’ gestures, but they did not seem to contribute to uptake of gestural information.

Post-hoc Analysis of Fixation Onset Latencies from Studies 3 and 4

As in Studies 1 and 2, we measured the time difference between the onset of the relevant cue (artificial speaker-fixation or artificial hold) and the onset of the addressees’ fixations of the gestures. Fixation onset latencies for artificial speaker-fixations were generally longer (M = 100 ms, SD = 85 ms) than onset latencies for gestures with artificial holds (M = 40 ms, SD = 0 ms), although there were too few data points to undertake a statistical analysis. These differences in fixation onset latencies nevertheless display the same trends as for natural speaker-fixations and holds.

Post-hoc Analyses of the Relationship Between Addressees’ Fixations and Uptake

One of the research questions concerned the relationship between fixations and uptake of gestural information. To address this issue, we examined whether information uptake differed between fixated versus non-fixated gestures.

All trials from Studies 1 through 4 were combined for this analysis to compare the likelihood of uptake in a within-subject comparison for those 20 participants who had codable trials with and without addressee-fixation (n = 15 from the hold conditions, n = 5 from the speaker-fixation condition). The proportion of matched responses was not significantly different between trials with addressee-fixation (M = .70, SD = .47) and without addressee-fixation (M = .62, SD = .42), F(1, 19) < 1, p = .576.

When the data were broken down according to the two cue types (speaker-fixation and holds), the proportion of matched responses in the two types of trials were still not significantly different from each other: uptake from speaker-fixated trials with addressee-fixation (M = .60, SD = .55) did not differ from speaker-fixated trials without addressee-fixations (M = .40, SD = .55), F(1, 4) < 1, p = .621. Similarly, uptake from hold-trials with addressee-fixation (M = .73, SD = .46) did not significantly differ from hold-trials without addressee-fixations (M = .69, SD = .36), F(1, 14) < 1, p = .783. Thus, there is little evidence that addressees’ fixations of gestures are associated with uptake of the gestural information.

General Discussion

This study investigated what factors influence addressees’ overt visual attention to (direct fixation of) gestures and their uptake of gestural information, focusing on one social factor, namely speakers’ gaze at their own gestures, and two physical properties of gestures, namely their place in gesture space and the effect of gestural holds. We also examined the relationship between addressees’ fixations of gesture and their uptake of gestural information. We explored these issues drawing on examples of natural gestures expressing directional information left or right, embedded in narratives.

The results concerning fixations of gestures can be summarized in four points. First, in line with previous studies (Gullberg and Holmqvist 1999, 2006; Nobe et al. 1998, 2000), addressees looked directly at very few gestures. Second, they were more likely to fixate gestures which speakers themselves had first fixated (speaker-fixation) than others. This tendency held also for gestures with artificially introduced speaker-fixations, although it did not reach statistical significance. Moreover, addressees were also more likely to fixate gestures with a post-stroke hold than gestures without. This held both for natural and artificial holds. Third, contrary to expectation, the locations of gestures in gesture space (central vs. peripheral) did not affect addressees’ tendency to fixate gestures. Fourth, the onset latency of fixations differed across gesture types. Fixations of gestures with post-stroke holds had shorter onset latencies than those of speaker-fixated gestures, suggesting that addressees look at different gestures for different reasons. Holds are fixated for bottom-up reasons and speaker-fixated gestures for top-down reasons.

There were three main findings concerning uptake of gestural information. First, addressees did not generally process and retain directional gestural information uniformly in all situations. Second, addressees were more likely to retain the directional information in gesture when speakers themselves had first fixated the gesture than when they had not. Third, there was no evidence that the presence or absence of post-stroke holds or the location in gesture space affected information uptake when an item with inadvertent speaker-fixation on a previous gesture was removed.

Finally, regarding the relationship between addressees’ fixations and their information uptake, a post-hoc analysis based on the pooled data from all the studies showed no evidence that addressees’ information uptake from gestures was associated with their fixations of gestures.

In previous studies of fixation behavior towards gestures (Gullberg and Holmqvist 1999, 2006; Nobe et al. 1998, 2000), the three factors investigated here have been conflated. The current study demonstrates the individual contributions of two of these factors: the social factor speaker-fixation, and one of the physical factors, namely post-stroke holds. It also shows that the other physical property, location in gesture space, does not matter. Moreover, the data suggest that addressees fixate different gestures for different reasons. The effect of speaker-fixations on addressees’ gaze behavior is compatible with suggestions that humans automatically orient to the target of an interlocutor’s gaze (e.g., Driver et al. 1999). Notice, however, that speaker-fixations only lead to overt gaze-following or addressee-fixations 8% of the time (Study 1; this rate is similar to that reported in Gullberg and Holmqvist 2006). This suggests that overt gaze-following is not an automatic process but rather a socially mediated process, where the social norm for maintaining mutual gaze is the default, and overt gaze-following to a gesture signals social alignment (Gullberg and Holmqvist 2006). The longer onset latencies of addressee-fixations following speaker-fixations support this notion, as longer onset latencies are likely to reflect top-down processes such as social alignment. In contrast, addressees’ tendency to fixate gestures with holds may result from holds constituting sudden change in the visual field, or from holds challenging peripheral vision, which is best at motion detection. With no motion to detect, an addressee needs to shift gaze and fixate the gesture in order to extract any information at all. Both accounts assume that fixations to holds should be driven by low-level, bottom-up processes. The fixation onset latency data support this account. The very short fixation onset latencies to gestural holds suggest a stimulus-driven response by the visual system.

The uptake results strongly suggest that all gestural information is not uniformly processed and integrated. That is, it is not the case that addressees cannot help but integrate gesture information (e.g., Cassell et al. 1999). The findings indicate that directional gesture information is not well integrated in the absence of any further highlighting, which is in line with Beattie and Shovelton’s (1999a, b) results showing that directional gesture information is less well retained than information about size and location. However, the social factor (speaker-fixation) modulated uptake of such information such that addressees retained gestural information about direction when speakers had looked at gestures first. The physical properties of gestures played no role for uptake.

The comparison of fixation behavior and uptake showed that uptake from gestures was greatest in a condition where gestures were first fixated upon by the speaker (86%), although the addressees only fixated these gestures 8% of the time (Exp.1). Addressees’ attention to gestures was therefore mostly covert. It seems that addressees’ uptake of gestural information may be independent of whether they fixate the target gesture or not provided that speakers have highlighted the gesture with their gaze first. Although this finding must be consolidated in further studies, it suggests that although overt gaze-following is not automatic, covert attention shift to the target of a speaker’s gaze location may well be, allowing fine-grained information extraction in human interaction.

An important implication of these findings for face-to-face communication is that addressees’ gaze is multifunctional and not necessarily a reliable index of attention locus, information uptake or comprehension. Addressees clearly look at different things for different reasons and one cannot assume that overt visual attention to something—like a gesture with a post-stroke hold—necessarily implies that the target is processed for information. This is primarily a caveat to studies on face-to-face interaction where a mono-functional view of gaze is often in evidence. In interaction addressees will typically maintain their gaze on the speaker’s face as a default. Addressees’ overt gaze shift may be an act of social alignment to show speakers that they are attending to their focus of attention (e.g., their gestures), rather than an act of information seeking which is often possible through peripheral vision. Conversely, the fact that addressees’ attention to gestures is not uniform means that speakers can manipulate it, highlighting gestures strategically as a relevant channel of information in various ways. For instance, speakers can use spoken deictic expressions such as ‘like this’ to draw direct attention to gestures, or use their own gaze (speaker-fixation) to do the same thing visually. Other possibilities include distributing information across the modalities in complementary fashion, such as saying ‘this big’ and indicating size in gesture (also an example of a deictic expression).

This study has raised a number of further issues to explore. An important question is what other factors might affect addressees’ attention to gestures. Other physical properties of gestures are likely candidates, such as gestures’ size and duration, the difference between simple and complex movement trajectories, etc. A social factor that is likely to play a role concerns the knowledge shared by participants, also known as common ground (e.g., Clark and Brennan 1991; Clark et al. 1983). The more common ground is shared between interlocutors, the more reduced the gestures tend to be in form and the less likely information is to be expressed in gesture at all (e.g., Gerwing and Bavelas 2004; Holler and Stevens 2007; Holler and Wilkin 2009). This opens for the possibility that attention to gesture is modulated by discourse factors with heightened attention to gesture when information is new and first introduced, and mitigated attention as information grows old. Another discourse effect concerns the relevance of information. The information probed in this study was deliberately chosen to be unimportant to the gist of the narratives. It is important to test whether these findings generalize to discursively vital information.

To conclude, this study has taken a first step towards a more fine-grained understanding of how and when addressees take gestural information into account and of the factors that govern attention allocation—both overt and covert—to such gestural information.