Attention to Speech-Accompanying Gestures: Eye Movements and Information Uptake
- 1.5k Downloads
There is growing evidence that addressees in interaction integrate the semantic information conveyed by speakers’ gestures. Little is known, however, about whether and how addressees’ attention to gestures and the integration of gestural information can be modulated. This study examines the influence of a social factor (speakers’ gaze to their own gestures), and two physical factors (the gesture’s location in gesture space and gestural holds) on addressees’ overt visual attention to gestures (direct fixations of gestures) and their uptake of gestural information. It also examines the relationship between gaze and uptake. The results indicate that addressees’ overt visual attention to gestures is affected both by speakers’ gaze and holds but for different reasons, whereas location in space plays no role. Addressees’ uptake of gesture information is only influenced by speakers’ gaze. There is little evidence of a direct relationship between addressees’ direct fixations of gestures and their uptake.
KeywordsGesture Interaction Eye gaze Fixation Multimodal information processing
Typically, when we talk, we also gesture. That is, we perform manual movements as part of the expressive effort (Kendon 2004; McNeill 1992). Such speech-accompanying gestures typically convey meaning (e.g., size, shape, direction of movement), which is related to the ongoing talk. The communicative role of these gestures is somewhat controversial. It is debated both whether speakers actually intend gestural information for their addressees (e.g., Holler and Beattie 2003; Melinger and Levelt 2004), and whether addressees attend to and integrate the gestural information. This paper focuses on the latter issue.
There is growing evidence that speech and speech-accompanying gestures are processed and comprehended together, forming an ‘integrated’ system or a ‘composite signal’ (e.g., Clark 1996; Kendon 2004; McNeill 1992). Gestural information is integrated with speech in comprehension and influences the interpretation and memory of speech (e.g., Beattie and Shovelton 1999a, 2005; Kelly et al. 1999; Langton and Bruce 2000; Langton et al. 1996). For instance, information expressed only in gestures re-surfaces in retellings, either as speech, as gesture, or both (Cassell et al. 1999; McNeill et al. 1994). Further, neurocognitive studies show that incongruencies between information in speech and gesture yield electrophysiological markers of integration difficulties such as the N400 (e.g., Özyürek et al. 2007; Wu and Coulson 2005). However, surprisingly few studies have attempted to examine directly whether attention to gestures and uptake of gestural information is deterministic and unavoidable or whether such attention is modulated in human interaction, and if so by what factors. Furthermore, surprisingly little is known about the role of gaze in this context. This study therefore aims to examine what factors influence overt, direct visual attention to gestures and uptake of gestural information, focusing on one social factor, namely speakers’ gaze at their own gestures, and two physical properties of gestures, namely their place in gesture space and the effect of gestural holds. The study also examines the relationship between addressees’ gaze and uptake.
Visual Attention to Gestures
Gestures are visuo-spatial phenomena, and so the role of vision and gaze for attention is important. However, addressees seem to gaze directly at speakers’ gestures relatively rarely. Addressees mainly look at the speaker’s face during interaction (Argyle and Cook 1976; Argyle and Graham 1976; Bavelas et al. 2002; Fehr and Exline 1987; Kendon 1990; Kleinke 1986). Studies using eye-tracking techniques in face-to-face interaction have further demonstrated that addressees spend as much as 90–95% of the total viewing time fixating the speaker’s face and thus fixate only a minority of gestures (Gullberg and Holmqvist 1999, 2006).
However, the likelihood of an addressee directly fixating a gesture increases under the following three circumstances (Gullberg and Holmqvist 1999, 2006; Nobe et al. 1998, 2000). The first is when speakers first look at their own gestures (speaker-fixation) (Gullberg and Holmqvist 1999, 2006). This tendency is stronger in live face-to-face interaction than when observing speakers on video (Gullberg and Holmqvist 2006). This suggests that the overt shift of visual attention to the target of a speaker’s gaze is essentially social in nature rather than an automatic response. The second circumstance is when a gesture is produced in the periphery of gesture space in front of the speaker’s body (cf. McNeill 1992). The third is when a gestural movement is suspended momentarily in mid-air and goes into a hold before moving on (cf. Kendon 1980; Kita et al. 1998; Seyfeddinipur 2006). Holds are often found between the dynamic movement phase of a gesture, the stroke, and the so-called retraction phase, which marks the end of a gesture. It is currently not clear whether these three factors—speaker-fixation, peripheral articulation, and holds—all contribute independently to the increased likelihood of the addressee’s fixation on gesture. The evidence for the influence of these three factors mostly comes from observational studies of naturalistic conversations, in which the three factors often co-occur (Gullberg and Holmqvist 1999, 2006; see also Nobe et al. 1998, 2000). Therefore, one of the goals of this study is to experimentally manipulate these factors and assess their relative contributions to the likelihood of addressees’ fixations of gesture.
The three factors may draw the addressee’s attention either for bottom-up, stimulus-related reasons or for top-down, social-cognitive reasons. Gestures in peripheral gesture space or with a hold may elicit the addressee’s fixation for bottom-up reasons, namely, because these gestures challenge peripheral vision. Firstly, the acuity of peripheral vision decreases the further away from the fovea the image is projected, and secondly, peripheral vision, which is good at motion detection, cannot process information about a static hand in a hold efficiently. In contrast, gestures with speaker-fixations may elicit the addressee’s fixation for top-down social reasons, namely to manifest social alignment or joint attention. The difference between bottom-up and top-down processes should be reflected in different onset-latencies of fixations to gestures (cf. Gullberg and Holmqvist 2006). Fixation onsets that are bottom-up driven should be short, whereas fixations driven by top-down concerns should have longer onsets (e.g., Yantis 1998, 2000). Thus, another goal of the study is to compare the onset-latency for fixations on gestures triggered by the three factors to further elucidate the reasons for fixation.
Uptake of Gestural Information
Only a few studies have attempted to directly examine whether attention to and uptake of information from gestures is unavoidable or whether it is ever modulated and if so by what factors. Rogers (1978) manipulated noise levels showing that addressees pick up more information from gestures the less comprehensible the speech signal. Beattie and Shovelton (1999a, b) demonstrated that addressees decode information about relative position and size better when presented with speech and gesture combined than with either gesture or speech alone. Interestingly, this study also indicated that not all gestural information was equally decodable. Addressees reliably picked up location and size information pertaining to objects, but did worse with information such as direction. These studies indicate that the comprehensibility of speech affects addressees’ attention to gestures and also that the type of gestural information matters.
Other factors may also modulate addressees’ attention to gestures. Speakers’ gaze to their own gestures, a factor of a social nature, is a likely candidate. It is well-known that humans are extremely sensitive to the gaze direction of others (e.g., Gibson and Pick 1963), and that gaze plays a role in the establishment of joint attention (e.g., Langton et al. 2000; Moore et al. 1995; Tomasello 1999; Tomasello and Todd 1983). It has been suggested that speakers look at their own gestures as a means to draw addressees’ attention to them in face-to-face interaction (e.g., Goodwin 1981; Streeck 1993, 1994). Such behavior could increase the likelihood of addressees’ uptake of gestural information, although this has not been tested with naturalistic, dynamic gestures that are not pointing gestures.
Physical properties of gestures may also affect addressees’ uptake of gestural information. First, the location of the gesture in gesture space may matter (cf. McNeill 1992). Speakers often bring gestures up into central gesture space, that is, to chest height and closer to the face, when they want to highlight the relevance of gestures in interaction (e.g., Goodwin 1981; Gullberg 1998; Streeck 1993, 1994). The information expressed by such a gesture seems more likely to be integrated than that of a gesture articulated for instance on the speaker’s lap in lower, peripheral gesture space.
A second potentially important physical property is the gestural hold. The functional role of holds is somewhat debated, but holds have been implicated in turn taking and floor holding in interaction. Transitions between speaker turns in interaction are more likely once a gesture is terminated or when a tensed hand position is relaxed (e.g., Duncan 1973; Fornel 1992; Goodwin 1981; Heath 1986). If holds are a first indication that speakers are about to give up their turn, it would be communicatively useful for addressees to attend to them. This in turn may increase the likelihood of information uptake from a gesture with a hold. A further goal of this study, then, is to examine the impact of these three factors on addressees’ uptake of gesture information.
The Relationship Between Fixations and Information Uptake
As indicated above, most gestures are perceived through peripheral vision. Although peripheral vision is powerful, optimal image quality with detailed texture and color information is achieved only in direct fixations, that is, if the image falls directly on the small central fovea. Outside of the fovea, parafoveal or peripheral vision gives much less detailed information (Bruce and Green 1985; Latham and Whitaker 1996). Consequently, it is generally assumed that an overt fixation indicates attention in the sense of information uptake. If addressees shift their gaze from the speaker’s face to a gesture in interaction, this might indicate that they are attempting to integrate the gestural information (e.g., Goodwin 1981; Streeck 1993, 1994).
However, addressees’ tendency to gaze directly at an information source is modulated in face-to-face interaction by culture-specific norms for maintained or mutual gaze to indicate continued attention (e.g., Rossano et al. 2009; Watson 1970). In cultures where mutual gaze is socially important, face-to-face interaction may emphasize the reliance on peripheral vision for gesture processing and dissociation between overt and covert attention. Addressees can fixate a visual target without attending to it (“looking without seeing”), and conversely, attend to something without directly fixating it (“seeing without looking”). If the speaker’s face is the default location of visual attention in interaction, then most gestures must be attended to covertly. It is therefore not entirely clear what the relationship between overt fixation and information uptake might be in interaction from information sources like gestures. A final goal of this study is therefore to examine the relationship between overt fixation of and uptake of information from gestures.
The Current Research
Do social and physical factors influence addressees’ fixations on speakers’ gestures? Furthermore, do different factors trigger qualitatively different fixations, reflecting the difference between top-down vs. bottom-up processes? We expect top-down driven fixations to have longer onset latencies than bottom-up driven fixations.
Do social and physical factors influence addressees’ uptake of gesture information?
Are addressees’ fixations a good index of information uptake from gestures?
To examine these questions we present participants (‘addressees’) with video recordings of naturally occurring gestures embedded in narratives. We examine the effect of a social factor, namely the presence/absence of speakers’ fixations of their own gestures (Study 1), and the effect of two physical properties of gestures, namely gestures’ location in gesture space (central/peripheral) and the presence/absence of holds (Study 2). In Studies 1 and 2, we manipulate the independent variables by selecting gestures with the relevant properties from a corpus of video recorded gestures. In a second set of control experiments, we present participants with digitally manipulated versions of the gesture stimuli used in Studies 1 and 2, examining the effect of presence/absence of speakers’ artificial fixations of their own gestures (Study 3) and the presence/absence of artificial holds (Study 4). These studies are undertaken to control for any other unknown variables that may have differed between the stimulus gestures used in the conditions in Studies 1 and 2.
In all studies, participants were presented with brief narratives that included a range of gestures, but our analyses focus on one “target gesture” in each narrative. Each target gesture conveyed information about the direction of a movement. This information was only encoded in the target gesture, and not in other gestures or in speech. Overt visual attention to gestures was operationalized as direct fixations of gestures. Participants’ eye movements were recorded during the presentation of the narratives using a head-mounted eye-tracker. Further, information uptake was operationalized as the extent to which participants could reproduce the information conveyed in the target gesture in a drawing task following stimulus presentation. Participants were asked to draw an event in the story that crucially involved the movement depicted by the target gesture. The match between the directionality of the movement in the drawing and in the target gesture was taken as indicative of information uptake.
Study 1: Speaker-fixations
The first study examines the effect of a social factor on addressees’ overt visual attention to and uptake of information from gestures, namely the presence/absence of speakers’ fixations of their own gestures.
Thirty Dutch undergraduate students from Radboud University Nijmegen participated in this study (M age = 22, SD = 3), 23 women and 7 men. They were paid 5 euros for their participation.
The stimuli were taken from a corpus of videotaped face-to-face story retellings in Dutch (Kita 1996). The video clips showed speakers facing an addressee or viewer retelling short stories. The video clips did not show the original live addressee, but only the speaker seated en face. Each video clip contained a whole, unedited story retelling. Each clip therefore contained multiple gestures, only one of which was treated as a target gesture. Consequently, the target gesture appeared within sequences of other gestures so as not to draw attention as a singleton. The stimulus videos were selected from the corpus because they contained one target gesture displaying the appropriate properties. For Study 1, each target gesture displayed either presence or absence of speaker-fixation, that is, the speakers either looked at their own gestures or not. The target gestures were otherwise similar, and performed in central gesture space without holds. All target gestures were representational gestures encoding the movement of a protagonist in the story from an observer viewpoint (McNeill 1992), meaning that the speaker’s hand represented a protagonist in the story as seen from outside. The target gestures, typically expressing a key event in the story lines, encoded the direction of the protagonist’s motion left or right. Although the movement itself was an important part of the storyline, the direction of the movement was not. The directional information was only present in the target gesture and not in co-occurring speech. Further, the directional information could not be inferred from other surrounding gestures. Care was taken to ensure that the gestural information was not highlighted in any other way. Co-occurring speech did not contain any deictic expressions referring to and therefore drawing attention to the gesture (e.g., ‘that way’). Moreover, the target gesture did not co-occur with hesitations in speech, with the story punch line or with first mention of a protagonist, as all of these features might have lent extra prominence to a co-occurring gesture. Descriptions of the animated cartoons used to elicit the narratives and the target scenes therein are provided in Appendix 1. Outlines of the spatio-temporal properties of the target gestures across conditions (and all studies) are provided in Appendix 2, and speech co-occurring with target gestures is listed in Appendix 3.
Mean duration (ms) of target gestures with and without speaker-fixation
Mean duration (SD)
Speaker-fixated gesture stroke
Speaker-fixation on gesture
No-speaker-fixated gesture stroke
We used a head-mounted SMI iView© eye-tracker, which is a monocular 50 Hz pupil and corneal reflex video imaging system. The eye-tracker records the participant’s eye movements with the corneal reflex camera. The eye-tracker also has a scene-camera on the headband, which records the field of vision. The output data from the eye-tracker consist of a merged video recording showing the addressee’s field of vision (i.e., the speaker on the video), and an overlaid video recording of the addressee’s fixations as a circle overlay. Since the scene-camera moves with the head, the eye-in-head signal indicates the gaze point with respect to the world. Head movements therefore appear on the video as full-field image motion. The fixation marker represents the foveal fixation and covers a visual angle of 2°. The output video data allow us to analyze both gesture and eye movements with a temporal accuracy of 40 ms.
Participants were randomly assigned to one of the two conditions: Speaker-fixation (central space, no hold, speaker-fixation) and No-speaker-fixation (central space, no hold, no speaker-fixation). The participants were seated 250 cm from the wall and fitted with the SMI iView© headset. A projector placed immediately behind the subject projected a nine-point matrix calibration screen on the wall of the same size as the subsequent stimulus videos. After calibration, four stimulus video clips were projected against the wall. The speakers appearing in the videos were thus life-sized, and their heads were level with the participants’ heads. Life-sized projections have been shown to yield fixation behavior towards gestures that is similar to behavior in live interaction (Gullberg and Holmqvist 2006). A black screen appeared between each video clip for a duration of 10 s. Participants were instructed to watch the videos carefully to be able to answer questions about them subsequently. The instructions did not mention gestures or the direction of the movements in the story. Participants’ eye movements were recorded as they watched the video clips. After watching all four videos, participants answered questions about the target events of each video by drawing pictures of the protagonists in the story. An example question is “De muis heeft moeite met roeien. Hoe komt hij toch vooruit?” (“The mouse has trouble rowing. How does it make progress?”) (see Appendix 4 for the complete set of questions).
The eye movement data were retrieved from the digitized video output from the eye-tracker. The merged video data of the participants’ gaze positions on the scene image were analyzed frame-by-frame and coded for fixation of target gesture (Yes or No) and for matched reply (Yes or No). A target gesture was coded as fixated if the fixation marker was immobile on the gesture, i.e., moved no more than 1 degree, for a minimum of 120 ms (equal to 3 video frames) (cf. Melcher and Kowler 2001). Note that fixations on gestures were spatially unambiguous. Either a gesture was clearly fixated, or the fixation marker stayed on the speaker’s face (cf. Gullberg and Holmqvist 1999, 2006). A drawing was coded as a matched reply if the direction of the motion in the drawing matched the direction of the target gesture on the video as seen from the addressee’s perspective (see Fig. 1).1 Only responses that could be coded as matched or non-matched were included in the analysis. When drawings did not depict a lateral direction of any kind, the data point was discarded. Chance performance therefore equals 50%.
The dependent variables were (a) the proportion of trials with fixations on target gestures, and (b) the proportion of matched responses as defined above. We employed non-parametric Mann–Whitney tests to analyze the fixation data because the dependent variable, proportions of trials with fixation on gesture, had a skewed distribution with clustering of data at zero. We analyzed the information uptake data using parametric, independent samples analyses of variance and single sample t-tests. Throughout, the alpha level for statistical significance is p = .05.
Results and Discussion
The results show that speakers’ fixation of their own gestures increase the likelihood of addressees fixating the same gestures. Furthermore, speaker-fixations also increase the likelihood of addressees’ uptake of gestural information, even when it is of little narrative significance and embedded in other directional information. Overall, the combined fixation and uptake findings suggest that speakers’ gaze at their own gestures constitute a very powerful attention directing device for addressees influencing both their overt visual attention and their uptake.
Study 2: Location in Space and Holds
The second study examines the effect of two physical gestural properties on addressees’ overt visual attention to and uptake of information from gestures, namely gestures’ location in space (central vs. peripheral) and the presence vs. absence of holds.
Forty-five new Dutch undergraduate students from Radboud University Nijmegen participated in this study (M age = 21, SD = 2), 41 women and 4 men. They were paid 5 euros for their participation.
Mean duration (ms) of central and peripheral target gestures with and without holds
Mean duration (SD)
Central stroke No Hold
Peripheral stroke No Hold
Central stroke + Hold
1,460 + 570 (677 + 428)
Peripheral stroke + Hold
1,490 + 580 (962 + 334)
Apparatus, Procedure, Coding, and Analysis
Participants were randomly assigned to one of the three conditions (15 participants in each condition): central hold, peripheral no hold, peripheral hold. The data from the no-speaker-fixation condition from Study 1 was used as the fourth condition, central no hold, in the analysis. The apparatus, procedure, coding, and analyses were otherwise identical to Study 1.
Results and Discussion
The results show that, when location in gesture space and holds were teased apart, only holds increased the likelihood of addressees fixating gestures, whereas the location in gesture space where gestures were produced did not influence addressees’ fixations. Moreover, surprisingly, only information conveyed by gestures performed in central, neutral gesture space was taken up and integrated by addressees. However, this result seems to be due to properties of a single item in the central hold condition, viz. the “trashcan” item (cf. Appendix 2). Eighty percent of the participants (12/15) had a matched response on this item. Closer inspection of the stimulus showed that the speaker in this stimulus item had looked at another gesture immediately preceding the target gesture. The item therefore inadvertently became similar to the items in the speaker-fixation condition. When this item was removed from the analysis, uptake for the central hold condition dropped to chance level, (M = .59, SD = .32) t(14) = 1.17, p = .262. Therefore, we conclude that location in gesture space and holds do not modulate the likelihood of information uptake from gestures.
Post-hoc Analysis of Fixation Onset Latencies from Studies 1 and 2
To examine whether different gestures are fixated for different reasons, we analyzed the fixation onset latencies for those gestures that drew fixations, that is, gestures with speaker-fixations, and gestures with holds (collapsing central and peripheral hold gestures). We measured the time difference between the onset of the relevant cue (speaker-fixation or gestural hold) and the onset of the addressees’ fixations of the gestures. Fixation onset latencies for gestures with speaker-fixations were significantly longer (M = 800 ms, SD = 400 ms) than onset latencies for gestures with holds (M = 102 ms, SD = 88 ms), Mann–Whitney, Z = −3.14, p = .01.
These differences suggest that addressees’ fixations of gestures are driven by different mechanisms. Onset latencies in the realm of 800 ms indicate that top-down concerns involving higher cognitive mechanisms are driving the fixation behavior. Onset latencies around 100 ms instead suggest that fixations of gestural holds may be bottom-up responses driven by the inner workings of the visual system (cf. Yantis 2000).
Study 3: Artificial Speaker-Fixations
The unexpected effect of an individual stimulus item in Study 2 raises a general concern that the independent variables may have been confounded with other unknown variables, given that the stimulus gestures differed across the conditions. For instance, the target gesture in the “plank” item had a more complex trajectory than the other items, and the gesture in the “pit” item was performed closer to the face than other target gestures (cf. Appendix 2). Although it is a strength of these studies that they draw on ecologically valid stimuli where the target gestures are naturally produced, dynamic gestures embedded in discourse and among other gestures, it is important to ascertain that the fixation and uptake findings were not caused by other factors. To test whether speaker-fixations and holds do account for the fixation and uptake data, we therefore created minimal pairs of the most neutral, baseline test items, the centrally produced gestures with no hold or speaker-fixation, by artificially introducing speaker-fixation (Study 3) and holds (Study 4) on these neutral gestures through video editing.
The third study examines the effect of artificially induced speaker-fixations on addressees’ overt visual attention to and uptake of information from gestures.
Fifteen new Dutch undergraduate students from Radboud University Nijmegen participated in this study (M age = 21, SD = 3), 11 women and 4 men. They were paid 5 euros for their participation.
Apparatus, Procedure, Coding, and Analysis
Results and Discussion
There was no significant difference between the proportion of fixated trials in the artificial speaker-fixation condition (M = .03, SD = .09) and the control condition (M = 0, SD = 0), Mann–Whitney, Z = −1.44, p = .15. Furthermore, there was no significant difference in the proportion of trials with uptake in the artificial speaker-fixation condition (M = .71, SD = .31) and the control condition (M = .63, SD = .32), F(1, 28) < 1, p = .536. However, the proportion of trials with uptake was reliably above chance (.50) in the artificial speaker-fixation condition, one-sample t-test, t(14) = 2.58, p = .022, but not in the control condition, t(14) = 1.61, p = .13.
Both for fixation and uptake, the differences between the artificial speaker-fixation and control condition went in the same direction as predicted by the results from Study 1, but neither difference reached statistical significance. The comparison against chance nevertheless indicated uptake above chance from gestures in the artificial speaker-fixation, in line with the effect of natural speaker-fixations on uptake found in Study 1.
There are two possible explanations for the weaker fixation results in this study than in Study 1. First, for practical reasons the duration of the artificial speaker-fixations was significantly shorter (480 ms) than the average authentic duration (M = 980 ms, SD = 414 ms), Mann–Whitney, Z = −2.46, p = .014. It is likely that the longer the speaker’s gaze on a gesture, the more likely the addressee is to also look at it. A closer inspection of the results from Study 1 revealed a tendency for longer speaker-fixations to yield more addressee-fixations than shorter ones. Second, the duration of the gesture stroke itself may also have played a role. Again, the average duration of the authentic gestures with speaker-fixations was significantly longer (M = 2,410 ms, SD = 437 ms) than the strokes of the control gestures on which we imposed the artificial speaker-fixation (M = 1,310 ms, SD = 305), Mann–Whitney, Z = −2.31, p = .021. However, the influence of the stroke duration is debatable because peripheral gestures, which by virtue of their spatial expanse also have longer duration than centrally produced gestures, did not draw fixations. Indirectly, then, these findings suggest that speakers’ fixations of their own gestures increase the likelihood of addressees’ shifting overt visual attention to gestures, and this effect is enhanced the longer the speakers’ fixation.
Study 4: Artificial Holds
The fourth study examines the effect of artificially induced gestural holds on addressees’ overt visual attention to and uptake of information from gestures.
Fifteen new Dutch undergraduate students from Radboud University Nijmegen participated in this study (M age = 21, SD = 2), 11 women and 4 men. They were paid 5 euros for their participation.
As in Study 3, the four items characterized as central, no hold, no speaker-fixation from Study 1 were digitally manipulated in Adobe® After Effects® to create minimal pairs of gestures with or without an artificial hold. The hand shape of the last frame of the original target gesture stroke was isolated and then pasted and maintained over the original retraction phase of the gesture for 5 frames or 200 ms, using the same procedure as illustrated in Fig. 4. The pasted hand shape was then moved spatially for a number of transition frames to fit onto the original, underlying location of the hand without creating a jerky movement. As before, speech and the synchronization between the auditory and the visual parts of the stimulus videos were not manipulated. The procedure allowed head and lip movements to remain synchronized with speech. Note that the original mean duration of natural holds (central and peripheral) was 575 ms. As in Study 3, a shorter hold duration (i.e., 200 ms), although still within the range of naturally occurring holds, was chosen to avoid too large a spatial discrepancy between the location of the artificially held hand, and the underlying retracted gesture. Such a discrepancy would have made the manipulation impossible to conceal. The four digitally manipulated items constitute the artificial hold condition.
Apparatus, Procedure, Coding, and Analysis
Results and Discussion
The proportion of fixated trials was significantly higher in the artificial-hold condition (M = .08, SD = .12) than in the control condition (M = 0, SD = 0), Mann–Whitney, Z = −2.41, p = .016. There was no significant difference in uptake between the artificial hold (M = .59, SD = .35) and the control conditions (M = .63, SD = .32), F(1, 28) < 1, p = .75. Moreover, the proportion of matched trials was at chance both in the artificial hold condition, one-sample t-test, t(14) = 1.03, p = .319, and in the control condition, t(14) = 1.61, p = .13.
To summarize, both the fixation and the uptake findings from Study 2 were replicated. Holds made addressees more likely to fixate speakers’ gestures, but they did not seem to contribute to uptake of gestural information.
Post-hoc Analysis of Fixation Onset Latencies from Studies 3 and 4
As in Studies 1 and 2, we measured the time difference between the onset of the relevant cue (artificial speaker-fixation or artificial hold) and the onset of the addressees’ fixations of the gestures. Fixation onset latencies for artificial speaker-fixations were generally longer (M = 100 ms, SD = 85 ms) than onset latencies for gestures with artificial holds (M = 40 ms, SD = 0 ms), although there were too few data points to undertake a statistical analysis. These differences in fixation onset latencies nevertheless display the same trends as for natural speaker-fixations and holds.
Post-hoc Analyses of the Relationship Between Addressees’ Fixations and Uptake
One of the research questions concerned the relationship between fixations and uptake of gestural information. To address this issue, we examined whether information uptake differed between fixated versus non-fixated gestures.
All trials from Studies 1 through 4 were combined for this analysis to compare the likelihood of uptake in a within-subject comparison for those 20 participants who had codable trials with and without addressee-fixation (n = 15 from the hold conditions, n = 5 from the speaker-fixation condition). The proportion of matched responses was not significantly different between trials with addressee-fixation (M = .70, SD = .47) and without addressee-fixation (M = .62, SD = .42), F(1, 19) < 1, p = .576.
When the data were broken down according to the two cue types (speaker-fixation and holds), the proportion of matched responses in the two types of trials were still not significantly different from each other: uptake from speaker-fixated trials with addressee-fixation (M = .60, SD = .55) did not differ from speaker-fixated trials without addressee-fixations (M = .40, SD = .55), F(1, 4) < 1, p = .621. Similarly, uptake from hold-trials with addressee-fixation (M = .73, SD = .46) did not significantly differ from hold-trials without addressee-fixations (M = .69, SD = .36), F(1, 14) < 1, p = .783. Thus, there is little evidence that addressees’ fixations of gestures are associated with uptake of the gestural information.
This study investigated what factors influence addressees’ overt visual attention to (direct fixation of) gestures and their uptake of gestural information, focusing on one social factor, namely speakers’ gaze at their own gestures, and two physical properties of gestures, namely their place in gesture space and the effect of gestural holds. We also examined the relationship between addressees’ fixations of gesture and their uptake of gestural information. We explored these issues drawing on examples of natural gestures expressing directional information left or right, embedded in narratives.
The results concerning fixations of gestures can be summarized in four points. First, in line with previous studies (Gullberg and Holmqvist 1999, 2006; Nobe et al. 1998, 2000), addressees looked directly at very few gestures. Second, they were more likely to fixate gestures which speakers themselves had first fixated (speaker-fixation) than others. This tendency held also for gestures with artificially introduced speaker-fixations, although it did not reach statistical significance. Moreover, addressees were also more likely to fixate gestures with a post-stroke hold than gestures without. This held both for natural and artificial holds. Third, contrary to expectation, the locations of gestures in gesture space (central vs. peripheral) did not affect addressees’ tendency to fixate gestures. Fourth, the onset latency of fixations differed across gesture types. Fixations of gestures with post-stroke holds had shorter onset latencies than those of speaker-fixated gestures, suggesting that addressees look at different gestures for different reasons. Holds are fixated for bottom-up reasons and speaker-fixated gestures for top-down reasons.
There were three main findings concerning uptake of gestural information. First, addressees did not generally process and retain directional gestural information uniformly in all situations. Second, addressees were more likely to retain the directional information in gesture when speakers themselves had first fixated the gesture than when they had not. Third, there was no evidence that the presence or absence of post-stroke holds or the location in gesture space affected information uptake when an item with inadvertent speaker-fixation on a previous gesture was removed.
Finally, regarding the relationship between addressees’ fixations and their information uptake, a post-hoc analysis based on the pooled data from all the studies showed no evidence that addressees’ information uptake from gestures was associated with their fixations of gestures.
In previous studies of fixation behavior towards gestures (Gullberg and Holmqvist 1999, 2006; Nobe et al. 1998, 2000), the three factors investigated here have been conflated. The current study demonstrates the individual contributions of two of these factors: the social factor speaker-fixation, and one of the physical factors, namely post-stroke holds. It also shows that the other physical property, location in gesture space, does not matter. Moreover, the data suggest that addressees fixate different gestures for different reasons. The effect of speaker-fixations on addressees’ gaze behavior is compatible with suggestions that humans automatically orient to the target of an interlocutor’s gaze (e.g., Driver et al. 1999). Notice, however, that speaker-fixations only lead to overt gaze-following or addressee-fixations 8% of the time (Study 1; this rate is similar to that reported in Gullberg and Holmqvist 2006). This suggests that overt gaze-following is not an automatic process but rather a socially mediated process, where the social norm for maintaining mutual gaze is the default, and overt gaze-following to a gesture signals social alignment (Gullberg and Holmqvist 2006). The longer onset latencies of addressee-fixations following speaker-fixations support this notion, as longer onset latencies are likely to reflect top-down processes such as social alignment. In contrast, addressees’ tendency to fixate gestures with holds may result from holds constituting sudden change in the visual field, or from holds challenging peripheral vision, which is best at motion detection. With no motion to detect, an addressee needs to shift gaze and fixate the gesture in order to extract any information at all. Both accounts assume that fixations to holds should be driven by low-level, bottom-up processes. The fixation onset latency data support this account. The very short fixation onset latencies to gestural holds suggest a stimulus-driven response by the visual system.
The uptake results strongly suggest that all gestural information is not uniformly processed and integrated. That is, it is not the case that addressees cannot help but integrate gesture information (e.g., Cassell et al. 1999). The findings indicate that directional gesture information is not well integrated in the absence of any further highlighting, which is in line with Beattie and Shovelton’s (1999a, b) results showing that directional gesture information is less well retained than information about size and location. However, the social factor (speaker-fixation) modulated uptake of such information such that addressees retained gestural information about direction when speakers had looked at gestures first. The physical properties of gestures played no role for uptake.
The comparison of fixation behavior and uptake showed that uptake from gestures was greatest in a condition where gestures were first fixated upon by the speaker (86%), although the addressees only fixated these gestures 8% of the time (Exp.1). Addressees’ attention to gestures was therefore mostly covert. It seems that addressees’ uptake of gestural information may be independent of whether they fixate the target gesture or not provided that speakers have highlighted the gesture with their gaze first. Although this finding must be consolidated in further studies, it suggests that although overt gaze-following is not automatic, covert attention shift to the target of a speaker’s gaze location may well be, allowing fine-grained information extraction in human interaction.
An important implication of these findings for face-to-face communication is that addressees’ gaze is multifunctional and not necessarily a reliable index of attention locus, information uptake or comprehension. Addressees clearly look at different things for different reasons and one cannot assume that overt visual attention to something—like a gesture with a post-stroke hold—necessarily implies that the target is processed for information. This is primarily a caveat to studies on face-to-face interaction where a mono-functional view of gaze is often in evidence. In interaction addressees will typically maintain their gaze on the speaker’s face as a default. Addressees’ overt gaze shift may be an act of social alignment to show speakers that they are attending to their focus of attention (e.g., their gestures), rather than an act of information seeking which is often possible through peripheral vision. Conversely, the fact that addressees’ attention to gestures is not uniform means that speakers can manipulate it, highlighting gestures strategically as a relevant channel of information in various ways. For instance, speakers can use spoken deictic expressions such as ‘like this’ to draw direct attention to gestures, or use their own gaze (speaker-fixation) to do the same thing visually. Other possibilities include distributing information across the modalities in complementary fashion, such as saying ‘this big’ and indicating size in gesture (also an example of a deictic expression).
This study has raised a number of further issues to explore. An important question is what other factors might affect addressees’ attention to gestures. Other physical properties of gestures are likely candidates, such as gestures’ size and duration, the difference between simple and complex movement trajectories, etc. A social factor that is likely to play a role concerns the knowledge shared by participants, also known as common ground (e.g., Clark and Brennan 1991; Clark et al. 1983). The more common ground is shared between interlocutors, the more reduced the gestures tend to be in form and the less likely information is to be expressed in gesture at all (e.g., Gerwing and Bavelas 2004; Holler and Stevens 2007; Holler and Wilkin 2009). This opens for the possibility that attention to gesture is modulated by discourse factors with heightened attention to gesture when information is new and first introduced, and mitigated attention as information grows old. Another discourse effect concerns the relevance of information. The information probed in this study was deliberately chosen to be unimportant to the gist of the narratives. It is important to test whether these findings generalize to discursively vital information.
To conclude, this study has taken a first step towards a more fine-grained understanding of how and when addressees take gestural information into account and of the factors that govern attention allocation—both overt and covert—to such gestural information.
There is no evidence that addressees reversed the directions in the drawings in order to represent the direction as expressed from the speaker’s viewpoint. Had addressees been reversing the viewpoints, we would have expected within-subject consistency of such reversals. There is no such consistency in the data, however.
We gratefully acknowledge financial and technical support from the Max Planck Institute for Psycholinguistics. We also thank Wilma Jongejan for help with the video manipulations, and Martha Alibali, Kenneth Holmqvist, and members of the Max Planck Institute’s Gesture project for useful discussions.
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
- Argyle, M., & Cook, M. (1976). Gaze and mutual gaze. Cambridge: Cambridge University Press.Google Scholar
- Bruce, V., & Green, P. (1985). Visual perception. Physiology, psychology and ecology (2nd ed.). Hillsdale, NJ: Erlbaum.Google Scholar
- Cassell, J., McNeill, D., & McCullough, K.-E. (1999). Speech-gesture mismatches: Evidence for one underlying representation of linguistic and nonlinguistic information. Pragmatics & Cognition, 7, 1–33.Google Scholar
- Clark, H. H. (1996). Using language. Cambridge: Cambridge University Press.Google Scholar
- Clark, H. H., & Brennan, S. A. (1991). Grounding in communication. In L. B. Resnick, J. M. Levine, & S. D. Teasley (Eds.), Perspectives on socially shared cognition. Washington: APA Books.Google Scholar
- Fehr, B. J., & Exline, R. V. (1987). Social visual interaction: A conceptual and literature review. In A. W. Siegman & S. Feldstein (Eds.), Nonverbal behavior and communication (pp. 225–326). Hillsdale, NJ: Erlbaum.Google Scholar
- Fornel, M. (1992). The return gesture: Some remarks on context, inference, and iconic gesture. In P. Auer & A. di Luzio (Eds.), The contextualization of language (pp. 159–176). Amsterdam: Benjamins.Google Scholar
- Goodwin, C. (1981). Conversational organization: Interaction between speakers and hearers. New York: Academic Press.Google Scholar
- Gullberg, M. (1998). Gesture as a communication strategy in second language discourse. A study of learners of French and Swedish. Lund: Lund University Press.Google Scholar
- Gullberg, M., & Holmqvist, K. (1999). Keeping an eye on gestures: Visual perception of gestures in face-to-face communication. Pragmatics & Cognition, 7, 35–63.Google Scholar
- Heath, C. (1986). Body movement and speech in medical interaction. Cambridge: Cambridge University Press.Google Scholar
- Kendon, A. (1980). Gesticulation and speech: Two aspects of the process of utterance. In M. R. Key (Ed.), The relationship of verbal and nonverbal communication (pp. 207–227). The Hague: Mouton.Google Scholar
- Kendon, A. (1990). Conducting interaction. Cambridge: Cambridge University Press.Google Scholar
- Kendon, A. (2004). Gesture. Visible action as utterance. Cambridge: Cambridge University Press.Google Scholar
- Kita, S. (1996). Listeners’ up-take of gestural information. MPI Annual Report, 1996, 78.Google Scholar
- Latham, K., & Whitaker, D. (1996). A comparison of word recognition and reading performance in foveal and peripheral vision. Vision Research, 37, 2665–2674.Google Scholar
- McNeill, D. (1992). Hand and mind. What the hands reveal about thought. Chicago: University of Chicago Press.Google Scholar
- Moore, C., & Dunham, P. J. (Eds.). (1995). Joint attention. Hillsdale, NJ: Erlbaum.Google Scholar
- Nobe, S., Hayamizu, S., Hasegawa, O., & Takahashi, H. (1998). Are listeners paying attention to the hand gestures of an anthropomorphic agent? An evaluation using a gaze tracking method. In I. Wachsmuth & M. Fröhlich (Eds.), Gesture and sign language in human-computer interaction (pp. 49–59). Berlin: Springer.CrossRefGoogle Scholar
- Nobe, S., Hayamizu, S., Hasegawa, O., & Takahashi, H. (2000). Hand gestures of an anthropomorphic agent: Listeners’ eye fixation and comprehension. Cognitive Studies. Bulletin of the Japanese Cognitive Science Society, 7, 86–92.Google Scholar
- Rossano, F., Brown, P., & Levinson, S. C. (2009). Gaze, questioning and culture. In J. Sidnell (Ed.), Conversation analysis: Comparative perspectives (pp. 187–249). Cambridge: Cambridge University Press.Google Scholar
- Seyfeddinipur, M. (2006). Disfluency: Interrupting speech and gesture. Unpublished doctoral dissertation, Radboud University, Nijmegen.Google Scholar
- Tomasello, M. (1999). The cultural origins of human cognition. Cambridge, MA: Harvard University Press.Google Scholar
- Watson, O. M. (1970). Proxemic behavior: A cross-cultural study. The Hague: Mouton.Google Scholar
- Yantis, S. (1998). Control of visual attention. In H. Pashler (Ed.), Attention (pp. 223–256). Hove: Psychology Press Ltd.Google Scholar
- Yantis, S. (2000). Goal-directed and stimulus-driven determinants of attentional control. In S. Monsell & J. Driver (Eds.), Attention and performance XVIII (pp. 73–103). Cambridge, MA: MIT Press.Google Scholar