Processing language in face-to-face conversation: Questions with gestures get faster responses
The home of human language use is face-to-face interaction, a context in which communicative exchanges are characterised not only by bodily signals accompanying what is being said but also by a pattern of alternating turns at talk. This transition between turns is astonishingly fast—typically a mere 200-ms elapse between a current and a next speaker’s contribution—meaning that comprehending, producing, and coordinating conversational contributions in time is a significant challenge. This begs the question of whether the additional information carried by bodily signals facilitates or hinders language processing in this time-pressured environment. We present analyses of multimodal conversations revealing that bodily signals appear to profoundly influence language processing in interaction: Questions accompanied by gestures lead to shorter turn transition times—that is, to faster responses—than questions without gestures, and responses come earlier when gestures end before compared to after the question turn has ended. These findings hold even after taking into account prosodic patterns and other visual signals, such as gaze. The empirical findings presented here provide a first glimpse of the role of the body in the psycholinguistic processes underpinning human communication.
KeywordsMultimodal communication Gesture Language processing Turn-taking
The human language faculty sets us apart from other species. Its cognitive workings and social uses have intrigued scholars at least since the ancient Greeks. And yet, in many respects, we are still only beginning to discover how this unrivalled cognitive machinery functions and allows us to communicate with others. Conversation is one of the most fundamental human activities, yet the cognitive processes that underpin it are surprisingly poorly understood due to a long-standing focus on the processing of utterances in isolation.
Language processing and turn-taking in conversation
A striking feature of conversation is its temporal structure. The surface pattern is one of alternating bursts of vocalisation, with gaps between them averaging around just 200 ms (Stivers et al., 2009). This poses a psycholinguistic puzzle (Levinson, 2016): Judging by language production experiments, it takes a minimum of 600 ms to produce a simple one-word utterance (Indefrey & Levelt, 2004), implying that the turn-taking system shaping human conversation must be built on a considerable amount of parallel predictive processing. While listening to an ongoing turn, next speakers must predict its unfolding content and its end point to be able to begin planning their response early and to launch it on time. Experimental evidence suggests that processing spoken turns in conversation indeed rests quite considerably on prediction (e.g., Magyari & de Ruiter, 2012). In addition, quantitative analyses of conversational corpora have opened up a new domain of empirical enquiry, overcoming some of the well-known limitations of traditional psycholinguistic paradigms by investigating human communication in interactive situ. Levinson and Torreira (2015) and Stivers et al. (2009) have provided quantitative confirmation of the turn-timing principle of minimal gaps and overlaps first observed by conversation analysts (Sacks, Schegloff, & Jefferson, 1974). Further, the fast timing of turns in conversation appears to be influenced not only by prediction but also by the ease with which turn content is cognitively processed (Roberts, Torreira, & Levinson, 2015) and the availability of turn-final ‘go signals’ (Levinson & Torreira, 2015).
Language and the body
The primary site of human language use is face-to-face conversation, suggesting that human language should be conceptualized as a fundamentally multimodal phenomenon (e.g., Bavelas & Chovil, 2000; Clark, 1996; Goldin-Meadow, 2003; Kendon, 2004, 2014; Levinson & Holler, 2014; Mondada, 2016; McNeill, 1992). Bodily signals, in particular, manual gestures, add a significant amount of meaning to what is being said. In certain contexts, the manual modality carries about 50% to 70% of the information constituting the overall message a speaker is encoding (Gerwing & Allison, 2009; Holler & Beattie, 2003; Holler & Wilkin, 2009). This information is taken up by recipients (e.g., Holler, Shovelton, & Beattie, 2009; S. D. Kelly, Barr, Church, & Lynch, 1999) and readily integrated with the information from the spoken channel (e.g., Kelly, Healey, Özyürek, & Holler, 2015; Kelly, Kravitz, & Hopkins, 2004; Willems, Özyürek, & Hagoort, 2007). Importantly, receiving gestural information in addition to speech appears to facilitate information processing in experimental settings as evidenced by faster reaction times to speech-plus-gesture compared to speech-only stimuli (Holle, Gunter, Rüschemeyer, Hennenlotter, & Iacoboni, 2008; Kelly, Özyürek, & Maris, 2010, Note 2; Nagels, Kircher, Steines, & Straube, 2015; Wu & Coulson, 2015).
Language processing situated in multimodal, face-to-face conversation
These diverse effects of gesture make it plausible that gestures also play a pivotal role in facilitating the remarkably fast system of turn-taking in conversation. Despite face-to-face conversation being the prime site for language use, there have been relatively few investigations in this domain. Duncan (1972) suggested a turn-end-signalling-model, in which the presence of bodily cues influences the likelihood of another speaker taking the turn. Further, conversation analysts have shown how gestures may give interlocutors cues to the prolongation, upcoming ending or beginning of a speaker’s turn (Mondada, 2006; Schegloff, 1984; Streeck & Hartge, 1992). If so, we might expect gestures to affect the timing of turns in conversation, but to date this remains a gap in our understanding of human communicative behaviour.
Further, if gestures indeed influence the cognitive processing underlying turn-taking, then we should be able to show, first, that quite a substantial number of turns have a gestural component. Second, this gestural presence should have a direct effect on turn timing. As shown above, the gap between turns is shorter when turn content is easier to process. If the information provided by gesture indeed facilitates the processing of communicative messages, as suggested by experiments (Holle et al., 2008; Kelly et al., 2010; Nagels et al., 2015; Wu & Coulson, 2015), then next turns should follow faster, at least when they respond to preceding turns. Third, the timing of a specific aspect of gesture may also be of relevance, namely the temporal relationship between the termination of gesture and the termination of the spoken turn. In line with Duncan (1972), Levinson and Torreira (2015) argued that gestures may contribute to signalling turn ends, acting as an additional ‘go signal’. Thus, if the gestural movement terminates prior to the spoken turn they accompany, this may affect the prediction of upcoming turn completion, resulting in faster turn transitions.
The present study
Here, we report quantitative analyses based on a rich corpus of multimodal conversation data (Holler & Kendrick, 2015), allowing us to move beyond traditional psycholinguistic methods based on studying comprehension and production independently and typically out of conversational and multimodal context. The study focuses on a type of communicative action that pervades conversation across languages (Stivers et al., 2009): question–response sequences. A question makes a response mandatory, making it an ideal prerequisite for comparing the speed with which next turns are issued. Further, the type of action a turn accomplishes has been shown to matter for the timing of turns (Roberts et al., 2015)—thus, mixing questions with other types of actions may blur the picture. As a first enquiry into whether bodily movements influence linguistic responses in conversation, the present analysis focuses on questions accompanied by two kinds of gestures that frequently accompany spontaneous speech: head and hand gestures. The results from this study inform models of human language use and the cognitive processes that underpin it.
Corpus and participants
In addition to eye-tracking glasses (SMI) filming the participants’ behaviour from a first-person perspective, the interactions were filmed with three high-definition video cameras. Participants also wore a head-mounted directional microphone providing precise recordings of their vocal behaviour (for further details on laboratory setup and equipment, see Holler & Kendrick, 2015).
After the equipment was set up for recording, experimenters left the recording suite for 20 min. During this time participants conversed freely. The study was approved by the Social Sciences Faculty Ethics Committee, Radboud University Nijmegen.
The present analysis focussed on question–response sequences, due to their prevalence in conversation, and because questions clearly make a specific next social action relevant (i.e., a response, typically an answer). This allows us to measure participants’ speed of responding in comparable sequential environments, reducing the noise that may be induced by inclusion of a wide variety of different communicative actions. In this study only triadic conversations were used, as the presence of more participants opens up extra coordination problems—who is being addressed, and who will respond.
Two hundred and eighty-one question–response sequences were identified (see Holler & Kendrick, 2015). The acoustic signal constituting the questions and responses was measured in Praat (Version 5.3.82; Boersma & Weenink, 2014).
For each question, the occurrence of gestures was annotated (using the software ELAN 4.61; Wittenburg, Brugman, Russel, Klassmann, & Sloetjes, 2006). Gestures were defined as communicative movements of head or hand (and torso) that speakers produced as part of conveying the question (or the response); this included (a) iconic and metaphoric gestures (McNeill, 1992); (b) deictic gestures (McNeill, 1992), and (c) pragmatic/interactive gestures (Bavelas, Chovil, Lawrie, & Wade, 1992; Bavelas, Chovil, Coates, & Roe, 1995; Kendon, 2004). An independent coder, blind to hypotheses, identified all gestures meeting the above criteria. Reliability was established for 22.8% of the question–response sequences (n = 64). This yielded a reliability of 76.7% for gesture identification indicating a high degree of agreement.
Measuring the timing of verbal and visual behaviours
Three timing measurements were established (using ELAN). The first focused on gestures only. Often, gestures consist of three phases: preparation, stroke, and retraction (Kita, van Gijn, & van der Hulst, 1998). The focus here was on the onset of the gesture retraction, which was determined through frame-by-frame inspection of the movement based on an established method (Seyfeddinipur, 2006). Identification of the retraction onset for head gestures is not as objectively possible, presumably due to the capital articulator being considerably more constrained than the hands in the size, range, complexity, and velocity of its movements. For this reason, we restricted our retraction annotations to manual gestures (however, our third timing measure described below—used for testing the effect of gesture on response speed—is based on all manual and all head gestures).
The second timing measure concerned the relation of gesture retraction and end of the verbal component of the question. We determined whether the gesture retraction began prior to or following the termination of the spoken question component (manual gestures only, and excluding those manual gestures without retractions due to another gesture following).
The third timing measure concerned the gap (in milliseconds) between the end of the spoken question and the onset of the spoken response. Positive values indicate silence between question and response, and negative values indicate overlap.
Measuring the associations between questions, gestures, prosody, and gaze
To be able to draw conclusions about the unique contribution of gestures to the timing of responses to questions we also coded the questions in our data for a range of prosodic variables as well as for questioners’ gaze direction. With regard to prosodic patterns, questions were coded for F0 (Hz) (fundamental frequency), minimum pitch (Hz) (i.e., the lowest pitch level of a question), maximum pitch (Hz) (i.e. the pitch peak of a question), average intensity (dB) (i.e. amplitude), and maximum intensity (dB) (i.e. the amplitude peak of a question). Because previous research has suggested a link between gesture apex and pitch peak of the verbal utterance a gesture accompanies (e.g., Leonard & Cummins, 2011), and because gestural retractions tend to directly follow the point of maximum excursion (i.e., the apex) of a gesture, we were also interested in the relationship between gesture retraction onsets and pitch peaks. For this purpose, we also measured the time stamps of gesture retraction onset and pitch peaks. Finally, we also coded the questioner’s gaze direction during the question, since previous research has argued for a role of gaze direction in foreshadowing upcoming turn completion (Duncan, 1972; Kendon, 1967). Gaze direction was categorized into ‘always on responder’, ‘averts and returns to responder’, ‘averts but does not return to responder’, or ‘unclear’ (for those cases where the eye-tracking data was not reliable [e.g., gaze fixations were missing for a number of frames during the question, or the calibration was off] or where the two independent coders did not agree). Note that the latter category included almost half of our data points (n = 144), meaning we should remain cautious regarding the interpretation of the gaze data in our sample.
We fitted linear mixed effects models to our data using lme4 (Version 1.1-12) package (Bates, Maechler, Bolker, & Walker, 2015) in R (Version 3.2.3; R Core Team, 2015). The main analyses are based on models with gap duration as the dependent variable; presence of gesture, or presence of gesture retraction, as a fixed effect; and questioner, respondent, and conversation as random intercepts. The analyses that check for the influence of prosodic patterns include F0, minimum pitch, maximum pitch, average intensity, or maximum intensity as additional fixed effects. The analyses that check for the influence of gaze include gaze direction (four levels, ‘gaze always on responder’ = reference level) as an additional fixed effect to the fixed effect of gesture. All model comparisons were made using the ‘anova’ function.
Frequency of gestures with question–response sequences
Proportion of questions and responses accompanied by manual and/or head gesture(s) (left-hand column), one or more manual gestures (centre column), and one or more head gestures (right-hand column)
Gesture (one or multiple)
Manual gesture (one or multiple)
Head gesture (one or multiple)
61% (N = 171)
30% (N = 83)
46% (N = 130)
67% (N = 189)
16% (N = 45)
60% (N = 168)
Use of gesture types by speakers of questions and responses
The effect of question-associated gestures on the timing of responses
We then tested whether this pattern held, based on several subsets of the data, to gain insight into the generalizability of the effect. First of all, the pattern held when the model was run on just hand gestures (β = −304.81, SE = 73.61, t = −4.14, p < .0001) and on just head gestures (β = −195.70, SE = 68.64, t = −2.85, p = .005).
Secondly, we generated a subset consisting of polar (= yes/no) questions only (i.e., polar questions [n = 228] as opposed to wh-questions [n = 48]; n = 5 classified as ‘other’). Responses to these two types of questions tend to differ in complexity, which can influence response times. For polar questions, those with gestures were still responded to faster than questions without gesture (β = −325.06, SE = 75.05, t = −4.33, p < .0001). The effect was not significant for the set of wh-questions, but bear in mind that these statistics were based on a small dataset.
Finally, we split our questions into two sets, one with increments (Schegloff, 2016)—that is, linguistic add-ons to syntactically and prosodically complete units (e.g., “How are you finding it [by the way]” [= increment]; n = 146)—and one without. In the case of increments nonfinal possible completion points may provide cues that elicit early responses, and this was indeed the case in our data (gaps for questions with increments: mode = −350 ms, median = −164 ms, mean = −154 ms; gaps for questions without increments: mode = 75 ms, median = 123 ms, mean = 163 ms). Because these responses occur fast in relation to the actual end of the question, we were interested whether the gesture effect would hold on the subset of questions without increments. It did indeed (β = −213.57, SE = 77.82, t = −2.75, p = .007).
In addition, we tested whether the gesture effect might in fact be due to other confounding factors; questions accompanied by gestures may be associated with specific prosodic patterns that are different to those for questions not accompanied by gestures. Our analyses show that neither fundamental frequency (F0) (β = −0.20, SE = 0.88, t = −0.23, p > .05), nor minimum pitch (β = 0.69, SE = 0.91, t = 0.76, p > .05), nor maximum pitch (β = −0.49, SE = 0.28, t = −1.78, p > .05) were significant predictors of response speed when they were added as fixed effects to the fixed effect of gesture; neither was average amplitude (β = 0.26, SE = 0.54, t = 0.48, p > .05). However, maximum amplitude did emerge as a significant predictor when added to the fixed effect of gesture (β = −20.17, SE = 5.87, t = −3.44, p < .0001), as well as when used as the only predictor (β = −24.29, SE = 5.96, t = −4.08, p < .0001). A model comparison showed that a model including both gesture and maximum amplitude provided a better fit for our data than a model including only the presence of gesture, χ2(1) = 11.56, p < .001, or only maximum amplitude, χ2(1) = 14.76, p < .0001 (when comparing the two models with just gesture or just maximum amplitude, the models were deemed to be equally good, but the gesture model had a lower Akaike information criterion [AIC] and Bayesian information criterion [BIC]). In the combined model, both the presence of gesture and maximum amplitude remained significant predictors, thus suggesting these variables lead to additive but independent effects and that gestures do make a unique contribution to gap duration.
Furthermore, we tested for the possible confounding influence of the questioner’s gaze behaviour during the questions. We added gaze direction as a predictor to the predictor of gesture which resulted in a model where gesture remained a significant predictor (β = −302.82, SE = 66.09, t = −4.52, p < .0001). The effect of gaze was not significant for any of the levels of the gaze variable except for ‘gaze never returns to responder’ (β = −267.36, SE = 134.68, t = −1.99, p < .05). Fitting a model that includes gesture and gaze compared to one including only gesture did not result in a better model, χ2(3) = 6.93, p = .074. On the whole, there was not a large amount of variation in gaze direction—speakers were constantly looking at the responder in the majority (58%) of cases, and they did so equally often for questions with gesture (42 questions) and without (38 questions).
The effect of gesture retractions on the timing of responses
Values of central tendency (in ms) for turn transitions measured in the present study
with hand gesture
without hand gesture
with head gesture
without head gesture
with early gesture retraction
with late gesture retraction
In conversation, time is of the essence. Across languages, interactants take turns at speaking, precisely timed with gaps between turns in the ballpark of milliseconds. While it has been acknowledged that natural human language usage is multimodal in nature, the study here makes plausible that gestures are interdigitated with speech in language processing in interaction: First, the data show that most question turns in conversation have visual communicative components. This clear prevalence of bodily signals in the direct environment of speaking turns is the prerequisite for postulating a significant role of gesture in face-to-face language processing. Second, we showed that question turns are responded to faster when the question has a gestural component than when it does not. Gestures sped up turn transitions by around 200 ms if we consider modal response times, a significant difference in the context of conversational turn-taking, where response latencies are themselves typically just 200 ms long (Stivers et al., 2009). We have ruled out the potentially confounding influence of some obvious variables (question type, linguistic turn structure, prosodic patterns, gaze direction), none of which lead to significant effects overall, with the exception of maximum amplitude. This means that questions with gestures were often uttered with higher amplitude peaks than questions without gestures—presumably, this is due to intercostal muscles being moved when making gestures, thus affecting a greater expulsion of air. While this is an interesting finding in itself, it does not mitigate the role of gesture we have postulated here—the effect of gesture being present is independent of the effect of amplitude peak. However, it helps us to further refine the picture since the best model fit is one that includes both gesture presence and amplitude peak as predictors, suggesting that the effect of these two variables on gap duration may be strongest when they occur together. Furthermore, while gaze direction did not have a significant effect overall, it is worth mentioning that the effect approached significance, and that one of the individual levels of this variable had a significant effect—interestingly, this related to cases where questioners did not return their gaze to the responder after averting it, which were often questions in which more than one person was addressed, suggesting that competition for the floor may have elicited early responses in those cases. The previously postulated role (Duncan, 1972; Kendon, 1967) of the questioner’s gaze being on the responder or gaze returning to the responder towards the end of the turn, however, did not seem to play a significant role in our data. Thus, while gesture makes a unique contribution to the effect of early responses in our data even when taking gaze direction into account, further studies into the role of gaze during turn-taking and its interplay with gesture seem warranted, especially in the context of dyadic (rather than triadic) interactions (i.e., the context in which a potential function of gaze for turn-taking was first observed).
Of course, corpus data always carry the possibility that uncontrolled factors may play a role, too. However, the fact that our finding is in line with experimental evidence that gestures facilitate language processing (Holle et al., 2008; Kelly et al., 2010; Nagels et al., 2015; Wu & Coulson, 2015) where such factors are strictly controlled for mitigates this issue. Our results suggest that such findings generalise to multimodal language use in interactive situ. Third, considering those question-accompanying gestures that terminated with a retraction phase (a subset, but still accounting for a quarter of all data points), next speakers responded significantly faster when the retraction began before compared to after turn end. We also found that retraction onset strongly correlates with the pitch peak of a question, which further adds to the literature on the link between prosody and gesture (e.g., Leonard & Cummins, 2011), but it does not change our interpretation of gesture causing the early retraction effect (since questions without gestures have pitch peaks, too, of course).
As to the exact mechanisms underlying these findings, we must remain speculative at this stage. The effect of responses being faster for questions with gestures than for those without (irrespective of whether or when the gestures retracted) may be explained in terms of a gesture-induced processing advantage—shorter response times in conversation tend to be taken as evidence for reduced comprehension processing in the face of production (Bögels, Magyari, & Levinson, 2015; Roberts et al., 2015). According to such an interpretation, the additional semantic and/or pragmatic information conveyed by hand and head gestures facilitates message processing (next to many other functions co-speech gestures can fulfil). It is also possible that gestures, which often precede corresponding information in speech (e.g., Bergmann, Aksu, & Kopp, 2011; Pine, Lufkin, Kirk, & Messer, 2007; Schegloff, 1984), facilitate the prediction of upcoming information, thus facilitating language processing. Of course, a third possibility is that bodily movements just draw more attention to what is being said, thus enhancing processing of the linguistic message itself, without any visual–verbal integration taking place. As to the second effect we found, addresses may perceive the onset of a visual retraction as an early cue of upcoming turn completion, thus acting as a ‘go signal’. Experimental research will allow us to tease apart these different potential mechanisms.
The findings have a number of theoretical implications. Traditional language processing models tend to focus on verbal language alone, and on isolated utterances. Language produced in face-to-face conversation, the primordial site of language use, poses significant processing challenges that necessitate an interactional perspective. Here, we have shown that the multimodal environment of verbal language may crucially influence its processing in an interactive context. Our findings fit with gesture comprehension models (Kelly et al., 2010), at least if we assume that integration processes explain the gesture-induced facilitation observed here. Turn-taking models with a clear focus on the verbal modality (Sacks et al., 1974) should take note of the role gesture appears to play in turn coordination. The turn-taking model proposed by Duncan (1972) is multimodal in nature, but it reduces the role of gestural signals to a matter of presence or absence, whereas the current findings suggest that their temporal relationship with speech may play a crucial role. Future research may explore the role of gesture for turn-taking, and especially their temporal relation with speech, beyond question turns, with the aim to more fully refine existing turn-taking models.
Open access funding provided by Max Planck Society. We would like to thank the Max Planck Society and the European Research Council for their financial support (INTERACT Grant #269484, awarded to S.C. Levinson), Nick Wood and Ludy Cilissen for their help with video processing, and Linda Drijvers and Marloes van der Groot for their assistance in annotating the corpus.
- Bates, D., Maechler, M., Bolker, B., & Walker, S. (2015). lme4: Linear mixed-effects models using Eigen and S4. Retrieved from the Institute for Statistics and Mathematics of WU website: http://CRAN.R-project.org/package=lme4
- Bergmann, K., Aksu, V., & Kopp, S. (2011). The relation of speech and gestures: Temporal synchrony follows semantic synchrony. Paper presented at the Proceedings of the 2nd Workshop on Gesture and Speech in Interaction, Bielefeld, Germany.Google Scholar
- Boersma, P., & Weenink, D. (2014). Praat: Doing phonetics by computer (Version 5.3.82) [Computer program]. Retrieved from http://www.praat.org/
- Goldin-Meadow, S. (2003). Hearing gesture. Cambridge: Harvard University Press.Google Scholar
- Holler, J. & Beattie, G. (2003). How iconic gestures and speech interact in the representation of meaning: Are both aspects really integral to the process? Semiotica, 146, 81–116.Google Scholar
- McNeill, D. (1992). Hand and mind: What gestures reveal about thought. Chicago: Chicago University Press.Google Scholar
- R Core Team. (2015). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Retrieved from http://www.R-project.org/
- Schegloff, E. A. (1984). On some gesture’s relation to talk. In J. M. Atkinson & J. Heritage (Eds.), Structures of sound action: Studies in conversation analysis (pp. 266–296). Cambridge: Cambridge University Press.Google Scholar
- Schegloff, E. A. (2016). Increments. In J. D. Robinson (Ed.), Accountability in social interaction (pp. 239–263). Oxford: Oxford University Press.Google Scholar
- Seyfeddinipur, M. (2006). Disfluency: Interrupting speech and gesture (Unpublished doctoral dissertation), Radboud University Nijmegen, Nijmegen.Google Scholar
- Stivers, T., Enfield, N. J., Brown, P., Englert, C., Hayashi, M., Heinemann, T., … Levinson, S. C. (2009). Universals and cultural variation in turn-taking in conversation. Proceedings of the National Academy of Sciences of the United States of America, 106, 10587–10592.Google Scholar
- Wittenburg, P., Brugman, H., Russel, A., Klassmann, A., & Sloetjes, H. (2006). ELAN: A professional framework for multimodality research. Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC 2006), 1556–1559. http://pubman.mpdl.mpg.de/pubman/faces/viewItemOverviewPage.jsp?itemId=escidoc:60436
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.