The role of iconic gestures and mouth movements in face-to-face communication

Krason, Anna; Fenton, Rebecca; Varley, Rosemary; Vigliocco, Gabriella

doi:10.3758/s13423-021-02009-5

The role of iconic gestures and mouth movements in face-to-face communication

Brief Report
Open access
Published: 20 October 2021

Volume 29, pages 600–612, (2022)
Cite this article

Download PDF

You have full access to this open access article

Psychonomic Bulletin & Review Aims and scope Submit manuscript

The role of iconic gestures and mouth movements in face-to-face communication

Download PDF

Anna Krason ORCID: orcid.org/0000-0002-4162-3584¹,
Rebecca Fenton¹,
Rosemary Varley¹ &
…
Gabriella Vigliocco¹

3752 Accesses
7 Citations
9 Altmetric
Explore all metrics

Abstract

Human face-to-face communication is multimodal: it comprises speech as well as visual cues, such as articulatory and limb gestures. In the current study, we assess how iconic gestures and mouth movements influence audiovisual word recognition. We presented video clips of an actress uttering single words accompanied, or not, by more or less informative iconic gestures. For each word we also measured the informativeness of the mouth movements from a separate lipreading task. We manipulated whether gestures were congruent or incongruent with the speech, and whether the words were audible or noise vocoded. The task was to decide whether the speech from the video matched a previously seen picture. We found that congruent iconic gestures aided word recognition, especially in the noise-vocoded condition, and the effect was larger (in terms of reaction times) for more informative gestures. Moreover, more informative mouth movements facilitated performance in challenging listening conditions when the speech was accompanied by gestures (either congruent or incongruent) suggesting an enhancement when both cues are present relative to just one. We also observed (a trend) that more informative mouth movements speeded up word recognition across clarity conditions, but only when the gestures were absent. We conclude that listeners use and dynamically weight the informativeness of gestures and mouth movements available during face-to-face communication.

Social interactions in the metaverse: Framework, initial evidence, and research roadmap

Article Open access 07 December 2022

Nonverbal Auditory Cues Allow Relationship Quality to be Inferred During Conversations

Article Open access 22 October 2021

Application of facial neuromuscular electrical stimulation (fNMES) in psychophysiological research: Practical recommendations based on a systematic review of the literature

Article Open access 20 October 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Face-to-face communication is a well-orchestrated process of exchanging multimodal information under various, sometimes challenging, conditions (e.g., a chat between friends in a noisy restaurant). Here, we investigate how iconic gestures and facial movements affect spoken word recognition under clear and distorted listening conditions by asking how listeners’ use of these cues depends upon their informativeness.

Iconic gestures

Iconic gestures that imagistically evoke features and properties of concepts (e.g., clenching one’s fist and moving the arm up and down to express a hammering action) are common in face-to-face communication. For example, 20% of the utterances in dyadic interactions, in which adults spontaneously talk about a set of known and unknown objects (Vigliocco et al., 2021), contain iconic gestures, whereas only 10% of the produced utterances contain beat gestures. Iconic gestures are processed automatically, as clearly demonstrated by the fact that listeners attend gestures even when they are misleading (Green et al., 2009; Habets et al., 2010; Kelly et al., 2014; Kelly et al., 2010; McNeill et al., 1994; Willems et al., 2009; Wu & Coulson, 2007). For instance, McNeill et al. (1994) showed participants video clips of a speaker telling a cartoon story accompanied by either matching or mismatching iconic gestures; finding that participants considered the information from both types of gestures when asked to recall the story. In another study, Kelly et al. (2010) presented participants with action primes followed by either congruent, weakly incongruent, or strongly incongruent speech–gesture video presentations. The participants’ task was to decide whether the speech or gesture from a video was related to an action prime seen earlier. The authors found that individuals made fewer errors for the presentations including weakly incongruent gestures (e.g., saying ‘chop,’ and gesturing ‘cut’), compared to strongly incongruent gestures (e.g., saying ‘chop,’ and gesturing ‘twist’), further suggesting that people make use of all the information available even when the meaning the gestures evoke mismatches the speech. Recent studies have extended these findings by showing that incongruence between speech and a visual cue can be especially detrimental for people with aphasia (Vigliocco et al., 2020) and by demonstrating similar interactions between different channels (hand and mouth) in users of British Sign Language (Perniss et al., 2020).

Integration of auditory and gestural information has been assessed using, for example, gestures containing information not present in speech (Beattie & Shovelton, 1999; Cocks et al., 2009; Cocks et al., 2018; Kelly et al., 1999) or degraded speech to increase difficulty (Holle et al., 2010; Obermeier et al., 2012). For example, Holle et al. (2010) tested comprehension of audiovisual sentences (with or without gestures) with different signal-to-noise ratios (SNR) and asked participants to type down all the information they understood. Participants were able to recall more information when the gestures were present indicating that gestures can aid speech comprehension especially in adverse listening conditions. Obermeier et al. (2012) further found that this gestural enhancement occurs under difficult listening conditions regardless of whether the challenge is due to external noise or hearing impairment.

Gesture presence can support speech comprehension by virtue of enhancing semantic activation (McNeill, 1992, 2000; Morrel-Samuels & Krauss, 1992). If this is the case then, the degree of informativeness of the gesture (i.e., the extent to which one can recognize the gesture) will matter.

Mouth movements

Facial (especially mouth) movements are among the visual cues that are almost always available in face-to-face communication, and it is well-known that they affect speech perception (McGurk & MacDonald, 1976). Seeing mouth movements makes speech recognition easier (Peelle & Sommers, 2015) by reducing lexical competition (Jesse & Massaro, 2010; Lachs & Pisoni, 2004; Tye-Murray et al., 2007), especially in noisy listening conditions (Drijvers & Özyürek, 2017; Ma et al., 2009; Reisberg et al., 1987; Ross et al., 2007; Schwartz et al., 2004; Sumby & Pollack, 1954). For example, Tye-Murray et al. (2007) employed a repetition task with stimuli distorted by speech babble presented in auditory-only, visual-only, or audiovisual combinations. They found that performance was enhanced for audiovisual presentations. Moreover, people benefit from visible speech in clearly audible conditions, in particular when the complexity of a message increases. For instance, Arnold and Hill (2001) measured participants’ comprehension of connected speech by presenting short stories that varied in their difficulty (e.g., a passage uttered in a non-native accent) and modality (either auditory-only or audiovisual). Participants performed better when mouth movements were present, replicating Reisberg et al. (1987), and suggesting that the information from mouth movements is automatically processed with speech.

In contrast to iconic gestures, mouth movements are primarily useful in decoding the phonological information and listeners benefit from audiovisual speech because facial gestures can support predictions for upcoming words (Solberg Økland et al., 2019). This has been captured by the notion of visemes—that is, the shape(s) of the lips that correspond to a particular phoneme or group of phonemes (Fisher, 1968; Massaro & Cohen, 1995). For example, sounds that are produced more anteriorly on the mouth, such as /f/, are visually more distinct than phonemes with a more posterior place of articulation, such as /k/, and hence inform the listener to a larger extent (Massaro et al., 1993). However, visemes provide only limited information about voicing and manner of articulation and lack a one-to-one correspondence with phonemes (/f/ and /v/ are indistinguishable in the visual context). Moreover, visemes can be different when isolated sounds are produced and when they are co-articulated (e.g., /b/ in ‘bean’ and ‘bow’). Here, we develop quantitative measures of mouth informativeness for English words, rather than employ a priori categories, to operationalize the amount of information available in speakers’ mouth movements.

Weighting the multimodal cues

The majority of previous studies have only looked at the impact of one visual cue: iconic gestures or mouth movements, while the other cue was eliminated to achieve control. Thus, the face is cropped or covered in studies of gestures (e.g., Drijvers & Özyürek, 2017; Habets et al., 2010; Hirata & Kelly, 2010; Holle & Gunter, 2007; Holle et al., 2010), and the hands are not visible in studies of audiovisual speech (e.g., Ross et al., 2007; Solberg Økland et al., 2019; Tye-Murray et al., 2007). Only a handful of studies have investigated both gestures and mouth movements (Drijvers & Özyürek, 2017, 2020; Drijvers et al., 2019; Hirata & Kelly, 2010; Skipper et al., 2009; Zhang, Ding, et al., 2021a; Zhang, Frassinelli, et al., 2021b).

For example, Drijvers and Özyürek (2017) presented participants with video clips of a speaker uttering words and producing gestures. Mouth movements were visible or blurred, and the speech was clear or degraded. Participants had to report the produced words. The researchers found that subjects benefited most from a double enhancement (i.e., when both cues were present), especially when the speech was moderately degraded, replicating previous studies (Ma et al., 2009; Ross et al., 2007). Crucially, they also found that iconic gestures affected word comprehension to a larger extent than mouth movements. However, we do not know how informative the mouth movements were in the study.

In another study, Zhang, Frassinelli, et al. (2021b) looked at brain activity during audiovisual connected speech processing. The researchers measured changes in the N400 amplitude—a negative event-related potential associated with semantic processing (Kutas & Federmeier, 2011)—by looking at word predictability, prosodic stress, iconic gestures, beat gestures, and mouth informativeness. They found that all the multimodal cues modulated the N400 amplitude along with word predictability, although the degree to which they did so depended on the presence of other cues. For mouth informativeness, they found that it enhances speech perception when iconic gestures are also present, similarly to the double enhancement effect found in Drijvers and Özyürek (2017).

Current study

The goal of the present study was to address how iconic gestures and mouth movements modulate word recognition in a picture–word matching task (i.e., we presented pictures of objects or actions followed by video clips of a speaker saying and gesturing a word). We manipulated the presence of gestures and their congruency with the spoken words, as well as the clarity of the speech (clear or moderately degraded), but we kept mouth movements always visible, as it is in face-to-face contexts. In contrast to previous studies, we used measures of informativeness of both the gesture and the mouth movement obtained in norming experiments as predictors. Employing these measures is a more ecologically valid and novel (for mouth movements) way of assessing the impact of multimodal cues on speech processing and can further inform our understanding of the underlying mechanisms without eliminating (e.g., blurring or covering) one of the visual cues and thereby removing information about their possible interactions.

On the basis of prior research, we predicted the following:

(i)
Congruent gestures versus no gestures. Performance should be enhanced when iconic gestures are presented alongside speech. This should be the case in particular in the degraded speech conditions, when meaning is harder to decode from auditory information alone. More informative mouth movements should also be useful in the degraded speech condition, especially in the absence of gestures (provided that gestures and mouth movements influence word recognition to a different extent; Drijvers & Özyürek, 2017) or in addition to gestures (provided that the presence of both cues enhance comprehension to a larger degree than the presence of a single cue; Drijvers & Özyürek, 2017).
(ii)
Incongruent gestures versus no gestures. Performance should be hindered when incongruent iconic gestures are present provided that they are processed automatically alongside speech (Kelly et al., 2010; McNeill et al., 1994). This will be the case particularly for the degraded speech. The effect (if any) of mouth movements will be difficult to document because of the large interference effect from the gestures.
(iii)
Congruent versus incongruent gestures. Performance should be significantly better for congruent relative to incongruent gestures, particularly when congruent gestures are more informative. Performance will be most disrupted when incongruent, highly informative gestures are present, and speech is degraded. Iconic gestures accompanied by more informative mouth movements should have a greater effect on word recognition than a single cue alone, especially when the speech is degraded (Drijvers & Özyürek, 2017).

Methods

Norming studies

Gesture informativeness norms

Forty-five native English speakers (28 females; M = 27 years, SD = 6.2) were recruited using Prolific (http://www.prolific.co/). Participants had normal or corrected-to-normal vision and hearing and did not report any known neurological or psychiatric conditions. All participants consented to participate in the study and received payment on completion according to Prolific policy. The study received ethical approval (Research Ethics Committee [0143/003]) from UCL.

The materials for this study were collected simultaneously with the materials for the main experiment. We video-recorded a female native-English speaker uttering and gesturing 187 concrete, gesturable words in isolation (mean length of a clip was 2 seconds) in a professional recording studio at UCL. Each word was recorded twice: with and without gestures. For the former, the model was asked to produce gestures as naturally as possible and place her hands on her lap when finished; for the recording without gestures, the model was prompted to keep her hands in her laps. The model wore neutral-colored clothes, wore no makeup, and was sitting on a chair against a unicolored background. For this norming study, only the videos where the gesture was present were used and further edited using iMovie (Version 10.1.12) such that only hand actions remained visible, and audio was muted (see Fig. 1a).

Participants took part in an online experiment designed using Gorilla (https://gorilla.sc/). The task, previously used in other gesture studies (e.g., Drijvers & Özyürek, 2017), was to rate on a scale from 1 to 7 how well the gestures represented the written words displayed on the screen (with 1 being very difficult; 7 being very easy to recognize). Each participant responded to 93–94 items randomly selected from the whole corpus. There were also two practice trials at the beginning of the experiment. In addition, 20 filler (M = 1.49, SD = 1.00) items were randomly presented during the experiment to ensure that participants used all the available ratings. The fillers consisted of the gestures that did not match the written words on the screen (e.g., the gesture represented a ‘hammer,’ and the written word was ‘vaccine’) and were not included in the analysis. Participants were allowed to take three breaks during the experiment but were asked to complete the study within 40 minutes. The trials were self-paced, and there was a fixation cross of 250 ms prior to each trial.

The participants’ responses had a grand mean of 5.24 (SD = 1.27), suggesting that most of the selected iconic gestures matched well with the written form. The gesture informativeness is the mean rating score, with bigger values (i.e., closer to 7), signifying that the gesture is more informative.

Mouth informativeness norms

We recruited 145 monolingual native English speakers using Prolific (http://www.prolific.co/). Eight participants were removed from the analysis: three participants experienced technical difficulties during the experiment, three timed out, and the last two did not respond correctly to the ‘catch trials’ (see paragraph below). The remaining 137 participants (71 females, 64 males, and two nonbinary; M = 29 years, SD = 6.24) reported normal or corrected-to-normal hearing and vision and had no known neurological or psychological disorders. All participants consented to participate in this online study and were paid for their time according to the Prolific policy. Ethical approval was obtained from the UCL Research Ethics Committee (0143/003).

We recorded 745 muted video clips in which an English-speaking actress produced single words. The selected words were concrete (ratings between 3.93 and 5 on a 5-point scale; Brysbaert et al., 2014) and referred to either everyday objects (e.g., ‘ball’), living beings (e.g., ‘fish’), actions (e.g., ‘watching’), or attributes (e.g., ‘hot’). The videos were recorded in a soundproof recording studio at UCL and contained only the face of the speaker presented on a dark unicolor background (see Fig. 1b). The video stimuli were randomly divided into seven lists, and each participant was randomly assigned to complete one of the lists. Additionally, we selected 12 pictures from various open-source platforms that served as catch trials presented on random occasions. In the catch trials, participants saw briefly presented images followed by a question about the picture (e.g., Was that a candle?). This was to ensure participants paid attention throughout the task.

Participants were invited to take part in an online lipreading task created with Gorilla (https://gorilla.sc/). They were instructed to watch silent video clips (mean length of 1 second) presented on a screen and then type their guess of what was uttered by the speaker. The same video was successively played twice to ensure subjects did not miss any trials and were able to extract the available information from the lips. After the second presentation, a blank answer box appeared below the video (see Fig. 1b). The task was self-paced. Before each trial, a fixation cross was presented on the screen for 250 ms. Participants were prompted to type a single word response using lower case letters and to avoid spaces. They were encouraged to make their best guess if they were unsure. Prior to the experimental trials, subjects performed seven practice trials followed by feedback.

We operationalized informativeness using the phonological distance between the typed responses and the target in the following manner. First, we converted written responses to phonetic transcription (International Phonetic Alphabet [IPA]) using available online software (https://tophonetics.com/). We then corrected accidental spaces and arbitrarily assigned the lowest value of informativeness to any missing answers (e.g., blank or ‘I don’t know’) to reflect the level of difficulty these words posed (62 trials out of 14,514). Second, we used the PanPhon package (Mortensen et al., 2016) in PyCharm 2018.2.4, which consists of a large database of phonemes and their phonological features, to calculate feature editdistance. This is a string edit distance with weighted phonological features^{Footnote 1} divided by the maximum length of a given word. The calculated distance was normalized and ranged from 0 to 1 (M = 0.49, SD = 0.16). The measure of mouth informativeness for a given word is, therefore, the mean distance value, with a smaller distance (i.e., closer to zero), corresponding to a larger informativeness score.

Main study

Participants

A total of 104 native English speakers (M = 29 years, SD = 6.95, 65 females) were recruited via Prolific (http://www.prolific.co/). All were right-handed monolinguals, who reported no language impairments and had normal or corrected-to-normal hearing and vision. As in the norming studies, all participants provided their consent for participation and were paid for their time under UCL ethical approval (0143/003). We were unable to conduct sample size calculations a priori based on effect sizes because of a lack of studies from which relevant information could be derived. Note, however, that according to Trafimow (2018), a study of 104 participants (>50 in each between-subject group) has between ‘good’ to ‘excellent’ probability of replication and ‘moderate’ precision.

Materials

Materials consisted of 120 gesturable target words referring to either actions (e.g., ‘watching’) or objects (e.g., ‘ball’) that varied in their mouth (M = 0.52, SD = 0.13, range: 0.17–0.87) and gesture (M = 5.30, SD = 1.25, range: 1.67–6.92) informativeness based on the results from the norming studies; 120 video clips recorded as a part of the gesture informativeness norming experiment with visible face, body, and hands of the actress (see Fig. 2); and 240 monochromatic pictures: one matching and one mismatching the target word. For the mismatching pairs, we avoided words that shared phoneme onsets as well as words for which the corresponding limb gestures resembled each other. The pictures were taken from various sources, including Druks and Masterson (2000), Snodgrass and Vanderwart (1980), and other online platforms.

We manipulated the clarity of the auditory signal to create conditions more similar to those in everyday interactions, in which comprehenders may rely more on visual cues such as gestures and mouth movements. We included a ‘clear’ (unedited) condition, as well as a six-band pass-filter vocoded condition with maintained rhythmic structure but reduced pitch-related information (Shannon et al., 1995). Six-band filtering was chosen because it has been shown to moderately hinder speech comprehension (Drijvers & Özyürek, 2017). To manipulate the sound files, we used the same technique as described in Drijvers and Özyürek (2017), following a custom Praat script (Boersma & Weenink, 2021).

We also manipulated the presence of gestures to assess whether mouth movements enhance comprehension in addition to the iconic gestures and whether the effect of gestures is larger than mouth movements (Drijvers & Özyürek, 2017). Congruency between gestures and speech was additionally manipulated in separate blocks presented to different participants in order to avoid the possibility that mixing congruent and incongruent gestures would lead to the use of strategies (e.g., such as ignoring the gestures altogether). The condition in which stimuli had congruent gestures or no gestures is more ecologically valid, given that in real-world communication, gestures are not always present, but when they are, they are congruent with the speech. The condition in which stimuli have incongruent gestures or have no gestures provides a less ecologically valid scenario. However, this manipulation establishes to what extent participants automatically process gestures even when they should strategically ignore them because of interference (Kelly et al., 2010).

To create the incongruent speech–gesture pairs, we used the procedure introduced by Perniss et al. (2020) and Vigliocco et al. (2020), in which the head from one video was cropped (together with the auditory signal) and combined with the body from another (muted) video. We additionally edited the congruent video pairs in a similar way, i.e., we cropped the head from a speech-only video and pasted it on the corresponding speech–gesture video with an aligned audio file to ensure consistency across congruent and incongruent stimuli. All video manipulations were done in iMovie (Version 10.1.12). Furthermore, we constrained the selection of the incongruent speech–gesture pairs in the following way: (i) paired items had the same syllable length but differed at least in phoneme onsets (e.g., ‘walking–bowling’), (ii) associated gestures of the paired items did not resemble each other (e.g., excluding pairings such as ‘bowling–throwing’), and finally, (iii) action and object items could not be paired together (e.g., excluding ‘throwing–airplane’).

Overall, participants saw congruent or incongruent speech–gesture videos under four possible manipulations: (i) clear, gesture absent (where speech is clear and not accompanied by gestures), (ii) degraded, gesture absent (where speech is noise-vocoded and not accompanied by gestures), (iii) clear, gesture present (where speech is clear and accompanied by gestures), and finally, (iv) degraded, gesture present (where speech is noise-vocoded and accompanied by gestures; see Fig. 2). In all the conditions, mouth movements were present as in naturalistic face-to-face communication settings. For informativeness scores, picture materials, and Praat script, please see https://osf.io/gudj6/. For audio/video materials, please contact the corresponding author.

Procedure

After consenting to take part in an online computer-based experiment developed using Gorilla platform (https://gorilla.sc/), participants were randomly allocated to one of the two experimental groups: congruent (53 participants) or incongruent (51 participants). Each trial started with a fixation cross (250 ms) followed by an interval (300 ms) that preceded the onset of the picture. An image was then presented for 1,000 ms, and a video clip would play automatically on the next screen with the simultaneous presentation of the ‘YES’ and ‘NO’ answer boxes below. Participants’ task was to decide whether the spoken words uttered by the speaker in the videos matched previously seen pictures of an object/action by selecting (as accurately and as quickly as possible) one of the answer boxes using the mouse. Participants could respond during the presentation of the videos to ensure that the reaction times (RT) measured in this study captured the moment of meaning recognition. Participants were presented with the same video stimulus twice: once with a matching target image (YES trials) and once with a mismatching image (NO trials), completing 240 trials in addition to eight practice trials (not seen elsewhere) prior to the experiment. The main trials were randomly divided into four blocks of 60, between which participants could take a self-paced break. The experimental blocks were also randomized across participants. Additionally, we introduced eight ‘catch trials’ (two per block, randomly presented) to ensure participants paid attention to the videos. The catch trials consisted of pictures (different from those used for the target items) briefly presented on the screen, followed by a picture-verification question (e.g., Was that a dog?).

Data analysis

Generalized logistic and linear mixed-effects regression analyses, with Holm’s corrected pairwise comparisons where necessary, were performed in RStudio (RStudio Team, 2015) using lme4 package (Bates et al., 2015). Mixed-effect regression was used to handle categorical and continuous variables without loss in power, as well as non-independence in the data (Dixon, 2008; Jaeger, 2008; Meteyard & Davies, 2020). It is also more suitable for unbalanced designs, can easily accommodate missing data, and can account for both by-subject and by-item variance (Gelman & Hill, 2006; Meteyard & Davies, 2020). We carried out two separate analyses (Analysis 1 and Analysis 2), both assessing participants’ accuracy (binomial dependent variable) and RT (continuous dependent variable). In both sets of analyses, we focused only on the trials where the spoken word and the picture matched (YES trials) to ensure reliability (Stadthagen-Gonzalez et al., 2009), following Vigliocco et al. (2020). Prior to the analyses, outliers were identified as (i) any participant with an accuracy below three standard deviations or RT above three standard deviations from the mean; (ii) any item with an accuracy below chance level (50%) or RT above three standard deviations from the mean; (iii) any trial with RTs greater than three standard deviations from the mean of all trials to ensure normal distribution; (iv) any trials which had video loading issues signaled by Gorilla. Outliers (~10% of the data) were further removed from the analyses (see the Supplementary Materials for a full description of the outliers).

In Analysis 1, we ran separate models for congruent and incongruent gestures entering the following fixed effects: gesture presence, speech clarity, and mouth informativeness, as well as all possible interactions between them (up to a three-way interaction) into the model. In Analysis 2, we selected all the trials in which the gesture was present across the congruent and incongruent conditions and included the following fixed effects in a new set of models: speech clarity, mouth informativeness, gesture informativeness, congruency, and up to three-way interactions between them. Taking a design-driven approach (Barr et al., 2013), we entered intercepts for subjects and items as random effects; we also entered by-subject and by-item random slopes for the effects of gesture presence and speech clarity in Analysis 1 and the effect of speech clarity in Analysis 2. The interaction terms as well as the mouth informativeness term were not included in the random structure due to models’ convergence issues. Furthermore, due to singularity fit, models were simplified based on the variance of the random slopes (i.e., the terms that explained the least variance were removed first and then a simplified model was tested). Specifically, we removed the random slopes of gesture presence from the Analysis 1 with incongruent gestures (by participant and by item for the accuracy model, as well as by participant for the RT model) and the random slope of speech clarity by participant from Analysis 2 (accuracy model). By keeping the possibly maximal random structure, we minimized the possibility of Type I errors and ensured a conservative interpretation of the results. To allow convergence, bobyqa optimizer was used to maximize the number of iterations each model performed. We also entered word age of acquisition (AoA; Kuperman et al., 2012), log frequency (Brysbaert & New, 2009), number of syllables, and semantic category (i.e., whether the item referred to an action or an object) as control variables.^{Footnote 2} All continuous predictors were centered on the mean, and all categorical variables were sum-coded (i.e., we compared the deviations from the grand mean [intercept] for a given predictor). We used log transformation of the RT to minimize skewness of the data and then checked for linear regression assumptions: visual inspection of the RT data suggested that the residuals were normally distributed, and the assumption of homoscedasticity was met. There was no multicollinearity (Variance Inflation Factors [VIF] below 1.7). Significance values for the models were obtained using the lmerTest package (Kuznetsova et al., 2017) following Luke (2017), with Sattherwaite’s approximation for the RT models and Laplace approximation for the accuracy models. For each model, we additionally calculated conditional R² that represents the variance explained by both fixed and random effects following Nakagawa and Schielzeth (2013), as well as Johnson (2014), and using the MuMIn package (Bartoń, 2019). Finally, the graphs were created with sjPlot(Lüdecke, 2021) and ggplot2(Wickham, 2016) packages. The R code and the datasets analyzed in the study are available in the Open Science Framework repository (https://osf.io/gudj6/).

Results

Here, we report only significant effects and interactions (for the full set of results, see the Supplementary Materials).

Analysis 1

Congruent gestures

The accuracy model revealed a significant main effect of gesture presence (β = −0.389, SE = 0.190, z = −2.048, p = .040) and of speech clarity (β = 0.644, SE = 0.204, z = 3.156, p = .001): Participants made more errors when there were no gestures and when speech was degraded.

In the RT analysis, we found a significant main effect of gesture presence (β = 0.031, SE = 0.006, t = 4.737, p < .001), and of speech clarity (β = −0.035, SE = 0.003, t = −11.391, p < .001): RTs were faster when gestures were present, and when the speech was clear. There was a significant interaction between these variables (β = −0.008, SE = 0.002, t = −3.645, p < .001). Follow-up pairwise comparisons showed that participants were slower in the noise-vocoded condition, especially when gestures were absent (ps < .004; see Fig. 3a). The interaction between gesture presence and mouth informativeness was marginal (β = 0.082, SE = 0.044, t = 1.870, p = .064): In the absence of gestures, participants were faster when mouth movements were more informative than when they were less informative (while in the presence of gestures mouth informativeness had no effect); maximal mouth informativeness with no gestures had a similar effect to the presence of gestures (see Fig. 3b).

Incongruent gestures

In the accuracy data, we found a significant main effect of gesture presence (β = 0.297, SE = 0.069, z = 4.322, p < .001), and of speech clarity (β = 0.911, SE = 0.133, z = 6.832, p < .001): Participants made more errors with incongruent gestures and when the speech was degraded, respectively.

In the RT analysis, there was a significant main effect of speech clarity (β = −0.041, SE = 0.005, t = −8.832, p < .001) with slower RTs for noise-vocoded speech. The effect of mouth informativeness was marginal (β = 0.153, SE = 0.081, t = 1.871, p = .064): Responses were faster for more informative mouth movements.

Analysis 2

In the accuracy analysis, we found a significant main effect of speech clarity (β = 0.765, SE = 0.153, z = 4.994, p < .001), and congruency (β = −0.412, SE = 0.097, z = −4.234, p < .001), with more errors for degraded speech and incongruent pairings, respectively. Their interaction was significant (β = 0.484, SE = 0.076, z = 6.369, p < .001): Participants were especially hindered by vocoding when the gestures were incongruent. There was no difference between clear versus vocoded speech when the gestures were congruent, p = .217 (see Fig. 4a). There was also a significant interaction between congruency, speech clarity, and gesture informativeness (β = 0.145, SE = 0.072, z = 2.025, p = .043): Participants performed equally well in both speech clarity conditions when the gesture was congruent, but significantly worse when the noise-vocoded speech was accompanied by highly incongruent gestures (see Fig. 4b).

The RT model revealed a significant main effect of speech clarity (β = −0.032, SE = 0.004, t = −8.232, p < .001), with slower RTs for the noise-vocoded speech. We found a significant interaction between congruency and speech clarity (β = −0.010, SE = 0.004, t = −2.800, p = .006). Pairwise comparisons showed that participants were generally slower for vocoded, compared to clear speech for both congruent and incongruent gestures (ps < .001). There was no difference between congruent vs. incongruent gesture conditions when the speech was clear (p = .441); however, there was a marginal difference in the noise-vocoded condition (p = .057) with slower RTs for the incongruent gestures (see Fig. 5a). There was also a significant interaction between congruency and gesture informativeness (β = 0.008, SE = 0.003, t = 3.147, p = .002): Participants responded faster when congruent gestures were more informative (see Fig. 5b). Finally, the interaction between speech clarity and mouth informativeness was also significant (β = −0.066, SE = 0.025, t = −2.621, p = .010): When the speech was degraded, participants were slower for less informative mouth movements, but equally fast when the speech was clear (see Fig. 5c).

Discussion

We investigated audiovisual word recognition under clear and distorted listening conditions using stimuli for which we have measures of informativeness. Unsurprisingly, subjects were less accurate and slower when the speech was noise vocoded. Replicating previous studies, they were also overall less accurate and slower when gestures were incongruent (e.g., Kelly et al., 2010; McNeill et al., 1994; Vigliocco et al., 2020).

Furthermore, the presence of congruent gestures enhanced word recognition with responses being more accurate and faster particularly when the speech was degraded. Also, faster response times were observed for more informative congruent gestures (relative to incongruent ones) across speech clarity conditions. Conversely, incongruent gestures, especially when they were more informative and accompanied by noise-vocoded speech, led to the least accurate responses.

Informativeness of mouth movements did not have a significant effect across conditions. However, we observed a trend for words with more informative mouth movements having faster RTs when the accompanying gestures were incongruent with the speech. Moreover, mouth informativeness interacted with speech clarity, such that RTs were faster for noise-vocoded words with more informative mouth movements across gesture congruency conditions.

Finally, we found that the two visual cues interact, such that more informative mouth movements speeded up recognition in the absence of gestures and the effect of maximal mouth informativeness was similar to the condition when the gestures were present.

Iconic gestures and mouth movements in spoken word recognition

In line with previous research, our findings indicate that iconic gestures have a pivotal role in face-to-face communication: They are automatically processed alongside speech and facilitate word recognition, especially for adverse listening conditions (Drijvers & Özyürek, 2017; Holle et al., 2010; Obermeier et al., 2012). Our results also contribute to the growing body of literature on multimodal communication by showing that gestures are particularly useful when highly informative: People derive meaning faster, plausibly because the more information conveyed in gestures, the less ambiguous they are, and thus their conceptual mapping is easier.

Furthermore, when looking at the semantically mismatching trials, people extract information from iconic gestures, even when irrelevant (McNeill et al., 1994). It has been argued that this interference effect reflects automatic and obligatory integration between the two information channels (Kelly et al., 2010). Here, we show that this depends upon gesture informativeness: The clearer the semantic information conveyed in the gestures, the larger the interference. While this result is compatible with an integration account, it may also come about because participants use the information provided by gestures in order to carry out the picture-matching task, rather than speech, as suggested by Vigliocco et al. (2020), to account for the performance of aphasic patients.

Regarding mouth movements, it has been demonstrated that they are part and parcel of both spoken (Sumby & Pollack, 1954) and signed languages (Bank et al., 2016; van de Sande & Crasborn, 2009). Mouth movements are particularly useful in adverse listening conditions (Ma et al., 2009; Reisberg et al., 1987; Ross et al., 2007; Schwartz et al., 2004; Sumby & Pollack, 1954). We extend this result to show that this is crucially the case for more informative mouth movements. More generally, we show that our novel manner of quantifying the amount of information provided by the mouth (mouth informativeness) is useful; it goes beyond manipulating the presence/absence of mouth movements and overcomes the difficulties of existing quantifications based on visemes.

Dynamic interplay between speech, gesture, and mouth movements

We identified interactions between gesture, mouth informativeness, and clarity, supporting proposals in which auditory and visual cues are dynamically and flexibly weighted during communication (Skipper et al., 2009; Zhang, Frassinelli, et al., 2021b). In one of the first experimental studies looking at audiovisual (including gestures) speech, Drijvers and Özyürek (2017) demonstrated that moderately noise-vocoded speech comprehension was enhanced when the two cues were present, with iconic gestures having a larger effect than mouth movements. Using a completely different task and looking at RTs (as well as accuracy), we replicated and, importantly, extended their results by clarifying how and when gestures and mouth impact word recognition. Specifically, the presence of gestures significantly improved word recognition, irrespectively whether mouth movements were informative or not. This finding can be accounted for in two ways. It could be that participants carried out the task by making a decision as soon as semantic information was accessed either via the speech or via the gesture (whichever came first). In degraded speech conditions, such decisions could be based predominantly on the gesture. This account is in line with our previous findings from aphasic speakers where we found clear evidence for a complementary use of speech and gestures (Vigliocco et al., 2020). Alternatively, the advantage might have come about because together speech and gestures enhanced activation in the semantic system in comparison to speech or gesture alone. Compatible with this latter possibility, we found that when the speech was degraded and accompanied by gestures (either congruent or incongruent) more informative mouth movements helped. This effect may come about because mouth movements facilitate phonological activation of the target word leading to enhanced (when congruent) or reduced (when incongruent) activation at the semantic level. This is in line with Drijvers and Özyürek’s (2017) finding of an enhanced impact on the accuracy of both (rather than single) cues in degraded conditions. The finding, albeit only marginal, of a larger mouth informativeness effect (across clarity conditions) without gestures goes beyond the work of Drijvers and Özyürek (2017), suggesting that the system weights differently the cues using the most useful at any one time.

Zhang, Frassinelli, et al. (2021b) showed that the N400 response evoked by words in context is modulated by the presence of different multimodal cues, such as (both iconic and beat) gestures, prosodic stress, and mouth informativeness. The researchers found that comprehension is enhanced when iconic gestures and more informative mouth movements accompany speech. They explained this finding in terms of eye gaze literature, suggesting that listeners often focus on a speaker’s face during speech–gesture processing (Beattie et al., 2010; Gullberg & Kita, 2009). In parallel, here we found that mouth informativeness had an impact on word recognition across gesture congruency conditions when the speech was degraded, similarly showing that the more information is available, the easier is comprehension.

Overall, our results support the view that both cues contribute to human communication, with iconic gestures playing a more substantial role than mouth movements (Drijvers & Özyürek, 2017). This is because iconic gestures facilitate encoding of the meaning by directly activating semantic features (McNeill, 1992, 2000; Morrel-Samuels & Krauss, 1992). Instead, mouth movements tap into phonological features of words which can then facilitate access to the semantic representations by prediction and constraint (Peelle & Sommers, 2015). Importantly, we also demonstrate that the use of cues depends on their informativeness, suggesting that iconic gestures and mouth movements are dynamically weighted during speech processing.

Data availability

The data and picture materials are available at https://osf.io/gudj6/

Code availability (software application or custom code)

The code is available at https://osf.io/gudj6/

Notes

Features are weighted according to their phonological class and their subjective variability. For more information, see: https://github.com/dmort27/panphon
Note that word familiarity, phonological neighborhood density, and viewport size (i.e., the size of each participant’s browser window, minus the webpage interface, such as URL bar) were initially considered as control factors but we had to drop them due to missing values and model complexity. An average viewport size was 1,536 × 750.

References

Arnold, P., & Hill, F. (2001). Bisensory augmentation: A speechreading advantage when speech is clearly audible and intact. British Journal of Psychology (London, England: 1953), 92(Pt. 2), 339–355.
Bank, R., Crasborn, O., & van Hoet, R. (2016). The prominence of spoken language elements in a sign language. Linguistics, 54(6), 1281–1305.
Article Google Scholar
Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68(3), 255–278. https://doi.org/10.1016/j.jml.2012.11.001
Article Google Scholar
Bartoń, K. (2019). MuMIn: Multi-model inference (R Package Version 1.43.15) [Computer software]. https://CRAN.R-project.org/package=MuMIn
Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48. https://doi.org/10.18637/jss.v067.i01
Article Google Scholar
Beattie, G., & Shovelton, H. (1999). Do iconic hand gestures really contribute anything to the semantic information conveyed by speech? An experimental investigation. Semiotica, 123(1/2), 1–30. https://doi.org/10.1515/semi.1999.123.1-2.1
Article Google Scholar
Beattie, G., Webster, K., & Ross, J. (2010). The fixation and processing of the iconic gestures that accompany talk. Journal of Language and Social Psychology, 29(2), 194–213. https://doi.org/10.1177/0261927X09359589
Article Google Scholar
Boersma, P., & Weenink, D. (2021). Praat: doing phonetics by computer. Version 6.1.53, retrieved 8 September 2021 from http://www.praat.org/
Brysbaert, M., & New, B. (2009). Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41(4), 977–990. https://doi.org/10.3758/BRM.41.4.977
Article PubMed Google Scholar
Brysbaert, M., Warriner, A. B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods, 46(3), 904–911. https://doi.org/10.3758/s13428-013-0403-5
Article PubMed Google Scholar
Cocks, N., Sautin, L., Kita, S., Morgan, G., & Zlotowitz, S. (2009). Gesture and speech integration: An exploratory study of a man with aphasia. International Journal of Language & Communication Disorders, 44(5), 795–804. https://doi.org/10.1080/13682820802256965
Article Google Scholar
Cocks, N., Byrne, S., Pritchard, M., Morgan, G., & Dipper, L. (2018). Integration of speech and gesture in aphasia: Integration of speech and gesture in aphasia. International Journal of Language & Communication Disorders, 53(3), 584–591. https://doi.org/10.1111/1460-6984.12372
Article Google Scholar
Dixon, P. (2008). Models of accuracy in repeated-measures designs. Journal of Memory and Language, 59(4), 447–456. https://doi.org/10.1016/j.jml.2007.11.004
Article Google Scholar
Drijvers, L., & Özyürek, A. (2017). Visual context enhanced: The joint contribution of iconic gestures and visible speech to degraded speech comprehension. Journal of Speech, Language, and Hearing Research, 60(1), 212–222. https://doi.org/10.1044/2016_JSLHR-H-16-0101
Article PubMed Google Scholar
Drijvers, L., & Özyürek, A. (2020). Non-native Listeners benefit less from gestures and visible speech than native listeners during degraded speech comprehension. Language and Speech, 63(2), 209–220. https://doi.org/10.1177/0023830919831311
Article PubMed Google Scholar
Drijvers, L., Vaitonytė, J., & Özyürek, A. (2019). Degree of language experience modulates visual attention to visible speech and iconic gestures during clear and degraded speech comprehension. Cognitive Science, 43(10), e12789. https://doi.org/10.1111/cogs.12789
Article PubMed PubMed Central Google Scholar
Druks, J., & Masterson, J. (2000). An object and action naming battery. Psychology Press.
Google Scholar
Fisher, C. G. (1968). Confusions among visually perceived consonants. Journal of Speech and Hearing Research, 11(4), 796–804. https://doi.org/10.1044/jshr.1104.796
Article PubMed Google Scholar
Gelman, A., & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge University Press. https://doi.org/10.1017/CBO9780511790942
Green, A., Straube, B., Weis, S., Jansen, A., Willmes, K., Konrad, K., & Kircher, T. (2009). Neural integration of iconic and unrelated coverbal gestures: A functional MRI study. Human Brain Mapping, 30(10), 3309–3324. https://doi.org/10.1002/hbm.20753
Article PubMed PubMed Central Google Scholar
Gullberg, M., & Kita, S. (2009). Attention to speech-accompanying gestures: Eye movements and information uptake. Journal of Nonverbal Behavior, 33(4), 251–277. https://doi.org/10.1007/s10919-009-0073-2
Article PubMed PubMed Central Google Scholar
Habets, B., Kita, S., Shao, Z., Özyurek, A., & Hagoort, P. (2010). The Role of Synchrony and Ambiguity in speech–gesture integration during comprehension. Journal of Cognitive Neuroscience, 23(8), 1845–1854. https://doi.org/10.1162/jocn.2010.21462
Article PubMed Google Scholar
Hirata, Y., & Kelly, S. D. (2010). Effects of lips and hands on auditory learning of second-language speech sounds. Journal of Speech, Language, and Hearing Research, 53(2), 298–310. https://doi.org/10.1044/1092-4388(2009/08-0243)
Article PubMed Google Scholar
Holle, H., & Gunter, T. C. (2007). The role of iconic gestures in speech disambiguation: ERP evidence. Journal of Cognitive Neuroscience, 19(7), 1175–1192. https://doi.org/10.1162/jocn.2007.19.7.1175
Article PubMed Google Scholar
Holle, H., Obleser, J., Rueschemeyer, S.-A., & Gunter, T. C. (2010). Integration of iconic gestures and speech in left superior temporal areas boosts speech comprehension under adverse listening conditions. NeuroImage, 49(1), 875–884. https://doi.org/10.1016/j.neuroimage.2009.08.058
Article PubMed Google Scholar
Jaeger, T. F. (2008). Categorical data analysis: Away from ANOVAs (transformation or not) and towards logit mixed models. Journal of Memory and Language, 59(4), 434–446. https://doi.org/10.1016/j.jml.2007.11.007
Article PubMed PubMed Central Google Scholar
Jesse, A., & Massaro, D. W. (2010). The temporal distribution of information in audiovisual spoken-word identification. Attention, Perception, & Psychophysics, 72(1), 209–225. https://doi.org/10.3758/APP.72.1.209
Article Google Scholar
Johnson, P. C. D. (2014). Extension of Nakagawa & Schielzeth’s R2GLMM to random slopes models. Methods in Ecology and Evolution, 5(9), 944–946.
Article Google Scholar
Kelly, S. D., Barr, D. J., Church, R. B., & Lynch, K. (1999). Offering a hand to pragmatic understanding: The role of speech and gesture in comprehension and memory. Journal of Memory and Language, 40(4), 577–592. https://doi.org/10.1006/jmla.1999.2634
Article Google Scholar
Kelly, S. D., Hirata, Y., Manansala, M., & Huang, J. (2014). Exploring the role of hand gestures in learning novel phoneme contrasts and vocabulary in a second language. Frontiers in Psychology, 5. https://doi.org/10.3389/fpsyg.2014.00673
Kelly, S. D., Özyürek, A., & Maris, E. (2010). Two Sides of the Same Coin: Speech and Gesture Mutually Interact to Enhance Comprehension. Psychological Science, 21(2), 260–267. https://doi.org/10.1177/0956797609357327
Article PubMed Google Scholar
Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods, 44(4), 978–990. https://doi.org/10.3758/s13428-012-0210-4
Article PubMed Google Scholar
Kutas, M., & Federmeier, K. D. (2011). Thirty years and counting: Finding meaning in the N400 component of the event-related brain potential (ERP). Annual Review of Psychology, 62, 621–647. https://doi.org/10.1146/annurev.psych.093008.131123
Article PubMed PubMed Central Google Scholar
Kuznetsova, A., Brockhoff, P. B., & Christensen, R. H. B. (2017). lmerTest Package: Tests in Linear Mixed Effects Models. Journal of Statistical Software, 82(13). https://doi.org/10.18637/jss.v082.i13
Lachs, L., & Pisoni, D. B. (2004). Cross-modal source information and spoken word recognition. Journal of Experimental Psychology: Human Perception and Performance, 30(2), 378–396. https://doi.org/10.1037/0096-1523.30.2.378
Article PubMed Google Scholar
Lüdecke D (2021). sjPlot: Data Visualization for Statistics in Social Science. R package version 2.8.9, https://CRAN.R-project.org/package=sjPlot.
Luke, S. G. (2017). Evaluating significance in linear mixed-effects models in R. Behavior Research Methods, 49(4), 1494–1502. https://doi.org/10.3758/s13428-016-0809-y
Article PubMed Google Scholar
Ma, W. J., Zhou, X., Ross, L. A., Foxe, J. J., & Parra, L. C. (2009). Lip-reading aids word recognition most in moderate noise: A Bayesian explanation using high-dimensional feature space. PLOS ONE, 4(3), Article e4638. https://doi.org/10.1371/journal.pone.0004638
Massaro, D. W., & Cohen, M. M. (1995). Perceiving talking faces. Current Directions in Psychological Science, 4(4), 104–109.
Article Google Scholar
Massaro, D. W., Cohen, M. M., & Gesi, A. T. (1993). Long-term training, transfer, and retention in learning to lipread. Perception & Psychophysics, 53(5), 549–562. https://doi.org/10.3758/BF03205203
Article Google Scholar
McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264, 746–748. https://doi.org/10.1038/264746a0
Article PubMed Google Scholar
McNeill, D. (1992). Hand and mind: What gestures reveal about thought. University of Chicago Press.
Google Scholar
McNeill, D. (Ed.). (2000). Language and Gesture. Cambridge University Press. https://doi.org/10.1017/CBO9780511620850
McNeill, D., Cassell, J., & McCullough, K.-E. (1994). Communicative effects of speech-mismatched gestures. Research on Language & Social Interaction, 27(3), 223–237. https://doi.org/10.1207/s15327973rlsi2703_4
Article Google Scholar
Meteyard, L., & Davies, R. A. I. (2020). Best practice guidance for linear mixed-effects models in psychological science. Journal of Memory and Language, 112, Article 104092. https://doi.org/10.1016/j.jml.2020.104092
Morrel-Samuels, P., & Krauss, R. M. (1992). Word familiarity predicts temporal asynchrony of hand gestures and speech. Journal of Experimental Psychology: Learning, Memory, and Cognition, 18(3), 615–622. https://doi.org/10.1037/0278-7393.18.3.615
Article Google Scholar
Mortensen, D. R., Littell, P., Bharadwaj, A., Goyal, K., Dyer, C., & Levin, L. (2016). PanPhon: A resource for mapping IPA segments to articulatory feature vectors. Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical papers (pp. 3475–3484). https://www.aclweb.org/anthology/C16-1328
Nakagawa, S., & Schielzeth, H. (2013). A general and simple method for obtaining R2 from generalized linear mixed-effects models. Methods in Ecology and Evolution, 4(2), 133–142.https://doi.org/10.1111/j.2041-210x.2012.00261.x
Article Google Scholar
Obermeier, C., Dolk, T., & Gunter, T. C. (2012). The benefit of gestures during communication: Evidence from hearing and hearing-impaired individuals. Cortex, 48(7), 857–870. https://doi.org/10.1016/j.cortex.2011.02.007
Article PubMed Google Scholar
Peelle, J. E., & Sommers, M. S. (2015). Prediction and constraint in audiovisual speech perception. Cortex, 68, 169–181. https://doi.org/10.1016/j.cortex.2015.03.006
Article PubMed PubMed Central Google Scholar
Perniss, P., Vinson, D., & Vigliocco, G. (2020). Making sense of the hands and mouth: The role of “secondary” cues to meaning in British Sign Language and English. Cognitive Science, 44(7), Article e12868. https://doi.org/10.1111/cogs.12868
Reisberg, D., McLean, J., & Goldfield, A. (1987). Easy to hear but hard to understand: A lip-reading advantage with intact auditory stimuli. In B. Dodd & R. Campbell (Eds.), Hearing by eye: The psychology of lip-reading (pp. 97–113). Erlbaum.
Google Scholar
Ross, L. A., Saint-Amour, D., Leavitt, V. M., Javitt, D. C., & Foxe, J. J. (2007). Do you see what I am saying? Exploring visual enhancement of speech comprehension in noisy environments. Cerebral Cortex (New York, N.Y.: 1991), 17(5), 1147–1153. https://doi.org/10.1093/cercor/bhl024
Article Google Scholar
RStudio Team (2015). RStudio: Integrated Development for R. RStudio, PBC, Boston, MA URL http://www.rstudio.com/.
Schwartz, J.-L., Berthommier, F., & Savariaux, C. (2004). Seeing to hear better: Evidence for early audiovisual interactions in speech identification. Cognition, 93(2), B69–B78. https://doi.org/10.1016/j.cognition.2004.01.006
Article PubMed Google Scholar
Shannon, R. V., Zeng, F. G., Kamath, V., Wygonski, J., & Ekelid, M. (1995). Speech recognition with primarily temporal cues. Science (New York, N.Y.), 270(5234), 303–304. https://doi.org/10.1126/science.270.5234.303
Article Google Scholar
Skipper, J. I., Goldin-Meadow, S., Nusbaum, H. C., & Small, S. L. (2009). Gestures Orchestrate Brain Networks for Language Understanding. Current Biology, 19(8), 661–667.https://doi.org/10.1016/j.cub.2009.02.051
Article PubMed Google Scholar
Snodgrass, J. G., & Vanderwart, M. (1980). A standardized set of 260 pictures: Norms for name agreement, image agreement, familiarity, and visual complexity. Journal of Experimental Psychology: Human Learning and Memory, 6(2), 174–215. https://doi.org/10.1037/0278-7393.6.2.174
Article Google Scholar
Solberg Økland, H., Todorović, A., Lüttke, C. S., McQueen, J. M., & de Lange, F. P. (2019). Combined predictive effects of sentential and visual constraints in early audiovisual speech processing. Scientific Reports, 9(1). https://doi.org/10.1038/s41598-019-44311-2
Stadthagen-Gonzalez, H., Damian, M. F., Pérez, M. A., Bowers, J. S., & Marín, J. (2009). Name–picture verification as a control measure for object naming: A task analysis and norms for a large set of pictures. Quarterly Journal of Experimental Psychology, 62(8), 1581–1597. https://doi.org/10.1080/17470210802511139
Article Google Scholar
Sumby, W. H., & Pollack, I. (1954). Visual contribution to speech intelligibility in noise. The Journal of the Acoustical Society of America, 26(2), 212–215. https://doi.org/10.1121/1.1907309
Article Google Scholar
Trafimow, D. (2018). An a priori solution to the replication crisis. Philosophical Psychology, 31(8), 1188–1214. https://doi.org/10.1080/09515089.2018.1490707
Tye-Murray, N., Sommers, M., & Spehar, B. (2007). Auditory and Visual Lexical Neighborhoods in Audiovisual Speech Perception. Trends in Amplification, 11(4), 233–241. https://doi.org/10.1177/1084713807307409
Article PubMed PubMed Central Google Scholar
van de Sande, I., & Crasborn, O. (2009). Lexically bound mouth actions in Sign Language of the Netherlands: A comparison between different registers and age groups. Linguistics in the Netherlands, 26(1), 78–90.
Article Google Scholar
Vigliocco, G., Gu, Y., Grzyb, B., Motamedi, Y., Murgiano, M., Brekelmans, G., Brieke, R., Perniss, P. (2021). A multimodal annotated corpus of dyadic communication (Manuscript in preparation).
Vigliocco, G., Krason, A., Stoll, H., Monti, A., & Buxbaum, L. J. (2020). Multimodal comprehension in left hemisphere stroke patients. Cortex, 133, 309–327. https://doi.org/10.1016/j.cortex.2020.09.025
Article PubMed PubMed Central Google Scholar
Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag. https://ggplot2.tidyverse.org
Willems, R. M., Özyürek, A., & Hagoort, P. (2009). Differential roles for left inferior frontal and superior temporal cortex in multimodal integration of action and language. NeuroImage, 47(4), 1992–2004. https://doi.org/10.1016/j.neuroimage.2009.05.066
Article PubMed Google Scholar
Wu, Y. C., & Coulson, S. (2007). Iconic gestures prime related concepts: An ERP study. Psychonomic Bulletin & Review, 14(1), 57–63. https://doi.org/10.3758/BF03194028
Article Google Scholar
Zhang, Y., Ding, R., Frassinelli, D., Tuomainen, J., Klavinskis-Whiting, S., & Vigliocco, G. (2021a). Electrophysiological signatures of multimodal comprehension in second language (Manuscript in preparation).
Zhang, Y., Frassinelli, D., Tuomainen, J., Skipper, J. I., & Vigliocco, G. (2021b). More than words: Word predictability, prosody, gesture and mouth movements in natural language comprehension. Proceedings of the Royal Society B: Biological Sciences, 288(1955), 20210500. https://doi.org/10.1098/rspb.2021.0500

Download references

Acknowledgements

The research was supported by the European Research Council grant 743035 awarded to G.V. G.V. was also supported by a Royal Society Wolfson Research Merit Award (WRM\R3\170016). We would like to thank Dr. Linda Drijvers for sharing her Praat script that was used to vocode the speech stimuli.

Author note

Portions of these findings were presented as a poster at the 2020 Virtual Psychonomics.

Funding

The study was founded by the European Research Council grant (743035) awarded to G.V. G.V. was also supported by a Royal Society Wolfson Research Merit Award (WRM\R3\170016).

Author information

Authors and Affiliations

Division of Psychology and Language Science, University College London, 26 Bedford Way, London, WC1H 0AP, UK
Anna Krason, Rebecca Fenton, Rosemary Varley & Gabriella Vigliocco

Authors

Anna Krason
View author publications
You can also search for this author in PubMed Google Scholar
Rebecca Fenton
View author publications
You can also search for this author in PubMed Google Scholar
Rosemary Varley
View author publications
You can also search for this author in PubMed Google Scholar
Gabriella Vigliocco
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anna Krason.

Ethics declarations

Conflicts of interest/Competing interests

Authors declare no conflict of interest

Ethics approval

The study was ethically approved by the UCL Research Ethics Committee (0143/003). The procedures used in this study adhere to the tenets of the Declaration of Helsinki.

Consent to participate

Informed consent was obtained from all individual participants included in the study.

Consent for publication

The actresses who helped with stimuli recordings signed informed consent regarding publishing their photographs.

Additional information

Open practice statement

The data, the Praat script, picture materials, and the R code are available at https://osf.io/gudj6/. None of the experiments was preregistered.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

ESM 1

(PDF 104 kb)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Krason, A., Fenton, R., Varley, R. et al. The role of iconic gestures and mouth movements in face-to-face communication. Psychon Bull Rev 29, 600–612 (2022). https://doi.org/10.3758/s13423-021-02009-5

Download citation

Accepted: 06 September 2021
Published: 20 October 2021
Issue Date: April 2022
DOI: https://doi.org/10.3758/s13423-021-02009-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

The role of iconic gestures and mouth movements in face-to-face communication

Abstract

Similar content being viewed by others

Social interactions in the metaverse: Framework, initial evidence, and research roadmap

Nonverbal Auditory Cues Allow Relationship Quality to be Inferred During Conversations

Application of facial neuromuscular electrical stimulation (fNMES) in psychophysiological research: Practical recommendations based on a systematic review of the literature

Introduction

Iconic gestures

Mouth movements

Weighting the multimodal cues

Current study

Methods

Norming studies

Gesture informativeness norms

Mouth informativeness norms

Main study

Participants

Materials

Procedure

Data analysis

Results

Analysis 1

Congruent gestures

Incongruent gestures

Analysis 2

Discussion

Iconic gestures and mouth movements in spoken word recognition

Dynamic interplay between speech, gesture, and mouth movements

Data availability

Code availability (software application or custom code)

Notes

References

Acknowledgements

Author note

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest/Competing interests

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher’s note

Supplementary Information

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation