The communicative advantage: how kinematic signaling supports semantic comprehension

Trujillo, James P.; Simanova, Irina; Bekkering, Harold; Özyürek, Asli

doi:10.1007/s00426-019-01198-y

The communicative advantage: how kinematic signaling supports semantic comprehension

Original Article
Open access
Published: 11 May 2019

Volume 84, pages 1897–1911, (2020)
Cite this article

Download PDF

You have full access to this open access article

Psychological Research Aims and scope Submit manuscript

The communicative advantage: how kinematic signaling supports semantic comprehension

Download PDF

James P. Trujillo ORCID: orcid.org/0000-0003-4713-376X^1,2,
Irina Simanova¹,
Harold Bekkering¹ &
…
Asli Özyürek^2,3

3039 Accesses
10 Citations
15 Altmetric
Explore all metrics

Abstract

Humans are unique in their ability to communicate information through representational gestures which visually simulate an action (eg. moving hands as if opening a jar). Previous research indicates that the intention to communicate modulates the kinematics (e.g., velocity, size) of such gestures. If and how this modulation influences addressees’ comprehension of gestures have not been investigated. Here we ask whether communicative kinematic modulation enhances semantic comprehension (i.e., identification) of gestures. We additionally investigate whether any comprehension advantage is due to enhanced early identification or late identification. Participants (n = 20) watched videos of representational gestures produced in a more- (n = 60) or less-communicative (n = 60) context and performed a forced-choice recognition task. We tested the isolated role of kinematics by removing visibility of actor’s faces in Experiment I, and by reducing the stimuli to stick-light figures in Experiment II. Three video lengths were used to disentangle early identification from late identification. Accuracy and response time quantified main effects. Kinematic modulation was tested for correlations with task performance. We found higher gesture identification performance in more- compared to less-communicative gestures. However, early identification was only enhanced within a full visual context, while late identification occurred even when viewing isolated kinematics. Additionally, temporally segmented acts with more post-stroke holds were associated with higher accuracy. Our results demonstrate that communicative signaling, interacting with other visual cues, generally supports gesture identification, while kinematic modulation specifically enhances late identification in the absence of other cues. Results provide insights into mutual understanding processes as well as creating artificial communicative agents.

Using and Seeing Co-speech Gesture in a Spatial Task

Article 12 February 2015

Two Means Together? Effects of Response Bias and Sensitivity on Communicative Action Detection

Article 12 April 2022

Toward the markerless and automatic analysis of kinematic features: A toolkit for gesture and movement research

Article Open access 24 August 2018

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Human communication is multimodal, utilizing various signals to convey meaning and interact with others. Indeed, humans may be uniquely adapted for knowledge transfer, with the ability to signal the intention to interact as well as to manifest the knowledge that s/he wishes to communicate (Csibra & Gergely, 2006). This communicative signaling system is powerful in that the signals are dynamically adapted for the context in which they are used. For example, representational gestures (Kendon, 2004; McNeill, 1994) show systematic modulations dependent upon the communicative or social context in which they occur (Campisi & Özyürek, 2013; Galati & Galati, 2015; Gerwing & Bavelas, 2004; Holler & Beattie, 2005). Although these gestures are an important aspect of human communication, it is currently unclear how the addressee benefits from this communicative modulation. The current study aims to investigate for the first time whether and how kinematic signaling enhances identification of representational gestures.

There is growing evidence that adults modulate their action and gesture kinematics when communicating with other adults, depending on the communicative context. For example, adults adapt to addressees’ knowledge by producing gestures that are larger (Bavelas, Gerwing, Sutton, & Prevost, 2008; Campisi & Özyürek, 2013), more complex (Gerwing & Bavelas, 2004; Holler & Beattie, 2005), and higher in space (Hilliard & Cook, 2016) when conveying novel information. Instrumental actions intended to teach show similar kinematic modulation, including spatial (McEllin, Knoblich, & Sebanz, 2018; Vesper & Richardson, 2014) and temporal (McEllin et al., 2018) exaggeration. Evidence from our own lab corroborates these findings of spatial and temporal modulation in the production of both actions and gestures. In our recent work, we quantified the spatial and temporal modulation of actions and pantomime gestures (used without speech) in a more- relative to a less-communicative context (Trujillo, Simanova, Bekkering, & Özyürek, 2018). We showed that spatial and temporal features of actions and pantomime gestures are adapted to the communicative context in which they are produced.

A computational account by Pezzulo, Donnarumma, and Dindo (2013) suggests that modulation makes meaningful acts communicative by disambiguating the relevant information, effectively making the intended movement goal clear to the observer. This framework focuses on actions, but could be extended to gestures. One recent experimental study directly assessed how kinematic modulation affects gesture comprehension. By combining computationally based robotic production of gestures with validation through human comprehension experiments, Holladay, Dragan, and Srinivasa (2014) showed that spatial exaggeration of kinematics allows observers to more easily recognize the target of pointing gestures. Similarly, Gielniak and Thomaz (2012) showed that when robot co-speech gestures are kinematically exaggerated, the content of an interaction with that robot is better remembered. Another study used an action-based leader–follower task to show that task leaders not only systematically modulate task-relevant kinematic parameters, but these modulations are linked to better performance of the followers (Vesper, Schmitz, & Knoblich, 2017).

These previous studies suggest that the kinematics modulation of communicative movements (e.g., actions and gestures) serves to clarify relevant information for the addressee. However, it remains unclear whether this also holds for more complex human movements, such as pantomime gestures. This question is important for our understanding of human communication given that complex representations form an important part of the communicative message (Kelly, Ozyurek, & Maris, 2010; Özyürek, 2014).

The mechanism by which kinematic modulation might support semantic comprehension, or identification, of complex movements remains unclear. Several studies suggest disambiguation of the ongoing act, either through temporal segmentation of relevant parts (Blokpoel et al., 2012; Brand, Baldwin, & Ashburn, 2002), or spatial exaggeration of relevant features (Brand et al., 2002) as the mechanism. In the case of disambiguation, the “semantic core” (Kendon, 1986), or meaningful part of the movement, is made easier to understand as it unfolds. However, there is also evidence suggesting that early kinematic cues provide sufficient information to inform accurate prediction of whole actions before they are seen in their entirety (Cavallo, Koul, Ansuini, Capozzi, & Becchio, 2016; Manera, Becchio, Cavallo, Sartori, & Castiello, 2011). One study, for example, used videos of a person walking, and at a pause in the video participants were asked whether the actress in the video would continue to walk, or start to crawl. The authors showed that whole-body kinematics could support predictions about the outcome of an ongoing action (Stapel, Hunnius, & Bekkering, 2012). However, another study showed videos of a person reaching out and grasping a bottle, and asked the participants to predict the next sequence in the action (e.g., to drink, to move, to offer) and found that they were unable to use such early cues for accurate identification in this more complex, open-ended situation (Naish, Reader, Houston-Price, Bremner, & Holmes, 2013). Furthermore, identification of pantomime gestures has previously been reported to be quite low when no contextual (i.e., object) information is provided (Osiurak, Jarry, Baltenneck, Boudin, & Le Gall, 2012). Given these inconsistencies in the literature, an open question remains: are early kinematic cues sufficient to inform early representational gesture identification, or does kinematic modulation primarily aid gesture identification as the movements unfold (i.e., late identification)?

Finally, to understand how kinematic modulation might support gesture identification, it is important to consider other factors that might influence the semantic comprehension of an observer. In a natural environment, movements such as gestures are accompanied by additional communicative signals, such as facial expression and eye-gaze, and/or finger kinematics relevant in the execution of the gestures. Humans are particularly sensitive to the presence of human faces, which naturally draw attention (Cerf, Harel, Einhäuser, & Koch, 2007; Hershler & Hochstein, 2005; Theeuwes & Van der Stigchel, 2006). This effect is most prominent in the presence of mutual gaze (Farroni, Csibra, Simion, & Johnson, 2002; Holler et al., 2015), but also occurs in averted gaze compared to non-face objects (Hershler & Hochstein, 2005). Hand-shape information can also provide clues as to the object one is manipulating (Ansuini et al., 2016), and more generally the kinematics of the hand and fingers together provide early cues to upcoming actions (Becchio, Koul, Ansuini, Bertone, & Cavallo, 2018; Cavallo et al., 2016), which together may allow the act to be more easily identified. To understand the role of kinematic modulation in communication, the complexity of the visual scene must also be taken into account.

In sum, previous studies show kinematic modulation occurring as a communicative cue in actions and gestures. While research suggests that this modulation serves to enhance comprehension, this has not been assessed directly in terms of semantic comprehension of complex movements, such as representational gestures. Furthermore, it is currently unclear if improved comprehension would be driven by early action identification or by late identification of semantics, and which kinematic features provide this advantage.

The current study addresses these questions. In two experiments, naïve participants perform a recognition task of naturalistic pantomime gestures recorded in our previous study (Trujillo, Simanova et al., 2018). In the first experiment, they see the original videos with the face of the actor either visible or blurred, to control for eye-gaze effects. In the second experiment, the same videos are reduced to stick-light figures, reconstructed from Kinect motion tracking data. The stick figure videos allow us to test the contribution of specific kinematic features, because only the movements are visible, but not the face or hand shape. In both experiments, we additionally manipulate video length to test whether any communicative benefit is driven more by early identification (resulting in differences only in the initial fragment), or late identification (resulting in differences in the medium and full fragments). Experiment II provides an additional exploratory test of the contribution of specific kinematic features to gesture identification.

We hypothesize that kinematic modulation serves to enhance semantic legibility. As early kinematic information is less reliable for open-ended action prediction (Naish et al., 2013) and pantomime gestures may generally be difficult to identify without context (Osiurak et al., 2012), we expect better recognition scores for the communicative gestures in the medium fragments and full fragments compared to initial fragments. We furthermore predict that performance will correlate with stronger kinematic modulation. Additionally, we expect performance to be lower overall with stick-light figures, compared to the full videos due to decreased visual information, but with a similar pattern (i.e., better performance in medium and full fragments compared to initial). For our exploratory test, we expect that exaggeration of both spatial and temporal kinematic features will contribute to better gesture identification.

Experiment I: Full visual context

Our first experiment, with actual videos of the gestures, was designed to test whether (1) kinematic modulations lead to improved semantic comprehension in an addressee, (2) if the advantage is better explained by early identification or late identification of the gestures, and (3) whether the effect is altered by removing a salient part of the visual context, the actor’s face.

Methods

Participants

Twenty participants were included in this study (mean age = 28; 16 female), recruited from the Radboud University. Participants were selected on the criteria of being aged 18–35, right-handed and fluent in the Dutch language, with no history of psychiatric disorders or communication impairments. The procedure was approved by a local ethics committee and informed consent was obtained from all individual participants in this study.

Materials

Each participant performed the recognition task with 60 videos of pantomimes that differed in their context (more or less communicative), video duration (short, medium and full), and face visibility (face visible vs. blurred). Detailed description of the video recordings, selection and manipulation follows below.

Video recording procedure

Stimuli were derived from a previous experiment (Trujillo, Simanova et al., 2018). In this previous experiment, participants (henceforth, actors) were filmed while seated at a table, with a camera hanging in front of the table. Motion-tracking data were acquired using Microsoft Kinect system hanging slightly to the left of the camera. Each actor performed a set of 31 gestures, either in a more-communicative or a less-communicative setting (described below). Gestures consisted of simple object-directed acts, such as cutting paper with scissors or pouring water into a cup. Target objects were placed on the table (e.g., scissors and a sheet of paper for the item ‘cut the paper with the scissors’) but actors were instructed to perform as if they were acting on the objects, without actually touching them. For each item, actors began with their hands placed on designated starting points on the table (marked with tape). After placing the target object(s) on the table, the experimenter moved out of view from the participant and the camera, and recorded instructions were played. Immediately following the instructions, a bell sound was played, which indicated that the participant could begin with the pantomime. Once the act was completed, actors returned their hands to the indicated starting points, which elicited another bell sound, and waited for the next item. For this study, videos began at the first bell sound, and ended at the second bell sounded. In the more-communicative context we introduced a confederate who sat in an adjacent room and was said to be watching through the video camera and learning the gestures from the participant. In this way, an implied communicative context was created. In the less-communicative context, the same confederate was said to be learning the experimental setup. The less-communicative context was, therefore, exactly matched, including the presence of an observer, but only differed in that there was no implied interaction. Despite the subtle task manipulation, our previous study (Trujillo, Simanova et al., 2018) showed robust differences in kinematics between the gestures produced in the more-communicative vs. the less-communicative context.

Kinematic feature quantification

For the current study, we used the same kinematic features that were quantified in our earlier study (Trujillo, Simanova et al., 2018). We used a toolkit for markerless automatic analysis of kinematic features, developed earlier in our group (Trujillo, Vaitonyte, Simanova, & Özyürek, 2018). The following briefly describes the feature quantification procedure: all features were measured within the time frame between the beginning and the ending bell sound. Motion-tracking data from the Kinect provided measures for our kinematic features, and all raw motion-tracking data were smoothed using the Savitzky–Golay filter with a span of 15 and degree of 5. As described in our previous work (Trujillo, Simanova et al., 2018), this smoothing protocol was used as it brought the Kinect data closely in line with simultaneously recorded optical motion-tracking data in a separate pilot session. The following features were calculated from the smoothed data: Distance was calculated as the total distance traveled by both hands in 3D space over the course of the item. Vertical amplitude was calculated on the basis of the highest space used by either hand in relation to the body. Peak velocity was calculated as the greatest velocity achieved with the right (dominant) hand. Hold time was calculated as the total time, in seconds, counting as a hold. Holds were defined as an event in which both hands and arms are still for at least 0.3 s. Submovements were calculated as the number of individual ballistic movements made, per hand, throughout the item. To account for the inherent differences in the kinematics of the various items performed, z scores were calculated for each feature/item combination across all actors including both conditions. This standardized score represents the modulation of that feature, as it quantifies how much greater or smaller the feature was when compared to the average of that feature across all of the actors. (Addressee-directed) Eye-gaze was coded in ELAN as the proportion of the total duration of the video in which the participant is looking directly into the camera. For a more detailed description of these quantifications, see Trujillo, Simanova et al. (2018). Also note that the kinematic features calculated using this protocol are in line with the same features manually annotated from the video recordings (Trujillo, Vaitonyte et al., 2018). This supports our assumption that the features calculated from the motion-tracking data represent qualities that are visible in the videos.

Inclusion and randomization

Our stimuli set included 120 videos (of the 2480) recorded in our previous study (Trujillo, Simanova et al., 2018). Our selection procedure (see Appendix 1) ensured that our stimulus set in the present experiment included an equal number of more- and less-communicative videos. Each of the 31 gesture items from the original set was included a minimum of three times and maximum of four times across the entire selection, performed by different actors, while ensuring that each item also appeared at least once in the more-communicative context and once in the less-communicative context. Three videos from each actor in the previous study were included. Appendix 2 provides the full list of items gesture items. Supplementary Figure 1 illustrates the range of kinematics, gaze, and video durations included across the two groups in the current study with respect to the original dataset from Trujillo, Simanova et al. (2018). We ensured that the stimulus set for the present study matched the original dataset in terms of context-specific differences in the kinematics and eye-gaze, ensuring that the current stimulus set is a representative sample of the data shown in Trujillo, Simanova et al. (2018). These results are provided in Appendix 1.

Video segmentation

To test whether kinematic modulation primarily influences early or late identification (question 2), we divided the videos into segments of different length. Based on the previous literature (Kendon, 1986; Kita, van Gijn, & van der Hulst, 1998), we defined segments as following: Wait covered the approximate 500 ms after the bell was played, but before the participant started to move. Reach to grasp covered the time during which the participant reached towards, and subsequently grasped the target object. In the case of multiple objects, this segment ended after both objects were grasped. Prepare captured any movements unrelated to the initial reach to grasp, but was not part of the main semantic aspect of the pantomime. Main movement covered any movements directly related to the semantic core of the item. Auxiliary captured any additional movements not directly related to the semantic core. Return object captured the movement of the hands back to the objects starting position, depicting the object being replaced to its original location. Retract covered the movement of the hands back to the indicated the starting position of the hands, until the end of the video. Note that the “prepare”, and “auxiliary” segments were optional, and only coded when such movements were present. All other segments were present in all videos. Phases were delineated based on this segmentation. Phase 0 covered the “wait” segment. Phase 1 covered “reach to grasp” and “prepare”. Phase 2 covered the “main movement” and “auxiliary”. Phase 3 covered “return object” and “retract”. See Table 1 and Fig. 1 for examples of how these phases map onto specific parts of the movement.

Table 1 Movement phase examples

Full size table

After defining the segments for each video, we divided the videos into three lengths, referred to as initial fragments (M = 3.27 ± 1.52 s), medium fragments (M = 4.62 ± 2.19 s), and full videos (M = 5.59 ± 2.53 s). Initial fragments consisted of only phase 0 and phase 1, medium fragments consisted of phases 0–2, and full videos contained all of the phases. An overview of these segments and phases can be seen in Fig. 1. We performed ANOVAs on each of the fragment lengths to ensure video durations of the same fragment length did not differ significantly across cells (see Supplementary Table 1 for statistics). This resulted in initial fragments only providing initial hand-shape and arm/hand/finger configuration information, medium fragments providing all relevant semantic information, and full videos providing additional eye-gaze (when present) and additional time for processing the information.

Blurring

In all videos, a Gaussian blur was applied to the object, which was otherwise visible in the video. This ensured that the object could not be used to infer the action. To determine whether the face in general, in particular the gaze direction, has an effect on pantomime recognition, we also applied a Gaussian blur to the face in half of the videos. Blurring the faces in this way allowed us to manipulate the amount of available visual information, providing a first test for how kinematic modulation affects gesture identification in a less complete visual context (question 3). This was balanced so that each actor had at least one video with a visible face and one with a blurred face.

Task

Before beginning the experiment, participants received a brief description of the task to inform them of the nature of the stimuli. This ensured that the participants knew to expect incomplete videos in some trials. Participants were seated in front of a 24″ Benq XL2420Z monitor with a standard keyboard for responses. Stimuli were presented at a frame rate of 29 frames per second, with a display size of 1280 × 720. During the experiment, participants would first see a fixation cross for a period 1000 ms with a jitter of 250 ms. One of the item videos was then displayed on the screen, after which the question appeared: “What was the action being depicted?” Two possible answers were presented on the screen, one on the left, and one on the right. Answers consisted of one verb and one noun that captured the action (e.g., the correct answer to the item “pour the water into the cup” was “pour water”). Correct answers were randomly assigned to one of the two sides. The second option was always one of the possible answers from the total set. Therefore, all options were presented equally often as the correct answer and as the wrong (distractor) option. Participants could respond with the 0 (left option) or 1 (right option) keys on the keyboard. Accuracy and response time (RT) were recorded for each video.

Analysis

Main effects analyses: communicative context, fragment length, and visual context

Both RT and accuracy of identification judgments were calculated for each of 12 cells (Table 2): fragment length (initial fragment vs. medium fragment vs. full video) × face (blurred vs. visible) × context (more-communicative vs. less-communicative) in order to test (1) whether more-communicative gestures were identified faster or with higher accuracy (main effect of context), (2) performance was higher in only initial fragments (providing evidence for early identification theory) or only in medium fragments (providing evidence for late identification), as well as (3) whether face visibility impacted performance, which informs us whether there is an effect of visual information availability on the identification performance. Separate repeated-measures analyses of variance (RM-ANOVA) were run for accuracy and RT to test for the presence of main and interactional effects. We used Mauchly’s test of sphericity on each factor and interaction in our model and applied the Greenhouse–Geisser correction where appropriate.

Table 2 Overview of analysis cells for Experiment I

Full size table

Results: Experiment I

We used RM-ANOVA to test for a significant main effect of communicative context, fragment length, or face visibility on performance. In terms of accuracy, results of the fragment length x face visibility x communicative context RM-ANOVA showed a significant main effect of communicative context, F(1,19) = 2.912, p = 0.029, as well as a main effect of fragment length, F(2,38) = 53.583, p < 0.001, but no main effect of face visibility, F(1,19) = 0.050, p = 0.825. Planned comparisons revealed higher accuracy in the more-communicative context for initial fragments (more-communicative mean = 87.13%, less-communicative mean = 81.17%; t(18) = 3.025, p = 0.007), but there was no difference between contexts in the medium fragments (more-communicative context mean = 97.37%, less-communicative mean = 96.49%; t(18) = 0.785, p = 0.443) or full videos (more-communicative mean = 97.37%, less-communicative mean = 97.22%; t(18) = 0.128, p = 0.899). In sum, performance was high overall on more-communicative compared to less-communicative videos, with specifically more-communicative initial fragments showing higher performance than less-communicative initial fragments. Accuracy, regardless of communicative context, was additionally higher in medium and full fragments compared to initial. See Fig. 2a for an overview of these results.

In terms of RT, results of the fragment length x face x context RM-ANOVA revealed a significant main effect of communicative context, F(1,19) = 5.699, p = 0.028, and of fragment length, F(2,38) = 192.489, p < 0.001, but not of face visibility, F(1,19) = 3.725, p = 0.069. Planned contrasts revealed faster RT in more-communicative compared to less-communicative initial fragments (more-communicative mean = 1.446; less-communicative mean = 1.583 s), t(19) = 3.824, p = 0.001 but faster RT for less- compared to more-communicative medium fragments (more-communicative mean = 1.094 s; less-communicative mean = 1.029 s), t(19) = 3.479, p = 0.003, but no difference between more- and less-communicative full videos (more-communicative mean = 1.094; less-communicative mean = 1.129), t(19) = 1.237, p = 0.231. We also found faster RT for medium fragments (M = 1.093) compared to initial fragments (M = 1.630), t(19) = 12.538, p < 0.001, as well as for medium fragments compared to full videos (M = 1.142), t(19) = 2.326, p = 0.031. In sum, RT was similar in both the more- and less-communicative contexts, but faster responses were seen in medium fragments compared to initial and full fragments. See Fig. 2b for an overview of these results.

Discussion: Experiment I

In our first experiment, we sought to determine how communicative modulation affects identification of pantomime gesture semantics. We found that pantomime gestures produced in a more-communicative context were better recognized when compared to those produced in a less-communicative context. Specifically, more-communicative initial fragments were recognized more accurately and faster than less-communicative initial fragments.

The higher accuracy in recognizing more- compared to less-communicative initial fragments suggests that at least some of the relevant information is available even in the earliest stages of the act, and that communicative modulation enhances this information. Since the face visibility did not contribute significantly to better performance, we suggest that improved comprehension may come from fine-grained kinematic cues, such as hand-shape and finger kinematics. As objects are known to have specific action and hand-shape affordances (Grèzes & Decety, 2002; Tucker & Ellis, 2001), hand shape can also provide clues as to the object being grasped, and thus also the upcoming action (Ansuini et al., 2016; van Elk, van Schie, & Bekkering, 2014). These results are therefore in line with the early prediction results described for action chains (Becchio, Manera, Sartori, Cavallo, & Castiello, 2012; Cavallo et al., 2016). Our results may also be explained by immediate comprehension. In other words, the visual information provided by the shape and configuration of the hands may be sufficiently clear to activate the semantic representation of the action without any prediction of the upcoming movements. Although we cannot determine the exact cognitive mechanism, we can conclude that communicative modulation supports comprehension through early action identification.

We found no evidence for higher accuracy in more- compared to less-communicative medium fragments, nor for full videos. It seems that the overall accuracy in medium and full fragments does not allow a difference to be found between the contexts. In both more- and less-communicative medium fragments, accuracy was above 96%, suggesting that ceiling level performance may have already been reached. This indicates that even if communicative modulation supports late identification, general task difficulty was not high enough in our task to allow us to find any difference. Surprisingly, faster RT was found for less- compared to more-communicative medium fragments. This unexpected result may reflect a trade-off between kinematic modulation, which is thought to be informative, and direct eye-gaze, which serves a communicative function but may not lead to faster responses. Along this line, Holler and colleagues (2012) argue that direct eye-gaze leads to a feeling of being addressed, which in turn forces the addressee to split their attention between the eyes and hands of the speaker. If this interpretation is correct, we would expect that although responses are faster for the less-communicative videos, accuracy should still be higher in the more-communicative videos. To draw any conclusions about how communicative modulation affects late identification, we suggest that it is necessary to increase task difficulty.

In sum, our results show that communicatively produced gestures are more easily recognized than less communicative gestures, and that this effect is explained by early action identification. This result is in line with the research on child-directed actions (Brand et al., 2002), as well as the more recent developments regarding early action identification based on kinematic cues (Ansuini, Cavallo, Bertone, & Becchio, 2014; Cavallo et al., 2016).

Experiment II: Isolated kinematic context

Although this first experiment shows evidence for a supporting role of kinematic modulation in semantic comprehension of gestures, it remains unclear whether the effect remains when only gross kinematics are observed, and facial, including attentional cueing to the hands, and finger kinematics, including hand shape, are completely removed. Removing additional visual contextual information would therefore help to disentangle the effects of gross (i.e., posture and hands) kinematic modulation from other (potentially communicative) visual information. For example, while extensive research has looked at the early phase of action identification from hand and finger kinematics (Ansuini et al., 2016; Becchio et al., 2018; Cavallo et al., 2016), the higher level dynamics of the hands and arms, which we call gross kinematics, have not been well studied. This is particularly relevant as these high level kinematic features are similar to the qualities described in gesture research. Thus, in Experiment II we replicate Experiment I, but reduce the stimuli to present a visually simplistic scene consisting of only lines representing the limbs of the actor’s body. If kinematic modulation is driving the communicative advantage seen in our first experiment, we can expect the same effect pattern as seen in Experiment I. If other features of the visible scene, such as finger kinematics, provided the necessary cues for semantic comprehension then the effect on early identification should no longer be present. Due to the visual information being highly restricted, we expect task difficulty to be increased.

In this way, we are able to determine if kinematic modulation supports early action identification in the absence of other early cues such as hand shape, and whether it supports ongoing semantic disambiguation when gesture recognition is more difficult. Overall, this experiment will build on our findings from Experiment I by providing a specific test of how kinematic modulation affects semantic comprehension when isolated from other contextual information. Additionally, it will test which specific kinematic features contribute to supporting semantic comprehension.