Categorizing coordination from the perception of joint actions
- 183 Downloads
The ability to perceive others’ actions and coordinate our own body movements accordingly is essential for humans to interact with the social world. However, it is still unclear how the visual system achieves the remarkable feat of identifying temporally coordinated joint actions between individuals. Specifically, do humans rely on certain visual features of coordinated movements to facilitate the detection of meaningful interactivity? To address this question, participants viewed short video sequences of two actors performing different joint actions, such as handshakes, high fives, etc. Temporal misalignments were introduced to shift one actor’s movements forward or backward in time relative to the partner actor. Participants rated the degree of interactivity for the temporally shifted joint actions. The impact of temporal offsets on human interactivity ratings varied for different types of joint actions. Based on human rating distributions, we used a probabilistic cluster model to infer latent categories, each revealing shared characteristics of coordinated movements among sets of joint actions. Further analysis on the clustered structure suggested that global motion synchrony, spatial proximity between actors, and highly salient moments of interpersonal coordination are critical features that impact judgments of interactivity.
KeywordsBiological motion Joint action Perception and action Perceptual categorization and identification
Our ability to perceive the actions of others and coordinate them with our own body movements is essential for interacting with the social world (Sebanz & Knoblich, 2009; Sebanz, Bekkering, & Knoblich, 2006). Human social interaction makes it possible to adapt and update our own actions in response to changes in other individuals’ movements, their emotional state, and intentions (Miles, Nind, & Macrae, 2009; Richardson, Marsh, & Baron, 2007a). Hence, joint actions play an important role in strengthening interpersonal connections to achieve a common goal (Lakin, Jefferis, Cheng, & Chartrand, 2003; Michael, Sebanz, & Knoblich, 2016). In many social situations, joint actions often rely on executing bodily movements in a timely manner. Time-critical execution requires varying degrees of temporal coordination and precise, moment-to-moment control over our own limbs and body (Pezzulo, Donnarumma, & Dindo, 2013; Vesper, Schmitz, Safra, Sebanz, & Knoblich, 2016).
During passive viewing, however, inferring that two people form a coordinated unit requires that the human visual system detect a set of features indicating social cooperation (Bernieri, 1988; Richardson, Marsh, Isenhower, Goodman, & Schmidt, 2007b). Researchers have found important evidence that both bottom-up and top-down processes facilitate joint action identification. de la Rosa et al. (2013) have shown a connection between social interaction recognition performance and the velocity of specific body joints (e.g., the arms, feet, and hips), and Thurman and Lu (2014) found that participants can identify spatially-scrambled displays of human dancers as long as movement between individual joints and global bodies was congruent. When viewing point-light actions involving another person, these types of actions are shown to enhance recognition even when embedded in a noisy background (Manera et al., 2011; Neri, Luu, & Levi, 2006; Su, van Boxtel, & Lu, 2016).
The present study investigated a common set of visual features for inferring cooperation while observing point-light displays of joint actions. If inferences regarding cooperation depend on the perception of critical visual features, do groups of features emerge to form categories, helping to differentiate between types of joint actions? We manipulated the temporal alignment between actors and their movements to measure sensitivity to motion synchrony and cooperation. To gauge a common set of characteristics among ten different joint actions, we measured how human judgments of interactivity change as a function of temporal misalignment. Lastly, previous research has shown that recognition sensitivity varies according to different categories of individual actions (Dittrich, 1993; van Boxtel & Lu, 2011) and social interactions having multiple levels of categorization (de la Rosa et al., 2014). We use a statistical clustering model (latent dirichlet allocation) to categorize joint actions based on shifts in “interactivity” ratings given temporal misalignment of the actors’ body movements.
A total of 55 undergraduate students (M age = 21.4, SD age = 4.3, 42 females) at the University of California, Los Angeles (UCLA) with normal or corrected-to-normal vision enrolled in the study. Participants gave informed consent as approved by the UCLA Institutional Review Board and were provided with course credit in exchange for their participation.
Stimuli and materials
Actors were presented as skeletal figures by connecting 17 body markers from each actor. Body marker coordinates were scaled with the BioMotion Toolbox (van Boxtel & Lu, 2013). The skeletal outlines were white lines with a thickness of 0.17∘ visual angle, superimposed with white dots as body joint markers (a diameter of 0.2∘), on a black background. The viewing distance was 34.5 cm. A 60-Hz CRT monitor was used to display the stimuli. All joint actions and additional visual angle information are shown in Figure 2.
After viewing two actors engaged in a joint action, either in its synchronized form or with a temporal offset, participants were asked to provide a rating judging “the degree to which the two actors appear to be interacting.” We used a 7-point scale, with 1 indicating “certainly not interactive” and 7 indicating “certainly interactive.” They were permitted up to 20 seconds to make a response, allowing ample time to rate the interaction. They were not provided with a definition of “interacting” to avoid biasing participants to specific visual indicators of interpersonal actions, nor were they explicitly informed of temporal misalignment between actors. The experiment included a total of 280 randomized trials based on 7 offsets, 10 joint actions, and 4 repetitions. After completing 50 trials, participants were allowed to take a short break up to 20 seconds before continuing with the experiment.
We first conducted an analysis to observe changes in ratings due to joint action type, temporal offset, or the interaction between between the two factors. We fit a linear mixed effects model with the participant as a random effect, adding temporal offset as a random factor that varies for each participant. The offset factor consisted of four levels, the synchronized action with no offset (level 0) and three other levels of absolute offset magnitude (levels 1–3, omitting indicators of leading or lagging offsets). We found a significant interaction between joint action and temporal offset using either an ANOVA method with Satterthwaite approximations of the degrees of freedom (F(27, 15290) = 8.5, p < 0.001), or a likelihood-ratio test between models with and without the interaction term (χ 2(27) = 227.5, p < 0.001). The significant interaction indicates that participant ratings vary according to temporal misalignment for some subset of actions. We further examine the effect of temporal misalignment on each action by analyzing the shapes of the discrete interactivity rating distributions.
Shape of rating distributions
Some actions in Figure 2 resulted in flat rating distributions across offsets, indicating temporal misalignment had a negligible effect on interactivity ratings (e.g., arguing & gesturing and tug-of-war), whereas other actions showed a peak in the distribution at offset zero, indicating sensitivity to temporally-aligned actions. We developed a “peakedness” index specifically for the discrete, multivariate offset distributions that captures the change in sensitivity between truly coupled joint actions and temporally misaligned actions. The peakedness index was calculated by subtracting ratings at offset zero (unaltered, synchronous actions) from the mean of all other non-zero offsets. Because the zero-offset condition corresponds to the truly coupled, synchronized, and coordinated actions, a large rating difference between the zero-offset condition and other non-zero offset conditions reflects higher sensitivity to temporal coordination and signals interactivity between actors. As the peakedness index approaches zero, the shape of the rating distribution becomes flatter, indicating that interactivity ratings do not reflect participant discrimination among truly coupled joint actions and temporally misaligned actions. A second characteristic of the offset distribution is asymmetry, and reflects participants’ sensitivity to the directionality of temporal offsets. For example, the joint action approach & high-five was rated to be less interactive if the second actor was lagging behind the reference actor, but more interactive if shifted forward in time. We quantified asymmetry by subtracting mean ratings for the negative offset conditions from the mean of positive offset conditions (omitting the zero-offset condition). Note that the indices defined in the present study are different from the statistical moments of kurtosis and skewness for measuring peakedness and symmetry of a distribution. Our indices are calculated using the zero-offset condition as a reference point, as opposed to the mean and standard deviation of some random variable as in the statistical definitions.
Categorization of joint actions
To visualize the category similarity between joint actions, further analyses were conducted by examining the matrix containing the clustering probabilities of each joint action for each class. We first computed the Hellinger distance, suitable in this case for comparing similarity among probability distributions, and then applied multidimensional scaling (MDS) on the distance matrix to visualize the categories of joint actions (Figure 4, bottom right). We found lower ratings of interactivity but high sensitivity to synchronized coordination in joint actions such as passing an object, medium interactivity ratings and more tolerance to temporal misalignment in joint actions such as tug-of-war, and high interactivity ratings and high sensitivity to temporal coordination in joint actions such as salsa dancing. Thus, the interactive ratings of joint actions can be used to categorize joint actions according to two critical features: tolerance to temporal offset and the degree of interactivity involved in a joint action. Joint actions with a high probability value for a single class (indicated by node size) are more spatially distant from the center (e.g., salsa dancing), and actions more likely to share multiple classes are near the center (with smaller nodes, e.g., shake hands).
In the current study, we tested the visual perception of social cooperation from the perspective of temporal misalignment having a graded influence on interactivity ratings, which may result in the use of graded features to guide these judgments. Although the clustering analysis makes no assumptions about the underlying nature of joint action categories, we can examine common properties among categories by comparing characteristics among joint actions given category membership. To gauge which visual features may play an important role in judging social cooperation from observed joint actions, we conjectured that the manipulation of temporal misalignment of joint action affects the reliability of detecting critical features, which results in systematic changes in rating distributions among actions in the same category. We specifically examined the potential role of three important visual features for identifying the interactivity of joint actions: global motion synchrony, spatial proximity, and salient moments of interpersonal coordination.
Average interactivity ratings shown with the average distance between actors (in horizontal visual angle units) by action category. As distance between actors increases, average rating within the corresponding category decreases.
Avg. rating score (SD)
Avg. horz. angle (SD)
Greet & shake hands; approach & high-five; playing catch; passing an object
Tug-of-war; chicken dancing
Arguing & gesturing
Salsa dancing; circular skipping; threaten
In addition to visual features signaling joint action activity, other high-level cognitive processes may also be involved. For example, we observed the temporal manipulations yielded asymmetrical rating averages between leading and lagging conditions. This may result from the distinct roles each actor plays play for some of the actions, consistent with the literature on real-time action prediction (Graf et al., 2007) and causal-effect relations in actions (Peng, Thurman, & Lu, 2017). For example, the threaten stimulus displayed the attacker as the reference actor while shifting the defender actor. When the defender’s body movements were shifted in time ahead of the attacker, the ratings were higher than the ones for lagging offsets, suggesting that people may use top-down cues of the defender’s movement to predict the attacker’s reaction. For other joint actions, the asymmetry effect may be due to bottom-up processing of unnatural body displacements when shifted in a negative or positive temporal direction. Circular skipping, for example, shows changes in skipping speed from beginning to end, and in this case, temporal misalignment can yield unnatural events, such as actors colliding with one another or actors showing noticeably different velocities in body displacement for negative offsets.
Joint action recognition may manifest itself at a later stage of processing after deciding that two individuals are engaged in social cooperation, in which previous experience, motor repertoires, and goal-oriented inference may play a role in recognition and further interpretation of the nature of joint action (Casile & Giese, 2006). Joint actions that generated high ratings of interactivity, and were sensitive to temporal shifts in movement, likely had multiple visual features that observers could use when interpreting the degree of interpersonal coordination. Further work is necessary to examine how individuals judging interactivity weigh multiple competing visual features among a wide range of joint actions. This line of research can provide insight into how people evaluate other humans in terms of their potential to cooperate and help execute specific action goals.
- de la Rosa, S., Mieskes, S., Bülthoff, H. H., & Curio, C. (2013). View dependencies in the visual recognition of social interactions. Frontiers in Psychology, 4. doi: https://doi.org/10.3389/fpsyg.2013.00752
- Richardson, M. J., Marsh, K. L., Isenhower, R. W., Goodman, J. R. L., & Schmidt, R. C. (2007b). Rocking together: Dynamics of intentional and unintentional interpersonal coordination. Human Movement Science, 26(6), 867–891. doi: https://doi.org/10.1016/j.humov.2007.07.002 CrossRefPubMedGoogle Scholar
- Shu, T., Thurman, S. M., Chen, D., Zhu, S.-C., & Lu, H. (2016). Critical features of joint actions that signal human interaction. In A. Papafragou, D. Grodner, D. Mirman, & J. C. Trueswell (Eds.), Proceedings of the 38th Annual Conference of the Cognitive Science Society (pp. 574–579). Austin, TX: Cognitive Science Society.Google Scholar
- van Boxtel, J. J. A., & Lu, H. (2013). A biological motion toolbox for reading, displaying, and manipulating motion capture data in research settings. Journal of Vision, 13(12). doi: https://doi.org/10.1167/13.12.7