Does optic flow provide information about actions?

Masoner, Hannah L.; Hajnal, Alen

doi:10.3758/s13414-023-02674-9

Does optic flow provide information about actions?

Published: 14 March 2023

Volume 85, pages 1287–1303, (2023)
Cite this article

Download PDF

Attention, Perception, & Psychophysics Aims and scope Submit manuscript

Does optic flow provide information about actions?

Download PDF

1506 Accesses
1 Citation
9 Altmetric
Explore all metrics

Abstract

Optic flow, the pattern of light generated in the visual field by motion of objects and the observer’s body, serves as information that underwrites perception of events, actions, and affordances. This visual pattern informs the observer about their own actions in relation to their surroundings, as well as those of others. This study explored the limits of action detection for others as well as the role of optic flow. First-person videos were created using camera recordings of the actor’s perspective as they performed various movements (jumping jacks, jumping, squatting, sitting, etc.). In three experiments participants attempted to detect the action from first-person video footage using open ended responses (Experiment 1), forced-choice responses (Experiment 2), and a match-to-sample paradigm (Experiment 3). It was discovered that some actions are more difficult to detect than others. When the task was challenging (Experiment 1) athletes were more accurate, but this was not the case in Experiments 2 and 3. All actions were identified above chance level across viewpoints, suggesting that invariant information was detected and used to perform the task.

Memory benefits when actively, rather than passively, viewing images

Article 27 November 2023

How feature integration theory integrated cognitive psychology, neurophysiology, and psychophysics

Article 09 July 2019

Color and emotion: effects of hue, saturation, and brightness

Article 13 June 2017

Introduction

Have you ever sat in a driving simulator that you were not controlling, yet you knew which direction the virtual car was moving? Many studies have demonstrated that the dynamically changing pattern of scattered, reflected, and refracted light, i.e. optic flow, serves as information for guiding ambulatory activities such as steering and avoiding obstacles (Matthis et al., 2022; Warren et al., 2001). A related, but understudied role of optic flow is to specify the activity of the agent in non-translatory activities. For example, the egocentric optic flow generated when a person does jumping jacks contains information that is unique to jumping jacks. By the same token, the egocentric optic flow pattern is different when a person is simply jumping. Can observers identify the activity and differentiate it from other similar activities by attending to the egocentric optic flow pattern? If you ever played a first-person video game and were required to understand your avatar’s actions while your own body was stationary, you were successfully attending to the optic flow patterns that allowed you to recognize the actions undertaken by the agent. Another example of action detection from egocentric viewpoints is the use of body worn cameras (BWCs) in law enforcement. The footage from these cameras not only reveals information about the visual field ahead but also reveals the movement of the officer wearing the camera. There is an ongoing debate about the ethics and efficiency of these devices, yet they are commonly used as evidence in court (Laming, 2019; Lum et al., 2020) to infer what was the person wearing the camera doing during law enforcement activity.

Furthermore, the detection of actions based on optic flow is not exclusive to humans. Artificial agents such as computers use algorithms that detect and analyze optic flow patterns in order to control self-driving vehicles such as cars and airplanes (Fan et al., 2019; Ruffier & Franceschini, 2005). This line of research traces its lineage to Gibson (1947), who first posed landing of airplanes as a perceptual problem for which optic flow can serve as information. Danafar and Gheissari (2007) explored the application of optic flow algorithms in computer vision when assessing surveillance footage from security cameras. Actions such as walking, jogging, clapping, and boxing were evaluated. The success rate in determining the actions performed was around 85% even though the videos were taken from different viewpoints and under different levels of illumination, suggesting that the pattern of motion of an action must be invariant across vantage points (Holte et al., 2010). Optic flow is also used in creating models that allow robots to more effectively interact with humans. Vignolo et al. (2017) created a computational model to allow robots to distinguish biomotion from non-biomotion so that they may advance the robots’ social behaviors. Noceti et al. (2019) have utilized optic flow patterns created by the performance of an actor to enhance the coupling of robot and human interaction when performing an action-timing task (e.g., walking in synchrony), much like humans learn to work together.

What is the nature of the information contained in optic flow that specifies actions? Relatedly, what aspects of optic flow do perceivers attend to detect actions? The theory of kinematic specification of dynamics (KSD; Runeson & Frykholm, 1981) suggests that observers attend to and are able to detect the pattern of kinematic variables such as displacement and speed of visual elements of the optic array, and that these optic patterns specify actions and events. Runeson and Frykholm showed that the patterning of optic flow is detectable even in situations in which only a few visual elements are visible, such as in point-light displays (PLDs) of the joints of the body in motion (Johansson, 1973). Importantly, the information in egocentric optic flow patterns should be the same as the information in point-light displays of a moving body. Even though in the egocentric view the body is not visible, the consequences of the action are present in the optic flow pattern of the changing light intensities of visual elements of the ambient optic array as the egocentric viewpoint continuously changes location during body movements. Importantly, we can recognize actions without acting ourselves and without seeing the body of the actor (or an avatar of the body). The goal of the current study was to demonstrate that (1) perception of human activity is possible based on viewing egocentric optic flow alone, and that (2) this perceptual skill is a function of experience. Successful demonstration of the perceptual skill will serve as a preliminary step to future investigations of the nature of the invariant pattern regardless of whether the optic flow is experienced from a first-person (egocentric) viewpoint or from a third-person (allocentric) viewpoint.

Athletes and action detection

The ability to recognize current and future action possibilities (i.e., affordances) for others is especially relevant in sports. It is a key component of skillful timely decisions during a game to better the play or overthrow the competition. Competitive athletes must read the play scenario, considering information from their own movements as well as those of the opponent and their teammates. They identify the action capabilities of all parties and then attune their own actions to the information (Hacques et al., 2021; Vickers, 2007). For instance, in volleyball, the typical pattern of play on one side of the net is: pass, set, attack. A defensive player (on the opposing side of the net) must recognize an attacker’s affordances based on the location of the ball during the second contact, the attacker’s location in relation to the ball, their physical capabilities (e.g., jumping height), and their hand and shoulder positions (Klostermann et al., 2015). In beach volleyball, players must be very skilled in identifying the action possibilities of their partner because they must make the appropriate subsequent move based on their partner’s play. At a high level the speed of the game is so fast that a player does not have time to react after their teammate’s contact but must be able to anticipate the path of the ball to some degree beforehand so that they can act ahead of the play.

Weast et al. (2011) discovered that basketball players were significantly better at judging a person’s ability to jump and reach when compared to non-basketball players. However, there were no differences in judging ability to sit or reach without jumping. It seems that athletes are more sensitive to affordances directly influenced by kinematic information as opposed to static measurements alone.

In cases where biological motion is the only information (i.e., physical details about shape are not available) athletes have demonstrated impressive skills in perceiving actions from PLDs, including whether the actor was a teammate or stranger (Steel et al., 2015). Weast et al. (2014) found that body motion alone provided enough information for athletes to detect affordances for another person when related kinematic information was observed (e.g., watching the motion of an actor squat and then estimate their reaching height while jumping). Athletes are more attuned to these tasks than non-athletes (Fajen et al., 2009).

In summary, athletes have a keen ability to judge action possibilities by observing another player’s body movements. Likewise, competitive athletes are more accurate in action detection based on the amount of time they spend intentionally studying actions and making visual observations while performing. For these reasons the goal of the present contribution is to compare perception of athletes and non-athletes in action detection tasks.

Purpose and hypotheses

We sought to determine if it is possible to detect another person’s actions from a video sample of their first-person (egocentric) perspective view during the activity. To wit, can perceivers detect actions when the body of the actor is not visible, and the video footage only contains what the actor sees in front of her during the activity? Furthermore, does extensive physical training provide athletes with a superior ability to perceive actions?

First, we predicted that observers could perceive actions from a video sample of a first-person view recording that only shows the consequences of the motion of the body, but not the body itself. Second, athletes should have an advantage in determining these actions compared to non-athletes. Athletes are expected to be more accurate and faster than non-athletes. We tested these hypotheses using three different empirical methods: open-ended responses (Experiment 1), forced-choice responses (Experiment 2), and a match-to-sample paradigm (Experiment 3).

Experiment 1

The goal of the experiment was to determine if human observers can perceive an activity based on video footage recorded from the point of view of the actor who was engaged in the action. We created first-person videos of an actor performing six separate actions. These included jumping jacks, jumping, sitting, squatting, skipping, and jogging. The videos showed the actor’s perspective during movement, but not their body. The key component of this manipulation is to demonstrate whether observers can recognize the activity based on the head-mounted camera’s movements without seeing the body of the actor. We hypothesized that the optic flow pattern generated by the camera movement contains information that specifies the action, and that this information can be detected by observers.

Method

Participants

This experiment utilized an online platform and was available to several groups of participants. The first group consisted of participants recruited via the Psychology Department’s SONA participant pool who received course credit in their psychology classes for their contribution. The second group was made up of students who competed for one of the varsity sports teams at the university. Participants were categorized in two groups: Non-Athletes (n = 50) and Athletes (n = 19).

Materials

For all experiments we created a set of video stimuli using a GoPro (Hero8) sports camera. The videos for Experiment 1 provided a first-person world view and did not give any information about the actor’s physicality such as body shape and size. The backdrop for the videos was a set of black retractable bleachers that were withdrawn so that they create a vertical wall-like structure (Fig. 1). The intention for using this background was to provide enough disparity and texture to give rich visual information, but not to give a surplus of detail to make the task too easy. Videos were recorded for six actions. The actions were grouped as three action pairs:

1.
Jumping–Jumping Jacks
2.
Squatting–Sitting
3.
Skipping–Jogging

These actions were chosen because they should be somewhat familiar to most people and are commonly incorporated in exercise, sports, and everyday behavior. The actions were paired with the intent of being similar, so that the task was not too easy, yet different enough to be distinguishable. Specifically, the movement patterns of each action pair had similar cycles, directions and ranges of motion to make the perceptual discrimination hard, but not impossible.

Experimental design

In Experiment 1 we employed a 2 Athletic Status (athlete versus non-athlete) × 6 Action mixed-design ANOVA to observe the differences in athlete status and all six actions. Additionally, a 2 Athletic Status (athlete versus non-athlete) × 2 Action pair mixed design was performed so that athlete status was a between-subject variable, and Action pair was a within-subjects variable. Three mixed 2 × 2 ANOVAs were conducted for the following Action pairs, respectively: jumping and jumping jacks, squatting and sitting, skipping and jogging. All participants underwent the same experimental procedures with stimuli being presented in a randomized order.

Procedure

Online experiments were programmed using the Collector data collection software (Garcia et al., 2015) to randomize stimuli for each participant. An online link for the experiment was distributed to both target populations simultaneously so that data for both groups was collected over the same window of time. Participants accessed the online link by using their laptop or desktop computer. A demographic questionnaire was initially presented that inquired about the person’s athletic status. This allowed us to determine if they met the qualifications for being included in the athlete group. Any participant who was currently rostered on a university sports team or had been rostered within the past year was included as an athlete.

For the experiment each video was presented randomly four times for a total of 24 trials. Each video was presented one time per trial and lasted about 5 s. For actions such as jumping and squatting the movement was repeated for the 5-s time frame until the participant responded. For actions that require covering ground such as jogging and skipping, a consistent distance was set, and the movement was recorded for the duration of the distance.

To assess people’s ability to detect the action we began by asking the general question: “What is the person doing in this video?” Instructions read: “Be as specific as possible but describe the action in no more than two words.” In the case that the video did not play appropriately due to technological issues like internet connection, the participant was instructed to enter the word “ERROR” into the text box. Response time for each trial was measured from the moment the response text box appeared and ended when the participant submitted their response. Figure 2 depicts the trial sequence.

Analyses

A coding scheme was created to categorize participant responses. Categories were determined based on the data collected. For instance, one-word responses such as “jump,” “hop,” and “bounce” were coded as a jump. After categorizing responses, we determined the accuracy for each trial and labeled them based on correctness (1 for correct, 0 for incorrect). Trials that resulted in an error response due to malfunction or glitch were removed, as well as trials where the participant clearly did not follow the instructions. This resulted in the removal of 11.8% of trials.

A repeated-measures analysis of variance (ANOVA) was performed to observe both dependent variables: accuracy and response time. It was expected that all participants could decipher the type of action in the videos to some degree. It was also anticipated that athletes would perform more accurately and take less time in responding.

Results

Accuracy

A 2 Athletic status × 6 Action repeated-measures ANOVA on accuracy revealed a main effect of Action, F(5,315) = 40.72, p < .001, η_p² = 0.39. Jumps were perceived most accurately (M = 0.63, SD = 0.37), whereas sitting was perceived least accurately (M = 0.01, SD = 0.05). There was also a main effect of Athletic status, F(1,63) = 11.29, p = .001, η_p²=0.15. Athletes were more accurate (M = 0.49, SD = 0.44) than non-athletes (M = 0.35, SD = 0.44). There was no significant interaction.

In order to get a more detailed look at the data we followed up the omnibus analysis with separate 2 Athletic status × 2 Action ANOVAs for each action pair: jog versus skip, jump versus jumping jacks, and sit versus squat. The 2 Athletic Status × 2 Action pair (jog vs. skip) ANOVA on accuracy revealed a significant effect of Athletic Status, F(1,66) = 5.99, p = .017, η_p² = 0.08. Specifically, athletes (M = 0.65, SD = 0.35) were more accurate than non-athletes (M = 0.44, SD = 0.49). No other effects were significant. The same ANOVA comparing jumps and jumping jacks revealed a significant difference between actions, F(1,66) = 113.7, p < .001, η_p² = 0.63. Specifically, jumps (M = 0.64, SD = 0.38) were detected more accurately than jumping jacks (M = 0.07, SD = 0.18). The Athletic Status × Action pair interaction was also significant, F(1,66) = 5.01, p = .03, η_p² = 0.07. Athletic Status was not significant. The ANOVA comparing accuracy of perceiving sitting and squatting returned a significant difference between actions, F(1,64) = 121.57, p < .001, η_p² = 0.66. Specifically, squats (M = 0.60, SD = 0.44) were detected more accurately than sitting down (M = 0.01, SD = 0.05). The Athletic Status × Action pair interaction was also significant, F(1,64) = 5.2, p = .03, η_p² = 0.08. Athletic Status was also significant, F(1,64) = 5.37, p = .03, η_p² = 0.08. Specifically, athletes (M = 0.4, SD = 0.46) were more accurate than non-athletes (M = 0.27, SD = 0.41). The average accuracy rates for each action pair and group are shown in Fig. 3.

Degrees of freedom varied for the 2 × 2 ANOVAs because due to technical glitches data were not recorded from some actions for four participants. This was because some videos failed to load because of poor internet connection. In the jog-skip and jump-jumping jack analyses we had to drop one participant per analysis; in the sit-squat analysis we had to drop three participants.

Response time

In order to remove the skewness of the response time distribution, responses that were 3 standard deviations above the mean were removed. This resulted in the removal of 1.6% of trials.

The initial omnibus ANOVA showed a main effect of Action, F(5,315) = 5.38, p < .001, η_p² = 0.08. Responses to jogging actions were the fastest (M = 5,946 ms, SD = 2,075 ms), whereas responses to sitting were the slowest (M = 7,527 ms, SD = 2,516 ms). There was no main effect of Athletic status and no interaction.

The 2 Athletic Status × 2 Action pair (jog vs. skip) ANOVA on response time revealed no significant effects. The same ANOVA comparing response times for jumps and jumping jacks revealed no significant effects or interactions. The ANOVA comparing response times of perceiving sitting and squatting returned a significant difference between actions, F(1,64) = 10.65, p = .002, η_p² = 0.14. Specifically, average response time for squats (M = 6,180 ms, SD = 2,121 ms) was shorter than for sitting down (M = 7,507 ms, SD = 2,502 ms). No other effects were significant. The average response times for each action pair and group are shown in Fig. 4.

Discussion

Some actions were more difficult to detect than others. Participants struggled to recognize jumping jacks and sitting. This could be because the natural optic flow patterns for these actions are not as unique as others, and are therefore easily confused with other actions. Another possibility is that jumping jacks and sitting may generate optic flow patterns that are more complex than for other actions, rendering them hard to detect. Sitting might have proven difficult because it is typically not a repetitive movement, however our video sample captured it as such (with the actor sitting down and standing up several times). Jumps were detected more accurately than jumping jacks, perhaps due to the relative simplicity of jumping motions. Athletes were more accurate than non-athletes, consistent with our predictions. This is most likely due to their trained eye and a lot of experience with physical activity with extensive focus and awareness of body movements. It is also possible that in the open-ended response type design, athletes were better equipped to report answers within the constraints of the task than non-athletes because of their familiarity with exercise names and types of movement.

Response times for the sitting activity were the longest of all actions. This is consistent with the difficulty in detecting sitting action and shows that perhaps it was not the optimal choice for this task due to it not being a cyclical action. There were no differences in the speed of responding between athletes and non-athletes, contrary to our prediction. This may have been because participants were not prompted in any way to respond as quickly as possible.

We also must consider the limits of the open-ended response method, which was utilized to increase the external validity of the task. At the same time, the open-ended nature of task invited a variety of responses, which decreased experimental control and resulted in low internal validity. The absence of clear differences between groups and activities may have been the result of passive responses (lack of inherent motivation to answer accurately), variations in participants’ typing speeds, and uncertainty about the exact labels for the various categories of activities. In some cases, participants were able to report the general movement but did not give a concise enough response to be considered correct (e.g., “up and down,” “moving forward”). In the second experiment we chose to use a forced choice response paradigm to reduce variability due to the open-ended responses. We predicted that the forced choice paradigm would make the task easier and result in less variable responses.

Experiment 2

The second experiment was conducted to refine and verify the results of Experiment 1. Participants were asked to determine the action presented in the first-person videos by means of a forced-choice task.