A Comprehensive Review of Group Activity Recognition in Videos

Human group activity recognition (GAR) has attracted significant attention from computer vision researchers due to its wide practical applications in security surveillance, social role understanding and sports video analysis. In this paper, we give a comprehensive overview of the advances in group activity recognition in videos during the past 20 years. First, we provide a summary and comparison of 11 GAR video datasets in this field. Second, we survey the group activity recognition methods, including those based on handcrafted features and those based on deep learning networks. For better understanding of the pros and cons of these methods, we compare various models from the past to the present. Finally, we outline several challenging issues and possible directions for future research. From this comprehensive literature review, readers can obtain an overview of progress in group activity recognition for future studies.


Introduction
In recent years, the widespread applications of surveillance equipment have rapidly increased the amount of video data. Analyzing and understanding the complicated video contents has become an urgent demand. Human activity analysis, as a challenging research topic for video contents analysis, has attracted intensive research interest in the community of computer vision. In previous decades, human activity analysis has made remarkable progress.
Human activity is a complicated concept and there are various levels. To present our work clearly, we categorize human activities into three different levels based on the complexity: individual action, group activity and crowd behavior. Fig. 1 demonstrates instances of these three levels. Individual action covers single-person action where the human pose and the motion of human body are discriminative information. The crowd behavior is occurred at environment with high dense crowds. Thus, it is infeasible to obtain the precise tracks and detailed information about an individual person. The objective of re-search in human crowds lies in identifying abnormal activity or emergency situations based on the motion pattern of crowds.
In this paper, we focus on group activity which is composed of one or more sub-groups involving visually countable persons with interactions in the scene. A distinctive property of group activity recognition is the interactions between different groups and individuals. As illustrated in Figs. 2(a) and 2(b), two highlighted people share a similar appearance with the same atomic action "standing", however, it is ambiguous to distinguish the group activity based on the action of only a single individual. The interaction among persons in the group should be considered to infer group activity. For instance, in Fig. 2(a), two or three people standing face to face indicate they are talking while in Fig. 2(b), many people standing and facing in the same direction reveals they might queue. Fig. 3 demonstrates that only a few key individuals play important roles in the group activity and other people might bring irrelevant information. Therefore, it is reasonable to predict group activity on the basis of contextual information in the entire image rather than isolated information from a single individual. Compared to crowd behavior, group activity enables us to capture detailed information about individuals as well as their interactions, which is more easily explained and makes sense in practice. The interactions among a group of persons occur much more often in practical scenarios and the study of group activity recognition has tremendous poten-tial for many applications such as sport video analysis and smart video surveillance.
To sum up, group activity recognition is of theoretical and practical significance. However, most of previous reviews of human activity recognition are focused on individual action recognition [1−4] . Reviews about group activity recognition are scarce. Fauzi and Sulistyo [5] mainly survey the connection between group activity recognition and the advancement of internet of things (IoT) technology in smart buildings. Aggarwal and Ryoo [6] study different levels of human activity, however they only introduce traditional methods. To the best of our knowledge, the most recent survey related to group activity recognition is published in 2017 [7] . It focuses mainly on handcrafted based methods while deep learning based methods are not discussed in depth. Moreover, notable progress has been made in this field in recent years because of powerful deep learning techniques. Therefore, an overview of group activity recognition methods including the state-ofthe-art in recent years is required. Compared with previous surveys, our survey introduces sufficient latest works and discusses recent research trends in group activity recognition.
This paper provides a comprehensive survey of current group activity recognition methods. We distinguish between the traditional approaches based on handcrafted features and those based on deep learning. For traditional methods, we further divide them into two categories. The first is the top-down approach which relies on analyzing the group-level information to recognize activity. The second one is bottom-up approach which recognizes activity based on each individual in group contexts. For approaches based on deep learning, we categorize four classes on the basis of what crucial problem they focus on. We also give a summary of publicly available data- (a) (b) Fig. 2 Role of contextual information. The group activity in (a) is talking, the group activity in (b) is queuing. Two highlighted people performing different actions share similar appearance features. These two pictures demonstrate that the interaction between individuals is a crucial cue for recognizing group activity.
(a) (b) Fig. 3 The group activity is usually determined by a few key individuals: (a) The main group activity is queuing while a person in the right of the image is walking and some people are talking in the queue. (b) The group activity is left spiking. The spiking player and blocking players are leading the group activity. These two pictures indicate that each group might perform different activities and it is essential to consider the contextual information of the whole scene to infer group activity.
sets and the comparisons between state-of-the-art approaches. This paper is organized as follows. A dataset summary is provided in Section 2. In Section 3, traditional approaches are divided into two categories and each category is reviewed with a specific description. In Section 4, deep learning based approaches proposed in recent years are detailed introduced. Finally, Sections 5 and 6 describe the research challenges and conclusions respectively.

Datasets
The public datasets provide a unified measurement and direct comparison for proposed methods, which leads to better understanding of the pros and cons of each al-gorithm. Therefore, constructing datasets plays an essential role for promoting the development of group activity recognition. Compared to benchmarks available for understanding individual actions, there are few resources involved in complex human group activities. All the datasets for group activity recognition belong to surveillance videos or sports videos which are motivated by the practical requirements for constructing safety systems or sports analysis systems. In this section, we provide an overview of available datasets. Example video frames appear in Fig. 4. A summary of datasets appears in Table 1.

Surveillance datasets
All the surveillance datasets are collected in practical environments such as the campus or street. Most of the  [16] UCLA Courtyard 6 10 2012 Surveillance video 83.7% Amer et al. [17] Nursing Home 2 6 2012 Surveillance video 85.5% Deng et al. [18] Broadcast Field Hockey 3 11 2012 Sports video 62.9% Lan et al. [19] NCAA Basketball 11 N/A 2016 Sports video 58.1% Wu et al. [20] Volleyball 8 8 2016 Sports video 94.4% Gavrilyuk et al. [21] C-Sports 5 N/A 2020 Sports video 81.3% Zalluhoglu and Ikizler-Cinbis [22] NBA 9 N/A 2020 Sports video 47.5% Yan et al. [23] (a) videos are recorded with a stationary monitor indicating that the backgrounds are static without camera motion. Background clutter and occlusions between multiple people occur frequently. NUS-HGA Dataset [24] is collected by a monitor at university car park. This dataset consists of six different group activities: Walk in Group, Ignore, Gather, Stand and Talk, Fight and Run in Group. Each activity clip takes 8−15 seconds with 4−8 actors. The dataset has 476 labeled video samples in total.
BEHAVE Dataset [8] consists of 10 types of group activity classes: InGroup, Approach, WalkTogether, Meet, Split, Ignore, Chase, Fight, RunTogether and Following. There are usually 2−5 people as a group or two groups interacting in each video. This dataset contains 174 samples of different group activities and in total 76 800 individual frames.
Collective Activity Dataset(CAD1) [9] is one of the widely used benchmarks for group activity recognition. It contains 44 short video clips from 5 activity categories (Crossing, Waiting, Queueing, Walking and Talking) and 6 individual action categories (NA, Crossing, Waiting, Queueing, Walking and Talking). The group activity is labeled for a clip by the activity in which most people participate. The benchmark also provides 8 pose orientation labels, 8 pairwise interaction labels and trajectory of each person in video clip. The above annotations are manually labeled every ten frames.
Collective Activity Extended Dataset (CAD2) [25] augments the Collective Activity Dataset [9] by adding two more categories of dancing and jogging as a new class and removing the Walking activity as the Walking activity is an individual action rather than a group activity. The Collective Activity Extended Dataset contains in total 75 video sequences.
New Collective Activity Dataset (CAD3) [26] is comprised of 32 video clips with 6 group activities: gathering, talking, dismissal, walking together, chasing and queueing. Three atomic actions are labeled as walking, standing still and running, and 9 interaction labels are defined.
UCLA Courtyard Dataset [27] contains 106-minute high-resolution videos from a bird-eye viewpoint of a courtyard at the UCLA campus. The annotation of datasets provides 6 group activities (Walking-together, Standing-in-line, Discussing-in-group, Sitting-together, Waitingin-group and Guided-tour) and 10 individual actions.
Nursing Home Dataset [28] consists of videos captured in a dining room of a nursing home by fixed lowresolution surveillance camera. Individual actions include walking, standing, sitting, bending and falling. Based on the individual actions, each frame is assigned by two activity categories: fall and non-fall. If any person falls, the frame is assigned "fall", vice versa. In total, there are 22 short video clips and 2 990 annotated frames in this dataset.

Sports datasets
Sports datasets are usually collected from broadcast video. In most of the cases, the camera moves with the occurrence of some specific event. Compared with surveillance datasets, sports datasets have more complicated person-person interactions and heavy occlusions. Moreover, the sport activities are usually sensitive to a few players such as the spike event in volleyball game is mainly determined by the spiking player and blocking players.
Broadcast Field Hockey Dataset [19] has 58 video sequences with 11 atomic actions: pass, dribble, shot, receive, tackle, prepare, stand, jog, run, walk and save, and 3 scene-level events: attack play, free hit and penalty corner. Besides, to explore the effect of social roles on group activity, five social roles are defined.
NCAA Basketball Dataset [11] collects 257 NCAA basketball games available from YouTube and each untrimmed video is 1.5 hours long. Eleven key events are defined including 5 types of shots, each of which could be successful or failed, plus additional a steal event. This dataset is challenging due to heavily mutual occlusion, low resolution and the complicated interactions in sports video.
Volleyball Dataset [10] is a more challenging dataset due to large scale, complicated interactions and rapid motion of players. This dataset is collected from available volleyball game videos in YouTube. It consists of 4 830 video clips gathered from 55 games. Each clip is only annotated in the middle frame in which each player is labeled by a bounding box with individual actions and a group activity category is provided for each clip. There are a total of 8 group activity categories (Left/Right set, Left/Right spike, Left/Right pass and Left/Right winpoint) and 8 individual atomic actions (Waiting, Setting, Digging, Falling, Spiking, Blocking, Jumping, Moving and Standing).
C-Sports [22] is a benchmark for multi-task recognition of both group activity and sports categories. In this dataset, there are 11 types of sports and 5 group activity categories. Sports categories include American football, basketball, dodgeball, football, handball, hurling, ice hockey, lacrosse, rugby, volleyball and water polo. Group activities are gathering, dismissal, passing, attack and wandering. To estimate the generalization ability of the algorithm, a challenging evaluation protocol in C-Sports is introduced which training and testing are on different sport classes respectively.
NBA Dataset [23] is currently the largest and the most challenging benchmark for group activity analysis. Unlike conventional GAR tasks, this dataset presents a new task namely weakly-supervised group activity recognition in which person-level information is not provided even in the training data and only video-level labels are available. It collects 181 NBA games from the web and there are 9 172 video clips, each of which belongs to one of the 9 activities.

Approaches based on handcrafted features
Traditional approaches to group activity recognition can be categorized into two classes: top-down approaches and bottom-up approaches. The top-down approaches analyze activities in terms of group level motion and interaction. The drawbacks for these approaches are a lack of detail description for activity that they cannot fully exploit features at individual level. The bottom-up approaches focus on recognizing each individual and describing the activity based on a collection of individual features and their statistics. Therefore, they are sensitive to individual feature extraction failure due to occlusion or missed detection. This section reviews both types of approaches and we compare top-down and bottom-up approaches in Tables 2 and 3 respectively.

Top-down approach
Top-down approaches are focused on analysis of glob-al motion patterns of an entire group or each sub-group and investigate the trajectory as well as interaction of groups while the individual action of a specific actor in the scene is less important. In this way, they are more robust to occlusion and low-resolution.

Trajectory based method
Trajectory based methods are centered on analyzing group activities in terms of interactions between individual trajectories. Vaswani et al. [37] model moving objects as point objects in the two-dimension plane. Instead of tracking each point and recognizing their interaction, they propose to represent a group activity as the polygonal shape change of these points′ configuration over time frames following the Kendall′s shape theory. At each time, they extract object points in the image and construct a polygon based on these points. A tangent coordinate system is defined as the mean shape which is learned from observed object configurations from a training sequence of frames. The normal and abnormal activities are distinguished by comparing the extracted shape from input frames with the learned model in the tangent space. Similarly, Khan and Shah [38] proposed a method to detect group activities which can be characterized by rigidity information such as parading or marching. They represent each entity as a corner of three-dimension polygons and the tracklets of each entity on the three-dimension polygon plane are treated as trajectory feature. The final classification results are inferred from the structure composed from the trajectory and interactions between participating entities.
Zhou et al. [39] designed a set of features which measures the strength of causality between two trajectories and another set describes the type of causality. The two sets of features along with conventional velocity and position features of a trajectory-pair are fused to explore the relationship between two object entities. However, they can only deal with the pair-activity recognition. Ni et al. [24] proposed to analyze group activity with self-causality, pair-causality and group-causality based on local trajectory information. These three categories of causality extract dynamic interaction relations of different individuals and describe the spatial and temporal characters of behaviors of the human group. Cheng et al. [12] introduced Gaussian processes to describe motion trajectories of individuals and provide a probabilistic perspective on explaining the variation of individual in group. Three descriptors, namely Individual, Dual and Unitized Group Activity Pattern respectively, are designed to capture relationships of individuals in group activities. Zhang et al. [40] proposed to obtain group-level context from extracted individual trajectories. They constructed a weighted graph to represent the probabilistic group membership of the individuals. The features extracted from this graph can capture the motion and action context for group  Choi et al. [25] 70.9 82.0 Choi and Savarese [26] 79.0 83.0 Amer et al. [27] 83.6 UCLA: 72.7 Lan et al. [31] 68.2 Kaneko et al. [32] 73.2 Nabi et al. [33] 72.9 72.3 Lan et al. [34] 77.5 Chang et al. [35] 83.3 80.3 Amer et al. [17] 88.9 84.2 UCLA: 83.7 Amer et al. [16] 92.0 87.2 Hossein et al. [36] 83.4 Khamis et al. [15] 72.0 85.8 event recognition.

Sub-group interaction
To cope with complicated situations where multiple groups perform different activities in a scene, some methods detect sub-groups firstly then analyze the interactions of different groups and the activity of each group. Yin et al. [41] first clustered each individual into several sub-groups by the minimum spanning tree algorithm and then used social network analysis based feature description to extract structural features which contain the global pattern of each sub-groups as well as local motion information of the individual in each group. Finally, a Gaussian process dynamical model is trained to model different group behaviors respectively. Zhang et al. [13] proposed to represent group behavior with a combination of subgroups and introduced multi-group causalities: individual, pair, behavior and inter-group causality to describe the interaction between groups. Furthermore, they employed an improved locality-constrained linear coding method to encode the proposed multi-group causalities. Azorin-Lopez et al. [42] proposed a descriptor vector which describes not only the trajectory of individuals in a group, but also the trajectory followed by sub-groups and the movement relationship between different sub-groups in the scene. The trajectory analysis provides a path to understand complex high-level groups activities.
Sub-group information is a helpful cue for recognizing group activity under complicated scenes, however how to identify meaningful sub-group remains a challenging problem. Kim et al. [43] proposed to detect the group interaction zone and update it over time so that noisy information can be suppressed and the active zone for activity can be enhanced. To represent interactions within group interaction zones, they further proposed two features, group interaction energy feature, attraction and repulsion features. Tran et al. [29] measured degrees of interactions between individuals by social signal cues. Then they leveraged graph clustering algorithm to discover interacting sub-groups in the scene and discarded non-dominant groups. To better understand group activity, they proposed a descriptor which encodes social interaction cues and motion information of individuals within the active sub-groups. Sun et al. [44] proposed a latent graph model to solve two tasks: group discovery and activity recognition simultaneously. They constructed a relation graph which encodes the context relations between tracklets, intra-group interaction and inter-group interaction. The model can propagate message between various layers of the latent graph.

Multi-camera context
Nowadays, multi-camera surveillance systems which provide larger view are set up in almost public places such as campus and airport. Therefore, there is high demand for addressing group activity recognition under multiple cameras scenarios and some researchers studied this topic. In [45], multiple tracks in multi-cameras are used to extract spatio-temporal features of individuals. They considered two hierarchical clustering approaches for grouping individuals, agglomerative clustering and decisive clustering, using dissimilarity to measure between tracked targets. Zha et al. [46] proposed a graphical model with hidden variables from which intra-camera and intercamera contexts are extracted. By optimizing the structure of graphical model, the contexts are explored automatically. Moreover, they present a spatio-temporal feature, namely vigilant area, to encode the motion information in an area which is proven to be effective for group activity representation.

Discussions
Trajectory-based methods are based on the observation that the tracking of individual positions and the overall movement of group are sufficient for recognizing group activity. Therefore, trajectory-based methods are suitable for recognizing the group activity which is characterized by the overall motion of an entire group. However, most trajectory-based methods focus on the activity with only one group without considering the fact that the group behavior in the real scenario usually consists of multiple groups and is mainly characterized by the dynamic interaction among groups of individuals. Sub-group interaction based methods address this problem by detecting sub-groups and utilizing interaction information among groups to better understand group activity. The major advantage of such approaches is their ability to analyze the interactions of groups. Unlike the aforementioned methods, multi-camera context based methods predict group activity with multiple cameras. In multi-camera scenes, intra-camera and inter-camera contexts are important information. In general, top-down approaches are analyzing activities in terms of group level motion and interaction and not heavily relying on individual feature which are robust to occlusions or low-resolution. Comparison between top-down approaches is shown in Table 2.

Bottom-up approach
Bottom-up approaches can be applied for recognizing group activity with a limited number of people in the scene who have nonuniform behaviors. For example, in indoor environments such as coffee shop, some people are talking face to face while other people are queuing for ordering coffee or just standing. These types of approaches usually recognize each individual person and then analyze their hierarchical structure: individual level and group level.

HMM based model
In the previous studies, hidden Markov model (HMM) is applied to address structure data in video. Zhang et al. [47] recognize group activity for meetings including monologues, discussion, presentation and note-taking. They proposed a two-layer hidden Markov model in which the first layer models basic individual action by low-level audio-visual features, and the second layer models the interaction between meeting participants. Similarly, Dai et al. [48] recognize break, presentation and discussion in meeting scenarios using event-driven multilevel deep belief nets (EDM-DBN) which models group interactions as a group of Markov chains.

Descriptor based method
Later some researchers combined context information by designing new descriptors extracted from individual or surrounding scenes to model the evolution of group activity. Choi et al. [9] introduced a spatio-temporal local (STL) descriptor which calculates the spatial temporal distribution of position, pose and motion information of individuals. The STL descriptor is centered on an anchor person and captures histograms of surrounding persons with their poses and motion information in different bins. Choi et al. [25] extends the STL descriptor and proposed randomized spatio-temporal volume (RSTV) representation. The framework is built upon a random forest structure which randomly samples portions of spatio-temporal volume and the discriminative regions for classification. This method can automatically discover the optimal configuration of spatio-temporal bins so as to increase discriminating ability of the algorithm.
Motivated by the fact that what other surrounding people doing is a constructive cue for analyzing the actions of each individual. Lan et al. [31] proposed the action context (AC) descriptor which captures the actions of the anchor person as well as other people nearby. Experimental results demonstrate that this method can deal with complex activities in a surveillance scene. However, the AC descriptor is sensitive to viewpoint change. To solve this problem, Kaneko et al. [32] proposed the relative action context (RAC) descriptor which encodes relative relation and is invariant under viewpoint change.
To make the low-level feature extractors provide more discriminative information for high-level inference models, Amer and Todorovic [49] introduced a mid-level feature descriptor bags-of-right-detections (BORD) which seeks to discover individuals who participate in group activity and remove irrelevant individuals in groups. Specifically, the BORD descriptor is a histogram of human poses which calculates with people who participate in the activity. The chains of BORDs are fed into a two-step maximum a posterior (MAP) inference to construct activity representation.
Existing methods heavily depend on the accuracy of detectors that might fail in the crowd scenarios due to occlusion. Nabi et al. [33] presented a semantic-based spatiotemporal descriptor based on Poselet activation patterns over time. This descriptor is designed for modeling human motion interactions in crowded cases. Experiential results revealed that this descriptor can effectively tackle complex real scenarios in group activity recognition and activity localization.

Interaction context
In addition, one of the essential properties of group activity is relationship and interactions between individuals, including person-person interactions, person-group interactions and group-group interactions, which are useful cues to reason about group activity. Lan et al. [34] introduced a hierarchical interaction model and an adaptive interaction structure mechanism to automatically search for the suitable structure to infer activity. Finally, the person-person interaction only builds between the subset of relevant people. Kaneko et al. [50] proposed to utilize fully connected CRFs to integrate multiple types of individual features such as position, size and motion. Thus, different shapes and types of groups can be handled. Chang et al. [35] focused on modeling the person-person interaction. They utilized the features of individuals in pairs and modeled relations peer-to-peer. The interaction pattern is obtained via the interaction matrix which is learned by maximizing the interaction responses.
Graphical models and their variants are commonly used tools for group activity recognition. Amer et al. [27] proposed a graph based interaction method. An AND-OR graph is present to model objects occurring in the scene, individual action and group activity simultaneously. They proposed a principled formulation for efficient graph inference by an explore-exploit strategy. In [17], they further proposed a hierarchical, spatio-temporal AND-OR graph (ST-AOG) which models both individual actions, group activities and relations of individual actions within a group activity. Moreover, Monte Carlo tree search is used to address expensive computation cost on AOG inference. Later, Amer et al. [16] advanced the existing graph model with a hierarchical random field (HiRF). HiRF is designed for extracting spatio-temporal features in video and capturing long-range dependencies. HiRF aggregates multi-scale input features and discovers foreground features of groups, while removes features that belong to background clutter.
Lan et al. [19] utilized social roles to complement the representation of low-level individual and high-level events within a graph framework. In the proposed graphical model, individual action is modeled based on individual feature vectors at the lowest level and the contextual interaction information between individuals are modeled based on their social roles at the intermediate level. Group-level events are inferred at the top level of model. Zhao et al. [51] observed that most existing approaches assumed all individuals share the same activity label and ignore multiple activities co-existing in some scenarios. This factor can serve as a context cue in many cases. They present a unified discriminative learning framework of multiple context models which takes both the intra-class and inter-class behavior interactions among persons into consideration. Activities always have serious intra-class variation caused by changes of individual appearance or temporal evolution, which will lead to confusion for the recognition algorithms. To solve this, Lan et al. [52] presented a method which additionally models action primitives and considers the interactions of theirs. Sometimes, activity of a group of people can be classified by counting the actions of individual in the scene. Hajimirsadeghi et al. [36] developed a probabilistic structured kernel method that is based on the multi-instance model to infer cardinality relations which can reduce the confusion caused by irrelevant individuals. The results show that encoding cardinality relations can obtain significant improvements on performance for group activity classification. Zhou et al. [53] addressed the problem of recognizing mixed group activities contained in one still image. They proposed a four-level structure which captures interactions among group to group and interactions among person to person. Experimental results demonstrate that the model is robust to scenes with high crowd density and can well tackle the problem of the mixed group activities.

Tracklets based method
For bottom-up approaches, an integral step is identifying coherent trajectories of each individual. However, tracking multiple individuals at the same time is challenging because of self-occlusion, background clutter or camera shaking. These factors lead to inaccurate tracklets which are not stable enough to construct the recognition algorithm. Most approaches isolate tasks of the tracking and recognizing activity. Choi and Savarese [26] presented a framework for simultaneously tracking multiple individual and estimating group activity. What underlies the intuition of treating the two problems jointly is that per-sons′ motion and their activity have a strong correlation. Performing the two tasks in a coherent fashion means that the two components can promote each other. They exploited interactions between individuals for guiding the target associating process and designed a hierarchical graphical model to encode the correlation between activities. Khamis et al. [15] is motivated by the discordance of an action in a scene. Sometimes, an object performing different actions may share similar appearance for frame-level features and different motion information in the track level feature. They proposed a model which captures the relevance between individual′s action and the motion flow in the video sequence. Finally, group activities are inferred by combining per-frame and per-track cues.

Discussions
Bottom-up approaches are suitable for recognizing the group activity with a limited number of members who have their own role, different from the others. For example, the group activity, presentation in a meeting room: the presenter is talking while the other members are listening or taking notes. This type of group activity requires methods to have the ability of recognizing actions of each individual and their structures. The HMMbased model is applicable to address hierarchical structure. At the bottom layer, atomic actions of individuals are recognized from sequences while the second-level layer models activities of the group. Context information in the scene is helpful to differentiate ambiguous activities such as standing and queuing. Descriptor based methods propose various feature descriptors extracted from a focal individual and its surrounding area to integrate contextual information. Unlike the descriptor based methods which provide context information between focal individual feature with all people within group, the interaction context model provides interaction information among person to person, person to group and group to group which makes it possible to tackle complicated interaction scenarios. For bottom-up approaches, identifying coherent trajectories of each individual is a preprocess step for group activity recognition. Previous methods are isolating tasks of the tracking and recognizing, however a person′s motion and their activity are sometimes correlated. The goal of the tracklets based method is performing two tasks jointly and making them promote each other. A comparison between bottom-up approaches is shown in Table 3.

Deep learning based methods
Recently, deep convolutional neural networks (CNNs) have demonstrated impressive performance on a variety of computer vision tasks including image classification [54] , semantic segmentation [55] , image super-resolution [56] and video recognition [57] . Several deep learning approaches have been proposed for group activity recognition and achieved superior results to handcrafted approaches. This section reviews deep learning based methods for group activity recognition. We summarize four key problems for group activity recognition: hierarchical temporal modeling, relationship modeling, attention modeling and a unified modeling framework. We divide methods based on what crucial problem they focus on. The comparison results for deep learning based methods are demonstrated in Table 4.

Hierarchical temporal modeling
The group activity recognition needs to simultaneously reason on a collective of persons. A challenge for this task is how to design appropriate networks to allow the learning algorithm to focus on differentiating higherlevel classes of activities which are about spatial and temporal evolution of the group activity. Long short-term memory network (LSTM) [70] , a particular type of recurrent neural network, has achieved great success in sequential tasks including speech recognition [71] and image captioning generation [72] . For group activity recognition, some researchers attempt to apply LSTM to construct a hierarchical structure representation to infer individual actions and group activities [10, 20, 58−60, 73 −76] .
Ibrahim et al. [10] proposed a two-stage hierarchical deep temporal model (HDTM). The first stage applies a person-level LSTM to the tracklets of each individual to model individual activities. In the second stage, a grouplevel LSTM is adopted to combine individual-level information and form group level features for group activities. This method is the first work that incorporates a deep LSTM framework to address group activity recognition. Besides person-person and person-group interactions, group activity is often associated with interactions between sub-groups. Wang et al. [58] proposed a multi-level interaction context encoding network on the basis of a hierarchical LSTM framework [10] . The network models three level interactions including individual dynamics, intra-group individual interactions and inter-group interactions. To enrich person level features, they deployed a contextual binary encoder which encodes the sub-action in the framework.
Shu et al. [59] argued that existing group activity recognition benchmark datasets (the collective activity dataset [9] and the volleyball dataset [10] ) are too small to train a robust LSTMs framework. To solve this problem, they proposed the confidence-energy recurrent network (CERN) which extends the two-level hierarchy of LSTMs framework by incorporating a confidence measure and an energy-based model.
Inspired by the fact that people can infer an activity from a sequence of sentences easily, Li and Chuah [60] presented a semantics-based scheme, namely SBGAR.
They designed a LSTM model to generate a caption for each video frame in the first stage. In the second stage, another LSTM model predicts group activities based on the generated caption of a sequence of frames. This is the first cross-modal method for group activity recognition and achieved the state-of-the-art results at that time.
Sometimes, different group activities share the same local motion which may cause misclassifications. To reduce the influence of confused motions, Kim et al. [73] proposed a discriminative group context feature (DGCF) that takes prominent sub-events into consideration. Two types of features, individual activity and sub-event feature, are extracted to construct group activity representations. The model is based on the gated recurrent units (GRU) model, which is a modified model of LSTM, to capture the temporal evolution in a video.
Gammulle et al. [74] presented a multi-level sequential generative adversarial network (MLS-GAN) based on LSTM architecture. This method is the first attempt to introduce GAN to the group activity recognition task. Instead of utilizing manually annotated individual actions, this approach automatically learns appropriate sub-actions which are pertinent to the final group activity by generative adversarial networks, within which the generator, trained with sequences of person-level and scenelevel features, learns an action representation and the discriminator performs group activity classification.
Wu et al. [75] proposed global motion pattern to represent complex multi-person motions in the sports video. Li and Chuah [60] Hierarchical temporal modeling 86.1/89.0 66.9 Deng et al. [18] Deep relationship modeling 81.2/N CAD2: 90.23/N Nursing home:85.50 Qi et al. [61] Deep relationship modeling 89.1/N 89.3 Ibrahim and Mori [62] Deep relationship modeling 89.5 Azar et al. [63] Deep relationship modeling 85.75/94.2 93.04 Wu et al. [64] Deep relationship modeling 91.0/N 92.6 Hu et al. [65] Deep relationship modeling N/93.8 91.4 Gavrilyuk et al. [21] Deep relationship modeling 92.8/N 94.4 Yan et al. [66] Attention modeling N/ 92.2 87.7 Tang et al. [14] Attention modeling N/95.7 90.7 Lu et al. [ [68] Unified modeling framework 87.1 Zhang et al. [69] Unified modeling framework 83.8/N 86.0 Global motion patterns extracted by an optical flow algorithm are fed into convolutional neural networks and LSTM networks to extract spatial and temporal features for event classification. They further extend the GMP framework in [20]. A two-stage scheme for event classification in basketball videos is proposed. In the first stage, event occurrence segments and post-event segments are utilized for event classification and the failure/success of an offense respectively. Eventually, final results are obtained by the integration of event classification results and success/failure classification results. Previous two-stage LSTM based methods neglect the fact that person-level actions and group-level activity are occurring over time. To this end, Shu et al. [76] proposed a graph LSTM-in-LSTM (GLIL) network which jointly models the person-level actions and the group-level activity. Multiple P-LSTMs model the person-level actions based on the interactions among individuals. Meanwhile, a G-LSTM models the group-level activity and the person-level information in P-LSTMs is selectively integrated into G-LSTM.

Deep relationship modeling
Building relationships between persons and performing relational reasoning are essential for recognition of higher-level activities. However, modeling relevant relations between people is challenging in group activity recognition for the reason that only individual action labels and group activity labels are accessible, without additional knowledge of interaction information. Much research [18, 21, 61−65, 77−83] explores how to capture the contextual information about the person in the scene and their relations.
Deng et al. [77] focused on modeling the interaction between individuals and their relationship in the scene. This is achieved by a multi-layer perceptron for capturing the dependencies of individual actions, group activity and scene labels. They further proposed a structure inference machine [18] which is consisted of a deep convolution network with a graphical model. They utilized a recurrent neural network to propagate messages between individual people in a scene. Moreover, a trainable gating function is designed to suppress the influence of irrelevant people in the scene.
Qi et al. [61] proposed an attentive semantic recurrent neural network, namely stagNet. A semantic graph is built from word labels and visual data. Individual actions and temporal contextual information are integrated by a structural-RNN model. The spatial relationship between individual people is inferred in a semantic graph via a message passing mechanism. Beyond that, personlevel spatial attention and frame-level temporal attention are designed to automatically discover the key person and the key frame.
To acquire a compact relational representation of each individual person, Ibrahim and Mori [62] developed the relational layer that refines relationship representations based on a relation graph. In the relational layer, each pair of individual features is aggregated by a shared neural network into a new relation to represent their relationship. By stacking multiple relational layers, a compact group representation encoding hierarchical relationships of interaction is obtained. Existing methods have not thoroughly explored the spatial relationship between persons. To address this issue, Azar et al. [63] proposed a novel spatial representation, dubbed an activity map, based on individual and group activities. Motivated by [84], the activity map is refined in multiple stages for decreasing the incorrect representations. An aggregation method ensures the refined activity map can produce reliable group activity labels.
Graph convolutional networks (GCN) [85] have become an emerging topic in deep learning. GCNs have been applied to many fields of computer vision such as visual tracking [86] and single human action recognition [87,88] . Graph convolutional networks are suitable model to address group activity recognition within which each person can be regarded as a node. Wu et al. [64] introduced GCN into group activity recognition. Person-level features are extracted by convolution neural networks and an actor relation graph are built based on visual similarity and spatial location distance between individual persons. Graph convolution networks are adopted to perform relational reasoning on the actor relation graph to acquire the relational features of each person.
Hu et al. [65] applied deep reinforcement learning for relation learning in group activity recognition which is a new method. A semantic relation graph is built to model relations of persons in the scene. Then, two agents based on Markov decision processes are applied to refine the graph. The relation gating agent is responsible for enforcing relevant relation learning and discarding irrelevant relations. Another feature-distilling agent distills the key frames of features which is similar to a temporal attention mechanism.
Xu et al. [79] proposed a multi-modal relation representation with temporal-spatial attention which infers relations from appearance features and motion information. Two types of inference modules, opt-GRU and relation-GRU, which are used to encode the object relationship and motion representation effectively, are introduced to form the discriminative frame-level feature representation.
Inspired by a transformer network [80] which relies on self-attention mechanisms to allow the network to adaptively extract the most relevant information and relationships, Gavrilyuk et al. [21] proposed an actor-transformers network which learns interactions between the actors and adaptively extracts the important information for activity recognition.
In the real scene, individuals may perform their own actions or they might be connected to a social group and several groups of people have potentially different social connections. Ehsanpour et al. [81] proposed a new task social activity recognition which simultaneously performs individual action prediction, social group division and sub-group activities predicting.

Attention modeling
For group activity recognition, there are usually several persons active in the scene while only a few key persons are contributing to group activities, and others who may bring confusing information for inferring group activities are irrelevant. Due to lack of key person annotations for group activity recognition datasets, this problem can be defined as weakly supervised important people detection. To address this issue, several methods [11, 14, 66, 67, 89−92] designed attention mechanism.
Ramanathan et al. [11] worked on basketball event detection which is sensitive to a subset of players. They formulated a spatial and temporal attention model to attend relevant players for events in the scene and apply weighted summation mechanisms to extract person-level features which lead to a better representation for event detection.
Yan et al. [66] observed that the actors who move steadily during the whole process or move remarkably at a moment have more contributing to the group activity. To measure the mean motion intensity which represents long motion of an actor, they stack the optical flow images of the video clip and calculate the mean intensity of them. The intensity of flash motion for an actor is captured by learning an attention factor to weight sum of his/her hidden state from LSTM at every time step. In [86], they further proposed a coherence constrained graph LSTM with a temporal confidence gate and a spatial confidence gate to control the memory updating. Meanwhile, an attention mechanism is constructed to measure the contribution of a motion at each time step.
Previous methods address key actor detection by selfattention mechanisms which are unreliable and lack interpretability. Tang et al. [14] provided a new insight on designing attention networks for group activity. A teacher network in the semantic domain is designed to recognize group activities based on the words of individual action labels. Then they train a student network in the visual domain to infer group activities based on video clips. In the training process, the teacher network distills attention knowledge into a student network, which is effective to mine the key people and suppress the irrelevant people without requirements for extra labels. In [90], they extended the teacher network and the student network with two types of graph neural network. By the graph convolutional modules passing the messages of different nodes, the relationship among different people in the scene can be explored.
Lu et al. [91] proposed a two-level attention mechanism for group activity recognition. The first individual-level attention is guided by pose features to control the hidden state at each time step. The second scene-level attention attaches individuals with different weights to construct discriminative scene representation. This method depends on pose estimation. Lu et al. [67] improved it and proposed a graph attention interaction model with graph attention blocks to capture unbalanced interaction relations at the individual and group level.

Unified modeling framework
Group activity recognition for video usually involves multi-person detection, multi-person tracking and activity recognition. Most existing methods separate the modeling of human detection/tracking and group activity recognition. They usually adopt an off-the-shelf human detection and tracking algorithm to preprocess the input video sequences. Their focus lies in designing a high-performance structure model to classify activity recognition. However, such practice has several drawbacks. First of all, decoupling the modeling of human detectors and group activity classifiers which ignore the inner correlation between two modules leads to suboptimal results for both parts. Second, the feature extracted by detectors for individual people is also useful for inferring group behaviors while separate learning needs extract features through backbone networks respectively which leads to extra computations.
Bagautdinov et al. [68] presented a unified framework to solve the aforementioned issues. They utilized the multiscale feature maps output by a fully convolutional network to address three tasks: multi-people detection, individual action recognition and group activity recognition. A matching mechanism is designed for associating the same person in consecutive frames and features are fused by standard GRU in the temporal domain.
Zhang et al. [69] focused on speeding up the inference time for group activity recognition. They proposed to perform human detection and activity reasoning simultaneously in a end-to-end framework, within which a shared backbone network is exploited to extract feature. Experiments demonstrate that people who are outliers for activity can be filtered out effectively and two tasks: human detection and group activity recognition can reinforce each other. On top of that, they proposed a latent embedding scheme for building the relation of person-person and person-group interactions.
Zhuang et al. [93] explored a new representation for group activity recognition to avoid a heavy dependency on the accuracy of human detection and tracking. They proposed a differential recurrent convolutional neural network (DRCNN) which is unnecessary to take each person′s bounding-box as input and without complicated preprocess steps. Unlike existing methods where feature extraction and parameter learning are separate, DRCNN jointly optimizes the unified deep learning framework.

Discussions
Recent deep learning based methods for group activity recognition demonstrate promising improvements in performance on traditional methods. Compared with learned features, handcrafted descriptors are often not learned and quantified automatically for discrimination and their discrimination powers are usually not guaranteed. Hierarchical temporal modeling based methods use a two-stage LSTM model to learn a temporal representation of individual-level actions and apply pooling functions to individual features to generate a group-level representation. This two-stage LSTM framework inspired a lot of follow-up work. Its limitation is treating all individuals with equal importance. However, the group activity is usually defined by a few key persons in some scenarios such as in sports videos. Attention modeling based methods attempt to solve this issue and many modifications have been proposed. From Table 4, we can see that this kind of method has higher performance than hierarchical temporal modeling based methods. However, due to lack of annotation of key persons, how to learn a stable model which can accurately find key individual is still a difficult problem. Currently, relationships among entities have been widely leveraged in various computer vision tasks. Various methods of relation reasoning are introduced into group activity recognition, such as GCN and transformers. The advantage of deep relationship modeling based methods is they can capture potential interactions and relationships between persons that can effectively discriminate person and group activity. This category of methods achieves the best results in the CAD and volleyball dataset. Unified modeling framework methods attempt to perform person detection and group activity recognition jointly in a single neural network which can speed up the algorithm and bring it closer to practical applications. However, they cannot achieve the stateof-the-art recognition accuracy. Most of the existing methods directly adopt bounding box from annotations which are inaccessible in practical applications. Research on this topic is limited. Weakly-supervised group activity recognition tasks where only video-level group activity is accessible could be another direction for group activity recognition. 1) Reliable relation representation. Relation representation matters for group activity recognition. Group activity recognition involves multiple people performing different actions and having varied interactions in a scene. Therefore, inferring group activity requires contextual reasoning about the appearance and relations of people rather than simply a combination of individual action. Under some circumstances such as sport videos, the contribution of actors are unbalanced for group activity which causes relation representation more difficult. There are some attempts to adopt self-attention mechanisms or graph neural networks for relation modeling in group activity recognition. However, previous works rely on explicitly spatial priors to build model and are limited on temporal relations. Reliable and efficient relation representation in spatial and temporal domains among actors still need to be further explored.

Challenges and trends
2) Powerful spatio-temporal representation. While 2D CNN have achieved enormous success in image recognition, they are suboptimal for video related tasks, because video is naturally a 3D spatio-temporal signal and temporal information is vital in videos. Most of the existing works for group activity recognition usually applies 2D CNN on a single frame to extract person-level features and model temporal information by recurrent neural networks on dense frames to extract group-level feature. It is worthwhile to investigate whether spatiotemporal representations extracted from 3D CNN can be beneficial for group activity representation. Optical flow, a motion information representation, can complement appearance information for CNN-based methods in individual action recognition, while they are seldom utilized in group activity recognition methods. Introducing efficient motion-related information in group activity representation should be investigated.
3) Robust human detection, tracking and recognition. Accurate detection and tracking results are Trajectory based methods (Vaswani et al. [37] -03) ( Interaction context AND-OR Graph (Amer et al. [27] -12) HiRF (Amer et al. [16] - 14) Traditional methods the fundamentals for feature extraction in high-level group activity recognition tasks. Although general detection and tracking are well-studied fields, it is challenging to detect and track multiple individuals accurately because of the frequently occurring inter-object occlusions, target-similar distractors, etc. Most of the existing methods focus on designing a structure model to classify group activity. They directly adopt a bounding box from annotations which is inaccessible in practical applications or from results of the third-party algorithm trained for general detection or tracking purpose which is non-optimal for handling multiple objects. Another attempt is to integrate human detection and group activity recognition in a unified framework which speeds up the algorithm by performing multiple tasks in a one pass-feed forward through a neural network. There are some methods working on that, but they cannot achieve the state-of-the-art classification accuracy. How to better integrate mid-level detection tasks and high-level recognition tasks is another direction of future research to further explore. 4) Bigger and challenging dataset. A brief comparison of existing collective activity recognition datasets is presented in Table 1. As it can be seen, most of the datasets are proposed before the deep learning era and these are quite limited to support the training of complex and representative models based on deep learning. The most commonly used volleyball dataset was proposed in 2016 and is limited to the domain of volleyball activity. Most algorithms achieve high accuracy in this dataset in which the best accuracy currently is 94.4% [21] . It will be worth studying whether the improvement obtained from current methods can scale up or are just the results of parameter regularization. Eventually, the dataset characterized by real-world challenging scenarios is significant for promoting research progress. Detailed annotation of various attributes such as densely actor bounding boxes or human poses may provide researchers a different perspective to solve the problem.

Conclusions
In this paper, we present a complete review of stateof-the-art techniques for group activity recognition. These techniques became particularly attractive in recent years because of their promising prospects in the application of video surveillance and sports video analysis. We survey several aspects of the existing attempts including handcrafted feature design and models that benefit from deep architectures. We highlight the contributions of each method and analyze their advantages. Meanwhile, we demonstrate publicly available datasets and comparisons between different methods. Future research directions are also discussed. For beginners or researchers in this field, this survey paper can be used as a helpful guide for further research.

Open Access
This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article′s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article′s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.