1 Introduction

Consider a simple action such as “opening a book” as shown in Fig. 1. When we humans see the sequence of images, we can easily recognize the action category; yet our current vision systems (with hundreds of layers of 3D convolutions) struggle on this simple task. Why is that? What is missing in current video recognition frameworks?

Fig. 1.
figure 1

How do you recognize simple actions such as opening book? We argue action understanding requires appearance modeling but also capturing temporal dynamics (how shape of book changes) and functional relationships. We propose to represent videos as space-time region graphs followed by graph convolutions for inference.

Let’s first take a closer look at the sequence shown in Fig. 1. How do humans recognize the action in the video corresponds to “opening a book”? We argue that there are two key ingredients to solving this problem: First, the shape of the book and how it changes over time (i.e., the object state changes from closed to open) is a crucial cue. Exploiting this cue requires temporally linking book regions across time and modeling actions as transformations. But just modeling temporal dynamics of objects is not sufficient. The state of objects change after interaction with human or other objects. Thus we also need to model human-object and object-object interactions as well for action recognition.

However, our current deep learning approaches fail to capture these two key ingredients. For example, the state-of-the-art approaches based on two-stream ConvNets [1, 2] are still learning to classify actions based on individual video frame or local motion vectors. Local motion clearly fails to model the dynamics of shape changes. To tackle this limitation, recent work has also focused on modeling long term temporal information with Recurrent Neural Networks [3,4,5,6] and 3D Convolutions [7,8,9,10]. However, all these frameworks focus on the features extracted from the whole scenes and fail to capture long-range temporal dependencies (transformations) or region-based relationships. In fact, most of the actions are classified based on the background information instead of capturing the key objects (e.g., the book in “opening a book”) as observed in [11].

On the other hand, there have been several efforts to specifically model the human-object or object-object interactions [12, 13]. This direction have been recently revisited with ConvNets in an effort to improve object detection [14,15,16], visual relationship detection [17] and action recognition [18], etc. However, the relationship reasoning is still performed in static images failing to capture temporal dynamics of these interactions. Thus, it is very hard for these approaches to capture the changes of object states over time as well as the causes and effects of these changes.

In this paper, we propose to perform long-range temporal modeling of human-object and object-object relationships via a graph-based reasoning framework. Unlike existing approaches which focus on local motion vectors, our model takes in a long range video sequence (e.g., more than 100 frames or 5 s). We represent the input video as a space-time region graph where each node in the graph represent region of interest in the video. Region nodes are connected by two types of edges: appearance-similarity and spatio-temporal proximity. Specifically, (i) Similarity Relations: regions which have similar appearance or semantically related are connected together. With similarity relations, we can model how the states of the same object change and the long range dependencies between any two objects in any frames. (ii) Spatial-Temporal Relations: objects which overlap in space and close in time are connected together via these edges. With spatial-temporal relations, we can capture the interactions between nearby objects as well as the temporal ordering of object state changes.

Given the graph representation, we perform reasoning on the graph and infer the action by applying the Graph Convolution Networks (GCNs) [19]. We conduct our experiments in the challenging Charades [20] and 20BN-Something-Something [21] datasets. Both datasets are extremely challenging as the actions cannot be easily inferred by the background of the scene and the 2D appearance of the objects or humans. Our model shows significant improvements over state-of-the-art results of action recognition. Especially in the Charades dataset, we obtain \(4.4\%\) boost.

Our contributions include: (a) A novel graph representation with variant relationships between objects in a long range video; (b) A Graph Convolutional Network for reasoning with multiple relation edges; (c) state-of-the-art performance with a significant gain in action recognition in complex environments.

2 Related Work

Video Understanding Models. Spatio-temporal reasoning is one of the core research areas in the field of video understanding and action recognition. However, most of the early work has focused on using spatio-temporal appearance features. For example, a large effort has been spent on manually designing the video features [22,23,24,25,26,27,28,29,30,31]. Some of the hand-designed features such as the Improved Dense Trajectory (IDT) [23] are still widely applied and show very competitive results in different video related tasks. However, instead of designing hand-crafted features, recent researches have focused towards learning deep representations from the video data [1, 2, 32,33,34,35,36,37]. One of the most popular model is the two-Stream ConvNets [1] where temporal information is model by a network with 10 optical flow frames as inputs (<1 s). To better model longer-term information, a lot of work has been focused on using Recurrent Neural Networks (RNNs) [3,4,5, 38,39,40,41,42,43] and 3D ConvNets [8, 9, 44,45,46,47,48]. However, these frameworks focus on extracting features from the whole scenes and can hardly model the relationships between different object instances in space and time.

Visual Relationships. Reasoning about the pairwise relationships has been proven to be very helpful in a variety of computer vision tasks [12, 13, 49,50,51]. For example, object detection in cluttered scenes can be significantly improved by modeling the human-object interactions [13]. Recently, the visual relationships have been widely applied together with deep networks in the area of visual question answering [52], object recognition [14,15,16] and intuitive physics [53, 54]. In action recognition, a lot of effort has been made on modeling pairwise human-object and object-object relationships [55,56,57]. However, the interaction reasoning framework in these efforts focuses on static images and the temporal information is usually modeled by a RNN on image level features. Thus, these approaches still cannot capture how a certain object state changes over time.

An attempt at modeling pairwise relations in space and time has been recently made in the Non-local Neural Networks [58]. However, the Non-local operator is applied in every pixel in the feature space (from low layers to higher layers), while our reasoning is based on a graph with object level features. Moreover, the non-local operator does not process any temporal ordering information, while this is explicit modeled in our spatial-temporal relations.

Graphical Models. The long range relationships in images and videos are usually captured by graphical models. One popular direction is using the Conditional Random Fields (CRF) [59, 60]. In the context of deep learning, especially for semantic segmentation, the CRF model is often applied on the outputs of the ConvNets by performing mean-field inference [61,62,63,64,65,66]. Instead of using mean-field inference, variant simpler feedforward graph based neural network have been proposed recently [19, 67,68,69,70,71]. In this paper, we apply the Graph Convolutional Networks (GCNs) [19] which was originally proposed for applications in Natural Language Processing. Our GCN is built by stacking multiple layers of graph convolutions with similarity relations and spatial-temporal relations. The outputs of the GCNs are updated features for each object node, which can be used to perform classification.

Our work is also related to video recognition with object cues [72,73,74] and object graph models [75,76,77,78]. For example, Structural-RNN [77] is proposed to model the spatial-temporal relations between objects (adjacent in time) for video recognition tasks. Different from these works, our space-time graph representation encodes not only local relations but also long range dependencies between any pairs of objects across space and time. By using graph convolutions with long range relations, it enables efficient message passing between starting states and ending states of the objects. This global graph reasoning framework provides significant boost over the state-of-the-art.

3 Overview

Our goal is to represent the video as a graph of objects and perform reasoning on the graph for action recognition. The overview of our model is visualized in Fig. 2. Our model takes inputs as a long clip of video frames (more than 5 s) and forward them to a 3D Convolutional Neural Network [8, 58]. The output of this 3D ConvNet is a feature map with the dimensions \(\mathrm {T} \times \mathrm {H} \times \mathrm {W} \times d\), where \(\mathrm {T}\) represents the temporal dimension, \(\mathrm {H} \times \mathrm {W}\) represents the spatial dimensions and d represents the channel number.

Fig. 2.
figure 2

Model Overview. Our model uses 3D convolutions to extract visual features followed by RoIAlign extracting d-dimension feature for each object proposal. These features are provided as inputs to the Graph Convolutional Network which performs information propagation based on spatiotemporal edges. Finally, a d-dimension feature is extracted and appended to another d-dimension video feature to perform classification.

Besides extracting the video features, we also apply a Region Proposal Network (RPN) [79] to extract the object bounding boxes (We have not visualized the RPN in Fig. 2 for simplicity). Given the bounding boxes for each of the \(\mathrm {T}\) feature frames, we apply RoIAlign [80, 81] to extract the features for each bounding box. Note that the RoIAlign is applied on each feature frame independently. The feature vector for each object has d dimensions (first aligned to \(7 \times 7 \times d\) and then maxpooled to \(1 \times 1 \times d\)). We denote the object number as \(\mathrm {N}\), thus the feature dimension is \(\mathrm {N} \times d\) after RoIAlign.

We now construct a graph which contains \(\mathrm {N}\) nodes corresponding to \(\mathrm {N}\) object proposals aggregated over \(\mathrm {T}\) frames. There are mainly two types of relations in the graph: similarity relations and spatial-temporal relations. For simplicity, we decompose this big graph into two sub-graphs with the same nodes but two different relations: the similarity graph and the spatial-temporal graph.

With the graph representations, we apply the Graph Convolutional Networks (GCNs) to perform reasoning. The output for the GCNs are in the same dimension as the input features which is \(\mathrm {N} \times d\). We perform average pooling over all the object nodes to obtain a d-dimension feature. Besides the GCN features, we also perform average pooling on the whole video representation (\(\mathrm {T} \times \mathrm {H} \times \mathrm {W} \times d\)) to obtain the same d-dimension feature as a global feature. These two features are then concatenated together for video level classification.

4 Graph Representations in Videos

In this section, we first introduce the feature extraction process with 3D ConvNets and then describe the similarity graph as well as the spatial-temporal graph.

Table 1. Our baseline ResNet-50 I3D model. We use T \(\times \) H \(\times \) W to represent the dimensions of filter kernels and 3D output feature maps. For filter kernels, we also have number of channels following T \(\times \) H \(\times \) W. The input is in \(32\times 224\times 224\) dimensions and the residual blocks are shown in brackets.

4.1 Video Representation

Video Backbone Model. Given a long clip of video (around 5 s), we sample 32 video frames from it with the same temporal duration between every two frames. We extract the features on these frames via a 3D ConvNet. Table 1 shows our backbone model based on the ResNet-50 architecture (motivated by [58]). The model takes input as 32 video frames with \(224 \times 224\) dimensions and the output of the last convolutional layer is a \(16 \times 14 \times 14\) feature map (i.e., 16 frames in the temporal dimension and \(14 \times 14\) in the spatial dimension). The baseline in this paper adopts the same architecture, where the classification is simply performed by using a global average pooling on the final convolutional features following by a fully connected layer.

This backbone model is called Inflated 3D ConvNet (I3D) [8, 48, 58] as one can turn a 2D ConvNet into a 3D ConvNet by inflating the kernels during initialization. That is, a 3D kernel with \(t \times k \times k\) dimensions can be inflated from a 2D \(k \times k\) kernel by copying the weights t times and rescaling by 1/t. Please refer to [8, 48, 58] for more initialization details.

Region Proposal Network. We apply the Region Proposal Network (RPN) in [79, 82] to generate the object bounding boxes of interest on each video frame. More specifically, we use the RPN with ResNet-50 backbone and FPN [83]. The RPN is pre-trained with the MSCOCO object detection dataset [84] and there is no weight sharing between the RPN and our I3D video backbone model. Note that the bounding boxes extracted by the RPN are class-agnostic.

To extract object features on top of the last convolutional layer, we project the bounding boxes from the 16 input RGB frames (sampled from the 32 input frames for I3D, with the sampling rates of 1 frame every 2 frames) to the 16 output feature frames. Taking the video features and projected bounding boxes, we apply RoIAlign [81] to extract the feature for each object proposal. In RoIAlign, each output frame is processed independently. The RoIAlign generates a \(7\times 7 \times d\) output features for each object which is then max-pooled to \(1 \times 1 \times d\) dimensions.

4.2 Similarity Graph

We measure the similarity between objects in the feature space to construct the similarity graph. In this graph, we connect pairs of semantically related objects together. More specifically, we will have a high confidence edge between two instances which are: (i) the same object in different states in different video frames or (ii) highly correlated for recognizing the actions. Note that the similarity edges are computed between any pairs of objects.

Fig. 3.
figure 3

Similarity Graph \(\mathbf {G}^{sim}\). Above figure shows our similarity graph not only captures similarity in visual space but also correlations (similarity in functional space). The query box is shown in orange, the nearest neighbors are shown in blue. The transparent green boxes are the other unselected object proposals. (Color figure online)

Formally, assuming we have the features for all the object proposals in the video as \(\mathbf {X}=\{\mathbf {x}_1, \mathbf {x}_2, \ldots , \mathbf {x}_\mathrm {N}\}\), where \(\mathrm {N}\) represents the number of object proposals and each object proposal feature \(\mathbf {x}_i\) is a \(d \) dimensional vector. The pairwise similarity or the affinity between every two proposals can be represented as,

$$\begin{aligned} F(\mathbf {x}_i, \mathbf {x}_j) = \phi (\mathbf {x}_i)^T\phi ^{\prime }(\mathbf {x}_j), \end{aligned}$$
(1)

where \(\phi \) and \(\phi ^{\prime }\) represents two different transformations of the original features. More specifically, we have \(\phi (\mathbf {x}) =\mathbf {w}\mathbf {x}\) and \(\phi ^{\prime }(\mathbf {x}) = {\mathbf {w}^{\prime }\mathbf {x}}\). The parameters \(\mathbf {w}\) and \(\mathbf {w}^{\prime }\) are both \(d \times d\) dimensions weights which can be learned via back propagation. By adding the transformation weights \(\mathbf {w}\) and \(\mathbf {w}^{\prime }\), it allows us to not only learn the correlations between different states of the same object instance across frame, but also the relations between different objects. We visualize the top nearest neighbors for the object proposals in Fig. 3. In the first example, we can see the nearest neighbors of the laptop not only include the other laptop instances in other frames, but also the human who is operating it.

After computing the affinity matrix with Eq. 1, we perform normalization on each row of the matrix so that the sum of all the edge values connected to one proposal i will be 1. Motivated by the recent works [58, 85], we adopt the softmax function for normalization to obtain the similarity graph,

$$\begin{aligned} \mathbf {G}^{sim}_{ij} = \frac{\exp {F(\mathbf {x}_i, \mathbf {x}_j)}}{\sum _{j=1}^{\mathrm {N}} \exp { F(\mathbf {x}_i, \mathbf {x}_j)} }. \end{aligned}$$
(2)

4.3 Spatial-Temporal Graph

Although the similarity graph captures even the long term dependencies between any two object proposals, it does not capture the relative spatial relation between objects and the ordering of the state changes. To encode these spatial and temporal relations between objects, we propose to use spatial-temporal graphs, where objects in nearby locations in space and time are connected together.

Fig. 4.
figure 4

Spatial-Temporal Graph \(\mathbf {G}^{front}\). Highly overlapping object proposals across neighboring frames are linked by directed edge. We plot some example trajectories with blue boxes and the direction shows the arrow of time. (Color figure online)

Given a object proposal in frame t, we calculate the value of Intersection Over Unions (IoUs) between this object bounding box and all other object bounding boxes in frame \(t+1\). We denote the IoU between object i in frame t and object j in frame \(t+1\) as \(\sigma _{ij}\). If \(\sigma _{ij}\) is larger than 0, we will link object i to object j using a directed edge \(i \rightarrow j\) with value \(\sigma _{ij}\). After assigning the edge values, we normalize the sum of the edge values connected to each proposal i to be 1 by

$$\begin{aligned} \mathbf {G}^{front}_{ij} = \frac{\sigma _{ij}}{\sum _{j=1}^{\mathrm {N}} \sigma _{ij}}, \end{aligned}$$
(3)

where \(\mathbf {G}^{front}\) is taken as the adjacency matrix for a spatial-temporal graph. We visualize some of the object proposals and the trajectories in Fig. 4.

Besides building the forward graph which connects objects from frame t to frame \(t+1\), we also construct a backward graph in a similar way which connect objects from frame \(t+1\) to frame t. We denote the adjacency matrix of this backward graph as \(\mathbf {G}^{back}\). Specifically, for the overlapping object i in frame t and object j in frame \(t+1\), we construct an edge \(i \leftarrow j\) and assign the values to \(\mathbf {G}^{back}_{ji}\) according to the IoU values. By building the spatial-temporal graphs in a bidirectional manner, we can obtain richer structure information and enlarge the number of propagation neighborhoods during graph convolutions.

5 Convolutions on Graphs

To perform reasoning on the graph, we apply the Graph Convolutional Networks (GCNs) proposed in [19]. Different from standard convolutions which operates on a local regular grid, the graph convolutions allow us to compute the response of a node based on its neighbors defined by the graph relations. Thus performing graph convolutions is equal to performing message passing inside the graphs. The outputs of the GCNs are updated features of each object node, which can be aggregated together for video classification. We can represent one layer of graph convolutions as,

$$\begin{aligned} \mathbf {Z} = \mathbf {G}\mathbf {X}\mathbf {W}, \end{aligned}$$
(4)

where \(\mathbf {G}\) represents one of the adjacency graph we have introduced (\(\mathbf {G}^{sim}\), \(\mathbf {G}^{front}\) or \(\mathbf {G}^{back}\)) with \(\mathrm {N} \times \mathrm {N}\) dimensions, \(\mathbf {X}\) is the input features of the object nodes in the graph with \(\mathrm {N} \times d\) dimensions, and \(\mathbf {W}\) is the weight matrix of the layer with dimension \(d \times d\) in our case. Thus the output of one graph convolutional layer \(\mathbf {Z}\) is still in \(N \times d\) dimensions. The graph convolution operation can be stacked into multiple layers. After each layer of graph convolutions, we apply two non-linear functions including the Layer Normalization [86] and then ReLU before the feature \(\mathbf {Z}\) is forwarded to the next layer.

To combine multiple graphs in GCNs, we can simply extend Eq. 4 as,

$$\begin{aligned} \mathbf {Z} = \sum _{i} \mathbf {G}_i\mathbf {X}\mathbf {W}_i, \end{aligned}$$
(5)

where \(\mathbf {G}_i\) indicates different types of graphs, and the weights for different graphs \(\mathbf {W}_i\) are not shared. Note that in this way, each hidden layer of the GCN is updated though the relationships from different graphs. However, we find that the direct combination of 3 graphs (\(\mathbf {G}^{sim}\), \(\mathbf {G}^{front}\) and \(\mathbf {G}^{back}\)) with Eq. 5 actually hurts the performance compared to the situation with a single similarity graph.

The reason is that our similarity graph \(\mathbf {G}^{sim}\) contains learnable parameters (Eq. 1) and requires back propagation for updating, while the other two graphs do not require learning. Fusing these graphs together in every GCN layer increases the optimization difficulties. Thus we create two branches of graph convolutional networks, and only fuse the results from two GCNs in the end: one GCN adopts Eq. 4 with \(\mathbf {G}^{sim}\) and the other GCN adopts Eq. 5 with \(\mathbf {G}^{front}\) and \(\mathbf {G}^{back}\). These two branches of GCNs perform convolutions separately for L layers and the final layer features are summed together, which is in \(\mathrm {N} \times d\) dimensions.

Video Classification. As illustrated in Fig. 2, the updated features after graph convolutions are forwarded to an average pooling layer, which calculates the mean of all the proposal features and leads to a \(1 \times d\) dimensions representation. Besides the GCN features, we also perform average pooling on the whole video representation and obtain the another \(1 \times d\) dimensions global features. These two features are then concatenated together for video classification.

6 Experiments

We perform the experiments on two recent challenging datasets: Charades [20] and Something-Something [21]. We first introduce the implementation details.

Training. The training of our backbone models involves pre-training on 2 different datasets following [8, 58]. The model is first pre-trained as a 2D ConvNet with the ImageNet dataset [87] and then inflated into a 3D ConvNet (i.e., I3D) as [8]. We then fine-tuned the 3D ConvNet with the Kinetics action recognition dataset [88] following the same training scheme for longer sequences (around 5 s video) in [58]. Given this initialization, we now introduce how to further fine-tune the network on our target datasets as following.

As specified in Table 1, our network takes 32 video frames as inputs. These 32 video frames are sampled in the frame rate of 6fps, thus the temporal length of the video clip is around 5 s. The spatial dimensions for input is \(224 \times 224\). Following [89], the input frames are randomly cropped from a randomly scaled video whose shorter side is sampled in [256, 320] dimensions. To reduce the number of GCN parameters, we add one more \(1 \times 1 \times 1\) convolutional layer on top of the I3D baseline model, which reduces the output channel number from 2048 to \(d=512\). As Charades and Something-Something dataset are in similar scales in number of video frames, we adopt the same learning rate schedule.

Our baseline I3D model is trained with a 4-GPU machine. The total batch size is 8 clips during training. Note that we freeze the parameters in all Batch Normalization layers during training. Our model is trained for 100K iterations in total, with learning rate 0.00125 in the first 90K iterations and it is reduced by a factor of 10 during training the last 10K iterations. Dropout [90] is applied on the last global pooling layer with a ratio of 0.3.

We set the layer number of our Graph Convolutional Network to 3. The first two layers are randomly initialized and the last layer is initialized as zero inspired by [91]. To train the GCN together with the I3D backbone, we propose to apply stage-wise training. We first finetune the I3D model as mentioned above, then we apply RoIAlign and GCN on top of the final convolutional features as shown in Fig. 2. We fix the I3D features and train the GCN with the same learning rate schedules as for training the backbone. Then we train the I3D and GCN together end-to-end for 30K more iterations with the reduced learning rate.

Table 2. Ablations on Charades. We show the mean Average Precision (mAP%).

Task Specific Settings. We apply different loss functions when training for Charades and Something-Something datasets. For Something-Something dataset, we apply the softmax loss function. For Charades, we apply binary sigmoid loss to handle the multi-label property. We also extract different numbers of object bounding boxes with RPN in two different datasets. For Charades, the scenes are more cluttered and we extract 50 object proposals for each frame. However, for Something-Something, there is usually only one or two objects in the center of video frame. We find that extracting 10 object proposals per frame is enough.

Inference. We perform fully-convolutional inference in space as [58, 89] during inference. The shorter side of each video frame is scaled to 256 while maintaining the aspect ratios. During testing one whole video, we sample 10 clips for Charades and 2 clips for Something-Something according to the average video length in two different datasets. Scores from multiple clips are aggregated by Max-Pooling.

6.1 Experiments on Charades

In the Charades experiments, we follow the official split with 8K training videos and 1.8K validation videos. The average video duration is 30 s. There are 157 action classes and multiple actions can happen at the same time.

How Much Each Graph Helps? We first perform analysis on each component of our framework, with the backbone of ResNet-50 I3D, as illustrated in Table 2a. We first show that the result of I3D baseline without any proposal extractions and graph convolutions is \(31.8\%\) mAP on the validation set.

One simple extension on this baseline is: obtain the region proposals with RPN, extract the features for each proposal and perform average pooling over them as an extra feature. We concatenate the video level feature and the proposal feature together for classification. However, we can only obtain \(0.3\%\) boost with this approach. Thus, a naive aggregation of proposal features does not help much.

We then perform evaluations using GCNs with the similarity graph and the spatial-temporal graph individually. We observe that our GCN with only spatial-temporal graph can obtain a \(2.4\%\) boost over the baseline and achieve \(34.2\%\). With the similarity graph, we can achieve a better performance of \(35.0\%\). By combining two graphs together and train GCNs with multiple relations, our method achieves \(36.2\%\) mAP which is a significant boost of \(4.4\%\) over the baseline.

Robustness to Proposal Numbers. We also analyze how the number of object proposals generated by the RPN affect our method. Note that our method achieves \(36.2\%\) with extracting 50 object proposals per video frame. If we use 25 (100) proposals per frame, the mAP of our method is \(35.9\%\) (\(36.1\%\)) mAP. Thus our approach is actually very stable with the changes of RPN.

Model Complexity. Given this large improvement in performance, the extra computation cost of the GCN over the baseline is actually very small. In the Charades dataset, our graph is defined based on 800 object nodes per video (with 16 output frames and 50 object proposals per frame). The FLOPs of the baseline I3D model is \(153 \times 10^9\) and the total FLOPs of our model (I3D + Joint GCN) is \(158 \times 10^9\). Thus there is only around \(3\%\) increase in FLOPs.

Comparing to the Non-local Net. One of the related work is the recent proposed Non-local Neural Networks [58], where they propose to perform spatial-temporal reasoning on different layers of feature maps. As shown in Table 2b, the Non-local operations gives \(1.7\%\) improvements over the baseline and our approach performs \(2.7\%\) better than the Non-local Net. These two approaches are actually complementary to each other. By replacing the I3D backbone with Non-local Net, we have another \(1.3\%\) boost, leading to \(37.5\%\).

Error Analysis. Given this significant improvements, we will also like to find out in what cases our methods improve over the baselines most. Following the attributes set up in [11], we show 3 different situations where our approach get more significant gains over the baselines in Fig. 5. Specifically, for each video in Charades, besides the action class labels, it is also labeled with different attributes (e.g., whether the actions are happening in a sequence?).

Fig. 5.
figure 5

Error analysis. We compare our approach against baseline I3D approach across three different attributes. Our approach improves significantly when action is part of sequence, involves interaction with objects and has high pose variance.

Part of A Sequence? This attribute specifies whether an action category is part of a sequence of actions. For example, “holding a cup” and then “sitting down” are usually in a sequence of actions, while “running” often happens in isolation. As shown in the left plots in Fig. 5, the baseline I3D method fails dramatically when an action is part of a sequence of actions, while our approach is more stable. If an action is not happening in isolation, we have more than \(5\%\) gain over the baseline.

Pose Variances. This attribute is computed by averaging the Procrustes distance [92] between any two poses in an action category. If the average distance is large, it means the poses change a lot in an action. As visualized in the middle plots in Fig. 5, we can see that our approach has similar performance as the baseline when the pose variance is small. However, the performance of the baseline drops dramatically again as the variance of pose becomes larger (from 0.68 to 0.73) in the action, while the slope of our curve is much smaller. The performance of both approaches improve as the pose variability reaches 0.83, where our approach has around 8%–9% boost over the baseline.

Involves Objects? This attribute specifies whether an object is involved in the action. For example, “drinking from a cup” involves the object cup while “running” does not require interactions with objects. As shown in the right plots in Fig. 5, we can see the baseline perform worse when the actions require interactions with objects. Interestingly, our approach actually performs slightly better when objects are involved.

Thus our approach is better in modeling a long sequence of actions as well as actions that require object interactions, and robust to pose changes.

Table 3. Classification mAP (%) in the Charades dataset [20]. NL indicates Non-Local.

Training with a Larger Backbone. Besides the ResNet-50 backbone architecture, we also verify our method on a much larger backbone model which is applied in [58]. This backbone is larger than our baseline in 3 aspects: (i) instead of using ResNet-50, this backbone is based on the ResNet-101 architecture; (ii) instead of using \(224 \times 224\) spatial inputs, this backbone takes in \(288 \times 288\) images; (iii) instead of sampling 32 frames with 6 fps, this backbone performs sampling more densely by using 128 frames with 24 fps as inputs. Note that the temporal output dimension of both our baseline model and this ResNet-101 backbone are still the same (16 dimensions). With all the modifications on the backbone architecture, the FLOPs are 3 times as many as our ResNet-50 baseline model.

We show the results together with all the state-of-the-art methods in Table 3. The Non-local Net [58] with ResNet-101 backbone achieves the mAP of \(37.5\%\). We can actually obtain the same performance with our method by using a much smaller ResNet-50 backbone (with around 1/3 FLOPs). By applying our method with the ResNet-101 backbone, our method (I3D + GCN) can still give \(3.6\%\) improvements and reaches \(39.1\%\). This is another evidence showing that our method is modeling very different things from just increasing the spatial inputs and the depth of the ConvNets. By combining the non-local operation together with our approach, we obtain the final performance of \(39.7\%\).

6.2 Experiments on Something-Something

In the Something-Something dataset, there are 86K training videos, around 12K validation videos and 11K testing videos. The number of classes is 174. The data in the Something-Something dataset is very different from the Charades dataset. In Charades, most of the actions are performed by agents in a cluttered indoor scene. However, in Something-Something, all videos are object centric and there is usually only one or two hands in actions. The background in the Something-Something dataset is also very clean in most cases.

Table 4. Classification accuracy (%) in the Something-Something dataset [21].

We report our results in Table 4. The baseline I3D approach achieves \(41.6\%\) in top-1 accuracy and \(72.2\%\) in top-5 accuracy. By applying our method with the I3D backbone (I3D + Joint GCN), we achieve \(1.7\%\) improvements in the top-1 accuracy. We observe that the improvement of top-1 accuracy here is not as huge as the gains we have in the Charades dataset. The reason is mainly because the videos are already well calibrated with objects in the center of the frames.

We have also combined our method with the Non-local Net. As shown in Table 4, the Non-local I3D method achieves \(44.4\%\) in top-1 accuracy. By combining our approach with the Non-local Net, we achieve another \(1.7\%\) gain in top-1 accuracy, which leads to the state-of-the-art results \(46.1\%\). We also test our final model on the test set by submitting to the official website. By using a single RGB model, we achieve the best result \(45.0\%\) in the leaderboard.