ICE-GCN: An interactional channel excitation-enhanced graph convolutional network for skeleton-based action recognition

Thanks to the development of depth sensors and pose estimation algorithms, skeleton-based action recognition has become prevalent in the computer vision community. Most of the existing works are based on spatio-temporal graph convolutional network frameworks, which learn and treat all spatial or temporal features equally, ignoring the interaction with channel dimension to explore different contributions of different spatio-temporal patterns along the channel direction and thus losing the ability to distinguish confusing actions with subtle differences. In this paper, an interactional channel excitation (ICE) module is proposed to explore discriminative spatio-temporal features of actions by adaptively recalibrating channel-wise pattern maps. More specifically, a channel-wise spatial excitation (CSE) is incorporated to capture the crucial body global structure patterns to excite the spatial-sensitive channels. A channel-wise temporal excitation (CTE) is designed to learn temporal inter-frame dynamics information to excite the temporal-sensitive channels. ICE enhances different backbones as a plug-and-play module. Furthermore, we systematically investigate the strategies of graph topology and argue that complementary information is necessary for sophisticated action description. Finally, together equipped with ICE, an interactional channel excited graph convolutional network with complementary topology (ICE-GCN) is proposed and evaluated on three large-scale datasets, NTU RGB+D 60, NTU RGB+D 120, and Kinetics-Skeleton. Extensive experimental results and ablation studies demonstrate that our method outperforms other SOTAs and proves the effectiveness of individual sub-modules. The code will be published at https://github.com/shuxiwang/ICE-GCN.


Introduction
Human action recognition has attracted more and more attention in the area of computer vision and finds its various applications in human-machine interaction, video surveillance, virtual reality, and so on [1][2][3][4]. Recently, with the emergence of high-precision depth sensors such as Microsoft Kinect [5] and advanced human pose estimation algorithms [6][7][8], the skeleton coordinates can be obtained accurately and economically. With its robustness to variations in body size, viewpoints, and complicated backgrounds, as well as efficiency in storage and computational cost, skeleton data have become the mainstream input compared with other modalities, such as traditional RGB videos.
The early-stage deep learning-based approaches directly treated human joint coordinates as sequences of coordinate vectors [9][10][11][12] or pseudo-images [13][14][15] and fed them into convolutional neural networks (CNNs) or recurrent neural networks (RNNs). Such representations overlook the intrinsic graph structure relationship among joints, which is crucial for recognizing human action. To solve this issue, recently, Yan et al. [16] proposed a spatio-temporal graph convolutional network (ST-GCN) to model the skeleton data as the graph structure, which represents the joints as graph nodes and the joint connectives as graph edges. In the spatial dimension, joint topology is defined by a sequence of adjacency matrices, and then, a graph convolutional network (GCN) is utilized to capture the joint spatial relationship for each frame. In the temporal dimension, temporal convolution (TCN) is applied to capture the inter-frame relationship for each node. ST-GCN is the first and classical network that introduced GCN to the task of skeleton-based action recognition, which was followed by many improvements and variants [17][18][19][20][21][22][23][24].
To enable the networks to capture various ranges of dependencies and enhance the most discriminative joints in intra-frame space, the spatial attention mechanisms [18,21,22,25] are applied to generate spatial attention maps for each joint. Based on similar considerations, the temporal attention mechanisms [18,[22][23][24] are applied to generate temporal attention maps for each frame. However, these previous joint and frame attention methods have treated feature patterns in different channels equally without considering how to select the more informative and channel-wise features. This limited the representation capability and was not optimal for obtaining the discriminative features.
Since different channels indicate different motion features [20], the importance among joints varies with different motion features. Therefore, exploring various importance of the motion features in different channels can emphasize the informative spatio-temporal feature patterns, which can help the network distinguish confusing actions. Inspired by SENet [26], which is the first to introduce a simple but effective channel attention module for image classification tasks, works [27,28] applied the channel attention to calculate channel-wise modulation weights. But these attention schemes only consider inter-channel information without introducing information from other dimensions. To address this problem, CBAM [29] was proposed to combine channel attention and spatial attention sequentially. And based on this idea, works [18,22,25,27] took spatial and temporal information into account, but these methods treated each single dimension independently and then combined them in a sequential manner; the other dimensions would be globally averaged into a single scalar. Intuitively, the channel and spatio-temporal information is highly related to each other, i.e., feature patterns in each channel are explored from spatio-temporal space. Thus, separately considering channel and spatio-temporal aspects is sub-optimal for exploring finer levels of discriminative joints among intra-and inter-frame.
To address this issue, inspired by works [29][30][31][32][33][34], an interactional channel excitation (ICE) module is proposed to incorporate both spatial and temporal information into the channel attention with cross-dimensional interactions. ICE is composed of channel-wise spatial excitation (CSE) and channel-wise temporal excitation (CTE) sub-modules. CSE is applied to capture the crucial body global structure patterns to excite the spatial-sensitive channels. CTE is applied to vital temporal dynamics information to excite the temporalsensitive channels.
Moreover, we also systematically investigate the strategies of graph topology, which is also essential in determining the representation ability of joint relationships in GCN. The topology is represented by the adjacent matrix. Various adjacency matrix schemes are employed to construct graph topology by previous works. They can be mainly summarized into three categories: A p (physical) means the fixed predefined matrix, which reflects the body natural physical structure [16,35,36]. A l (learnable) is the learnable matrix, which is parameterized and optimized throughout training [17,37,38]. A s (similarity) represents the Gaussian similarity matrix, which is used to measure the similarity of pairs of vertexes [17,20]. Based on the experimental observation, we argue that complementary topology is necessary, which can achieve a good balance between adaptation and too large of a searching space.
Finally, together equipped with ICE, an interactional channel excited graph convolutional network with complementary topology (ICE-GCN) is proposed. Extensive experiments and ablation studies demonstrate the necessity of the ICE module and the complementary topology scheme. Compared with previous works, the main contributions of our work can be summarized as follows: • Compared with the existing attention mechanisms, which ignore the cross-dimensional interaction, our interactional channel excitation (ICE) module embeds spatiotemporal information into channel attention, which allows to explore discriminative spatio-temporal features of actions in a finer channel level, adaptively recalibrating spatial-temporal-aware attention maps along channel dimension. ICE, composed of a channel-wise temporal excitation (CTE) and a channel-wise spatial excitation (CSE), can be inserted into any existing graph convolutional networks as a plug-and-play module to enhance the performances notably without light computational cost. • We systematically investigate the strategies of graphs and argue that complementary topology is necessary. Three adjacency sub-matrices A p , A l and A s are combined to construct the graph topology. This simple but efficient scheme notably improves the performance, which solves the dilemma between adaptation and too large of a searching space.
• Finally, together equipped with ICE, an interactional channel excited graph convolutional network with complementary topology (ICE-GCN) is proposed. Extensive experiments conducted on three large-scale datasets, NTU RGB+D 60, NTU RGB+D 120, and Kinetics-Skeleton, demonstrate our ICE-GCN outperforms the state-of-the-art performance. The follow-up ablation experiments and visualization also show the effectiveness of the individual modules in graph convolutional networks.

Traditional attention mechanisms for skeleton-based action recognition
To model various scales of dependencies and help the network focus on the most informative information, attention modules are integrated into the graph convolutional networks. Work [18] designed a channel attention module based on SENet [26] and generated attentive maps to reweight the channel dimension, which is averaged over all features of the spatial joints and temporal frames. Similarly, work [21] proposed spatial joint attention to measure the importance of each joint. And works [23,24] designed temporal frame attention to enhance the modeling capability of temporal dependencies. These attention mechanisms consider each dimension independently, with all other dimensions being globally averaged. As spatial, temporal and channel dimensions contain the complementary and correlated information for action recognition. More previous works [18,22,25,27,[39][40][41][42] inspired by the scheme based on CBAM [29] fused single-dimensional attention modules sequentially, such as work [18] fused spatial, temporal, and channel attention modules to construct STC-attention module in a sequential manner. Both spatial and temporal single-dimensional attention methods ignore the difference in the contribution of different spatio-temporal patterns along different channels. And the channel attention methods based on SENet [26] squeezed global spatio-temporal information into a unit without considering spatial or temporal joint correlations. The methods based on CBAM [29] simply fused channel, spatial, and temporal attention in a sequential manner without cross-dimensional interaction, which is essential to generate channel-wise spatio-temporal selective attention maps.
Thus, there existed some works in the computer vision field adopted cross-dimensional schemes. Coordinate attention [34] embedded positional as well as spatial information into channel attention along the horizontal and vertical directions, which is critical to detecting object structures. [31][32][33]43] introduced temporal information into channel attention for video-based action recognition tasks with spatio-temporal data. In more detail, in TEA [32], a motion excitation was proposed to embed temporal dynamic motion patterns that describe the temporal difference between the two adjacent frames into channel attention and then excite these motionsensitive channels. ACTION-Net [33] inserted one more convolutional layer between two fully connected layers for channel-wise features within temporal information.
Inspired by the considerations mentioned above, we propose our innovative method interactional channel excitation (ICE) module. The difference of our interactional channel excitation (ICE) module is that it is channel-wise and introduces both body global structure patterns and temporal inter-frame dynamics information to channel attention by cross-dimensional interaction. Our ICE is applied for the task of skeleton-based action recognition and focuses on capturing the features of the joint correlation of graphs, which is different from the image and video-based tasks.

Strategies of graph topology for GCN
Graph topology construction plays a key role in determining the representation ability of joint relationships in GCN. ST-GCN [16] proposed the predefined adjacency matrix A p , which is based on the body physical structure and manually builds three topologies using three partitioning strategies. Shi et al. [17] proposed a data-driven model called AGCN. This work introduced a learnable adjacency matrix A l , which is capable of adaptively learning the topologies of the graph, and introduced another Gaussian similarity matrix A s to measure the similarity of the pairs of vertexes in an embedding space by dot product. Based on AGCN, Chen et al. [20] considered that different channels reflect different types of features, and it is not desirable to only use a single shared topology for all channels. Therefore, they proposed channelspecific A s ( denoted as C A s ) for each channel to calculate the pairs of vertex distances in an embedding space using pairwise subtraction.
Although A l , A s and C A s are more adaptive to capture global graph information, they face too large of a searching space and learn too many "noisy" edges [36]. In this work, to address the dilemma between adaptation and a too large searching space, we systematically investigate the strategies of graphs and argue that complementary topology is necessary.

Interactional channel dimension excitation (ICE)
To solve the problem that features of joint correlation modeling ignores the interaction between spatial-temporal denotes the element-wise multiplication dimensions and channel dimension, inspired by previous excitation works [29,[31][32][33][34], an interactional channel excitation (ICE) module is proposed to capture channel-wise patterns and embed spatio-temporal information into channel attention by cross-dimensional interactions. A schematic diagram of the processing of the ICE on skeleton sequence with the action "kicking something" is shown in Fig. 1. The ICE module consists of two sub-modules, spatial channel excitation (CSE) and temporal channel excitation (CTE), which are described in detail in Sects. 3.1.1 and 3.1.2, respectively.
In addition, to more clearly illustrate the innovations of ICE, four schematic diagrams shown in Fig. 2 are to compare our proposed CSE, CTE with classical inter-channel attention mechanism SENet [26] and sequential multi-dimensional attention CBAM [29].

Channel-wise spatial excitation (CSE)
CSE is applied to capture the crucial body global structure patterns to excite the spatial-sensitive channels, which adaptively recalibrates the importance of joints along different channels. The architecture of the CSE module is illustrated in Fig. 2 c. Given an input feature X ∈ R C×T ×V , the average pooling is applied to summarize the temporal information for CSE and focuses on the interaction between channel and spatial dimensions. It also helps to reduce the computational cost in this way.
where X t pool denotes the feature after temporal pooling and T is the number of frames.
Second, a 1D convolution layer conv spa with the kernel size K set to V is applied to enable CSE to have a global receptive field covering all joints in a frame and facilitating the extraction of global structural features, which is ignored in previous spatial attention works for they considered joints independently. conv spa also reduces the number of channels to alleviate computational and model complexity at the same time.
where X spa denotes the global structure feature in the spatial dimension and its channel-reduced, r is the scale ratio (set to 16 in this work), and * indicates the convolution operation. Third, after feeding X spa to ReLU for nonlinearity, another 1D convolution layer conv ex p with kernel size set to 1 is applied to expand the channel dimension back to the original channel dimension. Then, the tensor X is reshaped as [C, 1, V ] and fed into a Sigmoid activation to obtain the spatial-attentive mask M cse .
Finally, the spatial-sensitive channels and crucial joints are excited by multiplication between the input features X and M cse along the channel dimension. Furthermore, a residual connection is applied to preserve the original information and ensure network stability.
Therefore, by interacting the spatial dimension with the channel dimension, CSE excites the beneficial spatialsensitive channels and adaptively recalibrates the importance of joints simultaneously. Finally, we obtain the excited output . C, T, and V denote the number of channels, frames, and joints, respectively. FC denotes a fully connected layer. r is the reduction ratio, and K means kernel size. w denotes the size of sliding temporal window; indicates the element-wise multiplication.
denotes the element-wise summation feature F cse and input the following channel-wise temporal excitation (CTE) sub-module.

Channel-wise temporal excitation (CTE)
Like CSE, CTE aims to utilize temporal dynamics information to discriminate and excite the vital temporal-sensitive channels and frames. The architecture of the CTE module is illustrated in Fig. 2d.
Given an input feature as X ∈ R C×T ×V , the average pooling is applied to summarize the spatial information.
where X s_pool denotes the feature after spatial pooling. Different from CSE, a 1D convolution layer conv tmp with the kernel size K set to w is applied to interact with the temporal dimension based on a sliding temporal window to capture the inter-frame temporal relationship of w frames. We set w as a hyperparameter (3, 5, etc.) according to the frames of different datasets to obtain the most appropriate temporal receptive field.
where X tmp denotes the contextual feature among w frames and its channel-reduced, r is the scale ratio (set to 16 in this work), and * indicates the convolution operation.
Like CSE, CTE adaptively recalibrates the importance of frames and excites the temporal-sensitive channels simultaneously by the interaction between the temporal dimension and channel dimension. Finally, the excited output feature F cte is obtained.

Complementary topology scheme
By rethinking various adjacency matrix schemes from previous works, primarily focusing on ST-GCN [16] and its two variants 2 s-AGCN [17] and CTR-GCN [20], we can summarize the various adjacency matrices as three different types: A p (physical), A l (learnable), and A s (similarity).
A p denotes the predefined matrix reflecting the physical structure of the human body, which is fixed during the training process. ST-GCN [16] applies predefined A p of spatial configuration partitioning, dividing the neighbor set into three subsets according to their distances to the skeleton gravity center.
A l denotes the learnable matrix covering the global graph. This indicates whether the connections exist between each pair of two joints and how strong they are. The ST-GCN utilizes the attention matrix M k to learn edge importance weighting and dot multiplies to A p . The 2 s-AGCN builds an adjacency matrix with the same shape of A p and makes it parameterized without any constraints, which can be optimized during the training process.
A s denotes the Gaussian similarity matrix between two vertexes, and A s is data dependent, which is different from A l . The 2 s-AGCN measures the similarity of pairs of vertexes in an embedding space by the dot product. CTR-GCN uses pairwise subtraction to calculate the distances along the channel dimension. Most importantly, CTR-GCN makes A s channelwise and learns channel-specific A s for each channel, leading to stronger representation capability than channel-shared A s . We mark the channel-specific A s as C A s .
In this study, we found that, however, A l , A s , and C A s are more adaptive to capture global graph information for different input samples compared with A p . However, it is not appropriate to neglect the necessity of A p , especially on large-scale datasets. Although A l and A s can automatically capture global graph information, they face a too large search space to find the most appropriate topology. The optimization process will be confused if each topology has too many "noisy" edges [36]. To address this issue, we find it necessary to take them all into account. We combine three sub-matrices by simple summation as A p + A l + C A s . This simple but efficient scheme achieves a better performance.

ICE-GC block and ICE-GCN
An efficient interactional channel excited graph convolution (ICE-GC) block, which is equipped with ICE and a complementary topology scheme elaborated above, is proposed. The structure of our ICE-GC is depicted in Fig. 3a. The operation of ICE-GC is formulated as follows: where the input feature map F in is a 3D tensor as C × T × V . A 1 × 1 2D convolutional layer is utilized to transform input features into high-level representations, where W k is the C × C × 1 weight vector. K v denotes three subsets according to three partition strategies proposed by ST-GCN [16].
A p and A l are both V × V adjacency matrices, which are the same for each channel. C A s is the C × V × V adjacency matrix which contains C channel-specific V × V adjacency matrices for C channels, and the final refined A is obtained by element-wise summation as C × V × V . After matrix multiplication with high-level representations, the graph convolution is accomplished. Three graph convolution blocks are utilized in parallel to extract the latent representations. The output excited feature map F out as C × T × V is obtained by CTE and CSE, which is a significant complementary approach after GCN, since the adjacency matrices can only define the existence of connections between joints, which cannot adaptively reflect the importance between joints along channel dimension. Based on the ICE-GC block, an interactional channel excitation enhanced graph convolutional network (ICE-GCN) is constructed. The basic block of our ICE-GCN is shown in Fig. 3b. A multi-scale temporal convolutional module (MS-TCN) is applied following the design of [20,35] for multiple receptive fields and temporal pooling, which is different from CTE. For the residual connection, a 1 × 1 convolution is worked when C is not equal to C . Therefore, our proposed ICE-GCN has powerful characterization capabilities in spatial, temporal, and channel dimensions.
As shown in Fig. 3c, the architecture of our ICE-GCN is similar to most of the improved ST-GCN frameworks. First, a batch normalization layer (BN) is added to normalize the input data. Then, ten basic ICE-GCN blocks mainly constitute the entire network. From block 1 to block 10, the input channel and output channel are (3, The frames T will be halved after block 5 and block 8. After ten main basic ICE-GCN blocks, a global average pooling (GAP) layer is added to pool output feature maps. Finally, a fully connected layer (FC) receives the pooled output and generates predictions for the action class through scores.

Datasets
Kinetics-Skeleton The Kinetics-Skeleton [44] dataset includes approximately 300,000 video clips and 400 human action classes, which are collected from the YouTube video website. However, it only offers raw video clips and does not provide skeleton data. Thanks to the work ST-GCN [16] and OpenPose [6] toolbox, which estimated the locations of 18 joints on every frame of the clips. There are 240,000 clips for training and 20,000 clips for testing. According to the conventional evaluation method of the ST-GCN, we report the top-1 and top-5 accuracies to evaluate our model.  denotes the element-wise summation.
denotes the matrix multiplication NTU RGB+D 60 NTU RGB+D 60 [45] is a large and widely used human action recognition dataset, which has 56880 human 3D skeleton action sequences, 40 volunteers and 60 classes collected by three Kinect v2 [5] cameras with different views. Each frame contains one or two actors, and each skeleton has 25 joints. There are two recommended benchmarks for this dataset: Cross-subject (X-sub) and cross-view (X-view). X-sub: 20 subjects for training and 20 subjects for testing. X-view: Camera views 2 and 3 for training and camera view 1 for testing. NTU RGB+D 120 With 57,367 additional samples and more than 60 action classes based on NTU RGB+D 60, NTU RGB+D 120 [46] is the largest dataset with 3D skeleton action sequences for human action recognition available right now. It contains 114,480 action samples and 120 action classes in total, which were recorded by 106 volunteers using three different camera views. Cross-subject (X-Sub) and cross-setup (X-Set) are two recommended benchmarks. X-sub: A total of 106 subjects were split into two groups of 53 for training and 53 for testing. X-setup: Dividing the samples into training and testing groups half and half based on the camera setup IDs.

Implementation details
All our experiments are conducted on the PyTorch deep learning framework, trained on one RTX 3090 GPU. The optimization strategy is SGD with a momentum of 0.9. For Kinetics-Skeleton, we follow the same data processing method as [17], which has 150 frames with two bodies in each frame. We set the batch size to 114 and the temporal receptive field of CTE to 3 frames. The training phase is completed at the 65th epoch. For NTU RGB+D 60 and NTU RGB+D 120, we follow the same data processing method as [20], which resized each sample to 64 frames. We set the batch size to 64 and the temporal receptive field of CTE to 1 frame. The training phase is completed at the 80th epoch.

Effectiveness of three excitation modules.
For a fair comparison, we choose the widespread framework AGCN [17] on the Cross-View of the NTU RGB+D 60 using joint coordinates as the only input data stream. We separately test the contributions of two sub-modules CTE and CSE and find that they both can improve the accuracy by 0.4% and 0.6%, respectively. Then, it is observed that ICE module improves AGCN by 1.1% by connecting two sub-modules, more than either of them as shown in Table 1. It illustrates that both spatial and temporal channel-wise features are necessary for distinguishing different action categories, and they are complementary to each other.

Comparison with other excitation modules
To validate the superiority of our ICE, we compare the performance of ICE with other channel dimension excitation on the Cross-View of the NTU RGB+D 60 using the joint stream as shown in Table 2. SENet [26] was proposed to be embedded into 2D CNNs for the image classification task; it is a classic and popular channel attention mechanism, and we adapted and applied it to our skeleton-based task. STCattention [18] applies channel attention in a skeleton-based action recognition task, concatenating spatial attention, and temporal attention in a sequential manner without crossdimensional interactions, as shown in Table 2. Note that the enhancements brought about by SENet and STC-attention for AGCN (+0.6% and +0.7%, respectively) are both less than our ICE for AGCN (+1.1%). This validated the rationality and superiority of our ICE, which introduces both spatial and temporal information into the channel dimension to capture the cross-dimensional interactions.

Transferring to other backbones
We verify the generality, adaptability, and complexity of our proposed ICE module on both the Cross-View and Cross-Subject of NTU RGB+D 60 using joint stream. We also chose the well-known and widespread backbone AGCN [17], the lightweight model Shift-GCN [19], and the latest proposed optimal model CTR-GCN [20]. Our ICE module is simply equipped with those models in a plug-and-play way. As shown in Table 3, the backbones equipped with our ICE outperform themselves notably, and the computation cost (measured by Floating Point Operation per second (FLOPs) and the number of parameters) does not change too much (only increase about 0.03 GFLOPs and 0.42 M parameters).

Effectiveness of adjacency matrix schemes
Then, we verify the necessity of the three adjacency matrices by removing A p , A l , and C A_s from ICE-GCN on the Cross-View of NTU RGB+D 60 using joint steam. As described in Sect. 3.4., A p denotes the physical adjacency matrix, A l denotes the learnable adjacency matrix, and C A_s denotes channel-wise similarity adjacency matrix. As shown in Table  4, the performance of our ICE-GCN can reach 95.3%. When removing A p , A l and C A_s, the performance drops 0.3%, 0.2%, and 0.5%, respectively. It verified that all three adjacency matrices are efficient and complementary to each other and verified the rationality of our refined an efficient complementary topology scheme A p + A l + C A s .

Comparison with the state-of-the-arts
Finally, we compare our ICE-GCN model with the state-ofthe-art methods in skeleton-based action recognition on three large-scale datasets Kinetics-Skeleton [44], NTU RGB+D 120 [46], and NTU RGB+D 60 [45] in Tables 5, 6, and 7, respectively. 2 s-AGCN [17] proposed the bone stream (the lengths and directions of bones) of the skeleton data, which shows notable improvement in the recognition accuracy. Generally, most state-of-the-art methods adopt multi-stream fusion strategies. For a fair and comprehensive comparison, we use both single-and multi-stream fusion strategies for comparison. J s denotes only the "joint stream" using the original skeleton coordinates as input. Bs denotes only the "bone stream" using the differential spatial coordinates as input. 2s denotes using both "joint stream" and "bone stream." On Kinetics-skeleton, the ICE-GCN notably outperforms the existing methods by about 2% on Top-1 and Top-5 for all the fusion strategies. On NTU RGB+D 60 and NTU RGB+D 120 datasets, ICE-GCN also outperforms the existing methods in most cases on both Cross-Subject and Cross-view. As shown in the results of the comparisons on all three largescale datasets, the state-of-the-art and competitive results verify the superiority of our ICE-GCN model. It demonstrates that our model has stronger modeling capability and performance on a larger dataset Kinetics-skeleton.

Visualization
To illustrate how ICE affects the final performance and highlight the difference of cross-dimension interactions, the attention maps are visualized. One real evaluation sample of "kicking something" is randomly selected from NTU RGB+D 60 and visualized. As shown in Fig. 4, CSE and SENet are compared on channel and spatial dimensions, and CTE and STC-attention are compared on channel and temporal dimensions. As shown in Fig. 4a, CSE can not only reweight crucial joints along each channel but also excite those spatialsensitive channels (such as channels 1, 10 and 11). CSE focuses on the joints of the legs like joint 15 "left knee" and joint 16 "left foot" which are relevant to the action "kicking something." The importance of joint 15 "left knee" is consistently strong in channels 0, 1, 6, etc., indicating that the spatial information of these related joints is generally important for the current action in the excited channels. But SENet reweights each joint constantly for each channel without channel-wise difference and interactions with spatial dimension, as shown in Fig. 4b.
As shown in Fig. 4c, CTE can not only reweight vital frames, respectively, but also excite those temporal-sensitive channels (such as channels 3, 5, and 16). CTE focuses on the frames (such as the frames from 21 to 30) which are most informative to the action of "kicking something." It is worth noting that the importance of these frames is consistently strong, indicating that the temporal relationship of these frames is generally important for the current action in the excited channels. As shown in Fig. 4d, STC-attention only reweights frames constantly for each channel without channel-wise discriminative consideration.

Conclusion
In this paper, we propose an interactional channel excited graph convolutional network with complementary topology for skeleton-based action recognition. The interactional channel excitation module (ICE) consists of CSE and CTE sub-modules. CSE is applied to capture the crucial body global structure patterns along different channels and then adaptively recalibrate the importance of joints and excite the spatial-sensitive channels. CTE is applied to capture vital temporal inter-frame dynamics information along different channels and then adaptively recalibrate the importance of frames and excite the temporal-sensitive channels. In addition, to solve the dilemma between the adaptation ability and too large of a searching space, to avoid too many "noisy" graph edges, a complementary topology scheme is refined as A p + A l + C A s . By coupling the ICE module and topology strategy, we propose an interactional channel excitationenhanced graph convolutional network with complementary topology (ICE-GCN), which is a powerful network to help extract optimal features covering three dimensions (spatial, temporal, and channel). Extensive experiments conducted on three large datasets NTU RGB+D 60, NTU RGB+D 120, and Kinetics-Skeleton demonstrate that our ICE-GCN outperforms state-of-the-art methods and proves the effectiveness of each sub-modules. In the future, the efficiency of ICE-GCN still needs more consideration.