MulStepNET: stronger multi-step graph convolutional networks via multi-power adjacency matrix combination

Graph convolutional networks (GCNs) have become the de facto approaches and achieved state-of-the-art results for circumventing many real-world problems on graph-structured data. However, these networks are usually shallow due to the over-smoothing of GCNs with many layers, which limits the expressive power of learning graph representations. The current methods of solving the limitations have the bottlenecks of high complexity and many parameters. Although Simple Graph Convolution (SGC) reduces the complexity and parameters, it fails to distinguish the feature information of neighboring nodes at different distances. To tackle the limits, we propose MulStepNET, a stronger multi-step graph convolutional network architecture, that can capture more global information, by simultaneously combining multi-step neighborhoods information. When compared to existing methods such as GCN and MixHop, MulStepNET aggregates neighborhoods information at more distant distances via multi-power adjacency matrix while fitting fewest parameters and being computationally more efficient. Experiments on citation networks including Pubmed, Cora, and Citeseer demonstrate that the proposed MulStepNET model improves over SGC by 2.8, 3.3, and 2.1% respectively while keeping similar stability, and achieves better performance in terms of accuracy and stability compared to other baselines.


Introduction
Graph convolutional networks (GCNs) (Zhang et al. 2018b;Kipf and Welling 2017;Li et al. 2018;Yao et al. 2019) show natural advantages for dealing with many real-world problems which can be modeled as graph networks (Bruna et al. 2014;Hamilton et al. 2017;Monti et al. 2017;Defferrard et al. 2016). Each convolution in these GCNs is based on one-step neighborhood aggregation scheme. GCNs directly apply multiple graph convolution layers to obtain multi-step neighborhoods information and learn graph representations. Limited to the over-smoothing  of GCNs with more layers, GCNs are difficult to leverage the hierarchical property of convolutional neural networks (CNNs) such as AlexNet (Krizhevsky et al. 2012) and MACNN (Lai et al. 2020). Therefore, GCNs are usually shallow and generally do not exceed four layers (Zhou et al. 2018). For instance, there are only two layers in GCN (Kipf and Welling 2017). This shallow mechanism limits the expressive power of learning graph representations and label propagation (Sun et al. 2020). To address the limitations, researchers propose some approaches which are summarized in the following two-fold.
1. To improve the expressive power, Li et al. (2019b) propose DeepGCNs with very deep networks using deep CNNs concepts such as residual connections (He et al. 2016) and dense connections (Huang et al. 2017). Nev-ertheless, deep networks with many parameters are extremely hard to train on large graphs. 2. High-order graph convolution models (Abu-El-Haija et al. 2019a, b;Luan et al. 2019) such as MixHop (Abu-El-Haija et al. 2019b) directly capture the interaction of neighboring nodes at different distances to achieve performance improvement. As the number of order increases, the parameters of these models will increase and these models become more complex. This is hard to train and may suffer from overfitting.
In order to reduce excess complexity and parameters, a recent surge of interest has focused on Simple Graph Convolution (SGC) (Wu et al. 2019). Wu et al. (2019) have shown that by repeatedly removing the nonlinearities between graph convolution layers of GCN and collapsing normalized adjacency matrices between consecutive layers, the complexity of GCN is reduced. In addition, they significantly reduce parameters through reparameterizing weight matrices into a single weight matrix. Nevertheless, SGC has difficulty in distinguishing the feature information of neighboring nodes at various distances, which limits the expressive power.
To address the limits, we propose a novel architecture of stronger multi-step graph convolutional network (Mul-StepNET). Figure 1 shows the background and need for designing our MulStepNET. As illustrated in Fig. 2, our MulStepNET leverages multi-step neighborhoods information by constructing multi-power adjacency matrix with simple grouping and attention mechanism and applies the attention mechanism to flexibly adjust the contributions of neighboring nodes at various distances. Based on the multipower adjacency matrix, we design a stronger multi-step graph convolution to aggregate more nodes features and learn global graph structure. Further, we build an one-layer network model with the fewest parameters to avoid overfitting and reduce complexity. Meanwhile, the model with large k-steps (zero-step to k-step) graph convolution widens the receptive field and improves the learning ability. Our contributions are as follows: • To enlarge the receptive field, we construct a multi-power adjacency matrix with simple grouping and attention mechanism by combining the adjacency matrices of different powers. Based on the multi-power adjacency matrix, we develop a new multi-step graph convolution that can flexibly adjust the weights of neighboring nodes at different distances. This may improve the learning ability. • To the best of our knowledge, it is the first work to propose an one-layer architecture with larger steps graph convolution. Compared with prior methods, in terms of complexity and parameters, our architecture performs as well as SGC and outperforms other methods while capturing more nodes and global information. • We conduct extensive experiments on node classification tasks. Experimental results show that the proposed method compares favorably against state-of-the-art approaches in terms of classification performance and stability.

Preliminaries and related works
Graph convolutional networks are successfully applied to non-Euclidean and Euclidean applications and are quickly evolving (Wang and Ye 2018;Yao et al. 2019;Kampffmeyer et al. 2019;Chen et al. 2018;Guo et al. 2019;Yu and Qin 2020). We mainly review the work most relevant to our approach. An undirected graph G with n vertices and e edges is denoted as G = ( , , A) , where and are respectively the set of edges and vertices. The edge relationships of G can be described by adjacency matrix A ∈ ℝ n×n . We introduce X ∈ ℝ n×c 0 to denote node feature matrix with c 0 features per node. Similar to CNNs, the convolution of GCN (Kipf and Welling 2017) is to learn the feature representation of nodes over multiple graph convolution layers. At layer j, we denote the adjacency matrix A j and the node hidden representation H j as input, the output node representation H j+1 can be written as: where Ã j denotes a new adjacency matrix with self-loops, with Ã j = A j + I j . I j and D j are identity matrix and degree matrix respectively. We describe H 1 (j=1) as original input feature matrix X and use gradient descent to train weight matrix W j . Stacking the layer twice, the two-layer GCNs can be described as: where W 1 and W 2 are different weight matrices, softmax is a normalization classifier. Abu-El-Haija et al. (2019b) propose high-order graph convolution (HGC) model to improve expressive power by mixing multi-hop neighborhoods information. We normalize the adjacency matrix Ã j into normalized adjacency matrix j . Then the model is as follow: where Â k j is the k power of Â j , and | denotes column concatenation. Lei et al. (2020) develop HGC model and reduce the parameters of Abu-El-Haija et al. (2019b) via weight sharing mechanism.
We write a K-layer GCN in general form as: According to the hypothesis of Wu et al. (2019), for the K-layer GCN , we remove all ReLU functions and reparameterize all weight matrices (W 1 , W 2 , … W K ) into a single weight matrix W via W = W 1 W 2 ⋯ W K . The K-layer GCN becomes: where Â K j denotes K power of Â j . The model is called as Simple Graph Convolution (SGC) (Wu et al. 2019). Although the model has fewer computations and parameters, the model can not distinguish the features information of neighboring nodes at different distances. This restricts the ability of learning graph representations. There are many other simple or linear models (Thekumparampil et al. 2018;Cai and Wang 2018;Eliav and Edith 2018). By applying these models to the tasks of Li et al. (2019a) and Al-Sharif et al. (2020), these models are more powerful.
Recently, many graph attention models (Veličković et al. 2018;Thekumparampil et al. 2018;Zhang et al. 2018a) try to assign suitable weights based on node features in the graph and achieve better performance on graph learning tasks. Nevertheless, these models with attention mechanism bring the concerns of high complexity and nuisance parameters. There are many other researches (Zhou et al. 2018;Wu et al. 2020) for comprehensive review.

The proposed method
We are committed to developing a method that can simultaneously capture neighboring nodes information at different distances and global graph structure while fitting few parameters and being computationally efficient. In this section, we propose a novel one-layer MulStepNET architecture. Further, we introduce our multi-step graph convolution that can simultaneously capture neighboring nodes information at more distant distances and analyze the computational complexity and parameters.

The overall architecture
In CNNs, we increase the expressivity of extracting features via many and deeper convolutional layers, which enlarges the scale of receptive field. However, GCN with multiple layers (exceed 2 layers) can suffer from over-smoothing , which hurts classification performance on graph learning tasks. To improve the performance, an effective way is to use high-order graph convolutions with different weight matrices to gather more neighborhoods information at various distances (Abu-El-Haija et al. 2019b). As the number of order increases, the parameters will significantly increase and the model will lead to redundant computation. This is hard to train and brings potential concerns of overfitting. For SGC (Wu et al. 2019), although the simplified mechanism speeds up model training and avoids the excess complexity, SGC fails to adjust and distinguish the contributions of neighboring nodes at different distances. This limits the learning ability. Motivated by the above analyses, we propose our MulStepNET architecture (Fig. 2). A key innovation of MulStepNET is a novel multi-step graph convolution layer. Instead of aggregating the information of one-step neighboring nodes, our multi-step graph convolution combines these neighborhoods information at different distances and captures the high-order interaction between nodes. We summarize the main differences of our MulStepNET and the models most relevant to our approach as follows.
In SGC, the model applies the K power of the normalized adjacency matrix Â to obtain the neighboring nodes that are K-hops away. In our MulStepNET, we design a multipower adjacency matrix (the k powers of the Â , namely 0 power to k power of the Â , k > K , see Sect. 3.2) to obtain the neighboring nodes that are k-hops away. In addition, we use a new attention mechanism to distinguish the contributions that may be important for classification. In MixHop, (1) the model constructs high-order graph convolution using different weight matrices to capture the feature information of neighboring nodes at different distances; (2) the model uses column concatenation to combine these information; (3) the model has a two-layer structure. In our MulStepNET, (1) we propose the multi-step graph convolution based on single weight matrix to capture the feature information of neighboring nodes at more distant distances; (2) we consider these information by the multi-power adjacency matrix, rather than the column concatenation; (3) we construct an onelayer structure with fewer computations and parameters; (4) we can adjust the contributions via the attention mechanism.

Multi-step graph convolution layer
Multi-step graph convolution has three stages: multi-power adjacency matrix, multi-step feature propagation, and multistep linear transformation. We introduce each stage in detail as follows.
Multi-power adjacency matrix We use adjacency matrix A to describe the edge weights between nodes that are onestep away in the graph. The adjacency matrix A fails to denote the edge weights of k-step ( k > 1 ) neighboring nodes. The information propagation of graph is propagated among edges in the graph. We introduce A k (the k power of A) to describe the k-step edge weights which indicate the relationship of k-step neighboring nodes in the graph. In order to better capture more nodes and global graph information, we combine different powers of A into a single multi-power adjacency matrix Â ok .
where Â ok denotes the multi-power adjacency matrix. Â 1 is normalized adjacency matrix, with Here I and D denote identity matrix and the degree matrix of Ã , with D ii = ∑ jÃij . We provide a higher weight to nodes's own features via Â 0 because the own features may be more important, with Â 0 = I . Furthermore, we utilize Â 2 ,Â 3 , … ,Â k to obtain twostep and larger step neighbors information in the graph. In order to consider more similarity of adjacent powers of Â and distinguish the difference between other powers of Â , we divide different powers of Â as multiple simple groups including {Â 0 }, {Â 1 ,Â 2 ,Â 3 }, {Â 2 ,Â 3 ,Â 4 }, … , {Â k−2 ,Â k−1 ,Â k } and flexibly adjust the weights of these groups via a series of attention multipliers 0 , 13 , 24 , … , (k−2)k ∈ R . By adjusting the weights, we can adjust the contributions of neighboring nodes at different distances. Each group except Â 0 leverages and shares the information of same groups. This can be regarded as an attention mechanism with simple grouping. We combine all groups with these attention multipliers to obtain much information and learn global graph topology.
Theorem 1 Multi-power adjacency matrix is an operator of preserving graph topology.

P r o o f L e t
Â 0 ∈ R n×n ( n n o d e s ) , t h e n Â0 = I ∈ R n×n ,Â 1 ,Â 2 , … ,Â k ∈ R n×n , and Â ok = 0Â 0 + 13 (Â 1 +Â 2 +Â 3 ) Equation (6) shows that multi-power adjacency matrix is element-wise operation. Obviously, the spatial location of Â ok is the same as Â , thus preserving the topology of the graph.
Equation (6) and Theorem 1 show that our multi-power adjacency matrix can capture more nodes and global graph information while preserving the graph topology.
Multi-step feature propagation Given the node feature matrix X, the feature propagation of graph convolution in GCN is defined as follows: Equation (7) shows that the feature propagation propagates node information to one-step neighboring nodes as well as the node itself via Â . However, the feature propagation fails to obtain two-step and larger step neighboring nodes information. To circumvent the limits, we design a novel multistep feature propagation scheme in Eq. (8).
Multi-step linear transformation After the multi-step feature propagation, a multi-step linear transformation is applied to the H ok by H ok W.
where W denotes weight matrix which is shared among all nodes.
Based on Eqs. (8) and (9), we conclude that our multistep graph convolution takes the following form: Our algorithm is summarized in Algorithm 1. Existing convolutions generally only aggregate neighborhoods information at up to four-steps distances, however our convolution with large k-steps can simultaneously capture neighborhoods information at more distant distances. Specifically, compared with GCN's convolution that can only capture onestep neighboring nodes information, our multi-step graph convolution can capture the high-order interaction information between neighboring nodes. Compared with SGC's convolution that can not distinguish the features information of neighboring nodes at various distances, our multi-step graph convolution can adjust the contributions of these features information and obtain more nodes information and better learn global graph topology.

Output layer
Similar to GCN, we predict the label of nodes using a softmax classifier. The output prediction Y MulStepNET can be expressed as: We follow the loss function from Kipf and Welling (2017).

Analysis of complexity and parameters
Since the calculation of H ok requires no weight, we calculate H ok =Â ok X in a feature preprocessing step. That is, we can regard H ok as a fixed feature extractor, then our calculation is very efficient. As described in Sect. 3.2, if Â ∈ ℝ n×n , X ∈ ℝ n×c 0 , W ∈ ℝ c 0 ×c 1 ( c 1 filters). Then Â ok ∈ ℝ n×n , H ok ∈ ℝ n×c 0 , Y ok = H ok W ∈ ℝ n×c 1 . In our model, c 1 is the number of classes. The proposed MulStepNET architecture takes O(n × c 0 × c 1 ) computational time and O(c 0 × c 1 ) parameters. The complexity and parameters are the same as SGC. Obviously, our MulStepNET takes fewer computations and parameters due to without hidden layer, when compared to GCN. To the best of our knowledge, our MulStepNET and SGC achieve the best performance in terms of computational time and parameters.

Experiments
In this section, we perform experiments on citation network datasets to evaluate the performance of our MulStepNET in terms of prediction accuracy and stability. We compare our MulStepNET against recent state-of-the-art approaches including graph networks and high-order graph convolutions in terms of classification accuracy, complexity, and parameters on benchmark datasets. Further, we conduct experiments using different k to investigate the relationship between model width and prediction accuracy. In addition, we prove that own node features are more important for prediction. Lastly, we investigate the influence of attention multipliers and show the significant trends of the proposed method.

Datasets
To prove the expressive power of the proposed MulStepNET to learn global graph topology, we evaluate on citation network datasets chosen from benchmarks commonly used on semi-supervised node classification tasks. The datasets are: Pubmed, Cora, and Citeseer (Kipf and Welling 2017). Following Kipf and Welling (2017) and Yang et al. (2016), we summarize the statistics of the datasets as shown in Table 1.

Results analysis
Based on Eqs. (12) and (13), we obtain the average test accuracy acc and standard deviation S of the proposed models. where acc is the test accuracy of the ith run, and m is the number of runs.
We compare our MulStepNET against graph networks and high-order graph convolution methods on citation networks, and the results are summarized in Table 2. Based on the results, we observe that our MulStepNET achieves the best performance including classification accuracy and stability among the state-of-the-art approaches (except SGC) on all datasets. Our MulStepNET obtains the highest prediction accuracy of 81.1, 83.7, 73.4% and very low standard deviation of 0.0, 0.1, 0.0% on Pubmed, Cora, and Citeseer respectively. In terms of accuracy comparison on all datasets, our MulStepNET improves over GCN by 2.7, 2.7, and 4.4%, improves over GAT by 2.7, 0.8, and 1.2%, and improves over SGC by 2.8, 3.3, and 2.1% respectively. Compared with SGC, our MulStepNET achieves competitive performance in terms of stability while significantly improving accuracy. Due to the smaller standard deviation in the proposed model, our MulStepNET outperforms other baselines by a large margin in terms of stability. These results demonstrate the effectiveness of MulStepNET for capturing nodes information and global graph topology. Similar to He and Sun (2015), we use theoretical time complexity to describe the complexity, rather than the actual running time, since the actual running time is sensitive to hardware and implementations. Table 3 shows the results of the proposed MulStepNET and competing methods in terms of complexity and parameters (see Sect. 3.4 for analysis). In terms of complexity and parameters, our MulStepNET achieves as good performance as SGC, and consistently outperforms other methods. These results show the superiority for designing one-layer model.
To investigate the influence of model width and Â 0 , we conduct these experiments using different k for our Mul-StepNET-a and MulStepNET-b on Pubmed, Cora, and Citeseer datasets respectively. The results are summarized in Fig. 3. We observe that our MulStepNET-a can improve Table 2 Comparison with graph networks and high-order graph convolution methods on citation networks (in percent)
To investigate the influence of attention multipliers, we remove all attention multipliers from MulStepNET while keeping other settings. Table 5 lists the comparison results between MulStepNET with and without attention multipliers. The results show that MulStepNET performs better. This demonstrates the benefits of the attention multipliers. Figure 4 shows the classification accuracy of Mul-StepNET, MulStepNET without Â 0 , and MulStepNET without attention ( ) on all three datasets, respectively. It is obviously to see that MulStepNET outperforms Mul-StepNET without Â 0 and MulStepNET without attention in terms of average accuracy and the overall trend of MulStepNET is better than MulStepNET without Â 0 and Fig. 3 Influence of model width (the highest power of Â , namely k) and own features ( Â 0 ) on node classification accuracy. MulStepNETa denotes MulStepNET without simple grouping, attention mecha-nism, and Â 0 , then Â 0k =Â 1 +Â 2 + ⋯ , +Â k . MulStepNET-b denotes MulStepNET without simple grouping and attention mechanism, then Â 0k =Â 0 +Â 1 +Â 2 + ⋯ , +Â k is the set of attention multipliers. In MulStepNET without attention multipliers, Pubmed dataset is slightly sensitive to initializations. We run the model 10 times and report the results for top 8 runs (sort by the results)

Method
Pubmed Cora Citeseer MulStepNET without attention. On Citeseer, the accuracy of MulStepNET is 1.9% (average value) and 1.0% (average value) higher than the classification result of MulStepNET without Â 0 and MulStepNET without attention, respectively. This further verifies the contribution of own node features and the attention multipliers to performance improvement.

Conclusion
In this paper, we propose a stronger multi-step graph convolutional network architecture, MulStepNET, on graphstructured data. Notably, our MulStepNET can obtain more nodes features information and enable adequate information propagation via multi-power adjacency matrix. Further, by precomputing the fixed feature extractor H ok , our computations are more efficient than GCN. Experiments on several graph classification benchmarks show natural advantages for capturing node features and entire graph structure information. We observe that our MulStepNET with the fewest parameters achieves better performance as compared to baselines. In the future, we would apply the proposed model to more application areas such as social networks.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.