AdjMix: simplifying and attending graph convolutional networks

Simple graph convolution (SGC) achieves competitive classification accuracy to graph convolutional networks (GCNs) in various tasks while being computationally more efficient and fitting fewer parameters. However, the width of SGC is narrow due to the over-smoothing of SGC with higher power, which limits the learning ability of graph representations. Here, we propose AdjMix, a simple and attentional graph convolutional model, that is scalable to wider structure and captures more nodes features information, by simultaneously mixing the adjacency matrices of different powers. We point out that the key factor of over-smoothing is the mismatched weights of adjacency matrices, and design AdjMix to address the over-smoothing of SGC and GCNs by adjusting the weights to matching values. Experiments on citation networks including Pubmed, Citeseer, and Cora show that our AdjMix improves over SGC by 2.4%, 2.2%, and 3.2%, respectively, while achieving same performance in terms of parameters and complexity, and obtains better performance in terms of classification accuracy, parameters, and complexity, compared to other baselines.


Introduction
Motivated by the increasing proliferation of graph structured data in real-world applications and the limited expressive power of convolutional neural networks (CNNs) for learn-ing such data, graph convolutional networks (GCNs) [21] and subsequent variants have experienced great attention and have become popular topics. These models show promising performance in various fields, including citation networks [21,38], social networks [7], traffic networks [13,25], applied chemistry [29], recommender systems [8,44], and so on.
Each layer in GCNs essentially acts as an operation of Laplacian smoothing [26]. Applying the operation twice, the nodes features in the same connected component are similar, which eases the classification task. As the times increase, the node features are more similar. Eventually, the node features tend to the fixed value [26], which makes them indistinguishable. This is called the over-smoothing problem. A key factor of the over-smoothing is the mismatched weights of adjacency matrices in the various powers. Therefore, we can effectively tackle the over-smoothing by adjusting the weights to appropriate and better values.
In addition to the over-smoothing, GCNs may inherit excess complexity and superfluous computation [40]. To reduce the complexity, a recent surge of research interest has concentrated on simple graph convolution (SGC) [40]. By removing the nonlinearities between GCNs layer and collapsing the weight matrices between consecutive layers, SGC achieves similar or even superior classification accuracy to GCNs in various application areas while being computationally more efficient and fitting fewer parameters. Nevertheless, SGC fails to adjust the weights of adjacency matrices in the various powers to matching values, thus bringing potential concerns of over-smoothing when taking higher powers. Limited to the over-smoothing, the width (the power of adjacency matrix) of SGC is narrow. This narrow mechanism naturally limits the scale of the receptive field. Due to the restricted receptive field, SGC has difficulty in obtaining more global information and propagating adequate label information.
Here, we propose a novel AdjMix that learns to adjust the weights of adjacency matrices in the different powers to matching values and capture more nodes and global information in an end-to-end fashion (Fig. 1). AdjMix can be viewed as a graph analogue of Inception [36], which allows for convolution model to increase the receptive field by mixing different kernel sizes. The main challenge to combine the adjacency matrices of different powers into a new single adjacency matrix is that the weights of adjacency matrices in the various powers are incompatible. To address the above challenges, we require a model that learns how to adjust the weights to matching values.
Our contributions are three-fold. (1) We provide new insights and point out that the key factor of this oversmoothing is the mismatched weights of adjacency matrices in the various powers. (2) We propose a new AdjMix to tackle the over-smoothing by adjusting the weights to appropriate values. Because AdjMix has no the limits of the over-smoothing, AdjMix can obtain more global information through taking higher powers instead of increasing depth and is scalable to wider structure. (3) Through an empirical assessment on node classification tasks, we show that the AdjMix matches the complexity and parameters of SGC while outperforming SGC and other state-of-the-art methods significantly in terms of classification accuracy.

Related works
Spectral CNN [6] is the first successful attempt at extending CNNs on graphs for dealing with graph-based tasks. Defferrard et al. [10] propose a localized filter using Chebyshev polynomials to avoid the computationally expensive Laplacian eigendecomposition of [6]. Later, Kipf and Welling [21] propose graph convolutional networks (GCNs) by simplifying the Chebyshev polynomials based on a redefined propagation matrix. Li et al. [26] prove that each convolution of GCNs is actually a special form of Laplacian smoothing, which suffers from potential limits of over-smoothing with many graph convolution layers. GraphSAGE [14] leverages the node information with sampling and aggregation to generate node embeddings for tackling the limitations of transductive learning. FastGCN [7] regards the convolution as integral transforms of embedding functions to significantly reduce the time and memory resources. Xu et al. [42] analyze the learning ability of graph neural networks (GNNs) for capturing different graph structures and propose graph isomorphism network (GIN) that is as powerful as the Weisfeiler-Lehman test for graph isomorphism. Klicpera et al. [22] propose an improved propagation scheme with personalized PageRank to leverage the information from a large neighborhood. Li et al. [27] propose proper low-pass graph convolutional filters that inject graph relations into data features in semi-supervised learning. Velikovi et al. [39] propose a deep graph infomax (DGI) that is based upon mutual information to learn graph embeddings on both transduc-tive and inductive learning tasks. By combining these works of [9,17,18,32,33,35] with DGI, DGI is more powerful. Several researches obtain multi-scale information via high-order propagation matrix [2,3,24,29,31].
A recent surge of research interest has focused on simpler and linear models to achieve efficient computations for node and graph learning tasks. Thekumparampil et al. [37] propose a linear GCNs model that removes all intermediate non-linear activation functions to simplify computations, which achieves a comparable to GCNs and other state-of-theart models. SGC [40], as a simpler and efficient variant of GCNs, removes the nonlinearities between GCNs layer and collapses the weight matrices between consecutive layers, has achieved comparable performance to GCNs on a variety of tasks.
Many researches demonstrate that graph attentional models try to assign various edge weights based on node features and have contributed to performance improvement on graph learning tasks [19,23,37,38,45]. Nevertheless, these models with attention mechanism add significant overhead to complexity and parameters. There are many other works [5,41,46] for a more comprehensive review.

Preliminaries
We follow [3,21] to introduce graph notations and problem definition. We represent an undirected graph G with n vertices and e edges as (ε, ν, A, X ), where ε and ν are the edge set and vertex set, respectively, X ∈ R n×c 0 is the node feature matrix assuming each node has c 0 features, and A ∈ {0, 1} n×n is the adjacency matrix with each entry a i j describing the edge weights between vertex i and j. We use edge weights to indicate connection strengths between vertices (nodes).
Graph convolutional networks. We follow [21] to introduce the convolution of GCNs, which is defined as: where S is the normalized adjacency matrix with added selfloops, with S =D − 1 2ÃD − 1 2 . HereÃ = A + I , I ∈ R n×n is the identity matrix, andD is the degree matrix ofÃ, withD ii = jÃ i j . H (k) and θ (k) denote the input node representations and the learned weight matrix at layer k. Specially, H (k) = X when k = 1, which serves as input to the first GCNs layer. We use the nonlinear activation function ReLU to achieve the pointwise nonlinear transformation, with ReLU(x) = max(0, x).
Similar to CNNs, GCNs learn graph representations by stacking multiple layers. Stacking the convolutional layer twice, the popular two-layer GCNs can be written as: (1) )θ (2) ), (2) where θ (1) and θ (2) are different weight matrices. We apply softmax classifier to predict the labels for node classification, with softmax(·) = exp(·) i exp(·) . Simple graph convolution. By repeatedly stacking GCNs layers, we describe k-layer GCNs in general form as: Wu et al. [40] show that the nonlinearity between GCNs layer is unimportant for classification tasks. Therefore, for k-layer GCNs, we remove the ReLU functions between each layer. The k-layer GCNs become: (4) To simplify notation and computations, we collapse the S into a single matrix S k by raising S to the k-th power and reparameterize weights into a single matrix θ via θ = θ (1) θ (2) · · · θ (k) . We write the k-layer GCNs in simple form: We refer to the simplified model as simple graph convolution (SGC) [40]. In SGC, we reduce the excess complexity of GCNs and achieve a simplified linear model while showing competing performance compared to GCNs.

SGC encounters bottleneck
In traditional CNNs, the depth (stacking layers) increases the receptive field of internal features [15]. In SGC, we increase the receptive field by increasing the k value in Eq. (5). Equation (5) shows that SGC captures the features information of neighbor nodes that are k-hops away in the graph. Consider- To analyze Eqs. (5) and (6) intuitively and clearly, we take . Then:

Fig. 2 Undirected graph
We can observe the impact of different k values on these weights between nodes at various distances (the various powers weights of S). The high power weights of S increase from 0 while the low power weights of S gradually decrease when increasing k value. Furthermore, the all weights of S tend to a fixed value when k ≥ 20, which makes nodes features indistinguishable. In addition, these weights are relatively close when k ≥ 3. This implies that SGC may suffer from oversmoothing when k ≥ 3. Limited to the over-smoothing, SGC has difficulty in capturing many nodes features and exploring the global graph structure.

The overall architecture
In SGC, we improve the learning ability by increasing the scale of receptive field. The analysis in Sect. 3.2 shows that SGC fails to further enlarge the receptive field due to the over-smoothing of SGC with larger k value (k ≥ 3). This hurts the classification performance on graph learning tasks. By expanding Eq. (6), we find that S k includes a series of weights of adjacency matrices in the various powers. We hypothesize that the weights do not match and the mismatched weights are the key factor of the over-smoothing. Therefore, we can effectively tackle the over-smoothing by adjusting the weights to matching values. Motivated by the above analyses, we propose to adjust the weights by constructing a new adjacency matrix. Based on the adjacency matrix, we develop our AdjMix architecture (Fig. 1). In our AdjMix, we first construct an adjacency matrix with attention mechanism to adjust the weights. In addition, we design a simple and attentional graph convolution model to capture the features information of neighbor nodes that are k-hops away. Finally, we apply a softmax classifier to classify each node.

Simple and attentional graph convolution layer
In our simple and attentional graph convolution layer, node representations are updated in four steps, i.e. multi-scale adjacency matrix with attention mechanism, multi-scale feature propagation, multi-scale linear transformation, and nonlinear activation. We now introduce each step in detail.

Multi-scale adjacency matrix with attention mechanism.
In GCNs, the adjacency matrix A is used to denote the edge weights between nodes that are one hop away. However, the information propagation of graph is inadequate because of its inability to denote the edge weights between nodes that are k-hops (k > 1) away. To address the limits, we mix various powers of the A into a single multi-scale adjacency matrix with attention mechanism.
whereŠ adj is the multi-scale adjacency matrix with attention mechanism, and S 1 adj is a normalized adjacency matrix without added self-loops, with S 1 adj = S adj = D − 1 2 AD − 1 2 . Here S k adj is the k power of S adj , and D is the degree matrix of A, with D ii = j A i j . We consider node's own feature via S 0 adj , with S 0 adj = I . In addition, we utilize the various powers of the S 1 adj (such as S k adj ) to obtain feature information from all nodes that are k-hops away in the graph. By applying a series of attention scores α 0 , α 1 , . . . α k , we can flexibly adjust the weights of these adjacent powers of S adj to avoid the concerns of over-smoothing.

Theorem 1 Multi-scale adjacency matrix with attention mechanism is an operation of permutation invariance.
Proof Let S adj ∈ R n×n (n nodes), then S 0 adj ∈ R n×n , S 1 adj ,S 2 adj ,· · · ,S k adj ∈ R n×n , andŠ adj = α 0 S 0 adj + α 1 S 1 adj + · · · + α k S k adj ∈ R n×n . Because our multi-scale adjacency matrix is element-wise operation, the spatial location ofŠ adj is permutation invariant.
Multi-scale feature propagation. In order to aggregate various neighboring features, we design a multi-scale feature propagation that is what distinguishes our convolution from these convolutions of GCNs and SGC. The node feature matrix X is updated along these neighboring nodes at different scales in Eq. (15): Intuitively, this step captures multi-scale feature information while keeping larger receptive field. Furthermore, we try to adjust the contributions of nodes features at different scales by applying a series of attention scores for performance improvement.

Multi-scale linear transformation and nonlinear activation.
After the multi-scale feature propagation, our convolutional layer is identical to a multi-layer perceptron (MLP). We use a learned weight matrix W to conduct multi-scale linear transformation. Finally, a nonlinear activation function such as ReLU is applied before outputting feature representation H . We conclude that the feature representation is updated.
Based on Eqs. (15) and (16), our simple and attentional graph convolution is defined as: Our algorithm is summarized in Algorithm 1. Our convolution model is used to learn k-hops nodes features and global graph structure by aggregating neighborhoods information as described in Eq. (17). In GCNs, the convolution suffers from inability in capturing nodes feature with different neighbors. However, our convolution can explore the interaction of neighboring nodes that are k-hops away. In SGC, the convolution fails to adjust the contributions of nodes features with different neighbors. In our convolution model, we can consider and adjust the contributions by a series of attention scores. In addition, limited to the oversmoothing of GCNs and SGC models, the receptive field of the models is narrow. This limits the expressive power. We adjust the weights to address the over-smoothing via multiscale adjacency matrix with attention mechanism. Our model can improve the expressive power and learn global graph topology. However, AdjMix needs to set different k values (the highest power of S adj ) to achieve the best performance according to different datasets, compared to these baselines such as GCNs and SGC.

Algorithm 1 Simple and attentional graph convolution
Inputs: a normalized adjacency matrix without added self-loops S adj , input node feature matrix X , a learned weight matrix W .

Output layer
In output layer, following Kipf and Welling [21], we predict the label of nodes using a softmax classifier and adopt the loss function from GCNs. The class prediction Y AdjMix (i.e. AdjMix model) takes the following form:

Complexity analysis and parameters comparison
Similar to SGC, we can regardH as a fixed feature extraction since the computation ofH =Š adj X requires no weight. By precomputing the fixed feature extractionH , the computation of the proposed model is very efficient. Let A ∈ R n×n , X ∈ R n×c 0 , and W ∈ R c 0 ×c 1 (c 1 filters). As introduced in Sect. 3.4, ourŠ adj ∈ R n×n ,H ∈ R n×c 0 , and Y AdjMix = softmax(ReLU(H W )) ∈ R n×c 1 , where c 1 denotes the number of classes. BecauseŠ adj is usually a sparse matrix with m non-zero entries, the time complexity and parameters of AdjMix are O(m × c 0 × c 1 ) and O(c 0 × c 1 ), respectively. Compared to GCNs, our AdjMix performs better in terms of the complexity and parameters.

Experiments
We evaluate the benefits of AdjMix against a number of stateof-the-art models, with the goal of answering the following research questions: Q1 Can AdjMix address the concerns of the over-smoothing when capturing the features information at long distances? Q2 How does AdjMix compare to the state-of-the-art models on semi-supervised node classification tasks? Q3 How does AdjMix compare to GCNs and other models in terms of the complexity and parameters on all datasets?

Datasets.
To evaluate the effectiveness of AdjMix, we conduct experiments on citation network datasets chosen from benchmarks commonly used on semi-supervised node learning tasks. We use the datasets on public splits [21,43] of Pubmed, Citeseer, and Cora [34], which are summarized in Table 1.
Model configurations. We implement our model using Adam [20] optimizer in TensorFlow [1]. We now report the optimized hyperparameters to achieve the best performance on different datasets. On Cora, we set the learning rate to 0.0003 and dropout to 0.95 for improving the performance in terms of stability and test accuracy. On other datasets, we set the learning rate to 0.01 and dropout to 0 due to the very stable results. In addition, we apply 0.005, 0.000825, 0.0004 L2 regularization factor and train 240, 1,500, 48,000 epochs on Pubmed, Citeseer, and Cora, respectively. We optimize the highest power (k power) of the normalized adjacency matrix S adj according to prediction performance. We set k to 21, k to 4, k to 8 on Pubmed, Citeseer, and Cora, respectively. We tune these attention scores α 0 , α 1 , · · · , α k to suitable values, with α 0 = 2.7, α 1 = 1.

Analysis of over-smoothing
To address Q1, we conduct a variety of numerical experiments in Fig. 3. In Fig. 3c-e, by increasing the power of the S adj , the weights are distinguishable while obtaining more nodes features. It is observed from Fig. 3f, g that further increasing the power of the S adj makes the weights more close and the nodes features more indistinguishable. This naturally leads to over-smoothing. As described in Sect. 3.3, a main factor of the over-smoothing is the mismatched weights. Therefore, we propose multi-scale adjacency matrix to adjust these weights to appropriate values. We observe that these weights of the proposed multi-scale adjacency matrix are distinguishable. Namely, the features of different neighboring nodes in the proposed multi-scale adjacency matrix are distinguishable. This implies that our model can effectively avoid the concerns of over-smoothing when capturing the features information of neighboring nodes at long distances.

Fig. 3
Our multi-scale adjacency matrix with four power in Fig. 2. We summarize these results based onŠ adj = α 0 S 0 adj + α 1 S 1 adj + α 2 S 2 adj + α 3 S 3 adj + α 4 S 4 adj , with α 0 = 1.6, α 1 = 1.3, α 2 = 1.12, α 3 = 0.85, and α 4 = 0.8 Therefore, by increasing the powers of the adjacency matrix, our model can enlarge the receptive field and improve the expressive power of learning graph representations. Table 2 compares the performance of AdjMix to these state-of-the-art node classification baselines in terms of classification accuracy and stability on citation networks. These results provide positive answers to question Q2. We observe that our AdjMix obtains the highest performance including classification accuracy and stability among all state-of-theart approaches on Pubmed, Citeseer, and Cora, improves over GCNs [21] by 2.3%, 4.6%, and 2.6%, improves over AdaLNet [29] by 3.4%, 7.0%, and 4.0%, and improves over SGC [40] by 2.4%, 2.2%, and 3.2% respectively. Our AdjMix achieves similar stability while significantly improving classification results compared to SGC [40], and outperforms other baselines by a large margin in terms of stability.

Results for node classification
Interestingly, our simplified model variant, Adjmix-2, outperforms SGC [40] by a large margin on all datasets. This is because our Adjmix-2 can adjust the weights of adjacency matrices in the different powers to matching values. These results demonstrate the effectiveness of the proposed models for adjusting the weights. Figure 4 shows the influence of model width and node's own feature on all datasets. We observe that the accuracy of overall trend in our AdjMix without AS and AdjMix without AS and S 0 adj models improves as model goes wider (increases k) until the width (k) of 21, 4, and 8 on Pubmed, Citeseer, and Cora, respectively. By further increasing the model width, the accuracy of our models begins to decrease. This is because our models with wider structure may mix the features from Fig. 4 Influence of model width (the highest power of S adj , i.e. k) and node's own feature. AdjMix without AS denotes our AdjMix model without these attention scores α 0 , α 1 , · · · , α k , thenŠ adj = S 0 adj + S 1 adj + · · ·+S k adj . AdjMix without AS and S 0 adj denotes our AdjMix model without these attention scores and S 0 adj , thenŠ adj = S 1 adj + S 2 adj + · · · + S k adj Table 3 Comparison of parameters and complexity Method Parameters Complexity various clusters. Compared to AdjMix without AS model, AdjMix without AS and S 0 adj model improves performance at different model widths on all datasets. This demonstrates the importance of node's own feature for designing model. These comparison results show the best model width and prove the benefits of own node features to performance improvement.

Analysis of parameters and complexity
As explained in He et al. [16], the actual running time is sensitive to hardware and implementations. We use the theoretical time complexity to show the complexity, rather than the actual running time, following Liu et al. [30]. To support answering Q3, we compare our methods with these state-of-the-art node classification baselines in terms of parameters and complexity on citation networks. As described in Sect. 3.6, we list the parameters and complexity in Table 3. Interestingly, in terms of parameters and complexity, the proposed models achieve as good performance as SGC [40], and perform better compared to other baselines. These results demonstrate the contribution of designing onelayer model.
Based on Tables 2 and 3, we conclude that the proposed models outperform SGC [40] by a large margin while main-taining same parameters and complexity, and they achieve state-of-the-art performance in terms of accuracy, parameters, and complexity, compared to other baselines. These results show the significant advantages of the proposed models.

Conclusion
In this paper, we propose a simple and attentional graph convolutional network architecture, AdjMix, for semi-supervised node learning tasks. Notably, our AdjMix can adjust the weights of adjacency matrices in the various powers to address the over-smoothing of SGC and GCNs, thus capturing more nodes features information by taking wider model. By precomputing the fixed feature extractionH , our AdjMix achieves as good performance as SGC in terms of parameters and complexity. Experiments on node classification benchmarks show the superiority of capturing the node features and entire graph structure. In the future, we would design a learned network to automatically adjust these weights to matching values and apply the proposed model to more graph learning tasks. Especially, we will study how to apply the proposed model to directed graphs.