Graph Convolutional Networks: analysis, improvements and results

In the current era of neural networks and big data, higher dimensional data is processed for automation of different application areas. Graphs represent a complex data organization in which dependencies between more than one object or activity occur. Due to the high dimensionality, this data creates challenges for machine learning algorithms. Graph convolutional networks were introduced to utilize the convolutional models concepts that shows good results. In this context, we enhanced two of the existing Graph convolutional network models by proposing four enhancements. These changes includes: hyper parameters optimization, convex combination of activation functions, topological information enrichment through clustering coefficients measure, and structural redesign of the network through addition of dense layers. We present extensive results on four state-of-art benchmark datasets. The performance is notable not only in terms of lesser computational cost compared to competitors, but also achieved competitive results for three of the datasets and state-of-the-art for the fourth dataset.


Introduction
In the last few years, data is usually represented as points in a vector space. Today, structured data is omnipresent, able to include structural information between points and can be particularly important to better represent the models learned on them. For this purpose, the graphs are widely used to represent this type of information through nodes/vertices and edges, including local and spatial information derived from the data.
Very often the interest concerns a prediction about the node properties in such graphs. For example, given a network that represents a human phenomenon, like a common exchange of messages in a social network, the goal is to predict the area of belonging i.e. users with common interests. Performing a forecasting process, especially in a semi-supervised environment, has been at the center of graph-based semi-supervised learning (SSL) Rozza et al. [2014]. In graph-based SSL, a small set of nodes is ini-Notationally, the structure described by the graph is normally incorporated as an explicit regularizer which applies a sliding constraint on the labels of the nodes to estimate. Recently, Graph Convolutional Networks (GCN) Defferrard et al. [2016]; Kipf and Welling [2017] have been proposed with a purpose to work on deep neural networks and graph-structured data. In this paper, our attention is on the task of graph-based SSL using GCNs. GCN progressively estimate a transformation from graph to vector space, also called embedding, and an aggregation of neighborhood nodes, whereas a target loss function for backpropagation errors is adopted. Indeed, node embedding result represents an estimation for label scores on the nodes. Moreover, it would be appropriate to obtain an association of confidence estimation to the label scores. This confidence scores can be adopted to understand the reliability of the estimated labels on a generic node. This improvement is introduced in the model called Confidence based Graph Convolutional Networks (Con-fGCN) Vashishth et al. [2019].
In this context, the aim of our paper is fourfold: 1. The models provide a set of parameters to be optimized. In this phase the goal is to find the optimal combination (activation function, loss function, hidden layers, number of nodes, etc) in order to obtain the best performance. It is assumed that this phase is very long and expensive. 2. In past few years, many researchers have been worked designing novel activation functions in order to help deep neural networks to converge and obtain better performance. Standard neural networks employ logistic sigmoid activation functions which is affected by saturation problem and and consequently the effectiveness and efficiency of the classification is reduced. In this paper, we introduce an efficient approach to learn, during training, combinations of base activation functions (such as Relu6); our goal is to check a search space for the activation functions through a convex combination of the base functions. 3. The models adopt only the information relating to the degree of the single nodes (matrixD in equation 1) to process the graphs. In this regard, we introduce a measure that provides additional topological information called clustering coefficients. 4. In deep learning the goal is to improve performance by focus attention on adding new layers, modifying the activation functions or changing the regularization methods. Furthermore, current structure, layers, can be redesigned to obtain optimal results compared to existing models. To this aim, we combined GCN and Dense layers. This new model provides a mixture of both GCN and Dense layers and compared to individual GCN, resulted in better performance.
Additionally, we analyze the two baseline models, GCN and ConfCGN, in order to show the impact of proposed changes during training and testing phases. The paper is organized as follows: section 2 gives an overview of related work. Section 3 provides an overview of the two models GCN and ConfCGN. Section 4 explains the enhancement we proposed in this paper. Whereas, section 5 discusses the results achieved with proposed approach on GCN and confGCN. Finally, section 6 gives some future directions and concludes our paper.

Related Work
Recent literature provides some interesting insights about application of neural networks and data organized as graphs. In Kipf and Welling [2017] a variant of convolutional neural networks, called Graph Convolutional Networks (GCNs), which operate directly on graphs is presented. Main motivation of convolutional architecture is related to localized first-order approximation of spectral graph convolutions. The model works by scaling linearly nodes connections and adopts hidden layer representations which encode both structure and features of graphs. ---s on graphs, is presented. The proposed formulation does not alter the computational complexity of standard CNNs, despite being found to be processing graph structures.
In Marcheggiani and Titov [2017], an enhanced version of Kipf and Welling [2017] is introduced. It is able to work with syntactic dependency graphs in form of sentence encoders and extracts words latent feature representations arranged in a sentence. Moreover, the authors showed that layers are complementary to LSTM layers.
In Veličković et al. [2018], a neural network architecture for inductive and transductive problems, based on masked self-attentional layers, called graph attention networks (GATs), for graph-structured data is presented. In this model, nodes are able to contribute about neighboring features extraction and different weights to different nodes in a neighborhood are enabled, eliminating expensive matrix operations. In this way, several key challenges of spectral-based graph neural networks are addressed at the same time.
In Vashishth et al. [2019], a modified version called Confidence-based Graph Convolutional Networks (Con-fGCN) of Kipf and Welling [2017] is introduced. The improvement concerns a confidence estimation about label scores that has not been explored in GCNs. ConfGCN adopts label scores estimation to identify the influence of a node on the neighborhood during aggregation, thus acquiring anisotropic abilities.
In Liao et al. [2018] a graph partition neural networks (GPNN), an extension of graph neural networks (GNNs), useful to work with large graphs are described. GPNNs combine local information between nodes in small subgraphs and global information between the subgraphs. Graphs are partitioned in efficient way through several al-gorithms and, additionally, a novel variant for fast processing of large scale graphs is introduced.
In Yadav et al.
[2019] a modified version of Kipf and Welling [2017] named Lovász Convolutional Networks (LCNs) is introduced. The model is able to capture global graph properties through Lovász orthonormal embeddings of the nodes.
In Atwood and Towsley [2016], a Diffusion-Convolutional Neural Networks (DCNNs) are described. Diffusion-convolution operation is useful to learn representations as an effective basis for node classification. The model includes different qualities such as latent representation for graphical data, invariance under isomorphism, polynomial-time prediction and learning.
In Bruna et al. [2014], possible generalizations of Convolutional Neural Networks (CNNs) to signals is defined for more general domains. In particular, two models, one based upon a hierarchical clustering of the domain and another based on the spectrum of the graph Laplacian are described. The model is able to learn convolutional layers with a number of parameters independent of the input size, resulting in efficient deep architectures. Further, a deep architecture with small learning complexity on general non-Euclidean domains is introduced in Henaff et al. [2015]. The model is an extension of Spectral Networks which includes a graph estimation procedure.
Finally, in Li et al. [2016] authors describe Gated graph sequence neural networks (GGNN), an extended versions of Graph Neural Networks (GNN) Scarselli et al. [2008], which uses modified gated recurrent units modern optimization techniques and extends output sequences. In the following section, we will explain the two baseline models i.e. GCN and ConfGCN.

Baseline Models
In this section, we briefly introduce Graph Convolutional Networks (GCNs) Kipf and Welling [2017] and its enhancement, Confidence-based Graph Convolutional Networks Vashishth et al. [2019]. The two models are compared and analyzed in detail in terms of their limitations and differences. Subsequently, starting from these points a set of improvements are proposed and demonstrated experimentally.

Notations
In this section we define the important elements for this research. Given G = (V, E, X) an undirected graph, where V = V l ∪ V u the set containing labeled (V l ) and unlabeled (V u ) nodes in the graph of dimension n l and n u , E is the set of edges and X ∈ R (n l +n u )×d is the input node features. The label of a node v is represented by a vector Y v ∈ R m , belonging to m classes. In this context, the goal is to predict the labels, Y ∈ R n l ×m , of the unlabeled nodes of G. To consider the confidence, label distribution µ v ∈ R m and a diagonal co-variance matrix

Graph Convolutional Networks
Graph Convolutional Networks (GCNs) Kipf and Welling [2017] works on undirected graphs. Given a graph G = (V, E, X), the node representation after a single layer of GCN can be defined as: W ∈ R d×d includes the model parameters, A represents nodes adjacency andD ii = j (A + I) i j . f is any activation function such as ReLU, f (x) = max(0, x). Equation 1 can be reformulated as: The goal is to acquire multi-hop dependencies between nodes, different GCN layers can be superimposed over one another. The representation of the node v after k layers can be written as where, W k and b k represent the weight and bias parameters of GCN layer.

Confidence based Graph Convolutional Networks
In Vashishth et al. [2019] Confidence-based Graph Convolutional Networks (ConfGCN) is described. The authors define the influence score of node u considering its near node v during GCN process as follows: u, v) represents Mahalanobis distance between two nodes Orbach and Crammer [2012]. Specifically, considering nodes u and v, with label distributions µ u and µ v and co-variance matrices Σ u and Σ v , r uv gives more importance to spatially close nodes belonging to same class, otherwise reduces importance of nodes with low confidence scores. This results leads to inclusion of anisotropic capability during neighborhood exploration. For a node v, equation 3 can be rewritten as: The final label prediction is obtained by equation 7 with K number of layers.

GCN versus ConfGCN Models
The differences between the two models are shown below: 1. The major difference in both models is that GCN implements the nodes embedding, projection from graph space to vector space, to describe the neighborhood while ConfGCN implements the confidence based prediction scheme where the neighboring nodes having higher confidence would be impor- 3. GCN doesnt have constraints on the number of nodes that influences the representation of a given target node and each node is influenced by all the nodes in its k-hop neighborhood. Whereas, in ConfGCN the label confidences are used to ignore less confident nodes and nodes having higher confidence would be considered important. Furthermore, number of nodes influencing do not sway the prediction of the wrong labels. 4. For more number of nodes in graphs such as Cora and CoraML datasets, ConfGCN has significantly better performances than Kipf GCN as the previous model implements the Nodes entropy of neighborhood calculation.
The limitations of the two models are shown below: 1. In GCN, memory requirement grows linearly in the size of the dataset. Whereas, ConfGCN requires higher memory requirement. 2. GCN is not applicable to directed graphs. It does not support edge features and is limited to undirected graphs (weighted or unweighted). 3. In GCN, locality for the nodes are assumed. 4. ConfGCN require more computational cost compared to the basic model. Cost increases as a confidence value, (equation 4), for the exploration of the neighborhood node. 5. ConfGCN require more time for execution. 6. In ConfGCN, increasing layers reduces the accuracy.
This behavior is connected to the increase of influencing nodes with increasing layers, which results in average information during aggregation.
In the following section, we will explain the proposed enhancement we did for selecting an optimal deep model that may result in fewer executing time and enhanced performance.

Proposed Models: Enhanced GCN and ConfGCN
We proposed four major enhancement for both the models. The first enhancement is changing the hyperparameters and training algorithm. The second and third are major enhancement i.e. adding more structural information to adjacency matrix and canonical optimization technique (also referred as convex). Finally, the fourth concerns a combination of two base models with introduction of additional dense layers. All these enhancement are applied on both the baseline models. Following section will explain how these models are designed and implemented.

Optimizing Hyper-parameters
First we optimized the baseline models by fine-tuning the hyper-parameter that include activation function (AF), loss function (LF), and the number of nodes in each hidden layer. For AF we have explored it with ReLu, ReLu6, Elu, and S elu. In case of LF, we utilized simple cross entropy and cross entropy so f tmax V2. Whereas, to explore the best number of nodes, we have taken nodes in each layer as 16, 32, 48, 64, 80, 96, 100, 112 and 200. We explored to find the best combination of these parameters to provide optimal result in minimum amount of time. From now we will call the two enhanced versions Optimized Graph Convolutional Networks (OpGCN) and Optimized Confidence based Graph Convolutional Networks (OpConfGCN).

Convex combination of activation functions
A standard neural network N d can be composed of a set of hidden layers d and a set functions L i that lead to a final mapping L related to a problem to address: N d = L• L d •· · ·• L 1 . Specifically, each hidden layer function L i is composed of two functions, g i and σ i , which include parameters within the spaces H gi and H σi . A remapping of the layer input neurons in form of activation function can be seen as: L i = σ i • g i . The learning process of L i consists in a procedure of optimization in the space H i = H σi × H gi . Commonly, σ i does not provide for a learning phase and H σi is a singleton. Then, H i = {σ i } × H gi . If we consider a fully-connected layer from R n i to R m i which adopts Relu activation function, H gi represents the set of all affine transformations from where Lin(A, B) and K(B) are respectively the sets of linear maps between A and B, and the set of translations of B.
In this paper, we adopt a technique to define learnable activation functions Manessi and Rozza [2018] that can be adopted in all hidden layers of a GCN architecture. The approach consists of a hypothesis space H σi and is based on the following idea: • select a set of activation functions F = { f 1 , . . . , f N }, in which elements can be adopted as base elements; • fix the activation function σ i to combine in linear way as elements belonging to F set; • look for an optimal hypothesis space; • look for GCN optimization respect to conv (A) is not vector subspace of V and is a generic convex subset in V reducing to a (|A| − 1)-dimensional simplex when the elements of A are linearly independent. If we consider F := { f 0 , f 1 , . . . , f N } the set of activation functions f i , the vector space F is defined from F considering all linear combinations i c i f i with c i ≥ 0, Σ i c i = 1. Note that, despite F is a spanning set of F, it is not generally a basis; indeed |F| ≥ dim F. Based on previous definitions, we can now define the technique to build learnable activation functions as follows: • fix a finite set F = { f 1 , . . . , f N }, where each f i is a learneable activation function; • create an additional activation function f as a linear combination of all the f i ∈ F; • select as hypothesis space H f the conv(F) set; The results are obtained for the following combination for F: where Relu6 = min(max(0, x), 6) In this work we have implemented two methods: 1. Taking two input layers of the model, use the different activation for them and then applying any mathematical operations on the inputs, i.e. Summation, Subtraction, Maximum, minimum and Average values of both Input layers output.
2. Looking at those results we got to know that summation operations are having the best results so we applied the canonical form on the outputs. In this case the convex combination becomes conv(A) := c 1 Relu6 + c 2 Relu6. The Structure of Base-Line model with optimized results is shown in Table 1: Whereas its enhanced model structure is given in Table  2:

Clustering coefficients
In equation 1 the adjacency matrix A, which describes the topology of the network, is very significant part of both models. Furthermore, the identity matrix I is added to A in order to remove zero values on the main diagonal. Our idea is to add more information about nodes by introducing a particular property called Clustering Coefficients. In graph theory, the clustering coefficient describes the degree of aggregation of nodes in a graph. The measure is based on triplets of nodes. A triplet is defined as three connected nodes. A triangle can include three closed triplets, each one centered on one of the nodes. Two possible versions can be defined: the Global Clustering Coefficients (GCCs) and the local Clustering Coefficients (CCs) Opsahl [2013]. We adopt the second defined as: k i is the degree of node i and δ i is the number of edges between the k i neighbors of node i. The measure is in the range {0, . . . , 1}, 0 if none of the neighbors of a node is connected and 1 if all of the neighbors are connected. Topological information is provided through CCs, which is connected to other structural properties Strang et al. [2018], such as transitivity, density, characteristic path length, and efficiency, useful for representation in the vector space. In this work we are suggesting that there is another possibility of the matrix I which is to replace the main diagonal of the matrix I with CCs values. For a graph of n × n nodes the identity matrix becomes: From now we will call the two enhanced versions Clustering Coefficients Graph Convolutional Networks (CCGCN) and Clustering Coefficients Confidence based Graph Convolutional Networks (CCConfGCN). The Structure of Base-Line model with optimized results is similar to Table 1. Matrix was added to the Adjacency matrix while pre-processing of the input and the combined matrix was considered as input to the neural network. The new matrix I n , having the same size as Identity matrix, is added to the adjacency matrix instead of plain identity matrix.

GCN and Dense Layer combination
Deep learning models have shown that beside creating a new layer, activation function, regularization method etc., if one can redesign existing layers etc. in a proper way. It can result in optimal performance as compare to the previous models. We adopted the same GCN and Dense layers and created a model that gave the optimal results. A dense layer is commonly known as fully connected layer and it is represented as: Here, y l n u represents the neuron at layer n, w l n i,v represents the weight (i, v) for that neuron multiplied with input neuron y l n−1 i , and b l n v represents that bias that is added to the weighted sum. The resultant weighted sum value is passed through an activation function f l n . Table 3 shows the structure of the model. We used this model on all four datasets. After extensive experiments, the best results are shown in Table 5. This combination provides a mixture of both GCN and Dense layers and result in better performance compared to individual GCN or Dense layer.
The training phase adopts the same Adam optimizer (similar to all other models). In each layer, we used Relu6 activation function. From now we will call the two enhanced versions Dense Graph Convolutional Networks (DGCN) and Dense Confidence based Graph Convolutional Networks (DConfGCN). In Table 3, 'In-Nodes' represents the input nodes to a layer, 'Out-Nodes' represent the output nodes of a layer, 'AF' represents the activation function, whereas drop out rate is represented by 'DO'.

Results
This section describes the results obtained on public datasets with the proposed improvements. In addition, the achieved results will be compared to the state-of-the-art models in literature.

Datasets
For performance evaluation we adopt several semisupervised classification datasets that are commonly used by other researchers. The set of dataset comprise of Cora, Citeseer, Pubmed Sen et al. [2008], and Cora-ML Bojchevski and Günnemann [2018]. The setup is the same as being followed in Vashishth et al. [2019]. Our aim concerns to classify documents into one of the predefined classes. Datasets represent citation networks, in which each document is encoded using bag-of-words features with undirected edges between nodes. The dataset statistics is summarized in table 4. Label mismatch concerns the fraction of edges between nodes with different labels in the training data. The datasets have substantially low label mismatch rate except Cora-ML.

Competitors
Our method is compared with approaches of different nature. Competitors can be divided into four groups. First group includes approaches based on extensions of the GCN model. G- GCN Marcheggiani and Titov [2017] provides an extension adopting edge-wise gating to remove noisy edges during aggregation. GAT Veličković et al. [2018] provides a method based on attention which gives different weights to different nodes by allowing nodes to attend to their neighborhood. Dual- GCN Monti et al. [2018] allows to learn both vertex and edge features and generalizes the GAT model Veličković et al. [2018]. LGCN Gao et al. [2018] works based on a learnable graph convolutional layer (LGCL). LGCL automatically selects a fixed number of neighboring nodes for each feature based on value ranking in order to transform graph data into grid-like structures in 1-D format. Fast- GCN Liang et al. [2015] is an accelerated and optimized tool for constructing gene co-expression networks that can fully harness the parallel nature of GPU (Graphic Processing Unit) architectures. Second group includes approaches based on extensions of the GNN model Scarselli et al. [2008]. GGNN Li et al. [2016] generalizes RNN framework for graph-structured data application. GPNN Liao et al. [2018] adopts partition approach to spread the information after the subdivision of large graphs into subgraphs. Third group includes approaches based on embedding. SemiEmb Weston et al. [2012] is a framework which provides semi-supervised regularization to improve training. DeepWalk Perozzi et al. [2014] adopts random walks to learns node features. Planetoid Yang et al. [2016] adopts a transductive and inductive approach for class label prediction using neighborhood information. Fourth group includes baseline approaches. LP Zhu et al. [2003] is a label propagation algorithm which spreads labels information to neighborhood following the proximity. ManiReg Belkin et al. [2006] provides geometric regularization on data. Feat Yang et al. [2016] works based on node features ignoring the structure infor-mation.

Comparison
We have summarized our results by showing the best results of all the enhancements for all the datasets. Table  5 shows the accuracy of all the models mentioned in Section 4. We have been successful in getting state-of-the-art result on one dataset as well as very close to the stateof-the-art work done till now on the other three datasets as highlighted in green in Table 5. On Cora ML dataset we achieved the current best accuracy of 86.9 ± 0.4 using DConfGCN model. This is the current state-of-theart based on our knowledge as the most recent papers i.e. Dual-GCN, LGCN, and Fast-GCN did not reported their results on Cora ML dataset. In case of Citeseer dataset, the best result which we achieved is 73.26%, this is more than Dual GCN and 0.3% less from LGCN. This makes our accuracy with ConvConfGCN the second best till date. However, just to highlight that LGCN Gao et al. [2018] report only the best result whereas our result are based on 100 run which are more stronger compare to reporting one highest performance. We have got the 3 rd best accuracy for Pubmed dataset i.e. 79.8 ± 0.4. Finally, on Cora dataset, we achieved 82.1±1.2 accuracy with DGCN that is better than baseline GCN and ConfGCn by slight margin, but at 4 th position overall in the list.
One of the reason for not having the best result for Citeseer, Cora, and Pubmed could be that the best reported results in the papers Gao et al. [2018]; Monti et al. [2018]; Liang et al. [2015] are not having the mean performance over multiple runs. Another reason is that, our model can not be directly compared with model like LGCN as it uses regular convolutional kernel in their model. Rather designing new kernels to work on graph data, in LGCN the authors organized the graph data in a way that normal convolutional kernel can operate over it and learn feature from them. These enhancement and results are reported to provide baseline for future works to be done in the field of SSL for the Graphs.
In table 6 execution time for PubMed dataset is shown. As the size of the features in each dataset varies, that is why the time (in seconds) per epoch varies for each dataset. GCN and its enhancements are faster compared to confGCN and its enhancements. While optimizing based on hyper-parameters, we found that the major reduction in computational cost was due to usage of the cross-entropy softmax V2 function rather than simple cross-entropy. Therefore, in all our later experiments we used this loss function. ConfGCN based models took more time compare to GCN based models. The optimal models interms of execution time is OPGCN.

Conclusions
We present enhanced models of GCN and ConfGCN for the Graph Convolution with Semi-supervised learning. The main focus among all the enhancements was on four changes: parametric configuration, adding more structural information to adjacency matrix for graph representation, convex optimization related to activation functions and combination of base models and dense layers. As this work is related to the Graph convolutions and with these enhanced models we have been able to show that the addition of the layers can be helpful for the increment of the accuracy, so this process has opened a path where addition of layer means accuracy reduction limitation of SSL has been removed. Also currently all the Graph Convolutional Layers are using 1D convolutions to operate the model, there can be 2D or 3D dimensional weighing schemes can be implemented on the concurrent models. The GCN was a new approach for SSL and in that the layer-wise propagation rule was implemented while Con-fGCN is a model which estimates label scores with labels confidence. We have prepared six different models with different configurations and we have validated our models with four benchmark datasets. In majority of the enhancements, we have been successful in increasing the accuracy as well as the execution time for all the best possible configurations in all four data-sets.