Star topology convolution for graph representation learning

We present a novel graph convolutional method called star topology convolution (STC). This method makes graph convolution more similar to conventional convolutional neural networks (CNNs) in Euclidean feature spaces. STC learns subgraphs which have a star topology rather than learning a fixed graph like most spectral methods. Due to the properties of a star topology, STC is graph-scale free (without a fixed graph size constraint). It has fewer parameters in its convolutional filter and is inductive, so it is more flexible and can be applied to large and evolving graphs. The convolutional filter is learnable and localized, similar to CNNs in Euclidean feature spaces, and can share weights across graphs. To test the method, STC was compared with the state-of-the-art graph convolutional methods in a supervised learning setting on nine node properties prediction benchmark datasets: Cora, Citeseer, Pubmed, PPI, Arxiv, MAG, ACM, DBLP, and IMDB. The experimental results showed that STC achieved the state-of-the-art performance on all these datasets and maintained good robustness. In an essential protein identification task, STC outperformed the state-of-the-art essential protein identification methods. An application of using pretrained STC as the embedding for feature extraction of some downstream classification tasks was introduced. The experimental results showed that STC can share weights across different graphs and be used as the embedding to improve the performance of downstream tasks.


Introduction
Convolutional neural networks (CNNs) have been used to solve problems which have a Euclidean feature space [1], such as image classification [2] and machine translation [3]. However, most problems, such as 3D meshes, social networks, telecommunication networks, and biological networks, have a non-Euclidean nature [4], and data in the form of a graph is a typical non-Euclidean problem. This makes using CNNs to solve these problems a challenge [1]. There are three major problems in generalizing CNNs to graphs: (1) the numbers of directly connected neighbors for nodes usually differ [5]; (2) the feature dimensions for nodes may also differ; (3) the edges may have features with dimensions that differ.
The convolution operation based on the graph Fourier transform, which is called spectral convolutional neural network (Spectral CNN), was first introduced to the graph domain by [6]. Simplified versions of Spectral CNN based on polynomial expansion have been proposed, like Chebyshev network (ChebyNet) [7] and graph convolutional network (GCN) [8]. ChebyNet [7] restricts the kernel of Spectral CNN to a polynomial expansion. GCN [8], a simplification of ChebyNet to bypass the spectral transform, reduces the computational cost of the eigen decomposition of the graph Laplacian matrix and is able to get spatially localized filters [1]. Due to its good performance and faster speed of convolution computation, GCN is widely used to solve graph problems [9]. It has limitations when scaled to large graphs and is difficult to train in a minibatch setting [10].
To address the problem of scaling GCN to large graphs, layer sampling methods [11][12][13][14][15] and subgraph sampling methods [10,16,17] have been proposed. These are designed for efficient minibatch training of GCN [10]. Layer sampling methods are based on a GCN on the full training graph with nodes or edges sampled from the whole graph to form minibatches in each layer with forward and backward propagation on the sampled graph. Subgraph sampling methods perform subgraph sampling before training the GCN to reduce the problem of large computational costs in the node/edge sampling of each layer [10,16,17].
However, the filter size is based on the size of the full graph or the size of the sampled subgraph, which is not flexible enough for different sizes of input. If the size changes, the model trained using the previous filter will need to be reconstructed and retrained. Although some flexible spatial methods [18] have been proposed, their aggregators are neither learnable nor convolutional. The problem of designing a flexible filter, like spatial methods, with a learnable and convolutional property, like spectral methods, still remains to be solved.
To address this problem, we propose star topology convolution (STC). All graphs are composed of subgraphs with a star topology, as Figs. 1 and 2 show. A star topology has several advantages in its eigen decomposition, as its eigenvalues are all equal except for the first and last elements. This means that the Laplacian matrices of star topology graphs, even of different sizes, will have some common eigenvectors, although they may be rotated. These properties allow the design of a graph-scale free (without a fixed graph size constraint), learnable, and convolutional filter which can maintain weight sharing for subgraphs of different sizes and structures.
With such a flexible filter the spectral convolution over a subgraph with a star topology can be defined. Then, a star topology convolution (STC) can be introduced for graph representation learning to perform the spectral convolution on these subgraphs to obtain the spectral node embedding. These can be aggregated to the central node of each subgraph. Hence, this method can create localized filters in both the spectral and spatial domains. The computational complexity and memory requirements are largely reduced compared to conventional spectral methods. To further improve the performance of STC, an edge attention mechanism is defined on the star topology subgraphs. As STC is inductive and has a low memory requirement, it can be applied to large or evolving graphs. STC can share weights across different graphs like a CNN which makes it more flexible than most existing spectral methods.
The contributions of this paper can be summarized as 1. This is the first attempt to use star topology subgraphs for convolution in graph representation learning. 2. This is a graph convolutional method which has a graphscale free, learnable, and convolutional filter. 3. This method has a comparatively low computational complexity with high memory efficiency, based on the properties of a star topology, even though it uses a selfattention mechanism. 4. This is a graph convolutional method that is more similar to conventional CNNs than most existing spectral methods.

Related work
Since the success of CNNs in areas such as computer vision, natural language processing, and speech processing, researchers began to generalize CNNs to the graph domain.
The key to generalizing CNNs to graphs is to define a convolution operator on graphs [1]. Two broad methods, spectral and spatial, categorize current graph convolutional neural networks. Spectral methods use the graph Fourier transform [19] to transfer signals from the spatial domain to the spectral domain and perform convolution on the spectral domain. Some spectral methods have been applied to node classification. Spectral convolutional neural network (Spectral CNN) [6] introduces the graph Fourier transform directly and uses spectral convolution for graph signals. However, the number of learnable parameters for the filter is large, potentially causing severe computational costs [4]. The Chebyshev network (ChebyNet) [7] restricts the kernel of Spectral CNN to a polynomial expansion. Graph convolutional network (GCN) [8] further simplifies ChebyNet to avoid the Fourier transform by reducing the computational cost of the eigen decomposition of the graph Laplacian matrix [1]. Graph wavelet neural network (GWNN) [1] introduces a graph wavelet transform to replace the Fourier basis of Spectral CNN to achieve sparse and localized filters while maintaining a good computational performance.
All these methods are difficult to scale to large graphs and train in a minibatch setting [10]. To scale spectral methods to large graphs, versions of GCN [11][12][13][14][15] have been designed for efficient minibatch training, but they are based on the whole training graph [10] and iteratively sample nodes or edges from the whole graph to form the minibatches in each layer [10]. This may cause "neighbor explosion" and lead to a large computational complexity. Some heuristic-based methods, which make subgraph sampling as a preprocessing step [16,17], attempt to solve this, but may introduce non-identical node sampling probabilities and bias. To cancel the bias, a graph sampling-based inductive learning method, (GraphSAINT) [10], developed a normalization technique so that feature learning does not give larger weights to nodes which are sampled more frequently. These spectral methods still have a problem with generalization: they have a fixed graph size constraint (not graph-scale free). The filter size of spectral methods is determined by the size of the full or sampled subgraph and larger sizes will result in large computational and memory costs. If the size changes, the model will need to be reconstructed and retrained.
Spatial methods define the convolution on graph geometry. Mixture model CNN (MoNet) uses a weighted average of functions defined over the neighborhood of a node as the spatial convolution operator to provide a general framework for the design of spatial methods [20]. The approach, graphs with generative adversarial nets (GraphSGAN) [21], facilitates generalization of generative adversarial nets (GANs) to the graph domain through low-density areas by generating fake samples between subgraphs to improve the performance of semi-supervised learning on graphs. Some spatial methods focus on improving model capacity by introducing an attention mechanism to the graph domain, such as the Graph attention network (GAT), which adopts a self-attention mechanism to learn the weighting function [4]. Developments of GAT, such as Dual-primal graph convolutional network (DPGCN) [22] generalized GAT using convolutions on nodes and edges, giving a better performance, temporal graph attention network (TempGAN) learns node representations from continuous-time temporal graphs [23], and Hyperbolic graph attention network learns robust node representations of graphs in hyperbolic spaces [24]. The graph sample and aggregate method (GraphSage) [18], a node-based spatial method, learns node, rather than graph, embeddings so it is graph-scale free and can be applied to large or evolving graphs. It performs a uniform node sampling, with a predefined sampling size, to neighbors in each layer iteratively. The sampling size gives an upper bound to the minibatch computational complexity. Unlike STC, proposed here, its aggregator is not learnable. A long-short term memory (LSTM) [25] version has been proposed, but is inherently symmetric.

Preliminary
Given an undirected graph G = {V, E, A}, where V = (V 1 , V 2 , . . . , V n ) is a node set (|V| = n), E is an edge set, and A is an adjacency matrix (A [i, j] = A [ j,i] ); its graph Laplacian matrix L can be defined as L = D − A, where D = diag( i = j A [i, j] ) is a degree matrix, and L is a symmetric positive-semidefinite matrix. The eigen decomposition of L is where U = (u 1 , u 2 , . . . , u n ) are the eigenvectors, which are orthonormal, and = diag(λ 1 , λ 2 , . . . , λ n ) is the diagonal matrix of the corresponding eigenvalues, which are real and non-negative and can be interpreted as the frequencies of the graph. These eigenvectors compose the basis of the feature space in the spectral domain.

Spectral convolution
For a signal F = (F 1 , F 2 , . . . , F n ) on the nodes of graph G, its graph Fourier transform is defined aŝ Given another signal g, the convolution of g and F can be defined as where g θ is the graph Fourier transform of g. Equation (3) is the spectral convolution, which is similar to the convolution theorem defined in Euclidean feature space, and g θ can be regarded as the convolution filter which provides a set of kernel functions.

Inductive spectral convolution
To apply spectral convolution on large or evolving graphs, we need to introduce an inductive version which can learn the local and global structural properties of each node. Inductive methods focus on obtaining inductive node embedding, rather than whole graph embedding, which provides good flexibility. Hence, the inductive spectral convolution can be defined as a node-based subgraph spectral convolution by where W is a flexible filter for subgraphs with different topology, and V i is the set of directly connected neighbors of V i .
The key to this work is to find a universal W to make Eq. (4) hold. Fig. 1 The structural similarity of graphs with a star topology. The red node is the central node

Properties of a star topology graph
W is related to the topology of the subgraphs. A good W should maintain the weight sharing property and, at the same time, provide different kernels for different structures [26]. We need to find a common structure, lying in different subgraphs, which should be identical, or at least symmetric, to help us design filters as in a conventional CNN. We find that star topology graphs with different sizes, n, are symmetric in their structure and Laplacian matrix, as Figs. 1 and 2 show. All graphs can be regarded as a composition of subgraphs with a star topology (Fig. 2). The eigen decomposition of an n-dimensional star topology (n neighboring nodes and one central node, n ≥ 1) can use a universal formulation: where all elements on the diagonal of are 1, except the first and last elements. That is all eigenvectors have the same where I is an identical matrix. According to the theorem of linear systems of equations, Eq. (8) having non-zero solutions is equivalent to det( I−L) = 0. Using Gaussian elimination on det( I − L) = 0, the solutions of Eq. (9) are . Eigenvectors U = (u 1 , u 2 , . . . , u n+1 ) are orthonormal. Different size star topology graphs have common parts in their eigenvectors for which the corresponding eigenvalue is 1. For an m-dim star topology graph G m ∈ R (m+1) * (m+1) and an n-dim star topology graph G n ∈ R (n+1) * (n+1) , where m ≤ n, the eigenvectors corresponding to eigenvalues of 1 in G m can be expressed by any m − 1 eigenvectors with corresponding eigenvalues of 1 in G n using the vector rotation formula: where R is an identical rotation matrix, and ind is the mask of m − 1 randomly selected indices. Hence, these eigenvectors can be regarded as equivalent and are suitable to be the universal basis for different scales of subgraphs.
Referring to Eq. (3), the first row of g θ U T corresponds to the first row of the new feature matrix g θ U T F and also to the first row of F. The first row of F corresponds to the central node as in L. Since we want to obtain information on neighbors of the central node, the first row of F can be padded to zero. The last eigenvalue is zero which means Fig. 3 A toy example showing the weight sharing property of the filter on subgraphs with different star topology. The red node is the central node and the yellow nodes are neighboring nodes to be aggregated. The kernels provided by the filter differ by subgraph. It is a general filter which can deal with different subgraphs, that is, the filter is graph-scale free that the last eigenvector is the least important. By manually adding an artificial node to the original subgraph, we get a new L ∈ R (n+2) * (n+2) . Adding a zero row under the last row of F will not affect the non-zero rows, so any number of zero rows in F can be added if needed. Then, we can design the flexible filter for {1, 2, . . . , k}-dimensional (k can be any non-zero positive integer) star topology subgraphs as where is a diagonal matrix which has k learnable parameters. The filter can provide different kernels on different star topology subgraphs and its weight sharing property is shown in Fig. 3. Decomposition of a graph into star topology subgraphs will lose some link information as Fig. 4 shows. In the first convolution layer, the edges (V 1 , V 2 ) and (V 3 , V 4 ) are lost. This can be compensated using more layers to aggregate higher order neighboring nodes. As Fig. 4 shows, through introducing the second convolution layer, the lost link information ((V 1 , V 2 ) and (V 3 , V 4 )) is compensated. Hence, using more layers can reduce the link information loss. But also as Fig. 4 shows, some neighboring nodes will be sampled repeatedly with more layers introduced. Hence, there is a trade-off between link information loss and repeatedly sampling problem. How to select a proper layer depth to reduce the link information loss and at the same time make the effect of repeatedly sampling not too big will be a problem, we will discuss it in the "Layer depth discussion" section.

Star topology convolution
The spectral convolution defined on a star topology can be obtained using the properties of the topology. The update function for the l + 1th layer's neighboring information is defined as where y l x] is the output of the lth layer for neighboring nodes set V i , x and z are the dimension indices, p and q are the dimensions of the embedding, and V i is the set of directly connected neighbors of the central node V i . To reduce the computational complexity of h l+1 , we first perform spectral convolution: Information on the neighbors obtained from the spectral convolution will be aggregated to the central node using feature transformation. :] is the convolution result of the neighboring information of the central node V i . Then h l+1 :] will be concatenated with the last layer output feature vector y l [V i ,:] of the central node V i . After a scaling transformation, the new feature vector will be activated using a nonlinear function as the current layer output of node V i as where y l [V i ,:] is the feature vector of the central node V i obtained in lth layer, cat() denotes a concatenation function, σ denotes a nonlinear activation function, and S l+1 ∈ R 2 p * q is a scaling parameter matrix for feature transformation. Figure 5 shows the comparison of a two-layer conventional CNN and a two-layer STC. The two-layer STC is formulated as where S 1 ∈ R 2 p * q and S 2 ∈ R 2q * d are two scaling parameter matrices, and d is the embedding dimension. STC divides the convolution process of a conventional CNN into two cascading steps: (1) spatial search; (2) spectral convolution. Spatial search is used to find spatially localized areas. Hence, our method can directly create spatially localized filters. As with a conventional CNN, STC obtains global information through the introduction of more layers. After the spatial search, a spectral convolution on the subgraphs is obtained. In a conventional CNN, the shape of different areas selected by a window function/kernel is usually the same. In a 3*3 kernel, for example, the different areas selected are all square and have the same size (9 pixels). In a graph, star topology subgraphs have symmetry in their spatial structures which allows STC to have a convolution process, which can share weights across different graphs, similar to a conventional CNN. If a new filter is used to replace W, a new inductive spectral convolutional method can be designed. Hence, STC provides a framework to design inductive spectral convolutional methods.

Node sampling
Since the filter size k is unknown, it must be confirmed in an implementation. This can be done simply by assigning it as K , where K is the maximum number of connections of a node in the graph. The node sampling strategy of GraphSage [18], giving a dropout function [27], is used to improve the robustness of STC. For a node V i with |V i | directly connected neighbors, when |V i | > k, k neighbors will be randomly selected in each training iteration.
where ind is the mask, with its size determined by the size of the filter, k. For nodes with |V i | ≤ k directly connected neighbors, all neighbors will be selected, and these nodes can be considered to not have enough information. Hence, it is necessary to take advantage of all their information. Through controlling the filter size, STC can achieve good robustness. The effect of the filter size and how to select it will be discussed in the ablation study in "Effect of filter size" section.

Edge attention
We introduce a self-attention mechanism (edge attention) [4] to STC to further improve its performance and robustness. As mentioned in "Star topology convolution" section, STC provides a framework to design different inductive spectral convolution methods by designing different W. Here we use edge attention to get the weights θ in W (W = U ) as Fig. 6 shows. For a node V i , the weights θ of the l + 1th layer in the filter for each neighboring node of it can be calculated as follows: where C l ∈ R p * q is a scaling parameter matrix and a l+1 ∈ R 2q * 1 is a scaling parameter vector.

Comparison with the state-of-the-art methods
In vanilla GCN, the l + 1th layer of GCN can be defined as where Y l+1 is the hidden unit output of the l + 1th layer, A is the adjacency matrix which is used to introduce the edge information to the learning process. And the scaling parameter matrix S l+1 is acting as the convolution filter, in the l +1th layer. Compared to GCN, STC uses a spectral filter W l+1 to replace A. Unlike A, W l+1 implicitly includes the edge information and STC has an edge attention mechanism. For the commonly used method GraphSage (mean aggregator), its l + 1th aggregator can be defined as The embedding of the central node, V i , in the l + 1th layer of GraphSage can be obtained using Eq. (16). GraphSage sets all weights to 1 to replace the spectral convolution filter in STC, which will lose some information in the learning process. GraphSage uses a simple mean to aggregate neighboring information to the central node V i , while STC uses a learnable weighted average function. GAT performs edge attention on the whole graph by where l+1 is the attention weight matrix. STC performs edge attention on the star topology subgraphs, making STC more suitable than GAT for large graphs.

Star topology convolution aggregator for neighbor averaging over relation subgraphs
For large heterogeneous graph representation learning, neighbor averaging over relation subgraphs (NARS) [28] and its variants have achieved leading performance. The vanilla NARS uses relational graph embeddings [29] as the features for featureless node types and then combines different relation types randomly to construct k (k can be seen as the filter size in STC) relation subgraphs.
where y l [Rel [V i ,:] ,:] denotes the lth hop neighbor averaging features of all relation subgraphs in Rel [V i ,:] .

Variants
We propose three variants of STC. They are named STC, STC (conv only), and STC-NARS, respectively. The differences of them are summarized as follows: STC is based on Eq. (16) and uses Eq. (20) to replace the of Eq. (12). STC (conv only) or vanilla STC is based on Eq. (16) to perform convolution only. STC-NARS is based on Eqs. (24,25) to do graph representation learning on large heterogeneous graphs.

Complexity analysis
The computational complexity of STC can be considered in two parts: (1) the convolution process; (2) the edge attention process. The convolution process of STC depends on the filter size, the size of the scaling parameter matrix, and the size of the learnable weight b, so its computational complexity in the l + 1th layer is 2 p * q + 2k, where 2 p * q is the size of scaling parameter matrix and k is the filter size. k is much smaller than p * q in most cases. Compared to other spectral methods, like Spectral CNN [6] (with a complexity of p * q * |V|, where |V| is the total number of nodes in the full graph) and GWNN [1] (with a complexity of p * q +|V|), the complexity of STC is much smaller allowing it to be used for large or evolving graphs. STC performs edge attention on star topology subgraphs. If the filter size is k, then the maximum sampling number of directly connected neighbors of a star topology subgraph is k, so its computational complexity is q * ( p + 2) + k. The computational complexity of GAT is q * ( p + 2) + (|E| + |V|), where |E| is the total number of unique edges in the full graph. k is much smaller than |E| + |V|.

Experiment settings
Benchmarks: Six homogeneous graph datasets: Cora [31], Citeseer [31], Pubmed [31], PPI [18,32], Arxiv [33], and Essential Proteins (EP) were selected to validate the ability of STC to predict the properties of nodes on homogeneous graphs. The statistics of these datasets can be seen in Table 1.
The optimizer for STC was Adam [65] and the learning rate was set to 0.001 for all datasets except EP (0.0001) in this paper. STC adopted a fixed number for the filter size (18) for all datasets, except Arxiv, Citeseer (Arxiv and Citeseer were used to do the ablation study of filter size) and PPI (48), and a fixed dropout rate (0.5) for all datasets, except Citeseer (0.6), Cora (0.6), PPI (0.1), MAG (0.2), and EP (0). The settings of two variants are the same as STC. We also introduced three ablation studies to clarify the effect of layer depth, the effect of scaling parameter matrix in attention process, and the effect of filter size. What's more, since the similarity between STC and CNN, a study about applying transfer learning [66,67] in STC has been introduced to investigate some potential applications of some CNN techniques in STC. To do node classification, a single fully connected layer was used to classify the results from two convolution layers of STC or STC (conv only).

Performance analysis
Small datasets: Tables 3, 4, and 5 show the test accuracy (micro-F1) obtained by STC and the state-of-the-art graph convolutional methods on small citation datasets. STC achieved the highest test accuracy (micro-F1) on Citeseer. STC also outperformed the other methods on Pubmed. GWNN achieved the highest test accuracy (micro-F1) on Cora with the performance of STC just slightly inferior. STC achieved the state-of-the-art performance on small datasets. Tables 3, 4, and 5 also show the number of trainable parameters for each method on these small citation datasets. Full-batch methods: GAT (sparse), GCN, and GWNN have fewer parameters than minibatch methods: STC, STC (conv only), GraphSage (mean aggr), MoNet, GraphSAINT (mean aggr), and GraphSAINT (concat aggr). But full-batch methods are difficult to scale to large graphs. What's more, STC has fewer parameters and better performance than other minibatch methods. Compared to STC (Conv only), STC outperformed it significantly on these small citation datasets with using less parameters, which shows that the introducing  of some techniques to improve the filter of vanilla STC is available.
Essential protein identification: Table 6 shows the performance of STC and the state-of-the-art methods on an essential protein identification task. STC and STC (conv only) outperformed other graph convolutional methods and  Best values are in bold, second best values are in italic essential protein identification methods. Graph convolutional methods outperformed traditional essential protein identification methods, even DNN, showing that graph convolutional methods have an advantage in solving traditional graph problems. On this dataset, STC outperformed other methods significantly. STC has more parameters, but a comparable number, than other methods. Compared to STC (conv only), STC has more parameters on this dataset, but STC outperformed STC (conv only) significantly. Table 7 shows the test accuracy (micro-F1) obtained by STC and the state-of-the-art graph convolutional methods on a medium-sized dataset. Although STC has more parameters than other methods on this dataset, Large datasets: Table 8 compares STC and STC (conv only) with some leading methods on Arxiv. STC outperformed the other methods and STC (conv only) was close to STC. Compared to some methods like GraphZoom (Node2vec) and Node2vec, STC and STC (conv only) have fewer parameters. Although STC has more parameters, but a comparable number, than GCN at the same level. All in all, STC has achieved the state-of-the-art performance on large homogeneous datasets.

Small heterogeneous datasets:
To further show the superiority of STC, we applied STC on three small heterogeneous graphs: ACM, IMDB, and DBLP. Table 9 shows the performance comparison of STC and leading methods on ACM, IMDB, and DBLP. STC outperformed GCN and GAT significantly. Even compared with some leading heterogeneous methods on ACM, IMDB, and DBLP, STC still achieved the best overall performance which shows that STC reduces the performance gap between heterogeneous methods and homogeneous methods on small heterogeneous graphs.

Large heterogeneous datasets:
To verify whether STC can reduce the performance gap between homogeneous methods and heterogeneous methods on large heterogeneous graphs, we applied STC on MAG. Table 10 shows the comparison of STC and some leading methods on MAG. Our STC outperformed other homogeneous graph methods significantly and also outperformed some basic heterogeneous methods like SIGN, R-GCN, and MetaPath2vec significantly. These methods are the backbone of some leading methods like NARS (based on SIGN), GraphSAINT+MetaPath2vec, GraphSAINT (R-GCN aggr), R-GCN+FLAG, and Neigh-borSampling (R-GCN aggr), etc. on large heterogeneous graphs. To achieve the state-of-the-art performance on large heterogeneous graphs, we also proposed an improved version of NARS called STC-NARS which uses STC as the aggregator. STC-NARS outperformed the vanilla NARS with only using half of parameters. And the number of its parameters is also the least among all heterogeneous methods in Table 10. Through introducing STC structure, the improved NARS framework can be used as a new backbone to design more powerful methods for large heterogeneous graphs. In summary, STC achieved the state-of-the-art performance on small, middle, large homogeneous graphs, and also heterogeneous graphs, demonstrating its robustness. Table 11 compares the ability of STC and STC (conv only) with the state-of-the-art methods to be run on large datasets (Arxiv and MAG) using PC2 (one NVIDIA GeForce GTX 1660 Super GPU (6GB)). Compared with other minibatch methods: GraphSage (mean aggr), GraphSAINT (mean aggr), and GraphSAINT (concat aggr), our methods and GraphSage

Effect of layer depth
To validate the effect of layer depth of STC on its performance, we trained STC with 1-4 layers on six homogeneous graphs and compared the test accuracy (micro-F1) of them respectively. As Fig. 7 shows, with the increasing of the num-ber of layers, the performance of STC were improved first and then reduced, while the parameter number increased steadily. Because with more layers introduced, link information loss will be reduced. But at the same time, repeatedly sampling problem arises. Hence, there is a trade-off here and to most problems layer depth of 2 will achieve the best performance as Fig. 7 and Table 12 show. That is why we select the layer depth of 2 to run all experiments.

Effect of scaling parameter matrix in attention process
We found that in some small datasets, STC can get better performance using an untrainable constant scaling parameter matrix in attention process. To further validate it, we trained STC with an untrainable constant scaling parameter matrix and a trainable constant scaling parameter matrix, respectively, on six homogeneous graphs and compared the test accuracy (micro-F1) of them. As Table 13 shows, in some small datasets like Citeseer, Cora, and Pubmed, using an untrainable constant scaling parameter matrix can improve the performance significantly and at the same time reduce the parameter number. But for larger graphs like PPI and Arxiv, an untrainable constant scaling parameter matrix will cause performance degradation.

Effect of filter size
The effect of different filter sizes in each STC layer were examined on Citeseer and Arxiv. Validation and test accuracy (Fig. 8) improved with increasing filter size to a peak value and then steadily reduced. A small filter size might give insufficient information to train a good model, while larger filter sizes might introduce redundant information to the trained model, leading to overfitting and a degradation in performance. Figure 8 also shows that the average batch time obtained by STC with different filter sizes on Citeseer and Arxiv were linear with respect to filter size. The filter size in STC is unrelated to the parameter number since parameter number is determined by scaling parameter vector a and scaling parameter matrix C. Their dimension is only related to the input feature dimension. The filter size could be tuned to achieve the best performance.

Pretrained model
STC is more similar to conventional CNNs than most existing spectral methods. Because its filter can share weights across different graphs as Fig. 3 shows, it is more flexible than some spectral methods like Spectral CNN and GWNN. Their filters are defined on the graph Fourier transform or The pretrained STC (conv only) is pretrained on the large homogeneous citation graph: Arxiv. Best values are in bold graph wavelet transform of the full graph, which makes them difficult to reuse the weights on different graphs. Hence, a potential application of STC is using transfer learning [66,67] to employ pretrained STC as an embedding for feature extraction of downstream tasks. Since the filter size of STC is related to the feature dimension while the filter size of STC (conv only) is only related to filter size which has less constraints, we select STC (conv only) to do the transfer learning. We trained a STC (conv only) model on the large homogeneous citation graph, Arxiv, first. Then we extracted the two STC layers of the pretrained model and froze their parameters to work as an embedding for downstream classification tasks of other three homogeneous citation datasets: Citeseer, Cora, and Pubmed. Table 14 shows the performance comparison of STC (conv only) and Pretrained STC (conv only) on Citeseer, Cora, and Pubmed. Through introducing the pretrained STC layers as the embedding, the classification performance has been improved significantly.

Conclusions
In this paper, we presented a novel graph convolutional method, called star topology convolution (STC), which is a graph-scale-free inductive spectral convolutional method. It learns node embedding on subgraphs within a star topology. In experiments, our method outperformed the state-of-the-art graph convolutional methods on both homogeneous graph and heterogeneous graph benchmarks, and showed better robustness and was more generalizable. STC also outperformed the state-of-the-art methods in identifying essential proteins. STC can share weights across different graphs which makes it possible to be pretrained as an embedding for feature extraction to improve the performance of some downstream tasks.