GIST: Distributed Training for Large-Scale Graph Convolutional Networks

The graph convolutional network (GCN) is a go-to solution for machine learning on graphs, but its training is notoriously difficult to scale both in terms of graph size and the number of model parameters. Although some work has explored training on large-scale graphs (e.g., GraphSAGE, ClusterGCN, etc.), we pioneer efficient training of large-scale GCN models (i.e., ultra-wide, overparameterized models) with the proposal of a novel, distributed training framework. Our proposed training methodology, called GIST, disjointly partitions the parameters of a GCN model into several, smaller sub-GCNs that are trained independently and in parallel. In addition to being compatible with all GCN architectures and existing sampling techniques for efficient GCN training, GIST i) improves model performance, ii) scales to training on arbitrarily large graphs, iii) decreases wall-clock training time, and iv) enables the training of markedly overparameterized GCN models. Remarkably, with GIST, we train an astonishgly-wide 32,768-dimensional GraphSAGE model, which exceeds the capacity of a single GPU by a factor of 8x, to SOTA performance on the Amazon2M dataset.


Introduction
Since not all data can be represented in Euclidean space (Bronstein et al., 2017), many applications rely on graphstructured data.For example, social networks can be modeled as graphs by regarding each user as a node and friendship relations as edges (Lusher et al., 2013;Newman et al., 2002).Alternatively, in chemistry, molecules can be modeled as graphs, with nodes representing atoms and edges en- coding chemical bonds (Balaban, 1985;Benkö et al., 2003).
To better understand graph-structured data, several (deep) learning techniques have been extended to the graph domain (Defferrard et al., 2016;Gori et al., 2005;Masci et al., 2015).Currently, the most popular one is the graph convolutional network (GCN) (Kipf & Welling, 2016), a multi-layer architecture that implements a generalization of the convolution operation to graphs.Although the GCN handles node-and graph-level classification, it is notoriously inefficient and unable to handle large-scale graphs (Chen et al., 2018b;a;Gao et al., 2018;Huang et al., 2018;You et al., 2020;Zeng et al., 2019).
To deal with these issues, node partitioning methodologies have been developed.These schemes can be roughly categorized into neighborhood sampling (Chen et al., 2018a;Hamilton et al., 2017;Zou et al., 2019) and graph partitioning (Chiang et al., 2019;Zeng et al., 2019) approaches.The goal is to partition a large graph into multiple smaller graphs that can be used as mini-batches for training the GCN.In this way, GCNs can handle larger graphs during training, expanding their potential into the realm of big data.
Although some papers perform large-scale experiments (Chiang et al., 2019;Zeng et al., 2019), the models (and data) used in GCN research remain small in the context of deep learning (Kipf & Welling, 2016;Veličković et al., 2017), where the current trend is towards incredibly large models and datasets (Brown et al., 2020;Conneau et al., 2019).Despite the widespread moral questioning of this trend (Hao, 2019;Peng & Sarazen, 2019;Sharir et al., 2020), the deep learning community continues to push the limits of scale, as overparameterized models are known to discover generalizable solutions (Nakkiran et al., 2019).Although deep GCN models suffer from oversmoothing (Kipf & Welling, 2016;Li et al., 2018), overparameterized GCN models can still be explored through larger hidden layers.As such, this work aims to provide a training framework that enables GCN experiments with wider models and larger datasets.
This paper.We propose a novel, distributed training methodology that can be used for any GCN architecture and is compatible with existing node sampling techniques.This methodology randomly partitions the hidden feature space in each layer, decomposing the global GCN model into multiple, narrow sub-GCNs of equal depth.Sub-GCNs are trained independently for several iterations in parallel prior to having their updates synchronized; see Figure 1.This process of randomly partitioning, independently training, and synchronizing sub-GCNs is repeated until convergence.We call this method graph independent subnetwork training (GIST).GIST can easily scale to arbitrarily large graphs and significantly reduces the wall-clock time of training large-scale GCNs, allowing larger models and datasets to be explored.We focus specifically on enabling the training of "ultra-wide" GCNs (i.e., GCN models with very large hidden layers), as deeper GCNs are prone to oversmoothing (Li et al., 2018).The contributions of this work are summarized below: • We develop a novel, distributed training methodology for arbitrary GCN architectures, based on decomposing the model into independently-trained sub-GCNs.This methodology is compatible with existing techniques for neighborhood sampling and graph partitioning.
• We show that GIST can be used to train several GCN architectures to state-of-the-art performance with reduced training time in comparison to standard methodologies.
• We propose a novel Graph Independent Subnetwork Training Kernel (GIST-K) that allows a convergence rate to be derived for two-layer GCNs trained with GIST in the infinite width regime.Based on GIST-K, we provide theory that GIST converges linearly, up to an error neighborhood, using distributed gradient descent with local iterations.We show that the radius of the error neighborhood is controlled by the overparameterization parameter, as well as the number of workers in the distributed setting.Such findings reflect practical observations that are made in the experimental section.
• We use GIST to enable the training of markedly overparameterized GCN models.In particular, GIST is used to train a two-layer GraphSAGE model with a hidden  (i) ) ← subTrain(Θ (i) , {G (j) } c j=1 ) end for end for ) end for dimension of 32,768 on the Amazon2M dataset.Such a model exceeds the capacity of a single GPU by 8×.

What is the GIST of this work?
GCN Architecture.The GCN (Kipf & Welling, 2016) is arguably the most widely-used neural network architecture on graphs.Consider a graph G comprised of n nodes with ddimensional features X ∈ R n×d .The output Y ∈ R n×d of a GCN can be expressed as Y = Ψ G (X; Θ), where Ψ G is an L-layered architecture with trainable parameters Θ.If we define H 0 = X, we then have that Y = Ψ G (X; Θ) = H L , where an intermediate -th layer of the GCN is given by (1) In (1), σ is an elementwise activation function (e.g., ReLU), Ā is the degree-normalized adjacency matrix of G with added self-loops, and the trainable parameters 2 (top), we illustrate nested GCN layers for L = 3, but our methodology extends to arbitrary L. The activation function of the last layer is typically the identity or softmax transformation -we omit this in Figure 2 for simplicity.
GIST overview.We overview GIST in Algorithm 1 and present a schematic depiction in Figure 1.We partition our (randomly initialized) global GCN into m smaller, disjoint sub-GCNs with the subGCNs function (m = 2 in Figures 2 and 1) by sampling the feature space at each layer of the GCN; see Section 2.1.Each sub-GCN is assigned to a different worker (i.e., a different GPU) for ζ rounds of distributed, independent training through subTrain.Then, newly-learned sub-GCN parameters are aggregated (subAgg) into the global GCN model.This process repeats for T iterations.Our graph domain is partitioned into c sub-graphs through the Cluster function (c = 2 in Figure 1).This operation is only relevant for large graphs (n > 50,000), and we omit it (c = 1) for smaller graphs that don't require partitioning.
L where and each layer is given by: Sub-GCN partitioning is illustrated in Figure 2-(a), where m = 2. Partitioning the input features is optional (i.e., (a) vs.(b) in Figure 2).We do not partition the input features within GIST so that sub-GCNs have identical input information (i.e., X (i) = X for all i); see Section 5.1.Similarly, we do not partition the output feature space to ensure that the sub-GCN output dimension coincides with that of the global model, thus avoiding any need to modify the loss function.This decomposition procedure (subGCNs in Algorithm 1) 1 Though any clustering method can be used, we advocate the use of METIS (Karypis & Kumar, 1998a;b) due to its proven efficiency in large-scale graphs.
extends to arbitrary L.

subTrain: Independently Training Sub-GCNs
Assume c = 1 so that the Cluster operation in Algorithm 1 is moot and i) and Y share the same dimension, sub-GCNs can be trained to minimize the same global loss function.One application of subTrain in Algorithm 1 corresponds to a single step of stochastic gradient descent (SGD).Inspired by local SGD (Lin et al., 2018), multiple, independent applications of subTrain are performed in parallel (i.e., on separate GPUs) for each sub-GCN prior to aggregating weight updates.The number of independent training iterations between synchronization rounds, referred to as local iterations, is denoted by ζ, and the total amount of training is split across sub-GCNs. 3Ideally, the number sub-GCNs and local iterations should be increased as much as possible to minimize communication and training costs.In practice, however, such benefits may come at the cost of statistical inefficiency; see Section 5.1.
j=1 to use as a mini-batch for SGD.Alternatively, the union of several sub-graphs in {G (j) } c j=1 can be used as a mini-batch for training.Aside from using mini-batches for each SGD update instead of the full graph, the use of graph partitioning does not modify the training approach outlined above.Some form of node sampling must be adopted to make training tractable when the full graph is too large to fit into memory.However, both graph partitioning and layer sampling are compatible with GIST (see Sections 5.2 and 5.4).We adopt graph sampling in the main experiments due to the ease of implementation.The novelty of our work lies in the feature partitioning strategy of GIST for distributed training, which is an orthogonal technique to node sampling; see Section 2.3.
After each sub-GCN completes ζ training iterations, their updates are aggregated into the global model (i.e., subAgg function in Algorithm 1).Within subAgg, each worker replaces global parameter entries Θ with its own parameters Θ (i) , where no collisions occur due to the disjointness of sub-GCN partitions.Interestingly, not every parameter in the global GCN model is updated by subAgg.For example, focusing on Θ 1 in Figure 2-(a), one worker will be assigned Θ (1) 1 (i.e., overlapping orange blocks), while the other worker will be assigned Θ (2) 1 (i.e., overlapping blue blocks).The rest of Θ 1 is not considered within subAgg.Nonetheless, since sub-GCN partitions are randomly drawn in each cycle t, one expects all of Θ to be updated multiple times if T is sufficiently large.

What is the value of GIST?
Architecture-Agnostic Distributed Training.GIST is a generic, distributed training methodology that can be used for any GCN architecture.We implement GIST for vanilla GCN, GraphSAGE, and GAT architectures, but GIST is not limited to these models; see Section 5.
Compatibility with Sampling Methods.GIST is NOT a replacement for graph or layer sampling.Rather, it is an efficient, distributed training technique that can be used in tandem with node partitioning.As depicted in Figure 3, GIST partitions node feature representations and model parameters between sub-GCNs, while graph partitioning and layer sampling sub-sample nodes within the graph.
Interestingly, we find that GIST's feature and parameter partitioning strategy is compatible with node partitioning-the two approaches can be combined to yield further efficiency benefits.For example, GIST is combined with graph partitioning strategies in Section 5.2 and with layer sampling methodologies in Section 5.4.
Enabling Ultra-Wide GCN Training.GIST indirectly updates the global GCN through the training of smaller sub-GCNs, enabling models with hidden dimensions that exceed the capacity of a single GPU by a factor of 8× to be trained.In this way, GIST allows markedly overparametrized ("ultra-wide") GCN models to be trained on existing hardware.In Section 5.2, we leverage this capability to train a two-layer GCN model with a hidden dimension of 32,768 on Amazon2M.
We argue that overparameterization through width is more valuable than overparameterization through depth because deeper GCNs could suffer from oversmoothing (Li et al., 2018).As such, we do not explore depth-wise partitions of different GCN layers to each worker, but rather focus solely upon partitioning the hidden neurons within each layer.Such a partitioning strategy is suited to training wider networks.
Improved Model Complexity.Consider a single GCN layer, trained over M machines with input and output dimension of d i−1 and d i , respectively.For one synchronization round, the communication complexity of GIST and stan- ), respectively.GIST reduces communication by only communicating sub-GCN parameters.Existing node partitioning techniques cannot similarly reduce communication complexity because model parameters are never partitioned.Furthermore, the computational complexity of the forward pass for a GCN model trained with GIST and using standard methodology is , respectively, where N is the number of nodes in the partition being processed.4Node partitioning can reduce N by a constant factor but is compatible with GIST.

Related Work
GCN training.In spite of their widespread success in several graph related tasks, GCNs often suffer from training inefficiencies (Gao et al., 2018;Huang et al., 2018).Consequently, the research community has focused on developing efficient and scalable algorithms for training GCNs (Chen et al., 2018b;a;Chiang et al., 2019;Hamilton et al., 2017;Zeng et al., 2019;Zou et al., 2019).The resulting approaches can be divided roughly into two areas: neighborhood sampling and graph partitioning.However, it is important to note that these two broad classes of solutions are not mutually exclusive, and reasonable combinations of the two approaches may be beneficial.
Neighborhood sampling methodologies aim to sub-select neighboring nodes at each layer of the GCN, thus limiting the number of node representations in the forward pass and mitigating the exponential expansion of the GCNs receptive field.VRGCN (Chen et al., 2018b) implements a variance reduction technique to reduce the sample size in each layer, which achieves good performance with smaller graphs.However, it requires to store all the intermediate node embeddings during training, leading to a memory complexity close to full-batch training.GraphSAGE (Hamilton et al., 2017) learns a set of aggregator functions to gather information from a node's local neighborhood.It then concatenates the outputs of these aggregation functions with each node's own representation at each step of the forward pass.FastGCN (Chen et al., 2018a) adopts a Monte Carlo approach to evaluate the GCN's forward pass in practice, which computes each node's hidden representation using a fixed-size, randomly-sampled set of nodes.LADIES (Zou et al., 2019) introduces a layer-conditional approach for node sampling, which encourages node connectivity between layers in contrast to FastGCN (Chen et al., 2018a).
Graph partitioning schemes aim to select densely-connected sub-graphs within the training graph, which can be used to form mini-batches during GCN training.Such sub-graph sampling reduces the memory footprint of GCN training, thus allowing larger models to be trained over graphs with many nodes.ClusterGCN (Chiang et al., 2019) produces a very large number of clusters from the global graph, then randomly samples a subset of these clusters and computes their union to form each sub-graph or mini-batch.Similarly, GraphSAINT (Zeng et al., 2019) randomly samples a sub-graph during each GCN forward pass.However, Graph-SAINT also considers the bias created by unequal node sampling probabilities during sub-graph construction, and proposes normalization techniques to eliminate this bias.
As explained in Section 2, GIST also relies on graph partitioning techniques (Cluster) to handle large graphs.However, the feature sampling scheme at each layer (subGCNs) that leads to parallel and narrower sub-GCNs is a hitherto unexplored framework for efficient GCN training.

Theoretical Results
We draw upon analysis related to neural tangent kernels (NTK) (Jacot et al., 2018) to derive a convergence rate for two-layer GCNs using gradient descent-as formulated in (1) and further outlined in Appendix C.1-trained with GIST.Given the scaled Gram matrix of an infinitedimensional NTK H ∞ , we define the Graph Independent Subnetwork Training Kernel (GIST-K) as follows: Given the GIST-K, we adopt the following set of assumptions related to the underlying graph; see Appendix C.2 for more details.Assumption 1. Assume λ min ( Ā) = 0 and there exists ∈ (0, 1) and p ∈ Z + such that (1 − ) 2 p ≤ D ii ≤ (1 + ) 2 p for all i ∈ [n] = {1, 2, . . ., n}, where D is the degree matrix.Additionally, assume that i) input node representations are bounded in norm and not parallel to any other node representation, ii) output node representations are upper bounded, iii) sub-GCN feature partitions are generated at each iteration from a categorical distribution with uniform mean 1 m .Given this set of assumptions, we derive the following result Theorem 1.Given assumption 1, if the number of hidden neurons within the two-layer GCN satisfies A full proof of this result is deferred to Appendix C, but a sketch of the techniques used is as follows: 1. We define the GIST-K and show that it remains positive definite throughout training given our assumptions and sufficient overparameterization. 2. We show that local sub-GCN training converges linearly, given a positive definite GIST-K.3. We analyze the change in training error when sub-GCNs are sampled (subGCNs), locally trained (subTrain), and aggregated (subAgg).4. We establish a connection between local and aggregated weight perturbation, showing that network parameters are bounded by a small region centered around the initialization given sufficient overparameterization.
Discussion.Stated intuitively, the result in Theorem 1 shows that, given sufficient width, two-layer GCNs trained using GIST converge to approximately zero training error.The convergence rate is linear and on par with training the full, two-layer GCN model, up to an error neighborhood (i.e., without the feature partition utilized in GIST).Such theory shows that the feature partitioning strategy of GIST does not cause the model to diverge in training.Additionally, the theory suggests that wider GCN models and a larger number of sub-GCNs should be used to maximize the convergence rate of GIST and minimize the impact of the additive term within Theorem 1; though the affect of m on the radius is less significant compared to d 1 .Such findings reflect practical observations that are made within Section 5 and reveal that GIST is particularly-suited towards training extremely wide models that cannot be trained using a traditional, centralized approach on a single GPU.

Experiments
We use GIST to train different GCN architectures on six public, multi-node classification datasets; see Appendix A for details.In most cases, we compare the performance of models trained with GIST to that of models trained with standard methods (i.e., single GPU with node partitioning).Comparisons to models trained with other distributed methodologies are also provided in Appendix B. Experiments are divided into small and large scale regimes based upon graph size.The goal of GIST is to i) train GCN models to state-of-the-art performance, ii) minimize wall-clock training time, and iii) enable training of very wide GCN models.

Small-Scale Experiments
In this section, we perform experiments over Cora, Citeseer, Pubmed, and OGBN-Arxiv datasets (Sen et al., 2008;Hu et al., 2020).For these small-scale datasets, we train a three- Table 2. Performance of models trained with GIST on Reddit and Amazon2M.Parenthesis are placed around speedups achieved at a cost of >1 deterioration in F1 and m ="-" refers to the baseline.Models trained with GIST train more quickly and achieve comparable F1 score to those trained with standard methodology.The performance benefits of GIST become more pronounced for wider models.
GraphSAGE (Hamilton et al., 2017) and GAT (Veličković et al., 2017) models with two to four layers on Reddit; see Appendix A.4 for more details.As shown in Table 2, utilizing GIST significantly accelerates GCN training (i.e., a 1.32× to 7.90× speedup).GIST performs best in terms of F1 score with m = 2 sub-GCNs (i.e., m = 4 yields further speedups but F1 score decreases).Interestingly, the speedup provided by GIST is more significant for models and datasets with larger compute requirements.For example, experiments with the GAT architecture, which is more computationally expensive than GraphSAGE, achieve a near-linear speedup with respect to m.
Amazon2M Dataset.Experiments are performed with two, three, and four-layer GraphSAGE models (Hamilton et al., 2017) with hidden dimensions of 400 and 4096 (we refer to these models as "narrow" and "wide", respectively).We compare the performance (i.e., F1 score and wall-clock training time) of GCN models trained with standard, single-GPU methodology to that of models trained with GIST; see Table 2. Narrow models trained with GIST have a lower F1 score in comparison to the baseline, but training time is significantly reduced.For wider models, GIST provides a more significant speedup (i.e., up to 7.12×) and tends to achieve comparable F1 score in comparison to the baseline, revealing that GIST works best with wider models.
Within Table 2, models trained with GIST tend to achieve a wall-clock speedup at the cost of a lower F1 score (i.e., observe the speedups marked with parenthesis in Table 2).
When training time is analyzed with respect to a fixed F1 score, we observe that the baseline takes significantly longer than GIST to achieve a fixed F1 score.For example, when L = 2, a wide GCN trained with GIST (m = 8) reaches an F1 score of 88.86 in ∼4,000 seconds, while models trained with standard methodology take ∼10,000 seconds to achieve a comparable F1 score.As such, GIST significantly accelerates training relative to model performance.3. Performance of GraphSAGE models of different widths trained with GIST on Amazon2M.m ="-" refers to the baseline and "OOM" marks experiments that cause out-of-memory errors.GIST enables training of higher-performing, ultra-wide models.

Training Ultra-Wide GCNs
We use GIST to train GraphSAGE models with widths as high as 32K (i.e., 8× beyond the capacity of a single GPU); see Table 3 for results and Appendix A.5 for more details.Considering L = 2, the best-performing, single-GPU GraphSAGE model (d i = 4096) achieves an F1 score of 90.58 in 5.2 hours.With GIST (m = 2), we achieve a higher F1 score of 90.87 in 2.8 hours (i.e., a 1.86× speedup) using d i = 8192, which is beyond single GPU capacity.Similar patterns are observed for deeper models.Furthermore, we find that utilizing larger hidden dimensions yields further performance improvements, revealing the utility of wide, overparameterized GCN models.GIST, due to its feature partitioning strategy, is unique in its ability to train models of such scale to state-of-the-art performance.

GIST with Layer Sampling
As previously mentioned, some node partitioning approach must be adopted to avoid memory overflow when the under- tinues to achieve improvements in wall-clock training time (e.g., speedup increases from 1.83× to 2.90× from m = 2 to m = 4 for L = 2) without significant deterioration to model performance.Thus, although node partitioning is needed to enable training on large-scale graphs, the feature partitioning strategy of GIST is compatible with numerous sampling strategies (i.e., not just graph sampling).

Conclusion
We present GIST, a distributed training approach for GCNs that enables the exploration of larger models and datasets.
GIST is compatible with existing sampling approaches and leverages a feature-wise partition of model parameters to construct smaller sub-GCNs that are trained independently and in parallel.We have shown that GIST achieves remarkable speed-ups over large graph datasets and even enables the training of GCN models of unprecedented size.We hope GIST can empower the exploration of larger, more powerful GCN architectures within the graph community.

A.2. Implementation Details
We provide an implementation of GIST in PyTorch (Paszke et al., 2019) using the NCCL distributed communication package for training GCN (Kipf & Welling, 2016), GraphSAGE (Hamilton et al., 2017) and GAT (Veličković et al., 2017) architectures.Our implementation is centralized, meaning that a single process serves as a central parameter server.From this central process, the weights of the global model are maintained and partitioned to different worker processes (including itself) for independent training.Experiments are conducted with 8 NVIDIA Tesla V100-PCIE-32G GPUs, a 56-core Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz, and 256 GB of RAM.

A.3. Small-Scale Experiments
Small-scale experiments in Section 5.1 are performed using Cora, Citeseer, Pubmed, and OGBN-Arxiv datasets (Sen et al., 2008;Hu et al., 2020).GIST experiments are performed with two, four, and eight sub-GCNs in all cases.We find that the performance of models trained with GIST is relatively robust to the number of local iterations ζ, but test accuracy decreases slightly as ζ increases; see  .Test accuracy for different sizes (i.e., varying depth and width) of GCN models trained with standard, single-GPU methodology on small-scale datasets.We adopt three-layer, 256-dimensional GCN models as our baseline architecture.
Experiments are run for 400 epochs with a step learning rate schedule (i.e., 10× decay at 50% and 75% of total epochs).A vanilla GCN model, as described in (Kipf & Welling, 2016), is used.The model is trained in a full-batch manner using the Adam optimizer (Kingma & Ba, 2014).No node sampling techniques are employed because the graph is small enough to fit into memory.All reported results are averaged across five trials with different random seeds.For all models, d 0 and d L are respectively given by the number of features and output classes in the dataset.The size of all hidden layers is the same, but may vary across experiments.
We first train baseline GCN models of different depths and hidden dimensions using a single GPU to determine the best model depth and hidden dimension to be used in small-scale experiments.The results are shown in Figure 4. Deeper models do not yield performance improvements for small-scale datasets, but test accuracy improves as the model becomes wider.
Based upon the results in Figure 4, we adopt a three-layer GCN with a hidden dimension of d 1 = d 2 = 256 as the underlying model used in small-scale experiments.Though two-layer models seem to perform best, we use a three-layer model within Section 5.1 to enable more flexibility in examining the partitioning strategy of GIST.

A.4. Large-Scale Experiments
Reddit Dataset.For experiments on Reddit, we train 256-dimensional GraphSAGE and GAT models using both GIST and standard, single-GPU methodology.During training, the graph is partitioned into 15,000 sub-graphs.Training would be impossible without such partitioning because the graph is too large to fit into memory.The setting for the number of sub-graphs is the optimal setting proposed in previous work (Chiang et al., 2019).Models trained using GIST and standard, single-GPU methodologies are compared in terms of F1 score and training time.
All tests are run for 80 epochs with no weight decay, using the Adam optimizer (Kingma & Ba, 2014).We find that ζ = 500 achieves consistently high performance for models trained with GIST on Reddit.We adopt a batch size of 10 sub-graphs throughout the training process, which is the optimal setting proposed in previous work (Chiang et al., 2019).
Amazon2M Dataset.For experiments on Amazon2M, we train two to four layer GraphSAGE models with hidden dimensions of 400 and 4096 using both GIST and standard, single-GPU methodology.We follow the experimental settings of (Chiang et al., 2019).The training graph is partitioned into 15,000 sub-graphs and a batch size of 10 sub-graphs is used.We find that using ζ = 5000 performs consistently well.Models are trained for 400 total epochs with the Adam optimizer (Kingma & Ba, 2014) and no weight decay.

A.5. Training Ultra-Wide GCNs
All settings for ultra-wide GCN experiments in Section 5.3 are adopted from the experimental settings of Section 5. approach for efficient GCN training.LADIES is used instead of graph partitioning.Any node sampling approach can be adopted-some sampling approach is just needed to avoid memory overflow.
We train 256-dimensional GCN models with either two or three layers.We utilize a vanilla GCN model within this section (as opposed to GraphSAGE or GAT) to simplify the implementation of GIST with LADIES, which creates a disparity in F1 score between the results in Section 5.4 and Section 5.2.Experiments in Section 5.4 compare the performance of the same models trained either with GIST or using standard, single-GPU methodology.In this case, the single-GPU model is just a GCN trained with LADIES.

B. Comparisons to Other Distributed Training Methodologies
Although GIST has been shown to provide benefits in terms of GCN performance and training efficiency in comparison to standard, single-GPU training, other choices for the distributed training of GCNs exist.Within this section, we compare GIST to other natural choices for distributed training, revealing that GCN models trained with GIST achieve favorable performance in comparison to those trained with other common distributed training techniques.

B.1. Local SGD
A simple version of local SGD (Lin et al., 2018)  Table 7. Performance of GraphSAGE models trained both with GIST and as ensembles of shallow sub-GCNs on Reddit.Models trained with GIST perform better and do not suffer from increased inference time as the number of sub-GCNs is increased.

B.2. Sub-GCN Ensembles
As previously mentioned, increasing the number of local iterations (i.e., ζ in Algorithm 1) decreases communication requirements given a fixed amount of training.When taken to the extreme (i.e., ζ → ∞), one could minimize communication requirements by never aggregating sub-GCN parameters, thus forming an ensemble of independently-trained sub-GCNs.
Within this formulation, ζ represents the total number of local iterations performed for each sub-GCN, while L(Θ (j) t,k ) is the loss on the jth sub-GCN during the tth global and kth local iteration.We can express L(Θ and the gradient has the form

Properties of the Transformed Input
The GCN (Kipf & Welling, 2016) uses a first-degree Chebyshev polynomial to approximate a spectral convolution on the graph, which results in an aggregation matrix of the form where A is the adjacency matrix and D is the degree matrix with In practice, the re-normalization trick is applied to control the magnitude of the largest eigenvalue of Ā.Here, however, we keep the original formulation of (4) to facilitate our analysis, and our assumption on the depth of the GCN does not lead to numerical instability even if λ max ( Ā) > 1.It is a well-known result that 2 = λ max ( Ā) ≥ λ min ( Ā) ≥ 0. In particular, the lower bound on the minimum eigenvalue is obtained by considering In our analysis, we require the aggregation matrix Ā to be positive definite.Thus, the following assumption can be made about λ min ( Ā). Assumption 2. λ min ( Ā) = 0.
Going further, we must make a few more assumptions about the aggregation matrix and the graph itself to satisfy certain properties relevant to the analysis.First, the following property must hold Property 1.For all i ∈ [n], we have xi 2 ≤ 1.
which can be guaranteed by the following assumption.Assumption 3.There exists ∈ (0, 1) and p ∈ Z + such that Additionally, we make the following assumption regarding the graph itself Assumption 4. For all i ∈ [n], we have x i 2 ≤ 1− 2 , and |y i | ≤ C for some constant C.Moreover, for all j ∈ [n] and j = i, we have x i x j .which, in turn, yields the following property Property 2. For all i, j ∈ [n] such that i = j, we have xi xj .

C.3. Full Statement Theorem 1
We now state the full version of theorem 1 from Section 4, which characterizes the convergence properties of one-hidden-layer GCN models trained with GIST.The full proof of this Theorem is provided within Appendix C.5.
Theorem 2. Suppose assumptions 2-4, and property 2 hold.Moreover, suppose in each global iteration the masks are generated from a categorical distribution with uniform mean 1 /m.Fix the number of global iterations to T and local iterations to ζ.If the number of hidden neurons satisfies with probability at least 1 − δ.

C.4. GIST and Local Training Progress
For a one-hidden-layer MLP, the analysis often depends on the (scaled) Gram Matrix of the infinite-dimensional NTK We can extend this definition of the Gram Matrix to an infinite-width, one-hidden-layer GCN as follows With property 2, prior work (Du et al., 2019) shows that λ min (H) > 0. Denoting λ 0 = λ min (G ∞ ), since Ā is also positive definite, we have that λ 0 ≥ λ min (H)λ min ( Ā) > 0. In our analysis, we define the Graph Independent Subnetwork Tangent Kernel (GIST-K) where H(t, t , k) is defined as t ,k,r ≥ 0 for masks M t and weights Θ t ,k .Following previous work (Liao & Kyrillidis, 2021) on subnetwork theory, the following Lemma can be obtained.Lemma 1. Suppose the number of hidden nodes satisfies d 1 = Ω λ −1 0 n 2 log T mn /δ .If for all t, k it holds that θ t,k,r − θ 0,r 2 ≤ R := λ0 48n , then with probability at least 1 − δ, for all t, t ∈ [T ] we have: After showing that every GIST-K is positive definite, we can then show that the local training of each sub-GCN enjoys a linear convergence rate.Lemma 2. Suppose the number of hidden nodes satisfies then we have and for all r As in previous work (Liao & Kyrillidis, 2021), we add the scaling factor 1 m to make sure that E Mt [ŷ (j) (t, 0)] = ŷ(t).Moreover, by properties of the masks M (j) t , we have Thus, we can invoke lemmas 13 and 14 from (Liao & Kyrillidis, 2021).We state the two key lemmas here in accordance with our own notation.
Lemma 3. The tth global step produces squared error satisfying Lemma 4. In the tth global iteration, the sampled subnetwork's deviation from the whole network is given by Moreover, lemmas 22 and 23 from (Liao & Kyrillidis, 2021) show that with probability at least 1 − 2n exp(− m 32 ), for all R ≤ 1 2 , it holds that For convenience, we assume that such an initialization property holds.Then, we can use lemma 24 from (Liao & Kyrillidis, 2021): as long as θ t,r − θ 0,r 2 ≤ R for all t, r, then we have Then, applying Markov's inequality gives the following with probability at least We point out that, within the proof, we use R = λ0 96n , which satisfies the condition above.Using lemma 3 to expand the loss at the (t + 1)th iteration and invoking lemma 2 gives y − ŷ(t + 1) Using the fact that E Mt [ŷ (j) (t, 0)] = ŷ(t) we have Then, using lemma 4 to rewrite the last term in the equation above and plugging in gives We denote the last term within the equation above as The following lemma shows the bound on ι t Lemma 5.As long as θ ≤ R for all t, k, j, and the initialization satisfies we have with E Θ0,ar [ι t ] = 0, for all γ ∈ (0, 1).
Therefore, we can derive the following using lemma 5 Starting from here, we use α to denote the global convergence rate Then, the convergence rate above yields the following Lastly, we provide a bound on weight perturbation using overparameterization.In particular, we can show that hypothesis 5 holds for iteration t + 1 under the assumption that it holds in iteration t and given the global convergence result Thus, it suffices to show that Using Jensen's inequality, we derive the following It then suffices to show that ] and let ĵ be the index of the sub-GCN in which r is active.Indeed, we have What remains is to prove hypothesis 5 for t = 0.In that case, we need Finally, we have the following lemma bounding E Θ0,a y − ŷ(0) 2 2 Lemma 6.It holds that Thus, the bound above boils down to Plugging in the value of B and using α ≥ ηλ0 2 (1 − γ) to solve for d 1 gives

C.6. Proof of Lemmas
We now provide all proofs for the major properties and lemmas utilized in deriving the convergence results for GIST.
Proof of Property 1.Under assumption 3, we have that for all i, i ∈ where the first inequality follows from assumption 4.
Proof of Lemma 1. Fix some R > 0. Following Theorem 2 by (Liao & Kyrillidis, 2021), we have that with probability at least 1 − 2n 2 e −2d1t 2 it holds that and with probability at least 1 − n 2 e − d 1 R 10m it holds that . Taking a union bound over all values of t and j, then plugging in the requirement d 1 = Ω λ −1 0 n 2 log T mn /δ gives the desired result.
Next, we bound the weight perturbation.First, using Markov's inequality, we have that with probability at least 1 − δ

Figure 2 .
Figure 2. GCN partition into m = 2 sub-GCNs.Orange and blue colors depict different feature partitions.Both hidden dimensions (d1 and d2) are partitioned.The output dimension (d3) is not partitioned.Partitioning the input dimension (d0) is optional.In this work, we do not partition d0 in GIST.
1 2.1.subGCNs: Constructing Sub-GCNs GIST partitions a global GCN model into several narrower sub-GCNs of equal depth.Formally, consider an arbitrary layer and a random, disjoint partition of the feature set [d ] = {1, 2, . . ., d } into m equally-sized blocks {D (i) } m i=1 . 2 Accordingly, we denote by Θ (i) = [Θ ] D (i) ×D (i) +1 the matrix obtained by selecting from Θ the rows and columns given by the ith blocks in the partitions of [d ] and [d +1 ], respectively.With this notation in place, we can define m different sub-GCNs Y

Figure 3 .
Figure 3. Illustrates the difference between GIST and node sampling techniques within the forward pass of a single GCN layer (excluding non-linear activation).While graph partitioning and layer sampling remove nodes from the forward pass (i.e., either completely or on a per-layer basis), GIST partitions node feature representations (and, in turn, model parameters) instead of the nodes themselves.

Figure 5 .
Based on the results in Figure5, we adopt ζ = 20 for Cora, Citeseer, and Pubmed, as well as ζ = 100 for OGBN-Arxiv.

Figure 4
Figure4.Test accuracy for different sizes (i.e., varying depth and width) of GCN models trained with standard, single-GPU methodology on small-scale datasets.We adopt three-layer, 256-dimensional GCN models as our baseline architecture.

Figure 5 .
Figure 5. Test accuracy of GCN models trained on small-scale datasets with GIST using different numbers of local iterations and sub-GCNs.Models trained with GIST are surprisingly robust to the number of local iterations used during training, no matter the number of sub-GCNs.

Table 1 .
see Appendix A.3 for further experimental settings.All reported metrics are averaged across five separate trials.Because these experiments run quickly, we use them to analyze the impact of different design and hyperparameter choices rather than attempting to improve runtime (i.e., speeding up such short experiments is futile).We investigate whether models trained with GIST are sensitive to the partitioning of features within certain layers.Although the output dimension d 3 is never partitioned, we selectively partition dimensions d 0 , d 1 , and d 2 to observe the impact on model performance; see Table1.Partitioning input features (d 0 ) significantly degrades test accuracy because sub-GCNs observe only a portion of each node's input features (i.e., this becomes more noticeable with larger m).However, other feature dimensions cause no performance deterioration when partitioned between sub-GCNs, leading us to partition all feature dimensions other than d 0 and d ± 0.010 75.95 ± 0.007 76.68 ± 0.011 65.65 ± 0.700 78.30 ± 0.011 69.34 ± 0.018 75.78 ± 0.015 65.33 ± 0.347 80.82 ± 0.010 75.82 ± 0.008 78.02 ± 0.007 70.10 ± 0.224 Test accuracy of GCN models trained on small-scale datasets with GIST.We selectively partition each feature dimension within the GCN model, indicated by a check mark.Partitioning on all hidden layers except the input layer leads to optimal performance.
(Srivastava et al., 2014)N model(Kipf & Welling, 2016)with GIST; L within the final GIST methodology; see Figure2-(b).How many Sub-GCNs to use?Using more sub-GCNs during GIST training typically improves runtime because sub-GCNs i) become smaller, ii) are each trained for fewer epochs, and iii) are trained in parallel.We find that all models trained with GIST perform similarly for practical settings of m; see Table1.One may continue increasing the number sub-GCNs used within GIST until all GPUs are occupied or model performance begins to decrease.GIST Performance.Models trained with GIST often exceed the performance of models trained with standard, single-GPU methodology; see Table1.Intuitively, we hypothesize that the random feature partitioning within GIST, which loosely resembles dropout(Srivastava et al., 2014), provides regularization benefits during training, but we leave an in-depth analysis of this property as future work.

Table 4 .
training graph is large.Although graph partitioning is used within most experiments (see Section 5.2), GIST is also compatible with other node partitioning strategies.To demonstrate this, we perform training on Reddit using GIST combined with a recent layer sampling approach (Zou et al., 2019) (i.e., instead of graph partitioning); see Appendix A.6 for more details.As shown in Table4, combining GIST with layer sampling enables training on large-scale graphs, and the observed speedup actually exceeds that of GIST with graph partitioning.For example, GIST with layer sampling yields Performance a 1.83× speedup when L = 2 and m = 2, in comparison to a 1.50× speedup when graph partitioning is used within GIST (see Table2).As the number of sub-GCNs is increased beyond m = 2, GIST with layer sampling con-of GCN models trained with a combination of GIST and LADIES (Zou et al., 2019) on Reddit.Here, the baseline represents models trained with LADIES in a standard, single-GPU manner.Combining GIST with layer sampling leads to further improvements in wall-clock training time without deteriorating the F1 score.

Table 5 .
The details of the datasets utilized within GIST experiments in Section 5 are provided in Table5.Cora, Citeseer, PubMed and OGBN-Arxiv are considered "small-scale" datasets and are utilized within experiments in Section 5.1.Reddit and Amazon2M are considered "large-scale" datasets and are utilized within experiments in Section 5.2.Details of relevant datasets.

Table 6 .
2; see Appendix A.4 for further details.For d i > 4096 evaluation must be performed on graph partitions (not the full graph) to avoid memory overflow.As such, the graph is partitioned into 5,000 sub-graphs during testing and F1 score is measured over each partition and averaged.All experiments are performed using a GraphSAGE model, and the hidden dimension of the underlying model is changed between different experiments.Performance of GraphSAGE models trained using local SGD and GIST on Reddit.We adopt settings described in Section 5.2, but use 100 local iterations for both GIST and local SGD.Models trained with GIST outperform those trained with local SGD in terms of test F1 score and wall-clock training time in all cases.
A.6.GIST with Layer SamplingExperiments in Section 5.4 adopt the same experimental settings as Section 5.2 for the Reddit dataset; see Appendix A.4 for further details.Within these experiments, we combine GIST with LADIES(Zou et al., 2019), a recent layer sampling can be implemented for distributed training of GCNs by training the full model on each separate worker for a certain number of local iterations and intermittently averaging local updates.In comparison to such a methodology, GIST has better computational and communication efficiency because i) it communicates only small fraction of model parameters to each machine and ii) locally training narrow sub-GCNs is faster than locally training the full model.We perform a direct comparison between local SGD and GIST on the Reddit dataset using a two-layer, 256-dimensional GraphSAGE model; see Table6.As can be seen, GCN models trained with GIST have lower wall-clock training time and achieve better performance than those trained with local SGD in all cases.
We now prove the convergence result for GIST outlined in Appendix C.3.In showing the convergence of GIST, we care about the regression loss y − ŷ(t) 2 2 with