1 Introduction

Since not all data can be represented in Euclidean space (Bronstein et al. 2017), many applications rely on graph-structured data. For example, social networks can be modeled as graphs by regarding each user as a node and friendship relations as edges (Lusher et al. 2013; Newman et al. 2002). Alternatively, in chemistry, molecules can be modeled as graphs, with nodes representing atoms and edges encoding chemical bonds (Balaban et al. 1985; Benkö et al. 2003).

To better understand graph-structured data, several (deep) learning techniques have been extended to the graph domain (Defferrard et al. 2016; Gori et al. 2005; Masci et al. 2015). Currently, the most popular one is the graph convolutional network (GCN) (Kipf and Welling 2016), a multi-layer architecture that implements a generalization of the convolution operation to graphs. Although the GCN handles node- and graph-level classification, it is notoriously inefficient and unable to support large graphs (Chen et al. 2018a, b; Gao et al. 2018; Huang et al. 2018; You et al. 2020; Zeng et al. 2019), making practical, large-scale applications difficult to handle.

To deal with these issues, node partitioning methodologies have been developed. These schemes can be roughly categorized into neighborhood sampling (Chen et al. 2018b; Hamilton et al. 2017; Zou et al. 2019) and graph partitioning (Chiang et al. 2019; Zeng et al. 2019) approaches. The goal is to partition a large graph into multiple smaller graphs that can be used as mini-batches for training the GCN. In this way, GCNs can handle larger graphs during training, expanding their potential into the realm of big data. However, the size of the underlying model is still limited by available memory capacity, thus placing further constraints on the scale of GCN experimentation.

Although some papers perform large-scale experiments (Chiang et al. 2019; Zeng et al. 2019), the models (and data) used in GCN research remain small in the context of deep learning (Kipf and Welling 2016; Veličković et al. 2017), where the current trend is towards incredibly large models and datasets (Brown et al. 2020; Conneau et al. 2019). Despite the widespread moral questioning of this trend (Hao 2019; Peng and Sarazen 2019; Sharir et al. 2020), the deep learning community continues to push the limits of scale. Overparameterized models yield improvements in tasks like zero/few-shot learning (Brown et al. 2020; Radford et al. 2021), are capable of discovering generalizable solutions (Nakkiran et al. 2019), and even have desirable theoretical properties (Oymak and Soltanolkotabi 2020).

Although deeper GCNs may perform poorly due to oversmoothing (Kipf and Welling 2016; Li et al. 2018), GCNs should similarly benefit from overparameterization, meaning that larger hidden layers may be beneficial. Furthermore, recent work indicates that overparameterization is most impactful on larger datasets (Hoffmann et al. 2022), making overparameterized models essential as GCNs are applied to practical problems at scale. Moving in this direction, our work provides an efficient training framework for wide, overparameterized GCN models—beyond the memory capacity of a single GPU—of any architecture that is compatible with existing training techniques.

This paper. Inspired by independent subnetwork training (IST) (Yuan et al. 2019), our methodology randomly partitions the hidden feature space in each layer, decomposing the global GCN model into multiple, narrow sub-GCNs of equal depth. Sub-GCNs are trained independently for several iterations in parallel prior to having their updates synchronized; see Fig. 1. This process of randomly partitioning, independently training, and synchronizing sub-GCNs is repeated until convergence. We call this method Graph Independent Subnetwork Training (GIST), as it extends the IST framework to the training of GCNs.

Though IST was previously unexplored in this domain, we find that GIST pairs well with any GCN architecture, is compatible with node sampling techniques, can scale to arbitrarily large graphs, and significantly reduces wall-clock training time, allowing larger models and datasets to be explored. In particular, we focus on training “ultra-wide” GCNs (i.e., GCN models with very large hidden layers), as deeper GCNs are prone to oversmoothing (Li et al. 2018) and GISTs model partitioning strategy can mitigate the memory overhead of training these wider GCNs.

The contributions of this work are as follows:

  • We develop a novel extension of IST for training GCNs, show that it works well for training GCNs with a variety of architectures, and demonstrate its compatibility with commonly-used GCN training techniques like neighborhood sampling and graph partitioning.

  • We show that GIST can be used to reach state-of-the-art performance with reduced training time relative to standard training methodologies. GIST is a compatible addition to GCN training that improves efficiency.

  • We propose a novel Graph Independent Subnetwork Training Kernel (GIST-K) that allows a convergence rate to be derived for two-layer GCNs trained with GIST in the infinite width regime. Based on GIST-K, we provide theory that GIST converges linearly –up to an error neighborhood– using distributed gradient descent with local iterations. We show that the radius of the error neighborhood is controlled by the overparameterization parameter, as well as the number of workers in the distributed setting. Such findings reflect practical observations that are made in the experimental section.

  • We use GIST to enable the training of markedly overparameterized GCN models. In particular, GIST is used to train a two-layer GraphSAGE model with a hidden dimension of \(32\,768\) on the Amazon2M dataset. Such a model exceeds the capacity of a single GPU by \(8\times \).

Fig. 1
figure 1

GIST pipeline: subGCNs divides the global GCN into sub-GCNs. Every sub-GCN is trained by subTrain using mini-batches (smaller sub-graphs) generated by Cluster. Sub-GCN parameters are intermittently aggregated through subAgg

Fig. 2
figure 2

GCN partition into \(m=2\) sub-GCNs. Orange and blue colors depict different feature partitions. Both hidden dimensions (\(d_1\) and \(d_2\)) are partitioned. The output dimension (\(d_3\)) is not partitioned. Partitioning the input dimension (\(d_0\)) is optional, but we do not partition \(d_0\) in GIST

2 What is the GIST of this work?

Algorithm 1
figure a

GIST Algorithm

GCN Architecture. The GCN (Kipf and Welling 2016) is arguably the most widely-used neural network architecture on graphs. Consider a graph \(\mathcal {G}\) comprised of n nodes with d-dimensional features \(\textbf{X} \in \mathbb {R}^{n \times d}\). The output \(\textbf{Y} \in \mathbb {R}^{n \times d'}\) of a GCN can be expressed as \(\textbf{Y} = \Psi _{\mathcal {G}}(\textbf{X}; \varvec{\Theta })\), where \(\Psi _{\mathcal {G}}\) is an L-layered architecture with trainable parameters \(\varvec{\Theta }\). If we define \(\textbf{H}_0 = \textbf{X}\), we then have that \(\textbf{Y} = \Psi _{\mathcal {G}}(\textbf{X}; \varvec{\Theta }) = \textbf{H}_L\), where an intermediate \(\ell \)-th layer of the GCN is given by

$$\begin{aligned} \textbf{H}_{\ell +1} = \sigma (\bar{\textbf{A}} \, \textbf{H}_\ell \, \varvec{\Theta }_\ell ). \end{aligned}$$
(1)

In (1), \(\sigma (\cdot )\) is an elementwise activation function (e.g., ReLU), \(\bar{\textbf{A}}\) is the degree-normalized adjacency matrix of \(\mathcal {G}\) with added self-loops, and the trainable parameters \(\varvec{\Theta } = \{\varvec{\Theta }_\ell \}_{\ell =0}^{L-1}\) have dimensions \(\varvec{\Theta }_\ell \in \mathbb {R}^{d_{\ell } \times d_{\ell +1}}\) with \(d_0 = d\) and \(d_L = d'\). In Fig. 2 (top), we illustrate nested GCN layers for \(L=3\), but our methodology extends to arbitrary L. The activation function of the last layer is typically the identity or softmax transformation – we omit this in Fig. 2 for simplicity.

GIST overview. We overview GIST in Algorithm 1 and present a schematic depiction in Fig. 1. We partition our (randomly initialized) global GCN into m smaller, disjoint sub-GCNs with the subGCNs function (\(m=2\) in Figs. 1 and 2) by sampling the feature space at each layer of the GCN; see Sect. 2.1. Each sub-GCN is assigned to a different worker (i.e., a different GPU) for \(\zeta \) rounds of distributed, independent training through subTrain. Then, newly-learned sub-GCN parameters are aggregated (subAgg) into the global GCN model. This process repeats for T iterations. Our graph domain is partitioned into c sub-graphs through the Cluster function (\(c=2\) in Fig. 1). This operation is only relevant for large graphs (\(n>50~000\)), and we omit it (\(c=1\)) for smaller graphs that don’t require partitioning.Footnote 1

2.1 subGCNs: constructing sub-GCNs

GIST partitions a global GCN model into several narrower sub-GCNs of equal depth. Formally, consider an arbitrary layer \(\ell \) and a random, disjoint partition of the feature set \([d_\ell ] = \{1, 2, \ldots , d_\ell \}\) into m equally-sized blocks \(\{\mathcal {D}^{(i)}_\ell \}_{i=1}^m\).Footnote 2 Accordingly, we denote by \(\varvec{\Theta }^{(i)}_{\ell } = [\varvec{\Theta }_{\ell }]_{\mathcal {D}^{(i)}_\ell \times \mathcal {D}^{(i)}_{\ell +1}}\) the matrix obtained by selecting from \(\varvec{\Theta }_{\ell }\) the rows and columns given by the ith blocks in the partitions of \([d_\ell ]\) and \([d_{\ell +1}]\), respectively. With this notation in place, we can define m different sub-GCNs \(\textbf{Y}^{(i)} = \Psi _{\mathcal {G}}(\textbf{X}^{(i)}; \varvec{\Theta }^{(i)}) = \textbf{H}^{(i)}_{L}\) where \(\textbf{H}^{(i)}_{0} = \textbf{X}_{[n] \times \mathcal {D}^{(i)}_0}\) and each layer is given by:

$$\begin{aligned} \textbf{H}^{(i)}_{\ell +1} = \sigma (\bar{\textbf{A}} \, \textbf{H}^{(i)}_{\ell } \, \varvec{\Theta }^{(i)}_{\ell }). \end{aligned}$$
(2)

Notably, not all parameters within the global GCN model are partitioned to a sub-GCN. However, by randomly re-constructing new groups of sub-GCNs according to a uniform distribution throughout the training process, all parameters have a high likelihood of being updated. In Sect. 4, we provide theoretical guarantees that the partitioning of model parameters to sub-GCNs does not harm training performance.

Sub-GCN partitioning is illustrated in Fig. 2a, where \(m=2\). Partitioning the input features is optional (i.e., (a) vs. (b) in Fig. 2). We do not partition the input features within GIST so that sub-GCNs have identical input information (i.e., \(\textbf{X}^{(i)} = \textbf{X}\) for all i); see Sect. 5.1. Similarly, we do not partition the output feature space to ensure that the sub-GCN output dimension coincides with that of the global model, thus avoiding any need to modify the loss function. This decomposition procedure (subGCNs function in Algorithm 1) extends to arbitrary L.

2.2 subTrain: independently training sub-GCNs

Assume \(c=1\) so that the Cluster operation in Algorithm 1 is moot and \(\{\mathcal {G}_{(j)}\}_{j=1}^c = \mathcal {G}\). Because \(\textbf{Y}^{(i)}\) and \(\textbf{Y}\) share the same dimension, sub-GCNs can be trained to minimize the same global loss function. One application of subTrain in Algorithm 1 corresponds to a single step of stochastic gradient descent (SGD). Inspired by local SGD (Lin et al. 2018), multiple, independent applications of subTrain are performed in parallel (i.e., on separate GPUs) for each sub-GCN prior to aggregating weight updates. The number of independent training iterations between synchronization rounds, referred to as local iterations, is denoted by \(\zeta \), and the total amount of training is split across sub-GCNs.Footnote 3 Ideally, the number sub-GCNs and local iterations should be increased as much as possible to minimize communication and training costs. In practice, however, such benefits may come at the cost of statistical inefficiency; see Sect. 5.1.

If \(c > 1\), subTrain first selects one of the c subgraphs in \(\{\mathcal {G}_{(j)}\}_{j=1}^c\) to use as a mini-batch for SGD. Alternatively, the union of several sub-graphs in \(\{\mathcal {G}_{(j)}\}_{j=1}^c\) can be used as a mini-batch for training. Aside from using mini-batches for each SGD update instead of the full graph, the use of graph partitioning does not modify the training approach outlined above. Some form of node sampling must be adopted to make training tractable when the full graph is too large to fit into memory. However, both graph partitioning and layer sampling are compatible with GIST (see Sects. 5.2 and 5.4). We adopt graph partitioning in the main experiments due to the ease of implementation. The novelty of our work lies in the feature partitioning strategy of GIST for distributed training, which is an orthogonal technique to node sampling; see Fig. 3 and Sect. 2.4.

2.3 subAgg: aggregating sub-GCN parameters

After each sub-GCN completes \(\zeta \) training iterations, their updates are aggregated into the global model (subAgg function in Algorithm 1). Within subAgg, each worker replaces global parameter entries within \(\varvec{\Theta }\) with its own, independently-trained sub-GCN parameters \(\varvec{\Theta }^{(i)}\), where no collisions occur due to the disjointness of sub-GCN partitions. Thus, subAgg is a basic copy operation that transfers sub-GCN parameters into the global model.

Not every parameter in the global GCN model is updated by subAgg because, as previously mentioned, parameters exist that are not partitioned to any sub-GCN by the \(\texttt {subGCNs}\) operation. For example, focusing on \(\varvec{\Theta }_{1}\) in Fig. 2a, one worker will be assigned \(\varvec{\Theta }^{(1)}_{1}\) (i.e., overlapping orange blocks), while the other worker will be assigned \(\varvec{\Theta }^{(2)}_{1}\) (i.e., overlapping blue blocks). The rest of \(\varvec{\Theta }_{1}\) is not considered within subAgg. Nonetheless, since sub-GCN partitions are randomly drawn in each cycle t, one expects all of \(\varvec{\Theta }\) to be updated multiple times if T is sufficiently large.

2.4 What is the value of GIST?

Architecture-Agnostic Distributed Training. GIST is a generic, distributed training methodology that can be used for any GCN architecture. We implement GIST for vanilla GCN, GraphSAGE, and GAT architectures, but GIST is not limited to these models; see Sect. 5.

Fig. 3
figure 3

Illustrates the difference between GIST and node sampling techniques within the forward pass of a single GCN layer (excluding non-linear activation). While graph partitioning and layer sampling remove nodes from the forward pass (i.e., either completely or on a per-layer basis), GIST partitions node feature representations (and, in turn, model parameters) instead of the nodes themselves

Compatibility with Sampling Methods. GIST is NOT a replacement for graph or layer sampling. Rather, it is an efficient, distributed training technique that can be used in tandem with node partitioning. As depicted in Fig. 3, GIST partitions node feature representations and model parameters between sub-GCNs, while graph partitioning and layer sampling sub-sample nodes within the graph.

Interestingly, we find that GIST’s feature and parameter partitioning strategy is compatible with node partitioning—the two approaches can be combined to yield further efficiency benefits. For example, GIST is combined with graph partitioning strategies in Sect. 5.2 and with layer sampling methodologies in Sect. 5.4. As such, we argue that GIST offers an easy add-on to GCN training that makes larger scale experiments more feasible.

Enabling Ultra-Wide GCN Training. GIST indirectly updates the global GCN through the training of smaller sub-GCNs, enabling models with hidden dimensions that exceed the capacity of a single GPU to be trained; in our experiments, we show results where GIST allows training of models beyond the capacity of a single GPU by a factor of \(8\times \). In this way, GIST allows markedly overparametrized (“ultra-wide") GCN models to be trained on existing hardware. In Sect. 5.2, we leverage this capability to train a two-layer GCN model with a hidden dimension of 32 768 on Amazon2M.

Overparameterization through width is especially relevant to GCNs because deeper models suffer from oversmoothing (Li et al. 2018). Additionally, the theoretical results provided within Sect. 4 reveal that the performance of GIST is best as the number of neurons within each hidden layer is increased, which further reveals the benefit of wide, overparameterized layers. We do not explore depth-wise partitions of different GCN layers to each worker, but rather focus solely upon partitioning the hidden neurons within each layer.

Improved Model Complexity. Consider a single GCN layer, trained over M machines with input and output dimension of \(d_{i-1}\) and \(d_{i}\), respectively. For one synchronization round, the communication complexity of GIST and standard distributed training is \(\mathcal {O}(\frac{1}{M}d_i d_{i-1})\) and \(\mathcal {O}(M d_{i} d_{i-1})\), respectively. GIST reduces communication by only communicating sub-GCN parameters. Existing node partitioning techniques cannot similarly reduce communication complexity because model parameters are never partitioned. Furthermore, the computational complexity of the forward pass for a GCN model trained with GIST and using standard methodology is \(\mathcal {O}(\frac{1}{M} N^2 d_i + \frac{1}{M^2} N d_i d_{i-1})\) and \(\mathcal {O}(N^2 d_i + N d_i d_{i-1})\), respectively, where N is the number of nodes in the partition being processed.Footnote 4 Node partitioning can reduce N by a constant factor but is compatible with GIST.

Relation to IST. Our work extends the IST distributed training framework—originally proposed for fully-connected network architectures (Yuan et al. 2019)—to GCNs. Due to the unique aspects of GCN training (e.g., non-euclidean data and aggregation of node features), it was previously unclear whether IST would work well in this domain. Though IST is applicable to a variety of architectures, we find that it is especially useful for efficiently training GCNs to high accuracy. GIST i) provides speedups and performance benefits, ii) is compatible with other efficient GCN training methods, and iii) enables training of uncharacteristically-wide GCN models, allowing overparameterized GCNs to be explored via greater width. The practical utility of GIST and interplay of the approach with unique aspects of GCN training differentiate our work from the original IST proposal.

3 Related work

GCN training. In spite of their widespread success in several graph related tasks, GCNs often suffer from training inefficiencies (Gao et al. 2018; Huang et al. 2018). Consequently, the research community has focused on developing efficient and scalable algorithms for training GCNs (Chen et al. 2018a, b; Chiang et al. 2019; Hamilton et al. 2017; Zeng et al. 2019; Zou et al. 2019). The resulting approaches can be divided roughly into two areas: neighborhood sampling and graph partitioning. However, it is important to note that these two broad classes of solutions are not mutually exclusive, and reasonable combinations of the two approaches may be beneficial.

Neighborhood sampling methodologies aim to sub-select neighboring nodes at each layer of the GCN, thus limiting the number of node representations in the forward pass and mitigating the exponential expansion of the GCNs receptive field. VRGCN (Chen et al. 2018a) implements a variance reduction technique to reduce the sample size in each layer, which achieves good performance with smaller graphs. However, it requires to store all the intermediate node embeddings during training, leading to a memory complexity close to full-batch training. GraphSAGE (Hamilton et al. 2017) learns a set of aggregator functions to gather information from a node’s local neighborhood. It then concatenates the outputs of these aggregation functions with each node’s own representation at each step of the forward pass. FastGCN (Chen et al. 2018b) adopts a Monte Carlo approach to evaluate the GCN’s forward pass in practice, which computes each node’s hidden representation using a fixed-size, randomly-sampled set of nodes. LADIES (Zou et al. 2019) introduces a layer-conditional approach for node sampling, which encourages node connectivity between layers in contrast to FastGCN (Chen et al. 2018b).

Graph partitioning schemes aim to select densely-connected sub-graphs within the training graph, which can be used to form mini-batches during GCN training. Such sub-graph sampling reduces the memory footprint of GCN training, thus allowing larger models to be trained over graphs with many nodes. ClusterGCN (Chiang et al. 2019) produces a very large number of clusters from the global graph, then randomly samples a subset of these clusters and computes their union to form each sub-graph or mini-batch. Similarly, GraphSAINT (Zeng et al. 2019) randomly samples a sub-graph during each GCN forward pass. However, GraphSAINT also considers the bias created by unequal node sampling probabilities during sub-graph construction, and proposes normalization techniques to eliminate this bias.

As explained in Sect. 2, GIST also relies on graph partitioning techniques (Cluster) to handle large graphs. However, the feature sampling scheme at each layer (subGCNs) that leads to parallel and narrower sub-GCNs is a hitherto unexplored framework for efficient GCN training.

Distributed training. Distributed training is a heavily studied topic (Shi et al. 2020; Zhang et al. 2018). Our work focuses on synchronous and distributed training techniques (Lian et al. 2017; Yu et al. 2019; Zhang et al. 2015). Some examples of synchronous, distributed training approaches include data parallel training, parallel SGD (Agarwal and Duchi 2011; Zinkevich et al. 2020), and local SGD (Lin et al. 2018; Stich 2019). Our methodology holds similarities to model parallel training techniques, which have been heavily explored (Ben-Nun and Hoefler 2019; Gholami et al. 2017; Günther et al. 2018; Kirby et al. 2020; Pauloski et al. 2020; Tavarageri et al. 2019; Zhu et al. 2020). More closely, our approach is inspired by IST, explored for feed-forward networks in Yuan et al. (2019). Later work analyzed IST theoretically (Liao and Kyrillidis 2021) and extended its use to more complex ResNet architectures (Dun et al. 2022). We explore the extension of IST to the GCN architecture both theoretically and empirically, finding that IST-based methods are suited well for GCN training. However, the IST framework is applicable to network architectures beyond the GCN.

4 Theoretical results

We draw upon analysis related to neural tangent kernels (NTK) Jacot et al. (2018) to derive a convergence rate for two-layer GCNs using gradient descent—as formulated in (1) and further outlined in “Appendix C.1”—trained with GIST. Given the scaled Gram matrix of an infinite-dimensional NTK \(\textbf{H}^\infty \), we define the Graph Independent Subnetwork Training Kernel (GIST-K) as follows:

$$\begin{aligned} \textbf{G}^{\infty } = \bar{\textbf{A}}\textbf{H}^\infty \bar{\textbf{A}}. \end{aligned}$$

Given the GIST-K, we adopt the following set of assumptions related to the underlying graph; see “Appendix C.3” for more details.

Notations. Let n denote the number of nodes (training samples) in graph of interest, \(d = d_0\) be dimension of the feature vector of each node, and m be the number of sub-GCNs in procedure 4. Let \(\lambda _0 = \lambda _{\min }\left( \textbf{G}^\infty \right) \) and \(\lambda ^* = \lambda _{\max }\left( \textbf{G}^\infty \right) \) be the minimum and maximum eigenvalue of \(\textbf{G}^\infty \), respectively. Lastly, we denote \(\mathbb {E}_{[\mathcal {M}_t]}[\cdot ] = \mathbb {E}_{\mathcal {M}_0,\dots ,\mathcal {M}_{t}}[\cdot ]\) to denote the total expectation with respect to \(\mathcal {M}_0,\dots ,\mathcal {M}_t\).

Assumption 1

Assume \(\lambda _{\min }(\bar{\textbf{A}}) \ne 0\) and there exists \(\epsilon \in (0,1)\) and \(p\in \mathbb {Z}_+\) such that \((1-\epsilon )^2p\le \textbf{D}_{ii}\le (1+\epsilon )^2p\) for all \(i \in [n] = \{1, 2, \dots , n\}\), where \(\textbf{D}\) is the degree matrix. Additionally, assume that i) input node representations are bounded in norm and not parallel to any other node representation, ii) output node representations are upper bounded, iii) sub-GCN feature partitions are generated at each iteration from a categorical distribution with uniform mean \(\frac{1}{m}\).

Given this set of assumptions, we can guarantee that \(\lambda _0 > 0\) (a detailed discussion is deferred to Sect. C.5). Under such conditions, we derive the following result for GCN models trained with GIST.

Theorem 1

Suppose assumptions 24 hold. Moreover, suppose in each global iteration the masks are generated from a categorical distribution with uniform mean . Fix the number of global iterations to T and local iterations to \(\zeta \). Consider a two-layer GCN with parameters \(\varvec{\Theta }\). If each entry of \(\varvec{\Theta }\) is initialized I.I.D. from \(\mathcal {N}(0,\kappa ^2\textbf{I})\), and the number of hidden neurons satisfies \(d_1\ge \Omega \left( \frac{T^2\zeta ^2 n}{\lambda _0^4(1-\gamma )^2}\max \left\{ \frac{n^3}{\delta ^2\kappa ^2}, \frac{n^2d}{\delta ^2}\left\| \bar{\textbf{A}}^2\right\| _{1,1}, T^2\lambda ^{*2}d\right\} \right) \), then procedure (4) with constant step size \(\eta = \mathcal {O}\left( \tfrac{\lambda _0}{n\left\| \bar{\textbf{A}}^2\right\| _{1,1}}\right) \) converges according to

$$\begin{aligned} \mathbb {E}_{[\mathcal {M}_{t-1}]}\left[ \left\| \textbf{y}- \hat{\textbf{y}}(t)\right\| _2^2\right]&\le \left( \gamma + (1-\gamma )\left( 1 - \frac{\eta \lambda _0}{2}\right) ^\zeta \right) ^t\left\| \textbf{y}- \hat{\textbf{y}}(0)\right\| _2^2 +\\&\quad \quad \quad \mathcal {O}\left( \frac{\gamma ^2d\kappa ^2\lambda ^{*2}}{m^2(1-\gamma )\lambda _0^2}\left\| \bar{\textbf{A}}^2\right\| _{1,1}\right) \end{aligned}$$

with probability at least \(1 - \delta \), where \(\gamma = \left( 1 - m^{-1}\right) ^{\frac{1}{3}}\).

A full proof of this result is deferred to “Appendix C”, but a sketch of the techniques used is as follows:

  1. 1.

    We define the GIST-K and show that it remains positive definite throughout training given our assumptions and sufficient overparameterization.

  2. 2.

    We show that local sub-GCN training converges linearly, given a positive definite GIST-K.

  3. 3.

    We analyze the change in training error when sub-GCNs are sampled (subGCNs), locally trained (subTrain), and aggregated (subAgg).

  4. 4.

    We establish a connection between local and aggregated weight perturbation, showing that network parameters are bounded by a small region centered around the initialization given sufficient overparameterization.

Discussion. Stated intuitively, the result in Theorem 1 shows that, given sufficient width, two-layer GCNs trained using \(\texttt {GIST}\) converge to approximately zero training error. The convergence rate is linear and on par with training the full, two-layer GCN model (i.e., without the feature partition utilized in GIST), up to an error neighborhood. Notice choosing a smaller initialization scale \(\kappa \) will result in a smaller size of the error neighborhood but at the same time a larger overparameterization requirement. Such theory shows that the feature partitioning strategy of GIST does not cause the model to diverge in training. Additionally, the theory suggests that wider GCN models should be used to maximize the convergence rate of GIST and minimize the impact of the additive term within Theorem 1. Such findings reflect practical observations that are made within Sect. 5 and reveal that GIST is particularly-suited towards training extremely wide models that cannot be trained using a traditional, centralized approach on a single GPU due to limited memory capacity.

5 Experiments

Table 1 Test accuracy of GCN models trained on small-scale datasets with GIST

We use GIST to train different GCN architectures on six public, multi-node classification datasets; see “Appendix A” for details. In most cases, we compare the performance of models trained with GIST to that of models trained with standard methods (i.e., single GPU with node partitioning). Comparisons to models trained with other distributed methodologies are also provided in “Appendix B”. Experiments are divided into small and large scale regimes based upon graph size. The goal of GIST is to (i) train GCN models to state-of-the-art performance, (ii) minimize wall-clock training time, and (iii) enable training of very wide GCN models.

5.1 Small-scale experiments

In this section, we perform experiments over Cora, Citeseer, Pubmed, and OGBN-Arxiv datasets (Sen et al. 2008; Hu et al. 2020). For these small-scale datasets, we train a three-layer, 256-dimensional GCN model (Kipf and Welling 2016) with GIST; see “Appendix A.3” for further experimental settings. All reported metrics are averaged across five separate trials. Because these experiments run quickly, we use them to analyze the impact of different design and hyperparameter choices rather than attempting to improve runtime (i.e., speeding up such short experiments is futile).

Which layers should be partitioned? We investigate whether models trained with GIST are sensitive to the partitioning of features within certain layers. Although the output dimension \(d_3\) is never partitioned, we selectively partition dimensions \(d_0\), \(d_1\), and \(d_2\) to observe the impact on model performance; see Table 1. Partitioning input features (\(d_0\)) significantly degrades test accuracy because sub-GCNs observe only a portion of each node’s input features (i.e., this becomes more noticeable with larger m). However, other feature dimensions cause no performance deterioration when partitioned between sub-GCNs, leading us to partition all feature dimensions other than \(d_0\) and \(d_L\) within the final GIST methodology; see Fig. 2b.

Table 2 Performance of models trained with GIST on Reddit and Amazon2M

How many Sub-GCNs to use? Using more sub-GCNs during GIST training typically improves runtime because sub-GCNs (i) become smaller, (ii) are each trained for fewer epochs, and (iii) are trained in parallel. We find that all models trained with GIST perform similarly for practical settings of m; see Table 1. One may continue increasing the number sub-GCNs used within GIST until all GPUs are occupied or model performance begins to decrease.

GIST Performance. Models trained with GIST often exceed the performance of models trained with standard, single-GPU methodology; see Table 1. Intuitively, we hypothesize that the random feature partitioning within GIST, which loosely resembles dropout (Srivastava et al. 2014), provides regularization benefits during training, but some insight into the favorable performance of GIST is also provided by the theoretical guarantees outlined in Sect. 4.

5.2 Large-scale experiments

For large-scale experiments on Reddit and Amazon2M, the baseline model is trained on a single GPU and compared to models trained with GIST in terms of F1 score and training time. All large-scale graphs are partitioned into 15 000 sub-graphs during training.Footnote 5 Graph partitioning is mandatory because the training graphs are too large to fit into memory. One could instead use layer sampling to make training tractable (see Sect. 5.4), but we adopt graph partitioning in most experiments because the implementation is simple and performs well.

Reddit Dataset. We perform tests with 256-dimensional GraphSAGE (Hamilton et al. 2017) and GAT (Veličković et al. 2017) models with two to four layers on Reddit; see “Appendix A.4” for more details. As shown in Table 2, utilizing GIST significantly accelerates GCN training (i.e., a \(1.32\times \) to \(7.90\times \) speedup). GIST performs best in terms of F1 score with \(m=2\) sub-GCNs (i.e., \(m=4\) yields further speedups but F1 score decreases). Interestingly, the speedup provided by GIST is more significant for models and datasets with larger compute requirements. For example, experiments with the GAT architecture, which is more computationally expensive than GraphSAGE, achieve a near-linear speedup with respect to m.

Amazon2M Dataset. Experiments are performed with two, three, and four-layer GraphSAGE models (Hamilton et al. 2017) with hidden dimensions of 400 and 4 096 (we refer to these models as “narrow” and “wide”, respectively). We compare the performance (i.e., F1 score and wall-clock training time) of GCN models trained with standard, single-GPU methodology to that of models trained with GIST; see Table 2. Narrow models trained with GIST have a lower F1 score in comparison to the baseline, but training time is significantly reduced. For wider models, GIST provides a more significant speedup (i.e., up to \(7.12\times \)) and tends to achieve comparable F1 score in comparison to the baseline, revealing that GIST works best with wider models.

Within Table 2, models trained with GIST tend to achieve a wall-clock speedup at the cost of a lower F1 score (i.e., observe the speedups marked with parenthesis in Table 2). When training time is analyzed with respect to a fixed F1 score, we observe that the baseline takes significantly longer than GIST to achieve a fixed F1 score. For example, when \(L=2\), a wide GCN trained with GIST (\(m=8\)) reaches an F1 score of 88.86 in \(\sim 4~000\) seconds, while models trained with standard methodology take \(\sim 10~000\) seconds to achieve a comparable F1 score. As such, GIST significantly accelerates training relative to model performance.

Table 3 Performance of GraphSAGE models of different widths trained with GIST on Amazon2M

5.3 Training ultra-wide GCNs

We use GIST to train GraphSAGE models with widths as high as 32 000 (i.e., \(\varvec{8\times }\) beyond the capacity of a single GPU); see Table 3 for results and “Appendix A.5” for more details. Considering \(L=2\), the best-performing, single-GPU GraphSAGE model (\(d_i=4~096\)) achieves an F1 score of 90.58 in 5.2 hours. With GIST (\(m=2\)), we achieve a higher F1 score of 90.87 in 2.8 hours (i.e., a \(1.86\times \) speedup) using \(d_i=8~192\), which is beyond single GPU capacity. Similar patterns are observed for deeper models. Furthermore, we find that utilizing larger hidden dimensions yields further performance improvements, revealing the utility of wide, overparameterized GCN models. GIST, due to its feature partitioning strategy, is unique in its ability to train models of such scale to state-of-the-art performance.

5.4 GIST with layer sampling

As previously mentioned, some node partitioning approach must be adopted to avoid memory overflow when the underlying training graph is large. Although graph partitioning is used within most experiments (see Sect. 5.2), GIST is also compatible with other node partitioning strategies. To demonstrate this, we perform training on Reddit using GIST combined with a recent layer sampling approach (Zou et al. 2019) (i.e., instead of graph partitioning); see “Appendix A.6” for more details.

As shown in Table 4, combining GIST with layer sampling enables training on large-scale graphs, and the observed speedup actually exceeds that of GIST with graph partitioning. For example, GIST with layer sampling yields a \(1.83\times \) speedup when \(L=2\) and \(m=2\), in comparison to a \(1.50\times \) speedup when graph partitioning is used within GIST (see Table 2). As the number of sub-GCNs is increased beyond \(m=2\), GIST with layer sampling continues to achieve improvements in wall-clock training time (e.g., speedup increases from \(1.83\times \) to \(2.90\times \) from \(m=2\) to \(m=4\) for \(L=2\)) without significant deterioration to model performance. Thus, although node partitioning is needed to enable training on large-scale graphs, the feature partitioning strategy of GIST is compatible with numerous sampling strategies (i.e., not just graph sampling).

Table 4 Performance of GCN models trained with a combination of GIST and LADIES Zou et al. (2019) on Reddit

6 Future work

There are a few notable extensions of GIST that may be especially useful to the research community. We leave these extensions as future work, as they go beyond the core focus of our proposal: formulating an easy-to-use, efficient training framework for large-scale experiments with GCNs.

GCNs with Edge Features. Recent work has explored using edge features within the GCN architecture (Gong and Cheng 2019; Jiang et al. 2020; Bergen et al. 2021). Given that GIST can be applied to any GCN architecture, we argue that (i) GIST is similarly compatible with architectural variants that exploit edge features and (ii) using edge features within the graph could yield further performance benefits.

To understand why such techniques would be compatible, we emphasize that—similar to node partitioning—edge features operate orthogonally to the model partitioning performed by GIST. For example, the EGNN model (Gong and Cheng 2019) injects edge information into the GCN model via the adjacency matrix at each layer, which modifies node representations and their relationships within the graph. As shown in Fig. 3, GIST simply partitions the feature space of each individual node within the hidden layers of the GCN, which has no impact or dependence on node or edge information within the underlying graph.

Deeper GCNs. Our analysis focuses upon the exploration of wide, but not deep, GCNs due to presence of oversmoothing in deep GCNs (Li et al. 2018). However, GIST is applicable to GCN architectures of any depth—the feature partitioning strategy is just applied separately to each layer. To further reduce the memory overhead of deeper GCN models, one could explore extensions of GIST that combine layer and feature partitioning strategies. Such a variant would independently train narrow sub-GCNs that contain only a small fraction of the global model’s total layers. Layer partitioning strategies—without feature partitioning—have already been shown to be effective for IST-based training of convolutional neural networks with residual connections (Dun et al. 2022).

More Settings. The analysis of GIST could be extended to alternative tasks (e.g., link prediction) and larger-scale datasets. However, performing experiments over datasets larger than Amazon2M is difficult due to the lack of moderately-large-scale graphs that are available publicly. For example, the only graph larger than Amazon2M provided via the Open Graph Benchmark (Hu et al. 2020) is Papers100M, which requires 256 Gb of CPU RAM to load.

7 Conclusions

We present GIST, a distributed training approach for GCNs that enables the exploration of larger models and datasets. GIST is compatible with existing sampling approaches and leverages a feature-wise partition of model parameters to construct smaller sub-GCNs that are trained independently and in parallel. We have shown that GIST achieves remarkable speed-ups over large graph datasets and even enables the training of GCN models of unprecedented size. We hope GIST can empower the exploration of larger, more powerful GCN architectures within the graph community.

Supplementary information

All code for this project is publicly-available via github at the following link: https://github.com/wolfecameron/GIST.