Almost universally optimal distributed Laplacian solvers via low-congestion shortcuts

In this paper, we refine the (almost) existentially optimal distributed Laplacian solver of Forster, Goranci, Liu, Peng, Sun, and Ye (FOCS ‘21) into an (almost) universally optimal distributed Laplacian solver. Specifically, when the topology is known (i.e., the Supported-CONGEST model), we show that any Laplacian system on an n-node graph with shortcut qualitySQ(G)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{SQ}(G)$$\end{document} can be solved after no(1)SQ(G)log(1/ϵ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n^{o(1)} \text {SQ}(G) \log (1/\epsilon )$$\end{document} rounds, where ϵ>0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon >0$$\end{document} is the required accuracy. This almost matches our lower bound that guarantees that any correct algorithm on G requires Ω~(SQ(G))\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widetilde{\Omega }(\textrm{SQ}(G))$$\end{document} rounds, even for a crude solution with ϵ≤1/2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon \le 1/2$$\end{document}. Several important implications hold in the unknown-topology (i.e., standard CONGEST) case: for excluded-minor graphs we get an almost universally optimal algorithm that terminates in D·no(1)log(1/ϵ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$D \cdot n^{o(1)} \log (1/\epsilon )$$\end{document} rounds, where D is the hop-diameter of the network; as well as no(1)log(1/ϵ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n^{o(1)} \log (1/\epsilon )$$\end{document}-round algorithms for the case of SQ(G)≤no(1)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{SQ}(G) \le n^{o(1)}$$\end{document}, which holds for most networks of interest. Moreover, following a recent line of work in distributed algorithms, we consider a hybrid communication model which enhances CONGEST with limited global power in the form of the node-capacitated clique model. In this model, we show the existence of a Laplacian solver with round complexity no(1)log(1/ϵ)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n^{o(1)} \log (1/\epsilon )$$\end{document}. The unifying thread of these results, and our main technical contribution, is the development of near-optimal algorithms for a novel ρ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho $$\end{document}-congested generalization of the standard part-wise aggregation problem, which could be of independent interest.


Introduction
The Laplacian paradigm has emerged as one of the cornerstones of modern algorithmic graph theory.Integrating techniques from combinatorial optimization with powerful machinery from numerical linear algebra, it was originally pioneered in [ST14] who established the first nearly-linear time solvers for a (linear) Laplacian system.Thereafter, there has been a considerable amount of interest in providing simpler and more efficient solvers [KMP14;Kel+13;KS16].Indeed, this framework has led to some state of the art algorithms for a wide range of fundamental graph-theoretic problems; e.g., see [AMV21; Mad16; Coh+17; Bra+20; Kel+14; Pen16; AMV20], and references therein.In the distributed setting, a major breakthrough was very recently made in [For+20].In particular, the authors developed a distributed algorithm that solves any Laplacian system on an n-node graph after n o(1) ( √ n + D) log(1/ε) rounds of the standard CONGEST model, where D represents the hop-diameter of the underlying network and ε > 0 is the error of the solver.Moreover, they showed that their algorithm is existentially optimal, up to the n o(1) factor, establishing a lower bound of Ω( √ n + D) rounds via a reduction from the s − t connectivity problem [Das+11].
This existential lower bound in the CONGEST model of distributed computing should hardly come as any surprise.Indeed, it is well-known by now that a remarkably wide range of global optimization problems, including minimum spanning tree (MST), minimum cut (Min-Cut), maximum flow, and single-source shortest paths (SSSP), require Ω( √ n + D) rounds 1 [PR99; Elk04;Das+11].The same limitation generally applies to any non-trivial approximation and even under randomization.Nonetheless, these lower bounds are constructed on some pathological graph instances which arguably do not occur in practice.This begs the question: Can we obtain more refined performance guarantees based on the underlying topology of the communication network?The framework of low-congestion shortcuts, introduced by [GH16], demonstrated that bypassing the notorious Ω( √ n) lower bound is possible: MST and Min-Cut on planar graphs can be solved in O(D) rounds.This is crucial, given that in many graphs of practical significance the diameter is remarkably small; e.g., D = polylog(n) (as is folklore, this holds for most social networks), implying exponential improvements over generic algorithms used for general graphs.In the context of the distributed Laplacian paradigm, we raise the following question: Is there a faster distributed Laplacian solver under "non-worst-case" families of graphs in the CONGEST model?
The only known technique in distributed computing for designing algorithms that go below the √ n-bound is the low-congestion shortcut framework of Ghaffari and Haeupler [GH16], and its large ecosystem of tools built around it [HIZ16a; HIZ16b; HWZ21; GH20; Zuz+22; GZ22; HRG22].However, the "ρ-congested minor" primitive introduced and extensively used in the novel distributed Laplacian solver [For+20] is out-of-reach from the current set of tools available in the low-congestion shortcut framework.We address this issue by introducing an analogous primitive called ρ-congested part-wise aggregation, which greatly simplifies the interface used by [For+20].We then extend the low-congestion shortcut framework with new techniques that enables it to near-optimally solve this primitive: we provide both an algorithm that utilizes the very recent hop-constrained expander decompositions for shortcut construction [HRG22] to solve the primitive in general graphs with a linear dependence on ρ, as well as a very simple algorithm with a quadratic ρ-dependence for bounded-treewidth graphs.Finally, we settle our original question in the positive by establishing that our new primitive can be readily used to accelerate the distributed Laplacian solver for non-worst-case topologies.Specifically, we show our new techniques are sufficient to lift the existentially optimal algorithm [For+20] to a universally optimal algorithm-modulo n o(1) factor inherent in the prior approach-for distributedly solving a Laplacian system, meaning that, for any topology, our algorithm is essentially as fast as possible.In other words, for any graph, our algorithm almost matches the best possible (correct) algorithm for that graph.This result is unconditional in essentially all settings of interest (see Theorem 1.2 for details), but relies on conjectured improvements of current state-of-the-art constructions of low-congestion shortcuts to achieve unqualified universal optimality-like all other results in the area.Furthermore, another concrete way of bypassing the Ω( √ n + D) lower bound, besides investigating non-worst-case families of graphs, is by enhancing the local communication network with a limited amount of global power.Indeed, research concerning hybrid networks was recently initiated in the realm of distributed algorithms [Aug+20], although networks combining different communication modes have already found numerous applications in real-life computing systems; as such, hybrid networks have been intensely studied in other areas of distributed computing (see [CGC16;Wan+10;KS18], and references therein).In this paper, we will enhance the standard CONGEST model with the recently introduced node-capacitated clique (henceforth NCC) [Aug+19].The latter model enables all-to-all communication, but with severe capacity restrictions for every node.The integration of these models will be referred to as the HYBRID model for the rest of this work.This leads to the following central question: Is there a faster distributed Laplacian solver in the HYBRID model?
Our paper essentially settles this question by showing the same ρ-congested part-wise aggregation primitive can be efficiently solved in Õ(ρ) rounds of NCC, implying an almost optimal n o(1)round distributed algorithm for solving Laplacian systems in the HYBRID model.A conceptual contribution of our approach is that we treat both CONGEST, Supported-CONGEST, and HYBRID in a unified way through the lens of the low-congestion shortcut framework, by designing our algorithm using high-level primitives and leaving the model-specific translations to the framework itself.We note that a similar unified view of PRAM (i.e., parallel) and CONGEST (i.e., distributed) graph algorithms through the same lens has led to very recent breakthroughs on long-standing open problems for both of these settings [Li22].

Overview of our Contributions and Techniques
The unifying thread and the main technical ingredient of our (almost) universally optimal distributed Laplacian solvers is a new fundamental communication primitive which we refer to as the congested part-wise aggregation problem.Specifically, we develop near-optimal algorithms for solving this problem in the (Supported-)CONGEST and the NCC model (Section 3), and then we utilize this primitive to develop almost universally optimal Laplacian solvers in Section 4.

The Congested Part-Wise Aggregation Problem
To introduce the congested part-wise aggregation problem, let us first give some basic background.The aforementioned Ghaffari-Haeupler framework of low-congestion shortcuts revolves around the so-called part-wise aggregation problem posed as follows: "The graph is partitioned into disjoint and individually-connected parts, and we need to compute some simple aggregate function for each part, e.g., the minimum of the values held by the nodes in a given part" [GH16] (see Definition 2.1 for a formal definition).Importantly, it has been shown that this primitive can be solved efficiently in structured topologies, and that many problems (including the MST, shortest path, min-cut, etc.) reduce to a small number of calls to a part-wise aggregation oracle, leading to universally optimal algorithms.Unfortunately, it is not clear how to reduce solving a Laplacian system to (a small number of) part-wise aggregation calls and in this paper, we primarily address this issue.
Our first technical contribution is to extend the framework of low-congestion shortcuts by studying a more general primitive: one that incorporates congestion (of the input parts) into the underlying part-wise aggregation instance.More precisely, unlike the standard part-wise aggregation problem, we allow each node to participate in up to ρ ∈ Z ≥1 aggregation parts (see Definition 3.1).We later show that efficient solutions to this primitive leads to efficient distributed Laplacian solvers.
We first remark that a natural strategy for solving congested part-wise aggregation instances does not work: congested instances cannot, in general, be directly reduced to a "small" collection of 1-congested instances, thereby necessitating a more refined approach.To this end, our approach is based on "lifting" the underlying communication network G into its ρ-layered version G O(ρ) : every edge is replaced with a matching and every node with a ρ-clique.The importance of this transformation is that, as we show in Lemma 3.3, the ρ-congested part-wise aggregation problem can be reduced to a 1-congested instance on the ρ-layered graph (Section 3.1.1).This is first established under the assumption that individual parts correspond to simple paths, and then we extend our results to general parts by following Haeupler, Wajc, and Zuzic [HWZ21].In light of this reduction, we next focus on solving the 1-congested part-wise aggregation instance on the layered graph.
As a warm-up, we treat graphs with bounded treewidth tw(G) (Definition 2.8).It is known from [HIZ16b] that on a graph G with treewidth tw(G), a 1-congested part-wise aggregation instance can be solved in O(tw(G)D) rounds of CONGEST.Keeping this in mind, we first show that the treewidth of the ρ-layered graph G ρ can only increase by a factor of ρ compared to the original graph (Lemma 3.8).Hence, we can solve 1-congested instances in G O(ρ) in O(ρ tw(G)D) rounds (when the underlying network is G O(ρ) ), which in turn allows us to solve ρ-congested instances on G in O(ρ2 tw(G)D) time in G (another ρ factor is necessary to simulate G O(ρ) in G).This positive result poses a natural question: can we achieve similar results on graphs with bounded minor density δ(G) (Definition 2.6)?However, the answer to this question is negative: minor density can blow up even for a 2-layered planar graph (see Observation 3.10), making such a result impossible.
Then, we look at arbitrary graphs G: it is known that 1-congested part-wise aggregation instances can be solved in a number of rounds that is controlled by SQ(G), where SQ(G) is the shortcut quality of G (a certain graph parameter we formalize in Definition 2.4).Specifically, it can be solved in O(SQ(G)) rounds when the topology is known in advance 2 [HWZ21] and poly(SQ(G)) • n o(1) in general CONGEST [HRG22].The shortcut quality parameter is significant because it was shown that many distributed problems (including the MST, shortest path, min-cut, and-Laplacian solving, as we show later) require Ω(SQ(G)) rounds in CONGEST to be solved on G [HWZ21].Therefore, algorithms that have an upper bound close to SQ(G) are universally optimal.
With the end goal of solving the 1-congested part-wise aggregations on layered graphs G ρ in time controlled by SQ(G), our main result established that the shortcut quality of the ρ-layered graph does not increase (modulo polylogarithmic factors) as compared to the original graph (Theorem 3.11).This has a plethora of important consequences: (1) when SQ(G) ≤ n o(1) , we can unconditionally solve ρ-congested part-wise aggregation instances in ρ • n o(1) CONGEST rounds and (2) when the topology of G is known, there exists a distributed algorithm which solves any ρ-congested part-wise aggregation problem in ρ • O(SQ(G)) rounds.As a consequence of our general result, the shortcut quality of any 2-layered planar graph is O(D) since it is known that the shortcut quality of a planar graph is O(D) [GH16].This constitutes perhaps the most natural example of a graph whose minor density is very far from the shortcut quality; the only other example documented in the literature so far is that of expander graphs.
Our proof proceeds by employing alternative characterizations of the shortcut quality in terms of certain communication tasks.Specifically, shortcut quality can be shown to be equal (modulo polylogarithmic factors) to the following two-player max-min game: the first (max) player chooses k sources and k sinks in the graph such that we can find k node-disjoint paths matching the sources with the sinks; then the second (min) player finds the smallest so-called quality Q such that there exist k paths matching the sources with the sinks with the path lengths being at most Q and each edge of the underlying graph supporting at most Q of second player's paths.This characterization allows us to compare the shortcut quality of G ρ with G as follows: take the worst-case (first player's) set of sources and sinks in G ρ .Project them to G and note they have node congestion ρ (due to the construction of G ρ ).Then, we show we can decompose (i.e., partition) these set of sources and sinks into O(ρ) pairs of sub-sources and sub-sinks that are node-disjointly connectable in G.However, each such set enjoys paths of quality SQ(G), hence embedding each such pair in a separate layer of G ρ shows that the shortcut quality of SQ( G ρ ) is at most O(SQ(G)).Although this general approach improves over our result for treewidth-bounded graphs we previously described, our approach for the latter class of graphs is substantially simpler and more suited for potential practical applications.

Almost Universally Optimal Laplacian Solvers
First, we note that any distributed Laplacian solver that always correctly outputs an answer on a fixed graph G must take at least Ω(SQ(G)) rounds, giving us a lower bound to compare ourselves with.Our refined lower bound uses the hardness result recently shown by [HWZ21] for the spanning connected subgraph problem, applicable for any (i.e., non-worst-case) graph G. Specifically, we show that a Laplacian solver can be leveraged to solve the spanning connected subgraph problem, thereby substantially strengthening the lower bound in [For+20].
Proposition 1.1.Consider a graph G with shortcut quality SQ(G).Then, solving a Laplacian system on G with ε ≤ 1 2 requires Ω(SQ(G)) rounds in both CONGEST and Supported-CONGEST models.
On the upper-bound side, we utilize the congested part-wise aggregation primitive to improve and refine the Laplacian solver of [For+20], leading to a substantial improvement in the round complexity under structured network topologies.

Theorem 1.2. Consider any n-node graph G with shortcut quality SQ(G) and hop-diameter D.
There exists a distributed Laplacian solver with error ε > 0 with the following guarantees: • In the Supported-CONGEST model, it requires n o(1) SQ(G) log(1/ε) rounds.
We note that the above algorithm is almost (up to inherent n o(1) factors) universally optimality for most settings of interest.Since it is (almost) matching the SQ(G)-lower-bound, it is unconditionally universally optimal when the topology is known in advance (i.e., Supported-CONGEST).Furthermore, in standard CONGEST, we give almost universally optimal Dn o(1) log(1/ε)-round algorithms for topologies that include planar graphs, n o(1) -genus graphs, n o(1) -treewidth graphs, excluded-minor graphs, since all of them are graphs with minor density δ(G) = n o(1) .Furthermore, for the realistic case of D ≤ n o(1) , it holds for most networks of interest that SQ(G) ≤ n o(1) (e.g., expanders, hop-constrained expanders, as well as all classes mentioned earlier), for which we get n o(1) log(1/ε)-round solvers.Finally, the conjectured improvements of the state-of-the-art of almost-optimal low-congestion shortcut constructions would immediately lift our results to be unconditionally universally optimal in CONGEST.However, the issue is orthogonal and out-of-scope of this paper.
Furthermore, in HYBRID we obtain an almost optimal complexity in general graphs: Theorem 1.3.Consider any n-node graph.There exists a distributed Laplacian solver in the HYBRID model with round complexity n o(1) log(1/ε), where ε > 0 is the error of the solver.This implies a remarkably fast subroutine for solving a Laplacian system in HYBRID under arbitrary topologies.As a result, we corroborate the observation that a very limited amount of global power can lead to substantially faster algorithms for certain optimization problems, supplementing a recent line of work [CLP21b; Aug+20; KS20; FHS20; CLP21a; Göt+21; KS22; Coy+22].Furthermore, our framework based on the congested part-wise aggregation problem allows for a unifying treatment of both (Supported-)CONGEST and HYBRID, and we consider this to be an important conceptual contribution of our work.Indeed, as we previously explained, both of our accelerated Laplacian solvers rely on faster algorithms for solving the congested part-wise aggregation problem.In particular, for (Supported-)CONGEST we have already described our approach in detail, while in the HYBRID model we employ certain communication primitives developed in [Aug+19] for dealing with congestion in part-wise aggregations.A byproduct of our results is that the framework of low-congestion shortcuts interacts particularly well with the HYBRID model, as was also observed in [AG21].

Further Related Work
Our main reference point is the recent Laplacian solver of Forster, Goranci, Liu, Peng, Sun, and Ye [For+20] with existentially almost-optimal complexity of n o(1) ( √ n + D) log(1/ε) rounds, where ε > 0 represents the error of the solver.Specifically, they devised several new ideas and techniques to circumvent certain issues which mostly relate to the bandwidth restrictions of the CONGEST model; these building blocks, as well as the resulting Laplacian solver are revisited in our work to refine the performance of the solver.We are not aware of any previous research addressing this problem in the distributed context.On the other hand, the Laplacian paradigm has attracted a considerable amount of interest in the community of parallel algorithms.Most notably, we refer to [PS14;Ble+14].These approaches in the PRAM model of parallel computing fail-at least without non-trivial modifications-to lead to a almost-optimal solver in the distributed context [For+20].
In addition to being a problem of independent interest, solving Laplacian systems often leads to a plethora of very fast algorithms (albeit typically polynomially-away from being optimal) for other problems such as (exact) maximum flow [Mad16], min-cost flow [AMV21], shortest paths with negative weights [Coh+17], etc.The recent distributed Laplacian solver [For+20] also contributed fast analogues of these algorithms in the distributed model.A natural question to ask is whether we can also use our techniques to make these algorithms work for more structured graphs.However, these algorithms rely on directed or exact shortest path computations, which currently represent a major barrier for shortcut-based approaches.Moreover, the same set of problems represent a barrier even for existentially-optimal approaches as the current state-of-the-art is a factor of D 1/4 away from achieving unqualified existential optimality [CM21].
Research concerning hybrid communication networks in distributed algorithms was recently initiated by [Aug+20].Specifically, they investigated the power of a model which integrates the standard LOCAL model [Lin92] with the recently introduced node-capacitated clique (NCC) [Aug+19], focusing mostly on distance computation tasks.Several of their results were subsequently improved and strengthened in subsequent works [KS20; CLP21a] under the same model of computation.In our work we consider a substantially weaker model, imposing a severe limitation on the communication over the "local edges".This particular variant has been already studied in some recent works for a variety of fundamental problems [FHS20; Göt+21].
The NCC model, which captures the global network in all hybrid models studied thus far, was introduced in [Aug+19] partly to address the unrealistic power of the congested clique (CLIQUE) [Lot+03].In the latter model each node can communicate concurrently and independently with all other nodes by O(log n)-bit messages.In contrast, the NCC model allows communication with O(log n) (arbitrary) nodes per round.As a result, in the HYBRID model and under a sparse local network, only Θ(n) bits can be exchanged overall per round, whereas CLIQUE allows for the exchange of up to Θ(n 2 ) (distinct) bits.As evidence for the power of CLIQUE we note that even slightly super-constant lower bounds would give new lower bounds in circuit complexity, as implied by a simulation argument in [DKO14].

Preliminaries
General notation We denote with [k] := {1, 2, . . ., k}.Graphs throughout this paper are undirected.The nodes and the edges of a given graph G are denoted as V (G) and E(G), respectively.We also use n := |V (G)| for brevity.The graphs are often weighted, in which case we assume (as is standard) that for all e ∈ E(G), w(e) ∈ {1, 2, . . ., poly(n)}.We will denote the hop-diameter of a graph G with D(G) (the hop-diameter ignores weights).Moreover, we use A B to denote the multiset union, i.e., each element is repeated according to its multiplicity; this operation corresponds to disjoint unions when A ∩ B = ∅.

Communication models
The communication network consists of a set of n entities with [n] := {1, 2, . . ., n} being the set of their IDs, and a local communication topology given by a graph G. 3We define D := D(G) to be the (hop-)diameter of the underlying network.At the beginning, each node knows its own unique O(log n)-bit identifier as well as the weights of the incident edges.Communication occurs in synchronous rounds, and in every round nodes have unlimited computational power to process the information they possess.We will consider models with both local and global communication modes.
The local communication mode will be modeled with the CONGEST model [Pel00] and Supported-CONGEST model [SS13], for which in each round every node can exchange an O(log n)-bit message with each of its neighbors in G via the local edges.In the (standard) CONGEST model, each node v ∈ V (G) initially only knows the identifiers of each node in v's own neighborhood, but has no further knowledge about the topology of the graph.On the other hand, in the Supported-CONGEST model, all nodes know the entire topology of G upfront, but not the input.
The global communication mode will be modeled using NCC [Aug+19], for which in each round every node can exchange O(log n)-bit messages with O(log n) arbitrary nodes via global edges.If the capacity of some channel is exceeded, i.e., too many messages are sent to the same node, it will only receive an arbitrary (potentially adversarially selected) subset of the information based on the capacity of the network; the rest of the messages are dropped.In this context, we will let HYBRID be the integration of CONGEST and NCC (i.e., nodes have both a local and a global communication mode at their disposal).
The performance of a distributed algorithm will be measured in terms of its round complexitythe number of rounds required so that every node knows its part of the output.For randomized algorithms it will suffice to reach the desired state with high probability. 4We will assume throughout this work that nodes have access to a common source of randomness; this comes without any essential loss of generality in our setting [Gha15].When talking about a distributed algorithm for a specific problem (e.g., Laplacian solving, part-wise aggregation, etc.) we assume the input is appropriately distributedly stored (i.e., each node will know its own part) and, upon termination, it will be required that the output is appropriately distributedly stored.The appropriate way to distributedly store the input and output will be explained in the problem definition.

Low-Congestion Shortcuts
A recurring scenario in distributed algorithms for global problems (e.g.MST) boils down to solving the following part-wise aggregation problem: In the part-wise aggregation problem, each node v ∈ V is given its part-ID (if any) and an O(log n)-bit value x(v) as input.The goal is that, for every part P i , all nodes in P i learn the part-wise aggregate w∈P i x(w), where is an arbitrary pre-defined aggregation function.
Throughout this paper, we will assume that the aggregation function is commutative and associative (e.g.min, sum, logical-AND), although this is not strictly needed (e.g., see [GZ22]).To give a concrete example, in the context of Boruvka's algorithm for the MST problem, determining the minimum-weight outgoing edge for each part is an instance of a part-wise aggregation problem with := min.To solve such problems, Ghaffari and Haeupler [GH16] introduced a natural combinatorial graph structure which they refer to as low-congestion shortcuts.Shortcut Quality and Construction of Shortcuts Shortcut quality, introduced below, is a fundamental graph parameter that has been proven to characterize the complexity of many important problems in distributed computing.
Definition 2.4.Given a graph G = (V, E), we define the shortcut quality SQ(G) of G as the optimal (smallest) shortcut quality of the worst-case partition of V into disjoint and connected parts P 1 P 2 . . .P k ⊆ V .
For fundamental problems such as MST, SSSP, and Min-Cut any correct algorithm requires Ω(SQ(G)) rounds on any network G, even if we allow randomized solutions and (non-trivial) approximation factors.In fact, this limitation holds even when the network topology G is known to all nodes in advance [HWZ21].We remark that Ω(D and the upper bound is known to be tight in certain (pathological) worst-case graph instances.This explains the notorious (existential) Ω(D + √ n) lower bound pervasive in distributed computing [Das+11].
Moreover, assuming fast distributed algorithms for constructing shortcuts of quality competitive with SQ(G), all of the aforementioned problems can be solved in O(SQ(G)) rounds [GH16; Zuz+22; GZ22].However, the key issue here is the algorithmic construction of the shortcuts upon which the above papers rely.While there has been a lot of recent progress in this regard, current algorithms are quite complicated and have sub-optimal guarantees.We recall below these state-of-the-art SQ(G)-competitive construction results.

Theorem 2.5. There exists a distributed algorithm that, given any part-wise aggregation instance on any n-node graph G, computes with high probability a shortcut with the following guarantees:
• In CONGEST, the shortcut has quality poly SQ(G) • n o(1) and the algorithm terminates in poly SQ(G) • n o(1) rounds [HRG22].

• In Supported-CONGEST, the shortcut has quality O(SQ(G)) and the algorithm terminates in O(SQ(G)) rounds [HWZ21].
Universal Optimality A distributed algorithm is said to be α-universally optimal if, on every network graph G, it is α-competitive with the fastest correct algorithm on G [HWZ21].Even the existence of such algorithms is not at all clear as it would seem possible that vastly different algorithms are required to leverage the structure of different networks.Nevertheless, a remarkable consequence of Theorem 2.5 is that in Supported-CONGEST we can design O(1)-universally optimal algorithms for many fundamental optimization problems.Moreover, efficient shortcut construction is the only obstacle towards achieving these results in the full generality of CONGEST, which is an issue orthogonal and out of scope for this paper.Still, the aforementioned results are sufficient to design n o(1) -universally optimal algorithms on graphs that have shortcut quality SQ(G) = n o(1) , as it is arguably the case in most networks of practical interest.

Graphs Excluding Dense Minors
It turns out that the crucial issue of efficient shortcut construction can be resolved with a near-optimal, simple, and even deterministic algorithm for the rich class of graphs with bounded minor density.Formally, let us first recall the following definition.5 Definition 2.6 (Minor Density).The minor density δ(G) of a graph G is defined as The (linear) dependency on the minor density is existentially optimal [GH20, Lemma 3.2].It should be noted that, in the context of Theorem 2.7, there is also a deterministic distributed algorithm with a slightly worse guarantee [GH20].Some of our results apply for communication networks with bounded treewidth, so let us recall the following definition.

Definition 2.8 (Tree Decomposition and Treewidth).
A tree decomposition of a graph G is a tree T with tree-nodes X 1 , . . ., X k , where each X i is a subset of V (G) satisfying the following properties: 1. V = k i=1 X i ; 2. For any node u ∈ V (G), the tree-nodes containing u form a connected subtree of T ; 3. For every edge {u, v} ∈ E(G), there exists a tree-node X i which contains both u and v.
The width w of the tree decomposition is defined as w := max i∈[k] |X i | − 1.Moreover, the treewidth tw(G) of G is defined as the minimum of the width among all possible tree decompositions of G.
Bounded-treewidth graphs inherit all of the nice properties guaranteed by Theorem 2.7, as implied by the following well-known fact.

The Congested Part-Wise Aggregation Problem
This section is concerned with a congested generalization of the standard part-wise aggregation problem (Definition 2.1), formally introduced below.
In the ρ-congested part-wise aggregation problem, each node v is given the following as input: for each part P i v node v knows the part-ID i and an O(log n)-bit part-specific value x i (v).The goal is that, for each part P i , all nodes in P i learn the part-wise aggregate w∈P i x i (w), where is an arbitrary pre-defined aggregation function.
This congested generalization of the standard part-wise aggregation problem that we study in this section turns out to be a central ingredient in our refined Laplacian solver; this is further explained in Section 4. The remainder of this section is organized as follows.In Section 3.1 we establish near-optimal algorithms for solving congested part-wise aggregations in CONGEST, which is also the main focus of this section.We conclude by pointing out the construction for NCC in Section 3.2.

Solving Congested Instances in the CONGEST Model
The first natural strategy for solving the ρ-congested part-wise aggregation problem of Definition 3.1 is through a reduction to poly(ρ) 1-congested instances.However, this approach immediately fails even if we allow ρ = 2. Indeed, there exist congested part-wise aggregation instances for which every two (distinct) parts share a common node, even when ρ = 2, leading to the following observation.
Observation 3.2.For an infinite family of values n, there exists an n-node planar graph G and a 2-congested part-wise aggregation instance Such a pattern is illustrated in Figure 1.As a result, directly employing a 1-congested part-wise aggregation oracle is of little use since it would introduce an overhead depending on the number of parts.In light of this, we develop a more refined approach that leverages what we refer to as the layered graph.This concept is introduced in Section 3.1.1,where we show that the congested part-wise aggregation problem can be reduced to the 1-congested part-wise aggregation problem in the layered graph.Then, we give an algorithm for the ρ-congested part-wise aggregation problem in treewidth-bounded graphs through a simple approach in Section 3.1.2,yielding an O(ρ 2 tw(G)D)round algorithm.Finally, we show that the shortcut quality SQ of the ρ-layered graph does not increase (modulo polylogarithmic factors) as compared to the original graph (Theorem 3.11).This implies a solution for ρ-congested part-wise aggregations in general graphs with a runtime with the optimal, linear, dependence on ρ, albeit at the cost of a more involved argument (Section 3.1.3,specifically Corollary 3.12).Here we introduce the layered graph G ρ associated with the underlying graph G.Then, we reduce the problem of ρ-congested part-wise aggregation on G to a 1-congested instance on G O(ρ) .

The Layered Graph
The Layered Graph Consider an underlying network G and some ρ ∈ Z ≥1 , corresponding to the congestion parameter in Definition 3.1.The layered graph G ρ is constructed in the following way.First, we let G ρ be a disjoint union of ρ copies of We also add an edge between each two copies that originate from the same node (i.e., we add a clique to G ρ on the set of copies associated with the same node v ∈ V (G)); this construction is illustrated in Figure 2. The layered graph induces a natural projection operation π : V ( G ρ ) → V (G) which maps a copy v i to its original node v = π(v i ).Furthermore, we often talk about simulating G ρ in G, by which we mean that each node v simulates-learns all the inputs and can generate all outputs-for its copies v 1 , . . ., v ρ .Throughout this paper, we will assume that ρ = poly(n) so that any O(log n)-bit message on G ρ can be sent within O(1) rounds in G; this also keeps the O-notation well-defined.The main goal of this section is to establish that the ρ-congested part-wise aggregation problem on G can be reduced to a 1-congested instance on G O(ρ) , as formalized below.

Lemma 3.3 (Unrestricted Congested Part-Wise Aggregation). Let G be an n-node graph and let Z ≥1 ρ ≤ poly(n). Suppose that any (1congested) part-wise aggregation on G O(ρ) can be solved with a τ -round CONGEST algorithm on G O(ρ) . Then, there exists an O(ρ • τ )-round CON-GEST algorithm on G that solves any ρ-congested part-wise aggregation instance on G.
The remainder of this section is dedicated to the proof of this result.We first point out that any CONGEST algorithm on G ρ can be simulated with only a ρ multiplicative overhead in the round complexity (see Appendix B.2). Lemma 3.4 (Simulating G ρ in G).For any G and any Z ≥1 ρ ≤ poly(n), we can simulate any τ -round CONGEST algorithm on G ρ with a (ρ • τ )-round CONGEST algorithm on G.
Furthermore, we will use a folklore result showing how to color a (multi)graph of maximum degree ∆ in O(∆) colors in O(log n) rounds of CONGEST.By multigraph here we simply mean that there can be multiple parallel edges between the same pair of nodes, and every such edge can carry an independent message per round.To keep the paper self-contained we provide a short sketch of the proof in Appendix B.2. Fact 3.5 (Folklore, [Joh99]).Given a (multi)graph G with n nodes and maximum degree ∆ ≤ poly(n), there exists a randomized CONGEST algorithm that colors the edges of G with O(∆) colors and completes in O(log n) rounds, with high probability.The coloring is proper, i.e., two edges that share an endpoint are assigned a different color.Now we are ready to prove a version of our main reduction (Lemma 3.3), but with the slightly twist that we restrict each part of the ρ-congested part-wise aggregation problem to be a simple path.This restriction will be removed later.ρ ≤ poly(n).Suppose that there exists a τ -round CONGEST algorithm solving the (1-congested) part-wise aggregation on G O(ρ) .Then, there exists an O(ρ • τ )-round CONGEST algorithm on G that solves any ρ-congested part-wise aggregation instance on G when each part is restricted to be a simple path6 (nodes are not repeated in simple paths).
Proof.Let P = {P 1 , P 2 , . . ., P k } be subsets of nodes in G comprising the parts of some ρ-congested part-wise aggregation on G.We will construct paths P = {P 1 , P 2 , . . ., P k } in G O(ρ) in a way that solving a part-wise aggregation on P corresponds to solving a ρ-congested part-wise aggregation on P.
Let E i be the set of edges of G comprising the simple path traversing all the nodes in P i , and consider the graph G := (V (G), k i=1 E i ).First, we observe that the degree of any node in v ∈ V (G ) = V (G) is at most 2ρ since at most ρ many parts contain v and each part contributes at most 2 to the degree (since P i is a simple path).Furthermore, we can simulate any ψ-round CONGEST algorithm on G with a (ψ • ρ)-round CONGEST algorithm on G as each edge e ∈ E(G) appears at most ρ times in E(G ) due to the part-wise aggregation instance being at most ρ-congested.Therefore, using Fact 3.5 we can distributedly color the edges of G into at most O(ρ) colors in O(log n) CONGEST rounds on G , which translates to O(ρ) CONGEST rounds on G. Suppose that the algorithm assigns a color c(e) ∈ {1, . . ., O(ρ)} to each edge e ∈ i E i .
We now construct P i ⊆ G O(ρ) as follows: consider each edge {u, v} ∈ E i and add both u c({u,v}) , v c({u,v}) ∈ V ( G O(ρ) ) to P i (i.e., the c({u, v})-th copy of both u and v).By construction, P i induces a connected subgraph and the projection P i to G is exactly P i .Next, we invoke the (1-congested) part-wise aggregation τ -round algorithm for {P 1 , . . ., P k } on G ρ , which can be converted to an O(τ • ρ)-round algorithm on G (Lemma 3.4).Thus, we obtain an O(τ • ρ)-round CONGEST algorithm on G which solves any path-restricted ρ-congested part-wise aggregation problem.
Finally, our reduction in Lemma 3.3 follows by reformulating [HWZ21, Lemma 7.2], as we argue in Appendix B.2.

Treewidth-Bounded Graphs
Here we leverage the reduction we established in Lemma 3.3 to obtain a simple algorithm for solving the congested part-wise aggregation problem in treewidth-bounded graphs.The crucial observation is that the treewidth of the layered graph can only grow by a factor of ρ compared to the treewidth of the underlying graph, as we show in Lemma 3.8.
Proof.Consider a tree decomposition (in the sense of Definition 2.8) of G into tree-nodes {X j } k j=1 such that the width of the decomposition satisfies w = tw(G).We will show that there exists a tree decomposition on the graph G ρ with width at most ρ(w + 1) − 1, which in turn will imply that tw( G ρ ) ≤ ρ(w + 1) − 1 = ρ(tw(G) + 1) − 1.Indeed, consider the following sets: In words, each node V (G) u ∈ X j is replaced by all of its copies u i in X j .Observe that, by construction, | X j | = ρ|X j |.Thus, it suffices to show that the collection of sets { X j } k j=1 forms a legitimate tree decomposition.First, since V (G) ⊆ j X j , it follows that V ( G ρ ) ⊆ X j .Moreover, consider any two sets X j , X , both containing a node u i ∈ V ( G ρ ) for some i ∈ [ρ].Then, we know that all the tree-nodes in the (unique) path between X j and X based on the original tree decomposition include u since X j and X both include u and {X j } is a tree decomposition of G.In turn, this implies that all the tree-nodes in the path between X j and X also contain u i .Thus, the tree-nodes containing u i form a connected subtree.Finally, we know that for every edge {u, v} ∈ E(G) there exists a subset X j such that u, v ∈ X j .Hence, we can infer that for every edge in E( G ρ ) there is a tree-node X j which includes both incident endpoints.As a result, we have constructed a tree decomposition in G ρ with width max j∈ Minor Density in the Layered Graph In light of Lemma 3.8, a natural question is whether an analogous bound holds with respect to the minor density of the underlying graph; i.e., whether δ( G ρ ) = poly(ρ)δ(G).Such a result would be strictly stronger as it would apply to the broader class of graphs with bounded minor density, and would essentially lift all the results in [GH20], such as Theorem 2.7, to the node-congestion setting in a black-box manner.Unfortunately, this is not possible.
Indeed, consider a √ n × √ n grid G-where √ n is assumed to be an integer-such that every node in the graph is 2-congested.Then, it is clear that δ(G) = O(1) (since planar graphs have excluded minors).On the other hand, we claim that δ( G ρ ) = Ω( √ n).To see this, denote by (i, j) the node positioned in the i-th row and j-th column with respect to the original graph, and by (i , j ) the node positioned in the i-th row and j-th column of the "duplicate" layer, for i, j √ n]} be the nodes comprising the j-th column of the original graph and R i = {(i , j ) : j ∈ [ √ n]} be the nodes comprising the i-th row of the duplicate layer.Then, it follows that the minor graph induced by the connected components R 1 , . . ., R √ n , C 1 , . . ., C √ n contains the complete bipartite graph K √ n, √ n as a subgraph (Figure 3).As a result, this implies that the minor density of G ρ is Ω( √ n).
Observation 3.10.There exists an n-node graph G with minor density

General Graphs
We conclude with our main result of Section 3.1: a near-optimal distributed algorithm for solving the ρ-congested part-wise aggregation problem in general graphs.In light of our reduction in Lemma 3.3, the technical crux is to control the degradation in the shortcut quality incurred by the transformation into the layered graph.Surprisingly, we show that the shortcut quality of G ρ does not increase by more than a polylogarithmic factor even when the number of layers is polynomial: Theorem 3.11.For any n-node graph G and any This theorem improves over our previous result for treewidth-bounded graphs (Lemma 3.8) since the latter guarantee inevitably induces a linear factor of ρ in the shortcut quality of G ρ .While this will not affect the asymptotic performance of the Laplacian solver, this improvement might prove to be important for future applications.Assuming that we have shown Theorem 3.11, we can then utilize the efficient shortcut constructions given in Theorem 2.5 to solve ρ-congested part-wise aggregations on any graph.
Corollary 3.12.There exists a randomized distributed algorithm that, for any n-node graph G and ρ ∈ Z ≥1 ≤ poly(n), solves with high probability any ρ-congested part-wise aggregation instance on G with the following guarantees: • In the CONGEST, the algorithm terminates in at most ρ • poly SQ(G) • n o(1) rounds.
• The rest of this subsection is dedicated to the proof of Theorem 3.11.To argue about the shortcut quality of the layered graph, we need to develop several generalized notions of node connectivity.
Pair node and any-to-any connectivity are essentially the multi-and single-commodity versions of node connectivity, respectively.
Pair Node Connectivity Given a (multi)set of source-sink pairs P = {(s i , t i )} k i=1 in G, we say that P has pair node connectivity ρ if there exist paths P 1 , . . ., P k , with s i and t i being the endpoints of each P i , such that every node v ∈ V (G) is contained in at most ρ many paths, i.e., for all v we have |{i : V (P i ) v}| ≤ ρ.If P has pair node connectivity 1 we say that they are pair node-disjointly connectable.
Any-to-Any Node Connectivity Suppose that we are given multisets of k sources S = {s 1 , . . ., s k } and k sinks T = {t 1 . . ., t k }.We say that (S, T ) have any-to-any node connectivity ρ if there is a permutation π : {1, . . ., k} → {1, . . ., k} such that the pairs {(s i , t π(i) )} k i=1 have pair node connectivity ρ.If (S, T ) have any-to-any node connectivity 1 we say they are any-to-any node-disjointly connectable.
The following decomposition lemma states that two sets with any-to-any node connectivity ρ can be decomposed into O(ρ) many pairs of subsets that are any-to-any node-disjointly connectable.Proof.Suppose that each edge in G has infinite capacity while each node in G has unit capacity.Then, let us connect a super-source s to each node x ∈ S with a unit-capacity edge, and a super-sink t to each node x ∈ T with a unit capacity edge.By assumption, we know that there exists a flow f over E(G) which sends k units of flow from s to t with edge congestion 1 and node congestion at most ρ.Therefore, the flow f /ρ sending k/ρ units of flow from s to t is a feasible solution of the maximum flow linear program with node constraints (i.e., it satisfies both edge and node capacity constraints).Since that linear program is integral (i.e., has an integrality gap of 1), there exists an integral flow f which sends at least k/ρ units of flow and satisfies both node and edge capacity restrictions.In other words, there exist at least k/ρ node disjoint paths (with the exception of the endpoints) between s and t.Let S 1 ⊆ S (T 1 ⊆ T ) be the set of nodes on these paths immediately following the super-source (just before the super-sink, respectively).Clearly, by construction, (S 1 , T 1 ) are any-to-any node-disjointly connectable.Finally, we define S := S \ S 1 , T := T \ T 1 and proceed iteratively as above (producing S 2 , T 2 instead of S 1 , T 1 ).In each step, the size of S and T decreases by at least a multiplicative factor of 1 − 1/ρ.Hence, O(ρ log k) steps suffice so that S = T = ∅.
Next, we introduce two communication tasks that will be useful for characterizing the shortcut quality.
Multiple-Unicast Problem Suppose that we are given k source-sink pairs P = {(s i , t i )} k i=1 .The goal is to find the smallest possible completion time τ such that there are k paths P 1 , . . ., P k for which (1) the endpoints of each P i are exactly s i and t i ; (2) the dilation is τ , i.e., each path P i has at most τ hops; and (3) the congestion is τ , i.e., each edge e ∈ E(G) is contained in at most τ many paths.
Any-to-Any-Cast Problem Suppose we are given k sources S = {s 1 , . . ., s k } and k sinks T = {t 1 . . ., t k }.The goal is to find the smallest completion time τ such that there exists a permutation π : {1, . . ., k} → {1, . . ., k} for which the multiple-unicast problem on {(s i , t π(i) )} k i=1 has a completion time of at most τ .
Finally, we now recall (a reinterpretation of) a result characterizing shortcut quality from [HWZ20; HWZ21].Shortcut quality was originally defined as the smallest completion-time of the worst-case generalized (with respect to parts) multiple-unicast (i.e., multi-commodity) problem over an a pair node-disjointly connectable instance (Definition 2.4).Using recent network coding gap results we can equivalently express shortcut quality as the smallest completion-time of the worst-case any-toany-cast (i.e., single-commodity) problem over sources and sinks that are any-to-any node-disjointly connectable.The formal statement follows.

Theorem 3.14 ([HWZ20; HWZ21]). Consider any graph G and let τ be the worst-case completion time of any-to-any-cast problems taken over all any-to-any node-disjointly connectable sets (S ⊆ V (G), T ⊆ V (G)). Then, τ = Θ(SQ(G)).
Proof.It was proven in [HWZ21, Lemma 2.8 in the Full Version] that SQ(G) is, up to Θ(1) factors, equal to the completion time C of some multiple-unicast instance with respect to some source-sink pairs P := {(s i , t i )} k i=1 that are pair node-disjointly connectable.We note that, since sources and sinks are disjoint, it follows that k = poly(n) and O(log n) = O(log k).Furthermore, [HWZ20] proved that there exists a sub-instance P = {(s i , t i ) k i=1 } ⊆ P such that SQ(G) is (up to Θ(1) factors) equal to the completion time τ of the any-to-any-cast problem with respect to ({s i } k i=1 , {t i } k i=1 ).One side of the claim is clear: for any sub-instance P ⊆ P we have that τ ≤ C. The other direction is harder and we sketch its proof here using the terminology in [HWZ20].By definition and strong duality, Cut P (2C) = ConcurrentFlow P (2C) ≤ 1.Furthermore, Cut P (C/10) = Cut P (2C)/20 ≤ 1/10.Hence, by [HWZ21, Lemma 2.6] there is a sub-instance P ⊆ P with a moving cut of distance τ := Ω(C) and capacity less than |P |.Therefore, this proves that the completion time of any-to-any-cast problem on P is at least τ .With this in mind, we have that Ω(SQ(G)) = Ω(C) = τ ≤ C = Θ(SQ(G)).
Finally, since P = {(s i , t i )} k i=1 was pair node-disjointly connectable, it follows from definition that the sub-instance ({s i } k i=1 , {t i } k i=1 ) is any-to-any node-disjointly connectable.Therefore, ({s i } k i=1 , {t i } k i=1 ) satisfies the constraints of this result and has completion-time τ = Θ(SQ(G)), as required.It is also clear that, by shortcut quality, any any-to-any node-disjointly connectable instance has completion time at most SQ(G) using the node-disjoint paths that witness the any-to-any node-disjointness as parts of the shortcut, making ({s i } k i=1 , {t i } k i=1 ) the worst-case such instance (modulo polylogarithmic factors).
We now combine all of the previous ingredients to prove the main result of this section.
Proof of Theorem 3.11.Let S ⊆ V ( G ρ ) and T ⊆ V ( G ρ ) be any-to-any node-disjointly connectable sets such that the completion time of any-to-any-cast between S and T is Θ(SQ( G ρ )) (Theorem 3.14).Let k := |S| = |T |, and suppose that S := s∈S {π(s)} ⊆ V (G) and T := t∈T {π(t)} ⊆ V (G) are the multisets induced by projecting S and T to G, respectively.By construction of G ρ , S and T have any-to-any node connectivity ρ; to see this, consider the witness paths disjointly connecting them in G ρ and project them to G. Therefore, we can partition S = S 1 . . .S O(ρ log k) and By definition of shortcut quality, for each i ∈ {1, . . ., O(ρ log k)} there exists a set of paths (P i j ) in G between S i and T i of quality (i.e., both congestion and dilation) at most SQ(G).Then, we inject the first O(log k) collections of paths (P 1 j ) j , (P 2 j ) j , . . ., (P

O(log k) j
) j to the first layer G 1 of G ρ ; the second O(log k) collections to the second layer G 2 , and so on, until we finally inject the last O(log k) collections to the last layer G ρ .Note that only the paths on the same layer interact, so both the congestion and dilation after injecting all paths into G ρ is O(SQ(G) log k).Hence, the same applies for the shortcut quality.Finally, to solve the any-to-any-cast problem on S and T one might need to add an between-layer edge at the beginning and at the end since each injected path is restricted to some adversarially chosen layer.However, this only increases the congestion and dilation by O(1).Hence, the completion time of any-to-any-cast between S and T is O(SQ(G)), implying that SQ( G ρ ) = O(SQ(G)).

The NCC Model
We next turn our attention to the NCC model.We observe that the ρ-congested part-wise aggregation problem admits a solution in poly(ρ, log n) rounds of NCC.This is established after appropriately translating the communication primitives established for NCC in [Aug+19]; the details are provided in Appendix C.
Lemma 3.15.Let G be an n-node communication network.Then, we can solve with high probability any ρ-congested part-wise aggregation problem on G after O(ρ + log n) rounds of NCC.

Almost Universally Optimal Laplacian Solvers
In this section we relate the congested part-wise aggregation problem we studied in the previous section with the Laplacian solver of Forster, Goranci, Liu, Peng, Sun, and Ye [For+20].To present a unifying analysis for both CONGEST and HYBRID, as well as for future applications and extensions, we analyze the distributed Laplacian solver under the following hypothesis.
Assumption 4.1.Consider a model of computation which incorporates CONGEST.We assume that we can solve with high probability any ρ-congested part-wise aggregation problem in ) rounds, for some universal constant c ≥ 1.
One of our crucial observations is that the performance of the Laplacian solver of Forster, Goranci, Liu, Peng, Sun, and Ye [For+20] can be parameterized in terms of the complexity of the congested part-wise aggregation problem.Indeed, we revisit and refine the main building blocks of their solver in Appendix A, leading to the following result.
Theorem 4.2 (Full Version in Theorem A.9). Consider a weighted n-node graph G for which Assumption 4.1 holds for some where c is a universal constant and Q = Q(G) is some parameter.Then, we can solve any Laplacian system after n o(1) Q log(1/ε) rounds.
Combining this theorem with Corollary 3.12 and Lemma 3.15 yields the following immediate consequences.

Theorem 1.2. Consider any n-node graph G with shortcut quality SQ(G) and hop-diameter D.
There exists a distributed Laplacian solver with error ε > 0 with the following guarantees: • In the Supported-CONGEST model, it requires n o( 1) SQ(G) log(1/ε) rounds.
Theorem 1.3.Consider any n-node graph.There exists a distributed Laplacian solver in the HYBRID model with round complexity n o(1) log(1/ε), where ε > 0 is the error of the solver.
Lower Bound in Supported-CONGEST Finally, we complement our positive results with a almost-matching lower bound on any graph G, applicable even under the Supported-CONGEST model, thereby establishing universal optimality up to an n o(1) factor.Our reduction leverages the refined hardness result established in [HWZ21] for the spanning connected subgraph problem [Das+11].
In this problem a subgraph H of G is specified with nodes knowing all of the incident edges belonging to H.The goal is to let every node learn whether H is connected and spans the entire network.

Theorem 4.3 ([HWZ21]
).Let A be any algorithm which is always correct with probability 7 at least 2 In this context, we show that a Laplacian solver can be leveraged to solve the spanning connected subgraph problem, leading to the following lower bound.
Proposition 1.1.Consider a graph G with shortcut quality SQ(G).Then, solving a Laplacian system on G with ε ≤ 1 2 requires Ω(SQ(G)) rounds in both CONGEST and Supported-CONGEST models.
This substantially strengthens the existential lower bound in [For+20], and deviates from their argument which is based on a reduction from the s − t connectivity problem.The proof is deferred to Appendix B.8.

Conclusions
We established almost universally optimal Laplacian solvers for both the (Supported-)CONGEST and the HYBRID model.One of our main technical contributions was to introduce and study a congested generalization of the standard part-wise aggregation problem, which we believe may find further applications beyond the Laplacian paradigm in the future.For example, one candidate problem would be to refine the distributed algorithm for max-flow due to Ghaffari, Karrenbauer, Kuhn, Lenzen, and Patt-Shamir [Gha+15].We also hope that our accelerated Laplacian solvers will be used as a basic primitive for obtaining improved distributed algorithms for other fundamental optimization problems as well.Indeed, Forster, Goranci, Liu, Peng, Sun, and Ye [For+20] showed that the Laplacian paradigm can offer sublinear and exact distributed algorithms for problems such as max-flow, an objective which previously appeared elusive.
2. There exists a mapping of the edges of G onto edges of G, or self-loops, such that for any {u G , v G } ∈ E(G), the mapped edge {u, v} satisfies u ∈ S G→G (u G ) and v ∈ S G→G (v G ).
Moreover, we say that this minor G has congestion ρ, or G is a ρ-minor, if: 1. Every node u ∈ G is contained in at most ρ super-nodes S G→G (u G ), for some u G ∈ V (G); 2. Every edge of G appears as the image of an edge of G or in one of the trees connecting super-nodes (i.e., T G→G (u G ) for some u G ) at most ρ times.
Finally, we say that G is ρ-minor distributed over G if every u ∈ V (G) stores: 2. For every edge e incident to u, (i) all the nodes u G for which e ∈ T G→G (u G ), and (ii) all edges e G that map to it.
We remark that the basis of Definition A.2 was the earlier concept of a distributed cluster graph of Ghaffari, Karrenbauer, Kuhn, Lenzen, and Patt-Shamir [Gha+15].The important connection is that the congested part-wise aggregation problem we introduced is the central ingredient that allows performing certain "local" operations on a graph ρ-minor distributed into the underlying communication network.The following lemma is a direct consequence of Definition A.2.

A.2 The Laplacian Building Blocks
To keep the exposition reasonably self-contained, here we review the basic ingredients of the distributed Laplacian solver developed in [For+20].Our main goal is to extend the guarantees established in [For+20] under Assumption 4.1.Then, we will combine these pieces in Appendix A.3 to complete the construction.

A.2.1 Ultra-Sparsification
As is standard in the Laplacian paradigm, we will require a preconditioner in the form of an ultra-sparsifier.In particular, the following lemma is established in Appendix B.4, and it is a refinement of [For+20, Lemma 4.9]: Let us briefly review the pieces required for this lemma.First, we need the distributed implementation of the low-stretch spanning tree algorithm of Alon, Karp, Peleg, and West [Alo+95] which is due to Ghaffari, Karrenbauer, Kuhn, Lenzen, and Patt-Shamir [Gha+15].Then, this spanning tree is augmented with off-tree edges based on the sampling procedure of Koutis, Miller, and Peng [KMP10], leading to a graph with a spectral approximation guarantee with respect to the original graph.Finally, the parallel elimination procedure of Blelloch, Gupta, Koutis, Miller, Peng, and Tangwongsan [Ble+14] is used to perform a series of contractions, leading to a subset with size analogous to the number of off-tree edges.We revisit these steps in detail in Appendix B.4.

A.2.2 Sparsified Cholesky
The next building block is the sparsified Cholesky algorithm of Kyng, Lee, Peng, Sachdeva, and Spielman [Kyn+16], which manages to effectively eliminate in every iteration a non-negligible fraction of the nodes.In the distributed context, we state the following lemma which is a refinement of [For+20, Lemma 4.10].

Lemma A.5 (Sparsified Cholesky).
Let G be an n-node graph ρ-minor distributed into a communication network G for which Assumption 4.1 holds for some Q = Q(ρ).Then, for a given parameter d and error ε, the algorithm Eliminate (G, d, ε) where c represents some universal constant, and returns a subset T ⊂ V (G) and access to operators This lemma is established based on a distributed implementation of the sparsified Cholesky algorithm of Kyng, Lee, Peng, Sachdeva, and Spielman [Kyn+16].In particular, the Cholesky decomposition essentially reduces solving a Laplacian to inverting (i) any sub-matrix of the Laplacian induced on a set S, and (ii) the Schur complement on V \ S. Thus, Kyng, Lee, Peng, Sachdeva, and Spielman [Kyn+16] initially develop a procedure for identifying an "almost independent" subset of nodes F (more precisely, a strongly diagonally dominant subset) for which inverting the Laplacian restricted on F can be done efficiently through preconditioning (e.g. via the Jacobi method), while F also contains at least a constant fraction of the nodes.Next, a combinatorial view of the Schur complement based on a certain family of random walks (see [Dur+19]) is employed to construct a spectral sparsifier of the Schur complement on T = V \ F .This process is then repeated for d iterations, leading to Lemma A.5.Several technical challenges that arise are discussed in Appendix B.5. Next, the main idea is to recurse on the set of terminals T .However, in our context this requires maintaining the invariant that the underlying subgraph is cast as a minor (with a reasonable congestion) of G.This is ensured in the following subsection.

A.2.3 Minor Schur Complement
This subsection introduces a subroutine that will be invoked after the Eliminate algorithm to return a low-congestion minor based on the set of terminals T returned by Eliminate; while doing so, the algorithm will incur a small overhead in the spectral guarantee, and a limited growth in the number of nodes with respect to T .This increase will be eventually negligible due to the selection of parameter d in Eliminate.In this context, the following lemma is a refinement of [For+20, Theorem 3].
Lemma A.6.Let G be an n-node graph ρ-minor distributed into an n-node communication network for which Assumption 4.1 holds for some Q = Q(ρ).Then, for an error parameter 0 < ε < 0.1 and a subset T of nodes, the algorithm ApproxSC returns with high probability a graph H as a ρ-minor This algorithm requires O(log 10 n/ε 3 ) calls to a distributed Laplacian solver to accuracy 1/ poly(n) on graphs that 2ρ-minor distribute into G, and an overhead of O(Q(ρ) log 10 n/ε 3 ) rounds.
This result builds upon the work of Li and Schild [LS18], who (roughly speaking) established that randomly contracting an edge with probability equal to its leverage score (and otherwise deleting) would suffice.In the distributed context, Forster, Goranci, Liu, Peng, Sun, and Ye [For+20] devise a parallelized implementation of this scheme based on the localization of electrical flows [SRS18].More precisely, they manage to identify a non-negligible subset of edges-which they refer to as steady edges-with small mutual (electrical) "correlation", allowing for independent (and hence highly parallelized) contractions/deletions within this set.This approach employs the recursive and sketching-based method of random projections due to Spielman and Srivastava [SS08], similarly to [LS18], to estimate quantities such as leverage scores and electrical correlation.These steps are carefully reviewed in Appendix B.6.

A.2.4 Schur Complement Chain
Finally, let us introduce the concept of a Schur complement chain, and explain how it can be employed to produce a Laplacian solver.
i=1 is a (γ, ε)-Schur complement chain if the following conditions hold: is some parameter.Then, for any vector b ∈ R n stored on its nodes and a sufficiently small error parameter ε > 0, Solver(G, ε) returns after n o(1) Q log(1/ε) rounds a vector x distributed on its nodes such that The proof of this theorem is included in Appendix B.7.We note that a guarantee with respect to the L(G) † -norm-as in Lemma A.8-can be translated to a guarantee in the L(G)-norm.This incurs only a logarithmic multiplicative overhead since it is assumed that the weights are polynomially bounded and the dependence on 1/ε is logarithmic [Vis12,.Thus, the overhead is subsumed by the factor n o(1) .

B Omitted Proofs
In this section we include all of the proofs deferred from the main body and Appendix A. We commence from Section 2.

B.1 Proofs from Section 2
Proposition 2.3.Suppose that P 1 , . . ., P k is any part-wise aggregation instance in a communication network G.Given a shortcut of quality Q, we can solve with high probability the part-wise aggregation problem in O(Q) CONGEST rounds.
Proof sketch.Consider only one part P i in isolation over the network G[P i ]+H i .First, we claim that there exists a simple deterministic algorithm that computes the AND-aggregate (where each node v ∈ P i has a input bit x(v)) in O(d) rounds, where each edge is used to send at most O(1) messages.Concretely, any node whose input is 0 will forward its input to all neighbors and deactivate itself.Any node which hears about the existence of an input-0 will forward this to all of its neighbors and deactivate itself.After O(d) rounds, either all nodes have heard about the existence of a 0 or they can conclude all inputs are 1.
We continue considering only one part P i in isolation.The next step is to elect a leader of P i by finding the node with the smallest ID in P i ; then, (1) iterate from the most significant bit of the ID to the least significant bit of the ID; (2) compute the AND-aggregate of the current bit of all the nodes' IDs; (3) if the AND-aggregate is 0, all nodes whose current bit of the ID is 1 will drop out.
Putting these together we have a way of computing the aggregate of a part P i in isolation in O(d) rounds with each edge carrying O(1) messages: First, we elect a leader of P i .Then, the leader initiates the computation of a spanning BFS tree of G[P i ] + H i by broadcasting from itself to all other nodes, and each node forwards the message to all neighbors; the neighbor from which it hears the message first is the parent in the tree.Finally, by performing a convergecast over the BFS tree, one can easily compute the aggregate in O(d) rounds for a single part P i .
Finally, we have to run the algorithms on all the parts {P i } i simultaneously.However, this might incur congestion issues on some edges: algorithms associated with multiple parts want to send a message through the same edge in the same round.To prevent this, we randomly delay the start of each algorithm by selecting the delay uniformly at random between 0 and O(c).This guarantees that the total number of messages (across all parts and all rounds) that want to cross a given edge is O(c).Hence, randomly delaying all algorithms makes the expected number of messages crossing a given edge in a fixed round is Θ(1).By Chernoff bounds, this number is bounded by O(1) with high probability.Therefore, by simulating each round of the algorithm using O(1) rounds of communication (where each round of communication carries at most a single message across an edge), we can schedule the algorithms on all parts simultaneously [Gha15].In turn, this allows us to complete all of the aggregates in

B.2 Proofs from Section 3
Lemma 3.3 (Unrestricted Congested Part-Wise Aggregation).Let G be an n-node graph and let Z ≥1 ρ ≤ poly(n).Suppose that any (1-congested) part-wise aggregation on G O(ρ) can be solved with a τ -round CONGEST algorithm on G O(ρ) .Then, there exists an O(ρ • τ )-round CONGEST algorithm on G that solves any ρ-congested part-wise aggregation instance on G.
Proof.Armed with Lemma 3.6, the claim essentially follows by leveraging Haeupler, Wajc, and Zuzic [HWZ21, Lemma 7.2 in the Full Version].More precisely, we will have to slightly reformulate their result.Haeupler, Wajc, and Zuzic [HWZ21] show how, for a given part P i , one can solve the part-wise aggregate problem on P i by reducing it to a sequence of O(1)-many (1-congested) part-wise aggregations between disjoint parts that are restricted to be simple paths j=1 (where nodes know the paths' edges they participate in).This is sufficient to prove our result: suppose we run that reduction on all parts P i simultaneously.A single call to the part-wise aggregation on all of parts P i combined, asks to find a ρ-congested part-wise aggregation in which the parts i P i are all simple paths.This is ρ-congested since at most ρ parts P i use any node v, and within each such P i , ever oracle call uses the node v at most once (since they are disjoint).
Let us briefly comment on the validity of our interpretation of [HWZ21, Lemma 7.2].Their statement has a few easily reconciled differences compared to our previous usage.Most notably, they compute shortcuts for a set of parts, assuming an oracle for doing so, which is a harder problem that simply computing part-wise aggregates.However, it can be easily verified that the shortcuts are used only to facilitate solving part-wise aggregations.Hence, the proof can easily be translated to require an oracle computing only part-wise aggregations.Lemma 3.4 (Simulating G ρ in G).For any G and any Z ≥1 ρ ≤ poly(n), we can simulate any τ -round CONGEST algorithm on G ρ with a (ρ • τ )-round CONGEST algorithm on G.
Proof.Let us consider one round of communication in G ρ .Each node v will simulate (learn all messages coming into) its copies v 1 , . . ., v ρ ∈ V ( G ρ ).Therefore, in each round node v ∈ V (G) needs to learn all messages send to v's copies v 1 , . . ., v ρ ∈ V ( G ρ ) from their neighbors in G ρ .Note that, by definition, v already knows the messages sent between any two copies v i and v j .Hence, in a single round v can learn all messages sent to any fixed v i .As a result, ρ rounds of communication in G suffice to simulate a single round in G ρ .Fact 3.5 (Folklore, [Joh99]).Given a (multi)graph G with n nodes and maximum degree ∆ ≤ poly(n), there exists a randomized CONGEST algorithm that colors the edges of G with O(∆) colors and completes in O(log n) rounds, with high probability.The coloring is proper, i.e., two edges that share an endpoint are assigned a different color.Proof sketch.A simple edge-coloring algorithm presented in [Joh99] works by choosing a color uniformly at random from the set {1, . . ., O(∆)} for each edge.Each edge will, with constant probability, choose a color not used by its neighbors.Then, this color stays fixed and the edge drops out.Hence, after O(log n) iterations the edges will be properly colored.Implementation-wise, we can assume there is an additional node in the middle of each edge which represents that edge (this only makes the problem harder).Each edge randomly chooses and sends its color to its endpoints which, in turn, inform on whether there is a conflict.Then, the edges send back to its endpoints whether it dropped out.This iteration is then repeated until we reach a proper coloring.
. By construction of the layered graph G i , there exists a path of length at most D(G) in the i-th layer of G ρ between u i to v i .Thus, it follows that the (hop) distance between u i and v j is at most D(G) given that v j and v i , with i = j, are adjacent-the copies form a clique in the layered graph.This also implies that the distance between any two nodes u i and u j , with π(u i ) = π(u j ), is 1, concluding the proof.

B.3 Useful Routines
Before diving into the proofs of the Laplacian building blocks it will be useful to present several operations that can be performed efficiently under Assumption 4.1.We stress that the proofs related to the Laplacian solver closely follow the approach in [For+20].Our goal here is to translate them into our more general setting.
Corollary B.1 (Matrix-Vector Products).Consider a matrix A with non-zeroes supported on the edges of an n-node graph G which is ρ-minor distributed over a communication network G for which Assumption 4.1 holds for some Q = Q(ρ), with values stored in the endpoints of the corresponding edges, and a vector x ∈ R n stored on the nodes (u G ) for u G ∈ V (G).Then, we can compute the vector Ax ∈ R n stored on the leader nodes (u G ) for all u G ∈ V (G) after O(Q(ρ)) rounds with high probability.
The proof of this corollary follows the one by Forster, Goranci, Liu, Peng, Sun, and Ye [For+20, Corollary 4.4], but nonetheless we state it here for completeness.
Proof of Corollary B.1.The first step is to use Assumption 4.1 to disseminate the coordinates of vector x to the corresponding super-nodes after Q(ρ) rounds; that is, for every u G ∈ V (G) the leader (u G ) passes to S G→G (u G ) the corresponding coordinate.Then, every node performs locally all the multiplications for its corresponding indices, and after ρ rounds the node can deliver this information to the corresponding super-node.Observe that this is possible because A is supported on edges of G, and Definition A.2 imposes an edge-congestion bound.Finally, we invoke again Assumption 4.1 to sum all of the values of each super-node to the leader node, which gives the desired output requirement.
Another important corollary of Assumption 4.1 is that we can simulate the spectral sparsification algorithm of Koutis (henceforth SpectralSparsify) on G [Kou14]: 3. L is the set of leaders such that every cluster S i has exactly one leader i ∈ L. The ID of the leader node will also serve as the ID of the cluster, while it is assumed that nodes know the ID of their leader, as well as the size of their cluster; 4. T = {T 1 , . . ., T N } is a set of cluster trees such that each cluster tree T i = (S i , E i ) is a (rooted) spanning tree of the induced subgraph G[S i ] of G, with root the leader of the cluster i ∈ S i (observe that this implies that the subgraph induced by each cluster S i is connected); 5. ψ : E → E is a bijective function that maps every edge {S i , S j } ∈ E to some edge {u i , u j } ∈ E connecting the corresponding clusters; i.e., it holds that u i ∈ S i and u j ∈ S j .It is assumed that the two nodes u i and u j know that the edge {u i , u j } is used to connect their respective clusters, as well as its weight.
Having introduced the concept of a distributed cluster graph, we state the following lemma, which is a direct corollary of the communication primitives we previously described.
Lemma B.6.Let G = (V, E) be an n-node graph ρ-minor distributed into an n-node communication network G = (V , E) for which Assumption 4.1 holds for some  For+20]).Let G be a graph ρ-minor distributed into an n-node communication network G for which Assumption 4.1 holds for some Q = Q(ρ).Moreover, let L be the Laplacian matrix associated with G, and F be a subset of V (G) such that L [F,F ] is α-DD for some α ≥ 4.Then, for any vector b stored on the leaders of the super-nodes, there is an algorithm which returns in O(Q(ρ) log(1/ε)) rounds the vector Zb stored on the same nodes, where Z is a linear operator such that for any sufficiently small ε > 0.
Again, this lemma follows from the guarantee in [Kyn+16] regarding the Jacobi procedure, as well as by directly adapting the distributed implementation in [For+20] using Corollary B.1.
Approximating the Schur Complement.Moreover, α-DD sets will be useful in the approximation of the Schur complement induced by the complementary subset of nodes.First, let us recall a combinatorial view of the Schur complement as a Laplacian matrix with weights estimated by certain random walks: ]).Let G be an n-node weighted graph and a subset of nodes T .Moreover, consider parameters 0 < ε < 1 and µ = O(log n/ε 2 ).If H is an initially empty graph, repeat for every edge {u, v} ∈ E(G) and for µ iterations the following procedure: 1. Simulate a random walk starting from u until it first hits T at some node t 1 ; 2. Simulate a random walk starting from v until it first hits T at some node t 2 ; 3. Combine these two walks to get a walk t 1 = u 0 , . . ., u = t 2 , where is the length of the combined walk.
4. Add the edge {t 1 , t 2 } to H with weight .
Then, the resulting graph H satisfies L(H) ≈ ε SC(G, T ) with high probability.
It should be noted that the random walks in the lemma are implied in the usual sense, wherein a step from a node is taken with probability proportional to the edge-weights of the incident edges.In the sequel, we will compute an α-DD set F via Lemma B.11, and then the goal will be to approximate the Schur complement on the set T = V \ F .Importantly, given that F is α-DD, we can guarantee that the random walks required in Lemma B.13 will be short in expectation.Nonetheless, a challenge that arises in the distributed context-and in particular under the CONGEST model-is that the expected congestion of an edge may by prohibitively large.This issue will be resolved by incorporating new nodes to the terminals whenever they exceed some threshold of congestion.At the same time, however, we also have to limit the node-congestion since G is minor distributed into G, and we can only deal with limited congestion.This will be addressed by invoking the spectral sparsification algorithm, ensuring that the average degree, and subsequently the congestion, remains limited.
Before we proceed with the algorithm that approximates the Schur complement, we note that we can implement the random walks of Lemma B.13 in O(Q(ρ)) rounds under Assumption 4.1, as implied by the approach in [For+20].

Lemma B.14 ([For+20]
).Let G be an n-node graph ρ-minor distributed into an n-node communication network G for which Assumption 4.1 holds for some Q = Q(ρ).Moreover, let F be an α-DD set, T = V \ F the set of terminals, ε ∈ (0, 1) some error parameter, and γ ≥ 1 the congestion parameter.Then, the algorithm RandomWalkSchur runs in O(α −1 γQ(ρ) log 2 n/ε 2 ) rounds, and returns a graph H along with its (α −1 γ log nρ)-minor distribution into G such that with high probability, where T ⊇ T has size at most n Proof.Let us briefly describe the RandomWalkSchur algorithm.First, we compute the expected congestion of the family of random walks W predicted by Lemma B.13 with respect to the set of terminals T .This is done by propagating the congestion to neighbors for O(α −1 log n) steps.Then, we create a new set T which includes T as a subset, as well as all the nodes which exceeded the congestion threshold of γ based on the estimation procedure of the previous step.Note that the congestion of a node with respect to W is simply the number of times this particular node participates in some random walk of W .By construction, it follows that the size of T is n − |F | along with all the nodes that exceeded the congestion threshold of γ.However, since F is an α-DD set it follows that the length of a random walk is O(α −1 log n) with high probability, while for every edge we simulate µ = O(log n/ε 2 ) random walks (this is related to the concentration of the corresponding random variables, as implied by Lemma B.13), in turn implying that the total congestion generated by these random walks is O(α −1 mε −2 log 2 n).As a result, only O(α −1 mε −2 log 2 n/γ) nodes can have congestion more than γ, verifying the assertion regarding the size of T .Next, the algorithm implements the random walks of Lemma B.13, but with respect to the augmented set of terminals T .A Chernoff bound argument assures us that all nodes in V \ T will have congestion O(γ) with high probability.
In terms of the distributed implementation, estimating the congestion can be implemented in O(α −1 Q(ρ) log 2 n/ε 2 ) rounds; this follows since every walk has length O(α −1 log n) with high probability, and we execute µ = O(log n/ε 2 ) iterations for every edge.Also note that a single step in the procedure estimating the congestion can be implemented in O(Q(ρ)) rounds.Next, the generation of the random walks with respect to the augmented set T can be performed in O(α −1 γQ(ρ) log 2 n/ε 2 ) rounds with high probability; this uses the aforementioned guarantee for the congestion.The final step is to minor-distribute the graph H with weights as dictated by Lemma B.13.This is done by assigning to the terminals the leaders of all intermediate (non-terminal) nodes.The congestion guarantee ensures that the resulting mapping is an O(α −1 γ log nρ)-minor distribution into G.
Proof of Lemma A.5.The Eliminate algorithm proceeds in d rounds, initializing M (0) to be an ε-spectral sparsifier of L(G) (recall Corollary B.2).In every round i ≥ 1, (i) we compute an α-DD set F i with α := 4; (ii) we employ Lemma B.12 to have access to an operator that approximates M (i−1) [F,F ] ; and (iii) we compute an ε-spectral sparsifier M (i) of the Schur complement SC(M (i−1) , T i ) approximated via Lemma B.14; here, T i = T i−1 − F i + U i , where U i represents the set of extra nodes added to ensure low congestion.In particular, Lemma B.14 is invoked with congestion parameter γ := 1000Cα −1 log 8 n/ε 4 , where C is a sufficiently large constant.The sparsification algorithm of Koutis (Corollary B.2) tells us that the number of edges will be m = (n log 6 n/ε 2 ), in turn implying that the number of nodes drops by at least a multiplicative factor of 49/50.
In terms of the distributed implementation, notice that due to the selection of the parameters the approximation of the Schur complement (Lemma B.14) can be performed in O(Q(ρ) log 10 n/ε 6 ) rounds.Next, the spectral sparsification step can be implemented in O(Q(ρ ) log 7 n/ε 2 ), where ρ = α −1 γ log nρ = O(log 9 n/ε 4 )ρ.Thus, by virtue of Assumption 4.1 we can infer that Q(ρ ) = O(log c n/ε c )Q(ρ), where c is some universal constant.Thus, after d iterations the cost of these operations is bounded by O(Q(ρ)(log c n/ε c ) d ), where c is some universal constant.Finally, the error guarantee follows directly from Lemmas B.12 to B.14, after a direct argument bounding the accumulation of the error.

B.6 Minor Schur Complement: Proof of Lemma A.6
We commence this subsection by introducing the notion of steady edges, which are in a sense edges which are mutually "uncorrelated": In words, the first constraint ensures that no edge will be selected in the steady set with too high of a probability; the second corresponds to the localization constraint, circumscribing the (mutual) correlation of edges within the set; and the final constraint imposes a bound on the variance, and will be used in the martingale analysis (to apply Freedman's inequality).It should be stressed that the existence of such objects is highly non-trivial, and follows from the localization of electrical flows recently shown by Schild, Rao, and Srivastava [SRS18].In the distributed setting, the following result will be established: ]).Let G be an n-node m-edge graph ρ-minor distributed into G for which Assumption 4.1 holds for some Q = Q(ρ).For a constant δ ∈ (0, 1) and a subset of terminals T ⊆ V (G), there exists an algorithm which has access to a distributed Laplacian solver, and returns with high probability a set of at least δm/(2000C log 2 m) edges in expectation which is (δ/(1000C log 2 m), δ)-steady, where C is a sufficiently large constant.This algorithm requires O(log 2 n) calls to a distributed Laplacian solver to 1/ poly(n) accuracy on graphs that 2ρ-minor distribute into G, and O(Q(ρ) log 2 n) communication rounds.
The first step towards establishing this lemma is to approximate the correlation of edges within some arbitrary set: Lemma B.17 ([For+20]).Let G be an n-node graph with resistances r, ρ-minor distributed into a communication network G for which Assumption 4.1 holds for some Q = Q(ρ).Then, there is an algorithm, with access to a distributed Laplacian solver, which for any subset W ⊆ E(G) and any edge e ∈ W returns with high probability the quantity The proof of this lemma follows directly from [For+20, Lemma 5.13], and leverages the 1 -sketch of Indyk [Ind06].Similarly, a sketch can be employed to estimate the effect of each edge on the Schur complement: As a result, Lemma B.16 is established based on the algorithm FindSteady in [For+20], with the round complexity guarantee following directly from Lemma B.17 and Lemma B.18.
The next ingredient is a pre-processing step which ensures that all the edges have leverage scores bounded away from 0 and 1.

Lemma B.19 ([LS18]
).Let G be an n-node graph ρ-minor distributed into a communication network G for which Assumption 4.1 holds for some Q = Q(ρ).Moreover, there exists a procedure which takes as input G and returns in O(Q(ρ)) rounds a graph resulting from collapsing paths and parallel edges, and removing non-terminal leaves, along with a ρ-minor distribution into G.
The distributed implementation of this lemma is fairly simple, and relies on Lemma B.3.We will also use the following lemma, which is based on the random projection scheme of Spielman and Srivastava [SS08]: As a result, BuildChain returns a (2 Θ((log log n) 2 ) , ε)-Schur complement chain, which in turn implies that this chain has length O(log n/(log log n) 2 ).Thus, Lemma A.8 implies that we can use this chain to produce a solution in ρn o(1) Q(ρ) rounds, where ρ represents the maximum congestion of a graph along the chain; it will be establish that ρ = n o(1) .

B.8 Proof of Proposition 1.1
Proposition 1.1.Consider a graph G with shortcut quality SQ(G).Then, solving a Laplacian system on G with ε ≤ 1 2 requires Ω(SQ(G)) rounds in both CONGEST and Supported-CONGEST models.
Proof.First of all, as pointed out in [For+20, Theorem 2], it suffices to establish the lower bound for a high-precision solver, i.e. for a sufficiently small ε = 1/ poly(n).Indeed, a low-accuracy solver (ε ≤ 1 2 ) can always be "boosted" with only an O(log n) overhead in the overall complexity.In this context, let H be the input to the spanning connected subgraph problem.We construct a resistor network H so that r(e) = 1 if e ∈ E(H), and r(e) = n 4 for every edge e / ∈ E(H).Moreover, let us select arbitrarily a node v ∈ V (G).The key idea of the proof is to consider as input to the Laplacian solver a vector b ∈ R n such that b(u) = −1 for all u ∈ V (G) \ {v}, while b(v) = n − 1.
To analyze the output of that Laplacian system, we first analyze the simpler Laplacian system with input a vector χ v,u ∈ R n for which the coordinate corresponding to node v is 1; the coordinate corresponding to node u is −1; and any other coordinate is set to 0. We recall the following well-known facts.As argued in [For+20], the output of the Laplacian with input χ v,u and a sufficiently small error ε = 1/ poly(n) can be used to determine whether v and u are connected.Indeed, the following arguments have been extracted from their lower bound.Claim B.24.If u and v are connected in H it follows that res H (v, u) ≤ n − 1.
Proof.It is well-known that the effective resistances satisfy the triangle inequality.Moreover, given that v and u are connected in H, it follows that there exists a path of length at most n − 1 in H so that every edge has resistance 1 (by construction of the resistor network H ). As a result, the triangle inequality implies that res H (v, u) ≤ n − 1.The next step of the proof is to incorporate in the analysis the error of the solver.To this end, let φ be an ε-approximate solution to the linear system L(H )φ = χ v,u in the sense that Moreover, since the Laplacian matrix has integer resistances up to range poly(n), it follows that for any x, x ∞ ≤ poly(n) x L .Thus, by setting ε = 1/ poly(n) to be sufficiently small, we have that res H (v, u) − 1 n ≤ φ (v) − φ (u) ≤ res H (v, u) + 1 n .Now we will use these bounds to argue about the initial Laplacian system with input vector b.By linearity, a solution of the Laplacian system with input b can be expressed as the sum of solutions of Laplacians with input χ v,u over all u ∈ V (G) \ {v}.Next, we let φ = L(H ) † b, and φ be the output of the Laplacian solver for a sufficiently small ε = 1/ poly(n).Our analysis distinguishes between the following cases.
Case I Suppose that H is connected.In turn, this implies that v is connected with any node u ∈ V (G).As a result, it follows from Fact B.22, Fact B.23 and Claim B.24 that for any node u, φ (v) − φ (u) ≤ (n − 1) 2 + 1.
(1) associated aggregation parts.Next, we can essentially reverse in time the previous communication pattern, but this time using the aggregate values as determined by the target nodes.As a result, every node will know with high probability the aggregate value for each of its aggregation parts after O(ρ + log n) rounds of NCC.
with congestion c and dilation d if the following properties hold: (i) the (hop) diameter of each subgraph G[P i ] ∪ H i is at most d, and (ii) every edge is included in at most c many of the subgraphs H i .The quantity Q = c + d will be referred to as the quality of the shortcut.Importantly, a shortcut of quality Q allows us to solve the part-wise aggregation problem in O(Q) rounds of CONGEST, as formalized below.For self-sufficiency, we include the proof in Appendix B.1.Proposition 2.3.Suppose that P 1 , . . ., P k is any part-wise aggregation instance in a communication network G.Given a shortcut of quality Q, we can solve with high probability the part-wise aggregation problem in O(Q) CONGEST rounds.
where r(G) is the complete-graph minor size, i.e., r(G) = max{r : K r is a minor of G} [Tho84; Tho01].Furthermore, any family of graphs closed under taking minors (such as planar graphs) has a constant minor density.For such graphs, Ghaffari and Haeupler [GH20] established efficient shortcut construction: Theorem 2.7 ([GH20]).Any graph G with hop-diameter D and minor density δ(G) admits shortcuts of quality O(δD), which can be constructed with high probability in O(δD) rounds of CONGEST.

Figure 1 :
Figure 1: A 2-congested part-wise aggregation problem on a 6×6 grid (the instance immediately extends to a √ n× √ n topology).Different colors highlight different parts of the instance.

Figure 2 :
Figure 2: An example of a transformation from G to the layered graph G ρ with ρ = 3.We have highlighted with different colors different layers of the graph.
Lemma 3.6 (Path-Restricted Congested Part-Wise Aggregation).Let G be a n-node graph and let Z ≥1

Corollary 3. 9 .
Let G be an n-node communication network of diameter at most D and treewidth tw(G).Then, we can solve with high probability any ρ-congested part-wise aggregation problem inG within O(ρ 2 • tw(G) • D) rounds of CONGEST.Proof.First, we know from Lemma 3.8 that tw( G ρ ) = O(ρ tw(G)), in turn implying that the minor density of G ρ can be bounded as δ( G ρ ) ≤ tw( G ρ ) = O(ρ tw(G)) (Fact 2.9).Thus, Theorem 2.7 implies that G ρ admits shortcuts of quality O(ρ tw(G)D(G)), which can be additionally constructed in O(ρ tw(G)D(G)) rounds of communication on G ρ .Finally, we have shown in Lemma 3.3 that this is sufficient to solve any ρ-congested part-wise aggregation problem on G in O(ρ 2 • tw(G) • D(G)) rounds of CONGEST, concluding the proof.

Lemma 3. 13 .
Given a graph G, suppose we are given any two multisets of nodes S ⊆ V (G) and T ⊆ V (G) of size k := |S| = |T | that have any-to-any node connectivity ρ.Then, we can partition S = S 1 S 2 . . .S O(ρ log k) and T = T 1 T 2 . . .T O(ρ log k) such that |S i | = |T i | and (S i , T i ) are any-to-any node-disjointly connectable.

Lemma A. 3 .
Let G = (V, E) be an n-node graph ρ-minor distributed into an n-node communication network G = (V , E) for which Assumption 4.1 holds for some Q = Q(ρ).Then, we can perform with high probability the following operations in the NCC model, simultaneously for all u G ∈ V (G), within O(Q(ρ)) rounds:1.Every leader (u G ) sends an O(log n)-bit message to all the nodes in S G→G (u G ); 2. All the nodes in S G→G (u G ) compute an aggregation function on O(log n)-bit inputs.
Definition B.15 ([For+20]).A stochastic subset of edges Z ⊆ E is called (α, δ)-steady with respect to an m-edge graph H if 1. E Z e∈Z r(e) −1 b(e)b(e) T αL(H); 2. For all e ∈ Z we have e =f ∈Z |b(e) T L(H) † b(f )| all e ∈ Z it holds that r(e) −1 b(e) T L(H) † SC(H, e =f ∈W |b(e) T L(G) † b(e)| r(e) r(f ) to within a factor of 2. This algorithm requires O(log 2 n) calls to a distributed Laplacian solver on graphs that ρ-minor distribute into G to accuracy 1/ poly(n), and an additional O(Q(ρ) log 2 n) communication rounds.
Lemma B.18([For+20]).Let G be an n-node with resistances r e , ρ-minor distributed into a communication network G for which Assumption 4.1 holds for some Q = Q(ρ).Then, for a subset T ⊆ V (G), there exists an algorithm which returns with high probability an estimate ofr(e) −1 b(e) T L(G) † SC(G, T ) 0 0 0 L(G) † b(e)to within a factor of 2. This algorithm requires O(log n) calls to a distributed Laplacian solver to accuracy 1/ poly(n) on graphs that 2ρ-minor distribute into G, and O(Q(ρ) log n) communication rounds.

Claim B. 25 .
If v and u are not connected in H it follows that res H (v, u) ≥ n 2 .Proof.Suppose that e 1 , . . ., e k are the edges leaving the connected component of v in H, for some k ≤ n 2 .Then, the Nash-Williams inequality implies thatres H (v, u) ≥ 1 k i=1 1 r(e i )≥ n 2 , by construction of the resistor network.
ψ) is a distributed cluster graph for G, the following operations can be performed in O(Q(ρ)) rounds:1.The leader i of each cluster S i broadcasts an O(log n)-bit message to every node in S i ;2.Computing aggregation functions on O(log n)-bit inputs simultaneously for all clusters, assum-ing the tree T i is known.Proof.The definition of a distributed N -node cluster graph (Definition B.5) implies that G is 1-minor distributed over G, and in turn ρ-minor distributed into G.