1 Introduction

1.1 Fastest mixing Markov chain set-up and motivation

Sampling objects from a finite set is a basic primitive which has a myriad of applications. Sampling directly from such a set, however, may be computationally too expensive or even impossible, for example, if the objects are nodes of a distributed network. A common approach in these scenarios is to design a random walk (RW), or, more generally, a Markov chain with state space corresponding to the set from which we wish to sample and appropriate equilibrium distribution. Furthermore, to ensure our sampling procedure is computationally efficient, we desire our Markov chain to converge to equilibrium in a small number of steps, i.e., have fast mixing time.

This has wide-ranging applications: from shuffling cards [6, 15], to approximating statistical physics models [16, 25] and analysing load-balancing protocols in distributed computing [36, 44]. Furthermore, approximately sampling from the uniform distribution of a set can be used to estimate the size of the set itself [26]. This has been applied to approximating the permanent of a matrix [24, 27] and counting the number of independent sets [17], perfect matchings [18] and forests [4] in graphs.

Fundamental to these applications is a fast mixing time. Understanding in which instances fast mixing is achievable and what the intrinsic obstacles to fast mixing are is the focus of this paper. More precisely, we consider the following scenario.

  • We are given a finite, undirected graph \(G = (V, E)\): the vertex set represents the underlying state space, while the edge set E defines the transitions allowed.

  • Our goal is to study the fastest mixing Markov chain satisfying these constraints.

We assume throughout that graphs are finite, undirected and connected.

This problem was originally introduced by Boyd, Diaconis and Xiao [12] as the Fastest Mixing Markov Chain (FMMC) problem. Specifically, by considering only reversible chains and optimising the spectral gap as a proxy for the mixing time, they recast the problem of finding the fastest mixing Markov chain on a graph as a convex optimisation problem. Analogously to Boyd, Diaconis and Xiao [12], we dedicate most of our attention to reversible, time-homogenous chains in discrete-time. We do, however, dedicate one section to questions in the continuous-time setting, first studied in [43], and one short final section to time-inhomogeneous chains. Compared with discrete-time chains, continuous-time and time-inhomogeneous chains are considerably more powerful, but perhaps less natural from an application viewpoint.

To be explicit and precise, a transition matrix \(P\) is reversible with respect to (w.r.t.) \(\pi \) if \( \pi (u) P(u,v) = \pi (v) P(v,u) \) for all \( u,v \in V. \) Results on the spectral gap below—both our own and those referenced—are always in the reversible set-up. We also restrict to lazy chains, ie chains with \(P(v,v) \ge \tfrac{1}{2}\) for all \(v \in V\). This is without loss of generality, since we are interested in maximising the spectral gap and this restriction costs a factor of at most \(\tfrac{1}{2}\) in the optimal spectral gap.

There are a variety of choices for how “convergence to equilibrium” is measured. It is typically measured in the total variation (TV), or equivalently \(\ell _1\), distance. Other popular measures, particularly in the statistics literature, include \(\ell _2\), or \(\chi ^2\), distance and relative entropy, or Kullback–Leibler divergence. We also recall that \(\ell _2\) is equivalent to \(\ell _\infty \), or uniform, distance for reversible chains.

Nevertheless, no matter which one of these measure we choose, the long-term convergence to equilibrium of a lazy, reversible Markov chain is governed by its spectral gap . More precisely, given a transition matrix \(P\) of a lazy, reversible Markov chain, let \(d_P(t, x)\) denote the distance between \(P^t(x,\cdot )\) and its equilibrium distribution according to any of the aforementioned measures and let \(d_P(t) :=\max _{x \in V} d_P(t, x)\). Then,

See [31, Theorems 12.4 and 12.5] for details. The spectral gap thus determines the asymptotic convergence to equilibrium without having to select a specific measure.

We now define formally the class of reversible Markov chains on a graph and then the spectral gap and the relaxation and mixing times.

Definition

(Markov Chains on a Graph) Let \(G = (V, E)\) be a graph and \(\pi \) a probability measure on V. We say that a Markov chain with transition matrix \(P\) is on G if \(P\in [0,1]^{V \times V}\) and \(P(u,v) > 0\) implies either \( \{ u,v \} \in E\) or \(u = v\). We denote with \(\mathcal M(G, \pi )\) the set of lazy transition matrices on G which are reversible w.r.t. \(\pi \).

Definition

(Spectral Gap, Relaxation Time and Mixing Time) Let \(G = (V, E)\) be a graph and \(\pi \) a probability measure on V. Let \(P\in \mathcal M(G, \pi )\). The spectral gap is , where \(\lambda _P\) is the largest non-unitary eigenvalue of \(P\). The relaxation time is . The (uniform) mixing time is where \(d^\infty \) is the \(\ell _\infty \)-distance. Write .

There is a standard relation between the relaxation and mixing times:

see [31, Theorems 12.4 and 12.5] for details. The “\(\lesssim \)” symbol hides a multiplicative universal constant; we use the symbols “\(\gtrsim \)” and “\(\asymp \)” similarly. Typically, \(\log \pi _\textsf{min}^{-1} \asymp \log | V |\). So, the relaxation time is a proxy for the mixing time, as well as characterising long-term convergence to equilibrium.

We are now finally ready to formally introduce the FMMC problem.

Definition

(Fastest Mixing Markov Chain) Let \(G = (V, E)\) be a graph and let \(\pi \) be a probability measure on V. The optimal spectral gap is defined as

The optimal relaxation time is We write , omitting the \(\pi \), when is uniform.

A transition matrix \(P\) is fast mixing if is polylogarithmic in \(| V |\), asymptotically. Analogously, a graph G admits a fast mixing chain if is polylogarithmic in \(| V |\), asymptotically.

Previous work has been mainly focussed on finding useful formulations of the problem or on solving particular cases; see Sect. 1.4 for further details. The primary aim of our work, instead, is twofold:

  1. (i)

    To control the optimal spectral gap in terms of geometric barriers in the graph;

  2. (ii)

    To find ways to overcome these geometric barriers by slightly relaxing the FMMC problem.

Finally, we centre our attention on the case where is the uniform distribution on V. This case was also the main focus of the original series of papers studying the FMMC problem.

1.2 Main results

This article includes multiple avenues of study, all on the theme of finding fast mixing Markov chains. We introduce these and the main theorems that we prove in the following subsections.

1.2.1 Characterisation of fast mixing on graphs

We are looking for some natural statistic of the graph G which characterises fast mixing: we desire necessary and sufficient conditions for to be ‘small’, namely polylogarithmic in |V|.

How well-connected a graph is should, intuitively, influence how fast a chain on the graph can mix. Thus, we would like to understand what kind of connectivity measure best characterises fast mixing. A natural candidate is the edge conductance of a graph, which is defined as follows.

Definition A.1

(Edge Conductance) The edge conductance of a graph \(G = (V, E)\) is

where \(E(S,S^c)\) is the edge boundary of \(S \subseteq V\) and \({{\,\textrm{vol}\,}}(S)\) is the volume of \(S \subseteq V\):

$$\begin{aligned} E(S,S^c) :=\bigl \{ \{ x,y \} \in E \mid x \in S, \, y \notin S \bigr \} \quad \text {and}\quad {{\,\textrm{vol}\,}}(S) :=\sum \nolimits _{x \in S} \deg (x). \end{aligned}$$

It is well-known the edge conductance characterises the spectral gap of the lazy random walk (abbreviated RW) \(P^\textsf {RW} \) on G via the discrete Cheeger inequality, discovered in [23, 30]:

The lazy RW on a graph, however, does not have uniform equilibrium distribution, unless the graph is regular. For this reason, it is natural to consider the uniform, or maximum degree, RW \(P^\textsf {U} \) which is defined by adding the appropriate number of self-loops to each vertex so that the graph becomes regular. A simple calculation with the Dirichlet characterisation gives

see [12, §7.2] for details. Applying this along with the discrete Cheeger inequality gives

Fast mixing for low-degree graphs is thus characterised by the edge conductance :

  1. (i)

    G admits a fast mixing chain if and only if \(P^\textsf {U} \) is fast mixing;

  2. (ii)

    \(P^\textsf {U} \) is fast mixing if and only if is polylogarithmic in \(| V |\).

Such a simple characterisation does not hold if \(d_\textsf{max}\) is large. This may be slightly counter-intuitive at first: adding edges can only increase the optimum ; but the lower bound above gets worse as \(d_\textsf{max}\) increases. A striking example is given by taking two cliques on n vertices and connecting them by a perfect matching; see Fig. 3 in §1.3. It is a regular graph with , but, as we will see later, it has . Informally, because we can replace the two cliques with two bounded degree expander graphs without overly damaging its connectivity properties. This shows that edge conductance is not the correct conductance measure for the FMMC problem.

This prompts us to consider an alternative notion of connectivity: the vertex conductance . It measures how well connected a set is by comparing the number of vertices in the boundary with its size. Contrastingly, edge conductance compares the number of edges in the boundary with the total number of edges inside the set.

Definition A.2

(Vertex Conductance) The vertex conductance of a finite graph \(G = (V, E)\) is

where \(\partial S\) is the vertex boundary of \(S \subseteq V\):

$$\begin{aligned} \partial S:=\bigl \{ y \notin S \mid \exists \, x \in S {\text { s.t. }} \{ x,y \} \in E \bigr \}. \end{aligned}$$

The example above in which two equisized cliques are connected by a perfect matching has vertex conductance . This agrees with our claimed optimal spectral gap .

Vertex conductance has been used to provide upper bounds on the time to spread a rumour in a graph and on the hitting times of RWs, by Giakkoupis [20] and Chandra et al [13], respectively, amongst others. Roch [38, Proposition 2] showed that vertex conductance represents a fundamental barrier to fast mixing: This can be seen directly via a simple calculation comparing the edge conductance of any reweighing of G, for which the RW on this weighted graph has uniform equilibrium distribution, with the vertex conductance.

The edge and vertex conductances are comparable for low-degree graphs:

Thus, the fact that the edge conductance characterises fast mixing for low-degree graphs means that the same holds for the vertex conductance :

We remove this \(d_\textsf{max}\) factor, at the cost of a \(\log | V |\) factor, thus showing that vertex conductance characterises the existence of a fast mixing chain for any graph. The graph of Fig. 3 in §1.3 shows this does not hold for the edge conductance.

Theorem A

(Characterisation of Fast Mixing) Let \(G = (V, E)\) be a finite graph. Then satisfies

Thus, vertex conductance characterises fast mixing for any graph.

The quadratic dependence on the vertex conductance in the lower bound is needed for graphs such as the cycle. This has optimal spectral gap and vertex conductance ; see [11]. We are not aware of a graph for which the \(\log | V |\) factor is needed, but we have reasons to believe that such a factor, or at least a factor \(\log d_\textsf{max}\), is necessary. We elaborate.

Louis, Raghavendra and Vempala [32] essentially showed that, under the so-called Small-Set Expansion Conjecture of Raghavendra and Steurer [37], for any \(\varepsilon > 0\), there is no polynomial-time algorithm that can distinguish between and for any graph \(G = (V,E)\). The optimal spectral gap can be computed in polynomial time, so removing the logarithmic factor in Theorem A altogether would violate the Small Set Expansion Conjecture.

One of the most interesting aspects of the proof of Theorem A, given in Sect. 2, is that it does not directly relate the vertex conductance to the spectral gap. Rather, it relates a variational characterisation of the optimal spectral gap, due to Roch [38, Proposition 1], to a new connectivity measure for graphs which we introduce. We term it matching conductance and denote it . It is defined similarly to vertex conductance, but it replaces the size of the vertex boundary of a set S in the numerator with the size of a maximum matching between S and \(S^c\) in E. A formal definition is given in Definition 2.1. It can be viewed as a measure of fault tolerance of a graph: a graph has small matching conductance if and only if we can remove a few vertices of the graph and split the graph into two large, disconnected subsets.

It turns out the matching conductance of a graph is essentially equivalent to its vertex conductance: uniformly over all graphs G; see Proposition 2.2. A specific set S of vertices, however, can have matching conductance much smaller than its vertex conductance . This fact makes using matching, rather than vertex, conductance essential in our proof.

1.2.2 Almost mixing

We introduced the FMMC problem to formalise our desire to construct a fast mixing Markov chain. Theorem A, however, implies there are certain graphs, namely those with small vertex conductance, for which this desire cannot be attained. It is then natural to ask if we can slightly relax the constraints we imposed to overcome this fundamental obstacle.

We answer this question affirmatively: we show that if the Markov chain is not required to have equilibrium distribution exactly uniform, but only sufficiently close to uniform, then all graphs with small diameter admit a fast-mixing Markov chain. Before formalising this claim, we gain some intuition by considering the following simple example, known as the dumbbell graph.

  • Take two complete graphs \(H_\pm = (V_\pm , E_\pm )\), each on n vertices. Choose \(v_\pm \in V_\pm \), respectively.

  • Form \(D_\star = (V, E)\) by connecting both \(v_\pm \) to a single ‘external’ vertex \(v_\star \notin V_+ \cup V_-\):

    $$\begin{aligned} V :=V_+ \cup V_- \cup \{ v_\star \} \hbox { and }E :=E_+ \cup E_- \cup \{ \{ v_+, v_\star \} , \{ v_-, v_\star \} \} . \end{aligned}$$

Since \(D_\star \) has vertex conductance equal to 1/n, Theorem A implies that no chain with uniform equilibrium distribution can have relaxation time of smaller order than n.

In light of the above, we propose the following RW, described by a weighting on the edges of \(D_\star \):

  • give all edges which do not include any of \( \{ v_+, v_-, v_\star \} \) unit weight;

  • give the remaining edges weight \(\varepsilon n\).

The RW takes steps with distribution proportional to the edge weights. It is straightforward to check that the equilibrium distribution induced is at most \(\varepsilon \) far from uniform in TV.

The fundamental barrier to fast mixing in \(D_\star \) is that any chain with uniform equilibrium gets stuck in one side of the graph for a time at least order n in expectation. Up-weighting the edges through the bottleneck means that the new RW transitions between the two sides with expected time order \(1/\varepsilon \). This leads to a relaxation time order \(1/\varepsilon \). This all comes at a cost of having invariant distribution \(\varepsilon \) far from uniform. Further details are given in Sect. 1.3.

We are able to generalise this construction to general graphs and general equilibrium distributions. The fast-mixing Markov chain we design is a RW on a carefully weighted breadth-first search (BFS) spanning tree, supplemented with self-loops. We establish an upper bound of \(12 ({{\,\textrm{diam}\,}}G)^2 / \varepsilon \) on the relaxation time when we allow the RW to have equilibrium distribution \(\varepsilon \)-far from uniform.

We now define precisely our set-up and formally state our result.

Definition B

(Almost-Mixed Distributions) Let V be a finite set. Let \(\mathcal D(V)\) denote the set of positive probability distributions on V. For \(\varepsilon \in [0, 1]\) and \(\pi \in \mathcal D(V)\), define

In particular, if \(\pi ' \in \mathcal D(\pi , \varepsilon )\), then .

We actually establish a stronger result than the one described above. The above description says that there exists some reversible chain which is fast mixing: there exist \(\pi ' \in \mathcal D(\pi , \varepsilon )\) and \(P\in \mathcal M(G, \pi ')\) such that . We prove that any reversible chain can be perturbed into a fast mixing chain: for all \(\pi \in \mathcal D(V)\) and all \(P \in \mathcal M(G, \pi )\), there exist \(\pi ' \in \mathcal D(\pi , \varepsilon )\) and \(Q\in \mathcal M(G, \pi ')\) such that and \(Q(e) \ge (1 - \varepsilon ) P(e)\) for all \(e \in E\).

Theorem B

(Almost Mixing) Let \(G = (V, E)\) be a finite, connected graph and \(\pi \in \mathcal D(V)\). Let \(\varepsilon \in (0, 1)\) and \(P\in \mathcal M(G, \pi )\). There exist \(\pi ' \in \mathcal D(\pi , \varepsilon )\) and \(Q\in \mathcal M(G, \pi ')\) such that

A consequence of this spectral gap estimate is that

The matrix \(Q\) is obtained as a perturbation of \(P\). Moreover, this perturbation is actually independent of \(P\): we construct a weighted BFS tree and mix it with the weights from \(P\). A more refined statement, making this independence of the perturbation explicit, is given in Theorem 3.1.

Many algorithms on graphs work well when applied to graphs with very good connectivity properties, such as expanders. A way to utilise these algorithms for non-expanders is to perform an expander decomposition: the graph is partitioned into disjoint expanders and few edges connecting the expanders [42]. An alternative approach, recently proposed by Bernstein, Gutenberg and Saranurak [7], re-weights the vertices of the graph in order to obtain a type of vertex expander.

Our approach is in a very similar vein: we re-weight the edges of the graph to obtain a very good edge expander, whilst increasing the total weight as little as possible. We have not checked all the details, and their definition of vertex expansion is slightly unusual, but it seems highly plausible that our algorithm, or a minor adjustment of it, could be applied in their set-up.

This diameter bound is a substantial improvement over the vertex conductance lower bound on the optimal spectral gap from Theorem A. It comes at the cost of having invariant distribution only ‘almost’ uniform—hence the name “almost mixing”. We show in the next section that passing to the continuous-time setting allows this diameter-squared bound to be maintained while having exactly uniform invariant distribution. We use fundamentally the same chain: it is a RW on the same weighted BFS tree, where the weights now represent the rate at which an edge is crossed.

1.2.3 Continuous-time markov chains

The discussion and results above all concern discrete-time Markov chains. It is also natural to study the question of the fastest mixing Markov chain in continuous-time. We restrict to the case where the target distribution . The question of the FMMC in continuous-time was originally raised by Sun, Boyd, Xiao and Diaconis [43] and has been studied subsequently by Sammer [39] and Montenegro and Tetali [34]. We review their work in Sect. 1.4.

A continuous-time Markov chain on a graph \(G = (V, E)\) with uniform equilibrium distribution can be represented by the RW on a weighted graph \((G, q)\), where \(q: E \rightarrow \mathbb {R}_+\), as follows.

Definition C.1

(RW on Weighted Graph) Let \(G = (V, E)\) be a graph and \(q: E \rightarrow \mathbb {R}_+\) a collection of non-negative weights. The RW on \((G, q)\) jumps from x to y at rate \(q( \{ x,y \} )\) for \(x,y \in V\) with \( \{ x,y \} \in E\). The Laplacian \(L^q\in \mathbb {R}^{V \times V}\) of the weighting \(q\) is defined by

$$\begin{aligned} L^q_{x,y} :=\varvec{1}\{ \{ x,y \} \in E \} \cdot q ( \{ x,y \} ) \! -\! \varvec{1}\{ x \!=\! y \} \sum \nolimits _{z \in V : \{ x,z \} \in E} q ( \{ x,z \} ) \quad \text {for}\quad x,y \in V. \end{aligned}$$

The spectral gap, which we denote , is given by the second smallest eigenvalue of \(L^q\).

The spectral gap is intrinsically related to the mixing time , as in discrete-time:

see [1, Lemma 4.23]. Again, fast mixing means relaxation time polylogarithmic in \(| V |\). The above relations thus imply that a polylogarithmic relaxation time characterises fast mixing.

It is immediate to see that if all the rates are multiplied by some factor, then the spectral gap changes by that factor too: for any \(c > 0\). We must therefore impose some normalisation.

Definition C.2

(Normalisation) The rate at which the walk leaves the vertex x is given by

$$\begin{aligned} q(x) :=\sum \nolimits _{y \in V} \varvec{1}\{ \{ x,y \} \in V \} q ( \{ x,y \} ) \quad \text {for}\quad x \in V. \end{aligned}$$

We call the maximal leave-rate and \(| V |^{-1} \sum \nolimits _{x \in V} q(x)\) the average leave-rate.

A natural normalisation is to require a maximal leave-rate of 1. It can be seen, however, that this reduces to the discrete-time case via exponential-1 waiting times. We impose instead an average leave-rate of 1, or, equivalently, \(q(E) \le \tfrac{1}{2} | E |\). This allows a few vertices to have abnormally large leave-rate, but rarely enough that the average is not significantly affected. This will allow the RW to exit small ‘bottlenecks’ quickly, where the discrete-time walk would remain stuck for significant time. This average leave-rate normalisation was considered in [34, 39, 43]. Montenegro and Tetali [34, §7.1] describe this normalisation as “rather powerful [compared with discrete-time]” due to the fact that the maximal leave-rate may be very large.

The main result of this section states that, for any graph, it is possible to construct a weighting with average leave-rate of 1 such that its spectral gap depends only on the diameter of the graph.

Theorem C

(Continuous-Time) Let \(G = (V, E)\) be a graph. There exists a weighting \(w: E \rightarrow \mathbb {R}_+\) with average leave-rate at most 1 such that the RW on \((G, w)\) satisfies

An upper bound on of order is required for graphs with diffusive behaviour, such as the cycle or the path. A lower bound of order \({{\,\textrm{diam}\,}}G\), however, is not necessary, in general. This is in stark contrast to the discrete-time case. Indeed, in continuous-time, a few edges can be up-weighted significantly with little affect on the average. So if the ‘typical’ distance is much less than the maximal, a relaxation time of smaller order than the diameter may well be achievable.

A highly related theorem is given in Sammer’s PhD thesis [39, §3.3]; see also [34, §7.1]. They bound the optimal spectral gap in terms of the spread constant c(G), introduced in [3], which is the maximal variance of a function that is Lipschitz on the edges of G:

The spread constant c(G) can be upper bounded by \(\tfrac{1}{4} ({{\,\textrm{diam}\,}}G)^2\), but there are examples for which this is far from tight. Still, if a very general, easy to calculate, bound is desired, then we do not know of a better bound than \(c(G) \lesssim ({{\,\textrm{diam}\,}}G)^2\), which reduces to approximately our bound. The spread constant c(G) can also be lower bounded by a type of ‘typical’ distance; see [34, Corollary 7.2].

In contrast with our result, however, theirs is non-constructive, relying on the famous, but non-constructive, Johnson–Lindenstrauss lemma [28]. Montenegro and Tetali [34, Remark 7.3] comment on the difficulty of explicitly constructing such a process: “It might be challenging and in general impractical a task to actually find such a process explicitly.” Our construction is explicit and can actually be constructed in time linear in the size of the graph.

Montenegro and Tetali [34, Remark 7.3] also comment on the existence of such a fast mixing Markov chain in continuous-time: “The key [to the existence of such a chain]... might be that we were implicitly providing the continuous-time chain with more power... by not requiring the rates in each row to sum to 1, but only the [average rate to be 1].” This significant additional power allows bottlenecks to be traversed quickly while maintaining an average leave-rate of 1. Indeed, the weighting \(w\) that we construct has \(\max _{x \in V} w(x) \asymp n / {{\,\textrm{diam}\,}}G\), which may be far larger than 1.

This really emphasises the strength of our ‘almost mixing’ result, Theorem B: the chain there is in discrete-time —or, equivalently, has \(\max _{x \in V} q(x) \le 1\)— but still attains a spectral gap only order \(\varepsilon \) smaller than that attained in the continuous-time case of Theorem C. Of course, the cost is that the equilibrium distribution \(\pi '\) only satisfies \(\min _{x \in V} \pi '(x) / \pi (x) \ge 1 - \varepsilon \), not \(\pi ' = \pi \).

We expect that our continuous-time analysis can be adjusted to handle general equilibrium distributions \(\pi \) with relatively little changed. We have not checked the details, however. We focussed on the uniform case because it is, arguably, the most important and the cleanest to present.

1.2.4 Time-inhomogeneous markov chains

Our attention has been so far restricted to time-homogeneous Markov chains, in which the transition probabilities do not change over time and are described by a single transition matrix P. A time-inhomogeneous Markov chain, instead, is described by a sequence \((P_t)_{t \in \mathbb {N}}\) of transition matrices and an initial law \(\mu _0\): the time-t law \( \mu _t :=\mathbb {P}\bigl (X_t \in \cdot \bigr ) \) is given by \( \mu _t = \mu _0 P_1 P_2 \cdots P_t \quad \text {for}\quad t \in \mathbb {N}. \) A time-homogeneous chain has \(P_t = P\) for all \(t \in \mathbb {N}\), for some P. Nevertheless, we close our section of results by showing that they can lead to improvements over time-homogeneous chains.

Theorem D

Let \(G = (V, E)\) be a connected graph and let \(\pi \in \mathcal D(V)\). There exists a time-inhomogeneous Markov chain on G that perfectly mixes to \(\pi \) after \(2 {{\,\textrm{diam}\,}}G\) steps: \( \mu _{2 {{\,\textrm{diam}\,}}G} = \pi . \)

It is easy to see that \({{\,\textrm{diam}\,}}G\) is a lower bound on the fastest ‘perfectly mixing’ chain. If one only requires then \(\tfrac{1}{2} {{\,\textrm{diam}\,}}G\) is a lower bound. Thus the bound of \(2 {{\,\textrm{diam}\,}}G\) above is tight up to a factor of at most 4.

1.3 Notable examples

We discuss briefly a few examples which are of particular interest. We always consider the uniform distribution, i.e. , unless specified to the contrary.

1.3.1 Dumbbell graph

Let \(D_\star \) be the dumbbell graph with bells \(H_\pm \) of size n. The bells \(H_\pm \) need not be cliques \(K_n\); they can be arbitrary connected graphs on n vertices. See Fig. 1 for an illustration when \(H_\pm = K_n\).

Fig. 1
figure 1

Dumbbell graph \(D_\star \) with \(n = 7\): two cliques connected to a single external vertex

Fig. 2
figure 2

Star graph \(G_\star \) with \(n = 7\) a central vertex connected to leaves

Fig. 3
figure 3

Matching graph \(\mathcal M\) with \(n = 7\): two cliques connected via a matching

Fig. 4
figure 4

Source graph \(\Sigma \) with \(n = 7\) and \(k = 3\): two cliques connected via a ‘source’

Conductance Measures. It is straightforward to see that the set with the worst vertex conductance is given by one side of the dumbbell graph: \(S = H_-\) or \(S = H_+\). This shows that

This implies that the optimal relaxation time satisfies

It is easy to find a chain attaining the correct order of \(n^{-1}\) when \(H_\pm = K_n\), i.e. each bell is a complete graph on n vertices. Define a weighting as follows. Each vertex gets the same total weight.

  • Place unit weights on all edges which do not include the centre \(v_\star \).

  • Give the edges \( \{ v_-, v_\star \} \) and \( \{ v_+, v_\star \} \) weight \(n-1\).

  • Add self-loops of weight \(2(n-1)\) to each of the vertices in \(V \setminus \{ v_-, v_+, v_\star \} \).

The probability of stepping to \(v_\star \) from either of \(v_\pm \) is \(\tfrac{1}{2}\). This gives an order-n hitting time of one clique from the other. This implies that our chain has relaxation time order n. Contrast this with the suboptimal order \(n^2\) hitting and relaxation time for the uniform RW.

Almost Mixing. If the graphs \(H_\pm \) have polylogarithmic diameter then Theorem B provides a chain with polylogarithmic relaxation time. This is a substantial improvement from linear. This is true regardless of the particular structure of \(H_\pm \): it just needs \(\log {{\,\textrm{diam}\,}}H_\pm \lesssim \log \log n\). If \({{\,\textrm{diam}\,}}H_\pm \asymp 1\), then we obtain relaxation time order 1, known also as an expander.

We now explain very roughly how to construct this chain for this dumbbell example \(D_\star \). The general idea is to up-weight edges towards the central vertex \(v_\star \), which is the bottleneck. We do this is such a way to make the distance to \(v_\star \) behave somewhat like an unbiased RW on \(\mathbb {Z}\). This way it should take time order \(({{\,\textrm{diam}\,}}H_\pm )^2\) to move from \(H_\pm \) to \(H_\mp \).

It is natural to try to achieve this bias by rooting a spanning tree T at \(v_\star \) and then up-weighting the vertices towards the root. This leads to a worst-case hitting time for the root \(v_\star \) of order \(({{\,\textrm{diam}\,}}T)^2\). We choose T to be a breadth-first search (BFS) tree since this has \({{\,\textrm{diam}\,}}T \le 2 {{\,\textrm{diam}\,}}G\).

We give a more detailed overview in Sect. 3.2. We specifically chose the bottleneck vertex \(v_\star \) to be the root of T above. It turns out that actually any choice of root suffices. The reader may find this surprising at first; we did. More generally, suppose that \(o \in V\) is any vertex and a BFS is rooted at o; we up-weight the edges towards o. Paths from \(v \ne o\) to o naturally go through bottlenecks. This automatically up-weights edges in bottlenecks.

1.3.2 Binary tree

Let \(\mathbb T= (V, E)\) be the complete binary tree on \(n = 2^N - 1\) vertices with depth \(N \approx \log _2 n\).

Conductance Measures. It is straightforward to see that the set with the worst vertex conductance is given by one side of the tree: the root, a child and all its descendants. This gives

This implies that the optimal relaxation time satisfies

\(\mathbb T\) has bounded degree, so the maximum degree chain attains the correct order relaxation time. This chain is just the simple RW, but with extra laziness at the leaves to make the invariant distribution uniform. The correct relaxation time is order n; see [41] for details.

Almost Mixing. It is very natural to root the BFS tree at the root o of \(\mathbb T\). The up-weighting will help pull the walk up the tree towards the root, allowing it to spread across the width of the tree more easily. The weight given to the edge from \(x \ne o\) to its parent is given by the number of vertices in the subtree rooted at x. Precisely, if x is at distance \(d \ge 1\) from the root, then the weight is \(2^{N-d}-1\). The up-weighting means that the distance from the root behaves roughly as an unbiased RW. The hitting time of the root is then order \(N^2 \asymp (\log n)^2\). Once the RW hits the root, which branch it takes after is uniformly distributed. Thus, once it hits the leaves again, it is uniform on the leaves and so approximately mixed. The total time for this is order \(N^2 \asymp (\log n)^2\).

Our method does not know, however, that there is anything special about the root of the binary tree. Any vertex can be picked as the root of the BFS tree. Viewed from this vertex, \(\mathbb T\) is like a complete binary tree of some depth, but with some of the branches pruned. The depth is at most 2N. The same ideas give relaxation and mixing time order \(N^2 \asymp (\log n)^2\).

1.3.3 Star graph

Let \(G_\star = (V, E)\) be the star graph with centre \(v_\star \) and n leaves. See Fig. 2 for an illustration.

Conductance Measures. There is a simple dichotomy for the vertex boundary of a set \(S \ne \emptyset \) in the star graph \(G_\star \): if \(v_\star \in S\), then \(\partial S = S^c\); if \(v_\star \notin S\), then \(\partial S = \{ v_\star \} \). Thus any \(S \subseteq V\) with \(| S | = \lfloor \tfrac{1}{2} n \rfloor \) and \(v_\star \notin S\) satisfies . It is straightforward to see that this gives the correct order: place unit weights on all the edges and weight-\((n-1)\) self-loops on all the non-central vertices; it is easy to see that this chain has mixing time order n.

Another measure of vertex conductance replaces the \(\partial S\) with the symmetric union \(\partial _\textrm{sym} S :=\partial S \cup \partial S^c\); denote the vertex conductance with \(\partial _\textrm{sym}\) by \(\Psi _\textrm{sym}\). Again, there is a simple dichotomy for \(S \ne \emptyset \): if \(v_\star \in S\), then \(\partial _\textrm{sym} S = S^c \cup \{ v_\star \} \); if \(v_\star \notin S\), then \(\partial _\textrm{sym} S = S \cup \{ v_\star \} \). Thus \(\Psi _\textrm{sym}^\star \asymp 1\).

The difference between the two measures is that if the boundary of S is small, then that of \(S^c\) is large. The use of \(\partial S\), as opposed to \(\partial _\textrm{sym} S\), is thus important in Theorem A.

It is well-known that the spectral gap is characterised by a variational form. The relationship between spectral gap and the edge conductance of the simple RW is given by the well-known Cheeger inequalities. A related variational form was introduced by Bobkov, Houdré and Tetali [8], which they denote \(\lambda _\infty \). They establish various Cheeger-type relationships between \(\lambda _\infty \) and the vertex conductance \(\Psi _\textrm{sym}^\star \). In particular, they show, for any graph, that

$$\begin{aligned} (\Psi _\textrm{sym}^\star )^2 \lesssim \lambda _\infty \lesssim \Psi _\textrm{sym}^\star ; \end{aligned}$$

see [8, Theorems 1 and 2]. It is immediate from the definitions that

In light of [8] and our Theorem A, it is natural to wonder whether \(\lambda _\infty \) can be directly related to the optimal spectral gap , without a \(d_\textsf{max}\) factor. The example of the star graph \(G_\star \) shows that this is not possible: \(\Psi _\textrm{sym}^\star \asymp 1\) and thus \(\lambda _\infty \asymp 1\), but and so . This shows that \(\lambda _\infty \) is really not the correct parameter for the FMMC problem.

Almost Mixing. Obtaining an ‘almost mixing’ chain with order \(1/\varepsilon \) mixing time is simple: place \(\varepsilon \) weights on all the edges and weight-1 self-loops to all non-central vertices. The total weight of the central vertex is \(\varepsilon (n-1)\). The remainder of the weight is spread uniformly. Thus the distribution \(\pi \) induced by this weighting is in \(\mathcal D(G_\star , \varepsilon )\). It is easy to see that the mixing time is order \(1/\varepsilon \).

1.3.4 Complete graphs connected via a matching

Let \(\mathcal M= (V, E)\) be two n-cliques \(H_\pm = (V_\pm , E_\pm )\), connected via a matching: enumerate \(V_\pm \) as \( \{ v_\pm ^1,..., v_\pm ^n \} \) and connect \(v_+^i\) with \(v_-^i\) for all \(i \in \{ 1,..., n \} \). See Fig. 3 for an illustration.

Conductance Measures. If \(v_+^i \in S\) but \(v_-^i \notin S\), then \(v_-^i \in \partial S\). From this, follows easily. It is not too hard to show that the optimal spectral gap is of the same order: just replace the two cliques by two 3-regular expanders and leave the perfect matching in place. The lazy simple RW on this edge-induced subgraph of \(\mathcal M\) has order-1 spectral gap.

Almost Mixing. The optimal spectral gap is order 1, so there is no need for ‘almost mixing’.

1.3.5 Complete graphs connected via a ‘source’

Let \(\Sigma = (V, E)\) be two n-cliques \(H_\pm \) as ‘bells’, connected via a ‘source’: choose \(v_0 \in H_-\) and \(v_1,..., v_k \in H_+\); connect \(v_0\) with each of \( \{ v_i \} _{i=1}^k\). See Fig. 4 for an illustration.

Conductance Measures. One may think at first that this ‘source’ of k edges, rather than just a single edge, give rise to faster mixing; indeed, . However, removing the source from the set gives . So in fact the vertex conductance of the source graph \(\Sigma \) is almost the same as that of the dumbbell graph \(D_\star \).

The edge conductance of the graph does improve with k: . But this is always at most 1/n. So the improvement from k is not enough to outweigh the fact that the uniform RW has spectral gap far from the optimal—unless \(k \asymp n\).

An optimal spectral gap can be achieved by choosing an arbitrary 3-regular expander as a subgraph of each of the cliques and connecting these via a single edge. The uniform RW on this sparse subgraph then has spectral gap order 1/n.

Almost Mixing. We can use exactly the same construction as in the dumbbell graph \(D_\star \), picking an arbitrary edge amongst the k connecting edges.

1.4 Review of previous work

We now review previous related work. The FMMC question was originally introduced by Boyd, Diaconis and Xiao [12], which was the first in a series of articles [9,10,11,12, 43] by those authors along with Parrilo and Sun. It has subsequently been studied by [2, 14, 19, 21, 34, 38, 39]. We roughly collect these by theme.

1.4.1 Finding useful formulations

Boyd, Diaconis and Xiao [12]. This original work introduces the FMMC question and then primarily studies equivalent formulations. In our view, the most important contribution of that paper, beyond the introduction of the very interesting FMMC question, is their formulation of the FMMC optimisation problem as a semi-definite program (SDP). This allows the computation of an optimal solution in polynomial time via standard convex optimisation techniques. The SDP leads naturally to a dual formulation, which found use in subsequent work [34, 38, 39].

Roch [38]. Roch takes the dual formulation of [11] much further, writing the optimal spectral gap as a minimisation of the variance of a certain constrained graph embedding. To quote him, “Informally, to obtain [a lower bound on the optimal spectral gap] we seek to embed the graph into \(\mathbb {R}^{| V |}\) so as to ‘spread’ the nodes as much as possible under constraints over the distances separating nodes connected by edges.” He re-derives the upper bound using this formulation. This shows vertex conductance is a fundamental barrier to fast mixing. Our result shows that vertex conductance is essentially the fundamental barrier to fast mixing.

Sun, Boyd, Xiao and Diaconis [43]. The paper [43] is of a similar flavour to [12] but in the continuous-time set-up. We discuss it in detail in Sect. 1.4 below.

1.4.2 Special cases and particular examples

Boyd, Diaconis, Sun and Xiao [11]. The special case of the path with uniform distribution is studied in the short note [11], as a follow-on from [12]. They show that the ‘uniform chain’, i.e., the unbiased RW with \(\tfrac{1}{2}\)-holding at the ends, has the largest spectral gap.

Boyd, Diaconis, Parrilo and Xiao [9, 10]. The FMMC problem on graphs with rich symmetry properties is studied in [10]. They are able to solve various cases analytically: edge-transitive graphs, such as the cycle; Cartesian products of graphs, such as the two-dimensional torus and the hypercube; distance-transitive graphs, such as Petersen, Hamming and Johnson graphs. They then use algebraic methods to study FMMC on orbit graphs. This uses powerful representation theory arguments developed in [9].

Cihan and Akar [14]. Many similar scenarios, such as edge-transitive graphs, are studied in [14]. The focus is on two SDP methods. They study the degree-biased and uniform equilibria.

Jafarizadeh and Jamalipour [21]. Symmetric K-partite graphs and connections to sensor networks are considered in [21]. They compare numerically with a Metropolis–Hastings algorithm.

Allison and Shader [2]. Graphs which are overlapping unions of two cliques are studied in [2]. Here are are two cliques, say of sizes \(r+s\) and \(r+t\), respectively, and there are s overlapping vertices. The FMMC problem is solved analytically for such graphs.

Fill and Kahn [19]. A rather different approach is taken in [19]. Their paper is focussed on comparison inequalities and majorisation of measures. They use these to analyse the FMMC problem. This use of majorisation allows them to study a distance, such as TV, separation or \(\ell _2\), rather than the spectral gap, which is only a proxy for the mixing time.

1.4.3 Continuous-time set-up

Sun, Boyd, Xiao and Diaconis [43]. The study of the FMMC question in continuous-time was initiated in [43]. The structure and goals of this paper are similar to [12]. The primary contribution is a convex SDP formulation as well as some dual formulations.

Recall that a normalisation on the weights was required. Indeed, doubling all the weights doubles the spectral gap. We imposed an “average leave-rate of 1”. A slightly more general ‘weighted average’ is considered in [43]. A number of physical interpretations of this normalisation are given.

Sammer [39] and Montenegro and Tetali [34]. The FMMC problem is considered by Sammer [39, §3.3]. It is referenced and discussed by byMontenegro and Tetali [34, §7]. We discussed their work in detail immediately after Theorem C. We add a small caveat to that discussion.

Montenegro and Tetali [34, §7.1] claim to impose a scaling of \(q(V) \le 1\) on their edge weightings \(q: E \rightarrow \mathbb {R}_+\); contrast this with our imposition of \(q(V) \le n\). Their scaling immediately implies that the relaxation time is at least order n; this contradicts their theorem. There are a couple of other points where there seem to be issues with the scalings, in particular in application of results from [43]. It may be possible to rectify these issues, but we have not checked carefully.

1.5 Subsequent work

Following the release of our paper on arXiv in late 2021, two papers [22, 29] extending our work on the FMMC question were released, almost simultaneously, in early 2022.

Kwok, Lau and Tung [29]. We posed the open question of

in Theorem A and also proving an analogous result for general equilibrium distributions, rather than restricted to the uniform distribution. This was the inspiration for [29] (private communication). They solve both open problems and they also generalise our framework to optimal higher-order eigenvalues. The extension to non-uniform equilibrium distribution was the main challenge and, whilst their solution follows the framework of our proof, it required a significant number of new ideas.

Jain, Pham and Vuong [22]. The above improvement from \(\log | V |\) to \(\log d_\textsf{max}\) was also established in [22] independently, but only in the set-up of a uniform equilibrium distribution.

2 Vertex conductance and the optimal spectral gap

This section is devoted to a proof of Theorem A. In Sect. 2.1, we define the matching conductance of a graph, which plays a central role in the proof of Theorem A. We also show in Proposition 2.2 that matching and vertex conductance of a graph differ by at most a universal constant factor. Sect. 2.2 contains the necessary notation and preliminaries needed in the proof of Theorem A. In Sect. 2.3, we relate the optimal spectral gap of a graph to its matching conductance. This relation is formalised in Theorem 2.10. Notice that Theorem 2.10 together with Theorem 2.2 directly imply Theorem A.

2.1 The matching conductance of a graph

A matching is a set of edges such that any pair of edges in the set do not share an endpoint. Given a set of (undirected) edges E together with a weight function \(w :E \rightarrow \mathbb {R}_{\ge 0}\), a maximum matching for E is a matching with maximum total weight (if E is the edge set of an unweighted graph, we assume w is equal to one on E). We denote with \(\nu (E)\) the weight of a maximum matching for E:

$$\begin{aligned} \nu (E) \triangleq \max _{\text {matching} \; F \subseteq E} \sum _{e \in F} w(e). \end{aligned}$$

We can now define the matching conductance of a graph.

Definition 2.1

Let \(G = (V, E)\) be a graph and \(\emptyset \ne S \subset V\). The matching conductance of S is defined as

The matching conductance of G is defined as

The next proposition relates matching and vertex conductance.

Proposition 2.2

Let \(G = (V, E)\) be a graph. Then, it holds that

Proof

The inequality is obvious: for any \(S \subseteq V\), \(\nu (E(S,S^c))\) must be smaller than the size of the vertex boundary of S. Therefore, for any \(S \subseteq V\), which yields the inequality.

The proof of is slightly more involved. In particular it’s not true that for any \(S \subseteq V\). Figure 4 provides an example of a graph with a set with small matching conductance, but large vertex conductance. Nevertheless, the worst vertex conductance of a set in a graph is related to the matching conductance of the graph. To prove this, consider \(|S| \le |V|/2\) with . We can assume , otherwise . Let M be a maximum matching for \(E(S,S^c)\), that is \(|M| = \nu (E(S,S^c))\), and \(V(M) \subset V\) be the set of vertices adjacent to edges in M. Now consider the set \(T = S \setminus V(M)\). We claim T has small vertex conductance. To this end, consider \(\partial T\). It holds that \(|\partial T| \le |V(M)|\). Indeed, any \(u \in S^c {\setminus } V(M)\) cannot be in \(\partial T\), otherwise there would exist an edge between a vertex \(S \setminus V(M)\) and a vertex in \(S^c \setminus V(M)\), and this would contradict the maximality of M. Therefore, since \(|V(M)| = 2 |M|\) and , we have that

2.2 Definitions and preliminaries

Given a set of vertices V, together with a set of edges E on V, a fractional matching is a function \(f :E \rightarrow [0,1]\) such that, for any \(v \in V\), \(\sum _{e \ni v} f(e) \le 1\). Moreover, the fractional matching number of E, denoted by \(\nu ^*(E)\), is the maximum total weight of a fractional matching for E:

$$\begin{aligned} \nu ^*(E) = \max _{f :E \rightarrow [0,1]} \sum _{e \in E} w(e) f(e), \end{aligned}$$

where the maximisation is over valid fractional matchings.

Notice that \(\nu ^*(E)\) is the solution of a linear program which is a convex relaxation for \(\nu (E)\). As such, \(\nu (E) \le \nu ^*(E)\). A useful characterisation of \(\nu ^*(E)\) is the following.

Proposition 2.3

The fractional matching number of E, \(\nu ^*(E)\), is equal to the minimum of the following linear program.

Proof

This simply follows from linear programming duality. \(\square \)

With a slight abuse of notation we will also use \(\nu (G)\) and \(\nu ^*(G)\) to denote the maximum (fractional) matching weight on the edge set of G.

Up until now we have considered only matchings in undirected graphs. For technical reasons, however, we will also need to consider matchings in directed graphs. Given a set \(\overrightarrow{E} \subseteq V \times V\) of directed edges, a directed matching \(\overrightarrow{M} \subseteq E\) is a set of edges such that, if \((u,v),(w,z) \in \overrightarrow{M}\) and \((u,v) \ne (w,z)\), then \(u \ne w\) and \(v \ne z\). Alternatively, a directed matching can be seen as a subgraph where each vertex has indegree and outdegree at most one, whereas an undirected matching is a subgraph where each vertex has degree at most one. Analogously to the undirected case, we denote with \(\nu (\overrightarrow{E})\) the weight of a maximum matching in \(\overrightarrow{E}\).

Definition 2.4

Given an undirected (weighted) graph \(G=(V,E,w)\), an orientation \(\overrightarrow{G} = (V, \overrightarrow{E}, w)\) is a directed graph constructed by replacing each undirected edge \(\{u,v\} \in E\) with a directed edge (uv) (with arbitrary orientation) having weight w(uv).

The next lemma relates the maximum matching weight in an undirected graph with the maximum matching weight of its orientation.

Proposition 2.5

For any graph \(G=(V,E,w)\) and an orientation \(\overrightarrow{G}\), it holds that \(\nu (\overrightarrow{G}) \le 4 \nu (G)\).

Proof

Let \(M \subseteq E\) be a matching returned by the greedy algorithm for finding a maximal matching on G, which works as follows: let \(e_1, \dots , e_m\) be an ordering of the edges of G such that \(w(e_1) \ge \cdots \ge w(e_m)\). Then, greedy incrementally construct M by adding \(e_i\) to it, for \(i=1,\dots ,m\), as long as this operation maintains the property that M is a matching. Denote with \(e_{i_1},\dots , e_{i_{|M|}}\) the edges of M ordered non-increasingly according to their weight.

Let \(\overrightarrow{M^*}\) be a maximum matching in \(\overrightarrow{G}\). We upper bound its total weight as follows. Let \(\overrightarrow{M}_0 = \overrightarrow{M^*}\) and, for \(j=1,\dots ,|M|\), let \(\overrightarrow{M}_j\) be the directed graph obtained from \(\overrightarrow{M}_{j-1}\) by removing all edges incident to one of the endpoints of \(e_{i_j}\). Since M is maximal by construction, \(\overrightarrow{M}_{|M|}\) is empty. Moreover, at each iteration j we remove at most four edges, since there are at most four edges in \(\overrightarrow{M^*}\) that share an endpoint with \(e_{i_j}\). Notice these edges cannot share an endpoint with \(\{e_1,\dots ,e_{i_{j-1}}\}\) otherwise they would have been removed in a previous iteration. Therefore, they must all have weight less than or equal to \(w(e_{i_j})\). This is because the matching \(\{e_1,\dots ,e_{i_{j-1}}\}\) can be augmented by adding any one of these edges (or rather, their undirected equivalent) without breaking the property of it being a matching. But then, their weight must be less than or equal to \(w(e_{i_j})\), since otherwise greedy would have chosen one of those instead of \(e_{i_j}\). Hence, we have proved that \(4\nu (G) \ge \nu (\overrightarrow{M^*})\), from which the proposition follows. \(\square \)

2.3 Matching conductance and the fastest mixing problem

The following result is due to Roch [38] and gives a variational characterisation of \(\gamma ^\star (G)\). It follows from the fact that \(\gamma ^\star (G)\) can be expressed as the solution to a semidefinite program for which strong duality holds.

Proposition 2.6

Let \(G = (V, E)\) be a graph of n vertices. Then, \(\gamma ^\star (G)\) is equal to the minimum of the following optimisation problem.

Remark 2.7

The variational characterisation actually given by Roch [38] doesn’t include a non-negativity constraint for the function g. The function g, however, needs to be non-negative whenever, as in our case, Markov chains on G are allowed non-negative holding probabilities. More precisely, for any \(u \in V\) and P transition matrix of a Markov chain on G, if we allow \(P(u,u) > 0\), then we need to require \(g(u) \ge 0\).

The variational characterisation of \(\gamma ^\star (G)\) given by Proposition 2.6 requires minimising over n-dimensional embeddings of the vertices in the graph. It is often more convenient to work with one-dimensional embeddings. For this reason, we introduce the following parameter.

Definition 2.8

Let \(G = (V, E)\) be a graph of n vertices. We denote with \(\gamma ^{(1)}(G)\) the minimum of the following optimisation problem.

The following proposition shows that \(\gamma ^{(1)}(G)\) is a \(O(\log {n})\)-approximation of \(\gamma ^\star (G)\).

Proposition 2.9

Let \(G = (V, E)\) be a graph. It holds that

$$\begin{aligned} \gamma ^\star (G) \le \gamma ^{(1)}(G) \lesssim \log {n} \cdot \gamma ^\star (G). \end{aligned}$$

The proof of this proposition uses a standard trick (see, e.g., Montenegro and Tetali [34]):

  1. (i)

    we apply the Johnson–Lindenstrauss lemma [28] to show that considering only \(O(\log {n})\)-dimensional embeddings suffices to obtain a constant approximation for \(\gamma ^\star (G)\);

  2. (ii)

    we transform such \(O(\log {n})\)-dimensional embedding into a one-dimensional embedding, but in doing so we will lose a \(O(\log {n})\) factor.

Proof

The relation \(\gamma ^\star (G) \le \gamma ^{(1)}(G)\) follows trivially since computing \( \gamma ^{(1)}(G)\) can be seen as minimising over the same set of n-dimensional embeddings as for \(\gamma ^\star (G)\), with the additional constraint that only the first coordinate can be non-zero.

To prove the upper bound, let \(f :V \rightarrow \mathbb {R}^n, g :V \rightarrow \mathbb {R}_{\ge 0}\) be the minimiser achieving \(\gamma ^\star (G)\) in Proposition 2.6. Then, the Johnson–Lindenstrauss lemma ensures there exists an embedding \(\widetilde{f} :V \rightarrow \mathbb {R}^d\) such that \(d = O(\log {n})\) and, for any \(u,v \in V\),

$$\begin{aligned} \frac{1}{2} \Vert f(u) - f(v)\Vert ^2 \le \Vert \widetilde{f}(u) - \widetilde{f}(v)\Vert ^2 \le \Vert f(u) - f(v)\Vert ^2, \end{aligned}$$

and

$$\begin{aligned} \frac{1}{2} \Vert f(u)\Vert ^2 \le \Vert \widetilde{f}(u)\Vert ^2 \le \Vert f(u)\Vert ^2. \end{aligned}$$

Now let

$$\begin{aligned} i \triangleq \underset{j \in \{1,\dots ,d\}}{{\text {arg}}{\text {max}}}\, \sum _{u,v \in V} \left( \widetilde{f}(u)_j - \widetilde{f}(v)_j\right) ^2, \end{aligned}$$

and define \(h :V \rightarrow \mathbb {R}\) as

$$\begin{aligned} h(u) = \widetilde{f}(u)_i - \frac{1}{n} \sum _{v \in V} \widetilde{f}(v)_i \end{aligned}$$

for any \(u \in V\).

By construction,

$$\begin{aligned} \sum _{u \in V} h(u) = \sum _{u \in V} \widetilde{f}(u)_i - \sum _{u \in V} \frac{1}{n} \sum _{v \in V} \widetilde{f}(v)_i = 0. \end{aligned}$$

Moreover, for any \(u,v \in V\),

$$\begin{aligned} (h(u)-h(v))^2 = (\widetilde{f}(u)_i - \widetilde{f}(v)_i)^2 \le \Vert \widetilde{f}(u) - \widetilde{f}(v)\Vert ^2 \le \Vert f(u) - f(v)\Vert ^2. \end{aligned}$$

Therefore, (hg) is a feasible solution to the optimisation problem of Definition 2.8.

Finally,

$$\begin{aligned} \sum _{u \in V} h(u)^2&= \frac{1}{2n} \sum _{u,v \in V} (h(u)-h(v))^2 \\&= \frac{1}{2n} \sum _{u,v \in V} (\widetilde{f}(u)_i - \widetilde{f}(v)_i)^2 \\&\ge \frac{1}{2nd} \sum _{u,v \in V} \left\| \widetilde{f}(u) - \widetilde{f}(v)\right\| ^2 \\&\ge \frac{1}{4nd} \sum _{u,v \in V} \left\| {f}(u) - {f}(v)\right\| ^2 \\&= \frac{1}{2d} \sum _{u\in V} f(u)^2, \end{aligned}$$

where we used the fact that both h and f are centred at zero. Therefore,

$$\begin{aligned} \gamma ^{(1)}(G) \le \frac{\sum _{u \in V} g(u)}{\sum _{u \in V} h(u)^2} \le 2d \cdot \frac{\sum _{u \in V} g(u)}{\sum _{u \in V} f(u)^2} \le 2d \gamma ^\star (G) \lesssim \log {n} \cdot \gamma ^\star (G). \end{aligned}$$

Suppose we fix \(f :V \rightarrow \mathbb {R}\) that minimises the optimisation problem above. Then, by Proposition 2.3, \(\gamma ^{(1)}(G)\) can be seen as the fractional matching value of a graph \(G_f\) which is constructed from G by reweighing each edge \(\{u,v\}\) by \((f(u) - f(v))^2\). Together with Proposition 2.9, this hints towards a connection between the matching conductance of G and \(\gamma ^\star (G)\). This connection is formalised in Theorem 2.10, which is the main result of this section.

Theorem 2.10

Let \(G = (V, E)\) be a graph. It holds that

Moreover, this implies that

The proof of Theorem 2.10 follows the standard template of the proof of the discrete Cheeger inequality. To upper bound \(\gamma ^{(1)}(G)\) it suffices to construct test functions fg from a set S minimising the matching conductance of G. The other direction is more complicated and, similarly to the case of the “hard direction” of the discrete Cheeger inequality, it requires using the function f that minimises \(\gamma ^{(1)}(G)\) to construct sweep sets and analyse the matching conductance of such sets. Analysing the matching conductance of these sweep sets, however, is not as straightforward as analysing their edge conductance as in the proof of the standard discrete Cheeger inequality.

We split the proof of Theorem 2.10 in several lemmata. The first one, Lemma 2.11, relates the maximum matching of cuts in the graph to the maximum matching of a weighted directed graph appositely constructed.

Lemma 2.11

Let \(G = (V, E)\) be an unweighted undirected graph and let \(f :V \rightarrow \mathbb {R}_{\ge 0}\). Let \(\overrightarrow{G}_f = (V, \overrightarrow{E}_f,w_f)\) be a directed weighted graph constructed as follows:

  1. (i)

    for any \(u,v \in V\), \((u,v) \in \overrightarrow{E}_f\) if and only if \(\{u,v\} \in E\) and \(f(u) < f(v)\);

  2. (ii)

    for any \((u,v) \in \overrightarrow{E}_f\), \(w_f(u,v) = f(v)^2 - f(u)^2\).

For any \(t > 0\), define \(S_t = \{u \in V :f(u)^2 > t\}\). Then, it holds that,

$$\begin{aligned} \int _0^{\infty }\nu (E(S_t, S_t^c)) \, dt \le 2 \nu (\overrightarrow{G}_f). \end{aligned}$$

Proof

For any \(t \in [0,\infty )\), let \(M_t \subseteq E(S_t, S_t^c)\) be a matching achieving value \(\nu (E(S_t, S_t^c))\). Notice that, for any t, there might be several distinct maximum matching for \(E(S_t, S_t^c)\): we just pick one of them arbitrarily. We have that

$$\begin{aligned} \int _0^{\infty } \nu (E(S_t, S_t^c)) \, dt = \int _0^{\infty } \sum _{e \in E} \varvec{1}\{ e \in M_t \} \, dt. \end{aligned}$$

Notice that, for any edge \(\{u,v\} \in E\) with \(f(u) < f(v)\),

$$\begin{aligned} \{ t \in [0, \infty ) :\{u,v\} \in M_t \} \subseteq \left[ f(u)^2,f(v)^2\right) . \end{aligned}$$

Let \(\overrightarrow{M} \subseteq \overrightarrow{E}_f \) output by an execution of greedy on \(\overrightarrow{G}_f\), which works as follows. We first order the edges \(\overrightarrow{E}_f = \{e_1,\dots ,e_m\}\) of \(\overrightarrow{G}_f\) such that \(w_f(e_1) \ge w_f(e_2) \ge \cdots \ge w_f(e_m)\). For \(i=1,\dots ,m\), we incrementally construct \(\overrightarrow{M}\) by including \(e_i\) for \(i=1,\dots ,m\) as long as adding this edge doesn’t break the property of \(\overrightarrow{M}\) being a directed matching. This is the same algorithm as the one described in the proof of Proposition 2.5, with the difference that we are now constructing a directed instead of an undirected matching.

Consider now \(\{u,v\} \in E\) such that \(f(u) < f(v)\) and \(\{u,v\} \in M_t\) for some \(t \ge 0\). Then, there must exist an edge \((u',v') \in \overrightarrow{M}\) such that \(u = u'\) or \(v = v'\) (since greedy outputs a maximal matching) and

$$\begin{aligned} w_f(u',v') = f(v')^2 - f(u')^2 \ge f(v)^2 - f(u)^2. \end{aligned}$$

The inequality above holds because otherwise greedy would have picked (uv) instead of \((u',v')\). This implies that \([f(u),f(v)) \subseteq [f(u'),f(v'))\) and \(t \in [f(u'),f(v'))\).

For any \(t \ge 0\), let \(h_t :E \rightarrow E\) be the function that maps any \(\{u,v\} \in E\) such that \(f(u) < f(v)\) and \(\{u,v\} \in M_t\) to an edge \((u',v')\) as above. Notice that for any edge \((u',v') \in \overrightarrow{M}\) and any \(t\ge 0\), there can be at most two edges in \(M_t\) that share an endpoint with \((u',v')\). Hence, \(|h_t^{-1}(u',v')| \le 2\).

Therefore, since \(\{ t \in [0, \infty ) :\{u,v\} \in M_t \} \subseteq \left[ f(u)^2,f(v)^2\right) \) as noted above, we have

$$\begin{aligned} \sum _{\{u,v\} \in E} \int _0^{\infty } \varvec{1}\{ \{u,v\} \in M_t \} \, dt&\le 2 \sum _{(u',v') \in \overrightarrow{M}} \int _0^{\infty } \varvec{1}\{ h_t^{-1}(u',v') \cap M_t \ne \emptyset \} dt \\&\le 2 \sum _{(u',v') \in \overrightarrow{M}} \left( f(v')^2 - f(u')^2\right) \\&\le 2 \nu (\overrightarrow{G}_f). \end{aligned}$$

The lemma follows by observing that

$$\begin{aligned} \int _0^{\infty } \nu (E(S_t, S_t^c)) \, dt = \int _0^{\infty } \sum _{e \in E} \varvec{1}\{ e \in M_t \} \, dt = \sum _{e \in E} \int _0^{\infty } \varvec{1}\{ e \in M_t \} \, dt \le 2 \nu (\overrightarrow{G}_f), \end{aligned}$$

where we can swap the signs of integration and summation since the matchings \(\{M_t: t\ge 0\}\) can be chosen so that we need to consider only at most \(n-1\) different matchings (since there are at most \(n-1\) different sets \(S_t\)), which implies that the integral can actually be computed as a finite sum. \(\square \)

The next lemma shows how to construct a set of small matching conductance given a “good” non-negative function \(f :V \rightarrow \mathbb {R}_{\ge 0}\).

Lemma 2.12

Let \(G = (V, E)\) be a graph and \(f :V \rightarrow \mathbb {R}_{\ge 0}\) such that \(0< \left| \{u \in V :f(u) \ne 0 \} \right| < |V|\). Let \(\lambda \) be the minimum of the following optimisation problem.

Then, there exists a set \(S \subseteq \{u \in V :f(u) > 0 \}\) such that .

Proof

Let \(S_t = \{u \in V :f(u)^2 > t\}\) for \(t \ge 0\). We have that

$$\begin{aligned} \min _{t :0< |S_t| < |V|} \Upsilon (S_t) \le \frac{\int _0^{\infty } \nu (E(S_t, S_t^c)) \, dt }{\int _0^{\infty } |S_t| \, dt}. \end{aligned}$$

First notice the denominator is equal to

$$\begin{aligned} \int _0^{\infty } |S_t| \, dt = \sum _{u\in V} f(u)^2. \end{aligned}$$

We now upper-bound the numerator. Let \(\overrightarrow{G}_f = (V, \overrightarrow{E}_f,w_f)\) be the directed weighted graph defined in Lemma 2.11:

  1. (i)

    for any \(u,v \in V\), \((u,v) \in \overrightarrow{E}_f\) if and only if \(\{u,v\} \in E\) and \(f(u) < f(v)\);

  2. (ii)

    for any \((u,v) \in \overrightarrow{E}_f\), \(w_f(u,v) = f(v)^2 - f(u)^2\).

Notice that this is an orientation of a graph \(G_f = (V, E_f,w_f)\) where each directed edge \((u,v) \in \overrightarrow{E}_f\) is replaced by \(\{u,v\} \in E\). Therefore, by Lemma 2.11 and Proposition 2.5, we have that

$$\begin{aligned} \int _0^{\infty } \nu (E(S_t, S_t^c)) \, dt \le 2 \nu (\overrightarrow{G}_f) \le 8 \nu (G_f). \end{aligned}$$

We now want to relate \(\nu (G_f)\) to \(\lambda \). Let M be a maximum matching in \(G_f\). By applying Cauchy–Schwartz and the triangle inequality,

$$\begin{aligned} \nu (G_f)&= \sum _{\{u,v\} \in M} w_f(u,v) \\&= \sum _{\{u,v\} \in M} \left| f(u)^2 - f(v)^2\right| \\&= \sum _{\{u,v\} \in M} \left| f(u) - f(v)\right| \cdot \left| f(u) + f(v)\right| \\&\le \sqrt{\sum _{\{u,v\} \in M} \left( f(u) - f(v)\right) ^2} \sqrt{\sum _{\{u,v\} \in M} \left( f(u) + f(v)\right) ^2} \\&\le \sqrt{\sum _{\{u,v\} \in M} (f(u) - f(v))^2} \sqrt{\sum _{\{u,v\} \in M} 2\left( f(u)^2 + f(v)^2\right) } \\&\le \sqrt{\sum _{\{u,v\} \in M} (f(u) - f(v))^2} \sqrt{2 \sum _{u \in V} f(u)^2}, \end{aligned}$$

where the first inequality follows from Cauchy–Schwartz, while the second from the inequality \((a+b)^2 \le 2a^2 + 2b^2\) for any \(a,b \in \mathbb {R}\). Notice that \(\sum _{\{u,v\} \in M} (f(u) - f(v))^2\) can be interpreted as the weight of a matching in an undirected graph obtained from G by reweighing each edge \(\{u,v\}\) with weight \((f(u) - f(v))^2\). Therefore, we can apply Proposition 2.3 and the definition of \(\lambda \) to show that

Putting all together, we obtain

We are now finally ready to prove Theorem 2.10.

Proof of Theorem 2.10

We start by proving the “easy side” of the Cheeger-type inequality, i.e., . Let \(S \subset V\) such that \(|S| \le |V|/2\) and . Define \(f :V \rightarrow \mathbb {R}\) as

$$\begin{aligned} f(u) = {\left\{ \begin{array}{ll} \frac{1}{\sqrt{2|S|}} &{} \text { if } u \in S \\ - \frac{1}{\sqrt{2|V \setminus S|}} &{} \text { if } u \not \in S. \end{array}\right. } \end{aligned}$$

Let M be a maximum matching for \(E(S,S^c)\), i.e., \(|M| = \nu (E(S,S^c))\). Denote with V(M) the set of vertices incident to M. Define \(g :V \rightarrow \mathbb {R}\) as

$$\begin{aligned} g(u) = {\left\{ \begin{array}{ll} 2/|S| &{} \text { if } u \in V(M) \\ 0 &{} \text { otherwise}. \end{array}\right. } \end{aligned}$$

By construction, fg satisfy the constraints of the optimisation problem of Definition 2.8. Moreover, \(\sum _{u \in V} g(u) = 2 V(M) / |S| = 4 \nu (E(S,S^c)) / |S|\), while \(\sum _{u \in V} f(u)^2 = 1\). Therefore,

We now turn the attention to the “harder side”, i.e., . Let \(f,g :V \rightarrow \mathbb {R}\) minimise \(\gamma ^{(1)}(G)\). We cannot directly apply Lemma 2.12 with f since f is not non-negative and might be supported over the entire V. For this reason, we define two non-negative functions \(h^-,h^+ :V \rightarrow \mathbb {R}_{\ge 0}\) as follows. Let c be the median of f, i.e., order the vertices in V such that \(f(u_1) \le f(u_2) \le \cdots \le f(u_n)\) and set \(c \triangleq f(u_{\lceil n/2 \rceil })\). For any \(u \in V\), define \(h^-(u) \triangleq \max \{0, - (f(u)-c)\}\) and \(h^+(u) \triangleq \max \{0, f(u)-c\}\). If \(\sum _{u \in V} h^-(u)^2 \ge \sum _{u \in V} h^+(u)^2\), we define \(h \triangleq h^-\), otherwise \(h \triangleq h^+\). We now apply Lemma 2.12 with h.

First notice that

$$\begin{aligned} \sum _{u \in V} h(u)^2 \ge \frac{1}{2} \sum _{u \in V} (f(u)-c)^2 \ge \frac{1}{2} \sum _{u \in V} f(u)^2, \end{aligned}$$

since \(\sum _{u \in V} f(u) = 0\) because f is a feasible solution to the optimisation problem of Definition 2.8. Moreover, for any \(\{u,v\} \in E\),

$$\begin{aligned} (h(u) - h(v))^2 \le (f(u)-f(v))^2 \le g(u) + g(v). \end{aligned}$$

We can then apply Lemma 2.12 with

$$\begin{aligned} \lambda \triangleq \frac{\sum _{u \in V} g(u)}{\sum _{u \in V} h(u)^2} \le \frac{2\sum _{u \in V} g(u)}{ \sum _{u \in V} f(u)^2} = 2\gamma ^{(1)}(G) \end{aligned}$$

Therefore, there exists \(S \subseteq \{u \in V :h(u) > 0 \}\) such that . Moreover, by construction the support of h has size at most |V|/2. Hence,

\(\square \)

3 Almost mixing

3.1 Set-up and main result

The previous section was devoted to estimating mixing-type statistics for the FMMC problem: we controlled the maximal spectral gap amongst all transition matrices \(P\) on a given graph \(G = (V, E)\) which are reversible w.r.t. the uniform distribution in terms of the vertex conductance of G. The purpose of the current section is to relax the condition that the invariant distribution of \(P\), which we denote \(\pi _P\), is exactly : we allow \(\pi _P\) to be \(\varepsilon \)-far from uniform in TV. We show that this can allow a significant speed up in the mixing time versus requiring the invariant distribution to be exactly \(\pi \): we explicitly construct a Markov chain with spectral gap order at least \(\varepsilon / ({{\,\textrm{diam}\,}}G)^2\).

Recall that we write \(\mathcal D(V)\) for the set of positive probability distributions on a set V and

Let \(u: E \rightarrow \mathbb {R}_+\), let \(A, B, S \subseteq V\), let \(x \in V\) and let \(E' \subseteq E\). We use the following notation.

$$\begin{aligned}{} & {} u(x) :=\sum \nolimits _{y \in V} u ( \{ x,y \} ) \quad \text {and}\quad u(S) :=\sum \nolimits _{x \in S} u(x); \\{} & {} E(A, B) :=\bigl \{ \{ x, y \} \in E \mid x \in A, \, y \in B \bigr \} \quad \text {and}\quad u(E') :=\sum \nolimits _{e \in E'} u(e). \end{aligned}$$

Define the transition matrix \(P_u\in [0,1]^{V \times V}\) by

Abbreviate the spectral gap as . Define the probability measure \(\pi _u: V \rightarrow [0,1]\) by

$$\begin{aligned} \pi _u(x) :=u(x) / u(V) \quad \text {for}\quad x \in V. \end{aligned}$$

\(P_u\) is the transition matrix of the RW on the weighted graph and \(\pi _u\) is its invariant distribution. It is the unique invariant distribution if the edge set \( \{ e \in E \mid u(e) \ne 0 \} \) is connected.

The following theorem is a refinement of Theorem B.

Theorem 3.1

(‘Almost Mixing’) Let \(G = (V, E)\) be a graph and let \(\pi \in \mathcal D(V)\). There exists an edge weighting \(w_1: E \rightarrow \mathbb {R}_+\), depending only on G and \(\pi \), with unit total weight, i.e. \(w_1(V) = 1\), and the following property. Let \(\varepsilon \in (0,1)\). Let \(P\) be a transition matrix on G which is reversible w.r.t. \(\pi \); it need not be irreducible. Let \(w_0: E \rightarrow \mathbb {R}_+\) be the unique edge weighting of G with \(P_w= P\) and \(w_0(V) = 1\). Define the transition matrix \(Q\) via the superposition weighting \(w:=w_0 + \varepsilon w_1\):

$$\begin{aligned} Q:=P_w\quad \text {where}\quad u(e) :=w_0(e) + \varepsilon w_1(e) \quad \text {for}\quad e \in E. \end{aligned}$$

Then Let \(Q' :=\tfrac{1}{2} (I + Q)\) denote its lazification. Then

Immediate consequences of these properties are

We remark briefly on the ‘independence’ of the perturbation by \(\varepsilon w_1\).

Remark 3.2

(Independence of Perturbation) The weighting \(w= w_0 + \varepsilon w_1\) can be seen as a perturbation of \(w_0\) by \(\varepsilon w_1\), since we are most interested in the case where \(\varepsilon \) is very small—indeed, we want the new equilibrium distribution to be very close to \(\pi \). We emphasise that the perturbation weighting \(w_1\) does not depend on the base weighting \(w_0\); rather, \(w_1\) is a function only G and \(\pi \).

We fix a graph \(G = (V, E)\) and a probability measure \(\pi \in \mathcal D(V)\) throughout this section. We do not always repeat these in statements below. Also, we write \(n :=| V |\).

We start by proving a slightly weaker statement. Assume that \(w_0\) corresponds to unit-self loops:

$$\begin{aligned} w_0 ( \{ x,y \} )&= 0 \quad \text {for all}\quad x, y \in V \quad \text {with}\quad x \ne y \quad \text {and}\quad w_0 ( \{ v \} ) \\&= \pi (v) \quad \text {for all}\quad v \in V. \end{aligned}$$

The corresponding transition matrix \(P_{w_0}\) is diagonal and thus reversible w.r.t. any measure. We then extend the argument to handle arbitrary initial weightings \(w_0\) in the Sect. 3.5.

3.2 Outline and proof given later results

We start by giving a very brief outline with cross-references to the results proved in the following subsections. We then flesh out this outline, giving a more detailed description.

Outline of Proof: Very Brief

The proof has four key steps.

  1. (i)

    We construct a weighted spanning tree; see Definition 3.11.

  2. (ii)

    We control the difference between the invariant distribution of the RW on this weighted tree and the target distribution \(\pi \); see Lemma 3.12.

  3. (iii)

    We estimate the conductance of this weighted tree; see Proposition 3.13.

  4. (iv)

    We relate its spectral gap and conductance using canonical paths; see Corollary 3.14. \(\square \)

Outline of Proof: More Detailed

We now flesh out the above details somewhat for .

  • Let \(T = (V, F)\) be a BFS spanning tree of G, rooted at \(v_\star \). We choose a weighting \(w_\star : F \rightarrow (0, \infty )\) such that the weights increase towards the root \(v_\star \) in such a way that \(w_\star \) has edge conductance order 1. We then rescale the \(w_\star \) to get \(\widetilde{w}_\star \) with total weight \(\widetilde{w}_\star (V) = \varepsilon n\).

  • Define \(w: E \rightarrow \mathbb {R}_+\) by adding unit-weight self-loops to \(\widetilde{w}_\star \). The total weight \(w(V) = (1 + \varepsilon ) n\) and \(w(x) \ge 1\) for all x. Thus the invariant distribution \(\pi '\) satisfies \( \min _{x \in V} \pi '(x) \ge (1 - \varepsilon ) / n. \)

  • It remains to analyse the edge conductance of \(w\), which is intimately related to the original total weight \(w_\star (V)\). We can choose the weights such that \(w_\star (V) \asymp n {{\,\textrm{diam}\,}}G\). We then apply a Cheeger-type inequality to deduce a spectral gap lower bound of order \(\varepsilon / ({{\,\textrm{diam}\,}}G)^2\).

We now describe how to choose the weighting \(w_\star \). Let \(T_x \subseteq T\) denote the subtree rooted at x and consisting of all descendants of x. We choose the weight \(w_\star ( \{ x,\textsf{prt}(x) \} ) :=| T_x |, \) where \(\textsf{prt}(x)\) is the (unique) parent of x, for \(x \ne o\). This way, the conductance of a subtree \(T_x\) in the weighted tree \((T, w_\star )\) is precisely 1. We emphasise that this is in the weighted tree \((T, w_\star )\). We need to rescale \(w_\star \) and combine it with the unit-weight self-loops to get an approximately uniform weighting.

It turns out that \( w_\star (F) \asymp n {{\,\textrm{diam}\,}}T \asymp n {{\,\textrm{diam}\,}}G. \) This then gives rise to a final conductance . The standard Cheeger inequality then gives . We improve this to by applying the canonical paths method, using the fact that T is a tree.

The proof for general \(\pi \) is very similar. One gives the self-loop at x weight \(\pi (x)\) and defines \( w_\star ( \{ x, \textsf{prt}(x) \} ) :=\pi (T_x). \) This is the natural extension. The same arguments follow through. \(\square \)

3.3 Preliminaries

We introduce some preliminary material which is used throughout the proof, as well as in Sect. 4. The majority of it will be familiar to a reader well-versed in RWs and mixing time analysis.

First, we generalise the notation of edge conductance of the graph, herein abbreviated conductance. We introduced this in Definition A.1 for RWs on unweighted graphs, i.e. unit weights on all edges. Reversible RWs correspond to a general weighting \(u: E \rightarrow \mathbb {R}_+\), as in the above notation. The (edge) conductance of a reversible RW is the (edge) conductance of that weighted graph.

Definition 3.3

(Edge Conductance) Let \(G = (V, F)\) be a graph and \(u: F \rightarrow \mathbb {R}_+\) be a weighting. The conductance of a set \(S \subseteq V\) with \(\pi _u(S) > 0\) w.r.t. \(u\) is defined to be

The conductance of \(u\) is defined to be

The adjusted conductance of a set \(S \subseteq V\) with \(0< \pi _u(S) < 1\) is defined similarly:

Remark 3.4

a (Conductance: Original and Adjusted Relations) The definition of the adjusted conductance does not need the restriction \(\pi _u(S) \le \tfrac{1}{2}\) as is invariant under complementation:

The following inequalities between and are immediate:

Remark 3.4

b (Conductance: Connectivity Assumption) We may assume that S induces a connected subset T[S] when analysing . Indeed, if \(S = A \mathrel {\dot{\cup }} B\) with \(F(A, B) = \emptyset \), then

using the fact that \( F(S, S^c) = F(A, A^c) \mathrel {\dot{\cup }} F(B, B^c) \) and that

$$\begin{aligned} \frac{a + b}{a' + b'} \ge \min \biggl \{ \frac{a}{a'}, \, \frac{b}{b'} \biggr \} \quad \text {for all}\quad a,a',b,b' > 0. \quad \quad \quad \quad \quad \quad \quad \triangle \end{aligned}$$

Next, we introduce the canonical paths method and use it to relate the spectral gap to the conductance in trees. A proof of Proposition 3.6 can be found in [40, Theorem 5].

Definition 3.5

(Paths) Let \(G = (V, F)\) be a graph. \(\Gamma : \{ 0,..., L \} \rightarrow V\) is an F-path from x to y if

$$\begin{aligned} \Gamma (0) = x, \quad \Gamma (L) = y \quad \text {and}\quad \{ \Gamma (\ell -1), \, \Gamma (\ell ) \} \in F \quad \text {for all}\quad \ell \in [1, L]. \end{aligned}$$

The length of a path \(\Gamma : \{ 0,..., L \} \rightarrow V\) is defined to be \( | \Gamma |:=L. \)

Proposition 3.6

(Canonical Paths: General) Let \(G = (V, E)\) be a connected graph and \(u: E \rightarrow \mathbb {R}_+\) be a weighting. Let \(\Gamma _{x,y}\) be an arbitrary F-path from x to y for \(x,y \in V\). The spectral gap of the RW on \((G, u)\) satisfies

Corollary 3.7

(Canonical Paths: Trees) Let \(T = (V, F)\) be a connected tree and \(u: F \rightarrow (0, \infty )\) be a weighting. The spectral gap of the RW on \((T, u)\) satisfies

Proof

Let \(\Gamma _{x,y}\) be the shortest path between x and y for all \(x,y \in V\). Then \(| \Gamma _{x,y} | \le {{\,\textrm{diam}\,}}T\) for all \(x,y \in V\). Removing the edge \(e = \{ e_-, e_+ \} \in F\) disconnects the graph, leaving two components, with \(e_- \in V\) in one component and \(e_+ \in V\) in the other. Denote the component containing \(e_\pm \) by \(T^e_\pm \). The canonical paths method (Proposition 3.6) then implies that

This uses the fact that \( T^e_+ \mathrel {\dot{\cup }} T^e_- = V \) and \( F(T_+^e, T_-^e) = e \) for all \(e \in F\). \(\square \)

Remark 3.8

A more general statement is proved by Miclo [33, Theorem 1] (article in French). He does not require the graph T to be a tree at the cost of replacing \({{\,\textrm{diam}\,}}T\) in the denominator of the bound by the longest path in T; eg, if T has a Hamiltonian path, then the denominator becomes \(n-1\). He gives two proofs, one of which uses a canonical paths style argument.

Finally, we introduce some notation for trees and prove a counting lemma. This lemma, innocuous as it may appear, is fundamental to multiple calculations. The notation and definitions above were for any graph \(T = (V, F)\). Assume that T is a tree for the rest of this preliminary section.

Definition 3.9

(Tree Notation) Let \(T = (V, F)\) be a tree rooted at o.

  • Let \(\textsf{anc}(z)\) denote the unique shortest path from z to the root o, including both z and o.

  • Let \(V_y := \{ z \in V \mid y \in \textsf{anc}(z) \} \) and \(T_y :=T[V_y]\) denote the subtree rooted at y.

  • Let \(\textsf{prt}(x)\) denote the parent of \(x \ne o\), ie the unique neighbour y of x satisfying \(y \in \textsf{anc}(x)\).

Lemma 3.10

(Counting Weighted Subtrees) For all measures \(\mu \) on V and all \(x \in V\), we have

$$\begin{aligned} \sum \nolimits _{y \in T_x \setminus \{ x \} } \mu (T_y) \le \mu (T_x) {{\,\textrm{diam}\,}}T. \end{aligned}$$

Proof

\(z \in T_y\) if and only if \(y \in \textsf{anc}(z)\); if \(y \in T_x \cap \textsf{anc}(z)\), then \(z \in T_x\). Thus

$$\begin{aligned} \sum \nolimits _{y \in T_x \setminus \{ x \} } \mu (T_y)&= \sum \nolimits _{y \in T_x \setminus \{ x \} } \sum \nolimits _{z \in T_y} \mu (z) \\ {}&= \sum \nolimits _{y \in V} \sum \nolimits _{z \in V} \mu (z) \varvec{1}\{ y \in T_x \setminus \{ x \} \} \varvec{1}\{ z \in T_y \} \\ {}&= \sum \nolimits _{z \in V} \mu (z) \sum \nolimits _{y \in V} \varvec{1}\{ y \in \textsf{anc}(z) \cap T_x \setminus \{ x \} \} \varvec{1}\{ z \in T_x \} \\ {}&\le \sum \nolimits _{z \in T_x} \mu (z) {{\,\textrm{depth}\,}}T_x = \mu (T_x) {{\,\textrm{depth}\,}}T_x \le \mu (T_x) {{\,\textrm{diam}\,}}T. \end{aligned}$$

\(\square \)

3.4 Construction via a weighted spanning tree

First, we define a weighted spanning tree \((T, w)\).

Definition 3.11

(Weighted Spanning Tree) Let \(o \in V\) and let \(T = (V, F)\) be a BFS tree rooted at o. Supplement F with a self-loop at each vertex of V. Define the following weightings \(w_{0/1}: F \rightarrow \mathbb {R}_+\):

$$\begin{aligned} w_0 ( \{ x,y \} )&:=\pi (x)\varvec{1}\{ x = y \}{} & {} \quad \text {for}\quad x,y \in V; \\ w_1 ( \{ x, \textsf{prt}(x) \} )&:=\pi (T_x){} & {} \quad \text {for}\quad x \in V \setminus \{ o \} . \end{aligned}$$

Define the weighting \(w: F \rightarrow \mathbb {R}_+\) via a linear combination:

$$\begin{aligned} w:=w_0 + \eta w_1 \quad \text {where}\quad \eta :=\tfrac{1}{2} \varepsilon / {{\,\textrm{diam}\,}}T. \end{aligned}$$

T is a BFS tree, so \(\Delta :={{\,\textrm{diam}\,}}T\) satisfies \(\Delta \le 2 {{\,\textrm{diam}\,}}G\).

The distribution \(\pi _w\) induced by \(w\) is close to \(\pi \) in the following sense.

Lemma 3.12

(\(\pi _w\) Close to \(\pi \)) The weighted tree \((T, w)\) and its induced distribution \(\pi _w\) satisfy

Proof

First, the subtree counting lemma (Lemma 3.10) implies that

$$\begin{aligned} w_1(V) = 2 w_1(F) = 2 \sum \nolimits _{e \in F} w_1(e)&= 2 \sum \nolimits _{x \in V \setminus \{ o \} } w_1 ( \{ x, \textsf{prt}(x) \} ) \\ {}&= 2 \sum \nolimits _{x \in T_o \setminus \{ o \} } \pi (T_x) \le 2 \pi (T_o) \Delta = 2 \Delta . \end{aligned}$$

Trivially, \( w_0(V) = \sum \nolimits _{x \in V} \pi (x) = 1. \) Thus

$$\begin{aligned} w(V) = w_0(V) + \eta w_1(V) \le 1 + (\tfrac{1}{2} \varepsilon / \Delta ) \cdot (2 \Delta ) = 1 + \varepsilon . \end{aligned}$$

Second, \( w(x) \ge w_0(x) = \pi (x) \) and thus \( \pi _w(x) = w(x) / w(V) \ge \pi (x) / (1 + \varepsilon ) \) for all \(x \in V\). \(\square \)

Next, we control the conductance of this weighted spanning tree.

Proposition 3.13

(Conductance) Let \((T, w)\) be the weighted spanning tree from Definition 3.11. The conductance of the RW on \((T, w)\) satisfies

Proof

First, suppose that \(o \notin S\). Choose \(x \in S\) with \(\text {dist}(x, o)\) minimal. We may assume that T[S] is connected, by Remark 3.4b. These together imply that \(S \subseteq V_x\) and

$$\begin{aligned} \bigl \{ x, \textsf{prt}(x) \bigr \} = F(T_x, T_x^c) \subseteq F(S, S^c). \end{aligned}$$

This implies that

The definition of \(w\) gives

The tree counting lemma (Lemma 3.10), then implies that

recalling that \(\eta = \tfrac{1}{2} \varepsilon / \Delta \). We have thus shown that

Importantly, this inequality does not require \(\pi _w(S) \le \tfrac{1}{2}\).

Next, suppose that \(o \in S\) and \(\pi _w(S) \le \tfrac{1}{2}\). The relations of Remark 3.4a imply that

But now \(S' :=S^c\) satisfies \(o \notin S'\) and \(S' \ne \emptyset \). Thus the previous case implies that

Importantly, this case did not require \(\pi _w(S') \le \tfrac{1}{2}\). Combining gives

Finally, we apply the canonical paths method for trees (Corollary 3.7) to deduce a bound on the spectral gap for \((T, w)\).

Corollary 3.14

(Spectral Gap) Let \((T, w)\) be the weighted spanning tree from Definition 3.11. The spectral gap of the RW on \((T, w)\) satisfies

Proof

This is an immediate consequence of the canonical paths method for trees (Corollary 3.7) and Proposition 3.13, along with the relations of Remark 3.4a. Also, \(\Delta \le 2 {{\,\textrm{diam}\,}}G\). \(\square \)

We have now almost proved the main result. We just need to make sure the chain is lazy and convert the spectral gap result into a mixing time result.

Proof of Theorem 3.1 when \(w_0 = \pi \)

The Markov chain constructed is a RW on a weighted BFS tree. It is defined in Definition 3.11. Denote the invariant distribution of the RW on this weighted tree by \(\pi '\). Lemma 3.12 establishes the claim on the invariant distribution.

The spectral gap bound is proved via Proposition 3.13 and Corollary 3.14. Precisely, Corollary 3.14 defines a reversible chain \(Q\) satisfying .

The mixing time bound will follow from the spectral gap bound via a standard mixing time–spectral gap relation. To apply this relation, we first pass from \(Q\) to its lazy version \(Q' :=\tfrac{1}{2}(I + Q')\). This ensures that the spectral gap and absolute spectral gap agree. \(Q\) and \(Q'\) have the same invariant distribution and that A simple calculation establishes the mixing time claim using the spectral–mixing relation; see [1, Lemma 4.23] for details of this relation. \(\square \)

It remains to handle the case of general \(w_0\), i.e. where \(w_0\) is any unit edge weighting with \(\pi \) as its induced invariant distribution. This is done in the next subsection.

3.5 Perturbation to arbitrary base chain

The analysis up to this point has shown the existence of a fast ‘almost mixing’ chain. Precisely, we defined a weighted graph by constructing an appropriately weighted BFS tree and supplementing it with \(\pi \)-weighted self-loops. We can think of the self-loops as a ‘base’ weighting which is reversible w.r.t. \(\pi \). We denoted the ‘base’ weighting \(w_0\) and the ‘tree’ weighting \(w_1\); recall Definition 3.11.

We now explain how to extend this to an arbitrary ‘base’ weighting \(w_0\). The analysis is extremely similar to that of the self-loops case above: we simply take an arbitrary base weighting \(w_0\) and superimpose on it the same weighted BFS tree. Some small adjustments are needed, but not many.

Let \(w_0: E \rightarrow \mathbb {R}_+\) be an arbitrary unit edge weighting of E. Define \( E' := \{ e \in E \mid w_0(e) \ne 0 \} , \) the edge set of the graph induced by \(w_0\). \(\pi \) is the invariant distribution of the \(w_0\)-weighted RW.

Construction

We define the weighted tree \((T, w_1)\) exactly as before in Definition 3.11: \(T = (V, F)\) is an arbitrary BFS tree and \(w_1 ( \{ x, \textsf{prt}(x) \} ) = \pi (T_x)\) for \(x \in V {\setminus } \{ o \} \); set \( w:=w_0 + \varepsilon w_1. \) The proof of Lemma 3.12 is unchanged, showing that the induced distribution \(\pi _w\) is close to \(\pi \). \(\square \)

Canonical Paths and Adjusted Conductance

We can no longer use the canonical paths method for trees (Corollary 3.7) since the weighting does not necessarily give rise to a tree. The ‘extra edges’—ie, those corresponding to non-self-loops in \(w_0\)—can only increase the conductance. Intuitively, these cannot harm mixing, but we must establish this carefully.

First, we adjust the proof of the canonical paths method for trees, i.e. the deduction of Corollary 3.7 from Proposition 3.6. We use the same canonical paths \(\Gamma \), defined by paths in the BFS tree T. The bound on the spectral gap does not require a lower bound on for arbitrary S; rather, it only needs a lower bound on for all \(x \in V\).

The fact that \(E' = \emptyset \) when \(w_0\) is only self-loops meant that the set of edges emanating from the set \(T_x\) was given by \( F(T_x, T_x^c) = \{ x, \textsf{prt}(x) \} . \) More generally, it is given by \( \{ x, \textsf{prt}(x) \} \cup E'(T_x, T_x^c). \) But this is always a superset of \( \{ x, \textsf{prt}(x) \} \), so the edge conductance is always larger than if the edges from \(E'\) were ignored. Motivated by this, define the following adjustment of edge conductance:

This is the edge conductance where only the boundary edges in the tree \(T = (V, F)\) are considered. The same proof as for canonical paths for trees then implies that

Conductance Analysis

The analysis of the conductance in Proposition 3.13 needs to be adjusted. We need only analyse the conductance of complete subtrees \(T_x\) and must regard the boundary as only \(F(T_x, T_x^c) = \{ x, \textsf{prt}(x) \} \), not the full boundary \((E' \cup F)(T_x, T_x^c)\). The proof of Proposition 3.13 applies almost unchanged to control \(\widehat{\Phi }_w^\star \): we obtain

$$\begin{aligned} \widehat{\Phi }_w^\star \ge \tfrac{1}{6} \varepsilon / \Delta . \end{aligned}$$

The only point to be noted is the establishment of the equality \(w_0(T_x) = \pi (T_x)\). Previously, this was obvious from the self-loop weightings. It still holds here, since the invariant distribution induced by \(w_0\) is \(\pi \) and \(w_0\) has unit total weight, by assumption. Thus, in fact, \(w_0(x) = \pi (x)\) for all \(x \in V\). \(\square \)

Conclusion

We combine the two results above, exactly as before, to obtain

The conversion of this into a lazy chain and then into a mixing estimate is unchanged. \(\square \)

4 Continuous-time markov chains

4.1 Set-up, main result and outline

We have been studying discrete-time Markov chains throughout this paper. It is natural to ask the same question for continuous-time chains. Our attention is devoted to continuous-time Markov chains which are reversible w.r.t. the uniform distribution. Such chains can always be represented as a RW on a weighted graph \((G, w)\) where \(w: E \rightarrow \mathbb {R}_+\) is a weighting on the edge of \(G = (V, E)\).

Our main result for continuous-time chains is simple to state: we impose a normalisation of \(| V |^{-1} \sum _{e \in E} w(e)\), ie the average rate at which the RW leaves a vertex is at most 1; we define a weighting \(w\) and show an upper bound of order \(({{\,\textrm{diam}\,}}G)^2\) on the spectral gap of this RW.

Theorem 4.1

(Fast Mixing Continuous-Time Markov Chain) Let \(G = (V, E)\) be a graph. There exists a weighting \(w: E \rightarrow \mathbb {R}_+\) with average rate \( | V |^{-1} \sum \nolimits _{x \in V} w(x) \le 1 \) and such that the Markov chain induced by this weighting has spectral gap and mixing time satisfying

Proof of Theorem 4.1: Outline

The outline is the same as in discrete-time.

  1. (i)

    We construct a weighted spanning tree; see Definition 4.2.

  2. (ii)

    We control the total weight of the spanning tree; see Lemma 4.3.

  3. (iii)

    We estimate the conductance of this weighted tree; see Proposition 4.5.

  4. (iv)

    We relate its spectral gap and conductance using canonical paths; see Corollary 4.6. \(\square \)

We fix a graph \(G = (V, E)\) and always take to be the uniform distribution on V. We do not always repeat these in statements below. Also, we write \(n :=| V |\).

4.2 Proof via adjustments to discrete-time case

The proof in continuous-time is surprisingly similar to that used in discrete-time.

  • We construct the same weighted tree \((T, w)\), except that we do not include the self-loops; contrast Definitions 3.11 and 4.2.

  • The invariant distribution of a RW on a graph with weights on the edges is always uniform. Thus we do not need an analogue of Lemma 3.12. We require the total weight to be at most n, instead of requiring the invariant distribution to be close to a given measure.

  • We use the same argument to control the conductance; cf Proposition 3.13. The only differences is that now we do not include the self-loop weight in the calculation.

  • The canonical paths argument applies in continuous-time; cf Corollaries 3.7 and 3.14.

First, we define the weighted spanning tree \((T, w\)); cf Definition 3.11.

Definition 4.2

(Weighted Spanning Tree) Let \(o \in V\) and let \(T = (V, F)\) be a BFS tree rooted at o. Define the following weightings \(w: F \rightarrow \mathbb {R}_+\) by

$$\begin{aligned} w ( \{ x, \textsf{prt}(x) \} ) :=\tfrac{1}{2} | T_x | / {{\,\textrm{diam}\,}}T. \quad \text {for}\quad x \in V \setminus \{ o \} . \end{aligned}$$

T is a BFS tree, so \(\Delta :={{\,\textrm{diam}\,}}T\) satisfies \(\Delta \le 2 {{\,\textrm{diam}\,}}G\).

This is equivalent to the tree-weight in the discrete-time case; see \(w_1\) in Definition 3.11. The particular scaling is chosen so that the total weight of \(w\) is at most n, as the next lemma shows.

Lemma 4.3

(Total Weight of \(w\)) We have \( w(V) \le n. \)

Proof

This is an immediate consequence of the subtree counting lemma (Lemma 3.10):

Next, we control the conductance of this weighted spanning tree; cf Proposition 3.13. To do this, we must first give the precise definition of conductance in continuous-time.

Definition 4.4

(Conductance) Let \(T = (V, F)\) be a graph and let \(u: F \rightarrow \mathbb {R}_+\) be a weighting. The conductance of a set \(S \subseteq V\) with \(\pi _u(S) > 0\) w.r.t. \(u\) is defined to be

The conductance of \(u\) is defined to be

Proposition 4.5

(Conductance) Let \((T, w)\) be the weighted spanning tree from Definition 3.11. The conductance of the RW on \((T, w)\) satisfies

Proof

The same reductions as used in Proposition 3.13 show that it suffices to show that

$$\begin{aligned} \Phi _w(T_x) \ge \tfrac{1}{2} \Delta ^{-1} \quad \text {for all}\quad x \in V. \end{aligned}$$

But this is immediate from the definition of \(w\):

Finally, we apply the canonical paths method for trees (Corollary 3.7) to deduce a bound on the spectral gap for \((T, w)\); cf Corollary 3.14. We must adjust this to apply in continuous-time; see Proposition 4.7 and Corollary 4.8.

Corollary 4.6

(Spectral Gap) Let \((T, w)\) be the weighted spanning tree from Definition 3.11. The spectral gap of the RW on \((T, w)\) satisfies

Proof

This is an immediate consequence of the canonical paths method for trees in continuous-time (Corollary 4.8) and Proposition 4.5. \(\square \)

It remains to adjust the method of canonical paths to continuous-time.

Proposition 4.7

(Canonical Paths in Continuous-Time: General) Let \(G = (V, E)\) be a graph and \(u: E \rightarrow \mathbb {R}_+\) be a weighting. Let \(\gamma _{x,y}\) be an F-path from x to y for all \(x,y \in V\). The spectral gap of the RW on \((G, u)\) satisfies

Proof

The discrete-time case is proved in [40, Theorem 5]. It involves the variational characterisation of the spectral gap in terms of the Dirichlet form. This characterisation holds both in discrete- and continuous-time; see [1, §3.6]. The proof in [40] then passes almost unchanged to the continuous-time set-up, recalling that now the invariant distribution is uniform.

Concretely, one can rescale the weights, setting \(\widetilde{w}(\cdot ) :=c w(\cdot )\) for some value c such that \(\max _{x \in V} \widetilde{w}(x) = 1\). This can then be realised by placing mean-1 exponential wait times between jumps of a discrete-time chain P. One then applies the canonical paths method to P. The Dirichlet form is linear in this scaling meaning that the scaling can be ‘undone’ at the end. \(\square \)

Remark

(Cheeger-Inequality in Continuous-Time) We remark that while a scaling argument as used above does apply for the usual discrete-time Cheeger inequality, namely , the bound is quadratic in the scaling. Thus a factor of c is lost. See [1, Theorem 4.40]. In our set-up, \( \max _{x \in V} w(x) \) may be as large as \( (n - 1) / \Delta . \) This would lead to a lower bound of \(1/(n \Delta )\) on the spectral gap, rather than \(1/\Delta ^2\) as we were able to achieve using canonical paths.

Analogous arguments to those used in the special case that G is a tree, ie deducing Corollary 3.7 from Proposition 3.6, apply in the continuous-time set-up too.

Corollary 4.8

(Canonical Paths in Continuous-Time: Trees) Let \(T = (V, F)\) be a tree and \(u: F \rightarrow (0, \infty )\) be a weighting. The spectral gap of the RW on \((T, u)\) satisfies

We now have all the ingredients required to deduce the main result.

Proof of Theorem 4.1

The Markov chain constructed is a RW on a weighted tree. It is defined in Definition 3.11. Lemma 4.3 bounds the total weight of this tree by n, as required.

The spectral gap bound is proved via Proposition 4.5 and Corollary 4.6. The mixing time bound is then deduced from the spectral gap bound via the (continuous-time) spectral gap–mixing time relation; see [1, Lemma 4.23] for details of this relation. \(\square \)

4.3 Hitting time of the root

The following result is not needed for the proof, but it is a nice little result and its proof is extremely simple, given a reference regarding hitting times in trees. Write \(\tau _x\) for the hitting time of vertex x.

Lemma 4.9

(Hit Root in Diameter Squared) Let \(G = (V, E)\) be a graph. For all \(o \in V\), there exists a weighting \(w: E \rightarrow \mathbb {R}_+\) with average rate \( \tfrac{1}{n} \sum \nolimits _{x \in V} w(x) \le 1 \) and such that the Markov chain induced by this weighting has worst-case expected hitting time of o satisfying

We use the weighted spanning tree \((T, w)\) used above, i.e. from Definition 4.2. Precisely, the tree \(T = (V, F)\) is rooted at some vertex \(o \in V\) and

$$\begin{aligned} w ( \{ x, \textsf{prt}(x) \} ) :=\tfrac{1}{2} | T_x | / {{\,\textrm{diam}\,}}T \quad \text {for}\quad x \in V \setminus \{ o \} . \end{aligned}$$

Moments of hitting times of the root in reversible Markov chains on trees were investigated by Zhang [45]. The following result is a special case of [45, Theorem 1.1].

Theorem 4.10

(Hitting Times in Trees; cf [45, Theorem 1.1]) Let \(T = (V, F)\) be a finite tree and \(q: F \rightarrow (0, \infty )\) a weighting. Let \(o \in V\) and root the tree at o; use the notation of Definition 3.9. Let \(\tau _o\) denote the hitting time of the root. For all \(x \in V\), we have

$$\begin{aligned} \mathbb {E}_{x}\bigl (\tau _o\bigr ) = \sum \nolimits _{y \in \textsf{anc}(x) \setminus \{ o \} } | T_y | / q ( \{ y, \textsf{prt}(y) \} ) . \end{aligned}$$

The hitting time result of Lemma 4.9 follows easily from this.

Proof of Lemma 4.9

Let \((T, w)\) be the weighted spanning tree from Definition 3.11. We have \( | T_y | / w ( \{ y, \textsf{prt}(y) \} ) = 2 {{\,\textrm{diam}\,}}T \) for all y. Thus applying Theorem 4.10 to this weighted tree gives

$$\begin{aligned} \mathbb {E}_{x}\bigl (\tau _o\bigr ) = | \textsf{anc}(x) \setminus \{ o \} | \cdot 2 {{\,\textrm{diam}\,}}T \le 2 ({{\,\textrm{diam}\,}}T)^2 \le 8 ({{\,\textrm{diam}\,}}G)^2, \end{aligned}$$

since \({{\,\textrm{diam}\,}}T \le 2 {{\,\textrm{diam}\,}}G\). The weighting satisfies \( w(V) \le n \) by Theorem 4.3. Theorem 4.9 follows. \(\square \)

Remark 4.11

(Inspiration) This result of Zhang [45] was the inspiration behind our choice of weighted spanning tree in both the discrete- and continuous-time set-ups. Particularly, the simplicity of the formula when one takes \( q ( \{ x, \textsf{prt}(x) \} ) :=| T_x | \) encouraged us to try this weighting.

We had originally tried to make the distance to the root behave roughly like an unbiased RW on the integers—this gives the right “diameter-squared” bound. This means balancing the weights on either ‘side’ of a vertex. This works for some trees, e.g. the path, rooted at one end, and the binary tree. But it does not combine well: attaching a path and binary tree, each of the same depth, at their root gives rise to a “diameter-cubed” hitting time of the root.

Although we have phrased the continuous-time proof as an adjustment to the discrete-time one, we actually developed the continuous-time argument first. Indeed, this is natural because any edge weighting gives rise to the uniform distribution in continuous-time, so there is no need to do any superposition with a ‘base’ weighting, such as self-loops.

5 Time-inhomogeneous markov chains

The content of this section is somewhat different from the previous ones. Our desire, as always, is to sample from a distribution \(\pi \) on a set V via some Markov process which only uses transitions permitted by the graph G. The difference here is that we use a time-inhomogeneous Markov chain.

Markov chains are typically time-homogeneous. Discrete-time chains are then described by a transition matrix P and an initial law \(\mu _0\). The time-t law \(\mu _t\) is then given by applying P t times to \(\mu \): \( \mu _t = \mu _0 P^t. \) Continuous-time chains are described in a somewhat similar manner. Time-inhomogeneous chains are allowed to use a different transition matrix at each step: \(P_1\) is used for the first step, \(P_2\) for the second and so on. The time-t law \(\mu _t\) is then given by \( \mu _t = \mu _0 P_1 \cdots P_t. \) The special case where \(P_t = P\) for all t, for some P, reduces to the time-homogeneous case.

Our main result for time-inhomogeneous chains is simple to state: given a graph \(G = (V, E)\) and \(\pi \in \mathcal D(V)\), we exhibit a time-inhomogeneous Markov chain which satisfies \( \mu _{2 {{\,\textrm{diam}\,}}G} = \pi . \)

Theorem 5.1

Let \(G = (V, E)\) be a connected graph and let \(\pi \in \mathcal D(V)\). There exists a time-inhomogeneous Markov chain on G obtaining perfect mixing after \(2 {{\,\textrm{diam}\,}}G\) steps: \( \mu _{2 {{\,\textrm{diam}\,}}G} = \pi . \)

We fix a graph \(G = (V, E)\) and a probability measure \(\pi \in \mathcal D(V)\) throughout this section. We do not always repeat these in statements below.

Proof of Theorem 5.1

Our argument proceeds by induction on the depth of the tree.

Choose \(o \in V\) and a breadth-first search (BFS) tree \(T = (V, F)\) rooted at o arbitrarily. Given \(x \in V\), let \(V_x\) denote the set of vertices for which x lies in the unique shortest path to the root o; let \(T_x :=T[V_x]\). Write \(\Delta ( x ) :={{\,\textrm{depth}\,}}T_x\) for the depth of \(T_x\), i.e. the maximal distance \(\max _{y \in T_x} \text {dist}(y,x)\) to the root of \(T_x\). We construct the time-inhomogeneous chain inductively

Suppose that \(X_0 = o\); we cover the general case later. We claim that there exists a chain \(X^o\) on \(T_o\) such that . This is trivially true if \(T_o\) is a singleton, which has depth 0. Assume now that \(| T_o | \ge 2\). Define the first transition matrix \(P_1\) to keep X at o with probability \(\pi (o)\) and otherwise move to x with \( \{ o,x \} \in F\) with probability proportional to \(\pi ( T_x ) \):

$$\begin{aligned} P_1(o, o) :=\pi (o) \quad \text {and}\quad P_1(o, x) :=\pi (T_x) \quad \text {for}\quad x \in V \text { with } \{ o, x \} \in V. \end{aligned}$$

Let \(P_k(o,o) = 1\) for all \(k \ge 2\). Thus if \(X_1 = o\) then \(X_k = o\) for all \(k \ge 1\).

There exists a chain \(X^x\) on \(T_x\) such that for each \(x \in V\), by the inductive hypothesis. Importantly, \(T_x \cap T_y = \emptyset \) if \( \{ o,x \} , \{ o,y \} \in F\) and \(x \ne y\). We can thus define sequences \(P_x :=(P^x_k)_{k=1}^{\Delta ( x ) }\) of transition matrices for each \(x \in V\) with \( \{ o, x \} \in F\) and each \(P^x\) defined on a disjoint set by induction. Define \(P^x_k :=I\) for \(k > \Delta ( x ) \). These can then be combined into a single sequence \((P_k)_{k=2}^{\Delta ( o ) }\): if \(X^o_1 = x\), then we use sequence \(P^x\). Then

noting that \(\Delta ( x ) = \Delta ( o ) - 1\). Thus, for \(y \ne o\),

This argument is no more than a formalisation of the following informal verbal description.

  • If the walk is at x, then stay at x with probability \(\pi (x) / \pi (V_x)\).

  • Otherwise, move to the children of x with probability proportional to their \(\pi \)-measure.

  • If at some point the walk stays put, then keep it at that state indefinitely.

This completes the argument when \(X_0 = o\). It remains to consider the case that \(X_0 \ne o\). Direct all edges towards o and run for \(\Delta ( o ) \) steps. Precisely, set

$$\begin{aligned} P_0(x,y) :=\varvec{1}\{ x \ne o, \, y = \textsf{prt}(x) \} + \varvec{1}\{ x = y = o \}, \end{aligned}$$

where \(\textsf{prt}(x)\) is the unique neighbour of \(x \ne o\) on the unique shortest path from x to o. If \(\Delta ( o ) \) steps are made according to this matrix, then \(X_{\Delta ( o ) } = o\), regardless of \(X_o\). We then apply the construction from the case \(X_0 = o\), all shifted by \(\Delta ( o ) \).

In summary, we obtain a time-inhomogeneous Markov chain X with \(X_{2 \Delta ( o ) } \sim \pi \). Finally, \(\Delta ( o ) = {{\,\textrm{depth}\,}}T \le {{\,\textrm{diam}\,}}G\). Keeping X fixed during \(( 2 \Delta ( o ) , \, 2 {{\,\textrm{diam}\,}}G ]\) gives \(X_{2 {{\,\textrm{diam}\,}}G} \sim \pi \). \(\square \)

6 Open problems and concluding remarks

We have studied fundamental barriers to fast mixing on graphs. We have essentially shown that three geometric quantities characterise the mixing properties of a graph: edge conductance characterises fast mixing for the simple random walk and vertex conductance characterises the fastest mixing time, whilst the diameter characterises the fastest almost-mixing time.

There are a few questions left open by our work. First, Theorem A implies that the fastest mixing time on a graph is at most order . It is natural to ask if this \(\log ^2 | V |\) factor can be improved to \(\log | V |\). This improvement would require departing from the framework of our proof, at least if \(d_\textsf{max}\gg 1\) is permitted, since we believe the optimal relaxation time can be of order ; see the discussion after Theorem A.

Another open problem prompted by our work is to construct a graph sparsifier, i.e., an edge-induced sparse subgraph, that approximately preserves the vertex conductance of the original graph. Our work together with a result by Batson, Spielman and Srivastava [5] implies that it is possible to construct an order n-size sparsifier of G with vertex conductance at least order . It is then natural to ask if we can obtain a better approximation.

Can our results spur new algorithmic applications? We mention two. First, we would like to design a distributed algorithm to compute a fast mixing Markov chain on G, where G also represents the topology of the distributed network. Second, is it possible to design a local algorithm, in the spirit of Spielman and Teng [42], that outputs a subset of nodes with small vertex conductance?

We discussed the application of our almost-mixing result—or, more precisely, the weighted-tree construction—to the objective of turning a non-expander into an expander; see the discussion after Theorem B. We would also like to find a more sampling-based application, but have not found a particularly satisfying one yet. In many scenarios, one has a target distribution \(\pi \) to sample from—e.g., the uniform distribution. One approximately samples from \(\pi \) by running a Markov chain X with equilibrium distribution \(\pi \) for its \(\varepsilon \)-mixing time t. The output \(X_t\) satisfies . If one instead runs a Markov chain \(X'\) with equilibrium distribution \(\pi '\) for its \(\varepsilon \)-mixing time \(t'\), then the output \(X'_{t'}\) satisfies if . Thus, there is no real disadvantage in adjusting the equilibrium distribution slightly.

Many natural graph structures have small diameter and thus give rise to a fast almost-mixing Markov chain. For example, the graph structure of the Ising model on an n-graph has diameter at most 2n. Thus, one gets fast—that is, polynomial-in-n—almost mixing for any temperature; this can be used to approximate the partition function. Similar statements hold for sampling independent sets via the hardcore model or counting the number of proper colourings on a graph.

Unfortunately, in all these scenarios, actually calculating the weights on the superimposed tree—which need not be a BFS tree; any tree T can be used, giving a bound of \(({{\,\textrm{diam}\,}}T)^2\)—is computationally infeasible since it often entails knowing the partition function already. We hope that a useful application of this result can be found, but we are yet to find one ourselves.