1 Introduction

Let G be a finite group, and let \([n] := \{1, 2, \ldots , n\}\). We consider the set \(G^n\) of all functions \(\sigma : [n] \rightarrow G\) (or “configurations”). We may define a Markov chain \((\sigma _t)_{t\ge 0}\) on \(G^{n}\) as follows: if we have a current state \(\sigma \), then uniformly at random, choose an ordered pair (ij) of distinct integers in [n], and change the value of \(\sigma (i)\) to \(\sigma (i) \sigma (j)^{\pm 1}\), where the signs are chosen with equal probability.

We will restrict the chain \((\sigma _t)_{t\ge 0}\) to the space of generating n-tuples, i.e. the set of \(\sigma \) whose values generate G as a group:

$$\begin{aligned} {{\mathcal {S}}}:= \left\{ \sigma \in G^n \ : \ \langle \sigma (1), \ldots , \sigma (n) \rangle =G \right\} . \end{aligned}$$

It is not hard to see that for fixed G and large enough n, the chain on \({{\mathcal {S}}}\) is irreducible (see [8, Lemma 3.2]). We will always assume n is large enough so that this irreducibility holds. Note that the chain is also symmetric, and it is aperiodic because it has holding on some states. Thus, the chain has a uniform stationary distribution \(\pi \) with \(\pi (\sigma )=1/|{{\mathcal {S}}}|\).

This Markov chain was first considered in the context of computational group theory—it models the product replacement algorithm for generating random elements of a finite group introduced in [6]. By running the chain for a long enough time t and choosing a uniformly random index \(k \in [n]\), the element \(\sigma _t(k)\) is a (nearly) uniformly random element of G. The product replacement algorithm has been found to perform well in practice [6, 10], but the question arises: how large does t need to be in order to ensure near uniformity?

One way of answering the question is to estimate the mixing time of the Markov chain. It was shown by Diaconis and Saloff-Coste that for any fixed finite group G, there exists a constant \(C_G\) such that the \(\ell ^2\)-mixing time is at most \(C_G n^2 \log n\) [8, 9] (see also Chung and Graham [3] for a simpler proof of this fact with a different value for \(C_G\)).

In another line of work, Lubotzky and Pak [12] analyzed the mixing of the product replacement chain in terms of Kazhdan constants (see also subsequent quantitative estimates for Kazhdan constants by Kassabov [11]). We also mention a result of Pak [14] which shows mixing in \(\text {polylog}(|G|)\) steps when \(n = \Theta (\log |G| \log \log |G|)\). The reader may consult the survey [15] for further background on the product replacement algorithm.

Diaconis and Saloff-Coste conjectured that the mixing time bound can be improved to \(C_G n \log n\) [9, Remark 2, Section 7, p. 290], based on the observation that at least \(n \log n\) steps are needed by the classical coupon-collector’s problem. This was confirmed in the case \(G = {{\mathbb {Z}}}/2\) by Chung and Graham [4] and recently refined by Ben-Hamou and Peres, who show that when \(G={{\mathbb {Z}}}/2\), the chain in fact exhibits a cutoff at time \(\frac{3}{2}n \log n\) in total-variation with window of order n [2].

In this paper, we extend the result of Ben-Hamou and Peres to all finite groups. Note that this also verifies the conjecture of Diaconis and Saloff-Coste for a fixed finite group. To state the result, let us denote the total variation distance between \({{\mathbb {P}}}_\sigma (\sigma _t \in \cdot \ )\) and \(\pi \) by

$$\begin{aligned} d_\sigma (t) := \max _{A \subseteq {{\mathcal {S}}}}|{{\mathbb {P}}}_\sigma (\sigma _t \in A)- \pi (A)|. \end{aligned}$$

Theorem 1.1

Let G be a finite group. Then, the Markov chain \((\sigma _t)_{t \ge 0}\) on the set of generating n-tuples of G has a total-variation cutoff at time \(\frac{3}{2}n\log n\) with window of order n. More precisely, we have

$$\begin{aligned} \lim _{\beta \rightarrow \infty } \limsup _{n\rightarrow \infty } \max _{\sigma \in {{\mathcal {S}}}} d_\sigma \left( \frac{3}{2}n\log n + \beta n\right) = 0 \end{aligned}$$
(1)

and

$$\begin{aligned} \lim _{\beta \rightarrow \infty } \liminf _{n\rightarrow \infty } \max _{\sigma \in {{\mathcal {S}}}} d_\sigma \left( \frac{3}{2}n\log n - \beta n\right) = 1. \end{aligned}$$
(2)

1.1 A connection to cryptography

We mention another motivation for studying the product replacement chain in the case \(G=({{\mathbb {Z}}}/q)^m\) for a prime \(q \ge 2\) and integers \(m \ge 1\). It comes from a public-key authentication protocol proposed by Sotiraki [16], which we now briefly describe. In the protocol, a verifier wants to check the identity of a prover based on the time needed to answer a challenge.

First, the prover runs the Markov chain with \(G = ({{\mathbb {Z}}}/q)^m\) and \(n = m\), which can be interpreted as performing a random walk on \(SL_n({{\mathbb {Z}}}/q)\), where \(\sigma (k)\) is viewed as the k-th row of a \(n \times n\) matrix. (In each step, a random row is either added to or subtracted from another random row.)

After t steps, the prover records the resulting matrix \(A \in SL_n({{\mathbb {Z}}}/q)\) and makes it public. To authenticate, the verifier gives the prover a vector \(x \in ({{\mathbb {Z}}}/q)^n\) and challenges her to compute \(y := Ax\). The prover can perform this calculation in O(t) operations by retracing the trajectory of the random walk.

Without knowing the trajectory, if t is large enough, an adversary will not be able to distinguish A from a random matrix and will be forced to perform the usual matrix-vector multiplication (using \(n^2\) operations) to complete the challenge. Thus, the question is whether \(t \ll n^2\) is large enough for the matrix A to become sufficiently random, so that the prover can answer the challenge much faster than an adversary.

Note that when \(n > m\), the product replacement chain on \(G = ({{\mathbb {Z}}}/q)^m\) amounts to the projection of the random walk on \(SL_n({{\mathbb {Z}}}/q)\) onto the first m columns. Thus, Theorem 1.1 shows that when m is fixed and \(n \rightarrow \infty \), the mixing time for the first m columns is around \(\frac{3}{2} n \log n\). One then hopes that the mixing of several columns is enough to make it computationally intractable to distinguish A from a random matrix; this would justify the authentication protocol, as \(n \log n \ll n^2\).

We remark that when t is much larger than the mixing time of the random walk on \(SL_n({{\mathbb {Z}}}/q)\) generated by row and additions and subtractions, it is information theoretically impossible for an adversary to distinguish A from a random matrix. However, the diameter of the corresponding Cayley graph on \(SL_n({{\mathbb {Z}}}/q)\) is known to be of order \(\Theta \left( \frac{n^2}{\log _q n} \right) \) [1, 5], so a lower bound of the same order necessarily holds for the mixing time. Diaconis and Saloff-Coste [8, Section 4, p. 420] give an upper bound of \(O(n^4)\), which was subsequently improved to \(O(n^3)\) by Kassabov [11]. Closing the gap between \(n^3\) and \(\frac{n^2}{\log n}\) remains an open problem.

1.2 Outline of proof

The proof of Theorem 1.1 analyzes the mixing behavior in several stages:

  • an initial “burn-in” period lasting around \(n \log n\) steps, after which the group elements appearing in the configuration are not mostly confined to any proper subgroup of G;

  • an averaging period lasting around \(\frac{1}{2} n \log n\) steps, after which the counts of group elements become close to their average value under the stationary distribution; and

  • a coupling period lasting O(n) steps, after which our chain becomes exactly coupled to the stationary distribution with high probability.

The argument is in the spirit of [2], but a more elaborate analysis is required in the second and third stages. To analyze the first stage, for a fixed proper subgroup H, the number of group elements in H appearing in the configuration is a birth-and-death process whose transition probabilities are easy to estimate. The analysis of the resulting chain is the same as in [2], and we can then union bound over all proper subgroups H.

In the second stage, for a given starting configuration \(\sigma _0 \in {{\mathcal {S}}}\), we consider quantities \(n_{a,b}(\sigma )\) counting the number of sites k where \(\sigma _0(k) = a\) and \(\sigma (k) = b\). A key observation (which also appears in [2]) is that by symmetry, projecting the Markov chain onto the values \((n_{a,b}(\sigma _t))_{a, b \in G}\) does not affect the mixing behavior. Thus, it is enough to understand the mixing behavior of the counts \(n_{a,b}\).

One expects these counts to evolve towards their expected value \({{\mathbb {E}}}_{\sigma \sim \pi } n_{a,b}(\sigma )\) as the chain mixes. To carry out the analysis rigorously, we write down a stochastic difference equation for the \(n_{a,b}\) and analyze it via the Fourier transform. Intuitively, as \(n \rightarrow \infty \), the process approaches a “hydrodynamic limit” so that it becomes approximately deterministic. It turns out that after about \(\frac{1}{2} n \log n\) steps, the \(n_{a,b}\) are likely to be within \(O(\sqrt{n})\) of their expected value. Our analysis requires a sufficiently “generic” initial configuration, which is why the first stage is necessary.

Finally, in the last stage, we show that if the \((n_{a,b}(\sigma ))_{a,b\in G}\) and \((n_{a,b}(\sigma '))_{a,b\in G}\) for two configurations are within \(O(\sqrt{n})\) in \(\ell ^1\) distance, they can be coupled to be exactly the same with high probability after O(n) steps of the Markov chain. A standard argument involving coupling to the stationary distribution then implies a bound on the mixing time.

The main idea to prove the coupling bound is that even if the \(\ell ^1\) distance evolves like an unbiased random walk, there is a good chance that it will hit 0 due to random fluctuations. A similar argument is used to prove cutoff for lazy random walk on the hypercube [13, Chapter 18]. However, some careful accounting is necessary in our setting to ensure that in fact the \(\ell ^1\) distance does not increase in expectation and to ensure sufficient fluctuations.

1.3 Organization of the paper

The rest of the paper is organized as follows. In Sect. 2, we state (without proof) the key lemmas describing the behavior in each of the three stages and use these to prove the upper bound (1) in Theorem 1.1. Sections 3 and 4 contain the proofs of these lemmas. Finally, in Sect. 5, we prove the lower bound (2) in Theorem 1.1; this is mostly a matter of verifying that the estimates used in the upper bound were tight.

1.4 Notation

Throughout this paper, we use \(c, C, C', \ldots \), to denote absolute constants whose exact values may change from line to line, and also use them with subscripts, for instance, \(C_G\) to specify its dependency only on G. We also use subscripts with big-O notation, e.g. we write \(O_G(\,\cdot \,)\) when the implied constant depends only on G.

2 Proof of Theorem 1.1 (1)

Let us fix a finite group G and denote its cardinality by \({{\mathcal {Q}}}:= |G|\). For a configuration \(\sigma \in {{\mathcal {S}}}\), let \(n_a(\sigma )\) denote the number of sites having group element a, i.e.,

$$\begin{aligned} n_a(\sigma ) := |\{i \in [n] \ : \ \sigma (i)=a\}|. \end{aligned}$$

2.1 The burn-in period

For a proper subgroup \(H \subseteq G\), let

$$\begin{aligned} n_{non}^H(\sigma ) := \sum _{a \in G {\setminus } H} n_a(\sigma ) \end{aligned}$$

denote the number of sites not in H, and define for \(c \in (0, 1)\) the set

$$\begin{aligned} {{\mathcal {S}}}_{non}\left( c \right) := \{\sigma \in {{\mathcal {S}}}\ : \ n_{non}^H(\sigma ) \ge cn \text { for all proper subgroups} \, H \subseteq G\}. \end{aligned}$$

Thus, \({{\mathcal {S}}}_{non}\left( c \right) \) is the set of states \(\sigma \) where the group elements appearing in \(\sigma \) are not mostly confined to any particular proper subgroup of G. The next lemma shows that we reach \({{\mathcal {S}}}_{non}\left( 1/3 \right) \) in about \(n \log n\) steps, and once we reach \({{\mathcal {S}}}_{non}\left( 1/3 \right) \), we remain in \({{\mathcal {S}}}_{non}\left( 1/6 \right) \) for \(n^2\) steps with high probability. Note that \(n^2\) is much larger than the overall mixing time, so we may essentially assume that we are in \({{\mathcal {S}}}_{non}\left( 1/6 \right) \) for all of the later stages.

Lemma 2.1

Let \(\tau _{1/3} := \min \{t \ge 0 : \sigma _t \in {{\mathcal {S}}}_{non}\left( 1/3 \right) \}\) be the first time to hit \({{\mathcal {S}}}_{non}\left( 1/3 \right) \). Then for all large enough n and for all large enough \(\beta > 0\),

$$\begin{aligned} \max _{\sigma \in {{\mathcal {S}}}} {{\mathbb {P}}}_\sigma (\tau _{1/3} > n \log n + \beta n) \le \frac{120 {{\mathcal {Q}}}}{\beta ^2}. \end{aligned}$$

Moreover, there exists a constant \(C_G\) depending only on G such that

$$\begin{aligned} \max _{\sigma \in {{\mathcal {S}}}_{non}\left( 1/3 \right) }{{\mathbb {P}}}_\sigma \left( \sigma _t \notin {{\mathcal {S}}}_{non}\left( 1/6 \right) \ \text {for some} \, t \le n^2 \right) \le C_G n^2e^{-n/10}. \end{aligned}$$

Proof

Fix a proper subgroup \(H \subset G\), and consider what happens to \(n_{non}^H(\sigma _t)\) at time t. Suppose our next step is to replace \(\sigma (i)\) with \(\sigma (i)\sigma (j)\).

If \(\sigma (j) \in H\), then \(n_{non}^H(\sigma _{t+1}) = n_{non}^H(\sigma _t)\). If \(\sigma (j) \not \in H\) and \(\sigma (i) \in H\), then \(n_{non}^H(\sigma _{t+1}) = n_{non}^H(\sigma _t) +1\). Finally, if \(\sigma (j), \sigma (i) \not \in H\), then \(\sigma (i)\sigma (j)\) may or may not be in H, so \(n_{non}^H(\sigma _{t+1}) \ge n_{non}^H(\sigma _t) - 1\).

Let \((N_t)_{t \ge 0}\) be the birth-and-death chain with the following transition probabilities for \(1 \le k \le n\):

$$\begin{aligned} {{\mathbb {P}}}(N_{t+1} = k+1 \mid N_t = k)&= \frac{k(n-k)}{n(n-1)} \\ {{\mathbb {P}}}(N_{t+1} = k-1 \mid N_t = k)&= \frac{k(k-1)}{n(n-1)} \\ {{\mathbb {P}}}(N_{t+1} = k \mid N_t = k)&= \frac{n-k}{n}. \end{aligned}$$

We start this chain at \(N_0 = n^H_{non}(\sigma _0)\); note that because the elements appearing in \(\sigma _0\) generate G, we are guaranteed to have \(n^H_{non}(\sigma _0) > 0\).

The above birth-and-death chain corresponds to the behavior of \((n^H_{non}(\sigma _t))\) if whenever \(\sigma (j), \sigma (i) \not \in H\), it always happened that \(\sigma (i)\sigma (j) \in H\). Thus, \((n^H_{non}(\sigma _t))\) stochastically dominates \((N_t)\).

The chain \((N_t)\) is precisely what is analyzed in [2] for the case \(G = {{\mathbb {Z}}}/2\). Let

$$\begin{aligned} T_k := \min \{t \ge 0 : N_t=k\}. \end{aligned}$$

Then, we have \({{\mathbb {E}}}_{k-1}T_k \le \frac{n^2}{k(n-2k)}\) [2, (2) in the proof of Lemma 1] and thus \({{\mathbb {E}}}_1 (T_{n/3}) =\sum _{k=2}^{n/3}{{\mathbb {E}}}_{k-1}T_k \le n \log n + n\). On the other hand, setting \(v_k=\mathrm{Var}_{k-1}(T_k)\), we have \(v_2 \le n^2\),

$$\begin{aligned} v_{k+1}\le \frac{k}{n-k}v_k + \frac{54 n^2}{k^2}, \end{aligned}$$

and \(\mathrm{Var}_1 (T_{n/3}) = \sum _{k=2}^{n/3}v_k \le 110 n^2\) [2, The proof of Lemma 1]. Hence by Chebyshev’s inequality for all large enough \(\beta > 0\),

$$\begin{aligned} {{\mathbb {P}}}_1(T_{n/3} > n \log n + \beta n) \le \frac{120}{\beta ^2}. \end{aligned}$$

Moreover, we have \({{\mathbb {P}}}_{n/3} \left( T_{n/6} \le n^2 \right) \le n^2e^{-n/10}\). Indeed, this follows from the fact that for \(m<k\), we have

$$\begin{aligned} {{\mathbb {P}}}_k(T_m \le n^2) \le n^2\frac{\pi _{\mathrm{BD}}(m)}{\pi _{\mathrm{BD}}(k)}, \end{aligned}$$

where \(\pi _{\mathrm{BD}}(k)={n \atopwithdelims ()k}/(2^n-1)\) [2, (5) and the following in the proof of Proposition 2].

We now take a union bound over all the proper subgroups H. \(\square \)

2.2 The averaging period

In the next stage, the counts \(n_a(\sigma _t)\) go toward their average value. We actually analyze this stage in two substages, looking at a “proportion vector” and “proportion matrix”, as described below.

2.2.1 Proportion vector chain

For a configuration \(\sigma \in {{\mathcal {S}}}\), we consider the \({{\mathcal {Q}}}\)-dimensional vector \((n_a(\sigma )/n)_{a \in G}\), which we call the proportion vector of \(\sigma \). One may check that for a typical \(\sigma \in {{\mathcal {S}}}\), each \(n_a(\sigma )/n\) is about \(1/{{\mathcal {Q}}}\). For each \(\delta > 0\), we define the \(\delta \)-typical set

$$\begin{aligned} {{\mathcal {S}}}_*(\delta ) := \left\{ \sigma \in {{\mathcal {S}}}\ : \ \left\| \left( \frac{n_a(\sigma )}{n}\right) _{a \in G} - \left( \frac{1}{{{\mathcal {Q}}}}\right) _{a \in G}\right\| \le \delta \right\} , \end{aligned}$$

where \(\Vert \cdot \Vert \) denotes the \(\ell ^2\)-norm in \({{\mathbb {R}}}^G\).

The following lemma implies that starting from \(\sigma \in {{\mathcal {S}}}_{non}\left( 1/3 \right) \), we reach \({{\mathcal {S}}}_*\left( \delta \right) \) in \(O_\delta (n)\) steps with high probability. The proof is given in Sect. 3.4.

Lemma 2.2

Consider any \(\sigma \in {{\mathcal {S}}}_{non}\left( 1/3 \right) \) and any constant \(\delta >0\). There exists a constant \(C_{G, \delta }\) depending only on G and \(\delta \) such that for any \(T \ge C_{G, \delta } n\), we have

$$\begin{aligned} {{\mathbb {P}}}_\sigma \left( \sigma _T \notin {{\mathcal {S}}}_*\left( \delta \right) \right) \le \frac{1}{n} \end{aligned}$$

for all large enough n.

2.2.2 Proportion matrix chain

We actually need a more precise averaging than what is provided by Lemma 2.2. Fix a configuration \(\sigma _0 \in {{\mathcal {S}}}\). For any \(\sigma \in {{\mathcal {S}}}\) and for any \(a, b \in G\), define

$$\begin{aligned} n_{a,b}^{\sigma _0}(\sigma ) := |\{i \in [n] \ : \ \sigma _0(i)=a, \sigma (i)=b \}|. \end{aligned}$$

If we run the Markov chain \((\sigma _t)_{t\ge 0}\) with initial state \(\sigma _0\), then \(n_{a,b}^{\sigma _0}(\sigma _t)\) is the number of sites that originally contained the element a (at time 0) but now contain b (at time t). Note that

$$\begin{aligned} \sum _{b \in G} n_{a,b}^{\sigma _0}(\sigma ) = n_a(\sigma _0) \quad \text {and} \quad \sum _{a \in G} n_{a,b}^{\sigma _0}(\sigma ) = n_b(\sigma ). \end{aligned}$$

We can then associate with \((\sigma _t)_{t \ge 0}\) another Markov chain \(\left( n_{a,b}^{\sigma _0}(\sigma _t)\right) _{a, b \in G}\) for \(t \ge 0\), which we call the proportion matrix chain (with respect to\(\sigma _0\)). The state space for the proportion matrix chain is \(\{0, 1, \ldots , n\}^{G \times G}\), and the transition probabilities depend on \(\sigma _0\).

The proportion matrix acts like a “sufficient statistic” for analyzing our Markov chain started at \(\sigma _*\), because of the permutation invariance of our dynamics. In fact, as the following lemma shows, the distance to stationarity of the proportion matrix chain is equal to the distance to stationarity of the original chain.

Lemma 2.3

Let \(\sigma _*\in {{\mathcal {S}}}\) be a configuration. For the Markov chain \((\sigma _t)_{t \ge 0}\) with initial state \(\sigma _*\), we consider \(\left( n_{a, b}^{\sigma _*}(\sigma _t)\right) _{a, b \in G}\). Let \({\overline{\pi }}^{\sigma _*}\) be the stationary measure for the Markov chain \(\{(n_{a, b}^{\sigma _*}(\sigma _t))_{a, b \in G}\}_{t \ge 0}\) on \(\left\{ 0, 1, \ldots , n\right\} ^{G \times G}\). Then, for every \(t \ge 0\), we have

$$\begin{aligned} \left\| {{\mathbb {P}}}_{\sigma _*}(\sigma _t \in \cdot \ ) - \pi \right\| _{\mathrm{TV}} = \left\| {{\mathbb {P}}}_{\sigma _*}\left( (n_{a, b}^{\sigma _*}(\sigma _t))_{a, b \in G} \in \cdot \ \right) - {\overline{\pi }}^{\sigma _*} \right\| _{\mathrm{TV}}. \end{aligned}$$

Proof

For any matrix \(N = (N_{a,b})_{a,b \in G} \in \{0, 1, \ldots , n\}^{G \times G}\), write

$$\begin{aligned} {{\mathcal {X}}}_{(N)} := \left\{ \sigma \in {{\mathcal {S}}}\ : \ n_{a, b}^{\sigma _*}(\sigma )=N_{a, b} \ \text {for all} \, a, b \in G \right\} \end{aligned}$$

for the set of configurations with N as their proportion matrix.

Since the distribution of \(\sigma _t\) is invariant under permutations on sites \(i \in [n]\) preserving the set \(\{ i : \sigma _*(i) = a\}\) for every \(a \in G\), the conditional probability measures \({{\mathbb {P}}}_{\sigma _*}\left( \sigma _t \in \cdot \mid \sigma _t \in {{\mathcal {X}}}_{(N)} \right) \) and \(\pi ( \ \cdot \mid {{\mathcal {X}}}_{(N)})\) are both uniform on \({{\mathcal {X}}}_{(N)}\). This implies that for each \(\sigma \in {{\mathcal {X}}}_{(N)}\),

$$\begin{aligned} |{{\mathbb {P}}}_{\sigma _*}(\sigma _t =\sigma ) - \pi (\sigma )| = \frac{1}{\left| {{\mathcal {X}}}_{(N)}\right| } \left| {{\mathbb {P}}}_{\sigma _*}\left( (n^{\sigma _*}_{a, b}(\sigma _t))_{a, b \in G}=N\right) - {\overline{\pi }}^{\sigma _*}(N) \right| , \end{aligned}$$

and summing over all \(\sigma \in {{\mathcal {X}}}_{(N)}\) and all N, we obtain the claim. \(\square \)

For \(\sigma _0 \in {{\mathcal {S}}}\) and \(r > 0\), define the set of configurations

$$\begin{aligned} {{\mathcal {S}}}_*\left( \sigma _0, r \right) := \left\{ \sigma \in {{\mathcal {S}}}\ : \ \left\| \left( \frac{n^{\sigma _0}_{a, b}(\sigma )}{n_a(\sigma _0)}\right) _{b \in G} - \left( \frac{1}{{{\mathcal {Q}}}}\right) _{b \in G}\right\| \le r \text { for all} \, a \in G \right\} . \end{aligned}$$

Roughly speaking, the following lemma shows that starting from a typical configuration \(\sigma _*\in {{\mathcal {S}}}_*\left( \frac{1}{4{{\mathcal {Q}}}} \right) \), we need about \(\frac{1}{2}n \log n\) steps to reach \({{\mathcal {S}}}_*\left( \sigma _*, \frac{R}{\sqrt{n}} \right) \), where R is a constant. We show this fact in a slightly more general form where the initial state need not be \(\sigma _*\); the proof is given in Sect. 3.5.

Lemma 2.4

Consider any \(\sigma _*, \sigma '_*\in {{\mathcal {S}}}_*\left( \frac{1}{4{{\mathcal {Q}}}} \right) \), and let \(T := \left\lceil \frac{1}{2} n \log n \right\rceil \). There exists a constant \(C_G > 0\) depending only on G such that for any given \(R > 0\), we have

$$\begin{aligned} {{\mathbb {P}}}_{\sigma '_*}\left( \sigma _T \notin {{\mathcal {S}}}_*\left( \sigma _*, \frac{R}{\sqrt{n}} \right) \right) \le C_G e^{-R} + \frac{1}{n} \end{aligned}$$

for all large enough n.

2.3 The coupling period

After reaching \({{\mathcal {S}}}_*\left( \sigma _*, \frac{R}{\sqrt{n}} \right) \), we show that only O(n) additional steps are needed to mix in total variation distance. The main ingredient in the proof is a coupling of proportion matrix chains so that they coalesce in O(n) steps when they both start from configurations \(\sigma , {{\tilde{\sigma }}} \in {{\mathcal {S}}}_*\left( \sigma _*, \frac{R}{\sqrt{n}} \right) \). We construct such a coupling and prove the following lemma in Sect. 4.

Lemma 2.5

Consider any \(\sigma _*\in {{\mathcal {S}}}_*\left( \frac{1}{5{{\mathcal {Q}}}^3} \right) \), and let \(R > 0\). Suppose \(\sigma , {{\tilde{\sigma }}} \in {{\mathcal {S}}}_*\left( \sigma _*, \frac{R}{\sqrt{n}} \right) \). Then, there exists a coupling \((\sigma _t, {{\tilde{\sigma }}}_t)\) of the Markov chains with initial states \((\sigma , {{\tilde{\sigma }}})\) such that for a given \(\beta > 0\) and all large enough n,

$$\begin{aligned} {{\mathbb {P}}}_{\sigma , {{\tilde{\sigma }}}}(\tau > \beta n) \le \frac{32{{\mathcal {Q}}}^2 R}{\sqrt{\beta }}, \end{aligned}$$

where \(\tau :=\min \{t \ge 0 : n^{\sigma _*}_{a, b}(\sigma _t) = n^{\sigma _*}_{a, b}({{\tilde{\sigma }}}_t) \ \text {for all} \, a, b \in G\}\).

To translate this coupling time into a bound on total variation distance, we need also the simple observation that the stationary measure \(\pi \) concentrates on \({{\mathcal {S}}}_*\left( \sigma _*, \frac{R}{\sqrt{n}} \right) \) except for probability \(O(1/R^2)\), as given in the next lemma.

Lemma 2.6

For the stationary distribution \(\pi \) of the chain \((\sigma _t)_{t \ge 0}\), for every \(R>0\) and for all \(n > m\),

$$\begin{aligned} \pi \left( \sigma \notin {{\mathcal {S}}}_*\left( \frac{R}{\sqrt{n}} \right) \right) \le \frac{C_G}{R^2}. \end{aligned}$$

Moreover for every \(\delta <1/(2{{\mathcal {Q}}})\), for every \(R>0\) and for all \(n>m\),

$$\begin{aligned} \max _{\sigma _*\in {{\mathcal {S}}}_*\left( \delta \right) }\pi \left( \sigma \notin {{\mathcal {S}}}_*\left( \sigma _*, \frac{R}{\sqrt{n}} \right) \right) \le \frac{2 C_G {{\mathcal {Q}}}}{R^2}, \end{aligned}$$

where \(C_G\) and m are constants depending only on G.

Proof

Observe that since the stationary distribution \(\pi \) is uniform on \({{\mathcal {S}}}\), it is given by the uniform distribution \(\mathrm{Unif}\) on \(G^n\) conditioned on \({{\mathcal {S}}}\). Note that we can always generate G using each of its |G| elements, so we have an easy lower bound of \(|{{\mathcal {S}}}| \ge |G|^{n - |G|}\). Consequently, we have

$$\begin{aligned} \pi \left( \sigma \notin {{\mathcal {S}}}_*\left( \frac{R}{\sqrt{n}} \right) \right)&\le |G|^{|G|} \mathrm{Unif}\left( \sigma \notin {{\mathcal {S}}}_*\left( \frac{R}{\sqrt{n}} \right) \right) \\&\le |G|^{|G|} \sum _{a \in G}\mathrm{Unif}\left( \left| \frac{n_a(\sigma )}{n}-\frac{1}{{{\mathcal {Q}}}}\right| \ge \frac{R}{\sqrt{n}}\right) \\&\le \frac{|G|^{|G|}}{R^2}\left( 1-\frac{1}{{{\mathcal {Q}}}}\right) . \end{aligned}$$

Concerning the second assertion, we note that \(n_a(\sigma _*) \ge (1/{{\mathcal {Q}}}-\delta )n\) for each \(a \in G\); the rest follows similarly, so we omit the details. \(\square \)

Remark 2.7

In Lemma 2.6 above, we have given a very loose bound on \(C_G\) for sake of simplicity. Actually, it is not hard to see that holding G fixed, we have \(\lim _{n\rightarrow \infty } |{{\mathcal {S}}}|/|G|^n = 1\). See also [9, Section 6.B.] for more explicit bounds for various families of groups.

Together, Lemmas 2.42.5, and 2.6 imply the following bound for total variation distance.

Lemma 2.8

Let \(\beta > 0\) be given, and let \(T := \left\lceil \frac{1}{2} n \log n \right\rceil + \left\lceil \beta n \right\rceil \). Then, for any \(\sigma _*\in {{\mathcal {S}}}_*\left( \frac{1}{5{{\mathcal {Q}}}^3} \right) \), we have

$$\begin{aligned} \left\| {{\mathbb {P}}}_{\sigma _*}(\sigma _T \in \cdot \ ) - \pi \right\| _{\mathrm{TV}} \le \frac{C_G}{\beta ^{1/4}}, \end{aligned}$$

where \(C_G\) is a constant depending only on G.

Proof

Let \({{\tilde{\sigma }}}\) be drawn from the stationary distribution \(\pi \). Define

$$\begin{aligned} \tau = \min \left\{ t \ge 0 : n^{\sigma _*}_{a,b}(\sigma _t) = n^{\sigma _*}_{a,b}({{\tilde{\sigma }}}_t) \text { for all} \, a, b \in G \right\} , \end{aligned}$$

where \(({{\tilde{\sigma }}}_t)\) is a Markov chain started at \({{\tilde{\sigma }}}\). Let \({\overline{\pi }}^{\sigma _*}\) denote the stationary distribution for the proportion matrix with respect to \(\sigma _*\). Since \({{\tilde{\sigma }}}\) was drawn from \(\pi \), the proportion matrix of \({{\tilde{\sigma }}}_t\) remains distributed as \({\overline{\pi }}^{\sigma _*}\) for all t.

We first run \(\sigma \) and \({{\tilde{\sigma }}}\) independently up until time \(T_1 := \left\lceil \frac{1}{2} n \log n \right\rceil \). For a parameter R to be specified later, consider the events

$$\begin{aligned} {{\mathcal {G}}}:= \left\{ \sigma _{T_1} \in {{\mathcal {S}}}_*\left( \sigma _*, \frac{R}{\sqrt{n}} \right) \right\} , \qquad {\tilde{{{\mathcal {G}}}}} := \left\{ {{\tilde{\sigma }}}_{T_1} \in {{\mathcal {S}}}_*\left( \sigma _*, \frac{R}{\sqrt{n}} \right) \right\} . \end{aligned}$$

Lemma 2.4 implies that \({{\mathbb {P}}}({{\mathcal {G}}}^{{\textsf {c}}}) \le C_G e^{-R} + \frac{1}{n}\), and Lemma 2.6 implies that \({{\mathbb {P}}}({{\tilde{{{\mathcal {G}}}}}}^{{\textsf {c}}}) \le \frac{2 C_G {{\mathcal {Q}}}}{R^2}\).

Let \(T_2 := \left\lceil \beta n \right\rceil \). Starting from time \(T_1\), as long as both \({{\mathcal {G}}}\) and \({{\tilde{{{\mathcal {G}}}}}}\) hold, we may use Lemma 2.5 to form a coupling \((\sigma _t, {{\tilde{\sigma }}}_t)\) so that

$$\begin{aligned} {{\mathbb {P}}}_{\sigma _*, \sigma _*} \Big ( n^{\sigma _*}_{a, b}(\sigma _{T_1+T_2}) \ne n^{\sigma _*}_{a, b}({{\tilde{\sigma }}}_{T_1+T_2}) \hbox { for some}\ a, b \in G \,\Big |\, {{\mathcal {G}}}\cap {{\tilde{{{\mathcal {G}}}}}} \Big ) \le \frac{C{{\mathcal {Q}}}^2 R}{\sqrt{\beta }}. \end{aligned}$$

Setting \(R = \beta ^{1/4}\), we conclude that

$$\begin{aligned}&{{\mathbb {P}}}_{\sigma _*, \sigma _*} \Big ( n^{\sigma _*}_{a, b}(\sigma _{T_1+T_2}) \ne n^{\sigma _*}_{a, b}({{\tilde{\sigma }}}_{T_1+T_2}) \hbox { for some}\ a, b \in G \Big ) \\&\quad \le \frac{C{{\mathcal {Q}}}^2 R}{\sqrt{\beta }} + {{\mathbb {P}}}({{\mathcal {G}}}^{{\textsf {c}}}) + {{\mathbb {P}}}({{\tilde{{{\mathcal {G}}}}}}^{{\textsf {c}}}) \\&\quad \le \frac{C{{\mathcal {Q}}}^2 R}{\sqrt{\beta }} + \left( C_G e^{-R} + \frac{1}{n}\right) + \frac{2C_G {{\mathcal {Q}}}}{R^2} = O_G\left( \frac{1}{\beta ^{1/4}}\right) . \end{aligned}$$

We have \(T = T_1 + T_2\), and recall that the proportion matrix for \({{\tilde{\sigma }}}\) is stationary for all time. This yields

$$\begin{aligned} \left\| {{\mathbb {P}}}_{\sigma _*}\left( (n^{\sigma _*}_{a, b}(\sigma _T))_{a, b \in G} \in \cdot \ \right) - {\overline{\pi }}^{\sigma _*} \right\| _{\mathrm{TV}} = O_G\left( \frac{1}{\beta ^{1/4}}\right) . \end{aligned}$$

The result then follows by Lemma 2.3. \(\square \)

2.4 Proof of the main theorem

We now combine the lemmas from the burn-in, averaging, and coupling periods to complete the proof of the upper bound in Theorem 1.1.

Proof of Theorem 1.1 (1)

Define \(T_1 := \left\lceil n \log n + \beta n \right\rceil \), \(T_2 := \left\lceil \beta n \right\rceil \), and \(T_3 := \left\lceil \frac{1}{2} n \log n \right\rceil + \left\lceil \beta n \right\rceil \).

Let \(\tau _{1/3}\) be the first time to hit \({{\mathcal {S}}}_{non}\left( 1/3 \right) \) as in Lemma 2.1. Then, Lemma 2.1 implies that for any \(\sigma _1 \in {{\mathcal {S}}}\) and any \(t \ge 0\), we have

$$\begin{aligned} d_{\sigma _1}(T_1 + t)&\le {{\mathbb {P}}}_{\sigma _1}\left( \tau _{1/3} > T_1 \right) + \max _{\sigma \in {{\mathcal {S}}}_{non}\left( 1/3 \right) } d_\sigma (t) \nonumber \\&\le \frac{120 {{\mathcal {Q}}}}{\beta ^2} + \max _{\sigma \in {{\mathcal {S}}}_{non}\left( 1/3 \right) } d_\sigma (t). \end{aligned}$$
(3)

Next, by Lemma 2.2, for any \(\sigma _2 \in {{\mathcal {S}}}_{non}\left( 1/3 \right) \) and when \(\beta \) and n are sufficiently large, we have that \({{\mathbb {P}}}_{\sigma _2} \left( \sigma _{T_2} \not \in {{\mathcal {S}}}_*\left( \frac{1}{5{{\mathcal {Q}}}^3} \right) \right) \le \frac{1}{n}\). Consequently, for \(\sigma _2 \in {{\mathcal {S}}}_{non}\left( 1/3 \right) \), we have

$$\begin{aligned} d_{\sigma _2}(T_2 + t) \le \frac{1}{n} + \max _{\sigma _*\in {{\mathcal {S}}}_*\left( \frac{1}{5{{\mathcal {Q}}}^3} \right) } d_{\sigma _*}(t). \end{aligned}$$
(4)

Finally, Lemma 2.8 states that

$$\begin{aligned} \max _{\sigma _*\in {{\mathcal {S}}}_*\left( \frac{1}{5{{\mathcal {Q}}}^3} \right) } d_{\sigma _*}(T_3) \le \frac{C_G}{\beta ^{1/4}}. \end{aligned}$$
(5)

Thus, combining (3), (4), and (5), we obtain for any \(\sigma \in {{\mathcal {S}}}\) that

$$\begin{aligned} d_\sigma \left( \frac{3}{2} n \log n + 4 \beta n \right)&\le d_\sigma \left( T_1 + T_2 + T_3 \right) \nonumber \\&\le \frac{120 {{\mathcal {Q}}}}{\beta ^2} + \frac{1}{n} + \frac{C_G}{\beta ^{1/4}} \end{aligned}$$

sending \(n \rightarrow \infty \) and then \(\beta \rightarrow \infty \) yields (1). \(\square \)

3 Proofs for the averaging period

In this section, we prove Lemmas 2.2 and 2.4. The proofs are based on analyzing stochastic difference equations satisfied by the Fourier transform of the proportion vector or matrix.

3.1 The Fourier transform for G

We first establish some notation and preliminaries for the Fourier transform. Let \(G^*\) be a complete set of non-trivial irreducible representations of G. In other words, for each \(\rho \in G^*\), we have a finite dimensional complex vector space \(V_\rho \) such that \(\rho : G \rightarrow GL(V_\rho )\) is a non-trivial irreducible representation, and any non-trivial irreducible representation of G is isomorphic to some unique \(\rho \in G^*\). Moreover, we may equip each \(V_\rho \) with an inner product for which \(\rho \in G^*\) is unitary.

For a configuration \(\sigma \in {{\mathcal {S}}}\) and for each \(\rho \in G^*\), we consider the matrix acting on \(V_\rho \) given by

$$\begin{aligned} x_\rho (\sigma ) := \sum _{a \in G} \frac{n_a(\sigma )}{n}\rho (a), \end{aligned}$$

so that \(x_\rho (\sigma )\) is the Fourier transform of the proportion vector at the representation \(\rho \). We write \(x(\sigma ):=(x_\rho (\sigma ))_{\rho \in G^*}\).

Let \(\widetilde{V} := \bigoplus _{\rho \in G^*}\mathrm{End}_{{\mathbb {C}}}(V_\rho )\), and write \(d_\rho := \dim _{{\mathbb {C}}}V_\rho \). For an element \(x = (x_\rho )_{\rho \in G^*} \in \widetilde{V}\), we define a norm \(\Vert \cdot \Vert _{\widetilde{V}}\) given by

$$\begin{aligned} \Vert x\Vert _{\widetilde{V}}^2 := \frac{1}{{{\mathcal {Q}}}}\sum _{\rho \in G^*}d_\rho \Vert x_\rho \Vert _{\mathrm{HS}}^2, \end{aligned}$$

where \(\langle A, B \rangle _{\mathrm{HS}} =\mathrm{Tr}\,(A^*B)\) denotes the Hilbert–Schmidt inner product in \(\mathrm{End}_{{\mathbb {C}}}(V_\rho )\) and \(\Vert \cdot \Vert _{\mathrm{HS}}\) denotes the corresponding norm. (Note that \(\langle \cdot , \cdot \rangle _{\mathrm{HS}}\) and \(\Vert \cdot \Vert _{\mathrm{HS}}\) depend on \(\rho \), but for sake of brevity, we omit the \(\rho \) when there is no danger of confusion.)

The Peter–Weyl theorem [7, Chapter 2] says that

$$\begin{aligned} L^2(G) \cong {{\mathbb {C}}}\oplus \widetilde{V}, \end{aligned}$$

where the isomorphism is given by the Fourier transform. The Plancherel formula then reads

$$\begin{aligned} \Vert x(\sigma )\Vert _{\widetilde{V}}^2 = \left\| \left( \frac{n_a(\sigma )}{n}\right) _{a \in G} - \left( \frac{1}{{{\mathcal {Q}}}}\right) _{a \in G}\right\| ^2. \end{aligned}$$
(6)

Thus, in order to show that \(\sigma \in {{\mathcal {S}}}_*\left( \delta \right) \), it suffices to show that \(\Vert x(\sigma )\Vert _{\widetilde{V}}\) is small. A similar argument may be applied to the proportion matrix instead of the proportion vector.

Finally, for an element \(A \in \mathrm{End}_{{\mathbb {C}}}(V_\rho )\), we will at times also consider the operator norm\(\Vert A\Vert _{op} := \sup _{v \in V_\rho , v \ne 0} \Vert Av\Vert / \Vert v\Vert \). We will also sometimes use the following (equivalent) variational characterization of the operator norm:

$$\begin{aligned} \sup _{\begin{array}{c} X \in \mathrm{End}_{{\mathbb {C}}}(V_\rho ) \\ \Vert X\Vert _{\mathrm{HS}} = 1 \end{array}} \Vert XA\Vert ^2_{\mathrm{HS}}&= \sup _{\begin{array}{c} X \in \mathrm{End}_{{\mathbb {C}}}(V_\rho ) \\ \Vert X\Vert _{\mathrm{HS}} = 1 \end{array}} \mathrm{Tr}\,(XAA^*X^*) = \sup _{\begin{array}{c} X \in \mathrm{End}_{{\mathbb {C}}}(V_\rho ) \\ \Vert X\Vert _{\mathrm{HS}} = 1 \end{array}} \mathrm{Tr}\,(X^*XAA^*) \\&= \sup _{\begin{array}{c} Y \in \mathrm{End}_{{\mathbb {C}}}(V_\rho ) \\ Y = Y^*, \;\; \mathrm{Tr}\,Y = 1 \end{array}} \langle Y , AA^* \rangle _{\mathrm{HS}} = \Vert AA^*\Vert _{op} = \Vert A\Vert _{op}^2. \end{aligned}$$

3.1.1 The special case of \(G = {{\mathbb {Z}}}/q\)

On a first reading of this section, the reader may wish to consider everything for the special case of \(G = {{\mathbb {Z}}}/q\) for some integer \(q \ge 2\). In that case, each representation is one-dimensional, and the representations can be indexed by \(\ell = 0, 1, 2, \ldots , q - 1\). The Fourier transform is then particularly simple: the coefficients are scalar values

$$\begin{aligned} x_\ell (\sigma ) = \sum _{a = 0}^{q - 1} \frac{n_a(\sigma )}{n} \omega ^{a \ell }, \end{aligned}$$

where \(\omega := e^{\frac{2\pi i}{q}}\) is a primitive q-th root of unity.

This special case already illustrates most of the main ideas while simplifying the estimates in some places (e.g. matrix inequalities we use will often be immediately obvious for scalars).

3.2 A stochastic difference equation for the \(n_a\)

For \(a \in G\), we next analyze the behavior of \(n_a(\sigma _t)\) over time. For convenience, we write \(n_a(t) = n_a(\sigma _t)\). Let \({{\mathcal {F}}}_t\) denote the \(\sigma \)-field generated by the Markov chain \((\sigma _t)_{t \ge 0}\) up to time t. Then, our dynamics satisfy the equation

$$\begin{aligned} {{\mathbb {E}}}[n_a(t+1)-n_a(t) \mid {{\mathcal {F}}}_t] = \sum _{b \in G} \frac{n_{ab^{-1}}(t) n_b(t) }{2n(n-1)}+\sum _{b \in G} \frac{n_{ab}(t) n_b(t)}{2n(n-1)} - \frac{n_a(t)}{n}. \end{aligned}$$
(7)

Note that \(|n_a(t + 1) - n_a(t)| \le 1\) almost surely. Thus, for each \(a \in G\), we can write the above as a stochastic difference equation

$$\begin{aligned} n_a(t+1) - n_a(t)= & {} \sum _{b \in G} \frac{n_{ab^{-1}}(t) n_b(t)}{2n(n-1)}+\sum _{b \in G} \frac{n_{ab}(t) n_b(t)}{2n(n-1)} \nonumber \\&- \frac{n_a(t)}{n} + M_a(t+1), \end{aligned}$$
(8)

where \({{\mathbb {E}}}[M_a(t+1) \mid {{\mathcal {F}}}_t] = 0\) and \(|M_a(t)| \le 2\) almost surely.

It is easiest to analyze this equation through the Fourier transform. Writing \(x_\rho (t) = x_\rho (\sigma _t)\), we calculate from (8) that

$$\begin{aligned} x_\rho (t+1) - x_\rho (t) = \frac{1}{n-1}x_\rho (t) \left( \frac{x_\rho (t) + x_\rho (t)^*}{2} -\frac{n-1}{n}\right) + \widehat{M}_\rho (t+1), \end{aligned}$$

where \(\widehat{M}_\rho (t) := \frac{1}{n}\sum _{a \in G}M_a(t) \rho (a)\). For convenience, write

$$\begin{aligned} X_\rho (t) = \frac{1}{n - 1}\left( \frac{x_\rho (t) + x_\rho (t)^*}{2} - \frac{n-1}{n}\right) , \end{aligned}$$

so that our equation becomes

$$\begin{aligned} x_\rho (t+1) - x_\rho (t) = x_\rho (t) X_\rho (t) + \widehat{M}_\rho (t+1). \end{aligned}$$
(9)

Note that we have

$$\begin{aligned} \Vert x_\rho (t)\Vert _{\mathrm{HS}} \le \sqrt{d_\rho }, \qquad {{\mathbb {E}}}[\widehat{M}_\rho (t+1) \mid {{\mathcal {F}}}_t] = 0, \qquad \text {and}\qquad \Vert \widehat{M}_\rho (t)\Vert _{\mathrm{HS}} \le \frac{2 {{\mathcal {Q}}}\sqrt{d_\rho }}{n}, \end{aligned}$$

and thus,

$$\begin{aligned} \Vert x(t)\Vert _{\widetilde{V}} \le 1 \qquad \text {and}\qquad \Vert \widehat{M}(t)\Vert _{\widetilde{V}}\le \frac{2{{\mathcal {Q}}}}{n}, \end{aligned}$$

where \(\widehat{M}=(\widehat{M}_\rho )_{\rho \in G^*}\).

3.3 A general estimate for stochastic difference equations

Before proving Lemma 2.2, we also need a technical lemma for controlling the behavior of stochastic difference equations, which will be used to analyze (9) as well as other similar equations.

Lemma 3.1

Let \((z(t))_{t \ge 0}\) be a sequence of [0, 1]-valued random variables adapted to a filtration \(({{\mathcal {F}}}_t)_{t \ge 0}\). Let \(\varepsilon \in (0, 1)\) be a small constant, and let \(\varphi : {{\mathbb {R}}}^+ \rightarrow (0,1]\) be a non-decreasing function.

Suppose that there are \({{\mathcal {F}}}_t\)-measurable random variables M(t) for which

$$\begin{aligned} z(t+1) - z(t) \le -\varepsilon \varphi (t+1) z(t) + M(t+1) \end{aligned}$$
(10)

and which, for some constant D, satisfy the bounds

$$\begin{aligned} {{\mathbb {E}}}[ M(t+1) \mid {{\mathcal {F}}}_t ] \le D\varepsilon \sqrt{\varepsilon }, \qquad |M(t)| \le D\varepsilon . \end{aligned}$$

Then, for each t and each \(\lambda > 0\), we have

$$\begin{aligned} {{\mathbb {P}}}\left( z(t) \ge \lambda \sqrt{\varepsilon } + e^{- \varepsilon \int _0^t \varphi (s)\,ds} \cdot z(0) \right) \le C_{D,\varphi } e^{-c_{D,\varphi } \lambda ^2} \end{aligned}$$

for constants \(c_{D,\varphi }, C_{D,\varphi }\) depending only on D and \(\varphi \).

Proof

Let us define for integers \(t \ge 1\),

$$\begin{aligned} \Phi (t) := \varepsilon ^{-1} \sum _{k = 1}^t \log \frac{1}{1-\varepsilon \varphi (k)}, \qquad \text {and} \qquad \Phi (0) := 0. \end{aligned}$$

Taking conditional expectations in the inequality relating \(z(t+1)\) to z(t), we have

$$\begin{aligned} {{\mathbb {E}}}[ z(t+1) \mid {{\mathcal {F}}}_t ] \le (1 - \varepsilon \varphi (t+1)) z(t) + D \varepsilon \sqrt{\varepsilon }. \end{aligned}$$

Rearranging and using the fact that \(\varphi (t)\) is non-decreasing, we have

$$\begin{aligned} {{\mathbb {E}}}[ z(t+1) \mid {{\mathcal {F}}}_t ] - \frac{D\sqrt{\varepsilon }}{\varphi (0)}&\le (1 - \varepsilon \varphi (t+1)) z(t) - \frac{D\sqrt{\varepsilon }(1 - \varepsilon \varphi (t+1))}{\varphi (0)} \\&\le (1 - \varepsilon \varphi (t+1)) \left( z(t) - \frac{D\sqrt{\varepsilon }}{\varphi (0)} \right) . \end{aligned}$$

Consequently,

$$\begin{aligned} Z_t := e^{\varepsilon \Phi (t)} \left( z(t) - \frac{D\sqrt{\varepsilon }}{\varphi (0)} \right) \end{aligned}$$

is a supermartingale, and its increments are bounded by

$$\begin{aligned} |Z_{t+1}-Z_t| \le e^{\varepsilon \Phi (t+1)}\left( |M(t+1)|+D \varepsilon \right) \le 2D\varepsilon e^{\varepsilon \Phi (t+1)}. \end{aligned}$$
(11)

Recall that \(\varphi \) is non-decreasing, so that for all \(t \ge s \ge 0\), we have

$$\begin{aligned} \Phi (t) = \Phi (s) + \varepsilon ^{-1} \sum _{k = s + 1}^t \log \frac{1}{1 - \varepsilon \varphi (k)} \ge \Phi (s) + (t - s) \varphi (0). \end{aligned}$$

Using this with (11), we see that the sum of the squares of the first t increments is at most

$$\begin{aligned} \sum _{s = 1}^{t} 4D^2 \varepsilon ^2 e^{2\varepsilon \Phi (s)}&\le 4D^2\varepsilon ^2 \sum _{s = 1}^t e^{2\varepsilon \Phi (t) - 2\varepsilon \varphi (0)(t - s)} \le 4D^2\varepsilon ^2 e^{2\varepsilon \Phi (t)} \cdot \frac{1}{1 - e^{-2\varepsilon \varphi (0)}} \\&\le 4D^2\varepsilon ^2 e^{2\varepsilon \Phi (t)} \cdot \frac{1}{1 - (1 - \frac{1}{2}\varepsilon \varphi (0))} = \frac{8D^2 \varepsilon }{\varphi (0)} \cdot e^{2\varepsilon \Phi (t)}. \end{aligned}$$

By the Azuma–Hoeffding inequality, this yields

$$\begin{aligned} {{\mathbb {P}}}\left( Z_t \ge \lambda \sqrt{\varepsilon } e^{\varepsilon \Phi (t)} + Z_0 \right) \le \exp \left( - \frac{\varphi (0) \lambda ^2 \varepsilon \cdot e^{2\varepsilon \Phi (t)}}{16D^2 \varepsilon \cdot e^{2\varepsilon \Phi (t)}} \right) = \exp \left( -\frac{\varphi (0) \lambda ^2}{16D^2} \right) , \end{aligned}$$

which in turn implies

$$\begin{aligned} {{\mathbb {P}}}\left( z(t) \ge \frac{D\sqrt{\varepsilon }}{\varphi (0)} + e^{-\varepsilon \Phi (t)} z(0) + \lambda \sqrt{\varepsilon } \right) \le \exp \left( -\frac{\varphi (0)\lambda ^2}{16D^2} \right) . \end{aligned}$$

Finally, observe that \(\Phi (t) \ge \sum _{k = 1}^t \varphi (k) \ge \int _0^t \varphi (s)\, ds\). The result then follows upon shifting and rescaling of \(\lambda \). \(\square \)

3.4 Proportion vector chain: Proof of Lemma 2.2

We first prove a bound for the Fourier coefficients \(x_\rho (t)\).

Lemma 3.2

Consider any \(\sigma \in {{\mathcal {S}}}_{non}\left( 1/3 \right) \) and any \(\rho \in G^*\). We have a constant \(c_G\) depending only on G for which

$$\begin{aligned} {{\mathbb {P}}}_\sigma \left( \bigcup _{t = 1}^{n^2} \left\{ \Vert x_\rho (t)\Vert _{\mathrm{HS}} \ge \frac{1}{n^{1/8}} + e^{-c_G t/n}\cdot \Vert x_\rho (0)\Vert _{\mathrm{HS}} \right\} \right) \le \frac{1}{n^3}. \end{aligned}$$

for all large enough n.

This immediately implies Lemma 2.2.

Proof of Lemma 2.2

With \(c_G\) defined as in Lemma 3.2, take \(C_{G, \delta }\) large enough so that for any \(T \ge C_{G, \delta } n\),

$$\begin{aligned} \frac{1}{n^{1/8}} + e^{-c_G T/n}\sqrt{d_\rho } \le \delta . \end{aligned}$$

Then, Lemma 3.2 and Plancherel’s formula yield

$$\begin{aligned} {{\mathbb {P}}}_\sigma \left( \sigma _T \notin {{\mathcal {S}}}_*\left( \delta \right) \right)&\le {{\mathbb {P}}}_\sigma \left( \Vert x_\rho (T)\Vert _{\mathrm{HS}} \ge \delta \hbox { for some}\ \rho \in G^*\right) \\&\le \frac{{{\mathcal {Q}}}}{n^3} \le \frac{1}{n}, \end{aligned}$$

for large enough n, as desired. \(\square \)

We are now left with proving Lemma 3.2, which relies on the following bound on the operator norm.

Lemma 3.3

There exists a positive constant \(\gamma _G\) depending on G such that for any \(\rho \in G^*\) and any \(\sigma \in {{\mathcal {S}}}_{non}\left( 1/6 \right) \),

$$\begin{aligned} \Vert I_{d_\rho }+X_\rho (\sigma )\Vert _{op} \le 1-\frac{\gamma _G}{n}. \end{aligned}$$

Proof

Let \(\Delta _G\) denote the set of all probability distributions on G, and for \(c \in (0, 1)\), let \(\Delta _G(c) \subset \Delta _G\) denote the set of all probability distributions \(\mu \) such that \(\mu (H) \le 1 - c\) for all proper subgroups \(H \subset G\).

Consider a representation \(\rho \in G^*\), and consider the function \(h : \Delta _G(1/6) \rightarrow \mathrm{End}_{{\mathbb {C}}}(V_\rho )\) given by

$$\begin{aligned} h(\mu ) = \sum _{a \in G} \mu (a) \frac{\rho (a)+\rho (a)^*}{2}. \end{aligned}$$

Then, \(h(\mu )\) is hermitian, and since \(\rho \) is unitary, we clearly have

$$\begin{aligned} \lambda (\mu ):=\max _{v \in V_\rho , \Vert v\Vert =1}\langle h(\mu )v, v\rangle \le 1. \end{aligned}$$

We claim that \(\lambda (\mu ) < 1\) for each \(\mu \in \Delta _G(c)\). Indeed, suppose the contrary. Then, there exists a non-zero vector \(v \in V_\rho \) such that \(\mathrm{Re}\langle \rho (a)v, v \rangle =1\) for all \(a \in G\) with \(\mu (a)>0\). This implies that the support of \(\mu \) is included in the subgroup

$$\begin{aligned} H=\{a \in G \ : \ \rho (a)v=v\}. \end{aligned}$$

Since \(\rho \) is a (non-trivial) irreducible representation, H is a proper subgroup of G, and \(\mu (H)=1\), contradicting the assumption that \(\mu \in \Delta _G(c)\).

Note that \(\mu \mapsto \lambda (\mu )\) is continuous. We may define

$$\begin{aligned} \gamma _\rho :=\max _{\mu \in \Delta _G(1/6)}\lambda (\mu )< 1 \qquad \text {and}\qquad {{\tilde{\gamma }}}_G:=\max _{\rho \in G^*}\gamma _\rho <1. \end{aligned}$$

Then, we have for any \(\sigma \in {{\mathcal {S}}}_{non}\left( 1/6 \right) \),

$$\begin{aligned} \frac{x_\rho (\sigma ) + x_\rho (\sigma )^*}{2} = \sum _{a \in G} \frac{n_a(\sigma )}{n}\frac{\rho (a)+\rho (a)^*}{2} \preceq {{\tilde{\gamma }}}_G I_{d_\rho }. \end{aligned}$$

Taking \(0<\gamma _G < 1-{{\tilde{\gamma }}}_G\), and plugging this into the definition of \(X_\rho \) gives \(X_\rho (\sigma ) \preceq -\frac{\gamma _G}{n}I_{d_\rho }\). Note that \(X_\rho (\sigma ) \succeq -\frac{2}{n-1}I_{d_\rho }\). Combining these together gives the result. \(\square \)

Remark 3.4

A much more direct approach is possible in the case \(G = {{\mathbb {Z}}}/q\). The condition \(\sigma \in {{\mathcal {S}}}_{non}\left( 1/6 \right) \) implies that \(n_0(\sigma ) \le \frac{5}{6}\). Then, we have

$$\begin{aligned} \mathrm{Re}x_\ell (\sigma ) := \mathrm{Re}\sum _{a = 0}^{q - 1} \frac{n_a(\sigma )}{n} \omega ^{a \ell } \le \frac{5}{6} + \frac{1}{6} \max _{1 \le a \le q - 1} \mathrm{Re}\omega ^{a \ell } = \frac{5}{6} + \frac{1}{6} \cos \frac{2\pi }{q} < 1 - \gamma _G \end{aligned}$$

for some positive \(\gamma _G\). Some rearranging of equations then yields the desired result.

Proof of Lemma 3.2

Fix \(\rho \in G^*\). Let \({{\mathcal {G}}}_t\) denote the event where for all \(0 \le s \le t\), we have \(\Vert I_{d_\rho }+X_\rho (s)\Vert _{op} \le 1 - \frac{\gamma _G}{n}\), where \(\gamma _G\) is taken as in Lemma 3.3. Since our chain starts at \(\sigma \in {{\mathcal {S}}}_{non}\left( 1/3 \right) \), Lemmas 2.1 and 3.3 together imply that

$$\begin{aligned} {{\mathbb {P}}}_\sigma ({{\mathcal {G}}}_{n^2}^{{\textsf {c}}}) \le C_G n^2 e^{-n/10}. \end{aligned}$$

Next, we turn to (9). Rearranging (9) and squaring, we have

$$\begin{aligned} \Vert x_\rho (t+1)\Vert _{\mathrm{HS}}^2&= \Vert x_\rho (t)(I_{d_\rho } + X_\rho (t))\Vert _{\mathrm{HS}}^2 + \Vert \widehat{M}_\rho (t+1)\Vert _{\mathrm{HS}}^2 \nonumber \\&\quad + 2\mathrm{Re}\langle x_\rho (t)(I_{d_\rho } + X_\rho (t)), \widehat{M}_\rho (t+1) \rangle _{\mathrm{HS}} \end{aligned}$$
(12)

Let \(z_t := \mathbf{1}_{{{\mathcal {G}}}_t} \Vert x_\rho (t)\Vert _{\mathrm{HS}}^2\) and

$$\begin{aligned} M'(t+1) := \Vert \widehat{M}_\rho (t+1)\Vert _{\mathrm{HS}}^2 + 2\mathrm{Re}\langle x_\rho (t)(I_{d_\rho } + X_\rho (t)), \widehat{M}_\rho (t+1) \rangle _{\mathrm{HS}}. \end{aligned}$$

Substituting into (12), we obtain

$$\begin{aligned} z_{t+1} \le \Vert I_{d_\rho } + X_\rho (t)\Vert _{op}^2 \cdot z_t + \mathbf{1}_{{{\mathcal {G}}}_t} M'(t+1) \le \left( 1 - \frac{\gamma _G}{n}\right) ^2 z_t + \mathbf{1}_{{{\mathcal {G}}}_t} M'(t+1). \end{aligned}$$

Note that we have the bounds

$$\begin{aligned} {{\mathbb {E}}}[ M'(t+1) \mid {{\mathcal {F}}}_t ]= & {} {{\mathbb {E}}}[ \Vert \widehat{M}_\rho (t+1)\Vert _{\mathrm{HS}}^2 \mid {{\mathcal {F}}}_t ] \le \frac{4{{\mathcal {Q}}}^2d_\rho }{n^2} \\ |M'(t+1)|\le & {} \Vert \widehat{M}_\rho (t+1)\Vert _{\mathrm{HS}}^2 + 2\sqrt{d_\rho }\left( 1+\frac{1}{n(n-1)}\right) \Vert \widehat{M}_\rho (t+1)\Vert _{\mathrm{HS}}\\\le & {} \frac{6{{\mathcal {Q}}}^2 d_\rho }{n}. \end{aligned}$$

We now apply Lemma 3.1 with \(\varepsilon = \frac{1}{n}\), \(\varphi (t) = \gamma _G\), \(D = 6{{\mathcal {Q}}}^2 d_\rho \), and \(\lambda = n^{1/4}\). This yields

$$\begin{aligned} {{\mathbb {P}}}\left( z_t \ge n^{-1/4} + e^{-\gamma _G t/n}\cdot z_0 \right) \le C'_G e^{-c'_G \sqrt{n}}. \end{aligned}$$

Consequently,

$$\begin{aligned} {{\mathbb {P}}}_\sigma \left( \Vert x_\rho (t)\Vert _{\mathrm{HS}} \ge n^{-1/8} + e^{-\gamma _G t/2n} \cdot \Vert x_\rho (0)\Vert _{\mathrm{HS}} \right) \le C'_G e^{-c'_G \sqrt{n}} + C_G n^2 e^{-n/10}. \end{aligned}$$

The lemma with \(c_G = \gamma _G/2\) then follows from union bounding over all \(1 \le t \le n^2\) and taking n sufficiently large. \(\square \)

3.5 Proportion matrix chain: Proof of Lemma 2.4

We carry out a similar albeit more refined strategy to analyze the proportion matrix. Throughout this section, we assume our Markov chain \((\sigma _t)_{t \ge 0}\) starts at an initial state \(\sigma _*\in {{\mathcal {S}}}_*\left( \frac{1}{4{{\mathcal {Q}}}} \right) \). We again write \(n_a(t)=n_a(\sigma _t)\) and \(n_{a, b}(t)=n^{\sigma _*}_{a, b}(\sigma _t)\), and similar to before, the \(n_{a, b}(t)\) satisfy the difference equation

$$\begin{aligned} n_{a, b}(t+1)-n_{a, b}(t)= & {} \sum _{c \in G}\frac{n_{a, bc^{-1}}(t)n_c(t)}{2n(n-1)}+\sum _{c \in G}\frac{n_{a, bc}(t)n_c(t)}{2n(n-1)} \nonumber \\&- \frac{n_{a, b}(t)}{n} + M_{a, b}(t+1), \end{aligned}$$
(13)

where \({{\mathbb {E}}}[M_{a, b}(t+1) \mid {{\mathcal {F}}}_t]=0\) and \(|M_{a, b}(t)| \le 2\) for all \(t \ge 0\).

We can again analyze this equation via the Fourier transform. In this case, for each \(a \in G\), we take the Fourier transform of \(\left( n_{a, b}(t)/n_a(\sigma _*)\right) _{b \in G}\). For \(\rho \in G^*\), let

$$\begin{aligned} y_{a,\rho }(t) = y_{a,\rho }^{\sigma _*}(t) := \sum _{b \in G}\frac{n_{a, b}(t)}{n_a(\sigma _*)}\rho (b) \end{aligned}$$

denote the Fourier coefficient at \(\rho \). Let \(\widehat{M}_{a, \rho }(t) := \frac{1}{n_a(\sigma _*)}\sum _{b \in G}M_{a, b}(t)\rho (b)\). Then, (13) becomes

$$\begin{aligned} y_{a, \rho }(t+1) - y_{a, \rho }(t) = y_{a, \rho }(t) X_\rho (t) + \widehat{M}_{a, \rho }(t+1). \end{aligned}$$
(14)

Note that \({{\mathbb {E}}}_\sigma [\widehat{M}_{a, \rho }(t+1) \mid {{\mathcal {F}}}_t]=0\). Also, since we assumed \(\sigma _*\in {{\mathcal {S}}}_*\left( \frac{1}{4{{\mathcal {Q}}}} \right) \), it follows that \(\frac{n_a(\sigma _*)}{n} \ge \frac{1}{2{{\mathcal {Q}}}}\). Thus, we also know \(\Vert \widehat{M}_{a, \rho }(t+1)\Vert _{\mathrm{HS}} \le \frac{4{{\mathcal {Q}}}^2\sqrt{d_\rho }}{n}\).

Again, our main step is a bound on the Fourier coefficients \(y_{a, \rho }(t)\), which will also be useful later in proving Lemma 2.5.

Lemma 3.5

Consider any \(\sigma _*, \sigma '_*\in {{\mathcal {S}}}_*\left( \frac{1}{4{{\mathcal {Q}}}} \right) \). There exist constants \(c_G, C_G > 0\) depending only on G such that for all large enough n, we have

$$\begin{aligned} {{\mathbb {P}}}_{\sigma '_*}\left( \Vert y^{\sigma _*}_{a, \rho }(t)\Vert _{\mathrm{HS}} \ge R \left( \frac{1}{\sqrt{n}} + e^{-t/n} \Vert y^{\sigma _*}_{a, \rho }(0)\Vert _{\mathrm{HS}} \right) \right) \le e^{-\Omega _G(R^2) + O_G(1)} + \frac{2}{n^3} \end{aligned}$$

for all t and \(R > 0\).

The above lemma directly implies Lemma 2.4.

Proof of Lemma 2.4

We apply Lemma 3.5 to each \(a \in G\) and \(\rho \in G^*\). Recall that \(T = \left\lceil \frac{1}{2} n \log n \right\rceil \), so that

$$\begin{aligned} \frac{1}{\sqrt{n}} + e^{-T/n} \Vert y^{\sigma _*}_{a, \rho }(0)\Vert _{\mathrm{HS}} \le \frac{2\sqrt{d_\rho }}{\sqrt{n}}. \end{aligned}$$

Then, Lemma 3.5 implies

$$\begin{aligned} {{\mathbb {P}}}_{\sigma '_*}\left( \Vert y^{\sigma _*}_{a, \rho }(T)\Vert _{\mathrm{HS}} \ge \frac{R}{\sqrt{n}} \right) \le e^{-\Omega _G(R^2) + O_G(1)} + \frac{2}{n^3}. \end{aligned}$$

Union bounding over all \(a \in G\) and \(\rho \in G^*\) and using the Plancherel formula, this yields

$$\begin{aligned} {{\mathbb {P}}}_{\sigma '_*}\left( \sigma _*\not \in {{\mathcal {S}}}_*\left( \sigma _*, \frac{R}{\sqrt{n}} \right) \right)&\le {{\mathbb {P}}}_{\sigma '_*}\left( \max _{a, \rho } \Vert y^{\sigma _*}_{a, \rho }(T)\Vert _{\mathrm{HS}} \ge \frac{R}{\sqrt{n}} \right) \\&\le e^{-\Omega _G(R^2) + O_G(1)} + \frac{2 {{\mathcal {Q}}}^2}{n^3} \le C_G e^{-R} + \frac{1}{n} \end{aligned}$$

for sufficiently large \(C_G\) and n. \(\square \)

We now prove Lemma 3.5. Before proceeding with the main proof, we need the following routine estimate as a preliminary lemma.

Lemma 3.6

Let \(\theta _n : {{\mathbb {R}}}^d \rightarrow {{\mathbb {R}}}^+\) be the function given by \(\theta _n(x) = \Vert x\Vert + \frac{1}{\sqrt{n}}e^{-\sqrt{n}\Vert x\Vert } - \frac{1}{\sqrt{n}}\). Then, we have the inequalities

$$\begin{aligned} \Vert \nabla \theta _n(x)\Vert \le 1, \qquad \theta _n(x + h) \le \theta _n(x) + \langle h, \nabla \theta _n(x) \rangle + \frac{\sqrt{n}}{2} \Vert h\Vert ^2. \end{aligned}$$

Proof

We can write \(\theta _n(x) = f(\Vert x\Vert )\), where \(f(r) = r + \frac{1}{\sqrt{n}} e^{-\sqrt{n} r} - \frac{1}{\sqrt{n}}\). By spherical symmetry, we have

$$\begin{aligned} \Vert \nabla \theta _n(x)\Vert = f'(\Vert x\Vert ) = 1 - e^{-\sqrt{n}\Vert x\Vert } \le 1, \end{aligned}$$

which is the first inequality. Again by spherical symmetry, the eigenvalues of the Hessian \(\nabla ^2 \theta _n(x)\) can be directly computed to be \(f''(\Vert x\Vert )\) and \(f'(\Vert x\Vert ) / \Vert x\Vert \). But these are bounded by

$$\begin{aligned} f''(r) \le \sqrt{n} e^{-\sqrt{n}r} \le \sqrt{n}, \qquad f'(r)/r \le \frac{1 - e^{-\sqrt{n}r}}{r} \le \sqrt{n}. \end{aligned}$$

Thus, \(\nabla ^2 \theta _n(x) \preceq \sqrt{n} I\), and the second inequality follows from Taylor expansion. \(\square \)

Proof of Lemma 3.5

Let \(\gamma _G\) and \(c_G\) be the constants from Lemmas 3.3 and 3.2, respectively. Define the events

$$\begin{aligned} \begin{aligned} {{\mathcal {G}}}_t&:= \bigcap _{s = 0}^t \left\{ X_\rho (\sigma _s) \preceq -\frac{\gamma _G}{n} \right\} , \\ {{\mathcal {G}}}'_t&:= \bigcap _{s = 0}^t \left\{ X_\rho (\sigma _s) \preceq -\frac{1 - \sqrt{d_\rho } e^{-c_G s/n} - 2n^{-1/8}}{n} \right\} . \end{aligned} \end{aligned}$$

Note that \(\sigma '_*\in {{\mathcal {S}}}_*\left( \frac{1}{4{{\mathcal {Q}}}} \right) \subseteq {{\mathcal {S}}}_{non}\left( 1/3 \right) \). Hence, by Lemmas 2.1 and 3.3, we have \({{\mathbb {P}}}({{\mathcal {G}}}^{{\textsf {c}}}_{n^2}) \le C_G n^2 e^{-n/10}\). We also have

$$\begin{aligned} X_\rho (s)&= \frac{1}{n - 1} \left( \frac{x_\rho (s)+x_\rho (s)^*}{2} - \frac{n - 1}{n}I_{d_\rho } \right) \preceq -\frac{1}{n}\left( 1 - \frac{n \Vert x_\rho (s)\Vert _{\mathrm{HS}}}{n - 1} \right) I_{d_\rho } \\&\preceq -\frac{1}{n}\left( 1 - \Vert x_\rho (s)\Vert _{\mathrm{HS}} - \frac{\sqrt{d_\rho }}{n-1} \right) I_{d_\rho }, \end{aligned}$$

where we have used the fact that \(\left\| \frac{x_\rho (s)+x_\rho (s)^*}{2}\right\| _{op} \le \Vert x_\rho (s)\Vert _{op} \le \Vert x_\rho (s)\Vert _{\mathrm{HS}}\).

Lemma 3.2 then implies that \({{\mathbb {P}}}({{\mathcal {G}}}'^{{\textsf {c}}}_{n^2}) \le \frac{1}{n^3}\). Thus, setting

$$\begin{aligned} \begin{aligned} \varphi (t)&:= \max (\gamma _G, 1 - \sqrt{d_\rho } e^{-c_G t/n} - 2 n^{-1/8}),\\ {{\mathcal {H}}}_t&:= {{\mathcal {G}}}_t \cap {{\mathcal {G}}}'_t = \bigcap _{s = 0}^t \left\{ X_\rho (\sigma _s) \preceq -\frac{\varphi (t)}{n} \right\} , \end{aligned} \end{aligned}$$

we conclude that

$$\begin{aligned} {{\mathbb {P}}}({{\mathcal {H}}}^{{\textsf {c}}}_{n^2}) \le {{\mathbb {P}}}({{\mathcal {G}}}^{{\textsf {c}}}_{n^2}) + {{\mathbb {P}}}({{\mathcal {G}}}'^{{\textsf {c}}}_{n^2}) \le \frac{2}{n^3} \end{aligned}$$

for all large enough n.

Next, we turn to (14) and apply \(\theta _n\) to both sides, where we identify \({{\mathbb {C}}}^{d_\rho ^2}\) with \({{\mathbb {R}}}^{2d_\rho ^2}\). Using Lemma 3.6 and taking the conditional expectation, we obtain

$$\begin{aligned} {{\mathbb {E}}}\left[ \theta _n\left( y_{a, \rho }(t+1) \right) \, \Big |\, {{\mathcal {F}}}_t \right]&\le \theta _n\left( y_{a, \rho }(t) (I_{d_\rho } + X_\rho (t))\right) + \frac{8 {{\mathcal {Q}}}^4 d_\rho }{n \sqrt{n}} \\&\le \theta _n(\Vert I_{d_\rho } + X_\rho (t)\Vert _{op} \cdot y_{a, \rho }(t)) + \frac{8 {{\mathcal {Q}}}^4 d_\rho }{n \sqrt{n}} \\&\le \Vert I_{d_\rho } + X_\rho (t)\Vert _{op} \cdot \theta _n(y_{a, \rho }(t)) + \frac{8 {{\mathcal {Q}}}^4 d_\rho }{n \sqrt{n}}, \end{aligned}$$

where the second inequality follows from the variational formula for operator norm (i.e. that \(\Vert BA\Vert _{\mathrm{HS}} \le \Vert A\Vert _{op} \Vert B\Vert _{\mathrm{HS}}\)), and the third inequality follows from the fact that \(\theta _n\) is convex with \(\theta _n(0) = 0\). Thus, we may write

$$\begin{aligned} \theta _n(y_{a, \rho }(t+1)) \le \Vert I_{d_\rho } + X_\rho (t)\Vert _{op} \cdot \theta _n(y_{a, \rho }(t)) + M'(t+1) \end{aligned}$$

where

$$\begin{aligned} {{\mathbb {E}}}[ M'(t+1) \mid {{\mathcal {F}}}_t ] \le \frac{8 {{\mathcal {Q}}}^4 d_\rho }{n\sqrt{n}}, \qquad |M'(t+1)| \le \frac{8{{\mathcal {Q}}}^2 \sqrt{d_\rho }}{n}. \end{aligned}$$

Now, let \(z_t := \mathbf{1}_{{{\mathcal {H}}}_t} \theta _n(y_{a, \rho }(t))\), and note that since \(X_\rho (\sigma ) \succeq -\frac{2}{n-1} I_{d_\rho }\), we have \(\Vert I_{d_\rho } + X_\rho (t)\Vert _{op} \le 1-\frac{\varphi (t)}{n}\) whenever \({{\mathcal {H}}}_t\) holds. Thus,

$$\begin{aligned} z_{t+1} \le \Vert I_{d_\rho } + X_\rho (t)\Vert _{op} \cdot z_t + \mathbf{1}_{{{\mathcal {H}}}_t}M'(t+1) \le \left( 1 - \frac{1}{n} \varphi (t) \right) z_t + \mathbf{1}_{{{\mathcal {H}}}_t}M'(t+1). \end{aligned}$$

We may then apply Lemma 3.1 with \(\varepsilon = \frac{1}{n}\) and \(D = 8{{\mathcal {Q}}}^4 d_\rho \). Note that

$$\begin{aligned} \int _0^t \varphi (s) \,ds&\ge \left( 1 - 2 n^{-\frac{1}{8}}\right) t - \sqrt{d_\rho } \int _0^\infty e^{-\frac{c_G s}{n}} \,ds \ge t - O_G(n) \end{aligned}$$

for all large enough n. Thus, Lemma 3.1 implies that

$$\begin{aligned} {{\mathbb {P}}}\left( z_t \ge \frac{\lambda }{\sqrt{n}} + C_G e^{-t/n} \cdot z_0 \right) \le C'_G e^{-c'_G \lambda ^2}. \end{aligned}$$
(15)

Consequently,

$$\begin{aligned}&{{\mathbb {P}}}\left( \Vert y_{a, \rho }(t)\Vert _{\mathrm{HS}} \ge R\left( \frac{1}{\sqrt{n}} + e^{-\frac{t}{n}}\Vert y_{a, \rho }(0)\Vert _{\mathrm{HS}} \right) \right) \\&\quad \le {{\mathbb {P}}}\left( \theta _n(y_{a, \rho }(t)) \ge \frac{R - 1}{\sqrt{n}} + Re^{-\frac{t}{n}}\Vert y_{a, \rho }(0)\Vert _{\mathrm{HS}} \right) \\&\quad \le {{\mathbb {P}}}\left( \theta _n(y_{a, \rho }(t)) \ge \frac{R - 1}{\sqrt{n}} + Re^{-\frac{t}{n}}\theta _n(y_{a, \rho }(0)) \right) \\&\quad \le {{\mathbb {P}}}\left( z_t \ge \frac{R - 1}{\sqrt{n}} + Re^{-\frac{t}{n}}z_0 \right) + {{\mathbb {P}}}({{\mathcal {H}}}^{{\textsf {c}}}_{n^2}) \\&\quad \le e^{-\Omega _G(R^2) + O_G(1)} + \frac{2}{n^3}, \end{aligned}$$

as desired. \(\square \)

4 Construction of the coupling: Proof of Lemma 2.5

For each \(\delta >0\), we define a subset of \(\{0, 1, \ldots , n\}^{G \times G}\) by

$$\begin{aligned} {{\mathcal {M}}}_\delta :=\left\{ (n_{a, b})_{a, b \in G} \ : \ n_{a, b} \ge \frac{(1-\delta ) n}{{{\mathcal {Q}}}^2} \ \text {for every} \, a, b \in G \, \text {and}\ \sum _{a, b \in G}n_{a, b}=n\right\} . \end{aligned}$$

Lemma 4.1

Consider a configuration \(\sigma _*\in {{\mathcal {S}}}\) and a constant \(0<\delta \le \frac{1}{2{{\mathcal {Q}}}^2}\), and assume that \((1 - \delta )n/{{\mathcal {Q}}}^2\) is an integer. Let \((\sigma _t)_{t \ge 0}\) and \(({{\tilde{\sigma }}}_t)_{t \ge 0}\) be two product replacement chains started at \(\sigma \) and \({{\tilde{\sigma }}}\), respectively. Then, there exists a coupling \((\sigma _t, {{\tilde{\sigma }}}_t)\) of the Markov chains satisfying the following:

Let

$$\begin{aligned} D_t:=\frac{1}{2}\sum _{a, b \in G}|n^{\sigma _*}_{a, b}(\sigma _t) - n^{\sigma _*}_{a, b}({{\tilde{\sigma }}}_t)|. \end{aligned}$$

Then, on the event \(\{(n^{\sigma _*}_{a, b}(\sigma _t))_{a, b \in G}, (n^{\sigma _*}_{a, b}({{\tilde{\sigma }}}_t))_{a, b \in G} \in {{\mathcal {M}}}_\delta \}\) and \(\{D_t > 0\}\), one has

$$\begin{aligned} {{\mathbb {E}}}_{\sigma , {{\tilde{\sigma }}}}[D_{t+1}-D_t \mid \sigma _t, {{\tilde{\sigma }}}_t]&\le 0, \end{aligned}$$
(16)
$$\begin{aligned} {{\mathbb {P}}}_{\sigma , {{\tilde{\sigma }}}}\left( D_{t+1} - D_t \ne 0 \mid \sigma _t, {{\tilde{\sigma }}}_t \right)&\ge \frac{(1-\delta )^2}{4{{\mathcal {Q}}}^3}. \end{aligned}$$
(17)

Proof

Let us abbreviate \(n_{a, b}(t) = n^{\sigma _*}_{a, b}(\sigma _t)\) and \({\tilde{n}}_{a, b}(t) = n^{\sigma _*}_{a, b}({{\tilde{\sigma }}}_t)\). Let \(m_{a, b}(t):=\min (n_{a, b}(t), {\tilde{n}}_{a, b}(t))\). For each \(a \in G\), we define the quantity

$$\begin{aligned} d_a(t) := \frac{1}{2}\sum _{b \in G} |n_{a, b}(t) - \tilde{n}_{a, b}(t)| = \sum _{b \in G} n_{a, b}(t) - \sum _{b \in G} m_{a, b}(t), \end{aligned}$$

so that \(D_t = \sum _{a \in G} d_a(t)\).

For accounting purposes, it is helpful to introduce two sequences

$$\begin{aligned} (x_1, x_2, \ldots , x_n) \text { and } ({\tilde{x}}_1, {\tilde{x}}_2, \ldots , {\tilde{x}}_n) \end{aligned}$$

of elements of \(G \times G\). These sequences are chosen so that the number of \(x_k\) equal to (ab) is exactly \(n_{a, b}\), and similarly the number of \({\tilde{x}}_k\) equal to (ab) is \({\tilde{n}}_{a, b}\). Moreover, we arrange their indices in a coordinated fashion, as described below.

We define three families of disjoint sets: \(P_{a, b}\), \(Q_a\), and \(R_a \subset [n]\).

  • For each \(a, b \in G\), let \(P_{a, b}\) be a set of size \((1 - \delta )n/{{\mathcal {Q}}}^2\) such that for any \(k \in P_{a, b}\), we have \(x_k ={\tilde{x}}_k = (a, b)\). (This is possible provided that \((n_{a, b}(t)), ({{\tilde{n}}}_{a, b}(t)) \in {{\mathcal {M}}}_\delta \) holds.)

  • For each \(a \in G\), let \(Q_a\) be a set of size \(\sum _{b \in G}(m_{a, b} - |P_{a, b}|)\) such that for any \(k \in Q_a\), \(x_k ={\tilde{x}}_k= (a, b)\) for some b. (Note that \(Q_a\) may be empty.)

  • For each \(a \in G\), let \(R_a\) be a set of size \(d_a\) such that for any \(k \in R_a\), \(x_k\) and \({\tilde{x}}_k\) both have a as their first coordinate. (This \(R_a\) is well-defined since \(\sum _b n_{a, b} = \sum _b {{\tilde{n}}}_{a, b}\) for each a; it may also be empty.)

Define

$$\begin{aligned} P := \bigsqcup _{a, b \in G} P_{a, b}, \qquad Q := \bigsqcup _{a \in G} Q_a, \qquad R := \bigsqcup _{a \in G} R_a. \end{aligned}$$

Suppose that \(D_t > 0\), so that for some \(a_*, b_*, b_*' \in G\) we have \(n_{a_*, b_*} > {\tilde{n}}_{a_*, b_*}\) and \(n_{a_*, b'_*} < {\tilde{n}}_{a_*, b'_*}\). Let us consider all possible ways to sample a pair of indices and a sign \((k, l, s) \in \{1, 2, \ldots , n\}^2 \times \{\pm 1\}\) with \(k \ne l\).

Suppose \(x_k = (a_k, b_k)\) and \(x_l = (a_l, b_l)\). We think of \((k, l, +1)\) as corresponding to a move on \((n_{a, b}(t))\) where \(n_{a_k, b_k}\) is decremented and \(n_{a_k, (b_k \cdot b_l)}\) is incremented. Similarly, \((k, l, -1)\) corresponds to a move where \(n_{a_k, b_k}\) is decremented and \(n_{a_k, (b_k \cdot b_l^{-1})}\) is incremented. We may also think of \((k, l, \pm 1)\) as corresponding to moves on \(({{\tilde{n}}}_{a, b}(t))\) in an analogous way.

Fig. 1
figure 1

Illustration of cases (i)–(iv)

We now analyze four cases, as illustrated in Fig. 1.

(i) Case\((k, l) \in (P \sqcup Q) \times (P \sqcup Q)\). For all but an exceptional situation described below, we apply the move corresponding to (kls) to both states \((n_{a, b}(t))\) and \(({{\tilde{n}}}_{a, b}(t))\). In these cases, \(D_{t+1}=D_t\).

We now describe the exceptional situation. Define

$$\begin{aligned} S = P_{a_*, b_*} \times \left( \bigsqcup _{c \in G} P_{c, (b_*^{-1} \cdot b'_*)}\right) \qquad \text {and}\qquad S' = P_{a_*, b'_*} \times \left( \bigsqcup _{c \in G} P_{c, \mathrm{id}}\right) . \end{aligned}$$

Then, the exceptional situation occurs when \(s = +1\) and \((k, l) \in S \sqcup S'\).

Take any bijection \(\tau \) from S to \(S'\). If \((k, l) \in S\), then we apply \((k, l, +1)\) to \((n_{a, b}(t))\) while applying \((\tau (k, l), +1)\) to \(({{\tilde{n}}}_{a, b}(t))\). This increments \(n_{a_*, b'_*}\), decrements \(n_{a_*, b_*}\), and has no effect on the \(({\tilde{n}}_{a, b}(t))\). The overall effect is that \(D_{t+1} = D_t - 1\).

If instead \((k, l) \in S'\), then we apply \((k, l, +1)\) to \((n_{a, b}(t))\) and \((\tau ^{-1}(k, l), +1)\) to \(({{\tilde{n}}}_{a, b}(t))\). A similar analysis shows that in this case \(D_{t+1} = D_t + 1\).

The exceptional event occurs with probability \(\frac{(1 - \delta )^2}{2{{\mathcal {Q}}}^3}\), and when it occurs, \(D_t\) increases or decreases by 1 with equal probability. Thus, the exceptional situation plays the role of introducing some unbiased fluctuation in \(D_t\) and gives us (17).

(ii) Case\((k, l) \in (Q \sqcup R) \times (Q \sqcup R)\) but \((k, l) \not \in Q \times Q\). This occurs with probability

$$\begin{aligned} \frac{1}{n(n-1)}\left( (|Q|+|R|)(|Q|+|R|-1)-|Q|(|Q|-1)\right) \end{aligned}$$

which is at most

$$\begin{aligned} \frac{2}{n(n - 1)}(|Q| + |R|) |R| = \frac{2 \delta }{n - 1}D_t. \end{aligned}$$

Apply the move corresponding to (kls) to both states. This increases \(D_t\) by at most 1. We will see later that the effect of this case is small compared to the other cases.

(iii) Case\((k, l) \in P \times R\). This occurs with probability

$$\begin{aligned} \frac{1}{n(n - 1)}|P| |R| =\frac{1 - \delta }{n - 1}D_t. \end{aligned}$$

Apply the move corresponding to (kls) to both states. Again, this increases \(D_t\) by at most 1, but there is also a chance not to increase.

Suppose that \(x_l = (a_1, b_1)\) and \({\tilde{x}}_l = (a_1, {\tilde{b}}_1)\), and suppose that \(k \in P_{a_2, b_2}\). Then the move has the effect of decreasing \(n_{a_2, b_2}\) and \({\tilde{n}}_{a_2, b_2}\) while increasing \(n_{a_2, (b_2 \cdot b_1^s)}\) and \({\tilde{n}}_{a_2, (b_2\cdot {\tilde{b}}_1^s)}\). Note that conditioned on this case happening, \((a_2, b_2)\) is distributed uniformly over \(G \times G\). When \((a_2, (b_2\cdot {\tilde{b}}_1^s)) = (a_*, b_*)\) or \((a_2, (b_2\cdot b_1^s)) = (a_*, b'_*)\), the move does not increase \(D_t\). Therefore there is at least a \(2/{{\mathcal {Q}}}^2\) chance that \(D_t\) is actually not increased. Hence, the probability that \(D_t\) is increased by 1 is at most

$$\begin{aligned} \left( 1 - \frac{2}{{{\mathcal {Q}}}^2}\right) \frac{1 - \delta }{n-1}D_t. \end{aligned}$$

(iv) Case\((k, l) \in R \times P\). This occurs with probability

$$\begin{aligned} \frac{1}{n(n - 1)} |R| |P| = \frac{1 - \delta }{n - 1} D_t. \end{aligned}$$

Suppose that \(x_k = (a, b)\) and \({\tilde{x}}_k = (a, {\tilde{b}})\). Let \(\tau \) be a permutation of P such that for \(l \in P_{a, c}\), one has \(\tau (l) \in P_{a, {\tilde{b}}^{-1}\cdot b \cdot c^s}\). Then apply (kls) to \((n_{a, b}(t))\) and apply \((k, \tau (l), s)\) to \(({{\tilde{n}}}_{a, b}(t))\). This always decreases \(D_t\) by 1.

Let us now summarize what we know when \((n_{a, b}(t)), ({{\tilde{n}}}_{a, b}(t)) \in {{\mathcal {M}}}_\delta \) and \(D_t>0\). From Cases (i), (ii), and (iii), we have

$$\begin{aligned} {{\mathbb {P}}}_{\sigma , {{\tilde{\sigma }}}}(D_{t+1} = D_t + 1 \mid \sigma _t, {{\tilde{\sigma }}}_t) \le \left( 1-\frac{2(1-\delta )}{{{\mathcal {Q}}}^2}+\delta \right) \frac{D_t}{n-1} + \frac{(1 - \delta )^2}{4{{\mathcal {Q}}}^3}. \end{aligned}$$

From Cases (i) and (iv), we have

$$\begin{aligned} {{\mathbb {P}}}_{\sigma , {{\tilde{\sigma }}}}(D_{t+1} = D_t - 1 \mid \sigma _t, {{\tilde{\sigma }}}_t) \ge (1-\delta )\frac{D_t}{n-1} + \frac{(1 - \delta )^2}{4{{\mathcal {Q}}}^3}. \end{aligned}$$

Therefore, if \(0<\delta \le \frac{1}{2{{\mathcal {Q}}}^2}\), then

$$\begin{aligned} {{\mathbb {E}}}_{\sigma , {{\tilde{\sigma }}}}[D_{t+1}-D_t \mid \sigma _t, {{\tilde{\sigma }}}_t] \le 0, \end{aligned}$$

verifying (16).

To fully define the coupling, when \(D_t = 0\), we can couple \(\sigma _t\) and \(\sigma _t\) to be identical, and if either \((n_{a, b}(t)) \notin {{\mathcal {M}}}_\delta \) or \(({{\tilde{n}}}_{a, b}(t)) \notin {{\mathcal {M}}}_\delta \), we may run the two chains independently. \(\square \)

Proof of Lemma 2.5

Since \(\sigma \in {{\mathcal {S}}}_*\left( \sigma _*, \frac{R}{\sqrt{n}} \right) \), we must have for each \(a \in G\) and \(\rho \in G^*\) that \(\Vert y^{\sigma _*}_{a, \rho }(\sigma )\Vert _{\mathrm{HS}} \le \frac{R}{\sqrt{n}}\). Note that for large enough n, we have \({{\mathcal {S}}}_*\left( \sigma _*, \frac{R}{\sqrt{n}} \right) \subseteq {{\mathcal {S}}}_*\left( \frac{1}{5{{\mathcal {Q}}}^3} \right) \). Thus, we may apply Lemma 3.5 to obtain

$$\begin{aligned} {{\mathbb {P}}}\left( \bigcup _{t = 0}^{n^2} \left\{ \Vert y^{\sigma _*}_{a, \rho }(\sigma _t)\Vert _{\mathrm{HS}} \ge \frac{1}{5{{\mathcal {Q}}}^3} \right\} \right) \le n^2 \left( e^{-\Omega _{G}(n) + O_{G}(1)} + \frac{2}{n^3} \right) \le \frac{3}{n} \end{aligned}$$
(18)

for large enough n. Define the event

$$\begin{aligned} {{\mathcal {G}}}_t := \left\{ \sigma _s \in {{\mathcal {S}}}_*\left( \sigma _*, \frac{1}{5{{\mathcal {Q}}}^3} \right) \text { for all} \, 1 \le s \le t \right\} . \end{aligned}$$

The Plancherel formula applied to (18) implies that \({{\mathbb {P}}}({{\mathcal {G}}}^{{\textsf {c}}}_{n^2}) \le \frac{3{{\mathcal {Q}}}^2}{n}\). We may analogously define an event \({{\tilde{{{\mathcal {G}}}}}}_t\) for \({{\tilde{\sigma }}}\) and let \({{\mathcal {A}}}_t := {{\mathcal {G}}}_t \cap {{\tilde{{{\mathcal {G}}}}}}_t\). Thus, \({{\mathbb {P}}}({{\mathcal {A}}}_{n^2}^{{\textsf {c}}}) \le \frac{6{{\mathcal {Q}}}^2}{n}\).

Pick \(\delta ' \in \left( \frac{2}{5{{\mathcal {Q}}}^2}, \frac{3}{7{{\mathcal {Q}}}^2}\right) \) so that \((1 - \delta ')n/{{\mathcal {Q}}}^2\) is an integer. Note that when \({{\mathcal {A}}}_t\) holds, we have

$$\begin{aligned} \sigma _t \in {{\mathcal {S}}}_*\left( \sigma _*, \frac{1}{5{{\mathcal {Q}}}^3} \right) \quad \text {and} \quad \sigma _*\in {{\mathcal {S}}}_*\left( \frac{1}{5{{\mathcal {Q}}}^3} \right) \implies (n_{a, b}(t)) \in {{\mathcal {M}}}_{\frac{2}{5{{\mathcal {Q}}}^2}} \subseteq {{\mathcal {M}}}_{\delta '}, \end{aligned}$$

and similarly \({{\tilde{\sigma }}}_t \in {{\mathcal {M}}}_{\delta '}\).

Thus, we may invoke Lemma 4.1 to give a coupling between \(\sigma \) and \({{\tilde{\sigma }}}\) where on the event \({{\mathcal {A}}}_t\), the quantity \(D_t\) is more likely to decrease than increase. Letting \(\mathbf{D}_t:=\mathbf{1}_{{{\mathcal {A}}}_t}D_t\), we see that \((\mathbf{D}_t)\) is a supermartingale with respect to \(({{\mathcal {F}}}_t)\).

Define

$$\begin{aligned} \tau := \min \{ t \ge 0 : D_t=0 \}, \qquad {{{\tilde{\tau }}}} := \min \{ t \ge 0 : \mathbf{D}_t=0 \}. \end{aligned}$$

Then, Lemma 4.1 ensures that on the event \(\{{{\tilde{\tau }}}> t\}\), we have \(\mathrm{Var}(\mathbf{D}_{t+1}\mid {{\mathcal {F}}}_t) \ge \alpha ^2\), where \(\alpha ^2:=\left( 1- \frac{1}{{{\mathcal {Q}}}^2}\right) \frac{(1-\delta ')^2}{4{{\mathcal {Q}}}^3}\). By [13, Proposition 17.20], for every \(u > 12/\alpha ^2\),

$$\begin{aligned} {{\mathbb {P}}}({{\tilde{\tau }}} > u) \le \frac{4 \mathbf{D}_0}{\alpha \sqrt{u}}. \end{aligned}$$
(19)

Recall that \(T = \left\lceil \beta n \right\rceil \) and \(D_0 \le \sqrt{{{\mathcal {Q}}}} R\sqrt{n}\). As long as \(\beta \) is large enough, we may apply (19) with \(u = T\) to get

$$\begin{aligned} {{\mathbb {P}}}_{\sigma , {{\tilde{\sigma }}}}(\tau > T) \le \frac{16 {{\mathcal {Q}}}^2 R}{(1-\delta ')\sqrt{\beta }} + {{\mathbb {P}}}({{\mathcal {A}}}_T^{{\textsf {c}}}) \le \frac{32 {{\mathcal {Q}}}^2 R}{\sqrt{\beta }} \end{aligned}$$

for all large enough n, as desired. \(\square \)

5 Proof of Theorem 1.1 (2)

The lower bound is proved essentially by showing that the estimates of Lemmas 2.1 and 2.4 cannot be improved. Let \(a_1, a_2, \ldots , a_k\) be a set of generators for G. Let \(\sigma _\star \in {{\mathcal {S}}}\) be the configuration given by

$$\begin{aligned} \sigma _\star (i) = {\left\{ \begin{array}{ll} a_i &{}\quad \text {if} \, i \le k, \\ 0 &{} \quad \text {otherwise}. \end{array}\right. } \end{aligned}$$

We will analyze the Markov chain started at \(\sigma _\star \) and show that it does not mix too fast.

Recall from Sect. 2 the notation

$$\begin{aligned} n^{\{\mathrm{id}\}}_{non}(\sigma ) = |\{ i \in [n] : \sigma (i) \ne \mathrm{id}\}| \end{aligned}$$

for the number of sites in \(\sigma \) that do not contain the identity. We first show that if we run the chain for slightly less than \(n \log n\) steps, most of the sites will still contain the identity.

Lemma 5.1

Let \(T := \left\lfloor n \log n - Rn \right\rfloor \). Then,

$$\begin{aligned} {{\mathbb {P}}}_{\sigma _\star }\left( n^{\{\mathrm{id}\}}_{non}(\sigma _T) \ge \frac{n}{3} \right) \le \frac{4{{\mathcal {Q}}}^2}{R^2}. \end{aligned}$$

Proof

Recall that in one step of our Markov chain, we pick two indices \(i, j \in [n]\) and replace \(\sigma (i)\) with \(\sigma (i) \cdot \sigma (j)\) or \(\sigma (i) \cdot \sigma (j)^{-1}\). The only way for \(n^{\{\mathrm{id}\}}_{non}(\sigma _t)\) to increase after this step is if \(\sigma (j) \ne \mathrm{id}\). Thus,

$$\begin{aligned} {{\mathbb {P}}}( n^{\{\mathrm{id}\}}_{non}(\sigma _{t+1}) = n^{\{\mathrm{id}\}}_{non}(\sigma _t) + 1 \mid n^{\{\mathrm{id}\}}_{non}(\sigma _t)) \le \frac{n^{\{\mathrm{id}\}}_{non}(\sigma _t)}{n}. \end{aligned}$$
(20)

Let \(\tau := \min \{ t \ge 0 : n^{\{\mathrm{id}\}}_{non}(\sigma _t) \ge \frac{n}{3} \}\) be the first time that \(n^{\{\mathrm{id}\}}_{non}(\sigma _t)\) is at least \(\frac{n}{3}\). We have that \(n^{\{\mathrm{id}\}}_{non}(\sigma _\star ) = k\), so it follows from (20) that \(\tau \) stochastically dominates the sum

$$\begin{aligned} G := \sum _{s = k}^{\left\lfloor n/3 \right\rfloor } G_s, \end{aligned}$$

where the \(G_s\) are independent geometric variables with success probability \(\frac{s}{n}\). Note that we have the bounds

$$\begin{aligned} {{\mathbb {E}}}G = \sum _{s = k}^{\left\lfloor n/3 \right\rfloor } \frac{n}{s} \ge n \left( \log \left\lfloor \frac{n}{3} \right\rfloor - \log k \right) , \qquad \mathrm{Var}(G) = \sum _{s = k}^{\left\lfloor n/3 \right\rfloor } \frac{n(n-s)}{s^2} \le n^2. \end{aligned}$$

Hence,

$$\begin{aligned} {{\mathbb {P}}}(\tau< T)&\;\le \; {{\mathbb {P}}}(G< T) \;\le \; {{\mathbb {P}}}(G < {{\mathbb {E}}}G + n\log (3k) - Rn ) \\&\;\le \; \frac{n^2}{n^2(R - \log (3k))^2} \le \frac{4}{R^2} \end{aligned}$$

for \(R \ge 2 {{\mathcal {Q}}}\ge 2 \log (3k)\). On the other hand, the bound claimed in the lemma is trivial for \(R \le 2 {{\mathcal {Q}}}\), so we have completed the proof. \(\square \)

Next, we show that it really takes about \(\frac{1}{2}n \log n\) steps for the Fourier coefficients \(x_\rho \) to decay to \(O\left( \frac{1}{\sqrt{n}} \right) \), as suggested by Lemma 2.4. Note that it suffices here to analyze the \(x_\rho \) instead of the \(y_{a, \rho }\), which simplifies our analysis. Actually, it suffices to consider (the real part of) the trace of \(x_\rho \). Here the orthogonality of characters reads \(\frac{1}{{{\mathcal {Q}}}}\sum _{a \in G} \mathrm{Tr}\,\rho (a)=0\), and it takes about \(\frac{1}{2}n \log n\) steps for \(\mathrm{Re}\mathrm{Tr}\,x_\rho (t)\) to decay to \(O\left( \frac{1}{\sqrt{n}} \right) \).

Lemma 5.2

Consider any \(\rho \in G^*\) and any \(R > 5\). Let \(T := \left\lfloor \frac{1}{2}n \log n - Rn \right\rfloor \), and suppose that \(\sigma \in {{\mathcal {S}}}\) satisfies \(n^{\{\mathrm{id}\}}_{non}(\sigma ) \le \frac{n}{3}\). Then,

$$\begin{aligned} {{\mathbb {P}}}_{\sigma }\left( \Vert x_\rho (\sigma _T)\Vert _{\mathrm{HS}} \le \frac{R}{\sqrt{n}} \right) \le \frac{4{{\mathcal {Q}}}^2}{R^2}. \end{aligned}$$

Proof

Let \(z(t) := (1/d_\rho ) \mathrm{Tr}\,(x_\rho (t)+x_\rho (t)^*)/2\). Then, noting that (9) also holds for \(x_\rho (t)^*\) since \(x_{\rho ^*}(t)=x_\rho (t)^*\), we have

$$\begin{aligned} z(t+1)-z(t) = \frac{1}{n-1}\frac{1}{d_\rho } \mathrm{Tr}\,\left( \frac{x_\rho (t)+x_\rho (t)^*}{2}\right) ^2-\frac{1}{n}z(t) + M(t+1), \end{aligned}$$

where

$$\begin{aligned} {{\mathbb {E}}}[M(t+1) \mid {{\mathcal {F}}}_t]=0 \qquad \text {and} \qquad |M(t)| \le \frac{2{{\mathcal {Q}}}}{n}. \end{aligned}$$

Here we have

$$\begin{aligned} \frac{1}{d_\rho } \mathrm{Tr}\,\left( \frac{x_\rho (t)+x_\rho (t)^*}{2}\right) ^2 \ge z(t)^2. \end{aligned}$$

We compare z(t) to another process \((w(t))_{t \ge 0}\) defined by \(w(0) := \frac{1}{3}\) and

$$\begin{aligned} w(t+1) := \left( 1-\frac{1}{n}\right) w(t) + M(t+1). \end{aligned}$$
(21)

We will show by induction that \(z(t) \ge w(t)\) for all t. For the base case, note that since \(n^{\{\mathrm{id}\}}_{non}(\sigma ) \le \frac{n}{3}\), we have

$$\begin{aligned} z(0) = \frac{1}{d_\rho }\mathrm{Re}\mathrm{Tr}\,\sum _{a \in G} \frac{n_a(t)}{n} \cdot \rho (a) \ge \frac{2}{3} - \frac{1}{3} = \frac{1}{3}. \end{aligned}$$

Suppose now that \(z(t) \ge w(t)\). Then,

$$\begin{aligned} z(t+1)&\ge z(t) + \frac{1}{n-1}z(t)^2-\frac{1}{n}z(t) + M(t+1) \\&\ge \left( 1 - \frac{1}{n}\right) w(t) + M(t+1) = w(t+1), \end{aligned}$$

completing the induction.

It now suffices to lower bound w(T). To this end, we first note that applying (21) repeatedly and taking expectations, we obtain

$$\begin{aligned} {{\mathbb {E}}}w(T) = \left( 1 - \frac{1}{n}\right) ^T \cdot \frac{1}{3} \ge \frac{e^R}{6\sqrt{n}} \ge \frac{2R}{\sqrt{n}}. \end{aligned}$$

In order to calculate the variance of w(T), we can also square (21) and take the expectation, which gives us

$$\begin{aligned} \mathrm{Var}(w(T))&= {{\mathbb {E}}}w(T)^2 - ({{\mathbb {E}}}w(T))^2 \\&\le \left( 1 - \frac{1}{n}\right) ^{2T} \cdot \frac{1}{9} + n \cdot \left( \frac{2{{\mathcal {Q}}}}{n} \right) ^2 - \left( \left( 1 - \frac{1}{n}\right) ^T \cdot \frac{1}{3} \right) ^2 \\&= \frac{4{{\mathcal {Q}}}^2}{n}. \end{aligned}$$

Then, by Chebyshev’s inequality, we have

$$\begin{aligned} {{\mathbb {P}}}_{\sigma }\left( \Vert x_\rho (\sigma _T)\Vert _{\mathrm{HS}} \le \frac{R}{\sqrt{n}} \right)&\le {{\mathbb {P}}}\left( z(T) \le \frac{R}{\sqrt{n}} \right) \le {{\mathbb {P}}}\left( w(T) \le \frac{R}{\sqrt{n}} \right) \\&\le \frac{4{{\mathcal {Q}}}^2/n}{(R/\sqrt{n})^2} = \frac{4{{\mathcal {Q}}}^2}{R^2}, \end{aligned}$$

as desired. \(\square \)

Proof of Theorem 1.1 (2)

Let \(T = T_1 + T_2\), where \(T_1 := \left\lfloor n \log n - \beta n \right\rfloor \) and \(T_2 := \left\lfloor \frac{1}{2} n \log n - \beta n \right\rfloor \). Fix any \(\rho \in G^*\). By Lemma 5.1 followed by Lemma 5.2, we have for large enough \(\beta \) that

$$\begin{aligned} {{\mathbb {P}}}_{\sigma _\star } \left( \sigma _T \in {{\mathcal {S}}}_*\left( \frac{\beta }{\sqrt{n}} \right) \right) \le {{\mathbb {P}}}_{\sigma _\star }\left( \Vert x_\rho (T)\Vert _{\mathrm{HS}} \le \sqrt{\frac{{{\mathcal {Q}}}}{d_\rho }}\frac{\beta }{\sqrt{n}} \right) \le \frac{8{{\mathcal {Q}}}^2}{\beta ^2}. \end{aligned}$$

On the other hand, Lemma 2.6 tells us that

$$\begin{aligned} \pi \left( {{\mathcal {S}}}_*\left( \frac{\beta }{\sqrt{n}}\right) \right) \ge 1- \frac{c_G}{\beta ^2}. \end{aligned}$$

Consequently,

$$\begin{aligned} d_{\sigma _\star }(T) \ge 1 - \frac{c_G}{\beta ^2} - \frac{8{{\mathcal {Q}}}^2}{\beta ^2}, \end{aligned}$$

which tends to 1 as \(\beta \rightarrow \infty \), establishing (2). \(\square \)