Cutoff for product replacement on finite groups

We analyze a Markov chain, known as the product replacement chain, on the set of generating n-tuples of a fixed finite group G. We show that as n→∞\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n \rightarrow \infty $$\end{document}, the total-variation mixing time of the chain has a cutoff at time 32nlogn\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\frac{3}{2} n \log n$$\end{document} with window of order n. This generalizes a result of Ben-Hamou and Peres (who established the result for G=Z/2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$G = {{\mathbb {Z}}}/2$$\end{document}) and confirms a conjecture of Diaconis and Saloff-Coste that for an arbitrary but fixed finite group, the mixing time of the product replacement chain is O(nlogn)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(n \log n)$$\end{document}.


Introduction
Let G be a finite group, and let [n] := {1, 2, . . . , n}. We consider the set G n of all functions σ : [n] → G (or "configurations"). We may define a Markov chain (σ t ) t≥0 on G n as follows: if we have a current state σ , then uniformly at random, choose an ordered pair (i, j) of distinct integers in [n], and change the value of σ (i) to σ (i)σ ( j) ±1 , where the signs are chosen with equal probability. We will restrict the chain (σ t ) t≥0 to the space of generating n-tuples, i.e. the set of σ whose values generate G as a group: S := σ ∈ G n : σ (1), . . . , σ (n) = G .
It is not hard to see that for fixed G and large enough n, the chain on S is irreducible (see [8,Lemma 3.2]). We will always assume n is large enough so that this irreducibility holds. Note that the chain is also symmetric, and it is aperiodic because it has holding on some states. Thus, the chain has a uniform stationary distribution π with π(σ ) = 1/|S|. This Markov chain was first considered in the context of computational group theory-it models the product replacement algorithm for generating random elements of a finite group introduced in [6]. By running the chain for a long enough time t and choosing a uniformly random index k ∈ [n], the element σ t (k) is a (nearly) uniformly random element of G. The product replacement algorithm has been found to perform well in practice [6,10], but the question arises: how large does t need to be in order to ensure near uniformity?
One way of answering the question is to estimate the mixing time of the Markov chain. It was shown by Diaconis and Saloff-Coste that for any fixed finite group G, there exists a constant C G such that the 2 -mixing time is at most C G n 2 log n [8,9] (see also Chung and Graham [3] for a simpler proof of this fact with a different value for C G ).
In another line of work, Lubotzky and Pak [12] analyzed the mixing of the product replacement chain in terms of Kazhdan constants (see also subsequent quantitative estimates for Kazhdan constants by Kassabov [11]). We also mention a result of Pak [14] which shows mixing in polylog(|G|) steps when n = (log |G| log log |G|). The reader may consult the survey [15] for further background on the product replacement algorithm.
Diaconis and Saloff-Coste conjectured that the mixing time bound can be improved to C G n log n [9, Remark 2, Section 7, p. 290], based on the observation that at least n log n steps are needed by the classical coupon-collector's problem. This was confirmed in the case G = Z/2 by Chung and Graham [4] and recently refined by Ben-Hamou and Peres, who show that when G = Z/2, the chain in fact exhibits a cutoff at time 3 2 n log n in total-variation with window of order n [2]. In this paper, we extend the result of Ben-Hamou and Peres to all finite groups. Note that this also verifies the conjecture of Diaconis and Saloff-Coste for a fixed finite group. To state the result, let us denote the total variation distance between P σ (σ t ∈ · ) and π by d σ (t) := max A⊆S |P σ (σ t ∈ A) − π(A)|. Theorem 1.1 Let G be a finite group. Then, the Markov chain (σ t ) t≥0 on the set of generating n-tuples of G has a total-variation cutoff at time 3 2 n log n with window of order n. More precisely, we have (2)

A connection to cryptography
We mention another motivation for studying the product replacement chain in the case G = (Z/q) m for a prime q ≥ 2 and integers m ≥ 1. It comes from a public-key authentication protocol proposed by Sotiraki [16], which we now briefly describe. In the protocol, a verifier wants to check the identity of a prover based on the time needed to answer a challenge. First, the prover runs the Markov chain with G = (Z/q) m and n = m, which can be interpreted as performing a random walk on SL n (Z/q), where σ (k) is viewed as the k-th row of a n × n matrix. (In each step, a random row is either added to or subtracted from another random row.) After t steps, the prover records the resulting matrix A ∈ SL n (Z/q) and makes it public. To authenticate, the verifier gives the prover a vector x ∈ (Z/q) n and challenges her to compute y := Ax. The prover can perform this calculation in O(t) operations by retracing the trajectory of the random walk.
Without knowing the trajectory, if t is large enough, an adversary will not be able to distinguish A from a random matrix and will be forced to perform the usual matrix-vector multiplication (using n 2 operations) to complete the challenge. Thus, the question is whether t n 2 is large enough for the matrix A to become sufficiently random, so that the prover can answer the challenge much faster than an adversary.
Note that when n > m, the product replacement chain on G = (Z/q) m amounts to the projection of the random walk on SL n (Z/q) onto the first m columns. Thus, Theorem 1.1 shows that when m is fixed and n → ∞, the mixing time for the first m columns is around 3 2 n log n. One then hopes that the mixing of several columns is enough to make it computationally intractable to distinguish A from a random matrix; this would justify the authentication protocol, as n log n n 2 . We remark that when t is much larger than the mixing time of the random walk on SL n (Z/q) generated by row and additions and subtractions, it is information theoretically impossible for an adversary to distinguish A from a random matrix. However, the diameter of the corresponding Cayley graph on SL n (Z/q) is known to be of order n 2 log q n [1,5], so a lower bound of the same order necessarily holds for the mixing time. Diaconis and Saloff-Coste [8,Section 4,p. 420] give an upper bound of O(n 4 ), which was subsequently improved to O(n 3 ) by Kassabov [11]. Closing the gap between n 3 and n 2 log n remains an open problem.

Outline of proof
The proof of Theorem 1.1 analyzes the mixing behavior in several stages: • an initial "burn-in" period lasting around n log n steps, after which the group elements appearing in the configuration are not mostly confined to any proper subgroup of G; • an averaging period lasting around 1 2 n log n steps, after which the counts of group elements become close to their average value under the stationary distribution; and • a coupling period lasting O(n) steps, after which our chain becomes exactly coupled to the stationary distribution with high probability.
The argument is in the spirit of [2], but a more elaborate analysis is required in the second and third stages. To analyze the first stage, for a fixed proper subgroup H , the number of group elements in H appearing in the configuration is a birth-and-death process whose transition probabilities are easy to estimate. The analysis of the resulting chain is the same as in [2], and we can then union bound over all proper subgroups H .
In the second stage, for a given starting configuration σ 0 ∈ S, we consider quantities n a,b (σ ) counting the number of sites k where σ 0 (k) = a and σ (k) = b. A key observation (which also appears in [2]) is that by symmetry, projecting the Markov chain onto the values (n a,b (σ t )) a,b∈G does not affect the mixing behavior. Thus, it is enough to understand the mixing behavior of the counts n a,b .
One expects these counts to evolve towards their expected value E σ ∼π n a,b (σ ) as the chain mixes. To carry out the analysis rigorously, we write down a stochastic difference equation for the n a,b and analyze it via the Fourier transform. Intuitively, as n → ∞, the process approaches a "hydrodynamic limit" so that it becomes approximately deterministic. It turns out that after about 1 2 n log n steps, the n a,b are likely to be within O( √ n) of their expected value. Our analysis requires a sufficiently "generic" initial configuration, which is why the first stage is necessary.
Finally, in the last stage, we show that if the (n a,b (σ )) a,b∈G and (n a,b (σ )) a,b∈G for two configurations are within O( √ n) in 1 distance, they can be coupled to be exactly the same with high probability after O(n) steps of the Markov chain. A standard argument involving coupling to the stationary distribution then implies a bound on the mixing time.
The main idea to prove the coupling bound is that even if the 1 distance evolves like an unbiased random walk, there is a good chance that it will hit 0 due to random fluctuations. A similar argument is used to prove cutoff for lazy random walk on the hypercube [13,Chapter 18]. However, some careful accounting is necessary in our setting to ensure that in fact the 1 distance does not increase in expectation and to ensure sufficient fluctuations.

Organization of the paper
The rest of the paper is organized as follows. In Sect. 2, we state (without proof) the key lemmas describing the behavior in each of the three stages and use these to prove the upper bound (1) in Theorem 1.1. Sections 3 and 4 contain the proofs of these lemmas. Finally, in Sect. 5, we prove the lower bound (2) in Theorem 1.1; this is mostly a matter of verifying that the estimates used in the upper bound were tight.

Notation
Throughout this paper, we use c, C, C , . . ., to denote absolute constants whose exact values may change from line to line, and also use them with subscripts, for instance, C G to specify its dependency only on G. We also use subscripts with big-O notation, e.g. we write O G ( · ) when the implied constant depends only on G.

Proof of Theorem 1.1 (1)
Let us fix a finite group G and denote its cardinality by Q := |G|. For a configuration σ ∈ S, let n a (σ ) denote the number of sites having group element a, i.e., n a (σ ) := |{i ∈ [n] : σ (i) = a}|. Thus, S non (c) is the set of states σ where the group elements appearing in σ are not mostly confined to any particular proper subgroup of G. The next lemma shows that we reach S non (1/3) in about n log n steps, and once we reach S non (1/3), we remain in S non (1/6) for n 2 steps with high probability. Note that n 2 is much larger than the overall mixing time, so we may essentially assume that we are in S non (1/6) for all of the later stages.

The burn-in period
Moreover, there exists a constant C G depending only on G such that Let (N t ) t≥0 be the birth-and-death chain with the following transition probabilities for 1 ≤ k ≤ n: We start this chain at N 0 = n H non (σ 0 ); note that because the elements appearing in σ 0 generate G, we are guaranteed to have n H non (σ 0 ) > 0. The above birth-and-death chain corresponds to the behavior of (n H non The chain (N t ) is precisely what is analyzed in [2] for the case G = Z/2. Let [2, (2) in the proof of Lemma 1] and thus and The proof of Lemma 1]. Hence by Chebyshev's inequality for all large enough β > 0, Moreover, we have P n/3 T n/6 ≤ n 2 ≤ n 2 e −n/10 . Indeed, this follows from the fact that for m < k, we have where π BD (k) = n k /(2 n − 1) [2, (5) and the following in the proof of Proposition 2]. We now take a union bound over all the proper subgroups H .

The averaging period
In the next stage, the counts n a (σ t ) go toward their average value. We actually analyze this stage in two substages, looking at a "proportion vector" and "proportion matrix", as described below.

Proportion vector chain
For a configuration σ ∈ S, we consider the Q-dimensional vector (n a (σ )/n) a∈G , which we call the proportion vector of σ . One may check that for a typical σ ∈ S, each n a (σ )/n is about 1/Q. For each δ > 0, we define the δ-typical set where · denotes the 2 -norm in R G .
The following lemma implies that starting from σ ∈ S non (1/3), we reach S * (δ) in O δ (n) steps with high probability. The proof is given in Sect. 3.4.

Lemma 2.2 Consider any σ ∈ S non (1/3) and any constant δ > 0. There exists a constant C G,δ depending only on G and δ such that for any T ≥ C G,δ n, we have
for all large enough n.

Proportion matrix chain
We actually need a more precise averaging than what is provided by Lemma 2.2. Fix a configuration σ 0 ∈ S. For any σ ∈ S and for any a, b ∈ G, define If we run the Markov chain (σ t ) t≥0 with initial state σ 0 , then n σ 0 a,b (σ t ) is the number of sites that originally contained the element a (at time 0) but now contain b (at time t). Note that We can then associate with (σ t ) t≥0 another Markov chain n σ 0 a,b (σ t ) a,b∈G for t ≥ 0, which we call the proportion matrix chain (with respect to σ 0 ). The state space for the proportion matrix chain is {0, 1, . . . , n} G×G , and the transition probabilities depend on σ 0 .
The proportion matrix acts like a "sufficient statistic" for analyzing our Markov chain started at σ * , because of the permutation invariance of our dynamics. In fact, as the following lemma shows, the distance to stationarity of the proportion matrix chain is equal to the distance to stationarity of the original chain.
. Let π σ * be the stationary measure for the for the set of configurations with N as their proportion matrix.
Since the distribution of σ t is invariant under permutations on sites i ∈ [n] preserving the set {i : σ * (i) = a} for every a ∈ G, the conditional probability measures P σ * σ t ∈ · | σ t ∈ X (N ) and π( · | X (N ) ) are both uniform on X (N ) . This implies that for each σ ∈ X (N ) , and summing over all σ ∈ X (N ) and all N , we obtain the claim.
For σ 0 ∈ S and r > 0, define the set of configurations Roughly speaking, the following lemma shows that starting from a typical configuration σ * ∈ S * 1 4Q , we need about 1 2 n log n steps to reach S * σ * , R √ n , where R is a constant. We show this fact in a slightly more general form where the initial state need not be σ * ; the proof is given in Sect. 3.5.

Lemma 2.4
Consider any σ * , σ * ∈ S * 1 4Q , and let T := 1 2 n log n . There exists a constant C G > 0 depending only on G such that for any given R > 0, we have for all large enough n.

The coupling period
After reaching S * σ * , R √ n , we show that only O(n) additional steps are needed to mix in total variation distance. The main ingredient in the proof is a coupling of proportion matrix chains so that they coalesce in O(n) steps when they both start from configurations σ,σ ∈ S * σ * , R √ n . We construct such a coupling and prove the following lemma in Sect. 4.

Lemma 2.5 Consider any
Then, there exists a coupling (σ t ,σ t ) of the Markov chains with initial states (σ,σ ) such that for a given β > 0 and all large enough n, To translate this coupling time into a bound on total variation distance, we need also the simple observation that the stationary measure π concentrates on S * σ * , R √ n except for probability O(1/R 2 ), as given in the next lemma.

Lemma 2.6
For the stationary distribution π of the chain (σ t ) t≥0 , for every R > 0 and for all n > m, Moreover for every δ < 1/(2Q), for every R > 0 and for all n > m, where C G and m are constants depending only on G.
Proof Observe that since the stationary distribution π is uniform on S, it is given by the uniform distribution Unif on G n conditioned on S. Note that we can always generate G using each of its |G| elements, so we have an easy lower bound of |S| ≥ |G| n−|G| . Consequently, we have Concerning the second assertion, we note that n a (σ * ) ≥ (1/Q − δ)n for each a ∈ G; the rest follows similarly, so we omit the details.

Remark 2.7
In Lemma 2.6 above, we have given a very loose bound on C G for sake of simplicity. Actually, it is not hard to see that holding G fixed, we have lim n→∞ |S|/|G| n = 1. See also [9, Section 6.B.] for more explicit bounds for various families of groups.
Together, Lemmas 2.4, 2.5, and 2.6 imply the following bound for total variation distance.

where C G is a constant depending only on G.
Proof Letσ be drawn from the stationary distribution π . Define where (σ t ) is a Markov chain started atσ . Let π σ * denote the stationary distribution for the proportion matrix with respect to σ * . Sinceσ was drawn from π , the proportion matrix ofσ t remains distributed as π σ * for all t. We first run σ andσ independently up until time T 1 := 1 2 n log n . For a parameter R to be specified later, consider the events Lemma 2.4 implies that P(G c ) ≤ C G e −R + 1 n , and Lemma 2.6 implies that P(G c ) ≤ Let T 2 := βn . Starting from time T 1 , as long as both G andG hold, we may use Lemma 2.5 to form a coupling (σ t ,σ t ) so that Setting R = β 1/4 , we conclude that We have T = T 1 + T 2 , and recall that the proportion matrix forσ is stationary for all time. This yields The result then follows by Lemma 2.3.

Proof of the main theorem
We now combine the lemmas from the burn-in, averaging, and coupling periods to complete the proof of the upper bound in Theorem 1.1.
Let τ 1/3 be the first time to hit S non (1/3) as in Lemma 2.1. Then, Lemma 2.1 implies that for any σ 1 ∈ S and any t ≥ 0, we have Next, by Lemma 2.2, for any σ 2 ∈ S non (1/3) and when β and n are sufficiently large, we have that P Finally, Lemma 2.8 states that Thus, combining (3), (4), and (5), we obtain for any σ ∈ S that sending n → ∞ and then β → ∞ yields (1).

Proofs for the averaging period
In this section, we prove Lemmas 2.2 and 2.4. The proofs are based on analyzing stochastic difference equations satisfied by the Fourier transform of the proportion vector or matrix.

The Fourier transform for G
We first establish some notation and preliminaries for the Fourier transform. Let G * be a complete set of non-trivial irreducible representations of G. In other words, for each ρ ∈ G * , we have a finite dimensional complex vector space V ρ such that ρ : G → G L(V ρ ) is a non-trivial irreducible representation, and any non-trivial irreducible representation of G is isomorphic to some unique ρ ∈ G * . Moreover, we may equip each V ρ with an inner product for which ρ ∈ G * is unitary. For a configuration σ ∈ S and for each ρ ∈ G * , we consider the matrix acting on V ρ given by so that x ρ (σ ) is the Fourier transform of the proportion vector at the representation ρ.
We write x(σ ) := (x ρ (σ )) ρ∈G * . Let V := ρ∈G * End C (V ρ ), and write d ρ := dim C V ρ . For an element x = (x ρ ) ρ∈G * ∈ V , we define a norm · V given by where A, B HS = Tr (A * B) denotes the Hilbert-Schmidt inner product in End C (V ρ ) and · HS denotes the corresponding norm. (Note that ·, · HS and · HS depend on ρ, but for sake of brevity, we omit the ρ when there is no danger of confusion.) The Peter-Weyl theorem [7, Chapter 2] says that where the isomorphism is given by the Fourier transform. The Plancherel formula then reads Thus, in order to show that σ ∈ S * (δ), it suffices to show that x(σ ) V is small. A similar argument may be applied to the proportion matrix instead of the proportion vector.
Finally, for an element A ∈ End C (V ρ ), we will at times also consider the operator norm A op := sup v∈V ρ ,v =0 Av / v . We will also sometimes use the following (equivalent) variational characterization of the operator norm:

The special case of G = Z/q
On a first reading of this section, the reader may wish to consider everything for the special case of G = Z/q for some integer q ≥ 2. In that case, each representation is one-dimensional, and the representations can be indexed by = 0, 1, 2, . . . , q − 1. The Fourier transform is then particularly simple: the coefficients are scalar values where ω := e 2πi q is a primitive q-th root of unity. This special case already illustrates most of the main ideas while simplifying the estimates in some places (e.g. matrix inequalities we use will often be immediately obvious for scalars).

A stochastic difference equation for the n a
For a ∈ G, we next analyze the behavior of n a (σ t ) over time. For convenience, we write n a (t) = n a (σ t ). Let F t denote the σ -field generated by the Markov chain (σ t ) t≥0 up to time t. Then, our dynamics satisfy the equation Note that |n a (t + 1) − n a (t)| ≤ 1 almost surely. Thus, for each a ∈ G, we can write the above as a stochastic difference equation where E[M a (t + 1) | F t ] = 0 and |M a (t)| ≤ 2 almost surely. It is easiest to analyze this equation through the Fourier transform. Writing x ρ (t) = x ρ (σ t ), we calculate from (8) that so that our equation becomes Note that we have and thus,

A general estimate for stochastic difference equations
Before proving Lemma 2.2, we also need a technical lemma for controlling the behavior of stochastic difference equations, which will be used to analyze (9) as well as other similar equations. (z(t)) t≥0 be a sequence of [0, 1]-valued random variables adapted to a filtration (F t ) t≥0 . Let ε ∈ (0, 1) be a small constant, and let ϕ : R + → (0, 1] be a non-decreasing function.

Suppose that there are F t -measurable random variables M(t) for which
and which, for some constant D, satisfy the bounds Then, for each t and each λ > 0, we have for constants c D,ϕ , C D,ϕ depending only on D and ϕ.
Taking conditional expectations in the inequality relating z(t + 1) to z(t), we have Rearranging and using the fact that ϕ(t) is non-decreasing, we have Consequently, is a supermartingale, and its increments are bounded by Recall that ϕ is non-decreasing, so that for all t ≥ s ≥ 0, we have Using this with (11), we see that the sum of the squares of the first t increments is at most By the Azuma-Hoeffding inequality, this yields which in turn implies The result then follows upon shifting and rescaling of λ.

Proportion vector chain: Proof of Lemma 2.2
We first prove a bound for the Fourier coefficients x ρ (t). (1/3) and any ρ ∈ G * . We have a constant c G depending only on G for which

Lemma 3.2 Consider any σ ∈ S non
for all large enough n.
This immediately implies Lemma 2.2.

Proof of Lemma 2.2
With c G defined as in Lemma 3.2, take C G,δ large enough so that for any T ≥ C G,δ n, Then, Lemma 3.2 and Plancherel's formula yield for large enough n, as desired.
We are now left with proving Lemma 3.2, which relies on the following bound on the operator norm.

Lemma 3.3
There exists a positive constant γ G depending on G such that for any ρ ∈ G * and any σ ∈ S non (1/6), Proof Let G denote the set of all probability distributions on G, and for c ∈ (0, 1), let Consider a representation ρ ∈ G * , and consider the function h : Then, h(μ) is hermitian, and since ρ is unitary, we clearly have We claim that λ(μ) < 1 for each μ ∈ G (c). Indeed, suppose the contrary. Then, there exists a non-zero vector v ∈ V ρ such that Re ρ(a)v, v = 1 for all a ∈ G with μ(a) > 0. This implies that the support of μ is included in the subgroup Since ρ is a (non-trivial) irreducible representation, H is a proper subgroup of G, and μ(H ) = 1, contradicting the assumption that μ ∈ G (c). Note that μ → λ(μ) is continuous. We may define Then, we have for any σ ∈ S non (1/6), Taking 0 < γ G < 1 −γ G , and plugging this into the definition of X ρ gives X ρ (σ ) − γ G n I d ρ . Note that X ρ (σ ) − 2 n−1 I d ρ . Combining these together gives the result.

Remark 3.4
A much more direct approach is possible in the case G = Z/q. The condition σ ∈ S non (1/6) implies that n 0 (σ ) ≤ 5 6 . Then, we have for some positive γ G . Some rearranging of equations then yields the desired result.

Proof of Lemma 3.2 Fix
where γ G is taken as in Lemma 3.3. Since our chain starts at σ ∈ S non (1/3), Lemmas 2.1 and 3.3 together imply that P σ (G c n 2 ) ≤ C G n 2 e −n/10 .
Next, we turn to (9). Rearranging (9) and squaring, we have Substituting into (12), we obtain Note that we have the bounds We now apply Lemma 3.1 with ε = 1 n , ϕ(t) = γ G , D = 6Q 2 d ρ , and λ = n 1/4 . This yields Consequently, The lemma with c G = γ G /2 then follows from union bounding over all 1 ≤ t ≤ n 2 and taking n sufficiently large.

Proportion matrix chain: Proof of Lemma 2.4
We carry out a similar albeit more refined strategy to analyze the proportion matrix. Throughout this section, we assume our Markov chain (σ t ) t≥0 starts at an initial state σ * ∈ S * 1 4Q . We again write n a (t) = n a (σ t ) and n a,b (t) = n σ * a,b (σ t ), and similar to before, the n a,b (t) satisfy the difference equation where We can again analyze this equation via the Fourier transform. In this case, for each a ∈ G, we take the Fourier transform of n a,b (t)/n a (σ * ) b∈G . For ρ ∈ G * , let becomes y a,ρ (t + 1) − y a,ρ (t) = y a,ρ (t)X ρ (t) + M a,ρ (t + 1). (14) Note that E σ [ M a,ρ (t +1) | F t ] = 0. Also, since we assumed σ * ∈ S * 1 4Q , it follows that n a (σ * ) n ≥ 1 2Q . Thus, we also know M a,ρ (t + 1) HS ≤ Again, our main step is a bound on the Fourier coefficients y a,ρ (t), which will also be useful later in proving Lemma 2.5.

Lemma 3.5
Consider any σ * , σ * ∈ S * 1 4Q . There exist constants c G , C G > 0 depending only on G such that for all large enough n, we have for all t and R > 0.
The above lemma directly implies Lemma 2.4.

Proof of Lemma 2.4
We apply Lemma 3.5 to each a ∈ G and ρ ∈ G * . Recall that T = 1 2 n log n , so that Then, Lemma 3.5 implies Union bounding over all a ∈ G and ρ ∈ G * and using the Plancherel formula, this yields for sufficiently large C G and n.
We now prove Lemma 3.5. Before proceeding with the main proof, we need the following routine estimate as a preliminary lemma.
By spherical symmetry, we have which is the first inequality. Again by spherical symmetry, the eigenvalues of the Hessian ∇ 2 θ n (x) can be directly computed to be f ( x ) and f ( x )/ x . But these are bounded by Thus, ∇ 2 θ n (x) √ n I , and the second inequality follows from Taylor expansion.
where the second inequality follows from the variational formula for operator norm (i.e. that B A HS ≤ A op B HS ), and the third inequality follows from the fact that θ n is convex with θ n (0) = 0. Thus, we may write Now, let z t := 1 H t θ n (y a,ρ (t)), and note that since X ρ (σ ) n whenever H t holds. Thus, We may then apply Lemma 3.1 with ε = 1 n and D = 8Q 4 d ρ . Note that for all large enough n. Thus, Lemma 3.1 implies that Consequently, as desired.

Construction of the coupling: Proof of Lemma 2.5
For each δ > 0, we define a subset of {0, 1, . . . , n} G×G by for every a, b ∈ G and a,b∈G Lemma 4.1 Consider a configuration σ * ∈ S and a constant 0 < δ ≤ 1 2Q 2 , and assume that (1 − δ)n/Q 2 is an integer. Let (σ t ) t≥0 and (σ t ) t≥0 be two product replacement chains started at σ andσ , respectively. Then, there exists a coupling (σ t ,σ t ) of the Markov chains satisfying the following: Let Proof Let us abbreviate n a,b (t) = n σ * a,b (σ t ) andñ a,b (t) = n σ * a,b (σ t ). Let m a,b (t) := min(n a,b (t),ñ a,b (t)). For each a ∈ G, we define the quantity so that D t = a∈G d a (t).
For accounting purposes, it is helpful to introduce two sequences of elements of G × G. These sequences are chosen so that the number of x k equal to (a, b) is exactly n a,b , and similarly the number ofx k equal to (a, b) isñ a,b . Moreover, we arrange their indices in a coordinated fashion, as described below.
We define three families of disjoint sets: P a,b , Q a , and R a ⊂ [n].
• For each a, b ∈ G, let P a,b be a set of size (1 − δ)n/Q 2 such that for any k ∈ P a,b , we have x k =x k = (a, b). (This is possible provided that (n a,b (t)), (ñ a,b (t)) ∈ M δ holds.) • For each a ∈ G, let Q a be a set of size b∈G (m a,b − |P a,b |) such that for any k ∈ Q a , x k =x k = (a, b) for some b. (Note that Q a may be empty.)

Fig. 1 Illustration of cases (i)-(iv)
Case (iv) • For each a ∈ G, let R a be a set of size d a such that for any k ∈ R a , x k andx k both have a as their first coordinate. (This R a is well-defined since b n a,b = bñ a,b for each a; it may also be empty.) Suppose that D t > 0, so that for some a * , b * , b * ∈ G we have n a * ,b * >ñ a * ,b * and n a * ,b * <ñ a * ,b * . Let us consider all possible ways to sample a pair of indices and a sign (k, l, s) ∈ {1, 2, . . . , n} 2 × {±1} with k = l. Suppose x k = (a k , b k ) and x l = (a l , b l ). We think of (k, l, +1) as corresponding to a move on (n a,b (t)) where n a k ,b k is decremented and n a k ,(b k ·b l ) is incremented. Similarly, (k, l, −1) corresponds to a move where n a k ,b k is decremented and n a k ,(b k ·b −1 l ) is incremented. We may also think of (k, l, ±1) as corresponding to moves on (ñ a,b (t)) in an analogous way.
We now analyze four cases, as illustrated in Fig. 1. (i) Case (k, l) ∈ (P Q)×(P Q). For all but an exceptional situation described below, we apply the move corresponding to (k, l, s) to both states (n a,b (t)) and (ñ a,b (t)). In these cases, D t+1 = D t .
We now describe the exceptional situation. Define Then, the exceptional situation occurs when s = +1 and (k, l) ∈ S S . Take any bijection τ from S to S . If (k, l) ∈ S, then we apply (k, l, +1) to (n a,b (t)) while applying (τ (k, l), +1) to (ñ a,b (t)). This increments n a * ,b * , decrements n a * ,b * , and has no effect on the (ñ a,b (t)). The overall effect is that D t+1 = D t − 1.
The exceptional event occurs with probability (1−δ) 2 2Q 3 , and when it occurs, D t increases or decreases by 1 with equal probability. Thus, the exceptional situation plays the role of introducing some unbiased fluctuation in D t and gives us (17).
This occurs with probability Apply the move corresponding to (k, l, s) to both states. This increases D t by at most 1. We will see later that the effect of this case is small compared to the other cases.
(iii) Case (k, l) ∈ P × R. This occurs with probability Apply the move corresponding to (k, l, s) to both states. Again, this increases D t by at most 1, but there is also a chance not to increase.
Suppose that x l = (a 1 , b 1 ) andx l = (a 1 ,b 1 ), and suppose that k ∈ P a 2 ,b 2 . Then the move has the effect of decreasing n a 2 ,b 2 andñ a 2 ,b 2 while increasing n a 2 ,(b 2 ·b s 1 ) andñ a 2 ,(b 2 ·b s 1 ) . Note that conditioned on this case happening, (a 2 , b 2 ) is distributed uniformly over G × G. When (a 2 , (b 2 ·b s 1 )) = (a * , b * ) or (a 2 , (b 2 · b s 1 )) = (a * , b * ), the move does not increase D t . Therefore there is at least a 2/Q 2 chance that D t is actually not increased. Hence, the probability that D t is increased by 1 is at most (iv) Case (k, l) ∈ R × P. This occurs with probability Suppose that x k = (a, b) andx k = (a,b). Let τ be a permutation of P such that for l ∈ P a,c , one has τ (l) ∈ P a,b −1 ·b·c s . Then apply (k, l, s) to (n a,b (t)) and apply (k, τ (l), s) to (ñ a,b (t)). This always decreases D t by 1.
Let us now summarize what we know when (n a,b (t)), (ñ a,b (t)) ∈ M δ and D t > 0. From Cases (i), (ii), and (iii), we have From Cases (i) and (iv), we have verifying (16).
To fully define the coupling, when D t = 0, we can couple σ t and σ t to be identical, and if either (n a,b (t)) / ∈ M δ or (ñ a,b (t)) / ∈ M δ , we may run the two chains independently.

Proof of Lemma 2.5
Since σ ∈ S * σ * , R √ n , we must have for each a ∈ G and ρ ∈ G * that y σ * a,ρ (σ ) HS ≤ R √ n . Note that for large enough n, we have S * σ * , R √ n ⊆ S * 1 5Q 3 . Thus, we may apply Lemma 3.5 to obtain for large enough n. Define the event G t := σ s ∈ S * σ * , 1 5Q 3 for all 1 ≤ s ≤ t .
The Plancherel formula applied to (18) implies that P(G c n 2 ) ≤ 3Q 2 n . We may analogously define an eventG t forσ and let A t := G t ∩G t . Thus, P(A c n 2 ) ≤ 6Q 2 n . Pick δ ∈ 2 5Q 2 , 3 7Q 2 so that (1 − δ )n/Q 2 is an integer. Note that when A t holds, we have σ t ∈ S * σ * , 1 5Q 3 and σ * ∈ S * 1 5Q 3 ⇒ (n a,b (t)) ∈ M 2 5Q 2 ⊆ M δ , and similarlyσ t ∈ M δ . Thus, we may invoke Lemma 4.1 to give a coupling between σ andσ where on the event A t , the quantity D t is more likely to decrease than increase. Letting D t := 1 A t D t , we see that (D t ) is a supermartingale with respect to (F t ).
Recall that T = βn and D 0 ≤ √ QR √ n. As long as β is large enough, we may apply (19) with u = T to get for all large enough n, as desired.

Proof of Theorem 1.1 (2)
The lower bound is proved essentially by showing that the estimates of Lemmas 2.1 and 2.4 cannot be improved. Let a 1 , a 2 , . . . , a k be a set of generators for G. Let σ ∈ S be the configuration given by otherwise.
We will analyze the Markov chain started at σ and show that it does not mix too fast. Recall from Sect. 2 the notation n {id} non (σ ) = |{i ∈ [n] : σ (i) = id}| for the number of sites in σ that do not contain the identity. We first show that if we run the chain for slightly less than n log n steps, most of the sites will still contain the identity.
Next, we show that it really takes about 1 2 n log n steps for the Fourier coefficients x ρ to decay to O 1 √ n , as suggested by Lemma 2.4. Note that it suffices here to analyze the x ρ instead of the y a,ρ , which simplifies our analysis. Actually, it suffices to consider (the real part of) the trace of x ρ . Here the orthogonality of characters reads 1 Q a∈G Tr ρ(a) = 0, and it takes about 1 2 n log n steps for ReTr x ρ (t) to decay to O 1 √ n . Lemma 5.2 Consider any ρ ∈ G * and any R > 5. Let T := 1 2 n log n − Rn , and suppose that σ ∈ S satisfies n {id} non (σ ) ≤ n 3 . Then, Proof Let z(t) := (1/d ρ )Tr (x ρ (t) + x ρ (t) * )/2. Then, noting that (9) also holds for x ρ (t) * since x ρ * (t) = x ρ (t) * , we have where E[M(t + 1) | F t ] = 0 and |M(t)| ≤ 2Q n .