A comparison of updating algorithms for large N reduced models

We investigate Monte Carlo updating algorithms for simulating $SU(N)$ Yang-Mills fields on a single-site lattice, such as for the Twisted Eguchi-Kawai model (TEK). We show that performing only over-relaxation (OR) updates of the gauge links is a valid simulation algorithm for the Fabricius and Haan formulation of this model, and that this decorrelates observables faster than using heat-bath updates. We consider two different methods of implementing the OR update: either updating the whole $SU(N)$ matrix at once, or iterating through $SU(2)$ subgroups of the $SU(N)$ matrix, we find the same critical exponent in both cases, and only a slight difference between the two.


Introduction
As originally proposed by t'Hooft [1], the large N limit of SU (N ) Yang-Mills (YM) theories at fixed t'Hooft coupling is an approximation to the model of strong interactions. Being simpler from an analytic point of view, it was hoped that it could lead to an understanding of the properties of hadrons at low energies. Despite the fact that only a subset of Feynman diagrams contribute in this limit, these initial hopes turned out to be too optimistic. Large N YM theories are full of rich phenomena but complicated enough to resist a complete understanding. In parallel, its interest has grown and extends much beyond its role as an approximation to strong interactions. The t'Hooft limit appears as an important ingredient in many recent developments, such as the connections of field theory with string theory and/or gravitational interactions as in the AdS/CFT construction.
Given its interest, several authors have used lattice gauge theory techniques to study the non-perturbative behaviour of YM theories in the large N limit (for a recent review see [2]). This program is challenging, since the number of degrees of freedom that have to be simulated in a computer increases with the rank of the group, making the simulations more difficult with growing N . Alternatively, one can exploit the idea of volume reduction.
The study of the Schwinger-Dyson equations of U (N ) YM theories led to the conjecture that gauge theories become volume independent in the large N limit [3]. Taking the idea of volume reduction to an extreme, one can simulate a lattice with one single point. This allows one to reach much larger values of N and can provide a more precise method of determining the large N observables.
Attempts to produce a successful implementation of the volume independence idea have lead to several proposals [4][5][6][7][8][9][10][11][12] after the original one was shown not to work [4] (see also Ref. [2] for a recent account). Our study here will focus on the twisted reduction idea originally proposed by some of the present authors [5], but its scope applies to other reduced models as well. In particular, the one-site twisted Eguchi-Kawai model (TEK) [6] with fluxes chosen in a suitable range [7] has been tested recently in several works. There is now strong numerical evidence that its results for various lattice observables coincide with those of SU (N ) gauge theories extrapolated to N → ∞ [13]. Furthermore, the tests have also extended to physical quantities in the continuum limit such as the string tension [14] or the renormalized running coupling [15].
In all the previous works a connection was established between the finite N corrections of the twisted model and finite volume effects of the ordinary gauge theory. Typically N 2 plays the role of the physical volume. Hence, in order to extract physical quantities with small systematic errors, one should work at rather large values of N (O(1000)). These types of simulations present their own challenges, which are sometimes very different to the ones we are used to face in usual lattice simulations. The literature is somewhat scarce and old (this will be reviewed in the next section). This serves as a motivation for the present work, devoted to the analysis of the computational aspects of this one-site model. We will present the different algorithms that can be used in the simulations, and numerical tests of the correctness and efficiency of these algorithms.
We must conclude by mentioning that the interest of the reduced matrix models extends beyond their purely computational one. First of all, the methodology developed here is useful for extensions of pure Yang-Mills theories such as the model with fermionic fields in the adjoint representation with various types of boundary conditions [9,16]. This allows the study of a possible candidate for walking technicolor and the determination of its anomalous dimensions [17,18]. There are also other intriguing phenomena, such as a stronger form of N -L scaling [19], [20], the emergence of new symmetries [10] or the connections with non-commutative space-time [21][22][23][24]. In summary, we are quite confident that the methodology studied here will be useful for future researchers in the field.

Update algorithms for the one-site model
The action of the TEK model in d space-time dimensions is defined as a function of the variables U µ (µ = 0, . . . , d − 1) that are elements of the group SU (N ) [6] S TEK [U ] = bN µ =ν where b can be interpreted as the inverse of the 't Hooft coupling λ = g 2 N , and z µν is given by where n µν (called the twist tensor) are integers defined modulo N . The action Eq. (2.1) is just the Wilson action of an SU (N ) lattice gauge theory defined on a one-site lattice with twisted boundary conditions. The partition function of this model is given by and in principle can be simulated with the usual Monte Carlo techniques like, for example, the metropolis algorithm [25]. Nevertheless the most effective algorithms used to simulate pure gauge theories [26][27][28][29][30][31], a combination of heatbath (HB) and overrelaxation (OR) sweeps, cannot be applied directly to this model. The reason is that the action Eq. (2.1) is not a linear function of the links U µ . The solution to this problem was discovered long ago by Fabricius and Haan [32] and consists in introducing auxiliary normally distributed complex N × N matrices Q µν with µ > ν. The original partition function Eq. (2.3) can be written as where N is a constant. If we perform the change of variables where 1 the modified action S TEKQ given by is linear in the link variables U µ . At this stage standard algorithms like heatbath and overrelaxation can be applied to update the links. The Fabricius-Haan algorithm thus involves the updating of the auxiliary variables followed by the link updates. We will describe below several updating possibilities and show that the optimal algorithm only requires over-relaxation updates of the links combined with the updating of the auxiliary variables.

Update of the link variables
The terms of the action that involve some link U α are where H α is the sum of staples given by and we have defined Q αν ≡ Q να for α < ν.

SU (2) projections
We introduce SU (2) subgroups H (i,j) ⊂ SU (N ) labelled by two integers (i, j) such that i = j. An element A of the subgroup H (i,j) can be written as with the 2 × 2 submatrix a kl being an SU (2) element. The update of an SU (N ) matrix proceeds by choosing a set S of these SU (2) subgroups of SU (N ) [25,27] S = {H 1 , . . . , H k } (2.11) such that no subset of SU (N ) remains invariant under left multiplication by S. This requirement can be satisfied with k = N − 1 as follows However covering the full group requires order N steps of this kind. Hence, as experience shows, it is more effective to choose S to be composed of all the N (N − 1)/2 subgroups of this type is a matrix of the type Eq. (2.10), it is easy to evaluate the one-link action Eq. (2.8) S α [AU α ] = −Tr (aw) + terms independent of a . (2.14) Here a and w/|w| are SU (2) matrices with w = Re(w 0 ) + iRe(w i )τ i obtained from the 2 × 2 complex matrix given bỹ A heatbath update [32] generates the matrix a with probability e −Sα . An overrelaxation update generates a by reflecting around the matrix w a = (w † ) 2 /|w| 2 .
(2. 16) In both cases the matrix A is then used to update the link U α U α → AU α . It is important to note that an overrelaxation update does not change the value of Eq. (2.14), and hence the value of the action S TEKQ [U, Q] remains invariant.

U(N) overrelaxation
The authors of [8] proposed to perform an overrelaxation update of the whole SU (N ) matrix at once. The starting point is again the link U α which we want to update according to the probability density where H α is the sum of staples defined in Eq. (2.9). The SU (N ) overrelaxation step consists of a proposed update of the form where W α is an SU(N) matrix that does not depend on U α or U new α . The change is accepted or rejected with probability P (U new α )/P (U α ). This is a valid update algorithm, but its efficiency depends on the acceptance rate. To maximize it one should minimize the change in the action caused by the proposed update. In fact, as we will see, if the link variables were elements of the U(N) group, the procedure described in [8,33] results in an exact microcanonical update (i.e. the action S TEKQ [U, Q] is unchanged by the update). The construction is as follows. Given the singular value decomposition of the staple where σ is diagonal and real, and X, V are U (N ) matrices, we construct the matrix This matrix is an element of U (N ) by construction, and it is a straightforward exercise to check that the action S TEKQ [U, Q] is left unchanged by the transformation of Eq. (2.19).
All updates referred to as ORN in the rest of this paper use this method. The algorithm is applicable to one-site models directly because for them there is no difference between the U (N ) and SU (N ) gauge models. This is obvious since any possible phase factor e ıαµ of the links U µ cancels in the evaluation of the action Eq. (2.1). Moreover Algorithm 1 Update algorithm for one site models.

4:
Update U α using HB/OR2/ORN if the observables of interest are center-invariant (i.e. any Wilson loop-like quantity), the phases also do not play any role in the evaluation of these observables. This statement can be made explicit in the partition function by introducing another set of real variables α µ uniformly distributed in (0, 2π) and writing the probability measure as After the change of variables U µ → e ıαµ U µ , the action of the SU (N ) TEK model transforms into that of the U (N ) TEK model. We stress that this is a particularity of the one-site model, and in that in general SU (N ) and U (N ) gauge theories on the lattice differ by a O(1/N 2 ) contribution.
Given that here we are concentrating on one-site lattice models we will stick to the previously defined microcanonical U(N) overrelaxation update for the rest of this paper. However, for models in which not all directions are fully reduced the equivalence of SU(N) and U(N) models does not apply. For the sake of the readers we will develop upon the SU (N ) case in the appendix A.

A new update algorithm for the one-site model
Monte Carlo algorithms for these type of models consists of two elements: the update of the auxiliary Q µν matrices followed by any combination of the possible updating procedures described in the previous section: heat-bath (HB), SU(2) overrelaxation steps (OR2) and U(N) overrelaxation steps (ORN).
The new update algorithm that we will propose and analyze numerically in this paper can be written in a few lines (algorithm 1 with only over-relaxation updates). It requires generating d(d−1) auxiliary matrices per sweep combined with only overrelaxation updates for the link update. Despite the absence of any HB update this is a valid algorithm since it satisfies detailed balance and ergodicity. The latter is proven in the next subsection.
Since in the one-site model the over-relaxation updates are applied to the action S TEKQ [U, Q], it is the value of this action that is unchanged by over-relaxation updates, but the original action of the TEK model S TEK [U ] does change by the combination of introducing the auxiliary variables and performing an over-relaxation sweep. The situation has some similarities with the HMC algorithm, where one also introduces auxiliary variables (the random momenta), and performs an update that leaves the Hamiltonian unchanged.

Ergodicity of over-relaxation updates
Here we will prove that the generation of the auxiliary variables followed by an overrelaxation update allows to reach the full space of unitary matrices with non-zero probability.
The key ingredient in the proof is that the mapping from the auxiliary matrices to the staples H α , Eq. (2.9), is a surjective map from the space of complex N × N matrices onto itself, except in some exceptional situations. If we consider one particular direction α, we have which is a linear map of vector spaces. To show surjectivity it is enough to consider only one term ν in the previous sum. Dropping indices for that case, the transformation becomes where t * /t = z = exp(2πın/N ). Surjectivity amounts to invertibility of this transformation, which implies that the kernel should vanish. By means of invertible transformations this problem maps onto the invertibility of the following transformation: This can be easily shown using the basis in which U is diagonal with eigenvalues e ıδa , where the previous expression reads Thus, the map has a well defined inverse whenever the expression in parenthesis does not vanish for any values of the indices a,b. Vanishing can only occur if z = −1 (since in this case Q aa = 0) or in a set of zero measure (i.e. when two different eigenvalues of U obey some special relation). So that, except for the z = −1 case which will be commented later, the transformation can reach an arbitrary N × N complex staple matrix. Let us summarize. We first introduced the auxiliary variables Q µν with Gaussian probability. Then we performed a shift to obtain the Q µν variables. Thus, all sets of nonzero measure will have a non vanishing probability. By the previous proof we showed that, except for the exceptional cases mentioned earlier, this will induce a probability distribution in staple space given by Borel and strictly positive measure (i.e. every non-empty open subset on the space of N × N matrices has a positive measure). The next step, explained below, will be to show that both OR2 and ORN updates can then produce an arbitrary U (N ) matrix with a non-zero probability.

ORN updates
Let us first study the case of an ORN update given by Eq. (2.19). In this case we perform the SVD decomposition of the staple matrix and update U according to We only need to show that we can generate any W with non-zero probability. This is easily seen recalling that we can write and except in a set of zero measure the matrix H is invertible. Now the map

OR2 updates
An OR2 update consists in successive left-multiplication of the link matrix U by matrices A (i,j) ∈ H (i,j) . First let us show that the matrix A (i,j) is an arbitrary matrix of H (i,j) . If we call P (i,j) the projectors onto the subspace H (i,j) , we have and since H is arbitrary, so is U H and therefore A (i,j) is an arbitrary element of H (i,j) . The full update is given by successive multiplications Since both the multiplication and the projection to SU (2) are continuous, the application

The singular case and partially reduced lattices
In a typical simulation of the TEK model, the singular case (z = −1) can only happen in a very particular situation, since in the TEK simulations the rank of the group matrices is usually taken a perfect square N = L 2 and the twist tensor |n µν | = kL with k and L co-prime. It is easy to see that these conditions imply that our singular case can only happen with N = 4, k = 1. Nevertheless these are sufficient conditions for the algorithm to be ergodic, but not necessary. In fact the previous proof shows something much stronger than ergodicity: that with a single sweep we can go from any configuration to any other with non-zero probability.
We have performed some extensive simulations of the worst case (N = 4, k = 1) with O(10 6 ) measurements, and found that both heatbath and over-relaxation thermalize to the same values starting both from a cold or hot configuration, and expectation values are consistent within errors. Moreover, we have not observed any significant dependence of the autocorrelation time with the value of z µν that could indicate a loss of ergodicity.
In any case, if the reader is interested in simulating one of the exceptional situations one can simply perform a heatbath sweep from time to time to mathematically guarantee the correctness of the algorithm even in the singular case.
We also want to point out that the above proof of ergodicity also applies to lattices in which at least two directions are reduced [34]. In this case the auxiliary variables are introduced only for the reduced directions. Since at least one term in the computation of the staples will have a contribution coming from the auxiliary variables, the staples will also be arbitrary in this case, and our proof applies. The numerical study of this case will not be covered in this work.

Frequency of the update for the auxiliary variables
In this subsection we will consider possible alternatives to algorithm 1 based on varying the relative ratio between the frequency at which the auxiliary variables are generated relative to the number of link updates per sweep. For the latter one can use either heatbath (HB) or over-relaxation (OR2 or ORN) steps. In this alternative approach, we also alter the order in which generation and updates are performed. Thus, in this version (see algorithm 2) one generates the full set of (d(d − 1)/2) matrices Q µν , and then performs n link updates using any of the alternatives. The ratio of Q generations to link updates now becomes d(d − 1)/2n instead of the d(d − 1) of algorithm 1.
It should be mentioned that our proof of ergodicity for overrelaxation updates does not directly apply to these new algorithms. Here one cannot separate the problem into independent directions and has to consider the full linear map from the vector space of d(d − 1)/2 auxiliary Q matrices to the vector space of d staples. Notice, however, that for the algorithm not to be ergodic, the map from a configuration to another must be singular (i.e. with an almost anywhere vanishing Jacobian). Taking into account that these maps are perfectly regular (i.e. look at Eq. (2.30)), and that for d > 2 there are more auxiliary matrices than links, we think that it is quite plausible that algorithm 2 with only OR updates is ergodic as well. This is also supported by the numerical evidence that we have. A formal proof might not be hard to find, but given our preference for algorithm 1, based on the results given below, we did not make the effort to include it in the paper.
In conclusion, our comparison of the two alternative algorithms will be based on the performance analysis that will follow.
To make a comparison, the first thing to examine is the time needed to update the auxiliary variables. Generating each auxiliary variable Q µν requires the generation of the random matrix Q µν and two matrix multiplications (see Eq.(2.5)). Since generating the random numbers requires O(N 2 ) operations, while matrix multiplication requires O(N 3 ) Algorithm 2 Alternative update algorithm for one site models. As is discussed in the text, keeping the auxiliary variables for many updates results in a worse performance. If from a practical point of view this approach results in a better algorithm is a more complicated question, where autocorrelation times have to be taken into account. Clearly the update of the auxiliary variables is crucial to achieve ergodicity, and therefore if one does not update these variables frequently enough one might end up exploring a small region of the space of configurations. This is nicely illustrated by looking at the thermalization process ( Figure 1). As the reader can see, the expectation value seems to "plateau" very fast in between the updates of the auxiliary variables. Note that if one looks at the combination of introducing the auxiliary variables and n updates, the algorithm is actually thermalizing to the correct value, and there is no loss of ergodicity. But the figure indicates that performing n > 1 updates in between the introduction of the new auxiliary fields is redundant and does not improve the thermalization at all.
Since the Markov operator is the same during thermalization as during equilibrium updates, this suggests that autocorrelations might be significantly reduced only by the first update after introducing the auxiliary fields. This is in fact the case, as the data of Table 1 shows. We can see that the most efficient algorithm is our original proposal (algorithm 1), and that keeping the same set of auxiliary variables does not help in decorrelating the measurements. Algorithm (1) Algorithm (2)

(2.34)
Moreover the over-relaxation updates based on SU (2) projections do not need to generate any random numbers, something that might be convenient for parallelization.
U (N ) over-relaxation The vectorization of the U (N ) over-relaxation updates basically requires writing an SVD routine that makes use of the vector/multi-core structure of the CPU. Some LAPACK implementations have support for multi-threading, and represent an alternative way to profit from the multi-core structure of the CPU.

Numerical comparison of algorithms
Since the purpose of any simulation is to compute the expectation values of some observables with the highest possible accuracy, the merit of an algorithm has to be evaluated by comparing the uncertainties of some observables per unit of CPU time. There are only two factors that have to be taken into account in our particular case: the CPU time per update sweep, and the autocorrelations of the measurements. After having introduced the auxiliary variables, and transformed the action in a linear function of the links, one can use different updating methods: heatbath (HB) and over-relaxation (OR2) based in the projection onto SU (2) subgroups and the U (N ) overrelaxation (see Algorithm 1). Since all three sweep methods have to compute the same staples and generate the same auxiliary variables Amdahl's law puts a limit on the theoretical improvement that one sweep method can have over the others.
In principle OR2 sweeps are very simple and do not require to evaluate any mathematical function like cos, log, . . . , but we have observed that the CPU time for a HB sweep is essentially the same as the CPU time required for an OR2 sweep.
The case of ORN is more difficult to evaluate, and depends crucially on the implementation of the SVD decomposition. With the standard BLAS/LAPACK implementation, the time per sweep is roughly the same as the time required for an HB/OR2 sweep (faster for N > 300, slightly slower for small values of N ). But we have observed that some LAPACK implementations of the SVD decomposition (like intel MKL) run up ×10 faster for large N . This is mainly due to the fact that the prefetching of data and block decomposition used by many BLAS/LAPACK efficient implementations actually sustain the manifest O(N 3 ) scaling of the algorithm up to very large values of N (when the matrices are actually several hundreds of MB in size).
In any case all these update methods scale like O(N 3 ), and hereafter we will assume that the three updating methods (HB/OR2/ORN) have the same cost per sweep. This reduces the question of the most efficient algorithm to the one that decorrelates measurements faster. Nevertheless the reader should keep in mind that if an efficient implementation of BLAS/LAPACK is available for the architecture they are running, there might be a saving in CPU time.
To make the comparisons we have chosen different observables. Firstly Wilson loops of different sizes; W (1, 1), W (3, 3) and W (5, 5). These are the classic observables that have been used in the past for several studies.
Secondly, smeared quantities that are known to have large autocorrelation times. In particular quantities derived from the gradient flow have been used to test how the slow modes of algorithms decorrelate [35]. We will consider the renormalized coupling λ T GF [36] as defined for the TEK model in Ref. [15]. We will tune b for various values of N , such that λ T GF 23, and hence the physical volume (a √ N ) is kept approximately constant. This will allow us to study how the autocorrelation time of a physical quantity scales as we approach the continuum limit.
Measuring flow quantities is very expensive for the case of the TEK model. Compared with a typical update, the integration of the flow equations can consume O(10 5 ) times more computer time. This is easy to understand if one takes into account that the measurement involves the computation of O(10 4 ) times the exponential of a matrix. Usually it is not a huge problem to measure these observables with a high accuracy, since the observables have a very small variance and one can easily run many parallel replicas. But in order to measure autocorrelation times we actually need to collect at least O(100τ int ) measurements on the same Monte Carlo chain. Since τ int can easily go up to 200, this task is actually very difficult. We address this difficulty in two ways. First we measure the flow quantities frequently enough to actually measure the correlation between the data, but not too frequently. We aim at measuring integrated autocorrelation times of the order of 10. The values that we will present for τ int are then computed by scaling with the number of updates in between measurements. Second we use a large step size to integrate the flow equations. We use the third order Runge-Kuta integrator described in the appendix of [37], with a step size of = 0.05. We explicitly checked the size of the systematic error in the measured coupling due to the numerical integration on several configurations for each value of N . In all cases the size of this relative error on the measured coupling was below 0.005%, moreover this error decreased with increasing N , so the systematic error due to the large integration step size is negligible compared to our O(0.5%) statistical errors.

Comparison of link update algorithms and scaling
Our Monte Carlo simulations produce correlated estimates of the observables of interest. The correlation between measurements can be quantified by τ int . We follow the notation and conventions of [38] and define the autocorrelation function of the observable O as were the simulation time t labels the measurements and O is the average value of our observable of interest, we have In practical situations the sum Eq. (3.2) has to be truncated after a finite number of terms W that defines our summation window. Since the autocorrelation function Γ O (t) decays exponentially fast for large t, estimates of τ int will be accurate as long as W is large compared with the slowest mode of the Markov operator. In this work we find it is sufficient to simply truncate the sum when the estimate of τ int shows a plateau in W (see figures below), but other situations may require more sophisticated techniques (see for example [39]). We now move on to comparing the three different update algorithms, HB, OR2 and ORN. In Table 2 we show the results from the three update methods of Wilson loop measurements at N = 49 with 2 × 10 6 updates for each. Similarly Table 3 and Figure 2 show Wilson loop measurements at N = 289 with 6 × 10 5 updates for each. All the update methods give observables which are in agreement within errors, and the OR updates result in observables with about half the integrated autocorrelation time of the HB update. However, despite the increased computational complexity of the ORN method compared to the OR2 method, it does not result in a significantly smaller integrated autocorrelation time.    Figure 3 shows the integrated autocorrelation time of λ T GF for N = 121 for the three update algorithms. We see the same results as for the Wilson loop observables; OR updates have around half the τ int of HB updates, with little difference between OR2 and ORN. Since we have tuned the physical volume to be constant, we can repeat these measurements at different values of N to determine how the integrated correlation time scales as a function of the lattice spacing a, using the relation √ N = L/a. The results are listed in Table 4 for the three update methods, and τ int as a function of a −2 is shown in Figure 4.
The generic expectation in normal lattice simulations is for a local update to behave like a random walk, and so for τ int to scale like a −2 . A priori it is not clear that this should also apply to the TEK model, since in this case a "local" update of a single gauge link actually involves all the degrees of freedom in the lattice. It turns out however that   all three update methods exhibit the same critical scaling as one would expect for a local algorithm, namely τ int ∼ a −2 , as can be seen in Figure 4.

Conclusions
We have studied several simulation algorithms for one-site lattice models. In particular we have focused on the TEK model, which is relevant in the context of the large N expansion of gauge theories and has recently been used to compute many interesting properties of SU (∞) gauge theories [14,15,17].
Following Fabricius and Haan [32] we introduce auxiliary variables to make the action a linear function of the links, and study several link-update algorithms. Up to now all authors included a combinaton of heat-bath steps and overrelaxation steps in their algorithms. However, we show that this is not necessary, since once the auxiliary variables are updated, overrelaxation alone suffices to make the algorithm ergodic. This is in contrast with the usual lattice gauge theory, where over-relaxation sweeps produce microcanonical movements and are therefore not ergodic.  Regarding the over-relaxation updates, we study two kinds. First the one based in the projection over SU (2) subgroups (OR2). Second we study the OR method over the whole group as proposed in [8,33] (ORN). Indeed, we realize that for one-site models the algorithm does not change the value of the action, making a Metropolis accept/reject step unnecessary. This is due to the equivalence of U(N) and SU(N) groups for these models.
Finally, we perform a performance comparison between the different alternatives. We show that, at different values of N for different observables, overrelaxation sweeps decorrelate faster than heatbath sweeps (τ OR int /τ HB int 1/2). We see no big differences in terms of autocorrelation times between the two possible overrelaxation methods (OR2 and ORN). Each algorithm has his own benefits. OR2 is simple and easy to vectorize. On the other hand ORN might profit from a highly optimized routine for matrix operations, in particular the SVD decomposition. We conclude by studying the performance for computing a renormalized quantity in the continuum limit, the twisted gradient flow renormalized coupling. We see that the three tested link-update algorithms scale like a −2 . Hence, none of them has a better critical exponent.