Distance between configurations in Markov chain Monte Carlo simulations

For a given Markov chain Monte Carlo algorithm we introduce a distance between two configurations that quantifies the difficulty of transition from one configuration to the other configuration. We argue that the distance takes a universal form for the class of algorithms which generate local moves in the configuration space. We explicitly calculate the distance for the Langevin algorithm, and show that it certainly has desired and expected properties as distance. We further show that the distance for a multimodal distribution gets dramatically reduced from a large value by the introduction of a tempering method. We also argue that, when the original distribution is highly multimodal with large number of degenerate vacua, an anti-de Sitter-like geometry naturally emerges in the extended configuration space.


Introduction
In Markov chain Monte Carlo (MCMC) simulations, one often encounters a situation where the equilibrium distribution is multimodal and the computation requires an extraordinarily long computer time to make the system reach global equilibrium. In order to speed up the relaxation to global equilibrium, one usually implements an additional method such as the overrelaxation [1] or the simulated/parallel tempering [2,3,4,5]. However, it is not always possible to find a nice overrelaxation method. Also, the tempering method requires an adjustment of tempering parameter such that each local move has a finite amount of acceptance rate, which is usually done in a manual or adaptive way (and sometimes in an empirical way). Thus, it should be useful if there is a numerical measure that quantifies how much two configurations are separate for a given MCMC algorithm and how fast the system approaches global equilibrium. Such measure then can be used to numerically evaluate how the situation gets improved by the introduction of additional methods, which will make the above adjustment easier.
So far there have been proposed various methods for the above purpose, but they only quantify the separation between two different probability distributions (such as the L 1 distance) or measure the similarity between two data points for a given observable (such as the autocorrelation), and there has still not been a useful apparatus that directly measures an effective distance between two different configurations. In this paper, for a given MCMC algorithm we introduce a distance between two configurations that quantifies the difficulty of transition from one configuration to the other configuration. We argue that the distance takes a universal form for the class of algorithms which generate local moves in the configuration space. We make explicit calculations of distance for the Langevin algorithm, and show that it has desired and expected properties as distance which quantifies the extent of separation between two configurations. We further show that the distance for a multimodal distribution gets dramatically reduced from a large value by the introduction of a tempering method.
The introduction of such distance opens a way to investigate relaxation processes in an MCMC algorithm in terms of the geometry of the configuration space itself (which should not be confused with the geometry of the set of probability distributions). As an example, we argue that, when the original distribution is highly multimodal with large number of degenerate vacua, our distance can be regarded as a geodesic distance with respect to an anti-de Sitter-like (AdS-like) metric. Such a geometrical viewpoint enables us to adjust the tempering parameter in a purely geometrical way; the adjustment can be carried out by requiring that the resulting geodesic distance be minimized. This paper is organized as follows. In section 2, we first introduce a positive, symmetric operator (to be called the transfer matrix) from the transition matrix of a given MCMC algorithm, and then define the distance between two configurations, which can be written only with the transfer matrix. We argue that the distance takes a universal form for the class of MCMC algorithms that generate local moves of configuration. The statement is confirmed explicitly in Appendix for the Langevin and Metropolis algorithms by using a simple model. In section 3, in order to exemplify that the distance actually has desired and expected properties, we study the distance for the Langevin algorithm in a one-dimensional configuration space with both unimodal and multimodal distributions. In section 4, we study the distance for a multimodal distribution with the implementation of the simulated tempering method, and show that the distance certainly receives a huge amount of reduction from a large value. We also investigate the local geometry of the extended configuration space of the simulated tempering, and show that it has an AdS-like geometry. Section 5 is devoted to conclusion and outlook for future work.

Definition of distance
In this section, we define the distance between two configurations for a given MCMC algorithm. The distance will be written only with the kernel of a positive, symmetric matrix. We argue that the distance should take a universal form for the class of MCMC algorithms that generate local moves in the configuration space.

Preparation
Let M = {x} be a configuration space, and suppose that we are given an MCMC algorithm which generates a new configuration x from a configuration y with the conditional probability P (x|y) at each step. We assume that this yields a stochastic process which has suitable ergodic properties such that P n (x|x 0 ) ≡ (P n )(x|x 0 ) converges to a unique equilibrium distribution p eq (x) in the limit n → ∞, irrespectively of the initial value x 0 . We further make two assumptions: 1. The algorithm satisfies the detailed balance condition for a given real-valued action S(x), which ensures that the equilibrium distribution is given by p eq (x) = e −S(x) /Z (Z = dx e −S(x) ).
2. The eigenvalues of P are all positive. 1 By using the bra-ket notation P (x|y) = x|P |y and the configuration operatorx ≡ dx x |x x|, the above two conditions can be rephrased as a single statement that the operatorP e −S(x) is positive and symmetric. This also means that the "transfer matrix" is positive and symmetric.T shares the same set of eigenvalues asP , and according to our assumptions, all the eigenvalues are positive and the largest eigenvalue is unity with no 1 Note that the second condition is not too restrictive. In fact, when a transition matrix does not satisfy this condition (i.e., when some of the eigenvalues are negative), one can consider a new transition matrix P new ≡ (P old ) 2 , for which the eigenvalues are all positive and the same equilibrium distribution is reached. Note also that this condition is always satisfied for the Langevin algorithm [see, e.g., (3.2)-(3.6) below].
degeneracy. Note that P n (x 1 |x 2 ) = x 1 |P n |x 2 can be expressed in the form with the kernel ofT , If we introduce the spectral decomposition ofT as 2 with λ 0 = 1 > λ 1 ≥ λ 2 ≥ · · · > 0, then the kernel is expressed as The relaxation of the system to global equilibrium in the stochastic process corresponds to the decoupling of higher modes fromT n = |0 0| + k≥1 λ n k |k k| as n increases. Thus, the large scale behavior of relaxation 3 is determined by the wave functions x|k with small eigenvalues λ k . Note that the decoupling occurs earlier for modes with larger k.
If we further introduce the HamiltonianĤ by using a small time increment ǫ asT = e −ǫĤ , thenT n is expressed as e −tĤ with t ≡ nǫ. For generic MCMC algorithms,Ĥ is a nonlocal operator acting on functions over M. However, for the class of MCMC algorithms that generate only local moves of configuration,Ĥ should become local in the limit ǫ → 0 (see subsection 2.4 for further arguments on locality).
We would like to introduce a distance d n (x 1 |x 2 ) for a pair of configurations, x 1 , x 2 ∈ M, such that it enjoys the following properties in addition to the usual axioms of distance: • (P1) The distance d n (x 1 |x 2 ) vanishes in the limit n → ∞ for any pairs x 1 and x 2 , in order to reflect the fact that any configuration can be reached from every configuration in finite steps.
• (P2) If x 1 can be easily reached from x 2 , then the distance is small even for finite n.
• (P3) If the distribution e −S(x) /Z is multimodal, and if x 1 and x 2 belong to different modes, then the distance is very large for finite n.
As we see below, such a distance can be naturally introduced if we look at transitions in M that is already in global equilibrium. 2 The ground state wave function is given by x|0 = e −(1/2)S(x) / √ Z. 3 By "large scale behavior (or structure)" we mean the behavior (or structure) both at large step numbers and at scales larger than the increment of configuration at each Monte Carlo step.

Half-time overlap
Let the system be in global equilibrium with probability distribution p eq (x) = e −S(x) /Z. We denote by X n the set of sequences of n processes in M and by X n (x 1 , x 2 ) the subset of X n that consists of sequences which start from x 2 and end at x 1 . We then define f n (x 1 , x 2 ) to be the ratio of the sizes of two sets: which can be expressed as The latter expression shows that f n (x 1 , x 2 ) is a symmetric function of gives the probability to find a sequence of n processes from x 2 to x 1 (or from x 1 to x 2 ) out of all the possible n processes, and thus expresses "mobility" between two configurations x 1 and x 2 . Note that f n (x 1 , x 2 ) should take a very small value if the equilibrium distribution is multimodal and x 1 and x 2 belong to different modes.
Since our interest is in the mobility between two different configurations, we normalize the mobility such that it takes the maximal value (= 1) for closed loops which start from and end at the same configuration: (2.10) We will call the normalized mobility F n (x 1 , x 2 ) the half-time overlap of configurations x 1 and x 2 for n steps. In fact, by using (2.9) and introducing the "half-time elapsed state" |x, n/2 ≡T n/2 |x , the function F n (x 1 , x 2 ) can actually be expressed as the overlap of two normalized half-time elapsed states: One can easily prove that F n (x 1 , x 2 ) enjoys the following properties: Furthermore, from the properties of f n (x 1 , x 2 ), we expect that • F n (x 1 , x 2 ) ≃ 1 when x 1 can be easily reached from x 2 in n steps. (2.16) • F n (x 1 , x 2 ) ≪ 1 when x 1 and x 2 are separated by high potential barriers. (2.17)

Definition of distance
Based on the half-time overlap given above, we define a distance between x 1 , x 2 ∈ M as follows: This should satisfy the properties (P1)-(P3) due to (2.15)-(2.17), as well as the following axioms of distance: 4 It is often convenient to introduce other distances from the half-time overlap: 5 Although these distances do not satisfy the triangle inequality, they agree with θ n (x 1 , x 2 ) up to higher order corrections when θ n (x 1 , x 2 ) is small enough. For the rest of this paper we will mainly use d n (x 1 , x 2 ) as the definition of distance, because it gives the simplest expressions for the following examples.

Universality of distance
We close this section with a comment on the universality of our distance. As long as the chosen algorithm generates only local moves of configuration, we expect that the large scale structure of distance d n (x 1 , x 2 ) takes a universal form, in the sense that differences of distance between two such algorithms can always be absorbed into a rescaling of n. In fact, our distance is totally expressed with the kernel of the transfer matrixT , and, when the transition is sufficiently local, the corresponding HamiltonianĤ is a local operator acting on functions over M in almost the same way. We thus expect that the eigenvalues λ k and the wave functions x|k are almost the same for small k's, which ensures the same large scale structure of the kernel K n (x 1 , x 2 ) and thus that of the distance d n (x 1 , x 2 ). This statement is explicitly checked in Appendix for the Langevin and Metropolis algorithms using a simple one-dimensional model. Note that the argument for universality are more trustworthy when the degrees of freedom of system become larger.
This universality will not hold for algorithms that generate nonlocal moves of configuration (such as the overrelaxation algorithm). 6 The Hybrid Monte Carlo (HMC) algorithm [6] is marginal in this sense, and we leave it for future study to investigate whether the distance for the HMC algorithm exhibits the same behavior as local algorithms.

Examples
Expecting the universality of distance commented in the previous section, we consider the Langevin method as an MCMC algorithm, and write down the explicit form of distance for a configuration space M = R. We show that the resulting distance actually satisfies the properties (P1)-(P3).
Then, the probability distribution with the Fokker-Planck Hamiltonian (we will set D = 1 below): The transfer matrix is expressed asT 6 The tempering algorithms are nonlocal when viewed from the original configuration space M = {x}, but they actually generate local moves in the extended configuration space, so that the universality of the distance is expected to hold also for these algorithms. See arguments below (4.5) in subsection 4.2 for details. 7 In this section we exclusively use a continuous notation; namely, the distribution P n (x|x 0 ) will be written where the positive, symmetric HamiltonianĤ is given bŷ The corresponding kernel is then given by with which the half-time overlap is expressed as

Example 1: Gaussian
We first consider the action for which the HamiltonianĤ takes the form (3.10) Note that the last term removes the zero-point energy of this harmonic oscillator. By using the kernel the distance d t (x 1 , x 2 ) is easily obtained to be This gives a flat and translation invariant metric in the entire configuration space M = R. 8 Note that the distance decreases exponentially as d 2 t ∼ e −ωt , from which we find that the 8 Our discussion can be easily generalized to the case The distance is then given by d 2 relaxation time of the system is given by ∼ 1/ω. 9 Since the manner of relaxation is almost the same for unimodal distributions, we see that the distance rapidly decreases to zero when the action gives a unimodal distribution. We thus confirm that the properties (P1) and (P2) hold in this example.

Example 2: perturbation around the Gaussian
We now consider the case where the action (3.9) is perturbed with a quartic term: The perturbative calculation of distance can be done easily, and we find to the first order perturbation: where s ≡ sinh(ωt) and c ≡ cosh(ωt). This shows that the distance generically does not take a flat or translation invariant form when the unimodal distribution is not Gaussian.

Example 3: double-well potential
Finally, in order to confirm the property (P3), we consider the case where a high potential barrier exists between local minima: We assume that β takes a large value so that the equilibrium distribution e −S(x) /Z is multimodal. The potential V (x) = (1/4)[S ′ (x)] 2 − (1/2)S ′′ (x) becomes sextic (see Fig. 1), and has local minima at x = 0 and x = ± x + with x + ≡ 2 + 1 + 9/β /3 1/2 ≃ 1. One can easily find that x = ± x + are global minima by looking at the values of S ′′ (x) (note that S ′ (x) vanishes there). The eigenvalues E k (E 0 = 0 < E 1 < E 2 · · · ) ofĤ can be roughly estimated as follows. We first note that the Gaussian approximation around the global minima should be effective when β ≫ 1, and thus the first two eigenstates ofĤ can be approximated as superpositions of the ground states |0 + and |0 − of the approximated Gaussian potential The energy difference between two states is exponentially small, E 1 = O(e −β/2 ), as can be estimated by an instanton calculation.
As for the second excitation ofĤ, we expect the relations E 0 = 0 E 1 = O(e −β/2 ) ≪ E 2 = O(β). In fact, there are two possible approximations for the second excitation; one is to represent the second excitation as a superposition of the first excited states of the approximated Gaussian potentials around the global minima x = ± x + , and the other is to represent it as the ground state of the approximated Gaussian potential around the local minimum x = 0. In our case, the latter approximation is applicable, because the former gives E 2 ≃ 4β and the latter gives E 2 ≃ 2β (see Fig. 3 in Appendix).
By using the expansion of the kernel K t (x 1 , the distance d t (x 1 , x 2 ) is expressed as (3.20) We now consider the following two cases: We first discuss Case 1 where x 1 and x 2 belong to the same mode. If we expand d 2 t (x 1 , x 2 = x 1 + δx) to the second order of δx, the coefficient of e −kE 1 t is given by which is vanishingly small as can be understood by considering the supports of the wave functions x|0 ± . Thus, the dominant contribution to the distance comes from the second term in (3.20), and the two configurations x 1 and x 2 can be reached from each other with a rather short time ∼ 1/E 2 = O(1/β). This confirms that two configurations are close to each other even for a multimodal distribution if they belong to the same mode. In contrast, as for Case 2 where x 1 and x 2 belong to different modes, the coefficient of e −E 1 t is not small, so that the dominant contribution to the distance comes from the first excited state, and the transition between x 1 and x 2 requires an exponentially long time ∼ 1/E 1 = O(e β/2 ). Accordingly, the distance d t (x 1 , x 2 ) between x 1 ≃ +1 and x 2 ≃ −1 has a large value d 2 t (x 1 , x 2 ) ∝ β, 10 which decreases only very slowly as t elapses. We thus see that the property (P3) certainly holds in this example.

Distance for the simulated tempering
Suppose that we are considering a multimodal distribution, and that we implement a simulated tempering method [2]. We take as the tempering parameter the overall coefficient β of the action and introduce the parameter set A ≡ {β a } a=0,1,...,A such that β 0 > β 1 > · · · > β A , where β 0 is the overall coefficient of the original action. We denote by S(x; β a ) the action with β 0 replaced by β a .
In the simulated tempering algorithm, we extend the original configuration space M to M × A = {X = (x, β a )}, and introduce a stochastic process such that it converges to global equilibrium with the probability distribution P eq (X) = P eq (x, β a ) = w a e −S(x;βa) . (4.1) Ideally the weights w a (a = 0, 1, . . . , A) are chosen such that the appearance ratio of the a-th configuration is the same for all a [i.e. dx P eq (x, β a ) = 1/(A + 1)]. 11 In this paper, we assume that w a are already chosen in this way. A possible stochastic process that converges to P eq (X) is given by a Markov chain which consists of the following two steps: • Step 1. Generate a transition in the x direction, X = (x, β a ) → X ′ = (x ′ , β a ), with some proper algorithm (such as the Langevin or Metropolis algorithm).

(4.2)
It is easy to see that each process satisfies the detailed balance condition with respect to P eq (X). After one obtains a sample from the extended configuration space with probability distribution P eq (X), one estimates the expectation values with respect to the original action S(x; β 0 ) by using only a subsample with β a=0 .
Since the distribution with smaller β a is less multimodal, two configurations belonging to different modes at β 0 can now be easily reached from each other by moving around in the extended configuration space, as long as moves in the β direction occur frequently (as we assume here). This improves the convergence to global equilibrium, and accordingly, the distance between two configurations X 1 = (x 1 , β 0 ) and X 2 = (x 2 , β 0 ) will be reduced. To demonstrate that this actually happens, we calculated the distance between X 1 = (x 1 = +1, β 0 ) and X 2 = (x 2 = −1, β 0 ) for the action S(x; β 0 ) = (β 0 /2) (x 2 − 1) 2 with β 0 = 20, by using the simplest setting A = 1 and β 1 = 1. As for Step 1 above, we adopted the Metropolis algorithm using Gaussian proposal distribution with variance σ 2 = 0.01, where the configuration space is restricted to the interval [−3, 3] and is latticized with cutoff a = 0.001. Furthermore, by denoting the transition matrix for Step s byP (s) (s = 1, 2), we set the transition matrixP in the simulated tempering algorithm toP =P (1)P(2)P(1) so that the combined transition matrix satisfies the detailed balance condition. Below is the result we obtained: 12 We clearly see that the introduction of the simulated tempering method dramatically reduces the distance for such a multimodal distribution.

AdS-like geometry of the extended configuration space
When a system has very large degrees of freedom (as in the case of the large-volume or low-temperature limit), the issue related to the multimodality of probability distribution becomes serious. However, it will then become a good approximation to coarse-grain the elements of configuration space M by regarding configurations in the same mode as a single configuration, and to introduce a new configuration spaceM that consists of coarse-grained configurations which are separate from each other. We assume that the way of separation between two neighboring configurations is uniform overM, keeping in mind a situation where the original distribution has highly degenerate vacua.
Suppose that we introduce the simulated tempering algorithm to such system. The original configuration space M is then extended to M × A, as discussed in the previous subsection. When the degrees of freedom are very large, the spacing in the parameter set A = {β a } must be sufficiently small such that two adjacent configurations (x, β a ) and (x, β a+1 ) has an acceptance rate of O(1). Then, we may (and we will) regard A as a continuous set, A = {β | β A ≤ β ≤ β 0 }. Now let us consider the distance in the extended, coarse-grained configuration space: such that its geodesic distance between two arbitrarily chosen points X 1 = (x 1 , β 1 ) and X 2 = (x 2 , β 2 ) gives our distance d n (X 1 , X 2 ). Such metric will take the following form due Figure 2: The geodesic line for the geometry (4.3) with two ends at X 1 = (x 1 , β 1 ) and X 2 = (x 2 , β 2 ).
to the uniformity overM: 13 (4. 3) The functions f (β) and g(β) will be studied in detail in the subsequent paper [8], and we here make a simple analysis of their properties. Since two different configurations (x 1 , β) and (x 2 , β) inM × A belong to two different modes of the distribution e −S(x;β) /Z(β), the transition between them becomes more difficult as β increases. Thus, g(β) should be an increasing function at least when β is large, and we assume that it takes an asymptotic form g(β) ∝ β q for large β with some positive number q. 14 On the other hand, since the transition in the β direction has no obstacle, we expect that f (β) has a simple behavior. Note that, when the measure in the β direction is scale invariant [i.e., when f (β) ∝ 1/β 2 ] for large β, the above metric has the following asymptotic form: ds 2 ≃ const. dβ 2 β 2 + const. β q dx 2 (β: large), (4.4) which is nothing but the Euclidean AdS metric, as can be easily seen by the redefinition of variable, z ≡ const. β −q/2 : (4.5) 13 We have assumed that the coefficient of dβ dx is relatively small and can be neglected. 14 Without a tempering method, the distance between two configurations belonging to different modes for fixed large β can be roughly estimated by an instanton analysis to be proportional to β [see discussions below (3.21)]. This implies that the function g(β) increases more slowly than β when the simulated tempering is implemented, because configurations at larger β will get more benefit from the tempering method. We thus expect that q satisfies the inequality 0 < q ≤ 1. Our on-going work with numerical calculation [8] shows that q is actually in this range, excluding a logarithmic growth of g(β).
We here argue that the appearance of AdS-like geometry [eq. (4.3) with g(β) ∼ β q ] should be universal. In fact, as discussed above, when implementing the tempering method to an MCMC simulation for a large system, the spacing in the parameter set A = {β a } must be sufficiently small, which allows us to regard A as a continuous set. Then, the stochastic processes in the x and β directions given in subsection 4.1 can be regarded as local moves in the extended configuration space M × A, which in turn define local moves in the extended, coarse-grained configuration spaceM × A. Thus, combining with arguments in subsection 2.4, we expect that the distance takes a universal form forM × A in the sense that it does not depend on the particular MCMC algorithms that generate local moves along the x and β directions. 15 This is why we expect that the emergence of AdS-like geometry is universal.

Conclusion and outlook
In this paper, we have defined the distance d n (x 1 , x 2 ) between two configurations x 1 and x 2 for a given MCMC algorithm. The distance is written only with the kernel of the transfer matrixT , and various properties of the distance (including the universality for local MCMC algorithms) are easily understood as the reflection of properties of the transfer matrix.
We made a detailed study of the distance for multimodal distributions, and showed that distances between two different modes get dramatically reduced by the introduction of the simulated tempering method. We also considered the effective distance in the extended configuration space by regarding configurations in the same mode as a single configuration, and argued that the distance between two configurations may then be regarded as the geodesic distance in the extended configuration space with respect to an AdS-like metric. It will be demonstrated in the subsequent paper [8] that this is actually the case.
The introduction of distance between configurations opens a way to investigate relaxation processes in an MCMC algorithm in terms of the geometry of the configuration space itself. Among possible applications of the present formalism, one interesting application is to determine the parameter set {β a } in the simulated tempering method by requiring that it minimize the geodesic distance in the bulk with given ends on the boundary at β = β 0 . Furthermore, since the coefficient function f (β) in (4.3) has a dependence on {β a }, it should be interesting to investigate whether the optimized parameter set gives an AdS geometry. This point will be argued positively in [8].
As discussed in subsection 2.4, we expect that our distance takes a universal form for local MCMC algorithms. It should be interesting to study to which extent this universality holds. In particular, it is important to investigate whether the HMC algorithm has the distance similar to that for local algorithms.
In this paper, we have discussed only the case where the action S(x) is real. When the action takes complex values as in QCD at finite density (see [7] for a review), we cannot directly use the present definition of distance because there can be no real-valued transfer matrix for such complex Boltzmann weight. The reweighting method gives a real-valued transfer matrix, but the reweighting method is generically inapplicable because of the sign problem. There are various approaches to the sign problem. One approach which is currently under intensive study is the complex Langevin method [9] (see also [10,11,12,13,14,15]), and another is the Lefschetz thimble method [16] (see also [17,18,19,20,21,22,23,24]). 16 It must be important to investigate whether nice distances can be introduced to these algorithms. As for the complex Langevin method, in particular, it will be very interesting if the condition [14] to be free from the wrong convergence problems [12,13,14,15] can be rephrased in terms of the distance. We expect that the formalism of [26] developed for a complex Hamiltonian will be useful. As for the Lefschetz thimble algorithm, it must be interesting to investigate the geometry of the tempering algorithm that was recently introduced for integration over the Lefschetz thimbles, where the tempering parameter is set to the flow time of the antiholomorphic gradient flow [22,23].
A study along these lines is now in progress and will be reported elsewhere. We numerically diagonalize the transfer matrix for the action S(x) = (β/2) (x 2 − 1) 2 with β = 20 by latticizing the configuration space. Restricting the configuration space to the interval [−2, 2], we set the spatial cutoff a, the time increment ǫ and the variance σ 2 of the proposal distribution in the Metropolis to a = 0.005, ǫ = a 2 and σ 2 = 2a 2 , respectively. Below are the obtained results of eigenvalues: We see that the eigenvalues for the two algorithms can be coincided by rescaling the unit. The wave functions also have the same forms as depicted in Fig. 3.