Abstract
The recent introduction of machine learning techniques, especially normalizing flows, for the sampling of lattice gauge theories has shed some hope on improving the sampling efficiency of the traditional hybrid Monte Carlo (HMC) algorithm. In this work we study a modified HMC algorithm that draws on the seminal work on trivializing flows by Lüscher. Autocorrelations are reduced by sampling from a simpler action that is related to the original action by an invertible mapping realised through Normalizing Flows models with a minimal set of training parameters. We test the algorithm in a \(\phi ^{4}\) theory in 2D where we observe reduced autocorrelation times compared with HMC, and demonstrate that the training can be done at small unphysical volumes and used in physical conditions. We also study the scaling of the algorithm towards the continuum limit under various assumptions on the network architecture.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Lattice field theory admits a numerical approach to the study of non-perturbative properties of many field theories by using Markov Chain Monte Carlo (MCMC) techniques to generate representative samples of field configurations and computing expectation values. However, standard MCMC algorithms suffer from a phenomenon known as critical slowing down, whereby the autocorrelation time of the simulation increases dramatically as the continuum limit is approached. In many theories of interest, including Quantum Chromodynamics (QCD), this problem is exacerbated by the effect of topology freezing [1,2,3,4,5,6,7,8,9,10]. Autocorrelation times of topological observables have been shown to scale exponentially with the inverse lattice spacing, \(a^{-1}\), for the CP\(^{N-1}\) model [3], and at least polynomially with \(a^{-6}\) for lattice QCD [9].
Trivializing maps are invertible field transformations that map a complicated theory to a trivial one, i.e. to a limit in which the field variables decouple and sampling is extremely efficient. Lüscher [11] originally proposed an augmentation of the Hybrid Monte Carlo (HMC) algorithm in which an approximate trivializing map is used to reduce autocorrelation times. However, when tested against CP\(^{N-1}\) models, it was reported that the quality of this approximation, which involved computing the first few terms of a power series, was not sufficient to improve the scaling of the computational cost towards the continuum limit with respect to standard HMC [4].
The recent introduction of machine learning techniques for the sampling of lattice field theories has opened a new avenue to address critical slowing down in lattice field theories [12,13,14,15,16,17,18]. A class of machine learning models known as normalizing flows are also invertible transformations that are parametrised by neural networks (NNs) and can hence be ‘trained’ to approximate a desired mapping [19,20,21]. Albergo, Kanwar and Shanahan [12] first demonstrated that direct sampling from a well-trained Normalizing Flow, combined with some form of reweighting such as a Metropolis test, produces unbiased samples of field configurations while completely avoiding critical slowing down. However, experiments with simple architectures have suggested that the overhead cost of training models to achieve a fixed autocorrelation time scales extremely poorly towards the continuum limit [22].
In this work we investigate an algorithm inspired by the original idea of Lüscher, but where a normalizing flow is used to approximate the trivializing map. Given the high training costs associated with the direct sampling strategy, we pose the question: is it possible to improve the scaling of autocorrelation times in HMC using minimal models that are cheap to train? To answer this we benchmark our method against standard HMC on a two-dimensional \(\phi ^{4}\) model.
The paper is organised as follows: in Sect. 2 we briefly review trivializing flows before describing the algorithm that is the focus of this work; in Sect. 3 we describe the experimental setup, which includes details about the \(\phi ^4\) theory, the normalizing flow architectures, and the HMC component of the algorithm; in Sect. 4 we provide the results of our experiments and compare the computational cost scaling against standard HMC. This work is based on results previously reported in Reference [23].
2 Learning trivializing flows
2.1 Trivializing flows
Consider the expectation value of an observable in the path integral formalism of a quantum field theory in Euclidean spacetime,
where \({\mathcal {O}}(\phi )\) is an observable defined for the field configuration \(\phi \), \(S(\phi )\) is the action of the theory, \({\mathcal {Z}}\) is its partition function,
and \({\mathcal {D}}\phi \) is the integration measure,
The probability of a field configuration \(\phi \) is given by the Boltzmann factor
We will refer to p as the target distribution. A change of variables \({\tilde{\phi }}={\mathcal {F}}^{-1}(\phi )\) in Eq. (1) yields
where \(J[ {\mathcal {F}}({\tilde{\phi }}) ]\) is the Jacobian coming from the change in the integral measure, \({\mathcal {D}}\phi ={\mathcal {D}}{\tilde{\phi }}\,\det J[ {\mathcal {F}} ]\). If \({\mathcal {F}}\) is chosen such that the effective action for the transformed field,
describes a non-interacting theory, then \({\mathcal {F}}\) is known as a trivializing map.
In Reference [11] trivializing maps for gauge theories were constructed as flows
with boundary condition
The trivializing map is defined as
Though not known in closed form, the kernel \(T\) of the trivializing flow can be expressed as a power series in the flow time \(t\). In practice, this power series was truncated at leading order and the flow integrated numerically, resulting in an approximate trivializing map where the effective action in Eq. (6) is still interacting in general. Nevertheless, \({\tilde{S}}\) ought to be easier to sample than S.
The algorithm introduced in Reference [11] is essentially the HMC algorithm applied to the flowed field variables \({\tilde{\phi }}\). This algorithm was tested for the CP\(^{N-1}\) model, which suffers from topology freezing [3], by Engel and Schaefer [4]. The conclusion of the study was that, although there was a small improvement in the proportionality factor, the overall scaling of the computational cost towards the continuum did not change with respect to standard HMC.
2.2 Flow HMC (FHMC)
Normalizing Flows are a machine learning sampling technique first applied to lattice field theories in Reference [12]. The idea is similar to that of the trivializing map. Starting from an initial set of configurations \(\{z_{i}\}_{i=1}^{N}\) drawn from a probability distribution where sampling is easy,Footnote 1
a transformation \(\phi = f^{-1}(z)\) is applied so that the transformed configurations \(\{\phi _{i}\}_{i=1}^{N} \equiv \{f^{-1}(z_{i})\}_{i=1}^{N}\) follow the new probability distribution
The probability density \(p_f\) is called the model distribution. The transformation \(f\) is implemented via NNs with a set of trainable parameters \(\{\theta _{i}\}\) which have been optimised so that \(p_{f}(\phi )\) is as close as possible to the target distribution \(p(\phi ) = e^{-S(\phi )} / {\mathcal {Z}}\), i.e. the distribution of the theory we are interested in. The determinant of this transformation can be easily computed if the network architecture consists of coupling layers [24,25,26] with a checkerboard pattern, as explained in Reference [12]. Normalizing flows are therefore NN parametrisations of trivializing maps.
Ideally, the NNs would be trained such that the Küllbach–Leibler (KL) divergence [27] between the model and the target distribution,
is minimised. The KL divergence is a statistical estimator satisfying \(D_{\text {KL}}(p_{f} || p) \ge 0\) and
However, the partition function \({\mathcal {Z}}\) appearing in \(p(\phi )\) is generally not known, so in practice one minimises a shifted KL divergence,
This loss function can be stochastically estimated by drawing samples \(\phi _{i} \sim p_{f}(\phi )\) from the model; there is no requirement to have a set of existing training data. Since f is differentiable, the loss can be minimised using standard gradient-based optimisation algorithms such as stochastic gradient descent and ADAM [28], the latter of which we used in this study. The absolute minimum of the loss function occurs when \(L(\theta ) = -\log {\mathcal {Z}}\), where \(p_{f} = p\). In practice this minimum is unlikely to be achieved due to both the limited expressivity of the model and a finite amount of training, but one expects to have an approximate trivializing map at the end of the training, i.e. \(p_{f} \approx p\).
The normalizing flow model generates configurations that are distributed according to \(p_{f}\), not p. To achieve unbiased sampling from p, the work in Reference [12] embeds the model in a Metropolis–Hastings (MH) algorithm [29, 30], where \(p_{f}\) serves as the proposal distribution. Since the proposals are statistically independent, the only source of autocorrelations are rejections. However, training models to achieve a fixed low MH rejection rate can become prohibitively expensive for large systems and long correlation lengths [22].
In contrast, in this work we propose to use the trained model as an approximation to the trivializing map in the implementation of the trivializing flow algorithm described in Sect. 2.1. Thus we identify
so that the expectation value in Eq. (1) becomes,
where we have defined the new action \({\tilde{S}}\) in the transformed coordinates to be
If the new probability distribution, \(e^{-{\tilde{S}}({\tilde{\phi }})}\) is easier to sample than \(e^{-S(\phi )}\) then performing HMC with the new variables, \({\tilde{\phi }}\), would result in a Markov chain \(\{ {\tilde{\phi }}_{i}\}_{i=1}^{N}\) with lower autocorrelation times for the observable \({\mathcal {O}}\).
We will refer to this algorithm as Flow HMC (FHMC), and its workflow is as follows:
-
1.
Train the network \(f\) by minimising the KL divergence in Eq. (13).
-
2.
Run the HMC algorithm to build a Markov chain of configurations using the action \({\tilde{S}}\),
$$\begin{aligned} \{ {\tilde{\phi }}_{1},\; {\tilde{\phi }}_{2},\; {\tilde{\phi }}_{3},\; \ldots ,\; {\tilde{\phi }}_{N} \} \sim e^{-{\tilde{S}}({\tilde{\phi }})}. \end{aligned}$$ -
3.
Apply the inverse transformation \(f^{-1}\) to every configuration in the Markov chain to undo the variable transformation. This way we obtain a Markov chain of configurations following the target probability distribution \(p(\phi ) = e^{-S[\phi ]}\),
$$\begin{aligned} \left\{ f^{-1}({\tilde{\phi }}_{1}),\; f^{-1}({\tilde{\phi }}_{2}),\; f^{-1}({\tilde{\phi }}_{3}),\; \ldots ,\; f^{-1}({\tilde{\phi }}_{N}) \right\} \nonumber \\ = \{ \phi _{1},\; \phi _{2},\; \phi _{3},\; \ldots ,\; \phi _{N} \} \sim e^{-S(\phi )}. \end{aligned}$$
Note that the HMC acceptance in step 2 can be made arbitrarily high by increasing the number of integration steps in the molecular dynamics evolution of HMC. Contrary to what happens in the algorithm suggested in Reference [12], this acceptance does not measure how well \(f^{-1}\) approximates a trivializing map; the relevant question is whether this algorithm improves the autocorrelation of HMC.
The motivation behind this work is that a Normalizing Flow parametrised by NNs ought to be better able to approximate a trivializing map than the leading-order approximation of the flow equation introduced and tested in References [4, 11]. Similar ideas have been explored in References [31,32,33]. Particularly, in Reference [32] a Normalizing Flow is optimised to approximate the target distribution from an input distribution corresponding to the action at a coarser lattice spacing, while training is done by minimising the force difference between two theories instead of the KL divergence. In contrast to these previous works we focus on minimal models and cheap training setups in an attempt to avoid the exploding training costs reported in Reference [22].
3 Model and setup
For our study we focus on a \(\phi ^{4}\) scalar field theory in \(D=2\) dimensions with bare mass \(m_{0}\) and bare coupling \(g_{0}\). Its standard continuum action is
On the lattice we will work with the \(\beta \)–\(\lambda \) parametrization,
where \(e_{\mu }\) represents a unit vector in the \(\mu \)th direction and the sum runs over all lattice points \(x\equiv (x_{1},x_{2})\). The relationship between these two actions is explained in Appendix A.
3.1 Observables
We will focus only on a handful of observables, the simplest one being the magnetization
with V the lattice volume. The building block for the rest of the observables is the connected two-point correlation function
where we have used translational invariance to define \(\langle \phi \rangle = \langle \phi _x \rangle \). The correlation length, \(\xi \), corresponding to the inverse mass of the lightest mode in the spectrum, can be extracted from the spatially-averaged two-point function,
at sufficiently large Euclidean time separations \(y_{2}\).
We can also measure the one-point susceptibility,Footnote 2
Since both the magnetization and the one-point susceptibility are local observables, we additionally studied the one-point susceptibility measured in smeared field configurations,
The smeared field configurations \(\phi _{t}\) are obtained by solving the heat equation
up to flow time t, which we choose so that the smearing radius of the flow is equal to the correlation length of the system, i.e. \(\sqrt{4t} = \xi \). For more details, see Appendix B.
As usual, the estimator of the expectation values of these observables is the statistical average over the generated Markov chain of configurations,
where \({\mathcal {O}}\) is the observable studied. The error associated with this estimation is given by the statistical variance,
where \(\tau _{\text {int},{\mathcal {O}}}\) is the so-called integrated autocorrelation time. It is defined as
with
being the autocorrelation time of the observable measured at field configurations separated by \(m\) Markov chain configurations. We estimate \(\tau _{\text {int}}\) using the automatic windowing procedure of the \(\Gamma \) method [10, 34]. Particularly, we perform the autocorrelation analysis with ADerrors.jl [35], which combines the \(\Gamma \) method with automatic differentiation techniques [36, 37].
3.2 Network architecture and training
As mentioned in Sect. 1 we focused on keeping the training costs negligible with respect to the cost of producing configurations with FHMC. The most intuitive choices that we took for this optimization are:
-
1.
Use convolutional neural networks (CNNs) instead of fully connected networks. The action of Eq. (18) has translational symmetry, so the network f should apply the same transformation to \(\phi _{x}\) for all x. CNNs respect this translational symmetry, and also require less parameters than fully connected networks.
-
2.
Tune the number of layers and kernel sizes of the CNNs so that the footprint of the network f is not much bigger than the correlation length \(\xi \). Two-point correlation functions will generally decay with \(\sim e^{-\left| x - y \right| / \xi }\), so the transformation of \(\phi _{x}\) should not depend on \(\phi _{y}\) if \(\left| x - y \right| \gg \xi \). Also, limiting the number of layers reduces the number of trainable parameters.
-
3.
Enforce f to satisfy \(f(-\phi ) = -f(\phi )\). The action in Eq. (18) is invariant under \(\phi \rightarrow -\phi \), so enforcing equivariance under this symmetry should optimise training costs.
Following [12], we partition the lattice using a checkerboard pattern, so that each field configuration can be split as \(\phi =\{\phi ^A,\phi ^B\}\), where \(\phi ^A\) and \(\phi ^B\) collectively denote the field variables belonging to one or the other partition. We then construct the transformation f as a composition of n layers,
where each layer \(g^{(i)}\) does an affine transformation to a set of the field variables, \(\{\phi ^A,\phi ^B\}\), organised in a checkerboard pattern, such as
where \(x=(x_{1},x_{2})\). In the affine transformation
the partition \(\phi ^A\) remains unchanged and only the field variables \(\phi ^B\) are updated. \(s(\phi )\) and \(t(\phi )\) are CNNs with kernel size k. To make this transformation equivariant under \(\phi \rightarrow -\phi \) we enforce \(f(-\phi )=-f(\phi )\) by using a tanh activation function and no bias for the CNNs (see Sec. III.F of [22]). The checkerboard pattern ensures that the Jacobian matrix has a triangular form so its determinant can be easily computed, and reads
where the product runs over the lattice points where the partition \(\phi _{x}^{B} = \phi _{x}\). An example of the action of a CNN with only 1 layer and a tanh activation function over a lattice field would be
where \(w^{(i)}(y) \equiv w^{(i)}_{y + (k+1) / 2}\) is the weight matrix of size \(k \times k\) of the CNN s of kernel size k. Choosing the same functional form for \(t_{x}^{(i)}(\phi ^{A})\), it is easy to check that the transformation in Eq. (30) is equivariant under \(\{\phi ^{A}, \phi ^{B}\} \rightarrow \{- \phi ^{A}, -\phi ^{B}\}\).
Two different affine layers with alternate checkerboard patterns are necessary to transform the whole set of lattice points, and we will denote such a pair of layers as a coupling layer.Footnote 3
In this work we studied network architectures with \(N_{l} = 1\) affine coupling layers, while the CNNs, s and t, have kernel size k and no hidden layers. The output configuration is rescaled with an additional trainable parameter. Finally, independent normal distributions are used as the prior distributions r in Eq. (10).
3.3 FHMC implementation
The main focus of this work is the scaling of the autocorrelation times of the magnetization, \(\tau _{M}\), and one-point susceptibilities, \(\tau _{\chi _{0}}\) and \(\tau _{\chi _{0,t}}\). Using local update algorithms such as HMC, these autocorrelation times are expected to scale as [39]
We will benchmark the scaling of the autocorrelation times in the FHMC algorithm against those in standard HMC.
For a scalar field theory, the HMC equations of motion read
where the force for the momenta \(\pi \) follows from the derivative of the action in Eq. (18),
In our simulations we used a leapfrog integration scheme with a single time scale, and the step size of the integration was tuned to obtain acceptances of approximately 90%. A pseudocode of an HMC implementation is depicted in Algorithm 1: the HMC function receives as input a configuration, \(\phi \), and the action of the target theory, S; after generating random momenta, the leapfrog function performs the molecular dynamics step and a configuration, \(\phi _{\text {new}}\), is chosen between the evolved field, \(\phi '\), and the old field, \(\phi \), with the usual MH accept–reject step.
The proposed FHMC algorithm is in essence the HMC algorithm with the transformed action in Eq. (16), which arises from the change of variables \({\tilde{\phi }} = f(\phi )\) in Eq. (15). The new Hamilton equations of motion now include derivatives with respect to the new variables \({\tilde{\phi }}\),
The basic implementation is sketched in Algorithm 2. The main differences with respect to standard HMC in Algorithm 1 are line 2, where we transform from the variables \(\phi \) to the variables \({\tilde{\phi }}\) using the trained network f; and line 6, where we undo the change of variables to obtain the new configuration \(\phi _{\text {new}}\) from \({\tilde{\phi }}_{\text {new}}\).
Note that the molecular dynamics evolution and the accept–reject step, lines 4 and 5, are applied to the transformed field variables \({\tilde{\phi }}\) with the new action \({\tilde{S}}\). Irrespective of the transformation f, the acceptance rate can be made arbitrarily high by reducing numerical errors in the integration of the equations of motion in Eq. (36). This means that we will always be able to tune the FHMC acceptance to approximately 90% by increasing the number of integration steps, even for a poorly trained normalizing flow.
Note also that now the evaluation of the force \({\tilde{F}}_{x} \equiv - \nabla _{\phi _{x}}{\tilde{S}}[ {\tilde{\phi }} ]\) requires computing the derivative of \(\log \det J[ f^{-1}({\tilde{\phi }}) ]\). This cannot be written analytically for an arbitrary network, and we used PyTorch’s automatic differentiation methods for its evaluation [40].
4 Results
We are interested in the scaling of the cost towards the continuum limit. Following the analysis in [22], we tuned the coupling \(\beta \) so that the correlation length satisfies
The continuum limit is therefore approached in the direction of increasing L. In Table 1 we summarise the parameters used in our simulations of FHMC and standard HMC. Results obtained with both HMC and FHMC can be found in Tables 3 and 4 of Appendix C, respectively.
4.1 Minimal network
An obvious strategy to reduce training costs is to build networks with few parameters to train. As mentioned in Sect. 3, using \(N_{l} = 1\) coupling layer would suffice to transform the whole lattice, while a kernel size \(k=3\) for the CNNs, which would couple only nearest neighbour, is the smallest that can be used.Footnote 4 Such a network has only 37 trainable parameters in total,Footnote 5 and is the most minimal network that we will consider.
It is interesting to study whether such a simple network can learn physics of a target theory with a non-trivial correlation length. In Fig. 1 (left) we plot the evolution of the KL divergence during the training of a network with such minimal architecture, where the target theory has parameters \(\beta = 0.641\), \(\lambda = 0.5\), lattice size \(L=18\) and correlation length \(\xi = L / 4 = 4.5\). Since the network has very few parameters, the KL divergence reaches saturation after only \({\mathcal {O}}(100)\) iterations.
Once the network is trained, one can use it as a variable transformation for the FHMC algorithm as sketched in Algorithm 2. In Fig. 1 (right) we show the magnetizations of a slice of 4000 configurations of Markov chains coming from FHMC and HMC simulations, yielding as autocorrelation times
All the results with this minimal network can be found in Table 4 of Appendix C, showing that FHMC leads to smaller autocorrelations compared to standard HMC, especially as the continuum limit is approached. This fact seems to indicate that a network with few parameters can indeed learn transformations with relevant physical information.
A measure of the closeness of the distributions \(p_f\) and p is given by the MH acceptanceFootnote 6 when sampling p with configurations drawn directly from \(p_f\). Focusing on the first three columns of Table 2, it is clear that the acceptances are low and decrease towards the continuum limit, in spite of the fact that the autocorrelation time of FHMC is better than that of HMC. This is because the networks used have a very reduced set of parameters and therefore limited expressivity. The map defined via the trained networks is not very accurate in generating the probability distribution of the target theory. However FHMC, i.e. a molecular dynamics evolution using flowed variables, yields a clear gain in the autocorrelation times.
4.2 Infinite volume limit
An important advantage of using a translationally-invariant network architecture, such as the one containing CNNs, is that they can be trained at a small lattice size L and then used in a larger lattice \(L' > L\). Note that doing this in the approach of References [12, 22] would not be viable since it would lead to an exponential decrease in the MH acceptance due to the extensive character of the action.
In the last column of Table 2 we show the MH acceptance using networks with the minimal architecture of Sect. 4.1 trained at lattice size L when the target theory has lattice size 2L. One can see that the acceptances are significantly lower than those obtained with the target theory at size L. However, the acceptance of the FHMC algorithm (Algorithm 2) can be kept arbitrarily high by increasing the number of integration steps of the Hamilton equations, so reusing the networks for higher volumes does not pose any problem. The reduced MH acceptance does not translate into a change in the autocorrelation time.
In Fig. 2 we compare the autocorrelation times for the magnetization of networks trained at L and reused at 2L with the ones of networks trained directly at 2L. Since they agree within statistical significance, this indicates that the relevant physical information is already learned at small volumes and reinforces the intuition that the training does not need to be done at a lattice sizes larger than \(\xi \).
4.3 Continuum limit scaling with fixed architecture
Finally we want to determine whether the computational cost of FHMC has a better scaling than the standard HMC as we approach the continuum. First we will consider a fixed network architecture as we scale L. We have trained a different network for each of the lattice sizes in Table 1, with \(N_{l}=1\) and kernel sizes \(k=3,5,7\). The cost of the training is in all cases negligible with respect to the costs of the FHMC, and the integration step of the leapfrog scheme is tuned to have an acceptance rate of approximately 90% for every simulation, as we do for HMC.
In Fig. 3 (left) we show the autocorrelation times of the magnetization for both HMC (filled, blue circles) and FHMC with kernel sizes \(k=3,5,7\) (open circles, triangles and squares). One can see that the autocorrelations in FHMC are lower than the ones of HMC, and decrease as the kernel size of the CNNs is increased.Footnote 7
In order to study the scaling towards the continuum limit, we plot the ratio of autocorrelation times for HMC versus FHMC in Fig. 3 (right) for the three values of the kernel size. Although for the coarser lattices the ratio increases towards the continuum, it seems to saturate within the range of lattice spacings explored, indicating that the cost scaling of both algorithms is the same. The same behaviour is observed for the one-point susceptibilities of Eqs. (22) and (23).
4.4 Continuum limit scaling with \(k \sim \xi \)
As we take the continuum limit the correlation length increases in lattice units. If the footprint is chosen to scale with \(\xi \), the convolution implemented by the network covers the same physical region. Using our architecture, this can be done by adding more coupling layers or increasing the kernel size, k. More concretely, a kernel size k couples \((k-1) / 2\) nearest neighbours; therefore, since we have no hidden layers, if we have \(N_{l}\) coupling layers the network will couple \(N_{l}(k-1)\) nearest neighbours.
In Fig. 4 (left) we show again the scaling of the autocorrelation times of the magnetization, but now the networks used for the FHMC algorithm have \(N_{l}(k-1) \approx \xi \). Particularly, all networks of the plot have \(N_{l} = 1\) coupling layers, so only the kernel size varies: for \(L = 10\) to \(L = 16\) the networks have \(k = 5\); for \(L = 18\) and \(L=20\), \(k=7\); for \(L = 40\), \(k = 11\); and for \(L = 80\), \(k = 21\).
The curves of the plot correspond to fits to the function
yielding as result
Thus keeping the physical footprint size constant seems to yield a slight improvement in the scaling towards the continuum.Footnote 8 It is also interesting to see that the same happens with the smeared susceptibility in Fig. 4 (right). The latter is a non-local observable that has been measured in smeared configurations with a smoothing radius \(\sqrt{4t} = \xi \) (see Appendix B).
This slight improvement is in agreement with the fact that for a fixed network architecture the continuum scaling remains the same as HMC, while increasing the kernel size improves the global factor of the autocorrelations, as was seen in Fig. 3.
It is important to note that increasing the kernel size of the network implies increasing the number of parameters in the training, and also the number of operations to compute the force in the molecular dynamics evolution using automatic differentiation. Particularly, the number of parameters of our networks is given by
In Fig. 5 we show the time needed to compute the force on a lattice with fixed length \(L=320\) as a function of the number of network parameters (keeping \(N_{l} = 1\) and varying k in the interval \(k \in [3,161]\)). Since the computing time seems to scale linearly with the number of parameters, if k scales with the correlation length \(\xi \) then there is an additional term proportional to \(\xi ^2\) in the FHMC cost.
According to this estimate, FHMC would not reduce the asymptotic simulation cost of a \(\phi ^4\) theory with respect to HMC. However, the implementation of the FHMC force does not require the integration of the flow equation in Eq. (7), unlike in [4]. Knowing that FHMC already reduces autocorrelation times with minimal architectures, it could probably be used to reduce simulation costs in Lattice QCD with minimal implementation effort.
5 Conclusions
We have tested a new algorithm, Flow HMC (FHMC), that implements the trivializing flow algorithm of References [4, 9] via a convolutional neural network, similar to those used in the Normalizing Flow algorithms introduced in Reference [12]. In contrast with previous works on Normalizing Flows, we use minimal network architectures which leads to negligible training costs, not affected therefore by the bad scaling towards the continuum limit observed in [22]. The main new ingredient is the combination of a neural network implementation of the trivializing flow with an HMC integration which keeps a large acceptance for any network architecture.
We have tested the algorithm in a scalar theory in 2D and benchmarked it against standard HMC. We have observed a significant reduction of the autocorrelation times in FHMC for the observables measured: the magnetization and one-point susceptibility. This improvement is maintained as the physical volume is increased at fixed lattice spacing, meaning that the network can be trained a small physical volume L and used at larger one \(L'> L\) without any extra cost. For gauge theories, this opens up the possibility of doing the training at unphysical values of the quark masses or small volumes – or both, where training is cheaper – and reusing the trained networks to sample the targeted theory at the physical values of the parameters.
However, for a fixed network architecture the scaling of the autocorrelation with the lattice spacing remains the same as that of HMC. A slight improvement in the scaling of the autocorrelation time is observed when the footprint of the network is kept constant in physical units. Although the training cost still remains negligible, scaling the footprint with the correlation length implies an extra cost in the computation of the force in the FHMC, leading to a worse overall scaling than HMC in the theory considered. This might be different in a theory with fermions, where the dominant cost is the inversion of the Dirac operator. Also, as discussed in Reference [43], the use of other Machine Learning training techniques such as transfer learning and the use of optimal architectures and stopping criteria can help alleviating the training cost scaling.
Although the improvement observed in the autocorrelation times for fixed network architectures of FHMC might be of some practical use, particularly given the simplicity of the implementation, it remains to be demonstrated that a neural network training policy can be applied that avoids critical slowing down.
Data Availibility Statement
This manuscript has no associated data or the data will not be deposited. [Authors’ comment: All necessary information to reproduce the reported results are available in the main text.]
Notes
An example of such easy distribution could be a multi-dimensional normal distribution.
See Reference [38] for an example of an actual implementation of all these concepts.
The transformation with \(k=1\) being a trivial rescaling.
A CNN with 1 layer has \(k^2\) parameters. The transformation in Eq. (30) with \(N_{l}\) affine coupling layers has \(4\times N_{l}\) different CNNs, and therefore \(4 \times N_{l} \times k^2\) parameters. Since we also add a global rescaling parameter as a final layer of our network, this architecture has \(N_{p} = 4 \times N_{l} \times k^2 + 1\) parameters.
Not to be confused with the HMC and FHMC acceptances, which are tuned to 90% in this work.
Note that an accurate cost comparison should include the overhead of computing the force via automatic differentiation. We have not tried to optimize this step and therefore postpone a detailed cost comparison to future work.
Note that \(x \equiv (x_{1}, x_{2})\) and \(p \equiv (p_{1} p_{2})\). For summations we use the shorthand notation \(\sum _{x} \equiv \sum _{x_{1},x_{2}=0}^{(L-1, L-1)}\).
References
M. Campostrini, P. Rossi, E. Vicari, Monte Carlo simulation of \(\rm CP ^{N-1}\) models. Phys. Rev. D 46, 2647 (1992). https://doi.org/10.1103/PhysRevD.46.2647
E. Vicari, Monte carlo simulation of lattice \({\mathbb{C}\mathbb{P} }^{N-1}\) models at large \(N\). Phys. Lett. B 309, 139 (1993). https://doi.org/10.1016/0370-2693(93)91517-Q. arXiv:9209025 [hep-lat]
L. Del Debbio, G.M. Manca, E. Vicari, Critical slowing down of topological modes. Phys. Lett. B 594, 315 (2004). https://doi.org/10.1016/j.physletb.2004.05.038. arXiv:hep-lat/0403001
G.P. Engel, S. Schaefer, Testing trivializing maps in the Hybrid Monte Carlo algorithm. Comput. Phys. Commun. 182, 2107 (2011). https://doi.org/10.1016/j.cpc.2011.05.004. arXiv:1102.1852 [hep-lat]
J. Flynn, A. Jüttner, A. Lawson, F. Sanfilippo, Precision study of critical slowing down in lattice simulations of the \({\mathbb{C}\mathbb{P}}^{N-1}\) model (2015). arXiv:1504.06292 [hep-lat]
C. Bonati, M. D’Elia, Topological critical slowing down: variations on a toy model. Phys. Rev. E 98, 013308 (2018). https://doi.org/10.1103/physreve.98.013308. arXiv:1709.10034 [hep-lat]
L. Del Debbio, H. Panagopoulos, P. Rossi, E. Vicari, Spectrum of confining strings in SU(N) gauge theories. JHEP 01, 009 (2002). https://doi.org/10.1088/1126-6708/2002/01/009. arXiv:hep-th/0111090
B. Alles, G. Boyd, M. D’Elia, A. Di Giacomo, E. Vicari, Hybrid Monte Carlo and topological modes of full QCD. Phys. Lett. B 389, 107 (1996). https://doi.org/10.1016/S0370-2693(96)01247-6. arXiv:hep-lat/9607049
M. Lüscher, Properties and uses of the Wilson flow in lattice QCD. JHEP 08, 071 (2010). [Erratum: JHEP 03, 092 (2014)]. https://doi.org/10.1007/JHEP08(2010)071. arXiv:1006.4518 [hep-lat]
S. Schaefer, R. Sommer, F. Virotta, (ALPHA), Critical slowing down and error analysis in lattice QCD simulations. Nucl. Phys. B 845, 93 (2011). https://doi.org/10.1016/j.nuclphysb.2010.11.020. arXiv:1009.5228 [hep-lat]
M. Luscher, Trivializing maps, the Wilson flow and the HMC algorithm. Commun. Math. Phys. 293, 899 (2010). https://doi.org/10.1007/s00220-009-0953-7. arXiv:0907.5491 [hep-lat]
M.S. Albergo, G. Kanwar, P.E. Shanahan, Flow-based generative models for Markov chain Monte Carlo in lattice field theory. Phys. Rev. D 100, 034515 (2019). https://doi.org/10.1103/physrevd.100.034515. arXiv:1904.12072 [hep-lat]
G. Kanwar, M.S. Albergo, D. Boyda, K. Cranmer, D.C. Hackett, S. Racanière, D.J. Rezende, P.E. Shanahan, Equivariant flow-based sampling for lattice gauge theory. Phys. Rev. Lett. 125, 121601 (2020). https://doi.org/10.1103/PhysRevLett.125.121601. arXiv:2003.06413 [hep-lat]
K.A. Nicoli, C.J. Anders, L. Funcke, T. Hartung, K. Jansen, P. Kessel, S. Nakajima, P. Stornati, Estimation of thermodynamic observables in lattice field theories with deep generative models. Phys. Rev. Lett. 126, 032001 (2021). https://doi.org/10.1103/PhysRevLett.126.032001. arXiv:2007.07115 [hep-lat]
D. Boyda, G. Kanwar, S. Racanière, D.J. Rezende, M.S. Albergo, K. Cranmer, D.C. Hackett, P.E. Shanahan, Sampling using \(SU(N)\) gauge equivariant flows. Phys. Rev. D 103, 074504 (2021). https://doi.org/10.1103/PhysRevD.103.074504. arXiv:2008.05456 [hep-lat]
M.S. Albergo, G. Kanwar, S. Racanière, D.J. Rezende, J.M. Urban, D. Boyda, K. Cranmer, D.C. Hackett, P.E. Shanahan, Flow-based sampling for fermionic lattice field theories. Phys. Rev. D 104, 114507 (2021). https://doi.org/10.1103/PhysRevD.104.114507. arXiv:2106.05934 [hep-lat]
M.S. Albergo, D. Boyda, K. Cranmer, D.C. Hackett, G. Kanwar, S. Racanière, D.J. Rezende, F. Romero-López, P.E. Shanahan, J.M. Urban, Flow-based sampling in the lattice Schwinger model at criticality. Phys. Rev. D 106, 014514 (2022). https://doi.org/10.1103/PhysRevD.106.014514. arXiv:2202.11712 [hep-lat]
R. Abbott et al., Gauge-equivariant flow models for sampling in lattice field theories with pseudofermions. Phys. Rev. D 106, 074506 (2022). https://doi.org/10.1103/PhysRevD.106.074506. arXiv:2207.08945 [hep-lat]
E.G. Tabak, E. Vanden-Eijnden, Density estimation by dual ascent of the log-likelihood. Commun. Math. Sci (2010). https://doi.org/10.4310/cms.2010.v8.n1.a11
E.G. Tabak, C.V. Turner, A family of nonparametric density estimation algorithms. Commun. Pure Appl. Math. 66, 145 (2012). https://doi.org/10.1002/cpa.21423
D.J. Rezende, S. Mohamed, Variational inference with normalizing flows (2015). https://doi.org/10.48550/ARXIV.1505.05770. arXiv:1505.05770 [stat.ML]
L. Del Debbio, J. Marsh Rossney, M. Wilson, Efficient modeling of trivializing maps for lattice \(\phi \)4 theory using normalizing flows: a first look at scalability. Phys. Rev. D 104, 094507 (2021). https://doi.org/10.1103/PhysRevD.104.094507. arXiv:2105.12481 [hep-lat]
D. Albandea, L. Del Debbio, P. Hernández, R. Kenway, J. Marsh Rossney, A. Ramos, Learning trivializing flows, in 39th International Symposium on Lattice Field Theory (2022). arXiv:2211.12806 [hep-lat]
L. Dinh, D. Krueger, Y. Bengio, NICE: non-linear independent components estimation (2014). arXiv:1410.8516 [cs.LG]
L. Dinh, J. Sohl-Dickstein, S. Bengio, Density estimation using Real NVP (2016). arXiv:1605.08803 [cs.LG]
D.P. Kingma, P. Dhariwal, Glow: Generative flow with invertible 1x1 convolutions (2018). arXiv:1807.03039 [stat.ML]
S. Kullbach, R.A. Leibler, On information and sufficiency. Ann. Math. Stat. 22, 79 (1951). https://doi.org/10.1214/aoms/1177729694
D.P. Kingma, J. Ba, Adam: a method for stochastic optimization (2014). arXiv:1412.6980 [cs.LG]
N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, E. Teller, Equation of state calculations by fast computing machines. J. Chem. Phys. 21, 1087 (1953). https://doi.org/10.2172/4390578
W.K. Hastings, Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57, 97 (1970). https://doi.org/10.1093/biomet/57.1.97
S. Foreman, T. Izubuchi, L. Jin, X.-y. Jin, J.C. Osborn, A. Tomiya, HMC with normalizing flows. PoS LATTICE2021, 073 (2022). https://doi.org/10.22323/1.396.0073
X.-y. Jin, Neural network field transformation and its application in HMC. PoS LATTICE2021, 600 (2022). https://doi.org/10.22323/1.396.0600
S. Bacchio, P. Kessel, S. Schaefer, L. Vaitl, Learning trivializing gradient flows for lattice gauge theories (2022). arXiv:2212.08469 [hep-lat]
U. Wolff (ALPHA), Monte Carlo errors with less errors. Comput. Phys. Commun. 156, 143 (2004). [Erratum: Comput. Phys. Commun. 176, 383 (2007)]. https://doi.org/10.1016/S0010-4655(03)00467-3. arXiv:hep-lat/0306017
A. Ramos, Automatic differentiation for error analysis. PoS TOOLS2020, 045 (2021). https://doi.org/10.22323/1.392.0045. arXiv:2012.11183 [hep-lat]
A. Ramos, Automatic differentiation for error analysis of Monte Carlo data. Comput. Phys. Commun. 238, 19 (2019). https://doi.org/10.1016/j.cpc.2018.12.020. arXiv:1809.01289 [hep-lat]
A. Griewank, A. Walther, Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, 2nd ed. (Society for Industrial and Applied Mathematics, USA, 2008)
M.S. Albergo, D. Boyda, D.C. Hackett, G. Kanwar, K. Cranmer, S. Racanière, D.J. Rezende, P.E. Shanahan, Introduction to normalizing flows for lattice field theory (2021). arXiv:2101.08176 [hep-lat]
L. Baulieu, D. Zwanziger, QCD(4) from a five-dimensional point of view. Nucl. Phys. B 581, 604 (2000). https://doi.org/10.1016/S0550-3213(00)00176-0. arXiv:hep-th/9909006
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, Pytorch: an imperative style, high-performance deep learning library, in Advances in Neural Information Processing Systems, edited by H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, R. Garnett, vol. 32 (Curran Associates, Inc., 2019). https://proceedings.neurips.cc/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf
N. Madras, A.D. Sokal, The pivot algorithm: a highly efficient Monte Carlo method for the self-avoiding walk. J. Stat. Phys. 50, 109 (1988). https://doi.org/10.1007/BF01022990
M. Luscher, Schwarz-preconditioned HMC algorithm for two-flavour lattice QCD. Comput. Phys. Commun. 165, 199 (2005). https://doi.org/10.1016/j.cpc.2004.10.004. arXiv:hep-lat/0409106
R. Abbott et al., Aspects of scaling and scalability for flow-based sampling of lattice QCD (2022). arXiv:2211.07541 [hep-lat]
Acknowledgements
We acknowledge support from the Generalitat Valenciana grant PROMETEO/2019/083, the European projects H2020-MSCA-ITN-2019//860881-HIDDeN and 101086085-ASYMMETRY, and the national project PID2020-113644GB-I00. AR acknowledges financial support from Generalitat Valenciana through the plan GenT program (CIDEGENT/2019/040). DA acknowledges support from the Generalitat Valenciana grant ACIF/2020/011. JMR is supported by STFC grant ST/T506060/1. LDD is supported by the UK Science and Technology Facility Council (STFC) grant ST/P000630/1. This work has been performed under the Project HPC-EUROPA3 (INFRAIA-2016-1-730897), with the support of the EC Research Innovation Action under the H2020 Programme; in particular, we gratefully acknowledge the support of the computer resources and technical support provided by EPCC. This work used the ARCHER2 UK National Supercomputing Service (https://www.archer2.ac.uk). We also acknowledge the computational resources provided by Finis Terrae II (CESGA), Lluis Vives (UV), Tirant III (UV). The authors also gratefully acknowledge the computer resources at Artemisa, funded by the European Union ERDF and Comunitat Valenciana, as well as the technical support provided by the Instituto de Física Corpuscular, IFIC (CSIC-UV).
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A: \(\phi ^4\) theory on the lattice
Discretising the Laplacian as
and using the translational invariance of the action in Eq. (17) leads to
Equation (18) can be obtained with the transformations
Appendix B: Smearing
1.1 B.1 Discrete heat equation in \(D=2\)
The time evolution of the gradient flow is given by the heat equation
where the lattice discretization of the Laplacian is
We can solve the gradient flow exactly using the discrete Fourier transform,Footnote 9 (DFT) and the inverse DFT (IDFT):
Using this, Eq. (B1) becomes
Then, the expression in square brackets is
and therefore
where we have defined \(p_{\mu } \equiv p {\hat{e}}_{\mu }\), with \(p_{\mu } = 2\pi n / L\) for \(n = 0, \ldots , L-1\); and \({\hat{p}}^2 \equiv \sum _{\mu }^{} 4 \sin ^2 \left( \frac{p_{\mu }}{2} \right) \). The solution of this equation in momentum space is
which we can express in position space using the IDFT,
Hence the final expression for the solution of Eq. (B1) is
1.2 B.2 Continuum smearing radius in D dimensions
Doing the same derivation in the continuum for D dimensions one would get
where we have defined the smearing kernel
Analogously to the Yang–Mills gradient flow [9], Eq. (B10) shows that the heat equation is a smoothing operation with mean-square radius
In this work, the flow time t for the computation of observables in smeared configurations was tuned so that \(R_{2}(t) = \xi \).
Appendix C: Supplementary plots and tables
Reference values from the simulations of HMC and FHMC with \(k=3\) and \(N_{l}=1\) can be found in Tables 3 and 4, respectively. Autocorrelation times for the unflowed and flowed one-point susceptibilities, \(\chi _{0}\) and \(\chi _{0,t}\), are displayed in Fig. 6, showing a similar behaviour to the autocorrelation time of the magnetization showed in Fig. 3.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Funded by SCOAP3. SCOAP3 supports the goals of the International Year of Basic Sciences for Sustainable Development.
About this article
Cite this article
Albandea, D., Del Debbio, L., Hernández, P. et al. Learning trivializing flows. Eur. Phys. J. C 83, 676 (2023). https://doi.org/10.1140/epjc/s10052-023-11838-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1140/epjc/s10052-023-11838-8