1 Introduction

Lattice field theory admits a numerical approach to the study of non-perturbative properties of many field theories by using Markov Chain Monte Carlo (MCMC) techniques to generate representative samples of field configurations and computing expectation values. However, standard MCMC algorithms suffer from a phenomenon known as critical slowing down, whereby the autocorrelation time of the simulation increases dramatically as the continuum limit is approached. In many theories of interest, including Quantum Chromodynamics (QCD), this problem is exacerbated by the effect of topology freezing [1,2,3,4,5,6,7,8,9,10]. Autocorrelation times of topological observables have been shown to scale exponentially with the inverse lattice spacing, \(a^{-1}\), for the CP\(^{N-1}\) model [3], and at least polynomially with \(a^{-6}\) for lattice QCD [9].

Trivializing maps are invertible field transformations that map a complicated theory to a trivial one, i.e. to a limit in which the field variables decouple and sampling is extremely efficient. Lüscher [11] originally proposed an augmentation of the Hybrid Monte Carlo (HMC) algorithm in which an approximate trivializing map is used to reduce autocorrelation times. However, when tested against CP\(^{N-1}\) models, it was reported that the quality of this approximation, which involved computing the first few terms of a power series, was not sufficient to improve the scaling of the computational cost towards the continuum limit with respect to standard HMC [4].

The recent introduction of machine learning techniques for the sampling of lattice field theories has opened a new avenue to address critical slowing down in lattice field theories [12,13,14,15,16,17,18]. A class of machine learning models known as normalizing flows are also invertible transformations that are parametrised by neural networks (NNs) and can hence be ‘trained’ to approximate a desired mapping [19,20,21]. Albergo, Kanwar and Shanahan [12] first demonstrated that direct sampling from a well-trained Normalizing Flow, combined with some form of reweighting such as a Metropolis test, produces unbiased samples of field configurations while completely avoiding critical slowing down. However, experiments with simple architectures have suggested that the overhead cost of training models to achieve a fixed autocorrelation time scales extremely poorly towards the continuum limit [22].

In this work we investigate an algorithm inspired by the original idea of Lüscher, but where a normalizing flow is used to approximate the trivializing map. Given the high training costs associated with the direct sampling strategy, we pose the question: is it possible to improve the scaling of autocorrelation times in HMC using minimal models that are cheap to train? To answer this we benchmark our method against standard HMC on a two-dimensional \(\phi ^{4}\) model.

The paper is organised as follows: in Sect. 2 we briefly review trivializing flows before describing the algorithm that is the focus of this work; in Sect. 3 we describe the experimental setup, which includes details about the \(\phi ^4\) theory, the normalizing flow architectures, and the HMC component of the algorithm; in Sect. 4 we provide the results of our experiments and compare the computational cost scaling against standard HMC. This work is based on results previously reported in Reference [23].

2 Learning trivializing flows

2.1 Trivializing flows

Consider the expectation value of an observable in the path integral formalism of a quantum field theory in Euclidean spacetime,

$$\begin{aligned} \langle {\mathcal {O}} \rangle = \frac{1}{{\mathcal {Z}}} \int _{ }^{ } {\mathcal {D}}\phi \; {\mathcal {O}}(\phi ) \, e^{-S(\phi )}, \end{aligned}$$
(1)

where \({\mathcal {O}}(\phi )\) is an observable defined for the field configuration \(\phi \), \(S(\phi )\) is the action of the theory, \({\mathcal {Z}}\) is its partition function,

$$\begin{aligned} {\mathcal {Z}} = \int {\mathcal {D}}\phi \; e^{-S(\phi )}, \end{aligned}$$
(2)

and \({\mathcal {D}}\phi \) is the integration measure,

$$\begin{aligned} {\mathcal {D}}\phi = \prod _{x} d\phi _{x}. \end{aligned}$$
(3)

The probability of a field configuration \(\phi \) is given by the Boltzmann factor

$$\begin{aligned} p(\phi ) = \frac{1}{{\mathcal {Z}}}\, e^{-S(\phi )}. \end{aligned}$$
(4)

We will refer to p as the target distribution. A change of variables \({\tilde{\phi }}={\mathcal {F}}^{-1}(\phi )\) in Eq. (1) yields

$$\begin{aligned} \langle {\mathcal {O}} \rangle = \frac{1}{{\mathcal {Z}}} \int _{ }^{ } D \tilde{\phi }\; {\mathcal {O}}({\mathcal {F}}({\tilde{\phi }})) e^{-S[{\mathcal {F}}({\tilde{\phi }})] + \log \det J[{\mathcal {F}}( {\tilde{\phi }} )]}, \end{aligned}$$
(5)

where \(J[ {\mathcal {F}}({\tilde{\phi }}) ]\) is the Jacobian coming from the change in the integral measure, \({\mathcal {D}}\phi ={\mathcal {D}}{\tilde{\phi }}\,\det J[ {\mathcal {F}} ]\). If \({\mathcal {F}}\) is chosen such that the effective action for the transformed field,

$$\begin{aligned} {\tilde{S}}({\tilde{\phi }}) \equiv S[{\mathcal {F}}({\tilde{\phi }})] - \log \det J[{\mathcal {F}}({\tilde{\phi }})] \, , \end{aligned}$$
(6)

describes a non-interacting theory, then \({\mathcal {F}}\) is known as a trivializing map.

In Reference [11] trivializing maps for gauge theories were constructed as flows

$$\begin{aligned} {\dot{\phi }}_t \equiv T[t, \phi _t], \end{aligned}$$
(7)

with boundary condition

$$\begin{aligned} \phi = \phi _0. \end{aligned}$$
(8)

The trivializing map is defined as

$$\begin{aligned} {\tilde{\phi }} = {\mathcal {F}}^{-1}(\phi ) = \phi _{1}. \end{aligned}$$
(9)

Though not known in closed form, the kernel \(T\) of the trivializing flow can be expressed as a power series in the flow time \(t\). In practice, this power series was truncated at leading order and the flow integrated numerically, resulting in an approximate trivializing map where the effective action in Eq. (6) is still interacting in general. Nevertheless, \({\tilde{S}}\) ought to be easier to sample than S.

The algorithm introduced in Reference [11] is essentially the HMC algorithm applied to the flowed field variables \({\tilde{\phi }}\). This algorithm was tested for the CP\(^{N-1}\) model, which suffers from topology freezing [3], by Engel and Schaefer [4]. The conclusion of the study was that, although there was a small improvement in the proportionality factor, the overall scaling of the computational cost towards the continuum did not change with respect to standard HMC.

2.2 Flow HMC (FHMC)

Normalizing Flows are a machine learning sampling technique first applied to lattice field theories in Reference [12]. The idea is similar to that of the trivializing map. Starting from an initial set of configurations \(\{z_{i}\}_{i=1}^{N}\) drawn from a probability distribution where sampling is easy,Footnote 1

$$\begin{aligned} z_{i}\sim r(z), \end{aligned}$$
(10)

a transformation \(\phi = f^{-1}(z)\) is applied so that the transformed configurations \(\{\phi _{i}\}_{i=1}^{N} \equiv \{f^{-1}(z_{i})\}_{i=1}^{N}\) follow the new probability distribution

$$\begin{aligned} p_{f}(\phi ) = r\bigl (f(\phi )\bigr )\, \left| \det \frac{\partial f(\phi )}{\partial \phi } \right| . \end{aligned}$$
(11)

The probability density \(p_f\) is called the model distribution. The transformation \(f\) is implemented via NNs with a set of trainable parameters \(\{\theta _{i}\}\) which have been optimised so that \(p_{f}(\phi )\) is as close as possible to the target distribution \(p(\phi ) = e^{-S(\phi )} / {\mathcal {Z}}\), i.e. the distribution of the theory we are interested in. The determinant of this transformation can be easily computed if the network architecture consists of coupling layers [24,25,26] with a checkerboard pattern, as explained in Reference [12]. Normalizing flows are therefore NN parametrisations of trivializing maps.

Ideally, the NNs would be trained such that the Küllbach–Leibler (KL) divergence [27] between the model and the target distribution,

$$\begin{aligned} D_{\text {KL}}(p_{f} || p) = \int {\mathcal {D}}\phi \, p_{f}(\phi ) \log \frac{p_{f}(\phi )}{p(\phi )}, \end{aligned}$$
(12)

is minimised. The KL divergence is a statistical estimator satisfying \(D_{\text {KL}}(p_{f} || p) \ge 0\) and

$$\begin{aligned} D_{\text {KL}}(p_{f} || p) = 0 \iff p_{f} = p\,. \end{aligned}$$

However, the partition function \({\mathcal {Z}}\) appearing in \(p(\phi )\) is generally not known, so in practice one minimises a shifted KL divergence,

$$\begin{aligned} L(\theta ) \equiv&\, D_{\text {KL}}(p_{f} \mid \mid p) - \log {\mathcal {Z}} \nonumber \\ =&\int {\mathcal {D}}\phi \, p_{f}(\phi ) \left[ \log p_{f}(\phi ) + S(\phi ) \right] . \end{aligned}$$
(13)

This loss function can be stochastically estimated by drawing samples \(\phi _{i} \sim p_{f}(\phi )\) from the model; there is no requirement to have a set of existing training data. Since f is differentiable, the loss can be minimised using standard gradient-based optimisation algorithms such as stochastic gradient descent and ADAM [28], the latter of which we used in this study. The absolute minimum of the loss function occurs when \(L(\theta ) = -\log {\mathcal {Z}}\), where \(p_{f} = p\). In practice this minimum is unlikely to be achieved due to both the limited expressivity of the model and a finite amount of training, but one expects to have an approximate trivializing map at the end of the training, i.e. \(p_{f} \approx p\).

The normalizing flow model generates configurations that are distributed according to \(p_{f}\), not p. To achieve unbiased sampling from p, the work in Reference [12] embeds the model in a Metropolis–Hastings (MH) algorithm [29, 30], where \(p_{f}\) serves as the proposal distribution. Since the proposals are statistically independent, the only source of autocorrelations are rejections. However, training models to achieve a fixed low MH rejection rate can become prohibitively expensive for large systems and long correlation lengths [22].

In contrast, in this work we propose to use the trained model as an approximation to the trivializing map in the implementation of the trivializing flow algorithm described in Sect. 2.1. Thus we identify

$$\begin{aligned} {\mathcal {F}} = f^{-1}, \end{aligned}$$
(14)

so that the expectation value in Eq. (1) becomes,

$$\begin{aligned} \langle {\mathcal {O}} \rangle =&\, \frac{1}{{\mathcal {Z}}} \int _{ }^{ } D {\tilde{\phi }}\; {\mathcal {O}}\left( f^{-1}({{\tilde{\phi }}})\right) \, e^{-S[f^{-1}({\tilde{\phi }})] + \log \det J[f^{-1}( {\tilde{\phi }} )]} \nonumber \\ \equiv&\, \frac{1}{{\mathcal {Z}}} \int _{ }^{ } D{\tilde{\phi }} \; {\mathcal {O}}\left( f^{-1}({{\tilde{\phi }}})\right) \, e^{-{\tilde{S}}[{\tilde{\phi }}]}, \end{aligned}$$
(15)

where we have defined the new action \({\tilde{S}}\) in the transformed coordinates to be

$$\begin{aligned} {\tilde{S}}({\tilde{\phi }}) \equiv S(f^{-1}({\tilde{\phi }})) - \log \det J[f^{-1}({\tilde{\phi }})]. \end{aligned}$$
(16)

If the new probability distribution, \(e^{-{\tilde{S}}({\tilde{\phi }})}\) is easier to sample than \(e^{-S(\phi )}\) then performing HMC with the new variables, \({\tilde{\phi }}\), would result in a Markov chain \(\{ {\tilde{\phi }}_{i}\}_{i=1}^{N}\) with lower autocorrelation times for the observable \({\mathcal {O}}\).

We will refer to this algorithm as Flow HMC (FHMC), and its workflow is as follows:

  1. 1.

    Train the network \(f\) by minimising the KL divergence in Eq. (13).

  2. 2.

    Run the HMC algorithm to build a Markov chain of configurations using the action \({\tilde{S}}\),

    $$\begin{aligned} \{ {\tilde{\phi }}_{1},\; {\tilde{\phi }}_{2},\; {\tilde{\phi }}_{3},\; \ldots ,\; {\tilde{\phi }}_{N} \} \sim e^{-{\tilde{S}}({\tilde{\phi }})}. \end{aligned}$$
  3. 3.

    Apply the inverse transformation \(f^{-1}\) to every configuration in the Markov chain to undo the variable transformation. This way we obtain a Markov chain of configurations following the target probability distribution \(p(\phi ) = e^{-S[\phi ]}\),

    $$\begin{aligned} \left\{ f^{-1}({\tilde{\phi }}_{1}),\; f^{-1}({\tilde{\phi }}_{2}),\; f^{-1}({\tilde{\phi }}_{3}),\; \ldots ,\; f^{-1}({\tilde{\phi }}_{N}) \right\} \nonumber \\ = \{ \phi _{1},\; \phi _{2},\; \phi _{3},\; \ldots ,\; \phi _{N} \} \sim e^{-S(\phi )}. \end{aligned}$$

Note that the HMC acceptance in step 2 can be made arbitrarily high by increasing the number of integration steps in the molecular dynamics evolution of HMC. Contrary to what happens in the algorithm suggested in Reference [12], this acceptance does not measure how well \(f^{-1}\) approximates a trivializing map; the relevant question is whether this algorithm improves the autocorrelation of HMC.

The motivation behind this work is that a Normalizing Flow parametrised by NNs ought to be better able to approximate a trivializing map than the leading-order approximation of the flow equation introduced and tested in References [4, 11]. Similar ideas have been explored in References [31,32,33]. Particularly, in Reference [32] a Normalizing Flow is optimised to approximate the target distribution from an input distribution corresponding to the action at a coarser lattice spacing, while training is done by minimising the force difference between two theories instead of the KL divergence. In contrast to these previous works we focus on minimal models and cheap training setups in an attempt to avoid the exploding training costs reported in Reference [22].

3 Model and setup

For our study we focus on a \(\phi ^{4}\) scalar field theory in \(D=2\) dimensions with bare mass \(m_{0}\) and bare coupling \(g_{0}\). Its standard continuum action is

$$\begin{aligned} S[ \phi ]&= \int d^2 x \left[ \frac{1}{2} \left( \partial _{\mu } \phi (x) \right) \left( \partial _{\mu }\phi (x) \right) \right. \nonumber \\&\quad \left. + \frac{1}{2} m_{0}^{2}\phi (x)^2 + \frac{1}{4!} g_{0} \phi (x)^4 \right] . \end{aligned}$$
(17)

On the lattice we will work with the \(\beta \)\(\lambda \) parametrization,

$$\begin{aligned} S[ \phi ] = \sum _{x}^{} \left[ -\beta \sum _{\mu =1}^{2} \phi _{x + e_{\mu }} \phi _{x} + \phi _{x}^{2} + \lambda (\phi _{x}^{2} - 1)^2 \right] , \end{aligned}$$
(18)

where \(e_{\mu }\) represents a unit vector in the \(\mu \)th direction and the sum runs over all lattice points \(x\equiv (x_{1},x_{2})\). The relationship between these two actions is explained in Appendix A.

3.1 Observables

We will focus only on a handful of observables, the simplest one being the magnetization

$$\begin{aligned} M = \frac{1}{V} \sum _{x}^{} \phi _{x}, \end{aligned}$$
(19)

with V the lattice volume. The building block for the rest of the observables is the connected two-point correlation function

$$\begin{aligned} G(y)&= \frac{1}{V} \sum _{x }^{} \langle (\phi _{x+y} - \langle \phi \rangle ) (\phi _{x} - \langle \phi \rangle ) \rangle \nonumber \\&= \frac{1}{V} \sum _{x }^{} \langle \phi _{x+y}\phi _{x} \rangle - \langle \phi \rangle ^2, \end{aligned}$$
(20)

where we have used translational invariance to define \(\langle \phi \rangle = \langle \phi _x \rangle \). The correlation length, \(\xi \), corresponding to the inverse mass of the lightest mode in the spectrum, can be extracted from the spatially-averaged two-point function,

$$\begin{aligned} \sum _{y_{1}=0}^{L-1} G(y_{1},y_{2}) \propto \cosh \left( \frac{y_{2} - L / 2}{\xi } \right) \, , \end{aligned}$$
(21)

at sufficiently large Euclidean time separations \(y_{2}\).

We can also measure the one-point susceptibility,Footnote 2

$$\begin{aligned} \chi _{0} \equiv G(0) = \frac{1}{V} \sum _{x }^{} \left[ \langle \phi _{x}^2 \rangle - \langle \phi \rangle ^2 \right] . \end{aligned}$$
(22)

Since both the magnetization and the one-point susceptibility are local observables, we additionally studied the one-point susceptibility measured in smeared field configurations,

$$\begin{aligned} \chi _{0,t} \equiv \frac{1}{V} \sum _{x }^{} \left[ \langle \phi _{x,t}^2 \rangle - \langle \phi _{t} \rangle ^2 \right] . \end{aligned}$$
(23)

The smeared field configurations \(\phi _{t}\) are obtained by solving the heat equation

$$\begin{aligned} \partial _{t}\phi _{x,t} = \partial _{x}^2 \phi _{x,t}, \end{aligned}$$
(24)

up to flow time t, which we choose so that the smearing radius of the flow is equal to the correlation length of the system, i.e. \(\sqrt{4t} = \xi \). For more details, see Appendix B.

As usual, the estimator of the expectation values of these observables is the statistical average over the generated Markov chain of configurations,

$$\begin{aligned} \overline{{\mathcal {O}}} = \frac{1}{N} \sum _{i=1}^{N} {\mathcal {O}}(\phi ^{(i)}). \end{aligned}$$
(25)

where \({\mathcal {O}}\) is the observable studied. The error associated with this estimation is given by the statistical variance,

$$\begin{aligned} \sigma _{\overline{{\mathcal {O}}}}^2 = \frac{\sigma _{{\mathcal {O}}}^2}{N} 2 \tau _{\text {int},{\mathcal {O}}}, \end{aligned}$$
(26)

where \(\tau _{\text {int},{\mathcal {O}}}\) is the so-called integrated autocorrelation time. It is defined as

$$\begin{aligned} \tau _{\text {int},{\mathcal {O}}} = \frac{1}{2} + \sum _{m=1}^{\infty } \frac{\Gamma _{{\mathcal {O}}}(m)}{\Gamma _{{\mathcal {O}}}(0)}, \end{aligned}$$
(27)

with

$$\begin{aligned} \Gamma _{{\mathcal {O}}}(m) = {1\over N} \sum _i {\mathcal {O}}(\phi ^{(i+m)}){\mathcal {O}}(\phi ^{(i)}) - \overline{{\mathcal {O}}}^2, \end{aligned}$$
(28)

being the autocorrelation time of the observable measured at field configurations separated by \(m\) Markov chain configurations. We estimate \(\tau _{\text {int}}\) using the automatic windowing procedure of the \(\Gamma \) method [10, 34]. Particularly, we perform the autocorrelation analysis with ADerrors.jl [35], which combines the \(\Gamma \) method with automatic differentiation techniques [36, 37].

3.2 Network architecture and training

As mentioned in Sect. 1 we focused on keeping the training costs negligible with respect to the cost of producing configurations with FHMC. The most intuitive choices that we took for this optimization are:

  1. 1.

    Use convolutional neural networks (CNNs) instead of fully connected networks. The action of Eq. (18) has translational symmetry, so the network f should apply the same transformation to \(\phi _{x}\) for all x. CNNs respect this translational symmetry, and also require less parameters than fully connected networks.

  2. 2.

    Tune the number of layers and kernel sizes of the CNNs so that the footprint of the network f is not much bigger than the correlation length \(\xi \). Two-point correlation functions will generally decay with \(\sim e^{-\left| x - y \right| / \xi }\), so the transformation of \(\phi _{x}\) should not depend on \(\phi _{y}\) if \(\left| x - y \right| \gg \xi \). Also, limiting the number of layers reduces the number of trainable parameters.

  3. 3.

    Enforce f to satisfy \(f(-\phi ) = -f(\phi )\). The action in Eq. (18) is invariant under \(\phi \rightarrow -\phi \), so enforcing equivariance under this symmetry should optimise training costs.

Algorithm 1
figure a

HMC

Following [12], we partition the lattice using a checkerboard pattern, so that each field configuration can be split as \(\phi =\{\phi ^A,\phi ^B\}\), where \(\phi ^A\) and \(\phi ^B\) collectively denote the field variables belonging to one or the other partition. We then construct the transformation f as a composition of n layers,

$$\begin{aligned} f(\phi ) = g^{(1)}(g^{(2)}(\dots g^{(n)}(\phi ) \dots )), \end{aligned}$$
(29)

where each layer \(g^{(i)}\) does an affine transformation to a set of the field variables, \(\{\phi ^A,\phi ^B\}\), organised in a checkerboard pattern, such as

$$\begin{aligned} \phi _{x}^{A} = {\left\{ \begin{array}{ll} \phi _{x} \quad &{}\text {if } x_{1}+x_{2} \text { odd} \\ 0 \quad &{}\text {otherwise} \end{array}\right. }, \; \phi _{x}^{B} = {\left\{ \begin{array}{ll} 0 \quad &{}\text {if } x_{1}+x_{2} \text { odd} \\ \phi _{x} \quad &{}\text {otherwise} \end{array}\right. }, \end{aligned}$$

where \(x=(x_{1},x_{2})\). In the affine transformation

$$\begin{aligned} g^{(i)}(\{\phi ^A,\phi ^B\}) = \{\phi ^A, \phi ^B \odot e^{\left| s ^{(i)}(\phi ^A) \right| } + t^{(i)}(\phi ^{A})\} \end{aligned}$$
(30)

the partition \(\phi ^A\) remains unchanged and only the field variables \(\phi ^B\) are updated. \(s(\phi )\) and \(t(\phi )\) are CNNs with kernel size k. To make this transformation equivariant under \(\phi \rightarrow -\phi \) we enforce \(f(-\phi )=-f(\phi )\) by using a tanh activation function and no bias for the CNNs (see Sec. III.F of [22]). The checkerboard pattern ensures that the Jacobian matrix has a triangular form so its determinant can be easily computed, and reads

$$\begin{aligned} \left| \det \frac{\partial g^{(i)}(\phi )}{\partial \phi } \right| = \prod _{\{x | \phi ^{B}_{x} = \phi _{x}\}} e^{\left| s ^{(i)}_{x}(\phi ^{A}) \right| }, \end{aligned}$$
(31)

where the product runs over the lattice points where the partition \(\phi _{x}^{B} = \phi _{x}\). An example of the action of a CNN with only 1 layer and a tanh activation function over a lattice field would be

$$\begin{aligned} s ^{(i)}_{x}(\phi ^{A}) = \tanh \left[ \sum _{y \in \left[ -\frac{k-1}{2},\frac{k-1}{2} \right] ^2 }^{} w^{(i)}(y) \phi _{x-y}^{A} \right] , \end{aligned}$$
(32)

where \(w^{(i)}(y) \equiv w^{(i)}_{y + (k+1) / 2}\) is the weight matrix of size \(k \times k\) of the CNN s of kernel size k. Choosing the same functional form for \(t_{x}^{(i)}(\phi ^{A})\), it is easy to check that the transformation in Eq. (30) is equivariant under \(\{\phi ^{A}, \phi ^{B}\} \rightarrow \{- \phi ^{A}, -\phi ^{B}\}\).

Algorithm 2
figure b

FHMC

Two different affine layers with alternate checkerboard patterns are necessary to transform the whole set of lattice points, and we will denote such a pair of layers as a coupling layer.Footnote 3

In this work we studied network architectures with \(N_{l} = 1\) affine coupling layers, while the CNNs, s and t, have kernel size k and no hidden layers. The output configuration is rescaled with an additional trainable parameter. Finally, independent normal distributions are used as the prior distributions r in Eq. (10).

3.3 FHMC implementation

The main focus of this work is the scaling of the autocorrelation times of the magnetization, \(\tau _{M}\), and one-point susceptibilities, \(\tau _{\chi _{0}}\) and \(\tau _{\chi _{0,t}}\). Using local update algorithms such as HMC, these autocorrelation times are expected to scale as [39]

$$\begin{aligned} \tau \sim \xi ^2. \end{aligned}$$
(33)

We will benchmark the scaling of the autocorrelation times in the FHMC algorithm against those in standard HMC.

For a scalar field theory, the HMC equations of motion read

$$\begin{aligned} {\dot{\phi }}_{x} =\, \pi _{x}, \qquad {\dot{\pi }}_{x} = -\nabla _{\phi _{x}}S[\phi ], \end{aligned}$$
(34)

where the force for the momenta \(\pi \) follows from the derivative of the action in Eq. (18),

$$\begin{aligned} F_{x} \equiv -\nabla _{x} S[ \phi ]&= \beta \sum _{\mu =\pm 1}^{\pm 2} \phi _{x+e_{\mu }} \nonumber \\&\quad + 2 \phi _{x} \left( 2 \lambda \left( 1-\phi _{x}^2 \right) -1 \right) . \end{aligned}$$
(35)

In our simulations we used a leapfrog integration scheme with a single time scale, and the step size of the integration was tuned to obtain acceptances of approximately 90%. A pseudocode of an HMC implementation is depicted in Algorithm 1: the HMC function receives as input a configuration, \(\phi \), and the action of the target theory, S; after generating random momenta, the leapfrog function performs the molecular dynamics step and a configuration, \(\phi _{\text {new}}\), is chosen between the evolved field, \(\phi '\), and the old field, \(\phi \), with the usual MH accept–reject step.

Table 1 Studied parameters of the action in Eq. (18). \(\beta \) has been chosen so that \(\xi = L / 4\), so the continuum limit is taken in the direction of increasing L

The proposed FHMC algorithm is in essence the HMC algorithm with the transformed action in Eq. (16), which arises from the change of variables \({\tilde{\phi }} = f(\phi )\) in Eq. (15). The new Hamilton equations of motion now include derivatives with respect to the new variables \({\tilde{\phi }}\),

$$\begin{aligned} \dot{{\tilde{\phi }}}_{x} =\, \pi _{x}, \qquad {\dot{\pi }}_{x} = -\nabla _{{\tilde{\phi }}_{x}}{\tilde{S}}[{\tilde{\phi }}]. \end{aligned}$$
(36)

The basic implementation is sketched in Algorithm 2. The main differences with respect to standard HMC in Algorithm 1 are line 2, where we transform from the variables \(\phi \) to the variables \({\tilde{\phi }}\) using the trained network f; and line 6, where we undo the change of variables to obtain the new configuration \(\phi _{\text {new}}\) from \({\tilde{\phi }}_{\text {new}}\).

Note that the molecular dynamics evolution and the accept–reject step, lines 4 and 5, are applied to the transformed field variables \({\tilde{\phi }}\) with the new action \({\tilde{S}}\). Irrespective of the transformation f, the acceptance rate can be made arbitrarily high by reducing numerical errors in the integration of the equations of motion in Eq. (36). This means that we will always be able to tune the FHMC acceptance to approximately 90% by increasing the number of integration steps, even for a poorly trained normalizing flow.

Note also that now the evaluation of the force \({\tilde{F}}_{x} \equiv - \nabla _{\phi _{x}}{\tilde{S}}[ {\tilde{\phi }} ]\) requires computing the derivative of \(\log \det J[ f^{-1}({\tilde{\phi }}) ]\). This cannot be written analytically for an arbitrary network, and we used PyTorch’s automatic differentiation methods for its evaluation [40].

4 Results

We are interested in the scaling of the cost towards the continuum limit. Following the analysis in [22], we tuned the coupling \(\beta \) so that the correlation length satisfies

$$\begin{aligned} \xi \approx \frac{L}{4}. \end{aligned}$$
(37)

The continuum limit is therefore approached in the direction of increasing L. In Table 1 we summarise the parameters used in our simulations of FHMC and standard HMC. Results obtained with both HMC and FHMC can be found in Tables 3 and 4 of Appendix C, respectively.

Fig. 1
figure 1

(Left) History of the KL divergence during the training from independent Gaussians to a theory with parameters \(\beta = 0.641\), \(\lambda = 0.5\), lattice size \(L = 18\) and \(\xi = L / 4\). (Right) History of the magnetization for a simulation with HMC (blue) and FHMC (orange)

4.1 Minimal network

An obvious strategy to reduce training costs is to build networks with few parameters to train. As mentioned in Sect. 3, using \(N_{l} = 1\) coupling layer would suffice to transform the whole lattice, while a kernel size \(k=3\) for the CNNs, which would couple only nearest neighbour, is the smallest that can be used.Footnote 4 Such a network has only 37 trainable parameters in total,Footnote 5 and is the most minimal network that we will consider.

It is interesting to study whether such a simple network can learn physics of a target theory with a non-trivial correlation length. In Fig. 1 (left) we plot the evolution of the KL divergence during the training of a network with such minimal architecture, where the target theory has parameters \(\beta = 0.641\), \(\lambda = 0.5\), lattice size \(L=18\) and correlation length \(\xi = L / 4 = 4.5\). Since the network has very few parameters, the KL divergence reaches saturation after only \({\mathcal {O}}(100)\) iterations.

Once the network is trained, one can use it as a variable transformation for the FHMC algorithm as sketched in Algorithm 2. In Fig. 1 (right) we show the magnetizations of a slice of 4000 configurations of Markov chains coming from FHMC and HMC simulations, yielding as autocorrelation times

$$\begin{aligned} \tau _{M, \text {FHMC}} = 74.4(3), \quad \tau _{M, \text {HMC}} = 100.4(2). \end{aligned}$$
(38)

All the results with this minimal network can be found in Table 4 of Appendix C, showing that FHMC leads to smaller autocorrelations compared to standard HMC, especially as the continuum limit is approached. This fact seems to indicate that a network with few parameters can indeed learn transformations with relevant physical information.

Table 2 MH acceptances of networks trained at lattice size L and used to sample a theory with coupling \(\beta \) at lattice sizes L and 2L
Fig. 2
figure 2

Autocorrelation time of the magnetization at lattice size 2L using FHMC. In circles, the networks used were trained at lattice size 2L; in triangles, they were trained at L and used at 2L

A measure of the closeness of the distributions \(p_f\) and p is given by the MH acceptanceFootnote 6 when sampling p with configurations drawn directly from \(p_f\). Focusing on the first three columns of Table 2, it is clear that the acceptances are low and decrease towards the continuum limit, in spite of the fact that the autocorrelation time of FHMC is better than that of HMC. This is because the networks used have a very reduced set of parameters and therefore limited expressivity. The map defined via the trained networks is not very accurate in generating the probability distribution of the target theory. However FHMC, i.e. a molecular dynamics evolution using flowed variables, yields a clear gain in the autocorrelation times.

4.2 Infinite volume limit

An important advantage of using a translationally-invariant network architecture, such as the one containing CNNs, is that they can be trained at a small lattice size L and then used in a larger lattice \(L' > L\). Note that doing this in the approach of References [12, 22] would not be viable since it would lead to an exponential decrease in the MH acceptance due to the extensive character of the action.

In the last column of Table 2 we show the MH acceptance using networks with the minimal architecture of Sect. 4.1 trained at lattice size L when the target theory has lattice size 2L. One can see that the acceptances are significantly lower than those obtained with the target theory at size L. However, the acceptance of the FHMC algorithm (Algorithm 2) can be kept arbitrarily high by increasing the number of integration steps of the Hamilton equations, so reusing the networks for higher volumes does not pose any problem. The reduced MH acceptance does not translate into a change in the autocorrelation time.

In Fig. 2 we compare the autocorrelation times for the magnetization of networks trained at L and reused at 2L with the ones of networks trained directly at 2L. Since they agree within statistical significance, this indicates that the relevant physical information is already learned at small volumes and reinforces the intuition that the training does not need to be done at a lattice sizes larger than \(\xi \).

4.3 Continuum limit scaling with fixed architecture

Finally we want to determine whether the computational cost of FHMC has a better scaling than the standard HMC as we approach the continuum. First we will consider a fixed network architecture as we scale L. We have trained a different network for each of the lattice sizes in Table 1, with \(N_{l}=1\) and kernel sizes \(k=3,5,7\). The cost of the training is in all cases negligible with respect to the costs of the FHMC, and the integration step of the leapfrog scheme is tuned to have an acceptance rate of approximately 90% for every simulation, as we do for HMC.

In Fig. 3 (left) we show the autocorrelation times of the magnetization for both HMC (filled, blue circles) and FHMC with kernel sizes \(k=3,5,7\) (open circles, triangles and squares). One can see that the autocorrelations in FHMC are lower than the ones of HMC, and decrease as the kernel size of the CNNs is increased.Footnote 7

In order to study the scaling towards the continuum limit, we plot the ratio of autocorrelation times for HMC versus FHMC in Fig. 3 (right) for the three values of the kernel size. Although for the coarser lattices the ratio increases towards the continuum, it seems to saturate within the range of lattice spacings explored, indicating that the cost scaling of both algorithms is the same. The same behaviour is observed for the one-point susceptibilities of Eqs. (22) and (23).

Fig. 3
figure 3

(Left) Scaling of the autocorrelation time of the magnetization towards the continuum for HMC (filled, blue circles) and FHMC with kernel sizes \(k=3,5,7\) (open circles, triangles and squares). (Right) Scaling of the ratio of autocorrelation times of the magnetization of HMC with respect to FHMC

Fig. 4
figure 4

(Left) Scaling of the autocorrelation time of the magnetization for HMC and FHMC with a network with \(k\sim \xi \). (Right) Same for the flowed one-point susceptibility, Eq. (23). The fits correspond to the best fit function of Eq. (39)

4.4 Continuum limit scaling with \(k \sim \xi \)

As we take the continuum limit the correlation length increases in lattice units. If the footprint is chosen to scale with \(\xi \), the convolution implemented by the network covers the same physical region. Using our architecture, this can be done by adding more coupling layers or increasing the kernel size, k. More concretely, a kernel size k couples \((k-1) / 2\) nearest neighbours; therefore, since we have no hidden layers, if we have \(N_{l}\) coupling layers the network will couple \(N_{l}(k-1)\) nearest neighbours.

In Fig. 4 (left) we show again the scaling of the autocorrelation times of the magnetization, but now the networks used for the FHMC algorithm have \(N_{l}(k-1) \approx \xi \). Particularly, all networks of the plot have \(N_{l} = 1\) coupling layers, so only the kernel size varies: for \(L = 10\) to \(L = 16\) the networks have \(k = 5\); for \(L = 18\) and \(L=20\), \(k=7\); for \(L = 40\), \(k = 11\); and for \(L = 80\), \(k = 21\).

The curves of the plot correspond to fits to the function

$$\begin{aligned} \tau = a \xi ^{z} + b, \end{aligned}$$
(39)

yielding as result

$$\begin{aligned} z_{M, \text {HMC}} = 2.19(4), \quad z_{M, \text {FHMC}} = 1.94(6). \end{aligned}$$
(40)

Thus keeping the physical footprint size constant seems to yield a slight improvement in the scaling towards the continuum.Footnote 8 It is also interesting to see that the same happens with the smeared susceptibility in Fig. 4 (right). The latter is a non-local observable that has been measured in smeared configurations with a smoothing radius \(\sqrt{4t} = \xi \) (see Appendix B).

This slight improvement is in agreement with the fact that for a fixed network architecture the continuum scaling remains the same as HMC, while increasing the kernel size improves the global factor of the autocorrelations, as was seen in Fig. 3.

It is important to note that increasing the kernel size of the network implies increasing the number of parameters in the training, and also the number of operations to compute the force in the molecular dynamics evolution using automatic differentiation. Particularly, the number of parameters of our networks is given by

$$\begin{aligned} N_{\text {params}} = 4 k^2 N_{l} + 1. \end{aligned}$$
(41)

In Fig. 5 we show the time needed to compute the force on a lattice with fixed length \(L=320\) as a function of the number of network parameters (keeping \(N_{l} = 1\) and varying k in the interval \(k \in [3,161]\)). Since the computing time seems to scale linearly with the number of parameters, if k scales with the correlation length \(\xi \) then there is an additional term proportional to \(\xi ^2\) in the FHMC cost.

According to this estimate, FHMC would not reduce the asymptotic simulation cost of a \(\phi ^4\) theory with respect to HMC. However, the implementation of the FHMC force does not require the integration of the flow equation in Eq. (7), unlike in [4]. Knowing that FHMC already reduces autocorrelation times with minimal architectures, it could probably be used to reduce simulation costs in Lattice QCD with minimal implementation effort.

Fig. 5
figure 5

Dependence of the time required to evaluate the force on the number of parameters in the network for a lattice with \(L=320\)

5 Conclusions

We have tested a new algorithm, Flow HMC (FHMC), that implements the trivializing flow algorithm of References [4, 9] via a convolutional neural network, similar to those used in the Normalizing Flow algorithms introduced in Reference [12]. In contrast with previous works on Normalizing Flows, we use minimal network architectures which leads to negligible training costs, not affected therefore by the bad scaling towards the continuum limit observed in [22]. The main new ingredient is the combination of a neural network implementation of the trivializing flow with an HMC integration which keeps a large acceptance for any network architecture.

We have tested the algorithm in a scalar theory in 2D and benchmarked it against standard HMC. We have observed a significant reduction of the autocorrelation times in FHMC for the observables measured: the magnetization and one-point susceptibility. This improvement is maintained as the physical volume is increased at fixed lattice spacing, meaning that the network can be trained a small physical volume L and used at larger one \(L'> L\) without any extra cost. For gauge theories, this opens up the possibility of doing the training at unphysical values of the quark masses or small volumes – or both, where training is cheaper – and reusing the trained networks to sample the targeted theory at the physical values of the parameters.

However, for a fixed network architecture the scaling of the autocorrelation with the lattice spacing remains the same as that of HMC. A slight improvement in the scaling of the autocorrelation time is observed when the footprint of the network is kept constant in physical units. Although the training cost still remains negligible, scaling the footprint with the correlation length implies an extra cost in the computation of the force in the FHMC, leading to a worse overall scaling than HMC in the theory considered. This might be different in a theory with fermions, where the dominant cost is the inversion of the Dirac operator. Also, as discussed in Reference [43], the use of other Machine Learning training techniques such as transfer learning and the use of optimal architectures and stopping criteria can help alleviating the training cost scaling.

Although the improvement observed in the autocorrelation times for fixed network architectures of FHMC might be of some practical use, particularly given the simplicity of the implementation, it remains to be demonstrated that a neural network training policy can be applied that avoids critical slowing down.