Learning trivializing flows

The recent introduction of machine learning techniques, especially normalizing flows, for the sampling of lattice gauge theories has shed some hope on improving the sampling efficiency of the traditional hybrid Monte Carlo (HMC) algorithm. In this work we study a modified HMC algorithm that draws on the seminal work on trivializing flows by Lüscher. Autocorrelations are reduced by sampling from a simpler action that is related to the original action by an invertible mapping realised through Normalizing Flows models with a minimal set of training parameters. We test the algorithm in a ϕ4\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\phi ^{4}$$\end{document} theory in 2D where we observe reduced autocorrelation times compared with HMC, and demonstrate that the training can be done at small unphysical volumes and used in physical conditions. We also study the scaling of the algorithm towards the continuum limit under various assumptions on the network architecture.


Introduction 1.1 Normalizing flows
Normalizing flows are a machine learning sampling technique introduced to lattice field theories in [1].As sketched in Fig. 1, they propose a flow-based Markov Chain Monte Carlo algorithm whose workflow is 1.Generate a set of configurations   following a trivial probability distribution  ().
2. Apply a function  −1 to all configurations   , obtaining a new set of configurations   via  =  −1 (), following a new probability distribution p  (). is a neural network which has been trained so that the new probability distribution p  is a similar as possible to our target distribution  ∝  − , the distribution of the theory to be studied.
3. Use a Metropolis-Hastings accept-reject to correct for the bias in the approximation p  ≈ .
The training of the neural network  is done by minimizing the Kullbach-Leibler (KL) divergence, which is a statistical distance satisfying Therefore, after training the network  is expected to be an approximate trivializing map [2].
As shown in Fig. 2 (left), the good thing about this algorithm is that autocorrelation times in the resulting Markov chain do not scale when taking the continuum limit if the neural networks are trained up to the same reference acceptance, meaning that the cost of producing configurations does not scale; on the other hand, autocorrelations with the Hybrid Monte Carlo (HMC) algorithm are expected to scale as ∼  2 .
However, if one wants to study the scaling of the total computational cost, one needs to analyze the training costs as well, since total cost = configuration production cost + network training cost.
(2) It was shown in [3] that the cost of keeping a reference Metropolis acceptance of 70% seems to scale approximately as ∼  8 (see Fig. 2), indicating a transfer of the critical slowing down problem from the production of configurations to the training cost of the networks.
In what follows we want to address whether one can still benefit from normalizing flows keeping the training costs low.

The algorithm
The main idea is to use Lüscher's trivializing flows [2], so that we can use the normalizing flows to help the HMC algorithm, rather than replacing it.For example, let's consider the partition function of our target theory, We can make a change of variables  → φ using our trained network  , so that the partition function becomes where we have defined the new action This new action is a combination of the old action [] and the logarithm of the Jacobian of the variable transformation of Eq.( 4).If the Jacobian cancels out part of the action, then the probability distribution  − S [ φ] might be easier to sample from than  − [ ] , and using HMC with the new action might yield lower autocorrelation times.The workflow of the algorithm would then be 1.Train the network  by minimizing the KL divergence.
2. Run the HMC algorithm to build a Markov chain of configurations following p( φ) 3. Apply the inverse transformation  −1 to every configuration in the Markov chain to undo the variable transformation.This way we obtain a Markov chain of configurations following the target probability distribution The important point is that the acceptance of this algorithm does not depend on the transformation  that we are doing: it only depends on how well you integrate the HMC equations of motion.This means that the algorithm will work, no matter how well one trains the network  .
Lüscher proposed this algorithm using the Wilson flow as an approximate trivializing map [2], but it was not good enough to improve the scaling of autocorrelation times towards the continuum in a CP  −1 theory with topology [4].The hope is that normalizing flows can play a better role as approximate trivializing maps; indeed, this idea has already been tested [5,6], but we want to focus on the scaling of cheap training setups.

The model
As in [1,3], we worked with a  4 theory with a massive scalar field in 2 dimensions, Among its features, it has a Z 2 symmetry, since the action is invariant under a sign flip of the scalar field,  → −; the probability density of the model has two modes, which correspond to positive and negative magnetization  = 1     ; and it has a non-trivial correlation length , yielding autocorrelations when building the Markov chain of configurations.
We will use these autocorrelations to benchmark our new algorithm against HMC, since the autocorrelation times of HMC are expected to scale as  int ∝  2 .Also, this model does not have topology, as opposed to QCD, so we will not suffer from topology freezing.

Keeping training costs low
Our intention is to keep training costs as low as possible so that we can say that the total computational cost is essentially given by the cost of producing the Markov chain of configurations, total cost ≈ configuration production cost (8) For this to happen it is helpful to consider the information we know about the model: • The action has translational symmetry, so there should be no difference between the transformation of the field at  and the transformation at any other point  .This indicates that one should use convolutional neural networks (CNN) instead of fully connected networks, so that the same transformation kernel is applied at every point.
• The relevant physics is contained within the correlation length  of the system, so the transformation of   should not depend on This indicates that one should make the footprint of the network as small as possible according to the correlation length .
With this in mind we studied very simple network architectures with only one affine coupling layer and no hidden layers.The affine transformation is defined as where () and  () are CNNs with kernel size , which can be varied to control the footprint of the transformation.For the simplest case of  = 3 and a pair of checkerboard-masked affine layers [1], the transformation would only couple next-to-nearest neighbors and the neural network would have only 37 parameters .With such a simple network the training cost is negligible with respect to the HMC simulation, and one could compare both algorithms just by studying the scaling of the autocorrelation time  int towards the continuum.

Reduction of 𝜏 int with minimal network architecture
A first thing to check is if a network with only 37 parameters can learn something.In Fig. 3 (left) we show the evolution of the KL divergence for the training of a network from independent A CNN with 1 layer has  2 parameters.The transformation in Eq.( 9) with   pairs of checkerboard-masked affine layers has 2 ×   different CNNs, and therefore 2 ×   ×  2 parameters.We also add a global rescaling parameter as a final layer of our network, so our models have   = 2 ×   ×  2 + 1 parameters.Gaussians to a theory with parameters  = 0.641,  = 0.5, lattice size  = 18 and  = /4.We see that  KL saturates fast, and this is because the network is very simple: this is the best thing one can do with it, no matter how much longer is trained.
Having trained the network, one can run a simulation with the two algorithms: one with plain HMC on the action () -denoted as HMC from now on-and another with HMC on the transformed action S( φ) -denoted as trivializing flow.The Monte Carlo history of the magnetization for both algorithms is shown in Fig. 3  (10) Since the trivializing flows reduced the autocorrelation time of HMC, this indicates that a simple network can indeed learn something.

Infinite volume limit
Another thing to note is that CNNs allow to take a network trained at a lattice size  and reuse it for a bigger lattice size  > .In Tab.1 we trained a different network for every value of  and checked the Metropolis acceptance that they would have using Metropolis reweighting directly, as in [1,3].Then, we reused the network -without further retraining-for a lattice size 2 and checked again their Metropolis acceptance.
One can note that the acceptances are low, meaning that the output distribution p  of our network is not a good approximation of the target distribution , which was expected because the networks are very simple; and that reusing the network for a larger lattices decreases the acceptance a lot, which is also expected because the action is an extensive quantity and we are just doing reweighting for a bigger volume.
However, none of this matters if the network is used as a trivializing flow.We can see this in Fig. 4, where we show the autocorrelation time of the magnetization versus different lattice sizes; whether you train directly at a lattice size 2 or you train at  and reuse it for 2, you get the same autocorrelation times.This indicates that the training should be done at a lattice size similar to the correlation length of the system,  ∼ , where the physical information is contained, to reduce training costs.

The setup
Finally, we study how the computational cost of the machine-learned trivializing flow scales towards the continuum, comparing it with the costs of HMC.The theory parameters used are in the table below, where  was tuned to fix the physical size of the lattice to  = /4.• For each of the columns we train a different network from independent Gaussians to the target theory.

𝐿
• We always used simple network architectures with one coupling layer and no hidden layers, and we trained them until saturation of the KL divergence.This saturation happens very fast and the training costs are negligible with respect to the cost of the production of configurations.Therefore we only need to compare the scaling of the autocorrelation times of both algorithms.
• The integration step for the molecular dynamics evolution is tuned so that the Metropolis-Hastings acceptance of HMC and the trivializing flows is approximately 90%.

Scaling with fixed architecture
In Fig. 5 (left) we compare the autocorrelation times of the magnetization from simulations with both algorithms: in blue circles we have the autocorrelations of HMC, and the rest of points are from trivializing flow simulations with kernel sizes  = 3, 5, 7.
First, we see that all autocorrelations from trivializing flows are better than the ones of HMC.Also, for a fixed lattice size , increasing the kernel size of the network decreases the autocorrelation, but it does not change the scaling of HMC.
The latter can be seen better if we plot the ratio  HMC / flow , as we do in Fig. 5 (right): as we go to the continuum the ratio tends to a constant, indicating that the scaling of both algorithms is the same for a fixed network architecture.

Scaling increasing the kernel size
The correlation length  increases in lattice units when taking the continuum limit, so one should also scale the kernel size of our transformations so that the transformation acts on the same physical region,  ∼ .When one chooses the optimum kernel size  of the CNNs for each lattice size, one gets the results displayed in Fig. 6 (left), where now the gap between HMC and the trivializing flow algorithm seems to increase towards the continuum.
Assuming that the autocorrelation length scales as  ∝   we can fit our results and compare the exponent of the two algorithms, obtaining   ,HMC = 2.20(4),   ,flow = 1.97(7).
(11) Therefore, scaling the kernel size of the CNNs leads to a slight improvement in the autocorrelation scaling of the magnetization  = 1/    , which is a local operator.In Fig. 6 (right) we plot instead the one-point susceptibility   = 1    2  ,  measured on configurations flowed up to flow time  with a smearing radius  ∼ .Assuming the same scaling we again find a slight improvement with respect to HMC:    ,HMC = 2.20(2),    ,flow = 1.92(4). (12)

Conclusions and outlook
We have shown that, even working with very simple architectures, using neural networks as trivializing flows can improve the autocorrelation times of HMC, although the scaling is the same as HMC for a fixed network architecture.
Also, the networks trained at a small lattice size can be reused for larger volumes without further training.Focusing on topology freezing, this could be useful in QCD: one could train at the size of Λ QCD for large values of the quark masses and then reuse the network for larger volumes and small masses, thus reducing training costs.
Finally, scaling the kernel size of the networks slightly improves the scaling of autocorrelations.An interesting question would then be if machine-learned trivializing flows could improve a much worse kind of continuum scaling, such as the one related to topological freezing in theories with topology.

Figure 1 :
Figure 1: Extracted from [1]: normalizing flow sketch.A set of configurations   is generated from a trivial probability distribution  (); a neural network model  is used to generate a new set of configurations via  =  −1 (), which follow p  () ≈ (), with p  the model's output distribution and  the target distribution.

Figure 2 :
Figure 2: (Left) Extracted from [1]: autocorrelation time scaling for different observables using normalizing flows with networks trained to have Metropolis acceptances of 50% and 70%.(Right) Extracted from [3]: scaling of the number of configurations needed to train a network up to an acceptance of 70%.

Figure 3 :
Figure 3: (Left) History of the KL divergence during the training from independent Gaussians to a theory with parameters  = 0.641,  = 0.5, lattice size  = 18 and  = /4.(Right) History of the magnetization for a simulation with HMC (blue) and trivializing flows (orange).

Figure 4 :
Figure 4: Autocorrelation time of the magnetization at lattice size 2 using trivializing flows.In circles, the networks used were trained at lattice size 2; in triangles, they were trained at  and used at 2.

Figure 5 :
Figure 5: (Left) Scaling of the autocorrelation time of the magnetization towards the continuum for HMC (filled, blue circles) and trivializing flows with kernel sizes  = 3, 5, 7 (open circles, triangles and hexagons).(Right) Scaling of the ratio of autocorrelation times of the magnetization of HMC with respect to trivializing flows.

Figure 6 :
Figure 6: (Left) Scaling of the autocorrelation time of the magnetization with  ∼ .(Right) Scaling of the autocorrelation time of the flowed one-point susceptibility with  ∼ .The fits correspond to the assumption  int ∝   .

Table 1 :
Metropolis acceptances at lattice sizes  and 2 of networks trained at lattice size .