Diffusion Models as Stochastic Quantization in Lattice Field Theory

In this work, we establish a direct connection between generative diffusion models (DMs) and stochastic quantization (SQ). The DM is realized by approximating the reversal of a stochastic process dictated by the Langevin equation, generating samples from a prior distribution to effectively mimic the target distribution. Using numerical simulations, we demonstrate that the DM can serve as a global sampler for generating quantum lattice field configurations in two-dimensional $\phi^4$ theory. We demonstrate that DMs can notably reduce autocorrelation times in the Markov chain, especially in the critical region where standard Markov Chain Monte-Carlo (MCMC) algorithms experience critical slowing down. The findings can potentially inspire further advancements in lattice field theory simulations, in particular in cases where it is expensive to generate large ensembles.


Introduction
To obtain physical observables in lattice field theory, it is essential to approximate the path integral as a sum over field configurations.Monte Carlo methods rely on random sampling from the probability distribution, determined by the action of the system [1].These methods can have high computational costs, particularly in the vicinity of a critical point in parameter space [2].Generative models, as a class of machine learning (ML) algorithms, can be trained to generate new data that follow the same distribution as a given data set, representing the underlying target distribution [3].An important potential application of generative models in lattice QCD is therefore to improve the efficiency of Monte Carlo simulations [4].
Depending on the way the likelihood is estimated, roughly two main categories of generative models are being developed.In implicit maximum likelihood estimation (MLE), Generative Adversarial Networks (GANs) can learn to generate new configurations via a min-max game using training on a dataset prepared using, e.g., a standard lattice simulation.A well-trained GAN can subsequently generate new samples, which should follow the same statistical distribution as the original dataset, but with reduced computational cost.Additionally, GANs can be used to improve the interpretability of lattice QCD simulations by generating visualizations of the system's behavior [5,6].In explicit MLE, flow-based models have recently been proposed to improve the efficiency of lattice simulations [7][8][9][10][11][12][13][14][15][16][17][18][19][20].Flow-based models are able to directly approach the underlying distribution of the quantum field without preparing training data.Although the method may encounter "modecollapse" and scalability problems [21,22], it allows for the generation of independent new configurations that can be used in MC simulations, again reducing the computational cost.The current progress in applying generative models also includes designing novel neural network architectures for specific field theories, e.g., autoregressive networks [23,24] and gauge-equivariant neural networks [8,18,[24][25][26][27].These approaches suggest that generative models have the potential to advance the capabilities of lattice QCD simulations.A comprehensive review can be found in Refs.[28,29].
Recently, the ML community has developed a new class of deep generative models known as Diffusion Models (DMs) [30].Impressive success has been obtained in generating high-quality images via stochastic denoising processes, as showcased in prominent applications [31] such as DALL-E 2 [32] and Stable Diffusion [33].As a prevalent class of generative models encoded in a stochastic process, DMs are grounded in Markov chains, and their implementation involves the utilization of variational inference methods.For an adoption in high-energy physics experiments, see, e.g., Refs.[34,35].
In numerical lattice field theory, there is an approach alternative to Monte Carlo methods to generate field configurations which relies on solving a stochastic differential equation, i.e., a Langevin equation.Its origin is stochastic quantization (SQ) as proposed by Parisi and Wu [36].The basic idea of SQ is to view the quantum system as the thermal equilibrium limit of a hypothetical stochastic process with respect to a fictitious time.For gauge fields, it does not require gauge fixing [36].Comprehensive reviews include Refs.[37,38].For theories with a sign or complex weight problem, such as Quantum Chromodynamics at nonzero baryon density or theories in real (rather than Euclidean) time, the Langevin process can be extended to complex Langevin dynamics [39], which evades the sign problem in certain cases [40][41][42][43][44], see also the reviews [45][46][47][48].Although the method has been put on firmer theoretical grounds [49][50][51][52][53], convergence and boundary condition issues continue to hinder obtaining a fully satisfying approach.Recent work to address this includes Refs.[54][55][56][57].
In this work, we first point out the correspondence between generative DMs from the ML community and SQ in quantum field theory.Then, we explore using DMs to generate lattice field configurations.Specifically, we demonstrate for a two-dimensional ϕ 4 lattice field theory that, combined with a hybrid Monte Carlo (HMC) algorithm, DMs can serve as an efficient global sampler along a Markov chain, leading to shorter autocorrelation times.
We provide self-consistency checks and numerical evidence that this approach is able to capture the SQ perspective-induced stochastic dynamics of quantum field theories.
The paper is organized as follows.In the subsequent section, we provide a brief overview of the concept of SQ.In Sec. 3, we present details of DMs.It is argued that the denoising process of DMs can explicitly serve as a SQ procedure.In Sec. 4, we explore the application of DMs to a two-dimensional scalar ϕ 4 field theory.Additionally, we demonstrate that DMs can accurately estimate the physical action in an unsupervised manner using the probabilistic flow approach.In Sec. 5, we summarize the correspondence between DMs and SQ, and outline how this novel method can motivate further research, in particular for theories where it is expensive to sample from the prior distribution using standard approaches.App.A contains information on the U-Net architecture employed, while Appendix B contains a brief discussion on the Skilling-Hutchinson estimator.A study of the dependence of observables and the acceptance rate on various system and DM parameters can be found in Appendix C and D respectively.

Stochastic Quantization
In Euclidean quantum field theory, as an alternative quantization scheme, one may invoke stochastic quantization (SQ) [36].We only consider real actions here, for which SQ is well understood and convergence is guaranteed [37].Starting from a generic Euclidean path integral with Euclidean action S E , SQ introduces a fictitious time, τ , for the field, ϕ, whose evolution is described by Langevin dynamics, ∂ϕ(x, τ where the noise term routinely satisfies, with α being the diffusion constant.Denoting the distribution as the drift term in the Langevin equation can be written as In the long-time limit (τ → ∞) the system reaches an equilibrium state of the dynamics [38] if the drift term has a damping effect, which is best analysed using the Fokker-Planck equation (see below).In addition, introducing the stochasticity is only one strategy for the quantization.The fictitious time itself can serve as a clue to develop microcanonical quantization, which describes a deterministic evolution with an ordinary differential equation [58].We will revisit this idea and its connection with DMs in Sec.3.3.

Fokker-Planck Equation and Observables
The Fokker-Planck equation [59] describes the time evolution of the probability distribution of field configurations, subject to a stochastic process, such as the Langevin equation.Also known as the Kolmogorov forward equation, in the context of SQ it describes the time evolution of the probability density function P [ϕ, τ ] for field configurations ϕ at time τ [37,60], as follows, Here n is the dimension of the Euclidean spacetime on which the field theory is formulated.Given this probability distribution function, the expectation value of an observable O[ϕ] is then given by [37] ⟨O where the left-hand side is understood as an average over many stochastic processes and on the right-hand side the integral is taken over all field configurations ϕ.In the long-time limit, the probability distribution P [ϕ, t] approaches an equilibrium distribution, which is the stationary solution of the Fokker-Planck equation (2.6), If α is set (formally) as ℏ and the equilibrium distribution is normalised, expectation values of observables are indeed observables in a quantum theory, with demonstrating that a stochastic process according to Eq. (2.2) can be used to quantize a system.Note that convergence to the stationary solution (2.8) follows from properties of the Fokker-Planck Hamiltonian associated to the Fokker-Planck equation (2.6) [37].In short, writing the time-dependent solution P [ϕ, t] as it is easy to show that Q[ϕ, t] satisfies the equation where the Fokker-Planck Hamiltonian reads The Fokker-Planck Hamiltonian is hence bounded from below by zero.Provided its spectrum is gapped, time-dependent components in Eq. (2.11) will decay exponentially and only the solution H FP Q[ϕ] = 0 survives.This solution is given by which, when combined with Eq. (2.10), results again in the equilibrium solution (2.8), completing the standard proof of convergence.

Langevin Simulations
In most cases, the Fokker-Planck equation cannot be solved analytically.Instead, one solves the stochastic process numerically by discretizing the Langevin equation Eq. (2.2).Using Itô calculus, the simplest discretized Langevin equation, using time step ∆τ , is A typical calculation includes, firstly, an initial condition to prepare field configurations {ϕ(x, τ = 0)}; secondly, the thermalization procedure, i.e, evolve the system to reach equilibrium by iterating through a sufficient number of time steps with an appropriate time step, ∆τ ; thirdly, after thermalization, record the values of the physical observable O[ϕ] for each configuration in an ensemble, e.g. after a time T .This is repeated for numerous configurations to improve the statistical accuracy of the ensemble average.Fig. 1 demonstrates one such typical Langevin process for a one-dimensional case.The target distribution is encoded via the drift, c.f. Eq. (2.5).This procedure provides an estimate of the quantum expectation value for the observables O[ϕ].Since the simple discretization scheme above results in finite-stepsize errors linear in ∆τ , one has to repeat the analysis to extrapolate to vanishing stepsize.The Langevin simulation provides an alternative to Monte Carlo approaches and has been used in simulations of lattice QCD [61].A characteristic is that no accept/reject step is used, which makes the algorithm not exact (as demonstrated by e.g., the finite-stepsize corrections).This can be remedied by introducing such an accept/reject step, which results in more sophisticated algorithms that outperform the basic Langevin algorithm and are hence favoured (such as Hybrid Monte Carlo).The absence of an accept/reject step is turned into a feature for theories with a complex-valued Boltzmann weight, for which importance sampling is not possible.The application of SQ to these theories is commonly referred to as complex Langevin dynamics [39], which is still an active area of research [40][41][42][43][44][45][46][47][48][49][50][51][52][53][54][55][56][57], but not further explored here.

Diffusion Models
Turning now to deep learning, DMs denote a class of deep generative algorithms that simulate a stochastic process to generate samples following a desired target distribution [62].The fundamental concepts underlying DMs involve modeling the data distribution through a continuous-time diffusion process or a discrete-time Markov chain, wherein noise is incrementally added to or removed from the data.This generative framework has gained prominence in recent works, such as score-based models (SBMs) [63,64] and denoising diffusion probabilistic models (DDPMs) [65]; for a comprehensive review, see Ref. [30].In this section, we will first introduce the denoising perspective of DMs and then point out the connections of DMs to SQ.

Denoising Model
In the context of DMs, the denoising model can be perceived as a mapping that reconstructs the original data from its noisy variant, which is attained through the application of a diffusion process.Similar to the Langevin dynamics introduced above, the first process (also called forward process) gradually introduces noise to the data, effectively "smoothing" the underlying (but typically unknown) probability distribution of data.Then the denoising model aims to learn its inverse process, i.e., denoising the data by eliminating the added noise.Upon training, the denoising model can be employed to generate samples following the data distribution by executing the reverse diffusion process only.This sampling process commences with random noise and iteratively applies the trained denoising model to "clean" it until convergence, resulting in a sample representative of the target data distribution.
In the limit of continuous time, the forward process follows a stochastic differential equation (SDE), where ξ ∈ [0, T ] is the Langevin time, η is the noise term, satisfying ⟨η(ξ)η(ξ ′ )⟩ = 2αδ(ξ−ξ ′ ) with α ≡ 1/2 as a convention in the machine learning community, and f (ϕ, ξ) is the drift term.Note that the noise and drift term have the same size as the field ϕ.Finally, the square of g(ξ) is the time-dependent scalar diffusion coefficient.The forward diffusion process can be modeled as the solution of such a generic SDE, provided the drift and diffusion coefficient are chosen appropriately.As Fig. 2 demonstrates, the aim is to obtain a sample from the data distribution p 0 by starting from a sample of the prior distribution p T and reversing the above diffusion process.This is achieved by solving a reverse SDE, representing a diffusion process evolving backward in time, which reads [66] where the reverse time t ≡ T − ξ and η is a noise term in the reverse time direction.Importantly, the drift term now contains two terms, including the gradient of the logarithm of p t (ϕ), the probabilistic distribution at time t in the forward diffusion process.This reverse SDE can be computed once we have specified the drift term and diffusion coefficient of the forward SDE, and determined the gradient of log p t (ϕ) for each t ∈ [0, T ].
The key to implementing a denoising model lies in accurately estimating the probability distribution, p t (ϕ), under a well-designed forward diffusion process of Eq. (3.1).In scorebased models [63,64], ∇ ϕ log p t (ϕ) is named as the score of each sample.Routinely, one can introduce a perturbation kernel for each forward diffusion step1 as p ξ (ϕ ξ |ϕ 0 ) ≡ N (ϕ ξ ; ϕ 0 , σ2 ξ I), where ϕ 0 ∼ p 0 .Thus, the score can be computed from A neural network s θ (ϕ, t) can be trained to approximate the score conveniently, which is named as score matching [67,68].The parameters of the neural network are optimized with the objective, where the interval T has been divided into N segments with index i (further details are given below).The optimal score-based model s θ(ϕ, t) with θ = arg min θ L θ should match the score at every time-step.Sampling turns to solving the reverse SDE shown in Eq. (3.2), with the score replaced with s θ(ϕ, t), For a concrete and concise example, by putting the forward drift term f (ϕ, ξ) to zero and simplifying the diffusion coefficient, see, e.g., Ref. [64], as g(ξ) ≡ σ ξ , one gets a variation expanding forward process whose transition kernel remains Gaussian, We note that both Eqs.(3.1) and (3.2) are Langevin equations. 2In the context of the reverse SDE, given the gradients ∇ ϕ log p(ϕ, t), it is convenient to generate samples from a prior distribution in a stochastic evolution process, as introduced in Sec.2.2.

Diffusion Model as Stochastic Quantization
The reverse SDEs of DMs are mathematically related to the Langevin dynamics.To explore the connection of DMs to stochastic quantization in more detail, we choose the variance expanding picture of the DMs.Its denoising process transfers ϕ T back to ϕ 0 with the same marginal densities as the forward process, but evolving according to where time t flows backward from T to 0. With the redefinition of time, τ ≡ T − t (dτ ≡ −dt), and denoting g τ = g(T − τ ), q τ (ϕ) = p T −τ (ϕ) of the forward process, the SDE now reads and in discretised form where the noise term ⟨η(τ n )⟩ = 0, ⟨η(τ n )η(τ n ′ )⟩ = 2ᾱδ nn ′ , with ᾱ ≡ ℏ = 1 as a convention in quantum field theories.Because the mean value of the noise term remains zero, the sign in front of the noise term is unrelated to the stochastic process.We note that g 2 (τ ) is referred to as a kernel, rescaling both the drift term and the variance of the noise term [37].Its effect can be absorbed by rescaling time with g 2 (τ ), or equivalently absorb it in the time step, g 2 τ ∆τ .The derivation of the corresponding Fokker-Planck is straightforward and one finds where is the score trained on the forward diffusion process, and p τ (ϕ) is the probability density of ϕ τ in the reverse diffusion process.One can immediately derive the following equilibrium condition, locally in time, where we recall that ᾱ = 1 and that S DM is an effective action that will be learned in DMs.When the noise scales are on par with the quantum scale, that is, O(ᾱ) ∼ O(ℏ), Eq. (3.12) encompasses the quantum fluctuations exhibited in Eq. (2.8).In the τ → T limit, where P [ϕ, T ] is nothing but the time evolution of the PDF in SQ.This implies that the "equilibrium state" of SQ can be achieved by denoising from a naive distribution.Concurrently, sampling from a DM is equivalent to optimizing a stochastic trajectory to approach the "equilibrium state" in Eq. (2.8) given a naive distribution from the SQ perspective.However, reverting the noise configuration back to the physics configuration is a highly intricate process.In principle, one could solve the reverse SDE of Eq. (3.8) to depict this denoising process, while computing the "time-dependent" drift term remains intractable.A particular type of deep neural network, known as the U-Net, is used here.The architecture of this network is discussed in more detail in the App. A. The U-Net is tailored to parameterize the score function, ŝθ (ϕ, τ ), for estimating this time-dependent drift term, −∇ ϕ log q τ (ϕ).It is designed to accept two inputs: time and a time-dependent configuration within a trajectory, while its output is the same size as the input (field configuration).

Building Effective Actions
To derive effective actions for quantum fields in DMs requires the probability flow ODE formulation [64,70], which provides a quantitative way to compute the negative log-likelihood of the field configuration.Assuming a differentiable one-to-one mapping h that transforms a data sample ϕ 0 ∼ p 0 to one from a prior distribution h(ϕ 0 ) = ϕ T ∼ p T , the likelihood of p 0 (ϕ 0 ) can be computed using the change-of-variable formula as where J h denotes the Jacobian of the mapping h, and it is assumed that the prior distribution p T is easy to evaluate.ODE trajectories also define a one-to-one mapping from ϕ 0 to ϕ T .In our case, for quantum field theories, It aims to find the trajectory of the state variable ϕ that has the same marginal probability density p t (ϕ) as the stochastic process described by the SDE, i.e., Eq. (3.1).An instantaneous change-of-variable formula can be derived to connect the probability of p 0 (ϕ 0 ) and p T (ϕ T ), as, where the divergence is the trace of the Jacobian, with redefinition of f (ϕ t , t) ≡ f (ϕ, t) − g(t) 2 ∇ ϕ log p t (ϕ).The practical computation of the trace can be found in App.B by using the Skilling-Hutchinson estimator [71].
From the perspective of probability current conservation, the stochastic process we derived in the previous section intrinsically connects with flow-based models, which have recently been widely explored in current lattice calculations [11,28].In a flow-based model, one can construct a flow mapping with a neural network whose parameters are {θ}, which represents a bijective transformation between the physical distribution p(ϕ) and a prior distribution, e.g., z ∼ N (0, I).Although it is common to build the flow mapping in discrete steps, the transformation f which follows, can also be designed in continuous steps. 4It refers to the continuous normalizing flow [18], in which the parameterized mapping, fθ , is the solution to a neural ODE for time t ∈ [0, T ], with boundary conditions: ϕ(0) ≡ z, ϕ(T ) ≡ ϕ.In which, a neural network with parameters, fθ (ϕ, t), can be introduced to represent the dynamical kernel.The probabilistic flow follows, ) and (3.17), one can immediately find that the flow mapping, fθ (ϕ(t), t), is equivalent to f (ϕ t , t) in Eq. (3.15) for DMs.

Effective Action in a Toy Model
From Eqs. (3.8) and (3.10), one can find a "time-dependent" stochastic process.In the variance-expanding DM, the forward process of Eq. (3.1) features a predetermined timedependent diffusion noise denoted as σ T −τ , which ensures field configurations will eventually be reduced to a noise configuration.The effective action and the corresponding denoising process can be understood from a field deformation perspective: the field is continuously deformed following a stochastic process from the noise distribution to the physical distribution.To demonstrate this in action, we introduce an over-simplified 0+0-dimensional field theory, i.e., a toy model which only has one degree of freedom, where the physical action and drift term are determined by the parameters, µ 2 and g.We prepared 5120 configurations as training data-sets in two set-ups: µ 2 1 = 1.0, g 1 = 0.4 (single-well action), and µ 2 2 = −1.0,g 2 = 0.4 (double-well action).A one-to-one neural network with time-embedding is implemented to represent the score function s θ (ϕ, τ ).
After 500 epochs of training, the learned score function ŝθ (ϕ, τ ) is seen to approximate the drift term f τ (ϕ) in the upper row of Fig. 3.Here the solid blue line represents the score function at the starting time (τ = 0), the rainbow-colored lines indicate different τ > 0 with a color-bar at rightest, while the solid red line indicates the score function at the end time (τ = 0.99) in a backward process of the well-trained DM.The learned effective action S DM (ϕ, τ ) = ϕ ŝθ ( φ, τ )d φ is shown in the middle row of Fig. 3; it approximates the physical action as τ increases.We have shifted S DM with a constant ∆S 0 , which is the difference between min[S(ϕ)] and min[S DM (ϕ, τ )].Generally, the learned drift terms and effective actions are accurate approximations in both the single and double-well cases, around ϕ ∼ 0. Samples generated with the trained DM are compared with samples from the underlying theory in the bottom row of Fig. 3.
The evolution trajectory of DM resembles an exact renormalizing group (RG) flow and its inversion.One intuitive understanding is that the gradually increasing noise scale (in the forward process of the DM) in coordinate space implicitly corresponds to a gradually decreasing momentum scale Λ in an RG. 5 With g τ →T = σ T −τ → 1, Λ → ∞, the full quantum theory is recovered.Although some works have discussed the running noise scales from a colored noise perspective in a Langevin process [73,74], the well-trained DM provides an exact RG flow in the fictitious time space, Because the probability density is positive-definite, the equation has a fixed point when ∂p τ (ϕ)/∂τ = 0.The RG flow is the evolution trajectory of the system from the trivial distribution at τ = 0 to the equilibrium state at τ = T , which has been shown in the middle panel of Fig. 3.
5 From Eq. (3.6), one can sample to get the evolving field (for a given initial field ϕ0) at any time during the forward process as ϕτ (x) = ϕ0(x) + σ 2τ −1 2 log σ ϵ(x) with ϵ ∼ N (0, I) a Gaussian distributed random noise, which in momentum space is ϕτ (p) = ϕ0(p) + σ 2τ −1 2 log σ ϵ(p).Considering the natural distribution of momentum modes for the physical fields, the above evolution will perturb (smear) higher momentum modes faster because of the gradually increasing noise level.This resembles somehow the functional renormalization group (FRG) procedure to progressively integrate out the high momentum modes.Conversely, in the reverse denoising process, the low momentum IR physics will be first generated, and only at a later time the UV details are filled by the denoising model, as seen in Fig. 4.

Diffusion Models for Lattice ϕ Scalar Fields
Lattice scalar field theories have been studied extensively, with applications spanning a wide range of topics.Scalar fields are frequently employed to showcase the efficacy of numerical algorithms, including with machine learning.Here we will implement a DM to generate configurations for a two-dimensional scalar field theory with quartic interaction, both in the broken and the symmetric phase.
We consider a real scalar field in d Euclidean dimensions with the action, where the subscript specifies bare quantities, including mass m 0 , coupling λ 0 , and field ϕ 0 (x).Discretising the derivative on a lattice with lattice spacing a in all directions, ∂ µ ϕ 0 (x) = [ϕ 0 (x + aμ) − ϕ(x)]/a, with μ the unit vector along the µ−direction, and redefining the field and parameters to yield dimensionless combinations, gives the lattice action (see e.g.Ref. [75]) where κ is the hopping parameter, related to the bare mass parameter via and λ = a 4−d κ 2 λ 0 /6 denotes the dimensionless coupling constant describing field interactions.Both parameters are positive.The field has been rescaled according to a d/2−1 ϕ 0 = (2κ) 1/2 ϕ.
In the case of d ≥ 2, for each coupling λ the hopping parameter exhibits a critical value κ c (λ), at which a second-order phase transition occurs.This quantum phase transition corresponds to a spontaneous Z 2 symmetry broken above the critical point in continuum limit [76].When the mass term becomes negative, the broken phase emerges classically.At the classical level, the critical value for the hopping parameter is given by which is determined by the vanishing of the mass term, but quantum fluctuations will change the value [75].As κ decreases across the critical value, the system transitions from the broken phase to the symmetric phase.

Configuration Generation and Model Setups
To implement the DM, we first prepare field configurations of the scalar ϕ 4 -theory in d = 2 dimensions on a 32 × 32 lattice, at fixed λ = 0.022, as defined above.To evaluate the performance of the model, we generated 5120 samples for training and 1024 samples for testing in the broken and symmetric phases at κ = 0.5 and κ = 0.21 respectively.The reason to choose these specific paramter values is that at λ = 0.022 the phase transition occurs at approximately κ c ≃ 0.239, as derived from Eq. (4.4).Hybrid Monte Carlo set-up.In order to prepare the dataset, we employ the Monte Carlo Markov Chain (MCMC) approach.The classical Metropolis-Hastings algorithm for local updating often suffers from long autocorrelation times, so we instead opt for the hybrid Monte Carlo (HMC) method.By incorporating classical Hamiltonian equations of motion [77], the HMC method improves the performance of the MCMC and allows larger steps to be taken while maintaining acceptable acceptance rates.In our setup, using a chosen set of action parameters, we prepare a 2-dimensional lattice of size 32 × 32.For each trajectory in the HMC, we initiate with a "100-trajectory burn-in sequence" for thermalization.Following this, after 64 additional trajectories, we select the outcome of the final trajectory update as a configuration.Each trajectory utilizes a well-established HMC integrator such as the leapfrog, with a certain stepsize ∆t.The total length of each trajectory is determined by τ = n step × ∆t, with n step representing the number of times the integrator is applied within a single HMC trajectory (commonly set so that τ = 1).Each configuration in the training dataset originates from one different Markov chain.
Diffusion Model set-up.We adopt the variance expanding DM as shown in Eq. (3.6) with σ = 25.The diffusion time is set as T = 1.In the training procedure the stepsize is not fixed, but we randomly sample the time points in the interval [0, T ], with the number of time points equal to the batch size in each batch optimization.The batch size is 64, and the maximum number of epochs for training is 250 to ensure convergence.In the sampling procedure, we use the Euler-Maruyama approach to solve the reverse-time SDE with time stepsize T /N = 0.002.The U-Net architecture [78] is utilized to represent s θ in Eq. (3.4); more details can be found in App. A. Unless otherwise indicated, we generated 1024 samples from the well-trained DM to demonstrate its performance.

Broken Phase
We start in the broken phase.Here one expects large clusters, in which field configurations fluctuate around a positive or negative average value, usually referred to as the vacuum expectation value, order parameter or magnetisation.We demonstrate that the clustering behavior of field configurations can be successfully captured by the well-trained DM.Fig. 4 visualizes the denoising process.Each row in the figure represents a different sample, and each column represents a different time point in the denoising process.The first column represents noise samples randomly drawn from the prior distribution, i.e., p ≈ N ϕ; 0, 1 2 log σ (σ 2 − 1)I , while the fifth column represents the generated samples obtained by denoising.The arrow at the bottom of Fig. 4 indicates the direction of time evolution in the stochastic differential equation, as per Eq.(3.2).
The order parameter or magnetization is defined as where V = N d denotes the number of lattice sites.A comparison between HMC generated testing samples and DM generated samples is shown in Fig. 5, where we see consistency between the distributions containing 1024 samples each.

Symmetric Phase
We now turn to the symmetric phase and take κ = 0.21, below but near the critical value.In the thermodynamic limit, the phase transition can be characterized by a divergence in the connected two-point susceptibility.This susceptibility is a measure of how a system responds to small perturbations and is defined as In addition, the Binder cumulant [79] is a dimensionless quantity often used to identify phase transitions, defined as a ratio of moments of a distribution, It is sensitive to the shape of the distribution, i.e., the kurtosis of the order parameter, and is particularly useful for detecting and analyzing phase transitions in finite-size systems.
In Fig. 6 we compare the distributions of the magnetization obtained independently from the trained DM and using HMC.Good consistency between the distributions is observed.To make this more quantitative we present in Table 1 the magnetization ⟨M ⟩,

data-set
⟨M ⟩ χ 2 U L Training (HMC) 0.0012± 0.0007 2.5160 ± 0.0457 0.1042 ± 0.0367 Testing (HMC) 0.0018 ± 0.0015 2.4463 ± 0.1099 -0.0198 ± 0.1035 Generated (DM) 0.0017± 0.0015 2.4227 ± 0.1035 0.0484 ± 0.0959 Table 1: Comparison of observables calculated on different datasets with lower error bounds determined using the statistical jackknife method.The sample size is comparable for the bottom two rows and a factor of 5 larger for the training set (top row).two-point susceptibility χ 2 , and the Binder cumulant U L obtained using both the trained DM and with HMC.For the latter, expectation values are shown both for the training set and the testing set, with the former having a factor of five more configurations, which is reflected in the uncertainty.We conclude that the DM is capable of accurately capturing the statistical properties of the system, as reflected by these observables, and is in good agreement with HMC, a well-established method for sampling from the target distribution.A study of the dependence on the size of the training set and the generated data set can be found in the Appendix C. We note that no accept-reject step has been applied in the DM.
The likelihood computation method described in Section 3.3 enables us to estimate the action for the field configurations generated by the trained DM.By comparing the action estimation from the trained DMs with the true action, see Eq. (4.2), we can confirm whether the DM has captured the underlying physics correctly.For this purpose, we first generate an ensemble of field configurations using the trained DM.We then compute the likelihoods, q(ϕ), for these configurations using the probability flow ODE approach.By applying the change-of-variable formula, we are able to compute the action values for these field configurations, as where the constant term Z has been omitted.In Fig. 7 we compare the evaluated effective action with the one obtained directly from the physical action, i.e., Eq. (4.2).The black dashed line means the x−axis is proportional to the y−axis, and the green sites are action values estimated from the probabilistic ODE (on the x−axis) and calculated from the physical action (on the y−axis).The coefficient of determination, R 2 , is a useful metric for quantitatively measuring how well the estimated action values, denoted as S eff (ϕ i ), match the actual ones, denoted as S(ϕ i ).Here i indicates the index in sampled configurations.We use where S(ϕ) is the average value for the actual action, It measures the proportion of variation in the dependent variable (in this case, the action) that is predictable from the independent variable (in this case, the DM-generated samples).R 2 values close to 1 indicate that the DM provides an accurate estimate of the action, suggesting that the trained DM has effectively learned the underlying physics of the scalar field theory correctly.In Fig. 7, with the default dataset size, i.e., 5120 samples for training, the R 2 value is 0.960 calculated on the 1024 generated configurations.

Autocorrelation Time
Critical slowing down [2] emerges when simulations approach a critical point and the correlation length of the system diverges.This should be explored at κ values close to the phase transition, which is located at κ c ≃ 0.27 (for L = 32, λ = 0.022); see e.g.Refs.[6,74,80].Critical slowing down can be studied using the autocorrelation function of an observable O, defined as where t 0 is the starting time and t denotes the time along the Markov chain.In an exponential fit, C O (t) ∼ exp{−t/t c }, the characteristic time t c should scale as a power of the correlation length ξ around the critical point, i.e., t c ∼ ξ z (the correlation length ξ should not be confused with the time variable in the forward SDE).The normalized autocorrelation functions, CO (t) ≡ C O (t)/C O (0) can be used for a clearer comparison.We investigate this for the "absolute value" of the magnetization, defined as and compare various updates.Because the likelihood of samples can be computed, as shown in the previous section, we can use the Metropolis-Hastings (MH) algorithm to construct the asymptotically exact Markov chain sampler with the well-trained DM.We refer to this as the DM-based Monte-Carlo method, following the convention of flow-based models.The acceptance criterion is corrected as where ϕ i−1 is the previous configuration and ϕ proposal is sampled from the well-trained DM.Further, we can estimate the efficiency gain of the DM by comparing the behavior of the autocorrelation function C |M | (t) with the one obtained using a Metropolis-Hastings Monte-Carlo (MC) and the HMC algorithm on 1024 configurations, as shown in Fig. 8.A more comprehensive discussion on the dependence of the acceptance rate on the parameters and the number of training epochs can be found in Appendix D. Beside, one can compute the integrated autocorrelation time [6,12,81] for a more quantitative comparision.It is defined as To get a more precise estimation with uncertainties, one can also choose the automatic windowing procedure in the computation [82,83].These comparisons demonstrate that our proposed method has the potential to significantly suppress autocorrelations along a Markov chain.

Conclusion and Outlook
In this work, we proposed a new method to generate quantum field configurations using diffusion models, popular in the ML community.We highlighted the connection with stochastic quantization (SQ), an approach to quantize field theories based on a stochastic process in a fictitious time direction.In SQ the drift term in the stochastic process is known, as it is derived from the probability distribution one wants to sample from, which is determined by the quantum theory under consideration.In DMs the drift term is not known but is learned from data in a forward stochastic process.After estimating the drift term, the trained DM is run in the opposite direction and independent configurations are created from noise, the "denoising" process.This latter part is akin to the dynamics in SQ, albeit with a learned and time-dependent drift term.In Figure 9, we summarize the relation between stochastic quantisation and diffusion models schematically.We have demonstrated the approach in a simple toy model and in a two-dimensional scalar field theory with a ϕ 4 interaction, both in the symmetric and in the broken phase.We have shown that the DM can serve as a global sampler to assist methods based on the Monte Carlo principle.We note that one can include an accept-reject step at the end of each trajectory.We have found in our numerical results that the well-trained DM can successfully reproduce configurations in both phases and that autocorrelation times are reduced along a Markov chain.This improvement in sampling efficiency can be beneficial for studying lattice field theories and should be analyzed further.
There are various directions to explore in the future.The forward ("smoothening") and backward ("denoising") processes are reminiscent of (inverse) renormalization group transformations and it would be interesting to make this precise.Because the action can be estimated relatively accurately, it is feasible to train a DM without preparing a training data-set, which needs to be investigated.The connection between DMs and SQ offers new perspectives.It is well known how to implement SQ for non-abelian gauge theories; hence it should be possible to formulate DMs for the non-abelian case.An intriguing direction is the following: imagine one has generated a set of configurations for a quantum field theory with fermions, such as QCD or the Schwinger model.Since the fermions are "integrated out", the configurations are given in terms of the gauge fields only.By using these configurations as the starting point of a DM, one can learn an effective stochastic process which incorporates the effect of the fermions implicitly.This may be used to create additional configurations in a numerically "cheap" manner.An accept-reject step based on the underlying theory can be included, which would require the evaluation of the fermion determinant.Finally SQ can also be used for theories with a sign or complex action problem, and hence a non-positive definite distribution.This is commonly known as complex Langevin (CL) dynamics.It is known that CL may become less reliable along long trajectories in the stochastic process.Combining CL with DMs, one may consider fairly short trajectories, but subsequently use the resulting configurations to train the DM to generate additional configurations to increase statistics.
Note added.Recent related work introduces the stochastic process into the hybrid Monte-Carlo algorithm [84] and explores the correspondence between the exact renormalizing group (ERG) and DMs based upon the heat equation [85].
A U-Net Architecture U-Net architectures are utilized to present s θ (ϕ, t) with a time-embedding input.This architecture is widely used in tasks that require semantic segmentation due to its ability to capture both local and global information in an image.The model is initialized with a function that calculates the standard deviation of the perturbation kernel, a list of channels for feature maps at each resolution, and an embedding dimension for Gaussian random feature embeddings.The model consists of an encoding path (contracting path) and a decoding path (expansive path), which are characteristic of the U-Net architecture, as Fig. 10 shows.The encoding path consists of four convolutional layers (conv1 to conv4), each followed by a dense layer (dense1 to dense4), and a group normalization layer (gnorm1 to gnorm4).The convolutional layers progressively increase the number of channels from 1 to 32, 64, 128, 256.The dense layers incorporate information from the time embedding into the feature maps, and the group normalization layers normalize the feature maps across the channels.
The decoding path consists of four transposed convolutional layers (tconv4 to tconv1), each followed by a dense layer (dense5 to dense7), and a group normalization layer (tg-norm4 to tgnorm2).The transposed convolutional layers progressively decrease the number of channels back to 1.The dense layers incorporate information from the time embedding into the feature maps, and the group normalization layers normalize the feature maps across the channels.The decoding path also includes skip connections from the encoding path, which help the model to better localize and learn representations with less loss of spatial information.
The model uses the Swish activation function, which is a smooth, non-monotonic function that has been found to work better than ReLU in deeper models.In the forward pass, the model first obtains the Gaussian random feature embedding for the input time t.It then passes the input ϕ through the encoding path, where it applies the convolutional, dense, group normalization, and activation layers sequentially.It then passes the output of the encoding path through the decoding path, where it applies the transposed convolutional, dense, group normalization, and activation layers sequentially, while also adding the skip connections from the encoding path.Finally, it normalizes the output by the standard deviation of the perturbation kernel at time t and returns it.

B Skilling-Hutchinson Estimator
The divergence function in Eq. (3.15) can be hard to evaluate in practice for general vectorvalued functions f , but we can use an unbiased estimator, named Skilling-Hutchinson estimator [71], to approximate the trace.Let ϵ ∼ N (0, I).The Skilling-Hutchinson estimator is based on the fact that Therefore, we can simply sample a random vector ϵ ∼ N (0, I), and then use ϵ ⊺ J f (ϕ)ϵ to estimate the divergence of f (ϕ).This estimator only requires computing the Jacobianvector product J f (ϕ)ϵ, which is typically efficient.As a result, for our probability flow ODE, we can compute the (log) data likelihood with the following, log p 0 (ϕ(0)) = log p T (ϕ(T )) With the Skilling-Hutchinson estimator, we can compute the divergence via Subsequently, we can compute the integral using numerical integrators.This provides us with an unbiased estimate of the true likelihood of the data, and we can make this estimate increasingly accurate by running the calculation multiple times and taking the average.The numerical integrator requires ϕ(t) as a function of t, which can be obtained by the probability flow ODE sampler.In our calculations, we tested different sampling sizes ranging from 1 to 100 and found that it converges when the size is larger than 10.Therefore, we chose the sample size ϵ as 10 to ensure the accuracy of the estimation.

C Statistics Examination
In Figure 11 (with the same physical parameter setup as in Sec.4.3), we demonstrate the dependence of four key physical observables including their uncertainties on the size of the    training datasets, while the size of the generated dataset remains constant (1024 configurations).The illustration uses gray markers to denote values derived from various training datasets, while colored markers represent values obtained from the generated datasets.The analysis reveals a trend towards convergence in the estimations of observables as the size of the training dataset increases, even with a relatively small generated dataset size.
In Figure 12, we present the dependence of the observables on the number of generated samples obtained from a DM trained on a dataset of 5120 configurations.The observables' values calculated on the training dataset are traced with black lines, contrasting with gray lines that represent calculations over a larger, unused dataset (51200 configurations).The results show that observables from the generated datasets tend to align with those from the larger dataset as their size increases.
Although there is a noticeable deviation in the ⟨M ⟩ estimations towards the training dataset values, it can be conceivably improved by introducing the Z 2 symmetry into the training data-sets explicitly, e.g., data augmentation with flipping all configurations.

D Acceptance Rate
We investigate the dependence of the acceptance rate on the value of the hopping parameter, the system size and the number of training epochs.We have calculated the dependence of the acceptance rate on the hopping parameter, specifically at κ = 0.1, 0.21, 0.27, 0.4, using an identical setup as in the main part of the paper, i.e. with 250 training epochs, T = 100 MC time-steps and 1024 configurations.In Figure 13, we observe that the acceptance rate lies between 0.4 and 0.6.This observation indicates that the efficacy of the DM-MC method remains robust against the variations in interaction couplings, demonstrating its utility as an efficient sampling technique across a broad range of parameter values.
Next we vary the linear size of the lattice, L = 16, 24, 32, 48, 64, at fixed κ = 0.27, using an otherwise identical setup as above.In Figure 13, we observe that the acceptance rate converges to around 0.6 for larger sizes, suggesting that the efficacy of the DM-MC will not be affected by the system size.
The dependence of the acceptance rate of DM-MC on the number of training epochs is shown in Figure 15 for parameter values closer to the critical point.It shows that the

𝜏Figure 1 :
Figure 1: Sketch of a Langevin simulation.Each line represents a different trajectory, evolving from an initial state at τ = 0 to one at a later time, T .The initial configurations are sampled from a prior but naive distribution, ϕ ∼ P [ϕ, τ = 0].After the simulation, the final configurations follow the equilibrated distribution, ϕ ∼ P [ϕ, τ = T ] ∼ P target [ϕ].

Figure 2 :
Figure 2: A sketch of the forward diffusion process (left panel) and the reverse denoising process (right panel).The two stochastic processes are described by two stochastic differential equations.The target distribution is typically unknown but learnt from the training data.

Figure 3 :
Figure 3: Drift terms (upper row) and effective actions (middle row) learned by the DM as a function of ϕ in both single-well (left column) and double-well (right column) actions, for various values of the time τ during the stochastic process.The action is shifted by a constant ∆S 0 .The dashed lines indicate the exact values.1024 samples generated using the exact and the DM are shown in the bottom row.

Figure 4 :
Figure 4: denoising process for generating four independent configurations from a well-trained DM.

Figure 5 :
Figure 5: Comparison of the distribution of the magnetization with 1024 samples generated from the well-trained DM and using HMC.

Figure 6 :
Figure 6: Comparison of the distribution of the magnetization in the symmetric phase from the test data-set (HMC) and from the well-trained DM.The number of independent configurations is 1024 in both cases.

Figure 7 :
Figure 7: Correlation between the effective action estimated from the trained DM using the probabilistic ODE and from the physical action on 1024 samples.

Figure 9 :
Figure 9: Schematic overview of the relation between stochastic quantisation and diffusion models in the case of lattice field theory.

Figure 10 :
Figure10: Sketch of the U-Net Architecture, where the gray lines with arrows indicate skip connections between encoding paths and decoding paths.In this study, the default shape of the input is set as (•,1,32,32), where the first dimension is the batch size, the second indicates the channel, and the final two represent dimensionalities of the field configuration.

Figure 11 :
Figure 11: Dependence of four observables and their uncertainties on the size of the training dataset within the symmetric phase (κ = 0.21, λ = 0.022), using a fixed number of 1024 generated configurations.

Figure 12 :
Figure 12: Dependence of observables on the number of generated samples obtained from a DM trained on a dataset of 5120 configurations within the symmetric phase (κ = 0.21, λ = 0.022).The results are compared with predictions from the training set and with those from a larger unused dataset with 51200 configurations.

Figure 13 :
Figure 13: Dependence of the acceptance rate on the hopping parameter κ on a 32 × 32 lattice.