A fast time domain solver for the equilibrium Dyson equation

We consider the numerical solution of the real time equilibrium Dyson equation, which is used in calculations of the dynamical properties of quantum many-body systems. We show that this equation can be written as a system of coupled, nonlinear, convolutional Volterra integro-differential equations, for which the kernel depends self-consistently on the solution. As is typical in the numerical solution of Volterra-type equations, the computational bottleneck is the quadratic-scaling cost of history integration. However, the structure of the nonlinear Volterra integral operator precludes the use of standard fast algorithms. We propose a quasilinear-scaling FFT-based algorithm which respects the structure of the nonlinear integral operator. The resulting method can reach large propagation times, and is thus well-suited to explore quantum many-body phenomena at low energy scales. We demonstrate the solver with two standard model systems: the Bethe graph, and the Sachdev-Ye-Kitaev model.


Introduction
Quantum many-body systems [1][2][3] can be described in terms of many-body Green's functions, which characterize the system's response to the addition and removal of one or more particles at different points in time.The number of degrees of freedom required to describe a many-body Green's function scales polynomially with the system size, independently of the number of particles.Green's function methods therefore provide an important complement to wavefunction-based methods, especially for macroscopic systems, since the dimensionality of a wavefunction is proportional to the number of particles it describes.A large number of physical properties can be determined just from single particle Green's functions, which give the expectation value associated with the addition and removal of a single particle.
The equation of motion for the single particle Green's function is called the Dyson equation [1].The static thermal equilibrium properties of quantum many-body systems can be obtained by solving the equation of motion in imaginary time [3], and the dynamical properties by solving it in real time or frequency.The last decade has seen a revival of time domain Green's function methods [4], mainly driven by an interest in non-equilibrium phenomena.This has in turn spurred advances in the development of numerical algorithms both for imaginary time [5][6][7][8][9][10][11][12][13][14][15][16] and non-equilibrum real time [17][18][19][20][21][22] Green's functions.However, the real time Dyson equation can also be used to determine the dynamical properties of quantum many-body systems in equilibrium, by evolving the imaginary time Green's function along the real time axis.
As we will show in Section 3.2, the Dyson equation of motion for the equilibrium single particle Green's function in real time can be written as a system of coupled nonlinear Volterra integro-differential equations of the form iy ′ (t) + t 0 k (y(t − t ′ ), t − t ′ ) y(t ′ ) dt ′ = f (y(t), t) y(0) = y 0 . ( In that context, the solution y is a real time single particle Green's function, and the interaction kernel k is called the self-energy.For simplicity of exposition, we take all quantities to be complex scalar-valued, but the matrix-valued case often encountered in Green's function calculations is a straightforward generalization.We also note that although we have assumed local nonlinearities k(t) = k(y(t), t) and f (t) = f (y(t), t), our method applies equally well to the more general case of causal nonlinearities k(t) = k(y(t ′ )| t ′ ≤t , t) and f (t) = f (y(t ′ )| t ′ ≤t , t), which appear in more complicated diagrammatic models of the self-energy.The cost of computing the history integral term in (1) scales as O N 2 with the number N of time steps, severely limiting the propagation times achievable by time domain solvers [23].This creates a practical barrier in studies of collective low energy phenomena in quantum many-body systems, which require high resolution in the frequency domain.This approach is typically avoided as a result.The standard alternative is to evaluate the self-energy in the time domain, where it is typically simpler, and transform to the frequency domain to solve the Dyson equation and obtain the corresponding Green's function.While this method eliminates the problem of history integration, and can be accelerated using the fast Fourier transform (FFT) [24], it requires solving a global nonlinear problem [25,26].The convergence properties of a given nonlinear iteration procedure are problem dependent, and the number of iterations can grow in physically interesting regimes, such as the low temperature limit.
We also mention another method, which is to perform analytic continuation to obtain a simpler integral equation in the imaginary time domain, and then analytically continue the solution back to the real time or frequency domain by solving a severely ill-conditioned integral equation [27].For sufficiently complicated approximations to the self-energy, direct solution in real time or frequency can be impractical, and this technique is the only option.However, despite significant recent progress in sophisticated regularization methods [28][29][30], the analytic continuation problem is fundamentally ill-posed, and obtaining high quantitative accuracy reliably, as we seek here, remains a challenge.
In this work, we present a time stepping algorithm for (1) which reduces the cost of Volterra history integration to O N log 2 N , eliminating the main barrier to carrying out equilibrium Green's function calculations in the time domain.Using this approach, the nonlinear problem is local and well-controlled; the number of nonlinear iterations is independent of global features of the solution, and is typically very small.In combination with the recently introduced discrete Lehmann representation [16] of the imaginary time axis, our time domain algorithm enables the calculation of dynamical properties of quantum many-body systems at low temperatures and large propagation times, i.e. at low energy scales, which are essential in capturing emergent many-body excitations.
The problem of efficient Volterra history integration is well known in the scientific computing literature, and several techniques have been proposed, particularly in the context of solving Volterra integral equations and computing nonlocal transparent boundary conditions.We mention fast Fourier transform (FFT)-based methods [31], methods based on sum-of-exponentials projections of the history [32][33][34][35][36][37][38][39][40], convolution quadrature [41,42], and hierarchical low-rank or butterfly matrix compression [22,43,44].Of the methods mentioned above, several have been applied to the efficient numerical solution of convolutional nonlinear Volterra integral equations [31,42,43].However, we emphasize that the form of the nonlinearity in (1) is different from that typically considered, where k, or its Laplace transform, is given either explicitly or numerically.All of the methods cited above make use of this a priori access to k in order to obtain a fast history summation scheme, and are therefore not applicable in the present setting, in which k itself is determined during the course of time stepping by a selfconsistency condition.This can be made evident by considering a simple example, k(y(t), t) = y(t), which yields the integral operator t 0 y(t − t ′ )y(t ′ ) dt ′ in (1).We refer to the form of the nonlinearity considered here as a kernel nonlinearity, to distinguish it from nonlinearities of the form (2).
Our algorithm is inspired by the FFT-based method of Hairer, Lubich, and Schlichte (HLS), proposed for the numerical solution of Volterra integral equations with nonlinearities of the typical form (2) [31].Their approach, which will be discussed in detail in Section 2.1, is to restructure the history sums into a hierarchy of Toeplitz block matrix-vector products, and to apply the blocks in quasi-linear time using the FFT.This yields an O N log 2 N algorithm.Although their method does not apply directly to the present case because of the kernel nonlinearity, we find that a more sophisticated block partitioning gives a compatible algorithm with the same computational complexity.This paper is organized as follows.We begin in Section 2 by describing our fast algorithm to compute the history integrals in (1).In Section 3, we give a brief introduction to many-body Green's functions and the equilibrium Dyson equation for single particle Green's functions, and show that the latter can be reduced to a system of coupled equations of the form (1).In Section 4, we apply our method to two canonical model systems: the Bethe graph [45,46] and the Sachdev-Yitaev-Ke (SYK) model [47][48][49].We demonstrate a significant increase in attainable propagation times, and therefore improvements in the resolution of the fine spectral properties of solutions.A concluding discussion is given in Section 5. We also provide appendices describing a specific high-order time stepping method for (1), and technical algorithmic details of the proposed method.

Fast history summation
To illustrate the appearance of the history sums in a simple manner, we first write down a forward Euler discretization of (1).We do not recommend using this method in practice, due to its poor accuracy and stability properties.Rather, we propose a high-order implicit multistep method in Appendix A. History sums of the same form appear no matter the choice of discretization, and the details of our fast history summation algorithm are independent of this choice.
Applying the forward Euler method to (1) yields Here t n = n∆t for a time step ∆t and n = 0, . . ., N − 1, y n ≈ y(t n ), and f n = f (t n ).For notational simplicity, we have suppressed the dependence of k and f on y.Applying the trapezoidal rule to the integral gives The computational bottleneck is evidently the calculation of the history sum at each time step.Indeed, the cost to compute all history sums for n = 0, . . ., N −1 is O N 2 , and dominates the otherwise O (N ) cost of time stepping.The collection of history sums takes the form of a lower triangular Toeplitz matrix-vector product s = Ky.Here, s, y ∈ C N and K ∈ C N ×N , with K nm = k n−m for n ≥ m and K nm = 0 for n < m, n, m = 0, . . ., N − 1.An N × N Toeplitz matrix can be applied to a vector in O (N log N ) operations using the FFT; a detailed description of the algorithm is given in Appendix B. Therefore, if y and k were known, then computing the sums (3) with quasi-linear cost would be straightforward.
However, in the present setting, there are two additional complications: 1. y n must be computed in order to obtain s n , which must in turn be computed in order to obtain y n+1 .
2. y n must be computed in order to obtain k n , which must in turn be computed in order to obtain y n+1 .
The first, which is familiar in the literature on fast algorithms for Volterra-type equations and integral operators, prevents us from computing the history sums all at once as a matrix-vector product, so we cannot simply use the standard algorithm for the fast application of Toeplitz matrices.The HLS algorithm [31], which we review in Section 2.1, offers a solution to this problem, as do the many other methods for the efficient evaluation of Volterra integral operators mentioned in the introduction.The second problem has not, to our knowledge, been addressed in this literature.We will propose a solution in Section 2.2, which appropriately modifies the HLS algorithm to respect the causal structure of the kernel nonlinearity.Here, we have L = 3 levels.The lettered blocks correspond to entries which are incorporated into the partial sums by direct summation, and the numbered blocks are applied using a fast Toeplitz matrix-vector product.The blocks are processed in the order corresponding to their first row indices: a-1-b-2-c-3-d-4-e-5-f-6-g-7-h.

HLS algorithm for fast history summation with a fixed kernel
The HLS algorithm computes the sums (3), for the case in which the kernel k is known, in O N log 2 N operations [31].The matrix K nm = k n−m is partitioned into square blocks as in Figure 1.In particular, we define n j = ⌈jN/2 L ⌉ for j = 1, . . ., 2 L − 1 and n 2 L = N − 1, where L is the number of levels of subdivision.We choose L = O (log 2 N ) so that n 1 is a small, fixed constant.Each block is Toeplitz, and therefore can be applied with quasi-optimal complexity using the FFT.
The algorithm proceeds by accumulating partial sums s n in an efficient manner, such that by the nth time step, s n = s n as needed.Contributions to the partial sums come both from triangular blocks (shown as the lettered blocks in Figure 1) applied directly row-by-row, and square blocks (shown as the numbered blocks in Figure 1) applied using the FFT.The algorithm amounts to a re-ordering of the calculation of history sums from the standard row-by-row approach, so as to enable the use of the FFT.
A pseudocode is given in Algorithm 1.We note the use of colon notation for indexing into vectors; for a vector x, we define x i:j ≡ (x i , . . ., x j ) T .Let us go through the first few steps of the algorithm.
Algorithm 1 Time-stepping using the HLS algorithm 1: Given: The first and last row indices, r f j and r l j , and the first and last column indices, c f j and c l j , of block j in Figure 1, for j = 1, . . ., Take a time step to obtain y n+1 6: end for 7: for j = 1, 2 L − 1 do

8:
Apply block j to y c f j :c l j using FFT-based algorithm, and add result to s r f j :r l j 9: for n = n j , n j+1 − 1 do Compute local contribution to history sums: Set s n = s n and take a time step to obtain y n+1 12: end for 13: end for First, the time steps n = 0, . . ., n 1 − 1 are carried out by the standard method, with the history sums s 0 , . . ., s n1−1 computed by direct summation.Once y n1 has been computed, block 1 in Figure 1 can be applied to the vector y 0:n1 , yielding partial contributions to the sums s n1:n2−1 .These partial contributions are stored in s n1:n2−1 .
Next, we carry out time steps n = n 1 to n = n 2 − 1.At the nth time step, we add the local contribution corresponding to the nth row of block b to the stored partial sum s n : s n ← s n + n m=n1+1 k n−m y m .We then take s n = s n , and can compute y n+1 .
Once y n2 has been obtained, block 2 can be applied to y 0:n2 , and the result stored in s n2:n4−1 .Then we proceed with time steps n = n 2 to n = n 3 − 1.At the nth time step, we add the local contribution corresponding to the nth row of block c to s n , s n ← s n + n m=n2+1 k n−m y m , set s n = s n , and obtain y n+1 .Once y n3 is obtained, block 3 can be applied to y n2+1:n3 , and its contribution added to s n3:n4−1 .Then the local contributions corresponding to the rows of block d are added to the partial sums in order to take the corresponding time steps.The algorithm proceeds in this manner, with contributions from each block added to the corresponding partial sums in the indicated order, such that by the end of the nth time step, s n = s n as needed.
To derive the computational complexity of this algorithm, we simply add up the cost of applying all blocks.Recall that L = O (log 2 N ).Since each triangular block is of dimensional approximately N/2 L = O (1), and there are 2 L = O (N ) such blocks, the total cost of applying these blocks is O (N ).For the square blocks, we have one block of dimension approximately N/2, two blocks of dimension N/4, four blocks of dimension N/8, and so on.The total cost of applying these blocks using the fast Toeplitz algorithm is therefore approximately

Fast history summation with a kernel nonlinearity
The HLS algorithm cannot be applied if k n+1 , . . ., k N −1 are unknown at the nth time step.Indeed, the block j is applied as soon as y nj is obtained.However, block j may contain values k n of the kernel with k > n j ; for example, block 4 contains values up to k N −1 .Since k n depends on y n , it is not possible to apply the full blocks in the order prescribed by the algorithm.In particular, only the upper triangular parts of blocks 1, 2, and 4 are known at the step they must be applied.
A modified partitioning of the same matrix is depicted in Figure 2. It contains square, triangular, and parallelogram-shaped blocks.In Appendix B, we generalize the standard FFT-based algorithm for Toeplitz matrices to triangular and parallelogram-shaped blocks (appropriately completed to full rectangular matrices by zero-padding), allowing these blocks to be applied in O (n log n) operations as well.
The structure of our algorithm is similar to that of the HLS algorithm, but with the modified partitioning.The triangular blocks indicated in Figure 2 by letters contain rows which are applied directly, and the numbered blocks are applied as soon as the time step corresponding to their first row n j is reached.The shapes of the numbered blocks ensure that their entries depend only on k n for n ≤ n j .The results of both the direct and block applications are accumulated in partial sums s n as before.Unlike in the HLS algorithm, at each time step there are two local contributions that must be made; the first, as before, corresponds to values y m for m near the current time step n, and the second to values k m for m near n.Also, at some steps, two blocks are applied instead of one.A pseudocode for the full procedure is given in Algorithm 2. The partioning for L = 5 is shown in Figure 3.This is similar to Figure 2, but contains more levels.
To obtain the computational complexity of the new algorithm, we consider the two types of units separately.We know from the previous section that the total cost of applying the triangular HLS-type units, indicated in blue in Figure 3, is approximately The total cost of applying each block in the square unit of dimension N/2 l is Here, the first term corresponds to the top-right upper triangular block, and the second term corresponds to the rest of the blocks.Summing over all square units gives the total cost Thus the computational complexity of the modified algorithm is O N log 2 N , the same as that of the original HLS algorithm.

Green's functions and the equilibrium Dyson equation
The aim of this section is to give a brief introduction to quantum many-body theory, the single particle Green's function, and the Dyson equation of motion, starting from basic quantum mechanics.We refer the reader to [4] for a more thorough introduction to the subject.In thermal equilibrium, the state of a quantum many-body system is described by a many-body density matrix ρ, which represents a statistical mixture of states |ψ n ⟩ weighted, according to the Gibbs distribution, by e −βEn /Z: Here Ĥ is the Hamiltonian of the system for times t < 0, β is the inverse temperature, |ψ n ⟩ is an eigenstate of the system with energy E n , Ĥ|ψ n ⟩ = E n |ψ n ⟩, and Z is the partition function Z = n e −βEn .In this context, the expectation value of an operator Ô is given by the ensemble average Algorithm 2 Time-stepping using the fast history summation algorithm for kernel nonlinearities 1: Given: The first and last row indices, r f j and r l j , and the first and last column indices, c f j and c l j , of block j in Figure 1, for j = 1, 2, 3 1 , 3 2 , 4, 5 1 , 5 2 , . . .2: Initialize s 0:N −1 = 0 3: for n = 0, n 1 − 1 do 4: Take a time step to obtain y n+1 6: end for 7: if there is one numbered block at row n j then There are two blocks at row n j ; do the same as above, for j = j 1 and j = j for n = n j , n j+1 − 1 do Compute local contribution to history sums: Set s n = s n and take a time step to obtain y n+1 16: end for 17: end for For a system with a time-dependent Hamiltonian Ĥ(t) for t ≥ 0, the operator expectation values are time-dependent, and can be calculated in the Heisenberg picture by time evolving the operator: Ô → Û (0, t) Ô Û (t, 0) ≡ Ô(t).Here Û (t, t ′ ) is the unitary time evolution operator from time t ′ to t -the solution operator for the Schrödinger equation with Hamiltonian Ĥ(t) -and is given formally by with T the time ordering operator and T the anti-time ordering operator.The time-dependent ensemble average is then given by ⟨ Ô(t)⟩ ≡ Tr ρ Û (0, t) Ô Û (t, 0) .
The notion of time propagation can also be extended to the initial state, since ρ can be expressed as Hence, the initial thermal equilibrium density matrix can be viewed in terms of evolution in imaginary time from t = z = 0 to −iβ, and the time-dependent ensemble average (4) can be expressed as One can now interpret the contents of the trace in (5) as the following sequence of operations: i) propagate from time 0 to t, and apply the operator Ô; ii) propagate backwards from time t to 0; and iii) propagate in imaginary time from 0 to −iβ.More generally, expectation values of multiple operators acting at different times, like the two-time correlation function between operators Ô1 and Ô2 ,  The full partioning is comprised of triangular units (blue) from the HLS algorithm (Figure 1), and square units (red), which account for the self-consistent nature of (1).

Single particle Green's functions
Quantum many-body systems can be described using a special class of expectation values called many-body Green's functions.Many-body Green's functions are the expectation values associated with the addition and removal of particles from a system at a collection of times on C. For example, the single particle Green's function G(z, z ′ ) is given by [4] where c † (z ′ ) is a particle creation operator, which adds a particle to the system at a generalized time z ′ , c(z) is a particle annihilation operator, which removes a particle from the system at another generalized time z, and T C is the contour time ordering operator, which commutes operators according to their contour ordering; see Figure 4.In the standard treatment [4], Green's function components are introduced for each possible restriction of z and z ′ to the three branches C + , C − , and C M , using the real-valued arguments t and τ , where t ≡ z for z ∈ C ± , and τ ≡ iz for z ∈ C M ; see also Figure 4.

The real time equilibrium Dyson equation
The Dyson equation of motion for the single particle Green's function ( 6) can be derived from the Martin-Schwinger hierarchy of coupled equations for n-particle Green's functions by collecting higher-order manybody corrections into an integral kernel called the self-energy Σ.Here we write the Dyson equations of motion for the Green's function components relevant in the case of equilibrium real time evolution.We note that in equilibrium, the Green's function is translation invariant in real time.Here, t max indicates a finite propagation time which can in principle be taken to ∞.
The initial thermal equilibrium state is described by the restriction G M of the contour Green's function G M is referred to as the Matsubara Green's function.The Dyson equation of motion for G M (τ ), for τ ∈ [0, β], has the form Here, ξ = ±1 for bosonic and fermionic particles, respectively, and G M is extended to (−β, 0) in the convolution by the periodicity or antiperiodicity conditions G M (−τ ) = ξG M (β − τ ). 1 The interaction kernel in these equations is the imaginary time self-energy Σ M , which depends self-consistently on G M .In equilibrium systems, it is sufficient to solve for the Green's function restricted to mixed real and imaginary time arguments.The left mixing component G ⌉ (t, τ ), with the second argument on the vertical imaginary time branch C M , satisfies the equilibrium Dyson equation of motion with Here, Σ ⌉ (t, τ ) and Σ R (t) are called the left mixing and retarded self-energies, respectively, and they depend self-consistently on G ⌉ (t, τ ).We note that the equation ( 7) for the Matsubara Green's function can be solved first, independently, and its solution then used in the initial condition and source term in (8).The solution of ( 7) and ( 8) completely determines the single particle Green's function.However, several other components are commonly used in the literature, and we mention them for completeness.The lesser component is defined as G < (t, t ′ ) = G(z, z ′ ) with z ∈ C − , z ′ ∈ C + , and it is related to the occupied density of states.It is given in equilibrium by and it is related to the unoccupied density of states.It is given by and is related to the full density of states.Again taking G R (t, t ′ ) ≡ G R (t − t ′ ), we therefore have For later convenience, we note that G R (t) satisfies the Dyson equation for t ≥ 0, and is taken to be zero for t < 0.

Semi-discretization in the imaginary time variable
To write the Dyson equation as a system of equations of the form (1), we must semi-discretize in the imaginary time variable τ .In addition to the equation ( 7) for G M , we will obtain r equations from (8), where r is the number of grid points discretizing τ ∈ [0, β].It is therefore important to find as efficient of a discretization of the imaginary time domain as possible.
Due to the importance of the Matsubara formalism in many-body calculations, there is a significant literature on the efficient discretization of G M .The most common approach is to use a uniform grid on τ ∈ [0, β], and to solve ( 7) by an FFT-based method.However, since G M is not periodic, this discretization suffers from low-order accuracy, and it struggles to resolve G M in low temperature calculations.Methods based on orthogonal polynomial expansions [6][7][8] and adaptive grids [9][10][11] offer an improvement, but remain suboptimal.Recently, two methods have been proposed which use the specific structure of imaginary time Green's functions, along with low rank compression techniques, to obtain highly compact representations: the intermediate representation with sparse sampling [12][13][14][15], and the discrete Lehmann representation (DLR) [16].Here, we work with the DLR.
Using the DLR, G M can be expanded in a basis of a small number r of exponential functions, for carefully chosen DLR frequencies ω k .The expansion coefficients g M k can be recovered from values g M j = G M (τ j ) at a collection of r DLR imaginary time grid points τ j .The frequencies and imaginary time grid points can be obtained using the interpolative decomposition [50,51], and they depend only on a dimensionless cutoff parameter Λ = βω max , with ω max a frequency cutoff, and an accuracy parameter ϵ.Λ can typically be estimated based on physical considerations, but in practice calculations should be converged with respect to that parameter.The size of the basis scales as r = O (log (Λ) log (1/ϵ)), and is exceptionally small in practice.The Matsubara component (7) of the Dyson equation can be solved on the DLR imaginary time grid.We refer to [16] for further details on the DLR.
Once the DLR expansion of G M is known, it can be used to compute Q ⌉ by the convolution in (9).In particular, let σ ⌉ j (t) and q ⌉ j (t) be the DLR grid discretizations of Σ ⌉ (t, τ j ) and Q ⌉ (t, τ j ), respectively, so that q ⌉ (t), σ ⌉ (t) ∈ C r .Then the discretization of (9) on the DLR grid is given by for an r ×r matrix G M which can be computed from the values g M j .We refer to [16] for a detailed description of this process.We note that we have assumed Σ ⌉ and Q ⌉ can also be represented by the DLR, which is the case for typical problems of physical interest.
As we did for σ ⌉ , and q ⌉ , we define g ⌉ j (t) as the discretization of G ⌉ (t, τ j ), so that g ⌉ (t) ∈ C R , and σ R (t) as the discretization of Σ R (t).The nature of the dependence of σ ⌉ and σ R on g ⌉ follows from that of Σ ⌉ and Σ R on G ⌉ ; namely, we have σ With these definitions, the semi-discretization of ( 8) is given a coupled system of integro-differential equations, (14) for j = 1, . . ., r.The exponential change of variables y(t) ≡ e iht g ⌉ (t) yields a system of equations of the form (1), with k(y(t), t) ≡ σ R e −iht y(t), t , f (y(t), t) ≡ e iht q ⌉ (t) = e iht G M σ ⌉ (e −iht y(t), t), and initial conditions y j (0) = iξG M (β − τ j ).
We note that as for G M (τ ) in ( 12), G ⌉ (t, τ ) can be expanded in the DLR basis: The r expansion coefficients g ⌉ k (t) can be obtain from the r values g ⌉ j (t).This expression can be used, for example, to compute G R (t) using (10).

Numerical examples
As a benchmark and proof of concept we apply the fast history summation algorithm to two simple but nontrivial systems with different self-energy expressions: the Bethe graph [45,46], for which Σ depends linearly on G, and the SYK model [47][48][49], for which the dependence is cubic.
All numerical experiments were implemented in Fortran, and carried out on a single CPU core of a laptop with an Intel Xeon E-2176M 2.70GHz processor.The Fortran library libdlr was used for an implementation of the DLR [52,53].The FFTW library [54] was used for FFTs.

The Bethe graph
The Bethe graph [45,46] is a simplified model of a periodic lattice system in which each lattice site is connected to q other sites.It is commonly employed in model calculations since the dynamics on the infinite graph can be described by the simple self-energy Here, the real constant c = qJ is determined by the hopping parameter J, given by the matrix element of the kinetic energy operator coupling states at neighbouring sites, and the coordination number q.The Green's function is known analytically in this case, so it provides a straightforward performance test for our time stepping method and history summation algorithm.
Let us first derive G R analytically, starting from the Dyson equation (11).We note that, following the convention in the literature [3,4], we define the Fourier transform of by f (ω) = ∞ −∞ e iωt f (t) dt, and the function argument itself indicates that we are working in the transform domain.Since by definition G R (t) = 0 and Σ R (t) = c 2 G R (t) = 0 for t < 0, (11) can be extended to the whole real axis 2 : Applying the Fourier transform yields a quadratic equation for G R (ω), where θ(ω) is the Fourier transform of the Heaviside function.It follows, in particular, that From ( 17), we see that A(ω) is the well known semi-circular density of states: Applying the inverse Fourier transform gives where in the third equality we have integrated by parts, and in the fourth equality we have used the change of variables 2c sin θ = ω − h.Here, we recognize the integral formula for the Bessel function J 1 of the first kind, and obtain For the numerical calculations, we fix c = 1, h = −1 and β = 10, and consider the fermionic case ξ = −1.Since the initial condition for the real time propagation of G ⌉ (t, τ ) in ( 8) is determined by G M (τ ), we first solve the imaginary time equation of motion (7).The imaginary time axis τ ∈ [0, β] is discretized using the DLR tolerance ϵ = 10 −15 , and a high energy cutoff Λ = 40, which we have verified is sufficient to eliminate the discretization error in τ to high accuracy.The resulting DLR contains r = 31 basis functions and DLR grid points τ j .We obtain G M (τ ) on the DLR grid using the method described in [16] for the equivalent integral form of (7), treating the nonlinearity by a weighted fixed point iteration with pointwise tolerance 10 −15 .
We then solve the real time equation of motion (8) for G ⌉ (t, τ ) using the high-order Adams-Moulton method described in Appendix A, and compute G R (t) from ( 10) by evaluating the DLR expansion (15) of G ⌉ .We use fixed point iteration with a pointwise tolerance 10 −15 for the nonlinear solve.Figure 5 shows a plot of G ⌉ (t, τ ) for t ∈ [0, 15].In Figure 6a, we show convergence of the fourth and eighth-order Adams-Moulton methods with respect to ∆t by comparing the computed value of G R (t) with the exact solution (18) for t ∈ [0, 1000].We then fix ∆t = 1/64, sufficient to achieve near machine precision accuracy for all calculations using the eighth-order method, and vary the final propagation time from t = 1 to t = 131072 by increasing the number N of time steps.Figure 6b shows wall clock timings using direct history summation (for some choices of N ) and our fast history summation algorithm.We observe the expected O N log 2 N scaling for the fast method, and note that it is superior for N ≥ 256, indicating a very small pre-factor in the scaling.For the longest propagation carried out here, with N = 8388608 and T = 131072, only N ≈ 110000 time steps to T ≈ 1700 would be possible by the direct method with the same cost.We also note that only one fixed point iteration is required to reach self-consistency for all time steps after the first 500, and before that at most two, indicating that the equations are non-stiff and an Adams predictor-corrector-type method is sufficient in this case [55, Sec 5.4.2.].

The Sachdev-Yitaev-Ke model
The fermionic Sachdev-Yitaev-Ke (SYK) model demonstrates the challenge of resolving emergent low energy features in quantum many-body systems.It is used to model certain features of strange metals, a poorly understood phase of matter with properties distinct from simple metals, and is connected to a variety of other phenomena of physical interest [47-49, 56, 57].The SYK equations of motion in real time are given by ( 7) and ( 8) with ξ = −1, and a self-energy consisting of a single second-order diagram in the interaction J, containing a product of three Green's functions: In the limit of zero temperature T = 1/β → 0, the SYK model displays an emergent low energy feature in the spectral function [47].At finite temperature T , the divergence is regularized by thermal fluctuations at frequencies |ω| ≲ T .
In order to resolve this feature, the Green's function must be propagated to a large time at low temperature.Our fast history summation algorithm makes this practical.We demonstrate this by solving the SYK equations and computing the spectral density at temperatures up to four orders of magnitude smaller than the coupling strength J, and establish the scaling of the cost required to reach even lower temperatures.
We solve the SYK equations and compute G R (t) and A(ω) for J = 1, h = 0 and β = 100, 1000, 10000 using the eighth-order Adams-Moulton method described in Appendix A. We tune all discretization parameters, and the propagation time, so that the spectral function A(ω) is resolved.
For β = 10000, we use the DLR parameters Λ = 10 5 and ϵ = 10 −10 , which yields r = 92 degrees of freedom in τ .We use a fixed point tolerance of 10 −14 for the nonlinear iteration.We propagate to a final time t = 50000 with N = 1048576 time steps.The computation takes less than five minutes to complete using our fast method.It would take approximately two days by the direct method.
We note that for all three choices of β, only one fixed point iteration is required to reach self-consistency for all time steps after the first 100, and before that at most three.
Plots of G R and A are shown in Figure 7. G R (t) decays exponentially as t → ∞ at a rate proportional to the temperature (Figure 7a).As a result, for large β, the spectral density is highly localized (Figure 7b).In the zero-temperature limit, the spectral density has an |ω| −1/2 singularity at the origin.Although for finite temperature A(ω) is we observe it approaching this singularity as β → ∞, with a corresponding t −1/2 decay regime in G R (t) at intermediate times.

Conclusion
We have presented an efficient solver for a class of nonlinear convolutional Volterra integro-differential equations for which the integral kernel depends causally on the solution.As is typical for Volterra-type equations, the computational bottleneck is the evaluation of history integrals at each time step.However, in the present setting, limited knowledge of the kernel at a given time step precludes the use of the various Volterra history summation techniques which have been introduced in the literature.We present an O N log 2 N algorithm which properly handles the structure of the integral operator.
We are motivated to study equations of the form (1) by their appearance in quantum many-body Green's function methods.In particular, the equilibrium real time Dyson equation of motion (8) for the single particle Green's function can be reduced to a system of equations of this form.By using the imaginary time Green's function, which is relatively easy to compute, to initialize real time propagation, our approach requires only local nonlinear iteration.By contrast, the standard method requires global nonlinear iteration in real time and frequency, which can lead to slow convergence in regimes of physical interest.
As proofs of concept, we present numerical experiments for the Bethe graph [45,46] and the Sachdev-Ye-Kitaev model [47,49], demonstrating propagation to extremely large times.In particular, we are able to resolve the square root divergence in the SYK model at a temperature 1/β four orders of magnitude lower than the coupling J, β −1 /J = 10 −4 , while resolving energy scales ω three orders of magnitude smaller than J, ω/J ≈ 10 −3 , in a calculation taking less than five minutes on a single CPU core of a laptop.More generally, our approach is applicable to many-body Green's function methods with low order selfenergy expressions, as in low order pertubation theory [1][2][3][4], as well as bubble and ladder resummations like Hedin's GW approximation and beyond [58][59][60][61].
Integrating both sides from t n to t n+1 gives The Adams-Moulton method of order p is obtained by replacing the integrand with its polynomial interpolant at the points {t n+1−j } p−1 j=0 , and computing the resulting integrals analytically.This yields the discretization Here, µ AM j are the Adams-Moulton weights.For a procedure to obtain the weights, along with tabulated weights for methods of up to eighth-order, we refer the reader to [62,Chap. 24].
It remains to discretize the history integrals to high-order accuracy.Since y j (t) is known only at the grid points t = t n , a standard high-order rule, such as a composite Gauss-Legendre rule, would require interpolation, and is inconsistent with our fast algorithm without significant modification.Instead we require a high-order rule of the form for some weights µ G m ; that is, a standard equispaced rule with endpoint corrections.A simple method is the Gregory integration rule.The Gregory rule with q weights is correct to order q + 1.Although this approach becomes unstable for very high orders, we have found it to be effective at least up to order 8.More stable endpoint-corrected rules have been introduced, including modifications of Gregory rules [63,64], and stable rules of very high order which require modifying endpoint node locations [65,66], but we have found the standard Gregory rules to be sufficient for our purposes.For a procedure to obtain the weights, we refer the reader to [64]; the weights are obtained from the Gregory coefficients, which are tabulated to high order in [67,68].
Inserting (22) into (21) and rearranging to place all unknown quantities on the left hand side, we obtain where we have defined the truncated endpoint-corrected history sum This is a collection of d coupled nonlinear equations, which are closed by prescribing the functional dependence of k n j and f n j on y n 1 , . . ., y n d .Once these are solved, S n+1 j must be computed by updating the value S n+1 j using (24).In any given time step, the previous p − 1 computed values of f n j and S n j must be stored, along with the full history of y j , for j = 1, . . ., d.
We can split the endpoint corrected history sums S n j into two terms, S n j = ∆t s n j + c n j , with a standard equispaced history sum, and a local correction.Thus, as for the simple discretization given in Section 2, we recognize the computation of the sums s n j as the quadratic scaling bottleneck of the time stepping procedure, and we can apply the same fast algorithm.
We do not suggest a specific nonlinear iteration procedure to solve (23) -in our numerical examples, we use a simple fixed point iteration -but propose using an Adams-Bashforth method to obtain an initial guess.The Adams-Bashforth method is derived in a similar manner to the Adams-Moulton method, but it is explicit.The pth order Adams-Bashforth discretization is given by We again refer the reader to [62,Chap. 24] for a method to compute the Adams-Bashforth weights µ AB , along with tabulated weights for the methods up to eighth-order.In practice, using this guess can significantly reduce the number of nonlinear iterations required at each time step.
To obtain a method of order p, it suffices to set q = p−1.Then ( 23) is only applicable when n ≥ p−2, and (25) when n ≥ p − 1.An alternative method is then required to obtain y n j , f n j , and S n j for n = 0, . . ., p − 1.A convenient approach is to use Richardson extrapolation on the second-order version of the method -the implicit trapezoidal rule -which does not require any initialization.The method, which requires p to be even, proceeds as follows: 1. Use the second-order method, (23), with p = 2, until a final time t = (p − 1)∆t, taking time steps of size ∆t, ∆t/2, ∆t/4, . . ., ∆t/2 p/2−1 .Record the computed values of y n j , f n j , and S n j for each time step choice.
2. Apply Richardson extrapolation to these values to obtain approximations of y j (t), f j (t), and S j (t) at t = ∆t, 2∆t, . . ., (p − 1)∆t which are accurate to order p.
Here, we have used that the truncation error of the implicit trapezoidal rule contains only even-order terms.
For further details of the Richardson extrapolation procedure, we refer the reader to [69, Sec.
Here, the top-left block, with bolded entries, is the original Toeplitz matrix A, which has been extended to a circulant matrix.The vector x is zero-padded as shown.Then, the result is truncated; here, the desired result is the vector b = (b 1 b 2 b 3 b 4 ) T , and the dashes indicate entries that are ignored.
A triangular Toeplitz matrix is simply a special case of a square Toeplitz matrix; in the above example, we would take e = f = g = 0. Therefore the algorithm for this case is the same.
We could also treat the case of a Toeplitz matrix whose non-zero entries form a parallelogram as simply another special case.However, this requires forming a circulant matrix of approximately twice the size necessary.We illustrate a more efficient method using the example of a 4 × 7 parallelogram: Here, the parallelogram has been embedded in the top half of a circulant matrix.No zero-padding of the input vector is necessary, but the result is still truncated to obtain b.

Figure 1 :
Figure 1: Illustration of the HLS algorithm to compute the history sums (3) on the fly with a given kernel k.Here, we have L = 3 levels.The lettered blocks correspond to entries which are incorporated into the partial sums by direct summation, and the numbered blocks are applied using a fast Toeplitz matrix-vector product.The blocks are processed in the order corresponding to their first row indices: a-1-b-2-c-3-d-4-e-5-f-6-g-7-h.

Figure 2 :
Figure2: Modification of the block structure depicted in Figure1to account for kernels which are obtained on the fly.The blocks are again processed in the order corresponding to their first row indices: a-1-b-2c-3-d-4-e-5-f-6-g-7-h.Blocks with the same label and subscripts 1 and 2, like 3 1 and 3 2 , can be processed simultaneously.

9 :
Apply block j to y c f j :c l j using FFT-based algorithm, and add result to s r f j 0) , can be treated by means of propagation on the generalized time contour C shown in Figure 4.This contour is comprised of three branches, representing the three directions of propagation used in calculating expectation values: a forward time propagation branch C + , a backward time propagation branch C − , and an imaginary time propagation branch C M .

Figure 3 :
Figure 3: Illustration of the block partitioning of the kernel matrix for the full fast algorithm, with L = 5.The full partioning is comprised of triangular units (blue) from the HLS algorithm (Figure1), and square units (red), which account for the self-consistent nature of (1).

Figure 4 :
Figure 4: Generalized time contour for time-dependent correlation functions of quantum many-body systems.Here, t max indicates a finite propagation time which can in principle be taken to ∞.

Figure 6 :
Figure 6: Numerical experiments with the Bethe lattice self-consistency Σ = G.(a) Convergence of the Adams-Moulton method as ∆t → 0. (b) Wall clock timings using direct and fast history summation with fixed ∆t and increasing propagation time.

Figure 7 :
Figure 7: Numerical experiments for the SYK model.(a) Asymptotic behavior of the retarded Green's function G R (t) at different temperatures.The inital t −1/2 decay (inset) transitions into exponential decay at a rate proportional to the temperature, requiring large propagation times for low temperature.(b) Structure of the spectral density A(ω) at different temperatures.The spectral function becomes highly localized at low temperatures (left inset).In the β → ∞ limit, it approaches an |ω| −1/2 singularity at the origin (right inset).

Algorithm 3 1 : 3 : 4 :
3.4.6].B Fast application of square, triangular, and parallelogram-shaped Toeplitz blocks We first review the standard fast algorithm to apply an n × n Toeplitz matrix A in O (n log n) operations [70, Sec.4.7.7].Let D be the n × n discrete Fourier transform matrix, D jk = ω jk n with ω n = e −2πi/n , and let C be an n × n circulant matrix with first column c.Then D diagonalizes C: C = D −1 ΦD.Here Φ is the diagonal matrix with entries Dc.Since D and D −1 can be applied to a vector in O (n log n) operations using the FFT, this gives an O (n log n) algorithm to compute a matrix-vector product b = Cx: Fast matrix-vector product b = Cx for a circulant matrix C with first column c Compute Dc by FFT 2: Compute Dx by FFT Compute the entrywise product Dc ⊙ Dx Compute b = D −1 (Dc ⊙ Dx) by inverse FFT To obtain an O (n log n) algorithm to compute the product b = Ax with an n × n Toeplitz matrix A, we simply embed A in a 2n × 2n circulant matrix C, apply C to a zero-padded vector x using Algorithm 3, and truncate the result.An example for n = 4 illustrates the method: a 0 g f e e d c b a 0 g f f e d c b a 0 g g f e d c b a 0 0 g f e d c b a a 0 g f e d c b b a 0 g f e d c c b a 0 g f e d   