Massively parallel simulations of strong electronic correlations: Realistic Coulomb vertex and multiplet effects

We discuss the efficient implementation of general impurity solvers for dynamical mean-field theory. We show that both Lanczos and quantum Monte Carlo in different flavors (Hirsch-Fye, continuous-time hybridization- and interaction-expansion) exhibit excellent scaling on massively parallel supercomputers. We apply these algorithms to simulate realistic model Hamiltonians including the full Coulomb vertex, crystal-field splitting, and spin-orbit interaction. We discuss how to remove the sign problem in the presence of non-diagonal crystal-field and hybridization matrices. We show how to extract the physically observable quantities from imaginary time data, in particular correlation functions and susceptibilities. Finally, we present benchmarks and applications for representative correlated systems.


Introduction
The generalized Hubbard model is given by where c † imσ creates an electron at site i with orbital quantum number m and spin σ, t ii mσ,m σ are the hopping integrals (i = i ) and the elements of the crystal-field matrices (i = i ), U i mm pp the local screened Coulomb tensor, and H dc the doublecounting correction. Assuming that the electron-electron interaction retains spherical symmetry, the Coulomb tensor for d electrons can be written in terms of three Slater integrals, F 0 , F 2 and F 4 . For t 2g or e g electrons only, the independent parameters reduce to two. In the basis of real harmonics, the essential terms Fig. 1. The self-consistency loop in the DFT+DMFT approach with a quantum Monte Carlo (QMC) impurity solver [8]. The method is also known as LDA + DMFT, as the localdensity approximation (LDA) is one of the most popular among the known approximations to the exact DFT exchange-correlation functional.
are the screened direct Coulomb integrals U mm mm = U − 2J(1 − δ mm ), the exchange (U mm m m = J), as well as the pair hopping (U mmm m = J) and spin-flip (U mmm m = J) term. These can be expressed as linear combinations of Slater integrals, with U = F 0 + 4 49 (F 2 + F 4 ) and J = 1 49 (3F 2 + 20 9 F 4 ) for t 2g states, while J = 1 49 (4F 2 + 5 3 F 4 ) for e g states [1,2]. The one-electron parameters of the Hamiltonian (1), i.e., hopping integrals, crystal-field splittings and spin-orbit couplings, can be obtained by building Wannier functions that span the correlated bands, either using the downfolding approach based on the N th-Order Muffin-Tin Orbital method (NMTO) [3,4] or the maximally-localized Wannier function technique [5]. The calculation of the screened Coulomb interaction would in principle require the full solution of the many-body problem, and can only be achieved via approximate approaches; popular schemes [6,7] are the constrained random-phase approximation (cRPA) or the constrained localdensity approximation (cLDA). The double-counting correction is also not known exactly; if only correlated electrons of the same kind are retained, it amounts to a shift of the chemical potential and can be neglected. In the more general case, approximations such as the around-mean-field or the fully-localized Ansatz are used [6,7].
In dynamical mean-field theory (DMFT), the generalized Hubbard Hamiltonian (1) is mapped onto a multi-orbital quantum-impurity problem, whose bath parameters are determined via the self-consistency criterium (see Fig. 1), which requires that the local lattice Green function equals the local impurity Green function [6,7]. The complexity of the quantum-impurity problem increases dramatically with the number of correlated sites and orbitals, the (lack of) symmetry of the Hamiltonian and the type of Coulomb terms included in the calculation, as well as the strength of crystalfield splittings and spin-orbit interaction. Depending on the approach used for solving the quantum impurity problem, the principal numerical bottleneck can be either the computational time (quantum Monte Carlo methods) or the memory (exact diagonalization or Lanczos). In the following we will review two particularly powerful general DMFT solvers, discuss their efficient implementation on massively parallel machines, and illustrate their abilities by presenting a number of paradigmatic applications.

The Lanczos approach
Using the Lanczos approach [9][10][11] we can find the ground state and spectral function of many-body Hamiltonians of the type (1), including the generalized quantum impurity problems at the core of the DMFT method [12]. The basic idea for numerically exactly calculating the ground state is strikingly simple: The vector (H − E[Ψ])|Ψ / Ψ|Ψ is the gradient of the energy expectation value E[Ψ] = Ψ|H|Ψ / Ψ|Ψ . Since the total energy functional E[Ψ] has no local minima, it can be minimized very efficiently by the method of steepest descent. Convergence is even more rapid when the minimization is performed over all successive gradient vectors simultaneously. For L steepest descent steps, starting from some arbitrary normalized state |v 0 , this corresponds to minimizing the energy functional on the (L + 1)-dimensional Krylov space K L (|v 0 ) = span(|v 0 , H|v 0 , . . . , H L |v 0 ). The Lanczos algorithm iteratively constructs an orthonormal basis of Krylov spaces of increasing dimension L using the recursion b n+1 |v n+1 = H|v n − a n |v n − b n |v n−1 , where a n = v n |H|v n while b n+1 is determined by the normalization of the new basis vector. The starting vector |v 0 must be normalized and we define |v −1 = 0. Rearranging (2) it immediately follows that the Hamiltonian is a tri-diagonal matrix H(L) in the L + 1-dimensional Lanczos basis: H|v n = b n |v n−1 + a n |v n + b n+1 |v n+1 .
With increasing dimension of the Krylov space the lowest eigenvalueĚ 0 (L) of H(L) very rapidly approaches the ground state energy of H and the variational state |Ψ 0 (L) = L n=0 u 0,n (L)|v n constructed from the lowest eigenvector u 0 (L) of the tridiagonal matrix H(L) gives, for increasing L, an excellent approximation to the ground state |Ψ 0 of H.
The Krylov space approach also provides rapidly converging approximations to spectral functions by simply replacing the full spectrum in the Lehmann representation by the L + 1 eigenstates of H in K L (|Ψ c ): Here the Krylov space is constructed starting from the (normalized!) state |Ψ c . To see how the approximate spectral functionǍ(ω) approaches the true spectral function A(ω ± iη) = ∓ G(ω ± iη)/π, we consider the m-th moment ofǍ (4) SinceȞ is the projection of the full Hamiltonian H onto K L (|Ψ c ), it follows by the construction of the Krylov space thatȞ m |Ψ c = H m |Ψ c for all m ≤ L. Thus the 2528 The European Physical Journal Special Topics approximate spectral functionǍ(ω) correctly reproduces the moments 0, . . . , 2L of the true spectral function.
The implementation of the Lanczos method for the Hubbard model is straightforward. The occupation number representation |n 0σ , n 1σ , . . . for electrons with spin σ is interpreted as the bit representation of an integer i σ = j n jσ 2 j so that a many-body basis state can be characterized by a pair of integers The hopping term can only connect basis states that differ in at most one electron. The simple density-density Coulomb term is diagonal and easily evaluated by counting the set bits in (i ↑ & i ↓ ). Counting bits is most efficiently done using the popcnt machine instruction. The full Coulomb term connects states that differ in at most two electrons. This means that even in the case of long-ranged hopping and Coulomb repulsion, for large systems, the manybody Hamiltonian will be extremely sparse so that the matrix-vector products H|v n that are at the core of the Lanczos approach can be calculated efficiently. The Lanczos approach has a number of striking advantages: (i) it is conceptually simple and easily implemented, (ii) it quickly converges to the ground state energy and gives a good approximation of the ground state vector, and (iii) it allows to accurately calculate spectral functions directly on the real axis. The main limitation is caused by the need to handle many-body vectors: For a given electron density the dimension of the Hilbert space increases exponentially with system size, limiting Lanczos calculations to quite small systems. For half-filling, e.g., a single many-body vector of a system with 12 orbitals requires just 6 MBytes of memory, for 18 orbitals already 17 GBytes are needed, while for 24 orbitals it takes 53 TBytes.
In practice, Lanczos calculations are thus limited by the amount of memory that can be efficiently accessed. The largest memories are available on modern supercomputers. This is the motivation for porting the Lanczos algorithm to massively parallel machines. For shared-memory systems this can be done quite easily: When calculating the matrix-vector product H|v n in (2), different threads can work on different sections of the resulting vector independently. The off-diagonal elements of the Hamiltonian (5) lead to non-local memory access of the many-body vector |v n , but the vector elements are only read, so that there is no need for locking. An OpenMP parallelization thus only requires a single pragma for parallelizing the sparse matrixvector product. Parallelizing the scalar products required for obtaining the Lanczos parameters a n and b n in a similar way, we obtain almost ideal speed-up on a wide range of shared-memory systems. Unfortunately, such systems are limited to between a few hundred to a thousand GBytes of memory, restricting calculations to at most 20 orbitals at half-filling. To go to even larger systems we need to find an efficient distributed-memory implementation. A naive approach uses the MPI-2 extensions to the Message Passing Interface which provide one-sided communication to emulate the shared-memory strategy of directly accessing remote memory. In practical implementations this leads, however, to a severe speed-down, i.e., the more CPUs we use, the longer we have to wait for the result.
An efficient distributed-memory implementation [13] is instead based on the fact that hopping does not change spin. Hopping of the up-electron mixes only different ( up-hopping configurations, while the down-electron configuration remains unchanged. If we group all up configurations for a fixed down configuration together in a single thread, this hopping can be carried out locally. Figure 2 illustrates how the amplitudes of the 36 basis states |i ↑ , i ↓ of a half-filled 4-orbital system are distributed over three nodes. Since all states with a fixed configuration of down-spin electrons i ↓ are stored on a single node, the terms involving hopping of an up-spin electron can be done in local memory. For the hopping of the down-spin electrons this is not true. Here we would need to access amplitudes stored on different nodes. If we, however, exchange the indices i ↑ ↔ i ↓ , i.e., perform a transpose operation as indicated in Figure 2, the terms involving the hopping of a down-spin electron can be performed in local memory. Transposing again restores the original storage arrangement. We use MPI Alltoall to implement an efficient matrix transpose. This routine sends data between all threads of the code and expects the data packages that will be sent to a given thread to be stored contiguously in memory. As shown in Figure 2 this is not quite the case since we actually store the down-spin amplitudes sequentially, which is equivalent to a matrix that is stored column-wise. For MPI Alltoall to work properly, we would have to bring the data elements into row-major order. This could be done by performing a local matrix transpose. The involved matrices are, however, in general rectangular, leading to expensive local-copy and reordering operations. We can avoid this by calling MPI Alltoall for each column separately as indicated by the red arrows in Figure 2. After this, only a local strided transposition has to be performed (small white arrows) to obtain the fully transposed Lanczos vector. The implementation described so far uses MPI Alltoall which assumes that the matrix to be transposed is square and that the dimension dim ↑ = dim ↓ is divisible by the number of MPI processes. To overcome these restrictions we have generalized the algorithm to use MPI Alltoallv, which allows the size of the data packages to depend on the thread. This is the implementation that is used in practice. 16 ( 8,8) 18 ( 9,9) 20(10,10) 22 (11,11) 22 (11,11) 24 (10,10) 1e-07   (9,9), 254 GBytes for 20(10,10), 3.8 TBytes for 22 (11,11), and 28 TBytes for 24 (10,10). Despite massive communication in each Lanczos iteration, the code shows excellent speed up. Only when the message size per process becomes too small, performance degrades due to network latency. This is shown in the lower plot. Properly scaling the execution times we obtain a universal scaling (ParLaw) for system sizes ranging over more than five and process counts ranging over three orders of magnitude.
Despite the massive communication that is involved in sending the entire manybody vector across all the nodes, our approach shows an impressive performance on massively parallel machines as can be seen in the upper panel of Figure 3: The speedup for a given problem size is nearly ideal until the node count becomes so large that the individual messages that are sent become very small. In that regime the performance is limited not by the bandwidth of the interconnect but by its latency. This is shown in the lower panel of Figure 3: When the number of processes #MPI proc is much larger than the dimension of one spin-configuration dim σ ≈ dim(H) the time per iteration decreases linearly with message size. In this regime the speedup measures the bandwidth of the network. When the number of processes keeps increasing, the speedup is no longer linear, as for the resulting shorter and shorter messages the network latency becomes more and more important. Figure 3 shows that timings of problems with dimensions varying over more than five and processor counts varying over more than three orders of magnitude show universal scaling. Here the size of a single many-body vector varies from below 90 MBytes (14 sites with N ↑ = 7 = N ↓ ) to about 28 TBytes (24 sites with N ↑ = 10 = N ↓ ) while the number of nodes varies between 64 and 65 536. This allows us to predict the run-time of our code on even larger problem sizes or processor counts.

General quantum Monte Carlo solvers
Quantum Monte Carlo methods are particularly well suited as impurity solvers for realistic applications of the DFT + DMFT approach. Thus, we have implemented general QMC solvers of different flavors: Hirsch-Fye QMC [14] and continuous-time hybridization-expansion (CT-HYB QMC) as well as continuous-time interactionexpansion (CT-INT QMC) [15][16][17][18][19]. These three approaches all have some advantages and some disadvantages and complement each other. Here we will illustrate in particular the generalized CT-HYB solver [17]. To this end, we first rewrite the quantum-impurity Hamiltonian as the sum of three terms, the local Hamiltonian, the bath Hamiltonian and the hybridization, H = H loc + H bath + H hyb . The first two terms can be collected in Here α = mσ labels the flavors, i.e., spin and orbital degrees of freedom and γ is the basis which diagonalizes the Hamiltonian describing the bath, H bath . The term ε αᾱ = ε αᾱ −Δε DC αᾱ is the sum of the crystal-field matrix ε αᾱ , and of the system-specific double-counting correction, Δε DC αᾱ . The next step is the expansion of the partition function in powers of H I . In the interaction picture H I (τ ) = e τH0 H I e −τH0 with β = 1/k B T this yields the series where T is the time ordering operator, such that τ = (τ 1 , τ 2 , . . . τ m ) with τ i+1 ≥ τ i and the integral is given by Only even expansion orders m = 2n (i.e., terms containing an equal number n of creation and annihilation operators both in the bath and impurity sector) contribute to Z. For a specific order m, we define the n-dimensional flavor vectors α = (α 1 , α 2 . . . α n ) andᾱ = (ᾱ 1 ,ᾱ 2 . . .ᾱ n ), where α i (ᾱ i ) are the flavors of the n impurity annihilation (creation) operators at imaginary times τ i (τ i ). The partition function can then be written as where Z bath = Tr e −βH bath . The term t (n) α,ᾱ (τ ,τ ) is the trace over the impurity states

The European Physical Journal Special Topics
Here c and N is the total number of electrons on the impurity. The second factor is the trace over bath states, which is the determinant of the n × n square hybridization-function matrix where ω n is a fermionic Matsubara frequency, is related to the inverse of the bath Green function G(ω n ) by In order to speed up the calculations, we exploit symmetries, splitting the sums in decoupled blocks. Furthermore, for calculating the local trace we adopt a multiapproach scheme. For local Hamiltonians that conserve the flavors, we use the fast segment approach [15]. In this case the partition function can be expressed, for each flavor a, in expansion orders n a Here τ = Na a=1 τ a and α = Na a=1 α a ; the same kind of definition holds forτ andᾱ. The local trace takes then the simple form where l aa is the length of the overlap of the τ segments a and a , s a = sgn(τ a1 −τ a1 ) is the fermionic sign. In all other cases, i.e., when the local Hamiltonian mixes flavors, we adopt the Krylov method [20]. In this approach, at the beginning of the DMFT loop we compute all the eigenstates of H loc , {|Ψ n }, and their energies {E n }. To calculate the trace at a certain order, a given state |Ψ n is initially propagated with e −τ1En ; the first creation or annihilation operator met generates a new state |Ψ , which we propagate via the Krylov approach obtaining |Ψ(τ 2 − τ 1 ) ; the procedure continues until the last creation or annihilation operator in the local trace is met. The calculation is performed in the occupation number basis, so that all operators appearing in the calculations are represented by sparse matrices. To exploit to the maximum efficient sparse-matrix multiplication algorithms, we collect the states according to the symmetries of the local Hamiltonian. Figure 4 shows the convergence of the Krylov procedure with the number of Lanczos steps r. The propagation and creation/annihilation is carried out from both the left and the right side of the trace; this minimizes the work-load to measure the Green function. When possible -far from phase transitions -we use the eigenvalues of H loc to adaptively truncate the outer bracket of the trace. Figure 5 shows in representative cases the performance of our CT-HYB QMC solver on the massively parallel Jülich BlueGene/Q in comparison with the performance of the Hirsch-Fye QMC solver on the same machine.  Concerning the minus sign problem, we find that it appears mostly when offdiagonal crystal-field terms are present. Remarkably, we find that the sign problem is strongly suppressed rotating the one-electron basis to crystal-field states, even when the off-diagonal terms of the hybridization function matrix remain of comparable size before and after the rotation [21]. We use this basis transformation to study, e.g., the low-temperature ferromagnetic transition of the low-symmetry t 1 2g system YTiO 3 ,

2534
The European Physical Journal Special Topics one of the few ferromagnetic Mott insulators. Figure 6 shows that the theoretical critical temperature is about 50 K, in very good agreement with experiments [22], which yield T C ∼ 30 K. For systems in which spin-orbit is included, we find that the sign problem can be tamed by a unitary transformation to a basis in which the local Green function is real. The specific form of the transformation will be discussed later, in the subsection about the Fermi surface of Sr 2 RuO 4 [19]; the precise form of the transformation depends on the system as can be seen by comparing to the case of Ca 2 RuO 4 [23].

Generalized linear-response functions
In addition to single-particle Green functions, in order to compare with experiments, linear-response functions are crucial. The local susceptibility tensor χ α (τ ) can be defined as follows [8] . We define τ = (τ 1 , τ 2 , τ 3 , τ 4 ) and α = (α 1 , α 2 , α 3 , α 4 ) where the α j are flavors. The Fourier transform to Matsubara frequencies yields χ α (ν), where ν = (ν n , −ν n − ω m , ν n + ω m , −ν n ), ν n and ν n are fermionic and ω m bosonic Matsubara frequencies. The local tensor elements build a square matrix [χ(ω m )] N,N with elements N = α 1 n, α 2 n, N = α 3 n , α 4 n . The local matrix χ(ω m ), calculated via QMC, does not couple quantum-impurity blocks, which have dimension N c . In the local-vertex approximation [24] the lattice susceptibility χ(q; ω m ) is given by the solution of the Bethe-Salpeter equation The equation is illustrated diagrammatically as a Dyson-like series in Figure 7. The terms χ(q; ω m ), χ 0 (q; ω m ), and Γ(ω m ) are all N × N matrices; the elements of the matrix χ 0 (q; ω m ) can be written as where G k α i αj (ν n ) is the DMFT (or, depending on the case, cellular DMFT) lattice Green function. The local vertex matrix Γ(ω m ) can be obtained by solving the local Bethe-Salpeter equation within a quantum-impurity block with By replacing the vertex tensor in equation (8) we obtain the lattice susceptibility tensor [χ(q; ω m )] N,N . The most time-consuming part is the computation, via QMC, of the local susceptibility tensor; to speed up the calculations we again use symmetries and have optimized our code for modern massively parallel architectures. Finally, we use the Filon-trapezoid method to minimize the time required to perform the Fourier transform to Matsubara space as well as an extrapolation scheme to sum up Matsubara frequencies. This approach is implemented in our generalized HF-QMC solver [25].

Applications
In this section we present a selection of representative applications that highlight the capabilities of our generalized impurity solvers. First we analyze the mechanism of orbital order in correlated insulators, comparing the strength of Kugel-Khomskii super-exchange and of mechanisms that are driven by lattice distortions. Next we turn to correlated metals, analyzing the interplay of crystal-field splitting, spin-orbit coupling, and Coulomb correlations (and in particular its low-symmetry terms) on the Fermi surface of Sr 2 RuO 4 . Finally we study the degree of frustration in twodimensional layered vanadates via the calculation of the magnetic response function.

The origin of orbital ordering
Orbital ordering phenomena are very common in Mott insulators with orbital degrees of freedom and they are believed to play a crucial role in determining their magnetic and electronic properties. Despite of the importance of the phenomenon, the origin of orbital order has been debated for decades. Two rather different mechanisms have been proposed to explain it. The first is based on the Jahn-Teller theorem, and is well presented in a seminal work of Kanamori [27]. In this view, in the presence of orbital degeneracy, electron-phonon coupling yields a Jahn-Teller distortion, and thus gives rise to an ordered pattern of occupied or empty orbital states in localized systems. Recently, it has been shown via DFT+DMFT that Coulomb repulsion strongly enhances orbital-order in the presence of a small crystal-field splitting [3,4], so that even a correspondingly small static distortion can lead to orbital order at rather high temperatures. The second mechanism, due to Kugel and Khomskii [28], takes a quite different perspective. In this view the orbitally ordered state forms because of multiorbital super-exchange, a purely electronic many-body effect, after which the systems undergoes a Jahn-Teller-like distortion as a consequence of the electronic order. Since the two mechanisms yield often the same final state, it is difficult to say, based on experiments only, which mechanism drives the transition [29]. The two paradigmatic and most studied compounds are the e g systems KCuF 3 (t 6 2g e 3 g ) and LaMnO 3 (t 4 2g e 1 g ). For both materials, DFT + U calculations have shown that the Coulomb interaction stabilizes the Jahn-Teller distortion [30,31]. This lead early-on to the conclusion that the Kugel-Khomskii mechanism drives the transition [30]. The DFT+U approximation is, however, based on a Hartree-Fock treatment of Coulomb effects, and yields an electronic gap only at the price of introducing longrange magnetic order, while in the real materials orbital order is retained well above the magnetic phase. One could thus conjecture that the results obtained via DFT + U are a spurious effect of the approximation. They were, however, recently confirmed via finite-temperature DFT+DMFT calculations for the paramagnetic phase [32], which yield basically the same energies and ground state geometries as DFT+U , indicating that the Hartree-Fock approximation gives reliable results for these properties. This somewhat unexpected result, as will become clear below, is due to the smallness of the  [21]. We find that T KK ∼ |2ΔE|/kB, where TKK is the transition temperature determined from p(T ).
energy gained by Kugel-Khomskii orbital polarization, which for the case of LaMnO 3 is shown in Figure 9.
Although these studies show that the Coulombic electron-electron repulsion plays a key role, they did not really clarify what the main mechanism is. Coulomb repulsion could, as a matter of fact, help to stabilize the Jahn-Teller structure for other reasons than the Kugel-Khomskii super-exchange mechanism. Furthermore, the fact that the magnetic ordering temperature is small, e.g., 40 K in KCuF 3 , while orbital order persists up to the melting temperature, makes it difficult to explain both phenomena via super-exchange interaction. In order to clarify this problem, we devised a scheme to separate the effects of Kugel-Khomskii super-exchange from the rest. We achieved this via the construction of a series of idealized structures in which the Jahn-Teller distortion is progressively reduced to zero. In the undistorted limit, we calculate the order parameter p (the orbital polarization), as a function of temperature and thus determine the critical temperature T KK for the pure Kugel-Khomskii mechanism (see, e.g., Figs. 9 and 10). With this approach, we find that T KK is 350 K [26] in KCuF 3 and T KK ∼ 600 − 700 K [33] in LaMnO 3 ; the results for the latter are presented in Figure 10. This shows that Kugel-Khomskii super-exchange is remarkably strong, but not sufficient to induce an orbital-ordered phase which persists to the melting temperature, as reported experimentally. It also shows that Kugel-Khomskii superexchange plays a larger role in LaMnO 3 than in KCuF 3 . The case of LaMnO 3 is more complicated than the one of KCuF 3 , and this has two reasons. First, the t 2g states are half-filled, and interact with the e g states via the Coulomb interaction; however, since the t 2g states are much lower in energy than the e g , they can be treated in first approximation as local effective spins S = 3/2 interacting with e g electrons, so that one can show that an e g model with renormalized hoppings and an effective magnetic field [34] is sufficient to describe the low-energy properties. Second, apart from the Jahn-Teller distortion, other distortions are present (tilting and rotation of the octahedra and orthorhombic distortions), which lead to a bandwidth reduction and an additional crystal field, whose symmetry differs from the one of the Jahn-Teller mode [3,4,33]. In order to account for those distortions and their effect, we performed calculations for idealized structures in which only the Jahn-Teller interaction is progressively removed. These structures yield, as discussed, a crystal-field splitting of e g states, hence an orbital polarization is present at any temperature. The onset of the super-exchange regime, and thus the super-exchange critical temperature T KK , can in this case be determined only by the change in the occupied orbital character which occurs when decreasing the temperature; we find that, indeed, at T KK , the occupied orbital rotates towards the state predicted by the super-exchange interaction. This can be seen in the right panel of Figure 10.
Remarkably, in the case of LaMnO 3 , while Jahn-Teller distortions survive in nanoclusters up to 1000 K or more [35], an order-to-disorder distortion is observed at T OO ∼ 750 K [36]; our calculated T KK is very close to T OO . Could it be that superexchange is driving such a transition? To answer this question we calculated T KK for representative elements of the REMnO 3 series (rare earths RE = La, Nd, Tb, Dy), systems for which T OO has been measured [37][38][39][40]. The results are shown in Figure 11. The figure shows that the super-exchange transition temperature T KK is basically independent of the rare-earth ion, while T OO strongly increases when the rare-earth radius decreases. This lead us to the conclusion that super-exchange alone cannot explain the observed trends in T OO and thus the fact that T OO ∼ T KK in LaMnO 3 is accidental. This study was based on an e g Hubbard model in which the t 2g electrons were treated as local effective spins S = 3/2. More recently, thanks to our generalized CT-HYB QMC solver, we confirmed these conclusions using the full 5-band Hubbard model [17].
If super-exchange is not the main mechanism, is it simply Kanamori's Jahn-Teller mechanism? To find an answer to this question we re-analyzed the case of KCuF 3 [41], for which, against all expectations, orbital ordering persists to melting. Actually, the transition to a symmetric phase is not simply absent, but the Jahn-Teller distortion parameter, in fact, increases with temperature. Applying hydrostatic pressure, Fig. 11. Super-exchange transition temperature TKK for representative elements of the ReMnO 3 series. Also shown (crosses) are experimental orbital-order to orbital-disorder transition temperatures, T OO [21]. For LaMnO3 circles of decreasing size are theoretical results for increasing pressures, from ambient pressure to 9.87 GPa.
instead, reduces the order parameter, while chemical pressure, replacing K with the larger ions Rb or NH 4 increases it. This shows that the structural change is hardly affected by thermal effects, instead the lattice constant plays the dominant role. The increase of the order parameter with temperature thus appears to be a consequence of thermal expansion. The two established mechanisms, Kugel-Khomskii super-exchange and the Jahn-Teller mechanism cannot explain this behavior. Super-exchange decreases with distance and for the Jahn-Teller mechanism we would have to assume a completely unrealistic softening of the distortion mode with lattice constant. Instead, we find a new mechanism for ionic Mott insulators, where the energy gain from the distortion is driven by the crystal-field splitting and the Ewald energy of the shifted ions while it is opposed by the Born-Mayer repulsion of the electron clouds of the F and Cu ions. Such a model can accurately reproduce the energy curves calculated within DFT+U , which describe the experimental structures very well. Since the Born-Mayer potential increases rapidly at short distances, we find that the short Cu-F distance is nearly constant, independently of temperature and hydrostatic or chemical pressure. Moreover, the Cu-F repulsion is essentially a property of the touching ions, which explains why an almost constant short Cu-F distance is found in a whole variety of compounds with copper ions surrounded by fluorine octahedra, but of otherwise widely varying composition. We note that the structures calculated in DFT+U are fairly robust under changes of U as long as the parameter is large enough to open a gap. The essential effect of increasing the Coulomb term is to slightly increase the size of the Cu 2+ ion [42]. Our new mechanism in which only the Born-Mayer repulsion stabilizes the undistorted phase suggests the following intriguing scenario: Since the contribution of the Born-Mayer term at the undistorted position decreases exponentially with the distance of the ions, even for non-Jahn-Teller-active ions the Ewald energy can drive a distortion once the lattice constant becomes large enough. The result would be an inverted phase transition: no distortion at low temperature, while thermal expansion would induce a distortion above some critical temperature.

The fermi surface of Sr 2 RuO 4 : spin-orbit and Coulomb anisotropy effects
The single-layered ruthenate Sr 2 RuO 4 (t 4 2g electronic configuration) exhibits an amazing range of exotic properties, among which are low-temperature p-wave superconductivity [43][44][45][46], Hund's metal physics [47] and other anomalous behavior [48,49], as well as a peculiar Mott metal-insulator transition after Sr→Ca substitution [50]. These surprising effects are believed to stem from the atypical low-energy electronic structure of this material, made of a quasi-two-dimensional xy band and two quasione-dimensional xz/yz bands. In order to understand the low-energy properties of Sr 2 RuO 4 , the description of its Fermi surface (FS) is crucial. It is thus not a surprise that the latter has been investigated intensively, both experimentally and theoretically. Angle-resolved-photoemission spectroscopy (ARPES) measurements [51] revealed early-on the presence of three Fermi sheets, the so-called γ (xy), and α and β (xz, yz) sheets. First principles calculations based on DFT in the local-density approximation (LDA) qualitatively reproduce the experimental FS, provided that the spinorbit (SO) interaction is taken into account [52,53]. They fail, however, in describing the relative size of the three sheets, suggesting that many-body effects could play an important role. In order to identify the actual part of the electron-electron interaction, we recently analyzed the problem by means of the DFT + DMFT approach, accounting for spin-orbit interaction and a realistic Coulomb tensor. First, we performed LDA and LDA + SO calculations using the full-potential linearized augmented plane-wave method as implemented in the WIEN2K [54] code. We obtained localized t 2g Wannier functions via the Marzari-Vanderbilt localization procedure [5,55,56] and using symmetries. Because of the D 4h symmetry at the Ru site, the Wannier t 2g states split into an e g doublet (xz, yz) and a b 2g singlet xy. The tetragonal crystal-field splitting, ε CF = ε xz − ε xy , is 120 meV in LDA. We find that the LDA spin-orbit coupling is basically isotropic, with the value λ xy ∼ λ z ∼ 100 meV. For what concerns the screened Coulomb tensor, it is typically assumed that the symmetry is O(3), since the bare electron-electron interaction is spherical; in this commonly adopted approximation all terms relevant for the t 2g states can be expressed as a function of two parameters, U and J, as discussed in the introduction. Once constructed, we solve the materialspecific t 2g Hubbard model (1) with DMFT via our CT-INT quantum Monte Carlo quantum-impurity solver [18,19]. In the case of Sr 2 RuO 4 the self-energy matrix in spin-orbital space is a 6 × 6 self-energy matrix. We performed the calculations with spin-orbit coupling in the basis |m σ =T |m σ , where the unitary operatorT is chosen such that the local imaginary-time Green function matrix is real. For Sr 2 RuO 4 this transformation merely amounts to an extra (−1) σ π phase (where σ = 1 for spin up and σ = −1 for spin down) for the |xz σ orbital.
The reshaping of the Fermi surface due to electron-electron correlation arises, for a Fermi-liquid in the zero-temperature limit, from the change in the elements of the onsite Hamiltonian via the real-part of the self-energy matrix. For Sr 2 RuO 4 in particular, the following parameters are relevant ε CF → ε CF + Δε CF , λ xy → λ xy + Δλ xy , λ z → λ z + Δλ z .
We calculated them with LDA + DMFT extrapolating to the T → 0 limit. Figure 12 summarizes the results. It shows (panels a and b) the Fermi surface obtained in LDA without (a) and with (b) SO interaction as well as the corresponding LDA+DMFT results (panels c and d). The right side of the figure shows LDA+DMFT calculations for (U, J) = (2.3, 0.4) eV, the constrained random-phase-approximation (cRPA) estimate [57], while the left side exhibits calculations for (U, J) = (3.1, 0.7) eV, the constrained LDA estimate [58]. ARPES data from reference [51] are also shown (grey map) for comparison. The figure shows that LDA describes well the α and γ sheets; however, the β and γ sheets cross, in contrast to ARPES, and the area enclosed by the β sheet is larger in LDA than in ARPES. Accounting for the spin-orbit interaction removes the crossing and strongly improves the agreement, but the β sheet remains too large with respect to experiments. These results are in line with previous LDA and LDA+SO calculations [52,53]. Remarkably, further including the O(3) Coulomb interaction does not really improve the agreement. The LDA + DMFT Fermi surface  [19]. Light lines: α and β sheets. Dark lines: γ sheet. Grey density maps: experimental data taken from reference [51].
deviates from ARPES in particular around the M point (γ sheet), which approaches the boundary of the first Brillouin zone. In the LDA + SO + DMFT calculation, with respect to LDA+DMFT results, the agreement worsens for the γ sheet and it improves for the α and β sheets. These results indicate that an important mechanism is missing.
We have identified it in the D 4h -symmetric Coulomb terms, which account for the actual site symmetry of the Coulomb tensor. More specifically, important Coulomb parameters are ΔU = U xy,xy − U xz,xz and ΔU = U xy,yz − U xz,yz . Due to the elongation of the RuO bond in the c direction, the e g (xz, yz) Wannier orbitals have a larger spread than the xy orbital, suggesting positive ΔU and ΔU . This is in line with the results of cRPA calculations [57], ΔU ∼ 0.3 eV. Coulomb low-symmetry terms yield, differently from the O(3) terms, an orbital-dependent double-counting correction H dc . For the latter we adopt the around-mean-field approximation, a typical choice for correlated metals, which yields H dc = n 6 [ΔU + 2ΔU ] σn σ xy , where n = 4 is the number of t 2g electrons. By performing a set of LDA + DMFT and LDA + SO + DMFT results for several values in the range 0 < ΔU < 0.45 eV, we could estimate a realistic range of on-site parameter enhancements, with Δε CF in the interval [-0.08,-0.02] eV, Δλ xy and Δλ z in the intervals [0.10,0.16] eV and [0.04,0.08] eV. In this range the theoretical Fermi surface is in good agreement with experiments, as shown in a representative case in Figure 13.
To summarize, we have shown that including the effects of the sphericallysymmetric (point group (O(3)) Coulomb interaction alone does not improve the agreement between theoretical and experimental Fermi surface. It is essential to take also into account both the spin-orbit interaction and D 4h -symmetric terms of the Coulomb tensor. The O(3) terms enhance the crystal-field splitting and the spin-orbit coupling. The Coulomb-enhanced spin-orbit coupling shrinks the β sheet and extends the γ sheet. The D 4h term ΔU reduces the Coulomb crystal-field enhancement. Similar effects are likely to occur in many other multi-orbital correlated metals. For Sr 2 RuO 4 , our results support the recent suggestions of strong spin-orbital entanglement for Cooper pairs [59]. The full Coulomb vertex is important not only for strongly correlated metals but also for insulators, as we have shown, e.g., in the case of spin-state transition cobaltates [60]. Coulomb vertex, T → 0 limit. Grey density maps: experimental data taken from reference [51]. Shown is a representative calculation corresponding to ΔU = 3ΔU = 0.3 eV (parameters: Δε CF ∼ −0.02 eV , Δλxy ∼ 0.13 eV, Δλz ∼ 0.08 eV) [19].

Are layered vanadates 2-dimensional frustrated systems?
The layered vanadates Li 2 VOSiO 4 and VOMoO 4 (see Fig. 14) have been proposed as possible realizations of the two-dimensional J 1 -J 2 quantum Heisenberg model [61][62][63][64]. The analysis of early experiments supported a strong frustration picture (with J 1 ∼ J 2 ). Nuclear magnetic resonance (NMR), muon-spin rotation and thermodynamic measurements lead to the estimates J 1 + J 2 ∼ 8.5 K and J 2 /J 1 ∼ 1.1 [62,63]. Similar conclusions, although with sizably larger couplings, J 1 + J 2 ∼ 155 K and J 1 ∼ J 2 , were reached for VOMoO 4 [64]. Furthermore, in both systems, small saturated magnetic moments have been inferred from thermodynamic data [62][63][64], a typical signature of strong frustration [65]. Remarkably, a very different picture emerged from ab-initio studies based on density-functional theory [64,66,67], which placed both systems in the weakly frustrated regime, with J 2 /J 1 ∼ 12 for Li 2 VOSiO 4 (collinear regime), and J 2 /J 1 ∼ 0.2 for VOMoO 4 (Néel antiferromagnetic regime). Later-on, a high-temperature expansion study of the J 1 -J 2 Heisenberg model pointed out that the experimental specific heat and susceptibility of Li 2 VOSiO 4 from references [62,63] could be also compatible with large J 2 /J 1 values obtained in DFT calculations [68]. More recently, neutron diffraction and resonant X-ray scattering experiments have reported magnetic order in three dimensions for both systems [69,70]. These theoretical and experimental results, taken together, suggest a weak frustration scenario.
This conclusion strongly relies, however, on the DFT estimate of the hopping integrals and the associated magnetic exchange couplings of the J 1 -J 2 Heisenberg model. It becomes therefore crucial to put the latter to a test. Recently we have addressed this issue via the DFT+DMFT method, calculating the effective spin model via linear response theory. To this end, the minimal material-specific many-body model is the half-filled one-band Hubbard model describing the xy low-energy states. By using linear response-function theory on top of DFT+DMFT calculations we first calculate the local correlation function, which yields the effective moments, showing that charge-fluctuations are negligible and that the layered vanadates behave indeed as a set of local S = 1/2 spins for all realistic U values. Next we calculated the magnetic susceptibility in the high-temperature (T T N ) regime, showing that it scales as μ 2 /(T − J(q)/μ 2 ), where μ is the effective magnetic moment with μ 2 = S(S + 1)/3. This enables us to extract the actual effective super-exchange coupling J(q), including non-trivial many-body effects, and construct the effective super-exchange magnetic model. As a final result, we find that for both systems considered J(q) ∼ 4J 1 cos q x 2 cos q y 2 1+2 J 1z J 1 cos q z + J 1z J 1 2 1/2 + 2J 2 (cos q x + cos q y ) + 2J z cos q z + 4J 2z (cos q x + cos q y ) cos q z + . . . .
Longer range terms are small and can be neglected. The Fourier transform of J(q) to real space yields the couplings of a generalized quantum Heisenberg model. The magnetic response function for VOMoO 4 is shown in Figure 15. Our calculations [25] support a weak frustration picture for both vanadates, with a small but non-negligible degree of three-dimensionality, which fully account for the three-dimensional magnetic order reported in experiments. Although weak, the frustration is large enough to explain the partial reduction of the ordered magnetic moments measured via neutron scattering experiments [69,70].

Conclusions
We have shown that massively parallel algorithms in combination with a multi-solver technique are key for solving complex many-body problems in real materials. This approach is then used to tackle problems that would be otherwise out of reach. Our generalized Lanczos solver is ideal for the very low-temperature regime [12,13], and The special points in the q x, qy plane are Γ1 = (2π, 0), X= (π, 0) and M= (π, π). allows us to reliably treat large systems. For intermediate and high temperatures, combining our three quantum Monte Carlo solvers [17,19] allows us to study problems which involve general Coulomb interaction with crystal-field matrices and spin-orbit coupling, cluster DMFT, and linear response-function calculations. We have presented results for paradigmatic applications which showcase the advantages of our massivelyparallel multi-solver approach. First we address the origin of orbital order, a problem which has been debated for decades. To solve it required calculations involving 2 to 5 orbitals, full Coulomb vertex and crystal-field matrix, and 2-4 sites. Our calculations [17,21,26,33] show that the Kugel-Khomskii super-exchange alone, although very strong, cannot explain orbital order in the two textbook examples of orbitally ordered materials, LaMnO 3 and KCuF 3 . This ultimately lead us to propose [41] for KCuF 3 an alternative mechanism, which explains the absence of an orbital-ordering transition in this material. The second application is the riddle of the Fermi surface of Sr 2 RuO 4 , a system famous for exhibiting p-wave superconductivity. The Fermi surface of Sr 2 RuO 4 has been studied for decades, but the role of Coulomb interaction and its interplay with spin-orbit coupling were not fully understood. To solve this problem, we performed calculations involving three orbitals, crystal-field splitting, spin-orbit interaction, and low-symmetry Coulomb interactions. For the latter, we additionally generalized the around-mean-field approximation in order to properly account for the double-counting correction. Our study [19] lead us to identify the effect of spinorbit and high-and low-symmetry Coulomb terms and their interplay. For the last application we chose layered vanadates, systems which had been proposed as possible candidates for highly-frustrated materials. By calculating correlation functions and high-temperature linear response functions, we could construct material-specific spin models for these systems. The latter lead us to conclude that these systems are well described by a weak frustration scenario. Finally, QMC solvers are very powerful but they are also hampered by the infamous sign problem. We have shown how, in specific cases, the sign problem can be strongly reduced by changing the orbital basis in which the calculation is set up; e.g., in the presence of a crystal field, this can be achieved rotating to the basis in which the crystal-field matrix is diagonal (although the hybridization function remains non-diagonal) [17] or, in the presence of spin-orbit interaction, to the basis which makes the local Green function real [19,23]. Support of the Deutsche Forschungsgemeinschaft through FOR 1346 is gratefully acknowledged.
Open Access This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.