Main

It has been an ever-persisting quest in condensed-matter and quantum many-body physics to capture the essence of quantum many-body systems that is covered behind their exponential complexity. Although many numerical methods have been developed to access the quantum many-body problem with strong interactions, it still remains an extraordinary challenge to obtain accurate ground-state solutions, especially for complex and large two-dimensional systems. The respective challenges depend on the method utilized, such as the ‘curse of dimensionality’ in exact diagonalization1, the notorious sign problem2 in quantum Monte Carlo approaches3 or the growth of entanglement and matrix contraction complexity in tensor network methods4. One of the paradigmatic instances of such complex two-dimensional quantum matter is the putative quantum-spin-liquid (QSL) phase in frustrated magnets5. Although a large variety of different numerical methods have been applied, the nature of many of the presumed QSLs still remains debated, such as the prototypical frustrated Heisenberg J1J2 magnets on square6,7,8,9,10,11,12 or triangular lattices13,14,15,16,17,18,19,20,21,22.

Recently, neural quantum states (NQSs) have been introduced as a promising alternative for solving the quantum many-body problem by means of artificial neural networks23. This approach has already seen tremendous progress for QSLs24,25,26. However, this method also faces an outstanding challenge critically limiting its capabilities and its potential to date. Due to the rugged quantum landscape27 with many saddle points, it is typically necessary to utilize stochastic reconfiguration (SR)28 in the optimization. SR is a quantum generalization of natural gradient descent29 and has a \({{{\mathcal{O}}}}({N}_\mathrm{p}^{3})\) complexity for a network with Np parameters, which impedes the training of deep networks. Consequently, the current applications of NQS mainly focus on shallow networks, such as a restricted Boltzmann machine (RBM)23,30 or shallow convolutional neural networks (CNNs)25,31 with no more than ten layers and around 103 parameters. Many efforts have been made to overcome the optimization difficulty in deep NQS based on either iterative solvers23, approximate optimizers32,33,34,35,36 or large-scale supercomputers37,38. However, the cost of SR still represents the key limitation in increasing the network size and, thereby, fully materializing the exceptional power of artificial neural networks for outstanding physics problems.

In this work, we introduce an alternative training algorithm for NQS, which we term the minimum-step stochastic reconfiguration (MinSR). We show that the optimization cost in MinSR is reduced massively while it remains as accurate as SR. Concretely, the training cost of MinSR is only linear in Np, which represents an enormous acceleration compared to SR. This, in turn, allows us to push the NQS towards the deep era by training deep networks with up to 64 layers and 106 parameters. We apply our resulting algorithm to paradigmatic two-dimensional quantum spin systems, such as the spin-1/2 Heisenberg J1J2 model, both to demonstrate the resulting accuracies for large system sizes beyond what is achievable with other computational methods and to address an outstanding question relating to the gaps in the model’s QSL phases.

Results

Minimum-step stochastic reconfiguration

In the NQS approach, a neural network is utilized to encode and compress the many-body wavefunction. In a system with N spin-1/2 degrees of freedom, the Hilbert space can be spanned by the Sz spin configuration basis \(\left\vert \sigma \right\rangle =\left\vert {\sigma }_{1},\ldots ,{\sigma }_{N}\right\rangle\) with σi =  or . An NQS with parameters θ maps every σ at the input to a wavefunction component ψθ,σ at the output23, as shown in Fig. 1a. The full quantum state is then given by the superposition \(\left\vert {\varPsi }_{\theta }\right\rangle ={\sum }_{\sigma }{\psi }_{\theta ,\sigma }\left\vert \sigma \right\rangle\). When searching for ground states based on a variational Monte Carlo method (VMC), θ is optimized to minimize the variational energy \({E}_{\theta }=\left\langle {\varPsi }_{\theta }\right\vert {{{\mathcal{H}}}}\left\vert {\varPsi }_{\theta }\right\rangle /\left\langle {\varPsi }_{\theta }| {\varPsi }_{\theta }\right\rangle\).

Fig. 1: Illustration of NQS and MinSR.
figure 1

a, In the NQS approach, an artificial neural network is used to represent a quantum many-body state. A change of the network parameters for the NQS leads to a new quantum state, whose distance to the previous NQS is given by the quantum metric \(S\in {{\mathbb{C}}}^{{N}_\mathrm{p}\times {N}_\mathrm{p}}\), where Np is the number of variational parameters. b, The quantum metric \(S={\overline{O}}^{{\dagger} }\overline{O}\) can be decomposed into a smaller matrix \(\overline{O}\in {{\mathbb{C}}}^{{N}_\mathrm{s}\times {N}_\mathrm{p}}\) with Ns ≪ Np the number of Monte Carlo samples. The optimization of an NQS involves the inversion of the quantum metric S, which is equivalent to determining its non-zero eigenvalues λi with i = 1, …, Ns. In MinSR, a neural tangent kernel \(T=\overline{O}\,{\overline{O}}^{{\dagger} }\in {{\mathbb{C}}}^{{N}_\mathrm{s}\times {N}_\mathrm{S}}\) is introduced with identical eigenvalues λi and, therefore, the essential information of S.

The standard numerical approach for finding the minimal variational energy for NQS is SR. This is done by approximately implementing imaginary-time evolution. Thus, as the training progresses, the contributions from eigenstates with higher energies are systematically reduced, thereby pushing the state towards the ground state step by step. In every training step, this requires minimizing the quantum distance d between the new variational state \(\left\vert {\varPsi }_{\theta +\delta \theta }\right\rangle\) and the exact imaginary-time evolved state \(\operatorname{e}^{-{{{\mathcal{H}}}}\delta \tau }\left\vert {\varPsi }_{\theta }\right\rangle\), where δτ is the imaginary-time interval.

As proven in the Supplementary Information, the quantum distance d can be estimated for a group of samples σ with Pσ ∝ ∣ψσ2 as \({d}^{\;2}={\sum }_{\sigma }{\left\vert {\sum }_{k}{\overline{O}}_{\sigma k}\delta {\theta }_{k}-{\overline{\epsilon }}_{\sigma }\right\vert }^{2}\), where ∑σ is performed on spin configurations in samples. We adopt the following notation: \({\overline{O}}_{\sigma k}=({O}_{\sigma k}-\left\langle {O}_{\sigma k}\right\rangle )/\sqrt{{N}_\mathrm{s}}\) with \({O}_{\sigma k}=\frac{1}{{\psi }_{\sigma }}\frac{\partial {\psi }_{\sigma }}{\partial {\theta }_{k}}\), and \({\overline{\epsilon }}_{\sigma }=-\delta \tau\left({E}_{{{{\rm{loc}}}},\sigma }-\left\langle {E}_{{{{\rm{loc}}}},\sigma }\right\rangle\right)/\sqrt{{N}_\mathrm{s}}\) with local energy \({E}_{{{{\rm{loc}}}},\sigma }={\sum }_{{\sigma }^{{\prime} }}\frac{{\psi }_{{\sigma }^{{\prime} }}}{{\psi }_{\sigma }}{H}_{\sigma {\sigma }^{{\prime} }}\), where Ns is the number of samples and \(\left\langle \ldots \right\rangle\) represents the mean value over the given set of samples.

Thus, the quantum distance d can be rewritten as \(d=| | \overline{O}\delta \theta -\overline{\epsilon }| |\) if we treat δθ and \(\overline{\epsilon }\) as vectors and \(\overline{O}\) as a matrix. As a key consequence, we introduce a new linear equation

$$\overline{O}\delta \theta =\overline{\epsilon },$$
(1)

whose least-squares solution minimizes the quantum distance d and leads to the SR equation. Conceptually, one can understand the left-hand side of this equation as the change of the variational state induced by an optimization step of the parameters, and the right-hand side as the change of the exact imaginary-time evolving state. The traditional SR solution minimizing their difference is

$$\delta \theta ={S}^{-1}{\overline{O}}^{{\dagger} }\overline{\epsilon }\quad {{{\rm{with}}}}\,S={\overline{O}}^{{\dagger} }\overline{O}.$$
(2)

As illustrated in Fig. 1a, the matrix S in equation (2) plays an important role as the quantum metric in VMC29,39,40, which links the variations in the Hilbert space and the parameter space. However, inverting the matrix S, which has Np × Np elements, has \({{{\mathcal{O}}}}({N}_\mathrm{p}^{3})\) complexity, and this a major difficulty when optimizing deep NQSs with large Np. To reduce the cost of SR, we focus on a specific optimization case of a deep network with a large number of parameters Np but a relatively small amount of batch samples Ns, as occurs in most deep learning research. In this case, as shown in Fig. 1b, the rank of the Np × Np matrix S is at most Ns, meaning that S contains much less information than its capacity. As a more efficient way to express the information of the quantum metric, we introduce the neural tangent kernel \(T=\overline{O}\,{{\overline{O}}^{{\dagger} }}\) (ref. 41), which has the same non-zero eigenvalues as S but the matrix size reduces from Np × Np to Ns × Ns.

As derived in Methods, we propose a new method termed MinSR using T as the compressed matrix,

$$\delta \theta ={\overline{O}}^{{\dagger} }{T}^{-1}\overline{\epsilon }\quad {{{\rm{with}}}}\,T=\overline{O}\,{\overline{O}}^{{\dagger} },$$
(3)

which is mathematically equivalent to the traditional SR solution but only has \({{{\mathcal{O}}}}({N}_\mathrm{p}{N}_\mathrm{s}^{2}+{N}_\mathrm{s}^{3})\) complexity. For large Np, it provides a tremendous acceleration with a time cost proportional to Np instead of \({N}_\mathrm{p}^{3}\). Therefore, it can be viewed as a natural reformulation of traditional SR, which is particularly useful in the limit Np ≫ Ns, as relevant in deep learning situations. For a performance comparison, Extended Data Fig. 1 shows the time cost and accuracy of different optimization methods.

Benchmark models

To demonstrate the exceptional performance of MinSR, we consider in the following the paradigmatic spin-1/2 Heisenberg J1-J2 model on a square lattice. This choice serves two purposes. On the one hand, this model serves as a standard benchmark system in various NQS studies and provides a convenient comparison to other state-of-the-art methods. On the other hand, it represents a paradigmatic reference case of QSLs in frustrated magnets, as an outstanding question regarding the nature of the QSL phase is whether it is gapped or gapless. The Hamiltonian of the system is given by

$${{{\mathcal{H}}}}={J}_{1}\mathop{\sum}\limits_{\left\langle i,\;j\right\rangle }{{{{\bf{S}}}}}_{i}\cdot {{{{\bf{S}}}}}_{j}+{J}_{2}\mathop{\sum}\limits_{\left\langle \left\langle i,\;j\right\rangle \right\rangle }{{{{\bf{S}}}}}_{i}\cdot {{{{\bf{S}}}}}_{j},$$
(4)

where \({{{{\bf{S}}}}}_{i}=({S}_{i}^{x},{S}_{i}^{y},{S}_{i}^{z})\) with \({S}_{i}^{x},{S}_{i}^{y},{S}_{i}^{z}\) spin-1/2 operators at site i, \(\left\langle i,j\right\rangle\) and \(\left\langle \left\langle i,j\right\rangle \right\rangle\) indicate pairs of nearest-neighbour and next-nearest-neighbour sites, respectively, and J1 is chosen to be equal to 1 for simplicity in this work.

We will specifically focus on two points in the parameter space: J2/J1 = 0 and J2/J1 = 1/2. At J2/J1 = 0, the Hamiltonian reduces to the non-frustrated Heisenberg model. At J2/J1 = 1/2, the J1-J2 model becomes strongly frustrated close to the maximally frustrated point where the system resides in a QSL phase24, which imposes a great challenge for existing numerical methods, including NQS31,42. Two different designs of residual neural networks (ResNet), whose details we describe in Methods, will be employed for variationally learning the ground states of these benchmark models. A direct comparison with exact diagonalization results for the 6 × 6 square lattice can be found in Extended Data Fig. 2, which shows that our network can even approach machine precision on modern GPU and TPU hardware.

For a non-frustrated Heisenberg model of a 10 × 10 square lattice, a deep NQS trained by MinSR provides an unprecedentedly precise result that is better than all existing variational methods, as shown in Fig. 2a. The adopted reference ground-state energy per site is EGS/N = −0.67155267(5), as given by a simulation based on a stochastic series expansion43 performed by ourselves, instead of the commonly used reference E/N = −0.671549(4) from ref. 44 because our best NQS variational energy E/N = −0.67155260(3) provides even better accuracy compared to this common reference energy. Thanks to the deep network architecture and the efficient MinSR method, the relative error of the variational energy ϵrel = (E − EGS)/∣EGS∣ drops much faster than for the one-layer RBM as Np increases and finally reaches a level of 10−7, greatly outperforming existing results.

Fig. 2: Relative error of the variational energy ϵrel = (E − EGS)/∣EGS∣ for a square lattice, where EGS is the exact ground-state energy estimated by stochastic series expansion in the non-frustrated case and zero-variance extrapolation in the frustrated case.
figure 2

a, Non-frustrated 10 × 10 Heisenberg model. The variational energies obtained in this work by using a deep ResNet trained with MinSR are compared to previous results in the literature including an RBM23, shallow CNN31 and RBM with a Lanczos step (RBM+LS)38. As no tensor network (TN) data are available for the periodic boundary condition, the best result with an open boundary condition is included as a dashed line51. b, Frustrated 10 × 10 J1-J2 model at J2/J1 = 0.5. The results obtained in this work with MinSR for two designs of ResNet are compared to previous results in the literature for a shallow CNN31, RBM+LS38, group convolutional neural network (GCNN)26 and medium CNN37. Further results from methods other than NQS are included as dashed lines, such as a tensor network9, the Gutzwiller wavefunction with two Lanczos steps (GWF+2LS)8, and a combination of the pair product state and RBM (PP+RBM)24. As a further reference, the so-called MSR limit is included. This was obtained from an NQS trained for a wavefunction where the sign structure was not learned but rather fixed by the MSR. c, Frustrated 16 × 16 J1J2 model at J2/J1 = 0.5.

To attain the next level of complexity, we will now focus on the frustrated J1J2 model, whose accurate ground-state solution has remained a key challenge for all available computational approaches. Figure 2b shows that, for a 10 × 10 square lattice, our method based on MinSR allows us to reach ground-state energies below what is possible with any other numerical scheme so far. In this context, the Marshall sign rule (MSR) limit shows the energy one can obtain without considering any frustration. As shown in the figure, the use of deep NQS becomes absolutely crucial as the shallow CNN is not guaranteed to beat the MSR limit. Most importantly, the variational energy we obtained was reduced upon increasing the network size for both networks trained by MinSR. We finally trained unprecedentedly large networks with 64 convolutional layers in ResNet1 and more than one million parameters in ResNet2, to attain the best variational energy E/N = −0.4976921(4), which outperforms all existing numerical results. The extraordinary variational outcomes allow us to accurately estimate the ground-state energy EGS/N = −0.497715(9) by zero-variance extrapolation, as described in Methods. Compared with the previous best result24, ϵrel in our biggest network is around 4 times lower, suggesting that our deep NQS result is substantially more accurate. From this, we conclude that the deep NQS trained by MinSR is superior even in the frustrated case, which was argued to be challenging for NQS on a general level45. The variational energies of different methods in this prototypical model are summarized in Extended Data Table 1.

Finally, we aim to provide evidence that our approach still exhibits advantageous performance compared to other computational methods upon further increasing the system size. Figure 2c presents the variational energy obtained for a 16 × 16 square lattice and compares the results with existing results in the literature. One can clearly see that our approach yields the best variational energy E/N = −0.4967163(8) for the frustrated J1-J2 model on such a large lattice. Compared with the best existing variational result given in ref. 37, ϵrel in this work is still 2.5 × 10−4 lower. In summary, the deep NQS trained by MinSR provides results for large frustrated models that are not only on a par with other state-of-the-art methods but can substantially outperform them.

Energy gaps of a QSL

Although so far we have focused on demonstrating the exceptional performance of the MinSR method, we now take the next step by addressing an outstanding physical question regarding the J1-J2 Heisenberg model considered. Concretely, we utilize the combination of the deep NQS and MinSR to study the gaps for two famous QSL candidates in the J1-J2 model on a square lattice and on a triangular lattice. In these systems, several works in the literature6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22 have shown the existence of QSL phases, although the energy gaps in the thermodynamic limit, especially for the triangular lattice, still remain debated. Figure 3 present an extrapolation of the energy gaps between states with total spin S = 0 and S = 1 to the thermodynamic limit within the most frustrated regime in which QSL candidates reside. As explained in Extended Data Figs. 3 and 4, the energy is estimated by NQS trained by MinSR with a Lanczos step and zero-variance extrapolation to increase accuracy. In the Supplementary Information, we provide the spin and dimer structure factors to support the existence of a QSL phase on the triangular lattice and compare gap estimates with and without zero-variance extrapolation.

Fig. 3: Energy gap Δ between the ground state with total spin S = 0 and the excited state with S = 1 as a function of inverse linear length 1/L at the maximally frustrated point.
figure 3

The inset includes the behaviour of the rescaled gap Δ × L versus 1/L.

On the square lattice, the gaps are measured in the total spin S = 1 sector and momentum k = (π, π) (M-point) at the most frustrated point J2/J1 = 0.5 for different system sizes, including 6 × 6, 8 × 8, 10 × 10, 12 × 12, 16 × 16 and 20 × 20. As shown by the small fitting error in Fig. 3 with Δ = a + b/L + c/L2, the vanishing gap Δ = 0.00(3) in the thermodynamic limit provides an unprecedented precision and is so far the most accurate extrapolation at this most frustrated point. In addition to the direct extrapolation of the energy gap Δ, we support our finding of a vanishing gap in the inset of Fig. 3, which display Δ × L as a function of 1/L (ref. 24). Although a finite gap would imply a divergent Δ × L, we observe a constant value, further corroborating our conclusion of a gapless phase in the thermodynamic limit. Combined with the large lattice sizes used, this result shows strong evidence of gapless QSLs as suggested by refs. 10,11,12,24 in contrast to the conclusion of the gapped QSLs in ref. 6.

The triangular J1J2 model has even stronger frustration compared to the square one, leading to larger variational errors in different methods and more controversy regarding the nature of the QSLs. To target the QSLs in this model, we also studied the most frustrated point at J2/J1 = 0.125. The gaps were measured for the S = 1 and k = (4π/3, 0) state on lattices 6 × 6, 6 × 9, 12 × 12 and 18 × 18 for the triangular lattice. Due to the larger variational error on the triangular lattice compared to the square case, a linear fitting Δ = a + b/L was utilized instead of the quadratic one to prevent overfitting. For a lattice with unequal extents Lx and Ly in different dimensions, L is defined as \(\sqrt{{L}_{x}{L}_{y}}\). Our data matches well with the linear relation Δ ∝ 1/L as expected for Dirac spin liquids, and the vanishing gap at the thermodynamic limit is Δ = −0.05(6). Furthermore, we also performed an extrapolation of Δ × L (inset of Fig. 3). We found a finite Δ × L upon increasing the system size L, indicating a vanishing gap in the thermodynamic limit. We take these results as strong numerical evidence suggesting the existence of a gapless QSL as also indicated in refs. 13,16,20,22 instead of a gapped QSL in refs. 14,15,21. Consequently, these numerical results demonstrate the exceptional computational power of the MinSR method applied to NQS wavefunctions, especially for the challenging regime of frustrated quantum magnets in two dimensions.

Discussion

To date, there have been tremendous efforts in solving quantum many-body problems in two major directions, studying the simplified Hilbert space given by specific physical backgrounds on classical computers and traversing the full Hilbert space on quantum computers. In this work, we present another promising approach that is supported by deep NQSs. This method allows us to approximate the complexity of quantum many-body problems through the emergent expressive power of large-scale neural networks.

For the future, we envision promising research directions, for instance, studying fermionic systems including the celebrated Hubbard model46,47 or ab initio quantum chemistry48, in which the traditional methods have limited accuracy, especially in the strongly interacting regime. Moreover, it is key to point out that the MinSR method is not at all restricted to NQS. As a general optimization method in VMC, it can also be applied to other variational wavefunctions, like tensor networks, so that a more complex ansatz can be introduced in these conventional methods to enhance the expressivity. It will also be of great importance to exploit the expressive power of large-scale variational wavefunctions through a suitable design that would lower the computational cost and increase the accuracy.

We can further envision the application of MinSR beyond the scope of physics for general machine learning tasks, if a suitable space for optimization like the Hilbert space in physics can be defined for which we can construct an equation like equation (1). In reinforcement learning tasks, for instance, obtaining gradients from the action in the environment is usually the most time-consuming part of the training, so a MinSR-like natural policy gradient49 can provide more accurate optimization directions without substantially increased time cost and greatly improve the training efficiency, even for very deep neural networks. Recently, a method inspired by MinSR has already found applications in general machine learning tasks50.

Methods

Derivation of the MinSR equation

MinSR was derived based on the observation that equation (1) is underdetermined when Ns < Np. To obtain a unique δθ solution, we employed the least-squares minimum-norm condition, which is widely used for underdetermined linear equations. To be specific, we chose, among all solutions with minimum residual error \(| | \overline{O}\delta \theta -\overline{\epsilon }| |\), the one minimizing the norm of the variational step ∣∣δθ∣∣, which helps to reduce higher-order effects, prevent overfitting and improve stability. We called this method MinSR due to the additional minimum-step condition. In this section, we adopt two different approaches, namely the Lagrangian multiplier method and the pseudo-inverse method, to derive the MinSR formula in equation (3).

Lagrangian multiplier

The MinSR solution can be derived by minimizing the variational step ∑kδθk2 under the constraint of minimum residual error \({\sum }_{\sigma }| {\sum }_{k}{\overline{O}}_{\sigma k}\delta {\theta }_{k}-{\overline{\epsilon }}_{\sigma }{| }^{2}\). To begin, we assume that the minimum residual error is 0, which can always be achieved by letting Ns < Np and assuming a typical situation in VMC that \({\overline{O}}_{\sigma k}\) values obtained by different samples are linearly independent. This leads to constraints \({\sum }_{k}{\overline{O}}_{\sigma k}\delta {\theta }_{k}-{\overline{\epsilon }}_{\sigma }=0\) for each σ. The Lagrange function is then given by

$${{{\mathcal{L}}}}(\{\delta {\theta }_{k}\},\{{\alpha }_{\sigma }\})=\sum_{k}| \delta {\theta }_{k}{| }^{2}-\left[\sum_{\sigma }{\alpha }_{\sigma }^{* }\sum_{k}({\overline{O}}_{\sigma k}\delta {\theta }_{k}-{\overline{\epsilon }}_{\sigma })+\mathrm{h.c.}\right],$$
(5)

where ασ is the Lagrangian multiplier. Written in matrix form, the Lagrangian function becomes

$${{{\mathcal{L}}}}(\delta \theta ,\alpha )=\delta {\theta }^{{\dagger} }\delta \theta -{\alpha }^{{\dagger} }(\overline{O}\delta \theta -\overline{\epsilon }\;)-(\delta {\theta }^{{\dagger} }{\overline{O}}^{{\dagger} }-{\overline{\epsilon }}^{{\dagger} })\alpha.$$
(6)

From \(\partial {{{\mathcal{L}}}}/\partial (\delta {\theta }^{{\dagger} })=0\), one obtains

$$\delta \theta ={\overline{O}}^{{\dagger} }\alpha.$$
(7)

Putting equation (7) back into \(\overline{O}\delta \theta =\overline{\epsilon }\), one can solve α as

$$\alpha =(\overline{O}\;{\overline{O}}^{{\dagger} }){\scriptstyle-1\atop}\overline{\epsilon }.$$
(8)

Combining equation (8) with equation (7), one obtains the final solution as

$$\delta \theta ={\overline{O}}^{{\dagger} }(\overline{O}\;{\overline{O}}^{{\dagger} }){\scriptstyle-1\atop}\overline{\epsilon },$$
(9)

which is the MinSR formula in equation (3). A similar derivation also applies when \(\overline{O},\delta \theta\) and \(\overline{\epsilon }\) are all real.

In our simulations, the residual error is non-zero, which differs from our previous assumption. This is because the inverse in equation (9) is replaced by a pseudo-inverse with finite truncation to stabilize the solution in the numerical experiments.

Pseudo-inverse

To simplify the notation, we use \(A=\overline{O},x=\delta \theta\) and \(b=\overline{\epsilon }\). We will prove that for a linear equation Ax = b,

$$x={A}^{-1}b={({A}^{{\dagger} }A)}^{-1}{A}^{{\dagger} }b={A}^{{\dagger} }{(A{A}^{{\dagger} })}^{-1}b$$
(10)

is the least-squares minimum-norm solution, where the matrix inverse is pseudo-inverse.

First, we prove x = A−1b is the solution we need. The singular value decomposition of A gives

$$A=U\varSigma {V}^{\;{\dagger} },$$
(11)

where U and V are unitary matrices, and Σ is a diagonal matrix with σi = Σii = 0 if and only if i > r with r the rank of A. The least-squares solution is given by minimizing

$$\begin{aligned}| | Ax-b| {| }^{\;2}&=| | U\varSigma {V}^{\;{\dagger} }x-b| {| }^{2}\\&=| | \varSigma {x}^{{\prime} }-{b}^{{\prime} }| {| }^{2}\\ &=\sum_{i=1}^{r}{\left({\sigma }_{i}{x}_{i}^{{\prime} }-{b}_{i}^{{\prime} }\right)}^{2}+\sum_{i=r+1}^{{N}_{s}}{b}_{i}^{{\prime} 2},\end{aligned}$$
(12)

where \({x}^{{\prime} }={V}^{\;{\dagger} }x\), \({b}^{{\prime} }={U}^{\;{\dagger} }b\) and Ns is the dimension of b, and the second step is because applying a unitary matrix does not change the norm of a vector. Therefore, all the least-squares solutions take the form

$${x}_{i}^{{\prime} }=\begin{cases}{b}_{i}^{{\prime} }/{\sigma }_{i},&i\le r,\\ {{{\text{any value}}}},&i > r.\end{cases}$$
(13)

Among all these possible solutions, the one that minimizes \(| | x| | =| | {x}^{{\prime} }| |\) is

$${x}_{i}^{{\prime} }=\begin{cases}{b}_{i}^{{\prime} }/{\sigma }_{i},&i\le r,\\ 0,&i > r.\end{cases}$$
(14)

With the following definition of a pseudo-inverse

$$\begin{aligned}{A}^{-1}&=V{\varSigma }^{+}{U}^{\;{\dagger} },\\ {\varSigma }_{ij}^{+}&={\delta }_{ij}\times \begin{cases}1/{\sigma }_{i},&{\sigma }_{i} > 0,\\ 0,&{\sigma }_{i}=0,\end{cases}\end{aligned}$$
(15)

we have \({x}^{{\prime} }={\varSigma }^{+}{b}^{{\prime} }\), so the final solution is

$$x=V{x}^{{\prime} }=V{\varSigma }^{+}{U}^{\;{\dagger} }b={A}^{-1}b.$$
(16)

Furthermore, we show the following equality

$${A}^{-1}={({A}^{{\dagger} }A)}^{-1}{A}^{{\dagger} }={A}^{{\dagger} }{(A{A}^{{\dagger} })}^{-1}.$$
(17)

With the singular value decomposition of A in equation (11), equation (17) can be directly proved by

$$\begin{aligned}{({A}^{{\dagger} }A)}^{-1}{A}^{{\dagger} }&={(V\varSigma {U}^{\;{\dagger} }U\varSigma {V}^{\;{\dagger} })}^{-1}V\varSigma {U}^{\;{\dagger} }\\ &=V{({\varSigma }^{+})}^{2}{V}^{\;{\dagger} }V\varSigma {U}^{\;{\dagger} }\\ &=V{\varSigma }^{+}{U}^{\;{\dagger} }\\&={A}^{-1},\end{aligned}$$
(18)

and

$$\begin{aligned}{A}^{{\dagger} }{(A{A}^{{\dagger} })}^{-1}&=V\varSigma {U}^{\;{\dagger} }{(U\varSigma {V}^{\;{\dagger} }V\varSigma {U}^{\;{\dagger} })}^{-1}\\ &=V\varSigma {U}^{\;{\dagger} }U{({\varSigma }^{+})}^{2}{U}^{\;{\dagger} }\\ &=V{\varSigma }^{+}{U}^{\;{\dagger}}\\&={A}^{-1}.\end{aligned}$$
(19)

In the derivation, the shapes of diagonal matrices Σ and Σ+ are not fixed but assumed to match their neighbour matrices to make the matrix multiplication valid.

Equation (17) shows that the SR solution in equation (2) and MinSR solution in equation (3) are both equivalent to the pseudo-inverse solution \(\delta \theta ={\overline{O}}^{-1}\overline{\epsilon }\), which justifies MinSR as a natural alternative to SR when Ns < Np.

MinSR solution

Numerical solution

In this section, we focus on how to solve the MinSR equation numerically:

$$\delta \theta ={\overline{O}}^{{\dagger} }{T}^{-1}\overline{\epsilon }.$$
(20)

The whole computation, starting from \(T=\overline{O}\,{\overline{O}}^{{\dagger} }\), should be executed under double-precision arithmetic to ensure that small eigenvalues are reliable.

Then a suitable pseudo-inverse should be applied to obtain a stable solution. In practice, the Hermitian matrix T is first diagonalized as T = UDU, and the pseudo-inverse is given by

$${T}^{-1}=U{D}^{+}{U}^{\;{\dagger} },$$
(21)

where D+ is the pseudo-inverse of the diagonal matrix D, numerically given by a cutoff below which the eigenvalues are regarded as 0, that is

$${\lambda }_{i}^{+}=\begin{cases}1/{\lambda }_{i},&| {\lambda }_{i}| \ge {r}_{{{{\rm{pinv}}}}}| {\lambda }_{\max }| +{a}_{{{{\rm{pinv}}}}},\\ 0,&| {\lambda }_{i}| < {r}_{{{{\rm{pinv}}}}}| {\lambda }_{\max }| +{a}_{{{{\rm{pinv}}}}},\end{cases}$$
(22)

where λi and \({\lambda }_{i}^{+}\) are the diagonal elements of D and \({D}^{+}\), \({\lambda }_{\max }\) is the largest value among λi, and rpinv and apinv are the relative and absolute pseudo-inverse cutoffs. In most cases, we choose rpinv = 10−12 and apinv = 0. Furthermore, we modify the aforementioned direct cutoff to a soft one52:

$${\lambda }_{i}^{+}={\left[{\lambda }_{i}\left(1+{\left(\frac{{r}_{{{{\rm{pinv}}}}}| {\lambda }_{\max }| +{a}_{{{{\rm{pinv}}}}}}{| {\lambda }_{i}| }\right)}^{6}\right)\right]}^{-1}$$
(23)

to avoid abrupt changes when the eigenvalues cross the cutoff during optimization.

Complex neural networks

Our original MinSR formula equation (3) can be applied when the network is real or complex holomorphic. In our ResNet2 architecture, however, the neural network parameters are real but the network outputs can be complex, in which case equation (3) cannot be directly applied. For other non-holomorphic networks, a complex parameter can be taken as two independent real parameters but this problem still occurs. To obtain the MinSR equation in these special cases, notice that the quantum distance d between \(\left\vert {\varPsi }_{\theta +\delta \theta }\right\rangle\) and \(\operatorname{e}^{\mathrm{i}{{{\mathcal{H}}}}\delta \tau }\left\vert {\varPsi }_{\theta }\right\rangle\) can be reformulated as

$$\begin{aligned}{d}^{\;2}&=| | \overline{O}\delta \theta -\overline{\epsilon }| {| }^{2}\\ &=| | \operatorname{Re}(\overline{O})\delta \theta -\operatorname{Re}(\;\overline{\epsilon }\;)| {| }^{2}+| | \operatorname{Im}(\overline{O})\delta \theta -\operatorname{Im}(\;\overline{\epsilon }\;)| {| }^{2},\end{aligned}$$
(24)

assuming \(\overline{O}\) and \(\overline{\epsilon }\) are complex while δθ is real. By defining

$${\overline{O}}^{\;{\prime} }=\left(\begin{array}{c}\operatorname{Re}\overline{O}\\ \operatorname{Im}\overline{O}\end{array}\right),\quad {\overline{\epsilon }}^{\;{\prime} }=\left(\begin{array}{c}\operatorname{Re}\overline{\epsilon }\\ \operatorname{Im}\overline{\epsilon }\end{array}\right),$$
(25)

one can rewrite the quantum distance again as \({d}^{\;2}=| | {\overline{O}}^{\;{\prime} }\delta \theta -{\overline{\epsilon }}^{\;{\prime} }| {| }^{2}\) with all entities real. The MinSR solution, in this case, is similarly given by

$$\delta \theta ={\overline{O}}^{\;{\prime} {\dagger} }{T}^{{\prime} -1}\overline{\epsilon }\quad {{{\rm{with}}}}\,{T}^{{\prime} }={\overline{O}}^{\;{\prime} }{\overline{O}}^{\;{\prime} {\dagger} }.$$
(26)

Similar arguments can also provide the SR equation in the non-holomorphic case as

$$\begin{aligned}\delta \theta &={S}^{\,{\prime} -1}{F}^{\,{\prime} }\\ {{{\rm{with}}}}\,{S}^{\,{\prime} }&={\overline{O}}^{\,{\prime} {\dagger} }{\overline{O}}^{\,{\prime} }=\operatorname{Re}S,\;{F}^{{\prime} }={\overline{O}}^{\,{\prime} {\dagger} }{\overline{\epsilon }}^{\,{\prime} }=\operatorname{Re}F,\end{aligned}$$
(27)

where \(S={\overline{O}}^{{\dagger} }\overline{O}\) and \(F={\overline{O}}^{{\dagger} }\overline{\epsilon }\) are the same as for the ordinary SR solution. This solution agrees with the widely used non-holomorphic SR solution53.

Neural quantum states

In this work, we adopt two different designs of ResNets. Several techniques are also applied to reduce the error.

ResNet1

The first architecture, as suggested in ref. 54, has two convolutional layers in each residual block, each given by a layer normalization, a ReLU activation function and a convolutional layer sequentially. All the convolutional layers are real-valued with the same number of channels and kernel size. After the forward pass through all residual blocks, a final activation function \(f(x)=\cosh x\,(x > 0),\,2-\cosh x\,(x < 0)\) is applied, which resembles the \(\cosh (x)\) activation in RBM but can also give negative outputs so that the whole network is able to express sign structures while still being real-valued. In the non-frustrated case, ∣f(x)∣ is used as the final activation function to make all outputs positive. After the final activation function, the outputs vi are used to compute the wavefunction as \({\psi }_{\sigma }^{{{{\rm{net}}}}}={\prod }_{i}({v}_{i}/t)\), where t is a rescaling factor updated in every training step. t is used to prevent a data overflow after the product.

ResNet2

The second design of ResNet basically follows ref. 26. In this architecture, the residual blocks are the same as ResNet1 but the normalization layers are removed. In the last layer, two different kinds of activations can be applied. For real-valued wavefunctions, we chose \(f(x)=\sinh (x)+1\). For complex-valued wavefunctions, we split all channels in the last layer into two groups and employ \(f({x}_{1},{x}_{2})=\exp ({x}_{1}+\mathrm{i}{x}_{2})\). A rescaling factor t is also inserted in suitable places in f to prevent an overflow.

Finally, a sum is performed to obtain the wavefunction. Considering the possible non-zero momentum q, the wavefunction is given by

$${\psi }_{\sigma }^{{{{\rm{net}}}}}=\sum_{i}\operatorname{e}^{-\mathrm{i}{{{{\bf{q \cdot r}}}}}_{i}}\sum_{c}{v}_{c,i},$$
(28)

where vc,i is the last-layer neuron at channel c and site i, and ri is the real-space position of site i. This definition ensures that the whole NQS has a momentum q.

In summary, ResNet1 performs better when one applies transfer learning from a small lattice to a larger one, but ResNet2, in general, has better accuracy and stability. Moreover, ResNet2 allows one to implement non-zero momentum, which is key to finding low-lying excited states.

Sign structure

On top of the raw output from the neural network \({\psi }_{\sigma }^{{{{\rm{net}}}}}\), the MSR55 is applied to wavefunctions on a square lattice, which serves as the exact sign structure for the non-frustrated Heisenberg model but is still the approximate sign structure in the frustrated region around J2/J1 ≈ 0.5. The sign structure representing the 120° magnetic order is also applied for the triangular lattice. Although these sign structures are additional physical inputs for specific models, the generality is not reduced because it has been shown that simple sign structures such as MSR can be exactly solved by an additional sign network56,57.

Symmetry

Symmetry plays an important role in improving the accuracy and finding low-lying excited states for NQS30,58. In this work, we apply symmetry on top of the well-trained \({\psi }_{\sigma }^{{{{\rm{net}}}}}\) to project variational states onto suitable symmetry sectors. Assuming the system permits a symmetry group of order ∣G∣ represented by operators Ti with characters ωi, the symmetrized wavefunction is then defined as30,59

$${\psi }_{\sigma }^{{{{\rm{symm}}}}}=\frac{1}{| G| }\sum_{i}{\omega }_{i}^{-1}{\psi }_{{T}_{i}\sigma }^{{{{\rm{net}}}}}.$$
(29)

With translation symmetry already enforced by the CNN architecture, the remaining symmetries applied by equation (29) are the point group symmetry, which is C4v for the square lattice and D6 for the triangular lattice, and the spin inversion symmetry σ → −σ (refs. 60,61,62,63,64).

Zero-variance extrapolation

The variational wavefunction provides an inexact estimate of the ground-state energy due to the variational error. Fortunately, in VMC one can compute the energy variance

$${\sigma }^{2}=\left\langle {{{\mathcal{{H}}}^{2}}}\right\rangle -{\left\langle {{{\mathcal{H}}}}\right\rangle }^{2}$$
(30)

as an estimate of the variational error. Hence, an extrapolation to zero energy variance gives a better estimate of the ground-state energy65,66, which has been successfully applied to NQS in refs. 30,37. In the following, we adopt the derivation in ref. 66 to show how to perform the extrapolation.

Assuming the normalized variational state \(\left\vert \psi \right\rangle\) deviates only slightly from the exact ground state \(\left\vert {\psi }_\mathrm{g}\right\rangle\), one can express it as

$$\left\vert \psi \right\rangle =\sqrt{1-{\lambda }^{2}}\left\vert {\psi }_\mathrm{g}\right\rangle +\lambda \left\vert {\psi }_\mathrm{e}\right\rangle ,$$
(31)

where \(\left\vert {\psi }_\mathrm{e}\right\rangle\) represents the error in the variational state orthogonal to the ground state and λ is a small positive number indicating the error strength. Denoting \({E}_\mathrm{g}=\left\langle {\psi }_\mathrm{g}| {{{\mathcal{H}}}}| {\psi }_\mathrm{g}\right\rangle\), \({E}_\mathrm{e}=\left\langle {\psi }_\mathrm{e}| {{{\mathcal{H}}}}| {\psi }_\mathrm{e}\right\rangle\) and \({\left\langle {{{{\mathcal{H}}}}}^{2}\right\rangle }_\mathrm{e}=\left\langle {\psi }_\mathrm{e}| {{{{\mathcal{H}}}}}^{2}| {\psi }_\mathrm{e}\right\rangle\), one can express the variational energy as

$$E=\left\langle \psi | {{{\mathcal{H}}}}| \psi \right\rangle ={E}_\mathrm{g}+{\lambda }^{2}({E}_\mathrm{e}-{E}_\mathrm{g}).$$
(32)

Similarly, the energy variance can be written as

$${\sigma }^{2}={\lambda }^{2}\left({\left\langle {{{{\mathcal{H}}}}}^{2}\right\rangle }_\mathrm{e}-2{E}_\mathrm{g}{E}_\mathrm{e}+{E}_\mathrm{g}^{\,2}\right)+{{{\mathcal{O}}}}({\lambda }^{4}).$$
(33)

If the error state \(\left\vert {\psi }_\mathrm{e}\right\rangle\) does not change substantially in different training attempts, there is a linear relation

$$(E-{E}_\mathrm{g})\propto {\sigma }^{\;2}$$
(34)

for small λ, so a linear extrapolation to σ2 = 0 gives E = Eg.

As shown in Extended Data Fig. 3, the ratio (E − Eg)/σ2 also remains nearly unchanged for different lattice sizes and symmetry sectors. This empirical conclusion is adopted to estimate the ratio in the large lattice from smaller ones so as to reduce the error and the time cost.

Lanczos step

The Lanczos step is a popular method in VMC for improving the variational accuracy67. It is also used in NQS26,38.

The key idea of a Lanczos step is to construct new states \(\left\vert {\psi }_\mathrm{p}\right\rangle\) orthogonal to the well-trained variational wavefunction \(\left\vert {\psi }_{0}\right\rangle\) and to minimize the energy of the new state formed by a linear combination of \(\left\vert {\psi }_{0}\right\rangle\) and \(\left\vert {\psi }_\mathrm{p}\right\rangle\). The new energy is then guaranteed to be lower than the initial energy.

Only one Lanczos step is applied in this work, so we have one state \(\left\vert {\psi }_{1}\right\rangle\) satisfying \(\left\langle {\psi }_{0}| {\psi }_{1}\right\rangle =0\) given by

$$\left\vert {\psi }_{1}\right\rangle =\frac{{{{\mathcal{H}}}}-{E}_{0}}{\sigma }\left\vert {\psi }_{0}\right\rangle ,$$
(35)

where \({E}_{0}=\left\langle {\psi }_{0}| {{{\mathcal{H}}}}| {\psi }_{0}\right\rangle\) and \({\sigma }^{\;2}=\left\langle {\psi }_{0}| {{{{\mathcal{H}}}}}^{2}| {\psi }_{0}\right\rangle -{E}_{0}^{\;2}\). The linear combination of \(\left\vert {\psi }_{0}\right\rangle\) and \(\left\vert {\psi }_{1}\right\rangle\) can be written as

$$\left\vert {\psi }_{\alpha }\right\rangle =\left\vert {\psi }_{0}\right\rangle +\alpha \left\vert {\psi }_{1}\right\rangle ,$$
(36)

whose energy is

$${E}_{\alpha }={E}_{0}+\frac{\left\langle {\psi }_{\alpha }\right\vert ({{{\mathcal{H}}}}-{E}_{0})\left\vert {\psi }_{\alpha }\right\rangle }{\left\langle {\psi }_{\alpha }| {\psi }_{\alpha }\right\rangle }={E}_{0}+\sigma \frac{{\alpha }^{2}{\mu }_{3}+2\alpha }{{\alpha }^{2}+1},$$
(37)

where

$${\mu }_{n}=\frac{\left\langle {\psi }_{0}\right\vert {({{{\mathcal{H}}}}-{E}_{0})}^{n}\left\vert {\psi }_{0}\right\rangle }{{\sigma }^{n}}.$$
(38)

The minimal energy is achieved at

$${\alpha }_{* }=\frac{{\mu }_{3}-\sqrt{{\mu }_{3}^{2}+4}}{2},$$
(39)

and the lowest energy is

$${E}_{{\alpha }_{* }}={E}_{0}+\sigma \frac{{\alpha }_{* }^{2}{\mu }_{3}+2{\alpha }_{* }}{{\alpha }_{* }^{2}+1}={E}_{0}+\sigma {\alpha }_{* }.$$
(40)

Initial guess of α

A direct way to compute μn is by measuring suitable quantities as expectation values of the initial state \(\left\vert {\psi }_{0}\right\rangle\). However, the measurement becomes more accurate if it is performed with a state \(\left\vert {\psi }_{{\alpha }_{0}}\right\rangle\) closer to the ground state67.

In this paper, we estimate the suitable α0 to obtain a \(\left\vert {\psi }_{{\alpha }_{0}}\right\rangle\) closer to the true ground state compared to \(\left\vert {\psi }_{0}\right\rangle\). Then, from equation (37), one can compute μ3 as

$${\mu }_{3}=\frac{\left({\alpha }_{0}^{2}+1\right)\left({E}_{{\alpha }_{0}}-{E}_{0}\right)/\sigma -2{\alpha }_{0}}{{\alpha }_{0}^{2}},$$
(41)

where \({E}_{{\alpha }_{0}}\) can be measured by Monte Carlo sampling. The optimal α* can be derived from μ3 by equation (39), and the lowest energy is then given by equation (40).

Energy variance

To compute the energy variance of \(\left\vert {\psi }_{\alpha }\right\rangle\), we start with an intermediate quantity

$${v}_{\alpha }=\frac{\left\langle {\psi }_{\alpha }\right\vert {({{{\mathcal{H}}}}-{E}_{0})}^{2}\left\vert {\psi }_{\alpha }\right\rangle }{{\sigma }^{\,2}\left\langle {\psi }_{\alpha }| {\psi }_{\alpha }\right\rangle }=\frac{{\alpha }^{2}{\mu }_{4}+2\alpha {\mu }_{3}+1}{{\alpha }^{2}+1}.$$
(42)

Like the previous case, one can measure \({v}_{{\alpha }_{0}}\) by Monte Carlo sampling and determine μ4 as

$${\mu }_{4}=\frac{\left({\alpha }_{0}^{2}+1\right){v}_{{\alpha }_{0}}-2{\alpha }_{0}{\mu }_{3}-1}{{\alpha }_{0}^{2}}.$$
(43)

Then \({v}_{{\alpha }_{* }}\) can be computed given μ3 and μ4, which gives the required energy variance as

$${\sigma }_{{\alpha }_{* }}^{2}=\frac{\left\langle {\psi }_{{\alpha }_{* }}\right\vert {\left({{{\mathcal{H}}}}-{E}_{{\alpha }_{* }}\right)}^{2}\left\vert {\psi }_{{\alpha }_{* }}\right\rangle }{\left\langle {\psi }_{{\alpha }_{* }}| {\psi }_{{\alpha }_{* }}\right\rangle }={\sigma }^{2}{v}_{{\alpha }_{* }}-{\left({E}_{0}-{E}_{{\alpha }_{* }}\right)}^{2}.$$
(44)