The Expressivity of Classical and Quantum Neural Networks on Entanglement Entropy

Analytically continuing the von Neumann entropy from R\'enyi entropies is a challenging task in quantum field theory. While the $n$-th R\'enyi entropy can be computed using the replica method in the path integral representation of quantum field theory, the analytic continuation can only be achieved for some simple systems on a case-by-case basis. In this work, we propose a general framework to tackle this problem using classical and quantum neural networks with supervised learning. We begin by studying several examples with known von Neumann entropy, where the input data is generated by representing $\text{Tr} \rho_A^n$ with a generating function. We adopt KerasTuner to determine the optimal network architecture and hyperparameters with limited data. In addition, we frame a similar problem in terms of quantum machine learning models, where the expressivity of the quantum models for the entanglement entropy as a partial Fourier series is established. Our proposed methods can accurately predict the von Neumann and R\'enyi entropies numerically, highlighting the potential of deep learning techniques for solving problems in quantum information theory.


Introduction
The von Neumann entropy is widely regarded as an effective measure of quantum entanglement, and is often referred to as entanglement entropy. The study of entanglement entropy has yielded valuable applications, particularly in the context of quantum information and quantum gravity (see [1,2] for a review). However, the analytic continuation from the Rényi entropies to von Neumann entropy remains a challenge in quantum field theory for general systems. We tackle this problem using both classical and quantum neural networks to examine their expressive power on entanglement entropy and the potential for simpler reconstruction of the von Neumann entropy from Rényi entropies.
Quantum field theory (QFT) provides an efficient method to compute the n-th Rényi entropy with integer n > 1, which is defined as [3] S n (ρ A ) ≡ 1 1 − n ln Tr(ρ n A ). (1.1) The computation is done by replicating the path integral representation of the reduced density matrix ρ A by n times. This step is non-trivial; however, we will be mainly looking at examples where explicit analytic expressions of the Rényi entropies are available, especially in two-dimensional conformal field theories (CFT 2 ) [4][5][6][7]. Then upon analytic continuation of n → 1, we have the von Neumann entropy The continuation can be viewed as an independent problem from computing the n-th Rényi entropy. Although the uniqueness of S(ρ A ) from the continuation is guaranteed by Carlson's theorem, analytic expressions in closed forms are currently unknown for most cases. Furthermore, while S n (ρ A ) are well-defined in both integer and non-integer n, determining it for a set of integer values n > 1 is not sufficient. To obtain the von Neumann entropy, we must also take the limit n → 1 through a space of real n > 1. The relationship between the Rényi entropies and the von Neumann entropy is therefore complex, and the required value of n for a precise numerical approximation of S(ρ A ) is not clear.
Along this line, we are motivated to adopt an alternative method proposed in [8], which would allow us to study the connection between higher Rényi entropies and von Neumann entropy "accumulatively." This method relies on defining a generating function that manifests as a Taylor series Summing over k explicitly yields an absolutely convergent series that approximates the von Neumann entropy with increasing accuracy as w → 1. This method has both numerical and analytical advantages, where we refer to [8] for explicit examples.
Note that the accuracy we can achieve in approximating the von Neumann entropy depends on the truncation of the partial sum in k, which is case-dependent and can be difficult to evaluate. It becomes particularly challenging when evaluating the higherorder Riemann-Siegel theta function in the general two-interval case of CFT 2 [8], which remains an open problem. On the other hand, deep learning techniques have emerged as powerful tools for tackling the analytic continuation problem [9][10][11][12][13][14], thanks to their universal approximation property. The universal approximation theorem states that artificial neural networks can approximate any continuous function under mild assumptions [15], where the von Neumann entropy is no exception. A neural network is trained on a dataset of known function values, with the objective of learning a latent manifold that can approximate the original function within the known parameter space. Once trained, the model can be used to make predictions outside the space by extrapolating the trained network. The goal is to minimize the prediction errors between the model's outputs and the actual function values. In our study, we frame the supervised learning task in two distinct ways: the first approach involves using densely connected neural networks to predict von Neumann entropy, while the second utilizes sequential learning models to extract higher Rényi entropies.
Instead of using a static "define-and-run" scheme, where the model structure is defined beforehand and remains fixed throughout training, we have opted for a dynamic "define-by-run" approach. Our goal is to determine the optimal model complexity and hyperparameters based on the input validation data automatically. To achieve this, we employ KerasTuner [16] with Bayesian optimization, which efficiently explores the hyperparameter space by training and evaluating different neural network configurations using cross-validation. KerasTuner uses the results to update a probabilistic model of the hyperparameter space, which is then used to suggest the next set of hyperparameters to evaluate, aiming to maximize expected performance improvement.
A similar question can be explicitly framed in terms of quantum machine learning, where a trainable quantum circuit can be used to emulate neural networks by encoding both the data inputs and the trainable weights using quantum gates. This approach bears many different names [17][18][19][20][21][22], but we will call it a quantum neural network. Unlike classical neural networks, quantum neural networks are defined through a series of welldefined unitary operations, rather than by numerically optimizing the weights for the non-linear mapping between targets and data. This raises a fundamental question for quantum computing practitioners: can any unitary operation be realized, or is there a particular characterization for the learnable function class? In other words, is the quantum model universal in its ability to express any function with the given data input? Answering these questions will not only aid in designing future algorithms, but also provide deeper insights into how quantum models achieve universal approximation [23,24].
Recent progress in quantum neural networks has shown that data-encoding strategies play a crucial role in their expressive power. The problem of data encoding has been the subject of extensive theoretical and numerical studies [25][26][27][28]. In this work, we build on the idea introduced in [29,30], which demonstrated the expressivity of quantum models as partial Fourier series. By rewriting the generating function for the von Neumann entropy in terms of a Fourier series, we can similarly establish the expressivity using quantum neural networks. However, the Gibbs phenomenon in the Fourier series poses a challenge in recovering the von Neumann entropy. To overcome this, we reconstruct the entropy by expanding the Fourier series into a basis of Gegenbauer polynomials.
The structure of this paper is as follows. In Sec. 2, we provide a brief overview for the analytic continuation of the von Neumann entropy from Rényi entropies within the framework of QFT. In addition, we introduce the generating function method that we use throughout the paper. In Sec. 3, we use densely connected neural networks with KerasTuner to extract the von Neumann entropy for several examples where analytic expressions are known. In Sec. 4, we employ sequential learning models for extracting higher Rényi entropies. Sec. 5 is dedicated to studying the expressive power of quantum neural networks in approximating the von Neumann entropy. In Sec. 6, we summarize our findings and discuss possible applications of our approach. Appendix. A is devoted to the details of rewriting the generating function as a partial Fourier series, while Appendix. B addresses the Gibbs phenomenon using Gegenbauer polynomials.

Analytic continuation of von Neumann entropy from Rényi entropies
Let us discuss how to calculate the von Neumann entropy in QFTs [31][32][33][34]. Suppose we start with a QFT on a d-dimensional Minkowski spacetime with its Hilbert space specified on a Cauchy slice Σ of the spacetime. Without loss of generality, we can divide Σ into two disjoint sub-regions Σ = A ∪ A c . Here A c denotes the complement sub-region of A. Therefore, the Hilbert space also factorizes into the tensor product We then define a reduced density matrix ρ A from a pure state on Σ, which is therefore mixed, to capture the entanglement between the two regions. The von Neumann entropy S(ρ A ) allows us to quantify this entanglement mally mixed states, it is therefore a fine-grained measure for the amount of entanglement between A and A c . The second equality holds for field theory, where we require a length scale to regulate the UV divergence encoded in the short-distance correlations. The leading-order divergence is captured by the area of the entangling surface ∂A, a universal feature of QFTs [35]. 1 There have been efforts to better understand the structure of the entanglement in QFTs, including free theory [36], heat kernels [37,38], CFT techniques [39] and holographic methods based on AdS/CFT [40,41]. But operationally, computing the von Neumann entropy analytically or numerically is still a daunting challenge for generic interacting QFTs. For a review, see [1].
Path integral provides a general method to access S(ρ A ). The method starts with the Rényi entropies [3] for real n > 1. As previously mentioned, obtaining the von Neumann entropy via analytic continuation in n with n → 1 requires two crucial steps. An analytic form for the n-th Rényi entropy must be derived from the underlying field theory in the first place, and then we need to perform analytic continuation toward n → 1. These two steps are independent problems and often require different techniques. We will briefly comment on the two steps below. Computing Tr ρ n A is not easy; therefore, the replica method enters. The early form of the replica method was developed in [34], and was later used to compute various examples in CFT 2 [4][5][6][7], which can be compared with holographic ones [42]. The idea behind the replica method is to consider an orbifold of n copies of the field theory to compute Tr ρ n A for positive integers n. The computation reduces to evaluating the partition function on a n-sheeted Riemann surface, which can be alternatively computed by correlation functions of twist operators in the n copies. For more details on the construction in CFTs, see [4][5][6][7]. If we are able to compute Tr ρ n A for any positive integer n ≥ 1, we have This is computable for special states and regions, such as ball-shaped regions for the vacuum of the CFT d . However, in CFT 2 , due to its infinite-dimensional symmetry being sufficient to fix lower points correlation functions, we are able to compute Tr ρ n A for several instances.
The analytic continuation in n → 1 is more subtle. Ensuring the existence of a unique analytic extension away from integer n typically requires the application of the Carlson's theorem. This theorem guarantees the uniqueness of the analytic continuation from Rényi entropies to the von Neumann entropy, provided that we can find some locally holomorphic function S ν with ν ∈ C such that S n = S n (ρ) for all integers n > 1 with appropriate asymptotic behaviors in ν → ∞. Then we have unique S ν (ρ) = S ν [43,44]. Carlson's theorem addresses not only the problem of unique analytic continuation but also the issue of continuing across non-integer values of the Rényi entropies.
There are other methods to evaluate S(ρ A ) in the context of string theory and AdS/CFT; see for examples [45][46][47][48][49][50]. In this work, we would like to focus on an effective method outlined in [8] that is suitable for numerical considerations. In [8], the following generating function is used for the analytic continuation in n with a variable z This manifest Taylor series is absolutely convergent in the unit disc with |z| < 1. We can analytically continue the function from the unit disc to a holomorphic function in C \ [1, ∞) by choosing the branch cut of the logarithm to be along the positive real axis. The limit z → −∞ is within the domain of holomorphicity and is exactly where we obtain the von Neumann entropy However, a more useful form can be obtained by performing a Möbius transformation to a new variable w It again manifests as a Taylor series We again have a series written in terms of Tr ρ n A , and it is absolutely convergent in the unit disc |w| < 1. The convenience of using w is that by taking w → 1, we have the von Neumann entropy This provides an exact expression of S(ρ A ) starting from a known expression of Tr ρ n A . Numerically, we can obtain an accurate value of S(ρ A ) by computing a partial sum in k. The method guarantees that by summing to sufficiently large k, we approach the von Neumann entropy with increasing accuracy.
However, a difficulty is that we need to sum up k ∼ 10 3 terms to achieve precision within 10 −3 in general [8]. It will be computationally costly for certain cases with complicated Tr ρ n A . Therefore, one advantage the neural network framework offers is the ability to give accurate predictions with only a limited amount of data, making it a more efficient method.
In this paper, we focus on various examples from CFT 2 with known analytic expressions of Tr ρ n A [6], and we use the generating function G(w; ρ A ) to generate the required training datasets for the neural networks.

Deep learning von Neumann entropy
This section aims to utilize deep neural networks to predict the von Neumann entropy via a supervised learning approach. By leveraging the gradient-based learning principle of the networks, we expect to find a non-linear mapping between the input data and the output targets. In the analytic continuation problem from the n-th Rényi entropy to the von Neumann entropy, such a non-linear mapping naturally arises. Accordingly, we consider S n (ρ A ) (equivalently Tr ρ n A and the generating function) as our input data and S(ρ A ) as the target function for the training process. As supervised learning, we will consider examples where analytic expressions of both sides are available. Ultimately, we will employ the trained models to predict the von Neumann entropy across various physical parameter regimes, demonstrating the efficacy and robustness of the approach.
The major advantage of using deep neural networks lies in that they improve the accuracy of the generating function for computing the von Neumann entropy. As we mentioned, the accuracy of this method depends on where we truncate the partial sum, and it often requires summing up a large k in (2.9), which is numerically difficult. In a sense, it requires knowing much more information, such as those of the higher Rényi entropies indicated by Tr ρ n A in the series. Trained neural networks are able to predict the von Neumann entropy more accurately given much fewer terms in the input data. We can even predict the von Neumann entropy for other parameter spaces without resorting to any data from the generating function. Furthermore, the non-linear mappings the deep neural networks uncover can be useful for investigating the expressive power of neural networks on the von Neumann entropy. Additionally, they can be applied to study cases where analytic continuations are unknown and other entanglement measures that require analytic continuations.
In the following subsections, we will give more details on our data preparation and training strategies, then we turn to explicit examples as demonstrations.

Model architectures and training strategies
Generating suitable training datasets and designing flexible deep learning models are empirically driven. In this subsection, we outline our strategies for both aspects.

Data preparation
To prepare the training datasets, we consider several examples with known S(ρ A ). We use the generating function G(w; ρ), which can be computed from Tr ρ n A for each example. This is equivalent to computing the higher Rényi entropies with different choices of physical parameters since the "information" available is always Tr ρ n A . However, note that all the higher Rényi entropies are distinct information. Therefore, adopting the generating function is preferable to using S n (ρ A ) itself, as it approaches the von Neumann entropy with increasing accuracy, making the comparison more transparent.
We generate N = 10000 input datasets for a fixed range of physical parameters, where each set contains k max = 50 terms in (2.9); their corresponding von Neumann entropies will be the targets. We limit the amount of data to mimic the computational cost of using the generating function. We shuffle the input datasets randomly and then split the data into 80% for training, 10% for validation, and 10% as the test datasets. Additionally, we use the trained neural networks to make predictions on another set of 10000 test datasets with a different physical parameter regime and compare them with the correct values as a non-trivial test for each example.

Model design
To prevent overfitting and enhance the generalizability of our model, we have employed a combination of techniques in the design of neural networks. ReLU activation function is used throughout the section. We adopt Adam optimizer [51] in the training process with mean square error (MSE) as the loss function.
We consider a neural network consisting of a few hidden Dense layers with varying numbers of units in TensorFlow-Keras [52,53]. In this case, each neuron in a layer receives input from all the neurons in the previous layer. The Dense connection allows the model to find non-linear relations between the input and output, which is the case for analytic continuation. The final layer is a Dense layer with a single unit that outputs a unique value for each training dataset, which is expected to correspond to the von Neumann entropy. As an example, we show a neural network with 3 hidden Dense layers, each with 8 units, in Figure 1.  Bayesian optimization is a method for finding the optimal set of designs and hyperparameters for a given dataset, by iteratively constructing a probabilistic model from a prior distribution for the objective function and using it to guide the search. Once the tuner search loop is complete, we extract the best model in the final training phase by including both the training and validation data.
To determine the optimal setting of our neural networks, we employ KerasTuner [16], a powerful tool that allows us to explore different combinations of model complexity, depth, and hyperparameters for a given task. An illustration of the KerasTuner process can be found in Figure 2. We use Bayesian optimization, and adjust the following designs and hyperparameters: • We allow a maximum of 4 Dense layers. For each layer, we allow variable units in the range of 16 to 128 with a step size of 16. The number of units for each layer will be independent of each other.
• We allow BatchNormalization layers after the Dense layers as a Boolean choice to improve generalization and act as a regularization.
• A final dropout with log sampling of a dropout rate in the range of 0.1 to 0.5 is added as a Boolean choice.
• In the Adam optimizer, we only adjust the learning rate with log sampling from the range of 3×10 −3 to 9×10 −3 . All other parameters are taken as default values in TensorFlow-Keras. We also use the AMSGrad [54] variant of this algorithm as a Boolean choice.
We deploy the KerasTuner for 100 trials with 2 executions per trial and monitor the validation loss with EarlyStopping of patience 8. Once the training is complete, since we will not be making any further hyperparameter changes, we no longer evaluate performance on the validation data. A common practice is to initialize new models using the best model designs found by KerasTuner while also including the validation data as part of the training data. Indeed, we select the top 5 best designs and train each one 20 times with EarlyStopping of patience 8. We pick the one with the smallest relative errors from the targets among the 5 × 20 models as our final model. We set the batch size in both the KerasTuner and the final training to be 512.
In the following two subsections, we will examine examples from CFT 2 with Tr ρ n A and their corresponding von Neumann entropies S(ρ A ) [4][5][6][7][8]. These instances are distinct and worth studying for several reasons. They have different mathematical structures and lack common patterns in their derivation from the field theory side, despite involving the evaluation of certain partition functions. Moreover, the analytic continuation for each case is intricate, providing strong evidence for the necessity of independent model designs.

Entanglement entropy of a single interval
Throughout the following, we will only present the analytic expression of Tr ρ n A since it is the only input of the generating function. We will also keep the UV cut-off explicit in the formula.

Single interval
The simplest example corresponds to a single interval A of length in the vacuum state of a CFT 2 on an infinite line. In this case, both the analytic forms of Tr ρ n A and S(ρ A ) are known [4], where S(ρ A ) reduces to a simple logarithmic function that depends on . We have the following analytic form with a central charge c that defines G(w; ρ A ). The corresponding von Neumann entropy is given by We fixed the central charge c = 1 and the UV cutoff = 0.1 when preparing the datasets. We generated 10000 sets of data for the train-validation-test split from = 1 to 50, with an increment of ∆ = 5 × 10 −3 between each step up to k = 50 in G(w; ρ A ).
To further validate our model, we generated an additional 10000 test datasets for the following physical parameters: = 51 to 100 with ∆ = 5 × 10 −3 . For a density plot of the data distribution with respect to the target von Neumann entropy, see Figure 3.  The MSE loss function as a function of epochs. We monitor the loss function with EarlyStopping, where the minimum loss is achieved at epoch 410 with loss ≈ 10 −7 for this instance. Right: The density plot of relative errors between the model predictions and targets. Note that the blue color corresponds to the test datasets from the initial train-validation-test split, while the green color is for the additional test datasets. We can see clearly that for both datasets, we have achieved high accuracy with relative errors 0.30%.  illustrates that the process outlined in the previous subsection effectively minimizes the relative errors in predicting the test data to a very small extent. Moreover, the model's effectiveness is further confirmed by its ability to achieve similarly small relative errors when predicting the additional test datasets. The accuracy of the model's predictions for the two test datasets significantly surpasses the approximate entropy obtained by summing the first 50 terms of the generating function, as can be seen in Figure 5. We emphasize that in order for the generating function to achieve the same accuracy as the deep neural networks, we generally need to sum k ≥ 400 from (2.9) [8]. This applies to all the following examples.
In this example, the von Neumann entropy is a simple logarithmic function, making it relatively straightforward for the deep learning models to decipher. However, we will now move on to a more challenging example.

Single interval at finite temperature and length
We extend the single interval case to finite temperature and length, where Tr ρ n A becomes a complicated function of the inverse temperature β = T −1 and the length . The analytic expression of the Rényi entropies was first derived in [55] for a twodimensional free Dirac fermion on a circle from bosonization. We can impose periodic boundary conditions that correspond to finite size and finite temperature. For simplicity, we set the total spatial size L to 1, and use to denote the interval length. In this case we have [55] Tr where is a UV cutoff. We study the case of ν = 3, which is the Neveu-Schwarz (NS-NS) sector. We then have the following Dedekind eta function η(τ ) and the Jacobi theta functions θ 1 (z|τ ) and θ 3 (z|τ ) Previously, the von Neumann entropy after analytically continuing (3.3) was only known in the high-and low-temperature regimes [55]. In fact, only the infinite length or zero temperature pieces are universal. However, the analytic von Neumann entropy for all temperatures was recently worked out by [56,57], which we present below Here σ and ζ are the Weierstrass sigma function and zeta function with periods 1 and iβ, respectively. We can see clearly that the analytic expressions for both Tr ρ n A and S(ρ A ) are rather different compared to the previous example.
In preparing the datasets, we fixed the interval length = 0.5 and the UV cutoff = 0.1. We generated 10000 sets of data for train-validation-test split from β = 0.5 to 1.0, with an increment of ∆β = 5 × 10 −5 between each step up to k = 50 in G(w; ρ A ). Since β corresponds to the inverse temperature, this is a natural parameter to vary as the formula (3.6) is valid for all temperatures. To further validate our model, we generated 10000 additional test datasets for the following physical parameters: β = 1.0 to 1.5 with ∆β = 5 × 10 −5 . A density plot of the data with respect to the von Neumann entropy is shown in Figure 6. As shown in Figure 7 and Figure 8, our model demonstrates its effectiveness in predicting both test datasets, providing accurate results for this highly non-trivial example.    .6) for the two test datasets. Again, the approximate entropy by summing over k = 50 terms in the generating function is included.

Entanglement entropy of two disjoint intervals
We now turn to von Neumann entropy for the union of two intervals on an infinite line. In this case, several analytic expressions can be derived for both Rényi and von Neumann entropies. The theory we will consider is a CFT 2 for a free boson with central charge c = 1, and the von Neumann entropy will be distinguished by two parameters, a cross-ratio x and a universal critical exponent η. The latter is proportional to the square of the compactification radius.
To set up the system, we define the union of the two intervals as The cross-ratio is defined to be With the definition, we can write down the generating function for two intervals in a free boson CFT with finite x and η [5] Tr(ρ n ) = c n 2 x 13 x 24 where is a UV cutoff and c n is a model-dependent coefficient [6] that we set to c n = 1 for simplicity. An exact expression for F n (x, η) is given by for integers n ≥ 1. Here Θ(z|Γ) is the Riemann-Siegel theta function defined as and where 2 F 1 is the hypergeometric function. A property of this example is that (3.9) is manifestly invariant under η ↔ 1/η. The analytic continuation towards the von Neumann entropy is not known, making it impossible to study this example directly with supervised learning. Although the Taylor series of the generating function guarantees convergence towards the true von Neumann entropy for sufficiently large values of k in the partial sum, evaluating the higher-dimensional Riemann-Siegel theta function becomes increasingly difficult. For efforts in this direction, see [58,59]. However, we will revisit this example in the next section when discussing the sequence model.
However, there are two limiting cases where analytic perturbative expansions are available, and approximate analytic continuations of the von Neumann entropies can be obtained. The first limit corresponds to small values of the cross-ratio x, where the von Neumann entropy has been computed analytically up to second order in x. The second limit is the decompactification limit, where we take η → ∞. In this limit, there is an approximate expression for the von Neumann entropy.

Two intervals at small cross-ratio
Let us consider the following expansion of F n (x, η) at small x for some η = 1 where we can look at the first order contribution with (3.14) The coefficient α for a free boson is given by α = min[η, 1/η]. N is the multiplicity of the lowest dimension operators, where for a free boson we have N = 2. Up to this order, the analytic von Neumann entropy is given by (3.15) We can set up the numerics by taking |x 12 | = |x 34 | = r, and the distance between the centers of A and B to be L, then the cross-ratio is simply Similarly we can express . This would allow us to express everything in terms of x and L.
For the datasets, we fixed L = 14, α = 0.5, and 2 = 0.1. We generated 10000 sets of data for train-validation-test split from x = 0.05 to 0.1, with an increment of ∆x = 5 × 10 −6 between each step up to k = 50 in G(w; ρ A ). To further validate our model, we generated 10000 additional test datasets for the following physical parameters: x = 0.1 to 0.15 with ∆x = 5 × 10 −6 . A density plot of the data with respect to the von Neumann entropy is shown in Figure 9. We refer to Figure 10 and Figure 11 for a clear demonstration of the learning outcomes.
The study up to second order in x using the generating function method is available in [8], as well as through the use of holographic methods [60]. Additionally, an analytic continuation toward the von Neumann entropy up to second order in x for general CFT 2 can be found in [61]. Although this is a subleading correction, it can also be approached using our method.    .15) for the two test datasets. We also include the approximate entropy by summing over k = 50 terms in the generating function.

Two intervals in the decompactification limit
There is a different limit that can be taken other than the small cross-ratio, where an approximate analytic Rényi entropies can be obtained. This is called the decompactification limit where we take η → ∞, then for each fixed value of x we have F(x, η) as where 2 F 1 is the hypergeometric function. Equation (3.17) is invariant under η ↔ 1/η, so we will instead use the result with η 1 In this case, the exact analytic continuation of the von Neumann entropy is not known, but there is an approximate result following the expansion with S W (ρ AB ) being the von Neumann entropy computed from the Rényi entropies without the special function F n (x, η) in (3.8). Note that This approximate von Neumann entropy has been well tested in previous studies [5,8], and we will adopt it as the target values in our deep learning models.
For the datasets, we fixed L = 14, x = 0.5 and 2 = 0.1. We generated 10000 sets of data for train-validation-test split from η = 0.1 to 0.2, with an increment of ∆η = 10 −5 between each step up to k = 50. To further validate our model, we generated 10000 additional test datasets for the following physical parameters: η = 0.2 to 0.3 with ∆η = 10 −5 . A density plot of the data with respect to the von Neumann entropy is shown in Figure 12. We again refer to Figure 13 and Figure 14 for a clear demonstration of the learning outcomes.    (3.19) for the two test datasets. We also include the approximate entropy by summing over k = 50 terms in the generating function.
We have seen that deep neural networks, when treated as supervised learning, can achieve accurate predictions for the von Neumann entropy that extends outside the parameter regime in the training phase. However, the potential for deep neural networks may go beyond this.
As we know, the analytic continuation must be worked out on a case-by-case basis (see the examples in [4][5][6][7]) and may even depend on the method we use [8]. Finding general patterns in the analytic continuation is still an open question. Although it remains ambitious, the non-linear mapping that the neural networks uncover would allow us to investigate the expressive power of deep neural networks for the analytic continuation problem of the von Neumann entropy.
Our approach also opens up the possibility of using deep neural networks to study cases where analytic continuations are unknown, such as the general two-interval case. Furthermore, it may enable us to investigate other entanglement measures that follow similar patterns or require analytic continuations. We leave these questions as future tasks.

Rényi entropies as sequential deep learning
In this section, we focus on higher Rényi entropies using sequential learning models. Studying higher Rényi entropies that depend on Tr ρ n A is equivalent to studying the higher-order terms in the Taylor series representation of the generating function (2.9). There are a few major motivations. Firstly, although the generating function can be used to compute higher-order terms, it becomes inefficient for more complex examples. Additionally, evaluating Tr ρ n A in (3.8) for the general two-interval case involves the Riemann-Siegel theta function, which poses a challenge in computing higher Rényi entropies [8,58,59]. On the other hand, all higher Rényi entropies should be considered independent and cannot be obtained in a linear fashion. They can all be used to predict the von Neumann entropy, but in the Taylor series expansion (2.9), knowing higher Rényi entropies is equivalent to knowing a more accurate von Neumann entropy. As we cannot simply extrapolate the series, using a sequential learning approach is a statistically robust way to identify underlying patterns.
Recurrent neural networks (RNNs) are a powerful type of neural network for processing sequences due to their "memory" property [62]. RNNs use internal loops to iterate through sequence elements while keeping a state that contains information about what has been observed so far. This property allows RNNs to identify patterns in a sequence regardless of their position in the sequence. To train an RNN, we initialize an arbitrary state and encode a rank-2 tensor of size (steps, input features), looping over multiple steps. At each step, the networks consider the current state at k with the input, and combine them to obtain the output at k + 1, which becomes the state for the next iteration.
RNNs incorporate both feedforward networks and back-propagation through time (BPTT) [63,64], with "time" representing the steps k in our case. The networks connect the outputs from a fully connected layer to the inputs of the same layer, referred to as the hidden states. These inputs receive the output values from the previous step, with the number of inputs to a neuron determined by both the number of inputs to the layer and the number of neurons in the layer itself, known as recurrent connections. Computing the output involves iteratively feeding the input vector from one step, computing the hidden states, and presenting the input vector for the next step to compute the new hidden states.
RNNs are useful for making predictions based on sequential data, or "sequential regression," as they learn patterns from past steps to predict the most probable values for the next step.

Model architectures and training strategies
In this subsection, we discuss the methodology of treating the Rényi entropies (the Taylor series of the generating function) as sequence models.

Data preparation
To simulate the scenario where k max in the series cannot be efficiently computed, we generate N = 10000 datasets for different physical parameters, with each dataset having a maximum of k max = 50 steps in the series. We also shuffle the N datasets since samples of close physical parameters will have most of their values in common. Among the N datasets, we only take a fraction p < N for the train-validation-test split. The other fraction q = N − p will all be used as test data for the trained model. This serves as a critical examination of the sequence models we find. The ideal scenario is that we only need small p datasets while achieving accurate performance for the q datasets.
Due to the rather small number of steps available, we are entitled to adopt the SimpleRNN structure in TensorFlow-Keas 2 instead of the more complicated ones such as LSTM or GRU networks [66,67].
We also need to be careful about the train-validation-test splitting process. In this type of problem, it is important to use validation and test data that is more recent than the training data. This is because the objective is to predict the next value given the past steps, and the data splitting should reflect this fact. Furthermore, by giving more weight to recent data, it is possible to mitigate the vanishing gradient (memory loss) problem that can occur early in the BPTT. In this work, the first 60% of the steps (k = 1 ∼ 30) are used for training, the middle 20% (k = 31 ∼ 40) for validation, and the last 20% (k = 41 ∼ 50) for testing.
We split the datasets in the following way: for a single dataset from each step, we use a fixed number of past steps 3 , specified by , to predict the next value. This will create (steps − ) sequences from each dataset, resulting in a total of (steps − ) × p sequences for the p datasets in the train-validation-test splitting. Using a fixed sequence length allows the network to focus on the most relevant and recent information for predicting the next value, while also simplifying the input size and making it more compatible with our network architectures. We take p = 1000, q = 9000, and = 5. An illustration of our data preparation strategy is shown in Figure 15. Figure 15: Data preparation process for the sequential models. A total of N datasets are separated into two parts: the p datasets are for the initial train-validation-test split, while the q datasets are treated purely as test datasets. The zoomed-in figure on the right hand side illustrates how a single example sequence is generated, where we have used a fixed number of past steps = 5. Note that for the additional q test datasets, a total of (steps − ) × q = 405000 sequences are generated.

Model design
After the pre-processing of data, we turn to the model design. Throughout the section, we use the ReLU activation function and Adam optimizer with MSE as the loss function.
In KerasTuner, we employ Bayesian optimization by adjusting a few crucial hyperparameters and designs. We summarize them in the following list: • We introduce one or two SimpleRNN layers, with or without recurrent dropouts.
The units of the first layer range from 64 to 256 with a step size of 16. If a second layer is used, the units range from 32 to 128 with a step size of 8. Recurrent dropout is applied with a dropout rate in the range of 0.1 to 0.3 using log sampling.
• We take LayerNormalization as a Boolean choice to enhance the training stability, even with shallow networks. The LayerNormalization is added after the SimpleRNN layer if there is only one layer; in between the two layers if there are two SimpleRNN layers.
• We allow a Dense layer with units ranging from 16 to 32 and a step size of 8 as an optional regressor after the recurrent layers.
• A final dropout with log sampling of a dropout rate in the range of 0.2 to 0.5 is added as a Boolean choice.
• In the Adam optimizer, we only adjust the learning rate with log sampling from the range of 10 −5 to 10 −4 . All other parameters are taken as the default values in TensorFlow-Keras. We take the AMSGrad [54] variant of this algorithm as a Boolean choice.
The KerasTuner is deployed for 300 trials with 2 executions per trial. During the process, we monitor the validation loss using EarlyStopping of patience 8. Once the best set of hyperparameters and model architecture are identified based on the validation data, we initialize a new model with the same design, but with both the training and validation data. This new model is trained 30 times while monitoring the training loss using EarlyStopping of patience 10. The final predictions are obtained by averaging the results of the few cases with close yet overall smallest relative errors from the targets. The purpose of taking the average instead of picking the case with minimum loss is to smooth out possible outliers. We set the batch size in both the KerasTuner and the final training to be 2048.
We will also use the trained model to make predictions on the q test data and compare them with the correct values as validation for hitting the benchmark.

Examples of the sequential models
The proposed approach will be demonstrated using two examples. The first example is a simple representative case of a single interval (3.1); while the second is a more challenging case of the two-interval at decompactification limit (3.19), where the higherorder terms in the generating function cannot be efficiently computed. Additionally, we will briefly comment on the most non-trivial example of the general two-interval case.

Single interval
In this example, we have used the same N datasets for the single interval as in Sec. 3.2. Following the data splitting strategy we just outlined, it is worth noting that the ratio of training data to the overall dataset is relatively small. We have plotted the losses of the three best-performing models, as well as the density plot of relative errors for the two test datasets in Figure 16. Surprisingly, even with a small ratio of training data, we were able to achieve small relative errors on the additional test datasets. The density plots as a function of relative errors for the two test datasets. The relative errors for the p test datasets are concentrated at around 1%; while for the additional q test datasets, they are concentrated at around 2.5% with a very small ratio of outliers.

Two intervals in the decompactification limit
Again, we have used the same N datasets for the two intervals in the η → ∞ limit as in Sec. 3.3. In Figure 17, we have plotted the losses of the four best-performing models and the density plot of relative errors for the two test datasets. In this example, the KerasTuner identified a relatively small learning rate, which led us to truncate the training at a maximum of 1500 epochs since we had achieved the required accuracy. In this case, the predictions are of high accuracy, essentially without outliers. Figure 17: Top: The loss function for the best 4 models as functions of epochs. We monitor the loss function with EarlyStopping. Bottom: The density plot as a function of relative errors for the two test datasets. The relative errors for the p test datasets are well within 1.5%; while for the additional q test datasets, they are well within 2%.
Let us briefly address the most challenging example discussed in this paper, which is the general two-interval case (3.8) where the analytic expression for the von Neumann entropy is not available. In this example, only Tr ρ n A is known, and since it involves the Riemann-Siegel theta function, computing the generating function for large k in the partial sum becomes almost infeasible. Therefore, the sequential learning models we have introduced represent the most viable approach for extracting useful information in this case.
Since only k max ≈ 10 can be efficiently computed from the generating function in this case, we have much shorter steps for the sequential learning models. We have tested the above procedure with N = 10000 datasets and k max = 10, however, we could only achieve an average of 5% relative errors. Improvements may come from a larger dataset with a longer training time, which we leave as a future task.
In general, sequential learning models offer a potential solution for efficiently computing higher-order terms in the generating function. To extend our approach to longer sequences beyond the k max steps, we can treat the problem as self-supervised learning. However, this may require a more delicate model design to prevent error propagation. Nonetheless, exploring longer sequences can provide a more comprehensive understanding of the behavior of von Neumann entropy and its relation to Rényi entropies.

Quantum neural networks and von Neumann entropy
In this section, we explore a similar supervised learning task by treating the quantum circuits as models that map data inputs to predictions, which influences the expressive power of quantum circuits as function approximations.

Fourier series from variational quantum machine learning models
We will focus on a specific function class that a quantum neural network can explicitly realize, namely a simple Fourier-type sum [29,30]. Before linking it to the von Neumann entropy, we shall first give an overview of the seminal works in [30].
Consider a general Fourier-type sum in the following form with the frequency spectrum specified by Ω ⊂ R N . Note that c ω (θ i ) are the (complex) Fourier coefficients. We need to come up with a quantum model that can learn the characteristics of the sum by the model's control over the frequency spectrum and the Fourier coefficients. Now we define the quantum machine learning model as the following expectation value where |0 is taken to be some initial state of the quantum computer. The M will be the physical observable. Note that we have omitted writing the vector symbol and the hat on the operator, which should be clear from the context. The crucial component is U (x, θ i ), which is a quantum circuit that depends on the data input x and the trainable parameters θ i with L layers. Each layer has a data-encoding circuit block S(x), and the trainable circuit block W (θ i ). Schematically, it has the form where we refer to Figure 18 for a clear illustration. Figure 18: Quantum neural networks with repeated data-encoding circuit blocks S(x) (whose gates are of the form g(x) = e −ixH ) and trainable circuit blocks W (i) . The data-encoding circuit blocks determine the available frequency spectrum for ω, while the remainder determines the Fourier coefficients c ω .
Let us discuss the three major components of the quantum circuit in the following: • The repeated data-encoding circuit block S(x) prepares an initial state that encodes the (one-dimensional) input data x and is not trainable due to the absence of free parameters. It is represented by certain gates that embed classical data into quantum states, with gates of the form g(x) = e −ixH , where H is the encoding Hamiltonian that can be any unitary operator. In this work, we use the Pauli X-rotation gate, and the encoding Hamiltonians in S(x) will determine the available frequency spectrum Ω.
• The trainable circuit block W (θ i ) is parametrized by a set of free parameters θ i = (θ 1 , θ 2 , ...). There is no special assumption made here and we can take these trainable blocks as arbitrary unitary operations. The trainable parameters will contribute to the coefficients c ω .
• The final piece is the measurement of a physical observable M at the output. This observable is general, it could be local for each wire or subset of wires in the circuit.
Our goal is to establish that f (x) can be written as a partial Fourier series [29,30] Note that here for simplicity, we have taken frequencies being integers Ω ⊂ Z N . The training process goes as follows: we sample a quantum model with U (x, θ i ), and then define the mean square error as the loss function. To optimize the loss function, we need to tune the free parameters θ = (θ 1 , θ 2 , ...). The optimization is performed by a classical optimization algorithm that queries the quantum device, where we can treat the quantum process as a black box and only examine the classical data input and the measurement output. The output of the quantum model is the expectation value of a Pauli-Z measurement.
We use the single-qubit Pauli rotation gate as the encoding g(x) [30]. The frequency spectrum Ω is determined by the encoding Hamiltonians. Two scenarios can be considered to determine the available frequencies: the data reuploading [68] and the parallel encodings [69] models. In the former, we repeat r times of a Pauli rotation gate in sequence, which means we act on the same qubit, but with multiple layers r = L; whereas in the latter, we perform similar operations in parallel on r different qubits. but with a single layer L = 1. These models allow quantum circuits to access increasingly rich frequencies, where Ω = {−r, ..., −1, 0, 1, ..., r} with a spectrum of integer-valued frequencies up to degree r. This will correspond to the maximum degree of the partial Fourier series we want to compute.
From the discussion above, one can immediately derive the maximum accessible frequencies of such quantum models [30]. But in practice, if the degree of the target function is greater than the number of layers (for example, in the single qubit case), the fit will be much less accurate. 4 Increasing the value of L typically requires more training epochs to converge at the same learning rate. This is relevant to a more difficult question of how to control the Fourier coefficients in the training process, given that all the blocks W (i) (θ i ) and the measurement observable contribute to "every" Fourier coefficient. However, these coefficients are functions of the quantum circuit with limited degrees of freedom. This means that a quantum circuit with a certain structure can only realize a subset of all possible Fourier coefficients, even with enough degrees of freedom. While a systemic understanding is not yet available, a simulation exploring which Fourier coefficients can be realized can be found in [30]. In fact, it remains an open question whether, for asymptotically large L, a single qubit model can approximate any function by constructing arbitrary Fourier coefficients.

The generating function as a Fourier series
Given the framework of the quantum model and its relation to a partial Fourier series, a natural question arises as to whether the entanglement entropy can be realized within this setup. To approach this question, it is meaningful to revisit the generating function for the von Neumann entropy as a manifest Taylor series. The goal is to rewrite the generating function in terms of a partial Fourier series. Therefore, we would be able to determine whether the von Neumann and Rényi entropies are the function classes that the quantum neural network can describe. Note that we will only focus on small-scale tests with a low depth or width of the circuit, as the depth or width of the circuit will correspond exactly to the orders that can be approximated in the Fourier series. But we cannot simply convert either the original generating function or its Taylor series form to a Fourier series. By doing so, it will generally involve special functions in ρ A , for which we will be unable to specify in terms of Tr ρ n A . Therefore, it is essential to have an expression of the Fourier series that allows us to compute the corresponding Fourier coefficients at different orders using Tr ρ n A , for which we know the analytic form from CFTs. This can indeed be achieved, see Appendix A for a detailed derivation. The Fourier series representation of the generating function on an interval [w 1 , w 2 ] with period T = w 2 − w 1 is given by where C cos and C sin are some special functions defined as with p F q being the generalized hypergeometric function. Note also that Similarly, the zeroth order Fourier coefficient is given by Note that summing to m = 10 suffices our purpose, while the summation in n corresponds to the degree of the Fourier series. Note that the complex-valued Fourier coefficients c n to be used in our simulation can be easily reconstructed from the expression. Therefore, the only required input for evaluating the Fourier series isf (m), with Tr ρ k+1 A explicitly given. This is exactly what we anticipated and allows for a straightforward comparison with the Taylor series form.
Note the interval for the Fourier series is not arbitrary. We will take the interval [w 1 , w 2 ] to be [−1, 1], which is the maximum interval where the Fourier series (5.6) is convergent. Furthermore, we expect that as w → 1 from (5.6), we arrive at the von Neumann entropy, that is However, as we can see in Figure 19, there is a rapid oscillation near the end points of the interval for the Fourier series. The occurrence of such "jump discontiunity" is a generic feature for the approximation of discontinuous or non-periodic functions using Fourier series known as the Gibbs phenomenon. This phenomenon poses a serious problem in recovering accurate values of the von Neumann entropy because we are taking the limit to the boundary point w → 1. We will return to this issue in Section 5.4. Figure 19: Gibbs phenomenon for the Fourier series near the end point for w → 1.
We take the single interval example where the yellow curve represents the generating function as a Taylor series, and the blue curve is the Fourier series approximation of the generating function.

The expressivity of the quantum models on the entanglement entropy
In this subsection, we will demonstrate the expressivity of the quantum models of the partial Fourier series with examples from CFTs. We will focus on two specific examples: a single interval and two intervals at small cross-ratio x. While these examples suffice for our purpose, it is worth noting that once the Fourier series representation is derived using the expression in (5.6), all examples with a known analytic form of Tr ρ n A can be studied.
The demonstration is performed using Pennylane [71]. We have adopted the Adam optimizer with a learning rate 0.005 and batch size of 100, where MSE is the loss function. Note that we have chosen a smaller learning rate compared to [30] and monitor with EarlyStopping. For the two examples we study, we have considered both the serial (data reuploading) and parallel (parallel encodings) models for the training. Note that in the parallel model, we have used the StronglyEntanglingLayers in Pennylane with itself of 3 user-defined layers. In each case, we start by randomly initializing a quantum model with 300 sample points to fit the target function where the complex-valued Fourier coefficients are calculated from the real coefficients in (5.6). We have chosen k = 4 with prescribed physical parameters in the singleand two-interval examples. Therefore, we will need r in the serial and parallel models to be larger than k = 4. We have executed multiple trials from each case, where we include the most successful results with maximum relative errors controlled in 3% in Figures 20∼23.    As observed from Figures 20∼23, a rescaling of the data is necessary to achieve precise matching between the quantum models and the Fourier spectrum of our examples. This rescaling is possible because the global phase is unobservable [30], which introduces an ambiguity in the data-encoding. Consider our quantum model where we consider the case of a single qubit L = 1, then U (x) = W (2) g(x)W (1) . (5.14) Note that the frequency spectrum Ω is determined by the eigenvalues of the dataencoding Hamiltonians, which is given by the operator H has two eigenvalues (λ 1 , λ 2 ), but we can rescale the energy spectrum to (−γ, γ) as the global phase is unobservable (e.g. for Pauli rotations, we have γ = 1 2 ). We can absorb γ from the eigenvalues of H into the data input by re-scaling with We should emphasize that we are not re-scaling the original target data, but instead, we are re-scaling how the data is encoded. Effectively, we are re-scaling the frequency of the quantum model itself. The intriguing part is that the global phase shift of the operator acting on a quantum state cannot be observed, yet it affects the expressive power of the quantum model. This can be understood as a pre-processing of the data, which is argued to extend the function classes of the quantum model that can represent [30].
This suggests that one may consider treating the re-scaling parameter γ as a trainable parameter [68]. This would turn the scaling into an adaptive "frequency matching" process, potentially increasing the expressivity of the quantum model. Here we only treat γ as a tunable hyperparameter. The scaling does not need to match with the data, but finding an appropriate scaling parameter is crucial for model training.

Recovering the von Neumann entropy
So far, we have managed to rewrite the generating function into a partial Fourier series f N (w) of degree N , defined on the interval w ∈ [−1, 1]. By leveraging variational quantum circuits, we have been able to reproduce the Fourier coefficients of the series accurately. In principle, with appropriate data-encoding and re-scaling strategies, increasing the depth or width of the quantum models would enable us to capture the series to any arbitrary degree N . Thus, the expressivity of the Rényi entropies can be established in terms of quantum models. However, a crucial problem remains, that is, we need to recover the von Neumann entropy under the limit w → 1 lim w→1 G(w; ρ A ) = S(ρ A ), (5.17) where the limiting point is exactly at the boundary of the interval that we are approximating. However, as we can see clearly from Figure 24, taking such a limit naïvely gives a very inaccurate value compared to the true von Neumann entropy. This effect does not diminish even by increasing N to achieve a better approximation of the series when compared to its Taylor series form, as shown in Figure 24. This is because the Fourier series approximation is always oscillatory at the endpoints, a general feature known as the Gibbs phenomenon for the Fourier series when approximating discontinuous or non-periodic functions. A priori, a partial Fourier series of a function f (x) is a very accurate way to reconstruct the point values of f (x), as long as f (x) is smooth and periodic. Furthermore, if f (x) is analytic and periodic, then the partial Fourier series f N would converge to f (x) exponentially fast with increasing N . However, f N (x) in general is not an accurate approximation of f (x) if f (x) is either discontinuous or non-periodic. Not only the convergence is slow, there is an overshoot near the boundary of the interval. There are many different ways to understand this phenomenon. Broadly speaking, the difficulty lies in the fact that we are trying to obtain accurate local information from the global properties of the Fourier coefficients defined via an integral over the interval, which seems to be inherently impossible.
Mathematically, the occurrence of the Gibbs phenomenon can be easily understood in terms of the oscillatory nature of the Dirichlet kernel, which arises when the Fourier series is written as a convolution. Explicitly, the Fourier partial sum can be written as where the Dirichlet kernel D n (x) is given by ( 5.19) This function oscillates between positive and negative values. The behavior is therefore responsible for the appearance of the Gibbs phenomenon near the jump discontinuities of the Fourier series at the boundary. Therefore, our problem can be accurately framed as follows: given the 2N + 1 Fourier coefficientsf k of our generating function (5.6) for −N ≤ k ≤ N , with the generating function defined in the interval w ∈ [−1, 1], we need to reconstruct the point value of the function at the limit w → 1. The point value of the generating function at this limit exactly corresponds to the von Neumann entropy. Especially, we need the reconstruction to converge exponentially fast with N to the correct point value of the generating function, that is lim w→1 |G(w; ρ A ) − f N (w)| ≤ e −αN , α > 0. (5.20) This is for the purpose of having a realistic application of the quantum model, where currently the degree N we can approximate for the partial Fourier series is limited by the depth or the width of the quantum circuits. We are in need of an operation that can diminish the oscillations, or even better, to completely remove them. Several filtering methods have been developed to ameliorate the oscillations, including the non-negative and decaying Fejér kernel, which smooths out the Fourier series over the entire interval, or the introduction of Lanczos σ factor, which locally reduces the oscillations near the boundary. For a comprehensive discussion on the Gibbs phenomenon and these filtering methods, see [72]. However, we emphasize that none of these methods are satisfying, as they still cannot recover accurate point values of the function f (x) near the boundary.
Therefore, we need a more effective method to remove the Gibbs phenomenon completely. Here we will adopt a powerful method by re-expanding the partial Fourier series into a basis of Gegenbauer polynomials. 5 This is a method developed in the 1990s by a series of seminal works [74][75][76][77][78][79], we also refer to [80,81] for more recent reviews.
The Gegenbauer expansion method allows for accurate representation, within exponential accuracy, by only summing a few terms from the Fourier coefficients. Given an analytic and non-periodic function f (x) on the interval [−1, 1] (or a sub-interval [a, b] ⊂ [−1, 1]) with the Fourier coefficientŝ 21) and the partial Fourier series The following Gegenbauer expansion represents the original function we want to approximate with the Fourier information where g λ n,N is the Gegenbauer expansion coefficients and C λ n (x) are the Gegenbauer polynomials. 6 Note that we have the following integral formula for computing g λ n,N Note that other methods exist based on periodically extending the function to give an accurate representation within the domain of interest, which involves reconstructing the function based on Chebyshev polynomials [73]. However, we do not explore this method in this work. 6 The Gegenbauer expansion coefficients g λ n,N are defined with the partial Fourier series f N (x) as For λ ≥ 0, the Gegenbauer polynomial of degree n is defined to satisfy We refer to Appendix. B for a more detailed account on the properties of the Gegenbauer expansion.
where we only need the Fourier coefficientsf k . In fact, the Gegenbauer expansion is a two-parameter family of functions, characterized by λ and M . It has been shown that by setting λ = M = β N where = (b − a)/2 and β < 2πe 27 for the Fourier case, the expansion can achieve exponential accuracy with N . Note that M will determine the degrees of the Gegenbauer polynomials, and as such, we should allow the degrees of the original Fourier series to grow with M . For a clear demonstration of how the Gegenbauer expansion approaches the generating function from the Fourier data, see Figure 25. We will eventually be able to reconstruct the point value of the von Neumann entropy near w → 1 with increasing order in the expansion. A more precise statement regarding the exponential accuracy can be found in Appendix B. This method is indeed a process of reconstructing local information from global information with exponential accuracy, thereby effectively removing the Gibbs phenomenon.

Discussion
In this paper, we have considered a novel approach of using classical and quantum neural networks to study the analytic continuation of von Neumann entropy from Rényi entropies. We approach the analytic continuation problem in a way suitable to deep learning techniques by rewriting Tr ρ n A in the Rényi entropies in terms of a generating function that manifests as a Taylor series (2.9). We show that our deep learning models achieve this goal with a limited number of Rényi entropies.
Instead of using a static model design for the classical neural networks, we adopt the KerasTuner in finding the optimal model architecture and hyperparameters. There are two supervised learning scenarios: predicting the von Neumann entropy given the knowledge of Rényi entropies using densely connected neural networks, and treating higher Rényi entropies as sequential deep learning using RNNs. In both cases, we have achieved high accuracy in predicting the corresponding targets.
For the quantum neural networks, we frame a similar supervised learning problem as a mapping from inputs to predictions. This allows us to investigate the expressive power of quantum neural networks as function approximators, particularly for the von Neumann entropy. We study quantum models that can explicitly realize the generating function as a partial Fourier series. However, the Gibbs overshooting hinders the recovery of an accurate point value for the von Neumann entropy. To resolve this issue, we re-expand the series in terms of Gegenbauer polynomials, which leads to exponential convergence and improved accuracy.
Several relevant issues and potential improvements arise from our approach: • It is crucial to choose the appropriate architectures before employing KerasTuner, for instances, densely connected layers in Sec. 3 and RNNs in Sec. 4. Because these architectures are built for certain tasks a priori. KerasTuner only serves as an effective method to determine the optimal complexity and hyperparameters for model training. However, since the examples from CFT 2 have different analytic structures for both the von Neumann and Rényi entropies, it would be interesting to explore how the different hyperparameters correlate with each example.
• Despite being efficient, the parameter spaces we sketched in Sec. 3.1 and Sec. 4.1 that the KerasTuner searches are not guaranteed to contain the optimal setting, and there could be better approaches.
• We can generate datasets by fixing different physical parameters, such as temperature for (3.6) or cross-ratio x for (3.15). While we have considered the natural parameters to vary, exploring different parameters may offer more representational power. It is possible to find a Dense model that provides feasible predictions in all parameter ranges, but may require an ensemble of models.
• Regularization methods, such as K-fold validation, can potentially reduce the model size or datasets while maintaining the same performance. It would be valuable to determine the minimum datasets required or whether models with low complexity still have the same representational power for learning entanglement entropy.
• On the other hand, training the model with more data and resources is the most effective approach to improve the model's performance. One can also scale up the search process in the KerasTuner or use ensemble methods to combine the models found by it.
• For the quantum neural networks, note that our approach does not guarantee convergence to the correct Fourier coefficients, as we outlined in Sec. 5.1. It may be beneficial to investigate various pre-processing or data-encoding strategies to improve the approximation of the partial Fourier series with a high degree r.
There are also future directions that are worth exploring that we shall comment on briefly: • Mutual information: We can extend our study to mutual information for two disjoint intervals A and B, which is an entanglement measure related to the von Neumann entropy defined as In particular, there is a conjectured form of the generating function in [8], with Tr ρ n A being replaced by Tr ρ n A Tr ρ n B / Tr ρ n A∪B . It is worth exploring the expressivity of classical and quantum neural networks using this generating function, particularly as mutual information allows eliminating the UV-divergence and can be compared with some realistic simulations, such as spin-chain models [82].
• Self-supervised learning for higher Rényi entropies: Although we have shown that RNN architecture is effective in the sequence learning problem in Sec. 4, it is worth considering other architectures that could potentially offer better performance. For instance, a time-delay neural network, depthwise separable convolutional neural network, or a Transformer may be appropriate for certain types of data. These architectures may be worth exploring in extending the task of extracting higher Rényi entropies as self-supervised learning, particularly for examples where analytic continuation is not available.
• Other entanglement measures from analytic continuation: There are other important entanglement measures, say, relative entropy or entanglement negativity that may require analytic continuation and can be studied numerically based on neural networks. We may also consider entanglement entropy or entanglement spectrum that can be simulated in specific models stemming from condensed matter or holographic systems.
• Expressivity of classical and quantum neural networks: We have studied the expressivity of classical and neural networks for the von Neumann and Rényi entropies, with the generating function as the medium. This may help us in designing good generating functions for other entanglement measures suitable for neural networks. It is also worth understanding whether other entanglement measures are also in the function classes that the quantum neural networks can realize. with manifest Tr ρ k+1 A appearing in the expression. Now we need to work out C cos (n, m) and C sin (n, m). First, let us consider in general a n = 2 T where we have written G(w; ρ A ) as f (t) for simplicity. We can write down the Taylor series of both pieces Note the polynomials are not orthonormal, the norm of C λ n (x) is where h λ n = π Here we will sketch briefly how the Gegenbauer expansion leads to a resolution of the Gibbs phenomenon as we discussed in Section 5.4. In fact, one can prove that there is an exponential convergence between the function f (x) we want to approximate and the m-th degree Gegenbauer polynomials. We will only sketch the idea behind the proof, and we refer the readers to the review in [79] for the details.
One can establish exponential convergence by demonstrating that the errors for the N -th Fourier coefficient, expanded into Gegenbauer polynomials, can be made exponentially small. On the right hand side of the inequality, we call the first norm as the regularization error, while the second norm as the truncation error. Note that we take the norm to be the maximum norm over the interval [−1, 1]. To be more precise, we can write the truncation error as where we takef λ k to be the unknown Gegenbauer coefficients of the function f (x). If both λ and m grow linearly with N , this error is shown to be exponentially small. On the other hand, the regularization error can be written as It can also be shown that this error is exponentially small for λ = γm with a positive constant γ. Since both the regularization and truncation errors can be made exponentially small with the prescribed conditions, the Gegenbauer expansion achieves uniform exponential accuracy and removes the Gibbs phenomenon from the Fourier data.