Extensive Studies of the Neutron Star Equation of State from the Deep Learning Inference with the Observational Data Augmentation

We discuss deep learning inference for the neutron star equation of state (EoS) using the real observational data of the mass and the radius. We make a quantitative comparison between the conventional polynomial regression and the neural network approach for the EoS parametrization. For our deep learning method to incorporate uncertainties in observation, we augment the training data with noise fluctuations corresponding to observational uncertainties. Deduced EoSs can accommodate a weak first-order phase transition, and we make a histogram for likely first-order regions. We also find that our observational data augmentation has a byproduct to tame the overfitting behavior. To check the performance improved by the data augmentation, we set up a toy model as the simplest inference problem to recover a double-peaked function and monitor the validation loss. We conclude that the data augmentation could be a useful technique to evade the overfitting without tuning the neural network architecture such as inserting the dropout.


Introduction
In the central cores of neutron stars baryonic matter is highly compressed by the gravitational force. The inward force by gravity is balanced by the outward pressure which strongly depends on intrinsic properties of cold and dense matter subject to the strong interaction.
The study of inner structures of the neutron star requires a relation between the pressure p and the energy density ε; namely, the equation of state (EoS), p = p(ε), at T = 0 (see, e.g., Refs. [1][2][3][4][5] for recent reviews). At the center the baryon number density n B may reach 5-10 times as large as the saturation density of nuclear matter, i.e., n 0 0.16 (nucleons)/fm 3 . Towards the model-independent EoS determination, we should solve quantum chromodynamics (QCD), which is the first-principles theory of the strong interaction. At the moment, however, the model-independent results are only available at limited ranges of densities. At low density range of n B ∼ 1-2 n 0 we can apply ab initio methods combined with the nuclear force derived from Chiral Effective Theory (χEFT) with controlled uncertainty estimates [6][7][8][9][10][11][12][13][14] (see Ref. [15] for a recent review). At asymptotically high densities of n B 50 n 0 perturbative QCD calculations reliably converge [16][17][18][19][20][21] (see Ref. [22] for a recent review). The intermediate density region around 2-10 n 0 , which is relevant for the neutron star structure, still lacks trustable QCD predictions (see also Ref. [23] for a recent attempt to improve the convergence in the resummed perturbation theory). To tackle the intermediate density region from QCD, we need to develop non-perturbative methods such as the Monte-Carlo simulation of QCD on the lattice (lattice QCD), but the lattice-QCD application to finite density systems is terribly hindered by the notorious sign problem (see, e.g., Ref. [24] for a review). This is why neutron-star-physics studies still rely on phenomenological EoS constructions: rather ab initio approaches [25,26] near the saturation density, estimates employing the Skyrme interactions [27], the relativistic mean field theories [28], and the functional renormalization group [29], etc.
Meanwhile, we have witnessed significant advances in neutron star observations over the last decades. Now these advances have opened an alternative pathway for the modelindependent extraction of the EoS by statistical methods. Observations include the Shapiro delay measurements of massive two-solar-mass pulsars [30][31][32][33], the radius determination of quiescent low-mass X-ray binaries and thermonuclear bursters [34][35][36][37][38] (see reviews [2,39,40] for discussions on the methods and associated uncertainties), detections of gravitational waves from binary mergers involving neutron stars by the LIGO-Virgo collaboration [41,42] as well as the X-ray timing measurements of pulsars by the NICER mission [43,44]. Typical observable quantities of neutron stars involve mass M , radius R, compactness M/R, tidal deformability Λ (and their variants, e.g., Love number k 2 ), quadrupole moment Q, moment of inertia I, etc. The gravitational waves from binary neutron star mergers provide us with the information about the tidal deformability. Also, the NICER mission particularly targets at the compactness M/R of stars by measuring the gravitational lensing of the thermal emission from the stellar surface. Some of these neutron star quantities are connected through the universal relations that are insensitive to the EoS details. Among such relations, the most well-known example is the I-Love-Q relation [45,46], which relates the moment of inertia I, the Love number k 2 , and the quadrupole moment Q (see Refs. [4,47] and reference therein for other universal relations and usages).
These observations have provided some insights on the EoS and invoked discussions about non-perturbative aspect of QCD: for example, the emergence of quark matter in the neutron star [48] and a certain realization of duality between the confined and the deconfined matter [49][50][51][52][53][54][55][56][57]. Moreover, these observations, particularly the gravitational wave observation, may provide a unique opportunity for studying the hadron-to-quark phase transition occurring in the binary star merger and even better sensitivity will be achieved with the future detector [58][59][60].
For the purpose of extracting the most likely EoS from the observational data, there are diverse statistical technologies. The commonly used method is the Bayesian inference [34][35][36][61][62][63][64][65][66]. There are still other methods such as the one based on the Gaussian processes which is a variation of the Bayesian inferences with nonparametric representation of the EoS [67][68][69], etc. Despite numerous efforts to quantitatively constrain the EoS, it is still inconclusive what the genuine EoS should look like because of uncertainties from the assumed prior distribution in the Bayesian analysis. Therefore, the EoS inference program is in need of complementary tools. As an alternative to these conventional statistical methods, the deep learning; namely, the machine learning devising deep neural networks (NNs), has been successfully applied to a wide range of physics fields [70][71][72][73][74][75][76]. Several studies have already put the NN in motion in the context of gravitational wave data analysis of binary neutron star mergers [77][78][79][80]. Along these lines, in our previous publications [81,82] we proposed a novel approach to the EoS extraction based on the deep machine learning (see also Refs. [83][84][85] for related works).
In the present work we make further comprehensive analyses using the machine learning method developed previously [81,82]. Here we give a brief synopsis of our machine learning approach to the neutron star EoS problem. Given an EoS, various pairs of stellar mass and radius, M -R, follow from the general relativistic structural equation, i.e., Tolman-Oppenheimer-Volkoff (TOV) equation [86,87]. We will express the inverse operation of solving the TOV equation in terms of the deep NN, train the NN with sufficiently large data set of mock observations, and eventually obtain a regressor that properly infers the most likely EoS corresponding to the observational M -R inputs. While the other approaches to the neutron star EoS explicitly assume a prior distribution for the EoS probability distributions, our method only implicitly depends on the prior assumption. Therefore, in a sense that the implementations are based on different principles and the prior dependences appear differently, we can consider our method as complementary to other conventional approaches. We also note that independent algorithms would help us with gaining some information about systematics. For more detailed discussions on the prior, see Sec. 2.2.
This paper is organized as follows. Section 2 provides a detailed account on our problem setting, i.e., mapping neutron star observables to the EoS. Then, we outline our method by setting forth our parametrization for the EoS and reviewing the machine learning technique in general. Section 3 is devoted for the methodology study using mock data. We describe our data generating procedure and training setups, and then give a brief overview of the evolution of learning processes, which motivates the later analyses in Sec. 5. We showcase the typical examples out of the whole results, and then we carry out statistical analyses for the data as a whole. We confront our NN method with a different method; namely, the polynomial regression, and compare two methods to show that our method surpasses the polynomial regression. In Sec. 4, we present the EoSs deduced from the several different neutron star observables in real. We explain our treatment of uncertainty quantification along with the possibility of a weak first-order phase transition in the deduced EoS results. In Sec. 5 we study the efficiency of data augmentation in the context to remedy the overfitting problem. Finally, we summarize this work and discuss future outlooks in Sec. 6. Throughout this paper we use the natural unit system with G = c = 1 unless otherwise specified.

Supervised Learning for the EoS Inference Problem
In this section we explicitly define our problem and summarize the basic strategy of our approach with the supervised machine learning. We will explain the concrete setup in each subsection where we adjust the strategy in accordance with the goal of each subsection.
In the present study we want to constrain the EoS from the stellar observables. The EoS and the observables are non-trivially linked by the TOV equation which gives a means to calculate the neutron star structure from the EoS input. Thus, constraining the EoS from the observables is the inverse process of solving the TOV equation, but this inverse problem encounters difficulties from the nature of the observations. In Sec. 2.1 we formulate the inverse problem of the TOV equation and discuss its difficulties. We proceed to our approach to this problem using the supervised machine learning in Sec. 2.2. We closely describe the EoS parametrization and the data generation in Sec. 2.3. Then, we explain the design of the deep NN in Sec. 2.4 and its training procedures in Sec. 2.5.

TOV mapping between the EoS and the M -R relation
In the present analysis we focus on the mass M and the radius R as the neutron star observables. Given a boundary condition of the core pressure p c , the observables M and R of such a neutron star can be determined by solving the TOV equation [86,87]: where r is the radial coordinate which represents a distance from the stellar center. The functions, p(r) and ε(r), are the pressure and the energy density (i.e., the mass density), respectively, at the position r. The function m(r) represents a mass enclosed within the distance r. We can readily integrate these integro-differential equations from r = 0 with the initial condition, p(r = 0) = p c , towards the outward direction. The radius of the neutron star, R, is defined by the surface condition: p(R) = 0. The mass of the neutron star is the total mass within R, i.e., M = m(R).
The only missing information in the combined TOV equations (2.1) and (2.2) is a relationship between p and ε, which is nothing but the EoS, i.e., p = p(ε). Once an EoS is fixed, we can draw a p c -parametrized trajectory of (M (p c ), R(p c )) in the M -R plane. This trajectory is called the M -R relation. Thus, the operation of solving the TOV equation can be considered as a mapping from the functional space of the EoSs onto the functional space of the M -R relations. Here we call this mapping the "TOV mapping" denoted by Ψ TOV : It is known that the mapping is invertible, i.e., the inverse mapping Ψ −1 TOV exists [88]. Thus, ideally, one could uniquely reconstruct the EoS from an observed M -R relation. However, practically, we cannot access the entire continuous curve on the M -R relation (which we will briefly call the M -R curve) from neutron star observations. In reconstructing the EoS from the observational data, we are facing twofold obstacles: (i) Incompleteness -M -R data from the observation does not form a continuous curve.
Since one neutron star corresponds to one point on the M -R curve, the observation of neutron stars gives a limited number of M -R points and only partial information on the original M -R curve is available. Even with a sufficiently large number of neutron stars, the information on possible unstable branches of the M -R curve, on which the neutron stars do not exist, would be missing.
(ii) Uncertainties -Real data is not a point but a distribution. Each M -R point is accompanied by observational uncertainties represented as a credibility distribution. Moreover, the central peak position of the distribution is not necessarily on the genuine M -R relation.
We can introduce the following symbolic representation of the above complication associated with the observational limitation: This mapping by Π obs (ω) is a convolution of a projection from the M -R relation to a set of points along the M -R curve and a probability distribution due to observational uncertainties. The distribution generally depends on observational scenarios (how many and what kinds of neutron stars are detected by which signals) and such dependence is collectively represented by ω. With this mapping, the relation between the EoS and the observed M -R data is more correctly expressed as Because of this additional filter by Π obs (ω), one cannot reconstruct the correct EoS from the observational data just by calculating Ψ −1 TOV . Since a part of original information is lost through Π obs (ω), this is an ill-posed problem, i.e., there is no unique well-defined answer due to insufficient information and uncertainties inherent in this problem.
Here, instead of reconstructing the unique EoS, we try to infer the most likely EoS from M -R observations with uncertainty; that is, the regression analysis of the inverse mapping including Π obs (ω) (see Fig. 1 for a schematic illustration). In particular it is crucial to develop an approach leading to outputs robust against the observational uncertainties of the M -R data.

General strategy for the machine learning implementation
To solve this ill-posed problem we employ a method of machine learning with deep NNs to infer the neutron star EoS. In particular we consider the supervised machine learning to train the NN with as many and different ω's as possible. The trained NN can then receive the M -R data for one particular ω and return the output of the most likely EoS correspondingly. To put it another way, the "inversion" of the mapping Π obs • Ψ TOV is approximated by the regression, particularly by the deep NN in our approach, which would be symbolically written as Reg[Π obs • Ψ TOV ] −1 .
In the supervised learning, we need to prepare the training data and the validation data composed of pairs of the input M -R data and the desired output EoS. To generate the training data efficiently, we make use of the asymmetry between Π obs • Ψ TOV and Reg[Π obs • Ψ TOV ] −1 ; that is, we can straightforwardly calculate the forward mapping of Π obs • Ψ TOV by modeling the observation Π obs (ω), while the latter inverse mapping, which is what we currently want to know, is more non-trivial. Thus, first, we randomly generate many possible answer EoSs represented by several parameters. We next generate the corresponding M -R data by applying the forward mapping with various ω's. We will explain technical details of handling and simplifying ω in Sec. 2.3. Now, we turn to the description of the architecture of our NN model and we will optimize the model parameters so that the model can infer the answer EoS corresponding to the input training M -R data. During the training, it is important to monitor the training quality by checking the prediction error behavior for the validation data. After the optimization process is complete with good convergence of the validation error, we can test the predictive power using the mock data for which the true answer EoS is known (see Sec. 3 for actual calculations). Once the method passes all these qualification tests, we finally proceed to the application of the real observational data to obtain the most realistic EoS (see Sec. 4 for details).
Here, we comment on an alternative possibility of the inference problem formulation. As already mentioned, the inverse TOV mapping, Ψ −1 TOV , is well-defined by itself. Thus, it is also feasible to decompose the inference as Reg[Π obs • Ψ TOV ] −1 = Ψ −1 TOV • Reg[Π obs ] −1 and train the NN aimed to approximate Reg[Π obs ] −1 ; that is, the NN model would predict the M -R curve from the input M -R data. We will not adopt this strategy since it is unclear in this approach how to measure the reconstruction performance of the M -R curve. Our target space is the EoS space, but the distance in the M -R space is not uniformly reflected in the distance in the desired EoS space. We are ultimately interested in the EoS, so it is natural to directly model Reg[Π obs • Ψ TOV ] −1 by maximizing the performance for the EoS reconstruction.
We also make a remark on the discriminative modeling and the generative modeling to clarify a difference between our machine learning method and other statistical approaches such as the Bayesian inference. The approach we are advocating in this paper belongs to the discriminative modeling: our NN model directly predicts the response variable, i.e., the EoS, for a fixed set of explanatory variables, i.e., the M -R data. In contrast, the statistical approaches such as the Bayesian inference for the neutron star EoS are to be categorized as the generative modeling. The Bayesian inference first models occurrence of the combination of the explanatory and response variables by the joint probability, Pr(M -R data, EoS) ≡ Pr(M -R data|EoS) Pr(EoS). Here, Pr(EoS) models the prior belief on the EoS, and Pr(M -R data|EoS) models the observation process. Then, the Bayes theorem gives a posterior distribution of EoS, Pr(EoS|M -R data), from the joint probability, which corresponds to the likely EoS distribution for given M -R data.
These discriminative and generative approaches can be considered to be complementary to each other. They have the pros and cons as follows: • The advantage in the generative approaches is that the EoS prediction for a given M -R data takes a distribution form, so that the prediction uncertainties can be estimated naturally. In the discriminative model, on the other hand, additional analysis is needed to estimate the prediction uncertainties. We will perform the additional uncertainty analysis within our method in Sec. 4.3. At the same time, we will utilize our uncertainty analysis for a physics implication as discussed in Sec. 4.5.
• In the discriminative modeling, once the NN training is completed, the most likely EoS corresponding to a given M -R data can immediately result from the mapping represented by the NN: the NN models the inference procedure by itself. Thus, the computational cost is much smaller than statistical analyses, which enables us to perform supplementary estimation on observables and/or quality assessments if necessary. In the generative approaches, on the other hand, the posterior calculation is computationally expensive because the EoS parameter space is large. In addition, one needs to calculate the posterior for each M -R data separately. If one applies the Bayesian inference to the real observational data only once, this computational cost would not matter. The computational cost becomes problematic, however, if one wants to examine the statistical nature of the inference using a large number of generated EoSs and M -R data.

Random EoS generation with parametrization by the speed of sound
The training data consists of pairs of the input M -R data and the output EoS, p(ε), parametrized by a finite number of variables. We first generate EoSs by randomly assigning values to the EoS parameters, solve the TOV equation to find the corresponding M -R curve, and finally generate the M -R data. The M -R data generation involves Π obs (ω), namely, we sample observational points of neutron stars on the M -R curve, and probabilistically incorporate finite observational uncertainties, which should mimic observations in the scenario ω. The real M -R data is composed of the probability distribution for each neutron star on the M -R plane, but in practice we should simplify ω and need to parametrize the M -R data as well. We will discuss the parametrization later and in this section we focus on our EoS parametrization and generation.
In designing the EoS parametrization, it is important to consider its affinity to the physical constraints such as the thermodynamic stability and the causality, i.e, p should be a monotonically non-decreasing function of ε, and the speed of sound, c s , should not be larger than the speed of the light. For this reason, c s , which is derived from is the convenient choice as the EoS parameters to impose the physical constraints easily. Also, we need to smoothly connect the parametrized EoS to the empirical EoS known from the low density region. Up to the energy density of the saturated nuclear matter, ε 0 , we use a conventional nuclear EoS, specifically SLy4 [27]. Because the EoS is well constrained by the saturation properties and the symmetry energy measurements near the saturation density, this specific choice of SLy4 would not affect our final results. In the energy region above ε 0 we employ the standard piecewise polytropic parametrization and partition a density range into a certain number of segments. In our present analysis we choose the density range, [ε 0 , 8ε 0 ], and six energy points, ε i (i = 0, . . . , 5) with ε 5 = 8ε 0 , equally separated in the logarithmic scale. Then, (ε i−1 , ε i ) (i = 1, . . . 5) form 5 segments. To be more specific, these values are (ε 0 , ε 1 , ε 2 , ε 3 , ε 4 , ε 5 ) = (150, 227, 345, 522, 792, 1200) MeV/ fm 3 . The EoS parameter is the average speed of sound, c 2 s,i ≡ c 2 s = ∂p/∂ε (i = 1, . . . , 5), in i-th segment. From these definitions we see that the pressure values at the i-th segment boundaries, p i , are read as 1 where p 0 is determined by p 0 = p(ε 0 ) from our chosen nuclear EoS, i.e., SLy4. We make polytropic interpolation for the EoS by p = k i ε γ i whose exponent and coefficient are given by For the EoS generation we randomly assign the average sound velocities, c 2 s,i , of the uniform distribution within δ ≤ c 2 s,i < 1 − δ. The upper bound comes from the causality 1 We can confirm that c 2 s,i is indeed an average in each segment under this construction as c 2 s < c 2 = 1, and the lower bound from the thermodynamic stability ∂p/∂ε ≥ 0. Here, a small margin by δ = 0.01 is inserted as a regulator to avoid singular behavior of the TOV equation. We here note that we allow for small c 2 s corresponding to a (nearly) first-order phase transition. Repeating this procedure, we generate a large number of EoSs. For each generated EoS, we solve the TOV equation to obtain the M -R curve as explained in Sec. 2.1. In Fig. 2 we show the typical EoS data and the corresponding M -R curves used in our later analyses. It should be noted here that we excluded artificial biases in the EoS generation and allowed for any possible EoSs including ones disfavored by the model calculations or the observational data (e.g., an EoS whose maximum mass exceeds 3M in the right panel of Fig. 2). This is important for the NN to be free from biases coming from specific model preference. In this way our analysis gains an enough generalization ability, covers even exotic scenario cases, and captures the correct underlying relations between the input and the output.
Before closing this part, let us mention a possible improvement for the random EoS generation, though we would not utilize this improvement to keep our present approach as simple as possible. The problem arises from our assumed uniform distribution of c 2 s,i . The parametrization and generation algorithm as explained above definitely allows for a firstorder phase transition for sufficiently small c 2 s,i 's; however, due to the uniform distribution, it is a tiny percentage among the whole data for the generated EoSs to accommodate a firstorder phase transition, and this may be a part of our "prior" dependence. Since a strong first-order transition scenario has already been ruled out from phenomenology, this prior dependence should be harmless. Nevertheless, if necessary, we can naturally increase the percentage of the EoSs with a first-order phase transition in a very simple way as follows: In Eq. (2.8) we carry out the linear interpolation in the log-log plane. Alternatively, we can use the spline interpolation in the log-log plane. With such a smooth interpolation, there appear the energy density regions with negative ∂p/∂ε, i.e., the EoS can be non-monotonic. We can then replace this non-monotonic part by the first-order phase transition using the Maxwell construction. This is one effective and natural way to enhance a finite fraction of EoSs with a first-order phase transition while keeping the same c 2 s,i 's. We also note that one more merit in this procedure is that the end points of the first-order energy regions are not restricted to be on the segment boundaries. Thus, the strength of the first-order phase transition is not necessarily discretized by the segment width but an arbitrarily weak first-order phase transition could be realized with the same parametrization. We say again that we would not adopt this procedure: the present observational precision gives a tighter limitation and for the moment we do not benefit from this EoS improvement. In the future when the quality and the quantity of the observational data ameliorate, the above mentioned procedure could be useful.

Neural network design: a general introduction
The neural network, NN, is one representation of a function with fitting parameters. The deep learning, i.e., the machine learning method using a deep NN, is nothing but the optimization procedure of the parameters contained in the function represented by the NN. In particular, the supervised learning can be considered as the fitting problem (namely, regression) for given pairs of the input and the output, i.e., the training data. We often refer to a NN "model" to mean the optimized system of the function given in terms of the NN.
The advantage of the deep learning, as compared to naive fitting methods, lies in the generalization properties of the NN. We need not rely on any preknowledge about reasonable forms of fitting functions: the NN with a sufficient number of neurons is capable of capturing any continuous functions [89,90]. With a large number of neurons (and layers), there are many fitting parameters, and recent studies of machine learning have developed powerful optimization methods for such complicated NNs.
The model function of feedforward NN can be expressed as follows: where x and y are input and output, respectively. The fitting parameters, {W ( ) , b ( ) } L =1 , are given in the form of matrices and vectors, respectively. For calculation steps, we first prepare (L+1) layers (including the input and the output layers). Each layer x ( ) ( = 0, . . . L) consists of nodes called neurons corresponding to the vector components, x ( ) 1 , . . . , x ( ) n . We note that the number of neurons, n , can be different for different . The input x is set to the first layer x (0) (which is labeled as = 0 in this paper), and the subsequent layers are calculated iteratively as The L-th layer becomes the final output, y = f (x) = x (L) . Here, σ ( ) (x)'s are called activation functions, whose typical choices include the sigmoid function: σ(x) = 1/(e x + 1), the ReLU: σ(x) = max{0, x}, hyperbolic tangent: σ(x) = tanh(x), etc. In the above notation σ ( ) returns a vector by applying the activation to each vector component of the function argument, i.e., σ(x) = σ(x 1 ), . . . , σ(x n ) T . The fitting parameters, W ( ) and b ( ) , on the -th layer, denote the weights between nodes in the two adjacent layers and the activation offset at each neuron called bias, respectively. These fitting parameters have no physical interpretation at the setup stage. The general architecture is schematically depicted in Fig. 3, in which the calculation proceeds from the left with input x to the right with output y.
The complexity choice of the NN structure, such as the number of layers and neurons, has a tradeoff problem between the fitting power and the overfitting. To capture the essence of the problem, the complexity of layers and neurons should be sufficiently large. At the same time, to avoid the overfitting problem, and to train the NN within reasonable time, the number of layers and neurons should not be too large. There is no universal prescription and the performance depends on the problem to be solved. For our setup we found overall good performance when we chose the neuron number on the first layer greater than the input neuron number. A more sophisticated NN may be possible, but for our present purpose a simple network structure as in Fig. 3 is sufficient.

Loss function and training the neural network
For the actual procedure of machine learning we need to choose a loss function and minimize it, which is called training. We define a loss function as where (y, y ) quantifies the distance or error between the NN prediction y and the answer y provided in the training data. The exact formula for the distribution Pr(x) is not necessarily known in general. Moreover, even if it is known, multidimensional integration in Eq. (2.12) is practically intractable, so let us approximate it with the sum over the samples in the training data, D = {(x n , y n )} |D| n=1 , with |D| denoting the sample size: The optimization problem we deal with here is to minimize the above loss function by tuning the network parameters {W ( ) , b ( ) } . The minimization is usually achieved with the stochastic gradient descent method or its improved versions. The deterministic methods would be easily trapped in local minima or saddle points of the loss function. In practice, therefore, the stochastic algorithms are indispensable. In gradient descent method the parameters {W ( ) (t), b ( ) (t)} at the iteration step t are updated along the direction of the derivative of the loss function with respect to the parameters: where η is called learning rate. We can update b ( ) in the same way. We usually evaluate the derivatives by using the mini-batch method. In the mini-batch method, the training data set is first randomly divided into multiple subsets, which is called mini-batches. Then, the derivative of the loss function is estimated within a mini-batch B. Using the explicit expression (2.13), the approximated derivatives read: where the batch size |B| denotes the number of sample points in B, whose optimal choice differs case-by-case. The epoch counts the number of scans over the entire training data set D. The parameters are updated for each mini-batch, so one epoch amounts to |D|/|B| updates. Usually, the derivative, ∂ /∂W appearing in Eq. (2.15), is evaluated by a method called backpropagation. For numerics we basically use a Python library, Keras [91] with TensorFlow [92] as a backend. In this work we specifically employ msle, i.e., the mean square logarithmic errors, for the choice of the loss (y, y ) appearing in Eq. (2.12); that is, We use a specific fitting algorithm, Adam [93], with the mini-batch size being 100 for the analysis in Sec. 3 and 1000 for the analysis in Sec. 4. We initialize our NN parameters with the Glorot uniform distribution.

Extensive Methodology Study and the Performance Tests
This section is an extensive continuation from our previous work on the methodology study [81]. We here present more detailed and complete investigations of various performance tests. In this section, to put our considerations under theoretical control, we assume ω to be a hypothetical observational scenario that 15 neutron stars are observed and (M i , R i ) (i = 1, . . . , 15) are known with a fixed size of observational uncertainty following the previous setup in Refs. [2,37,38,81]. We design a NN, generate the validation data as well as the training data, and train the NN to predict the EoS parameters, c 2 s,i , from the artificially generated M -R data, (M i , R i ) (i = 1, . . . , 15) (i.e., mock data). We discuss the training behavior of the NN, the reconstruction performance of the EoS and the M -R curve, and the distribution of reconstruction errors.

Mock data generation and training
For the current study in this section, we deal with the input M -R data, , and the output EoS parameters, c 2 s,i (i = 1, . . . 5), as described in the previous section. A combination of these input and output data forms a sample point in the training and the validation data, and we need to generate many independent combinations to prepare the training and the validation data.
The procedure for sampling the M -R points is sketched in Fig. 4. We follow the basic strategy described in Sec. 2.3 and introduce minor modifications for the current setup. For each randomly generated EoS, p = p(ε), we solve the TOV equation to get the M -R curve and identify the maximum mass M max of the neutron star, corresponding to (1) and (2) in Fig. 4. If M max does not reach the observed maximum, such an EoS is rejected from the ensemble. Then, for each EoS, we randomly sample 15 pairs of (M i , R i ) on the corresponding M -R curve, as shown in (3)  If there are multiple values of R corresponding to one M , we take the largest R discarding others belonging to unstable branches. In this way we select 15 points of (M * i , R * i ) on the M -R curve. We also expect the NN to learn that observational data should include uncertainties, ∆M i and ∆R i . We randomly generate ∆M i and ∆R i according to the normal distribution with assumed variances, σ M = 0.1M for the mass and σ R = 0.5 km for the radius. These variances should be adjusted according to the real observational uncertainty as we will treat in Sec. 4, but for the test purpose in this section, we simplify the treatment by fixing σ M and σ R by hand. Furthermore, the distributions are not necessarily centered on the genuine M -R curve as in (3) in Fig. 4. So, we shift data points from an ensemble of randomly shifted data for each EoS, and the repeating number (and thus the number of obtained shifted data) is denoted by n s in this work. Our typical choice is n s = 100, whose effects on learning will be carefully examined in later discussions. For studies in this section, we have prepared 1948 EoSs (2000 generated and 52 rejected), so the total size of the training data set is 1948 × n s . We find that the learning proceeds faster if the data is normalized appropriately to be consistent with the Glorot initialization; we use M i /M norm and R i /R norm with M norm = 3M and R norm = 20 km. The NN is optimized to fit the training data so as to have a predictive power. To test it, we use the validation loss calculated for the validation data, i.e., independent mock data of the neutron star observation. We separately generate 200 EoSs, among which 196 EoSs pass the two-solar-mass condition, and sample an n s = 1 observation for each EoS to generate the validation data set.
It is worth mentioning here that calculating M -R curves requires repeated numerical integration of the TOV equation with various boundary conditions, which is computationally expensive. Generally speaking, in the machine learning approach, the preparation of the training data is often the most time-consuming part. In contrast, increasing the sample size by repeating the observation procedure does not need further computational cost. This procedure can actually be regarded as a kind of the data augmentation by the noise injection, where the observational errors play the role of the injected noise. Let us call this procedure observational data augmentation hereafter. This observational data augmentation has actually another virtue, which will be discussed in later sections.
We are constructing a NN that can give us one EoS in the output side in response to one "observation" of 15 neutron stars, (M i , R i ) (i = 1, . . . , 15) in the input side. Thus, the number of neurons in the input and the output layers should be matched with the size of the observational data and the number of EoS parameters, respectively. The input layer with 30 neurons matches 15 M -R points (30 input data). We sorted 15 M -R pairs by their masses in ascending order, but we found that the sorting does not affect the overall performance. The output neurons in the last layer correspond to the prediction target, i.e., 5 parameters of the speed of sound.
The concrete design of our NN is summarized in Tab. 1. We choose the activation function at the output last layer as σ (4) (x) = tanh(x) so that the speed of sound is automatically bounded in a causal range c 2 s < 1. In principle the output could be unphysical: c 2 s < 0 may occur, and then we readjust it as c 2 s = δ. For the other layers we choose the ReLU, i.e., σ ( ) (x) = max{0, x} ( = 1, . . . , L − 1), which is known to evade the vanishing gradient problem.

Learning curves with the observational data augmentation
The stochastic gradient descent method is an iterative procedure to adjust the NN parameters step by step and to gradually make the loss function smaller. To judge the point where we stop the iteration, we need to monitor the convergence of the loss function in training. For this purpose we plot the loss function as a function of training time (i.e., the number of iteration units), which we call the learning curve in this work.
The actual shape of the learning curve depends on the initial parameters of the NN and also on the choice of mini-batches in the stochastic gradient descent method. Although the learning curve differs every time we start the new training, we can still find some common tendency. In Fig. 5 we show the typical behavior of learning curves for our neutron star problem. It should be noted that the horizontal axis in Fig. 5 is the number of mini-batches. In this study we will compare the training speeds with different n s but with the fixed minibatch size. The number of epochs -the number of scans over the whole training data setis often used as the unit of training time, but in the present study for the training speeds, we need to adopt the number of mini-batches which is proportional to the actual number of parameter updating and the computational time of the training.
In Fig. 5 we plot two different learning curves for a single training: one for the training loss (dashed line) and the other for the validation loss (solid line). The training and the validation losses are the loss functions calculated for the training and the validation data sets, respectively, by using Eq. (2.13). The former diagnoses how well the model remembers the training data, and the latter diagnoses how well the model properly predicts unseen data by generalizing the learned information.
Here, we compare the learning curves for two different training data sets with n s = 100 and n s = 1. By observing the plot like Fig. 5 for many trials on the training, we can read out two general tendencies: (i) The value of validation loss of n s = 1 is larger than that of n s = 100. A possible interpretation is that the loss function can be more trapped in local minima/saddle points for smaller n s .
(ii) For the number of batches 10 3 , for n s = 1, the training loss keeps decreasing with increasing number of batches, whilst the validation loss turns to increasing behavior. The loss function of n s = 1 apparently shows overfitting behavior but the overfitting is not seen for n s = 100.
From these results, we may argue that the data augmentation with sufficiently large n s could be useful to improve the problems of local minimum trapping and overfitting. To justify or falsify this speculation, we will perform further quantitative analyses on the observational data augmentation under a simple setup in Sec. 5. We note in passing that recent developments show a possibility of over-parametrized networks, which has a very large number of parameters, to reconcile a good generalization ability and an elusion of overfitting [94,95]. Here, our point is that the data augmentation is needed to incorporate the physical uncertainty rather than to improve the learning quality, but it may have such a byproduct.

Typical examples -EoS reconstruction from the neural network
Once the loss function converges, we can use the trained NN model to infer an EoS from an observation of 15 M -R points. We picked two typical examples represented by orange and blue in Fig. 6. In the left panel of Fig. 6 the dashed lines represent randomly generated EoSs. We note that two EoSs are identical in the low density region because the SLy4 is employed at ε ≤ ε 0 . The corresponding genuine M -R relations are shown by the dashed lines in the right panel of Fig. 6. Randomly sampled mock observations consisting of 15 M -R points are depicted by ellipses where the centers are the "observed" (M, R) and the major and minor axes show observational uncertainties in the R and M directions, respectively. The reconstructed EoSs are depicted by solid lines in the left panel of Fig. 6. We can see that the reconstructed EoSs agree quite well with the original EoSs for these examples. It would also be interesting to make a comparison of the M -R relations corresponding to the original and the reconstructed EoSs. The solid lines in the right panel of Fig. 6 represent the M -R relations calculated with the reconstructed EoSs. As is expected from agreement in the EoSs in the left panel of Fig. 6, the original and the reconstructed M -R relations are close to each other. More details are already reported in our previous publication [81].

Error correlations
We make the detailed error analysis of the reconstructed EoSs in terms of the speed of sound. To increase the statistics, for the present analysis, we set n s = 1000 (remember that increasing n s is almost costless). Now, we introduce the prediction error in the speed of sound: where (c 2 s,i ) (rec) and (c 2 s,i ) * are the reconstructed and original values, respectively, for the speed of sound. The first row of Fig. 7 shows the histograms of δc 2 s,i (i = 1, . . . 5). We see that the prediction error at the lower density (i.e., smaller i) is centered at the origin while that at the higher density (i.e., larger i) is distributed widely, which means that the observational data efficiently constrains lower density regions. At the highest density with i = 5, the prediction error roughly follows the uniform distribution within the physical range. This behavior can be understood as follows. The original values of the speed of sound in the validation data are generated by the uniform distribution in the range [0, 1] while the distribution of the predicted c 2 s,i (i = 1, . . . 5) are shown in the second row of Fig. 7. At lower density, the distribution of the prediction roughly follows the original uniform distribution. However, at higher density, the distribution of the prediction peaks around c 2 s,i ∼ 0.5. At the highest density, therefore, the NN always predicts c 2 s,i ∼ 0.5 for any inputs. This is because c 2 s,i ∼ 0.5 is an optimal prediction to minimize the loss function when the constraining information is inadequate. The exact peak position is slightly smaller than 0.5, which is because our choice of the loss function, msle, gives larger weights for smaller values. Thus, the approximate uniform distribution of the prediction errors at the highest density (as seen in the upper rightmost figure in Fig. 7) just reflects the distribution of the original values relative to the fixed prediction.
From this analysis we conclude that the observation data should contain more information on the lower density EoS but not much on the high density EoS around ε ∼ 8ε 0 . This is quite natural, but such a quantitative demonstration in Fig. 7 is instructive. Also, the results in Fig. 7 tell us two important aspects of our NN approach. First, the NN tends to give a moderate prediction, not too far from all possible values, when the uncertainty is large. This is nontrivial: the posterior distribution of EoS is random, but such information is lost after the NN filtering. This is reasonable behavior in a sense that we want to design the NN to predict the most likely EoS instead of the posterior distribution. Secondly, the moderate prediction for unconstrained cases may explicitly depend on the choice of the loss function and also on the prior distribution, i.e., the distribution of the EoSs in the training data set in our problem. Thus, we should be careful of possible biases when we discuss unconstrained results in the high density regions.
The error analysis so far is separately made for each EoS parameter. In Fig. 8, to extract more detailed information, we plot the marginalized distributions for different pairs of the EoS parameters with corresponding Pearson's correlation coefficients r ij . We see that the speeds of sound of neighboring density segments have negative correlations, which is similar to the behavior seen in the Bayesian analysis [63]. The underlying mechanism for this behavior can be explained in the same way as in Ref. [63]: an overprediction in a segment can be compensated by an underprediction in the adjacent segment, and vice versa. The correlations are clearer in the lower density regions in which predicted c 2 s,i 's have better accuracy. We also observe a small and positive correlation, e.g., r 13 > 0, between the parameters separated by two segments. This is also expected from two negative correlations in between. The other correlations involving parameters in the high density regions are consistent with zero within the statistical error.

Conventional polynomial regression
It is an interesting question how the machine learning method could be superior to other conventional approaches as a regression tool. In this section we make a detour, introducing a polynomial model, and checking performance of a traditional fitting method. We can model the regression of EoS from the M -R data with the following polynomial fitting of degree n:  In this section we shall quantify the overall agreement, and for this purpose we enlarge the size of the validation data; namely, we randomly generate 1000 EoSs, among which 967 EoSs pass the two-solar-mass condition, and we carry out statistical analyses with 967 EoSs. We define the overall reconstruction accuracy by calculating the radius deviation, i.e., the distance between the solid and dashed lines as already shown in the right panel of Fig. 6: In the plot of the logarithmic probability density the Gaussian distribution or the normal distribution takes a quadratic shape, so the δR histograms from both the NN and the polynomial regression have a strong peak at δR = 0 for all M . The tails are wider than the normal distribution, and the polynomial results (lower panels) exhibit even slightly wider tails than the NN results (upper panels).

Comparison between the neural network and the polynomial regression
We can look into this tendency in more quantitative manner in Fig. 10. In the leftmost (a) of Fig. 10 we plot the RMS of δR(M ), i.e., ∆R RMS . We see that ∆R RMS from the NN is around ∼ 0.12 km for a wide range of masses. This indicates that our NN method works surprisingly good; remember that data points have random fluctuations by ∆R ∼ 0.5 km. It should be noticed that, even without neutron stars around M = 0.6-0.8M in mock M -R data of our setup, the RMS of the corresponding radii are still reconstructed within the accuracy ∼ 0.1 km. One might think that this should be so simply because we use the SLy4 in the low energy region, but in view of Fig. 6, different EoSs lead to significantly different radii even in this region of M = 0.6-0.8M . Figure 10 (a) clearly shows that the NN results surpass the polynomial outputs with n = 2.
When the distribution has long tails, as seen in Fig. 9, the RMS may not fully characterize the data behavior. To quantify the deviation of the reconstruction more differentially, we can define quantile-based widths, ∆R 1σ and ∆R 2σ , corresponding to 68% and 95% of δR contained within the half widths, respectively. They can be explicitly expressed as where F −1 δR (p) is the quantile function of the δR distribution, i.e., the value of δR at the fraction p integrated from small δR. The error function, erf(α/ √ 2) ≡ (1/ √ 2π) α −α dxe −x 2 /2 , is used to translate the statistical significance by σ to the probability, e.g., erf(1/ √ 2) 68% and erf(2/ √ 2) 95%. If δR obeys the normal distribution N (0, σ 2 ), in general, ∆R ασ = ασ follows. Therefore, we can use ∆R ασ /α as alternatives to ∆R RMS = σ. For more general distributions it is important to note that these quantile-based widths have a clear meaning as the 68% and 95% confidence intervals unlike ∆R RMS .
In Figs. 10 (b) and (c) we plot ∆R 1σ and ∆R 2σ . It is interesting that the ∆R 2σ results appear very different while the ∆R 1σ results are almost indistinguishable between the NN and the polynomial results. The polynomial results are consistently larger than the NN results. This behavior of ∆R 2σ implies that the polynomial results may be contaminated by outliers of wrong values far from the true answer. We can understand this by relating it to Runge's phenomenon of the polynomial approximation; higher order terms can cause large oscillation in the interpolation and the extrapolation. In particular, the extrapolation may easily break down by higher order terms that blow up for large inputs. In contrast, the NNs with ReLU activation only contain up to the linear terms so that the outputs are stable.

EoS Estimation from the Real Observational Data
In the previous section we have studied our methodology to infer the most likely EoS from observational data and quantitatively confirmed satisfactory performance to reproduce the correct results. Now, we shall apply the method to analyze the real observational data and constrain the real neutron star EoS.
For practical purposes it is important to develop a way to estimate uncertainties around the most likely EoS within the framework of the machine learning method. As we have already discussed, the EoS inference from the M -R data is an ill-posed problem and the solution cannot be uniquely determined. Thus, the prediction from the NN method cannot avoid uncertain deviations from the true EoS. To employ a predicted EoS as a physical output, we need to quantify the uncertainty bands on top of the results. In this section we consider two different approaches for the uncertainty quantification.

Compilation of the neutron star data
For the real data we use the M -R relation of the neutron stars determined by various X-ray observations. There are three sorts of sources we adopt here: 8 neutron stars in quiescent low-mass X-ray binaries (qLMXB) [2,37,38], 6 thermonuclear bursters with better constraints than the qLMXBs [2,37] as well as a rotation-powered millisecond pulsar, PSR J0030+0451, with thermal emission from hot spots on the stellar surface [43,44]. The data from the first two sources is tabulated in Refs. [2,37,38], especially in Fig. 4 of Ref.
[2] 2 . The data from the last source is recently investigated in the NICER mission, and there are two independent analyses based on the same observation, among which here we use the data given in Ref. [43]. All of these M -R relations are provided in the form of the Bayesian posterior distribution, which takes the two-dimensional probability distribution on the R-M plane (see also Fig. 1 in Ref. [82] for a graphical representation of the probability distribution).
Ideally, with sufficient computational resources, the machine learning would be capable of directly dealing with such two-dimensional probability distribution using, for example, the convolutional neural network (CNN). In the present work, however, we simplify our analysis by approximately characterizing a two-dimensional probability distribution with four parameters following the prescription proposed in our previous publication [82]: the means and the variances of the marginal distributions with respect to M and R. More specifically, we make projection of the two-dimensional distribution onto the one-dimensional M -axis (and R-axis) by integrating over R (and M , respectively). In this way we marginalize the two-dimensional distribution with respect to M and R into two one-dimensional distributions, which are eventually fitted by the Gaussians with the mean and the variance.

Training data generation with observational uncertainty
We should revise the data generation procedure previously illustrated in Fig. 4   and σ R = 0.5 km, for all ∆M i and ∆R i . In reality, however, they should vary for different i, i.e., σ M,i and σ R,i should correspond to the observational uncertainties of (M i , R i ). To deal with the real observational data, the revised procedure for sampling the M -R points is sketched in Fig. 11. We need to design the NN with an input of information including σ M,i and σ R,i : the input variables are extended to (M i , R i ; σ M,i , σ R,i ). We shall recapitulate the data generation scheme as follows. In the same way as in Sec. 3 we prepare 5 EoS parameters c 2 s,i (i = 1, . . . 5) in the output side. In this section the training data comprises 14 × 4 input parameters, i.e., (M i , R i ; σ M,i , σ R,i ) (i = 1, . . . , 14). We note that i runs to not 15 but 14 corresponding to the number of observed neutron stars as explained in Sec. 4.1. We calculate the M -R curve for each EoS, and then select 14 points of (M * i , R * i ) on the M -R curve and add statistical fluctuations of ∆M i and ∆R i [see Fig. 11 (3)]. Let us go into more detailed procedures now. Unlike σ M and σ R in Sec. 3 here we randomly generate σ M,i and σ R,i differently for i = 1, . . . 14. These variances, σ M,i and σ R,i , are sampled from the uniform distributions, [0, M ) and [0, 5 km), respectively. In view of the observational data, these ranges of the distributions should be sufficient to cover the realistic situations. Then, ∆M i and ∆R i are sampled according to the Gaussian distributions with these variances, σ M,i and σ R,i . Finally we obtain the training data, Fig. 11 (4)]. Hereafter we call these 14 tetrads of (M i , R i ; σ M,i , σ R,i ) an observation. Now we prepare the training data set by taking multiple observations. For each EoS we randomly generate 100 different pairs of (σ M,i , σ R,i ), and then we make another 100 observations for each (σ M,i , σ R,i ). From the former 100 pairs the NN is expected to learn that the observational uncertainties may vary, and the latter tells the NN that the genuine M -R relation may deviate from the observational data. In total we make n s = 10000 Function  0  56  N/A  1  60  ReLU  2  40  ReLU  3  40  ReLU  4 5 tanh Table 2. Neural network architecture used in Sec. 4. In the input layer, 56 neurons correspond to the input 14 points of the mass, the radius, and their variances. The design of the other layers is kept the same as in Tab. 1.

Layer index Neurons Activation
(= 100 × 100) "observations" per one EoS. The size of the whole training data set is thus 100 times larger than before. We modify the architecture of the NN used in this section accordingly. The number of neurons in the input layer becomes 56 (= 4 × 14) (the performance test was done with 15 points, but we have numerically confirmed that the same level of performance is achieved with 14 points as well). Since we already know from the mock data analysis that the mass sorting does not affect the performance, we keep the mass ordering as generated randomly, unlike in Sec. 3.1. We normalize the input data as M i /M norm , R i /R norm , σ M,i /σ M norm , and σ R,i /σ R norm with M norm = 3M , R norm = 20 km, σ M norm = 1M , and σ R norm = 5 km. Aside from the input layer, the NN design of the other layers is chosen to be the same as before, as summarized in Tab. 2. For the NN optimization we adopt Adam, the same as in Sec. 3, but with different mini-batch size, 1000. Incorporating the observational uncertainties the training data set becomes larger as compared to before, and we expect that a larger mini-batch size would fasten the training.

Two ways for uncertainty quantification
Here we prescribe two independent methods to quantify uncertainties in the output EoS based on different principles.
The first one utilizes the validation data, and the procedure is similar to that in Sec. 3.6. The basic idea is that, once we have trained the NN, we can evaluate the prediction accuracy from the validation data. We generate 100 samples of the validation data set whose input variances are chosen to be in accord with the real observational uncertainties as explained in Sec. 4.1. For the validation data, we know the true M -R relation so that we can evaluate the deviation δR(M ) as defined in Eq. (3.4). Using the whole validation data set, we calculate the root-mean-square deviation, ∆R RMS , and regard it as the systematic error inherent in the NN approach. Later we show this uncertainty by the label "validation" as seen in Fig. 13.
The second one is the ensemble method in machine learning. This method is usually used to enhance the stability and performance of the predicted output from NNs. Here we repurpose it to quantify uncertainty of the predicted output. The idea is concisely summarized in Fig. 12. In this method we set up multiple NNs independently: NN 1 , NN 2 , . . . ,NN N with N being the number of prepared NNs. We perform random sampling from D to gener-  ate different subsets, D 1 , D 2 , · · · , D N and train each NN using D 1 , D 2 , . . . , D N . This random sampling is commonly referred to as bootstrapping. After the training, by feeding the input data, each NN predicts output values, which we symbolically denote byô 1 ,ô 2 , . . . ,ô N as in Fig. 12. Finally, aggregating the outputs from these multiple NNs, i.e., averaging all the outputs, we get the overall outputô = ô i ≡ N −1 iô i . Here we can also calculate the variance by ∆ô 2 = (ô i −ô) 2 , and regard it as "uncertainty". This whole procedure, comprising bootstrapping and aggregating, is named bagging 3 [96]. In this work, we choose N = 10. If some regions of the EoS are insensitive to the M -R observation, independently trained 10 NN models would lead to different EoSs in such unconstrained regions. From ∆ô that quantifies how much 10 output EoSs vary, we can estimate the uncertainty around the outputô. We use this bagging for most of the analyses as shown by the band labeled by "Bagging" in Figs. 13.
We have introduced the two natural ways to quantify uncertainties, but our working prescriptions are still to be developed. In fact there is no established method yet. A more systematic way for the uncertainty quantification might be possible. This is an interesting problem left for future research in the general context of machine learning.

The most likely neutron star EoS
In Fig. 13 the orange line is the most likely neutron star EoS deduced from our NN approach. We estimated uncertainty from the bagging (shown by the band with a hatch pattern labeled by "bagging") and the validation (shown by the blue band labeled by "validation"). We plot bands from other works in the figure for comparison. The gray band represents an estimate from the χEFT calculation combined with polytropic extrapolation and the two-solar-mass pulsar constraint [97] (labeled by "χEFT+astro"). Because χEFT is an ab initio approach, any reasonable predictions should lie within the gray band, and indeed our results are found to be near the middle of the gray band. The preceding Bayesian analyses [2,[35][36][37][38] are also overlaid on Fig. 13. While Özel et al. [2,37,38] and our present analysis use the same astrophysical data, Steiner et al. [35,36] employs a subset of the data, i.e., 8 of X-ray sources. One may think that our prediction gives a tighter constraint than the others, but the narrowness of the band may be related with the implicit assumption in our EoS parametrization; we will come back to this point in Sec. 4.5 (see Fig. 17). Figure 13 (b) shows the M -R curves corresponding to the EoSs in (a). We see that our EoS (blue curve) certainly supports neutron stars with M > 2M [30][31][32][33].

Figures 14 (a) and (b)
show the tidal deformability and their correlation, respectively, in the binary neutron star merger GW170817. Once an EoS is given, the dimensionless tidal deformability, Λ, results from a quantity called the Love number k 2 , which is derived from the Einstein equation under static linearized perturbations to the Schwarzschild metric due to external tidal fields. Practically, we solve a second-order ordinary differential equation in combination with the TOV equation; see Refs. [98,99] for the explicit form of the equations. The blue band in Fig. 14 (a) represents Λ from the EoS we inferred in the present work, which is consistent with the merger event GW170817 indicated by the red bar. In Fig. 14 (b) we show the correlation of the tidal deformabilities, Λ 1 of the star 1 and Λ 2 of the star 2, using the relation between Λ and M as given in Fig. 14 (a). The orange lines in Fig. 14 (b) refer to the constraints (solid:90% and dashed:50%) for which Λ 1 and Λ 2 are sampled independently [41], while the green lines refer to the constraints for which Λ 1 and Λ 2 are related through Λ a (Λ s , q) with Λ a = (Λ 2 − Λ 1 )/2, Λ s = (Λ 2 + Λ 1 )/2, and q = M 2 /M 1 . In Fig. 14 (b) we clearly see that our predicted band is located within the 90% contours of the LIGO-Virgo data [41,100].
So far we have only used the observational data from the thermonuclear bursters and qLMXBs, which may contain large systematic errors related with the uncertain atmospheric model of neutron stars. The results in Figs. 15 (a) and (b) include the M -R constraint from the NICER mission as well. There are two independent analyses, as spotted on Fig. 15 (b), on the same observation of PSR J0030+0451 [43,44], and we adopt the one [green bar in (b)] in Ref. [43]. We see that the uncertainty becomes slightly larger by the inclusion of the NICER data, which is attributed to the relatively large deviation of the NICER data from others.

Possible EoSs with a weak first-order phase transition
In the analyses we have presented so far, we used the piecewise polytrope with 5 segments of density. Here, we change the number of segments from 5 to 7 and repeat the inference with the finer bins. There are mainly two issues argued in this subsection: a possibility of a weak first-order phase transition with the finer bins and its implication on the uncertainty quantification.
To this end we prepared N = 100 NNs in the bagging outlined in Sec. 4.3. We note that each NN is trained so as to predict an EoS in response to the real observational data. Here we use the M -R data of qLMXBs and thermonuclear bursters without the NICER data. In the left panel of Fig. 16 we show 100 EoSs predicted from 100 independent NNs. We remind that the activation function in the output layer is chosen to be tanh which takes a value over [−1, 1]. When the predicted values is smaller than the threshold: c 2 s < δ = 0.01, we adjust it to c 2 s = δ and identify a first-order phase transition. We found 44 predicted EoSs out of 100 that have a first-order phase transition. We highlight the region of first-order phase transition with orange thick lines in Fig. 16. From this plot we can understand why we increased the number of segments. If we use the EoS parametrization with 5 segments, weak first-order phase transitions are too strongly prohibited by coarse discretization.
We also make a histogram in the right panel of Fig. 16 to show a breakdown of the EoS regions with a first-order phase transition. This histogram counts the number of first-order transition EoSs in each energy density region. It is interesting to see that the most of the first-order phase transition is centered around the energy region 202-272 MeV. On the one hand, in the lower energy region 150-202 MeV the first-order phase transition is less likely, and this tendency is consistent with the fact that a stronger first-order phase transition in a lower energy region is more disfavored by the two-solar-mass pulsar constraint [101]. In the higher energy region, on the other hand, there are also less EoSs with a first-order phase transition. One may think that a first-order phase transition would be more allowed in the higher energy region, but it is not the case in the NN analysis. In Sec. 3.4 we already discussed that the NN model tends to predict the most conservative value around c 2 s ∼ 0.5 in the high energy density regions where the constraints are inadequate. Therefore, the correct interpretation of the absence of the first-order transition in the high density regions as shown in Fig. 16 should be, not that our results exclude a first-order transition there, but merely that the observational data analyzed in our NN method does not favor a firstorder transition there. Another artificial factor in the high energy density region is that our piecewise polytrope is equally spaced in the log scale, so the higher density segments have larger energy widths in the linear scale. Then, the parametrization does not have a resolution to represent a weak first-order phase transition in the high energy region. Here, we also make a comment that our NN approach does not take account of the third family scenario [102,103] in which separate branches may appear in the M -R relation (see also Refs. [66,104] for recent discussions).
In Fig. 17 we show the EoS results with 7 segments (fine binning) after bagging. We see that uncertainty is widened compared with the results with 5 segments (coarse binning). This is partially because the number of output parameters, c 2 s,i (i = 1, . . . , 7), is increased, while the amount of the input data is unchanged, so the corresponding uncertainty should increase. Furthermore we see that the uncertainty is enhanced by the effect of first-order phase transition. If there is a first-order phase transition, c 2 s becomes (nearly) zero and these zero values enlarge the variance of the output in the bagging and increase the uncertainty accordingly. Now the uncertainty appears comparable with other Bayesian methods as seen in Fig. 17. Our previous results with 5 segments as shown in Ref. [82] and in Sec. 4.4 of this paper had smaller uncertainty, which implies that the choice of 5 segments turns out to be optimal.
Finally we plot the speed of sound in Fig. 18. The bare output from the NN model actually comprises these values of c 2 s,i . It is clear from the plots that the speed of sound exceeds the conformal limit c 2 s = 1/3; see Refs. [13,105] for thorough discussions on the conformal limit. The inclusion of the NICER data only slightly widens uncertainty, and changing the segment number is a more dominant effect on uncertainty bands as quantified in the left panel of Fig. 18.

More on the Performance Test: Taming the Overfitting
In Sec. 3.2, we observed a quantitative difference between the learning curves for the training data sets with and without data augmentation by n s = 100 as demonstrated in Fig. 5. Then, it would be a natural anticipation to consider that this observational data augmentation may be helpful to overcome the problems of local minimum trapping and overfitting that we often meet during the NN training. This section is aimed to discuss numerical experiments to understand the behavior of the learning curve and the role of n s thereof. In particular, we will focus on the overfitting problem here 4 .
Neutron star (i = 1, . . . , 15) f (x) fitting Input 1948 n base = 20 Number of training data (input) 1948 n s = 1948000 n base · n s = 20 n s Table 3. Literal correspondence between the neutron star calculation (mock data analysis) and the f (x) fitting.

A toy model
Here we consider a specific and simple problem as a toy model. We choose the following double-peaked function f (x) 5 : in the domain of x ∈ [0, 10]. A concrete shape of above f (x) is depicted in Fig. 19. In Tab. 3 we make a comparison between the neutron star calculation and this toy-model study of f (x) fitting.
For the training data, we first generate a set of the "true" inputs, denoted by x * , in the interval [0, 10) and calculate the corresponding outputs, y * = f (x * ). We denote this base data set as T * = {(x * p , y * p )} n base p=1 ⊂ R 2 with n base = |T * | being the size of the base data set. Then, similarly to the neutron star case, we augment the training data set by duplicating n s copies of the base data with the input fluctuated by "observational uncertainty", ∆x.
Here, we set the distribution of ∆x as the normal distribution with the standard deviation fixed as σ(x * ) = 0.3(x * + 0.5) of our choice 6 . Then, the whole training data set can be expressed as The size of the training data set T is found to be |T | = n s n base . The trivial case with n s = 1 corresponds to the naive training data set without augmentation, while n s > 1 augments the training data set. Figure 19 exemplifies the training data set with n base = 20 and n s = 5.
Since the task we deal with here is idealized and as simplified as possible, we can reasonably keep the NN architecture simple and relatively small (albeit deep). In the present study we basically use two types: NN1221 and NN1991 as tabulated in Tab. 4. We 4 We also tested the performance to escape from the local minimum trapping, but the data augmentation seems not to be useful to resolve the trapping problem. 5 This choice of f (x) is motivated from spectral functions which often appear in the lattice-QCD calculation. 6 With this choice of σ(x * ) the uncertainty is smaller in the x region with a narrow peak. For the demonstration purpose the toy model is designed in such a way that the problem is not too easy but not intractable.  Table 4. NN architectures used in this work: NN1221 (left) and NN1991 (right). We have chosen the leaky ReLU f (x) = αx (x < 0), x (x ≥ 0) to avoid the dying ReLU problem. We fixed the leaky ReLU parameter α = 0.1. also choose the simplest loss function here as mae, i.e., the mean absolute errors:

Input noises, parameter fluctuations, and dropout
To deepen our understanding of the data augmentation, we argue that including noises in the input is equivalent to introducing fluctuations to the NN parameters in the first layer. For the moment we will not limit ourselves to the special toy model in Eq. (5.1) but here we shall develop a general formulation using the schematic notation introduced in Sec. 2.4. We consider the following case that a noiseẽ is added to input values:x = x +ẽ. The noise is assumed to be independent of the input value, i.e., Pr(x,ẽ) = Pr(x) Pr(ẽ). In this case the input noiseẽ can be absorbed by redefinition of the first layer parameters; that is, The loss function with the input noise is given by This is nothing but the loss function with fluctuations in the first-layer biases b (1) ; in summary, the noiseẽ in the input is translated into the fluctuation by W (1)ẽ in the NN parameter b (1) . Once we realize this correspondence, we can generalize this latter prescription of fluctuating NN parameters not only in the first-layer b (1) but also in all the parameters {W ( ) , b ( ) } . We will test this idea of generalization later. We can further relate the parameter fluctuations to a standard NN technique called dropout. Instead of adding random fluctuations to the parameters (additive parameter noise), we can also think of multiplying random fluctuations to the parameters (multiplicative parameter noise), i.e., W i , putting these multiplicative parameter noise is exactly the procedure commonly referred to as "dropout" with the dropout rate identified as p.
In this way we can understand the parameter noises as generalization of the data augmentation by noise injection and the dropout. It is already well known that the dropout is indeed an established tool to avoid the overfitting problem. Thus, we can expect that our observational data augmentation, which was originally needed to handle the observational uncertainties, have an effect similar to the dropout and can tame the overfitting behavior.

Numerical tests
Motivated by the relations between the data augmentation and the dropout through parameter noise, as argued in the previous subsection, we shall make a numerical comparison involving all of the data augmentation, the NN parameter noise, and the dropout using the toy model defined in Eq. (5.1). For the parameter fluctuation noise, we consider the additive parameter noise of the normal distribution N (0, σ 2 ) in the first-layer parameters, (W (1) , b (1) ), or in all the parameters, {W ( ) , b ( ) } . To manipulate this special type of the training presented in this section, we implement the NN and its optimization from scratch in C++. As an ideal test case we want a "troublesome" NN suffering from the overfitting problem. We have first verified that the NN method with NN1221 in Tab. 4 can reasonably solve the reconstruction problem of f (x) while NN1991 encounters the overfitting problem.
In what follows below, we will present the results from NN1991 to analyze the overfitting in details. We fix n base = 20 and the mini-batch size is 10. We repeat carrying out the independent training procedures with the same setup 100 times and take the average of the loss function to draw the learning curves with probabilistic distribution.
In Fig. 20 we show the averaged learning curves for several choices of n s to check the effect of the data augmentation. The mean values of the loss function L(t) are represented by the dashed (training) and the solid (validation) curves along with the standard error band, ∆ L(t) , for the validation loss. Here the (population) mean is estimated by the sample mean over the 100 training, i.e., L(t) where t is the training time in the unit of the net number of processed mini-batches, and i is the index of the training. We should stress that this process with N = 100 is not the bagging, but we do so to quantify the performance on average. We also note that the learning curves of the validation loss fluctuate, and the standard deviation ∆L(t) [as shown in Fig. 20 (b)] is much larger than the standard error band [as read from Fig. 20 (a)] that quantifies the typical deviation of the sample mean L(t) from the "true" population mean L(t) . Here, the standard deviation and the standard error are estimated by the unbiased As we speculated before, the results in Fig. 20 (a) evidence that incorporating the data augmentation with n s > 1 reduces the overfitting behavior (i.e., the validation loss increasing while the training loss decreasing). Obviously the learning curve for n s = 1 indicates the overfitting. For a sufficiently large n s 10, the validation loss does not increase any more and saturates towards a certain value. The improvement with n s 10 can be understood from the size of the data set, n s n base 10·20 = 200, which is larger than the number of the NN parameters, i.e., 118 for NN1991. When n s is small, the size of the data set is too small as compared to the NN parameters, so that the NN parameters cannot be well constrained. Even for n s = 100 there are still discrepancies between the training and the validation losses as seen in Fig. 20 (a), which can be explained as follows; the size of the independent training base data with n base = 20 is too small. The data augmentation does not supplement any additional information on the training data T but the contained information is the same as the base data set T * . Therefore, the improvement should be eventually saturated with increasing n s . The standard deviation in Fig. 20 (b) at an early time is large reflecting the variance of initial NN parameters. The variance of the initial parameters quickly disappears around t 100. An interesting observation is that the standard deviation becomes larger again when the overfitting occurs, and its magnitude is of the same order as the loss increase by the overfitting. This means that in some trials of training the performance does not necessarily get worse even in the overfitting time scale. The actual distribution of the final validation loss at t = 10 6 is shown in Fig. 21 in a form of histogram. We can see that the validation loss spreads to larger values in the overfitting case (n s = 1), while the validation loss with the data augmentation (n s = 10 and 100) is well localize around L = 0.25.
In Fig. 22 we next check the effect of noise in the NN parameters. We introduce the additive parameter noise of the normal distribution N (0, σ 2 ) with several variances σ = 0, 0.2, and 0.4 to the first-layer parameters (W (1) , b (1) ). From Fig. 22 we can confirm the same trend as the observational data augmentation, and indeed the overfitting is evaded similarly. We have tried different values of σ = 0.2 and 0.4, but they are consistent with each other within the standard errors. We can also observe the behavior of the standard deviation similar to the case of the data augmentation.
We also compare different ways to introduce noise in the NN parameters. We here consider the following types of the parameter noise to make a plot in Fig. 23: (i) Add random values sampled from the normal distribution N (0, σ 2 ) with σ = 0.2 to the first-layer parameters (W (1) , b (1) ). We resample parameter noises every time for each sample point in the training data ["Pointwise Noised (W (1) , b (1) )" in Fig. 23].
(ii) Add random values in the same way as (i) but resample parameter noises every time a mini-batch starts. We use a common set of the parameter noise values throughout a mini-batch. ["Batchwise Noised (W (1) , b (1) )" in Fig. 23].
We find that the performance is slightly improved for the batchwise noise rather than the pointwise noise. Also, better performance, i.e., smaller validation losses are reached for all-layer noises as prescribed in (iii) rather than for the first-layer noises as in (i) and (ii). Finally, we plot the results for the case with a dropout in the NN in Fig. 24. Here again, we can validate the role of the dropout to overcome the overfitting problem. We have tried several different dropout rates; namely, 10%, 20%, and 50%, and numerically confirmed that the performance is better for small values of the dropout rate in the current setup. Figure 25 is the summary plot of the performance comparison between three different approaches; namely, the observational data augmentation, the parameter noise, and the dropout. The parameter noise and the dropout results are close to each other and almost identical within the errors. Among these three approaches, it has turned out that the best performance is realized for the observational data augmentation in the present toy-model setup. The best strategy can be different for different model circumstances. The important point is not a question of which is the best but the finding that the data augmentation can tame the overfitting problem at the same level as the standard procedure of the dropout. This is a very interesting observation. In physics all observational data has uncertainties, and the data is intrinsically augmented by repeated measurements with limited accuracy. Then, for physics problems, this NN nature of the "observational data augmentation" can naturally evade the overfitting problem by itself.

Summary and Outlooks
In this work, we have performed a comprehensive analysis of inferring the equation of state (EoS) from the observed masses and radii of neutron stars using the previously proposed method of deep machine learning [81,82]. Following the method elaborated in Sec. 2, we have subdivided our analyses into three parts: the examination of the deep learning methodology using the mock data mimicking the real observations (Sec. 3), the application of the method to the actual observational data of neutron stars (Sec. 4), and the discussion of our observational data augmentation by noise fluctuations in a simple setup idealized for theorization (Sec. 5).
In the first part (Sec. 3) we have demonstrated how our method works by presenting the typical two examples of the EoS prediction. Then we have quantified the uncertainty or error between the prediction and the original value as given in Eq. (3.1) and inspected the prediction reliability for the whole validation data set. The upshot is that, by looking at the histogram and the correlation between the prediction uncertainties in different energy density regions, we have verified that the EoS reconstruction fares well for the lower density regions, but the higher density regions are not sufficiently constrained. We have also compared the NN method to a more conventional approach, i.e., the polynomial regression. We then quantified the results in terms of ∆R RMS , ∆R σ , and ∆R 2σ as given in Eqs. (3.4) and (3.5). We have reached a conclusion that our NN method leads to more robust outputs, and the polynomial regression may have outliers that damage the overall performance.
In the second part (Sec. 4) we have applied the EoS inference for real neutron star data from the various sources; namely, qLMXBs, thermonuclear bursters, and the pulsar with hot spots on the surface, at which the NICER mission aims. We have also introduced natural and intuitive ways for uncertainty quantification. One is based on the validation loss and the other is based on the procedure called the bagging. We trained multiple NNs independently and took the average for the bagging. We have utilized independent outputs from the multiple NNs to discuss the likelihood of EoSs with a first-order phase transition. We made a histogram to count the rate of EoSs with a first-order transition in each energy density segment.
In the third part (Sec. 5) we have brought to light a byproduct from our prescription of incorporating observational uncertainty. The training data is augmented with fluctuations corresponding to observational uncertainty, which we call the observational data augmentation. Then, the data augmentation with n s fluctuating copies has turned out to tame the overfitting problem. We have explicitly confirmed this behavior of the overfitting reduction by monitoring a time evolution of the validation loss.
Although we have carried out extensive analyses in this work, there are still a lot to be done in the future to improve the applicability of our method to the neutron star EoS inference. As our NN method reliably predicts the most likely EoS in lower density region, it would be supplementary to provide information on the TOV limiting mass (the heaviest observed mass) directly in the NN input. At present we have incorporated this information indirectly by just kicking out the training data that does not reach the TOV limit. Also, our NN model currently copes with the pairs of mass-radius data only, but it would be an important next step to extend the NN model so as to process other forms of data, e.g., the tidal deformability.
Currently we employ the feedforward network. It is also a natural future extension to upgrade the NN architecture. For instance, we can replace the NN with the convolutional neural network (CNN) as frequently used in the image processing. With CNN we can make full use of the two-dimensional Bayesian posterior distribution for the input observable data. On top of the improvement in the NN part, we can also think of another extension of better EoS parametrization. It would be possible to use parametrizations other than the piecewise polytropes, such as the spectral representation [106,107] and the quarkyonic parametrization [56]. In this work we jointed our predicted EoS above n 0 with the SLy4 below n 0 . The dependence of the joint density and/or choices of the nuclear EoS should be systematically examined. Our naive expectation is that none of these improvements would affect the results drastically.
The uncertainty quantification is another interesting issue. In this work we have used the practical but oversimplified techniques; namely, the bagging and the validation loss. More rigorous treatments of uncertainties may be possible by adopting the CNN as mentioned above, or by combining the Bayesian method and NN approach. Using such a Bayesian NN, in which the network weights are sampled from a probability distribution, we can obtain the posterior probabilities of the outputs and thus we can capture the NN model uncertainties correctly. One efficient way to implement this computation is to set up a dropout as outlined in Refs. [108,109] (see also Ref. [110] for physics applications). In analogy to the explanation in Sec. 5.2, we can regard a dropout as a Bayesian inference with likelihood functions set to the Bernoulli distribution B(1, p). All above-mentioned extensions deserve further investigations.