Neural Network Parameterizations of Electromagnetic Nucleon Form Factors

The electromagnetic nucleon form-factors data are studied with artificial feed forward neural networks. As a result the unbiased model-independent form-factor parametrizations are evaluated together with uncertainties. The Bayesian approach for the neural networks is adapted for chi2 error-like function and applied to the data analysis. The sequence of the feed forward neural networks with one hidden layer of units is considered. The given neural network represents a particular form-factor parametrization. The so-called evidence (the measure of how much the data favor given statistical model) is computed with the Bayesian framework and it is used to determine the best form factor parametrization.


Introduction
The electromagnetic (EM) form-factors (FF) of the nucleon are the quantities which embody the information about the complex electromagnetic structure of the proton and neutron [1]. In practice, the form-factors are introduced in order to model (on effective level) the electromagnetic hadronic current for elastic ep (n) scattering. In the one photon exchange approximation it has the following form: where q µ = p ′ −p denotes the four-momentum transfer; M p(n) is the proton (neutron) mass; p ′ and p are outgoing and incoming nucleon momenta; Q 2 ≡ −q 2 ; F p(n) 1 is the helicity nonflip Dirac proton (neutron) form-factor, while F p(n) 2 denotes the helicity-flip Pauli proton (neutron) form-factor. The form factors are normalized as follows: F p 1 (0) = 1, F p 2 (0) = µ p − 1, F n 1 (0) = 0, F n 2 (0) = µ n , (1.2) where µ p,n is anomalous magnetic moment of the proton, neutron.
The nucleon is the many-body system of strongly interacting quarks (three valence quarks and any number of quark-antiquark pairs) and gluons. This complex system is described by the QCD (quantum chromodynamics) in the confinement regime. Study of the EM form-factors gives an opportunity for testing the models describing the strong interactions. However, computing the EM form-factors from the first principles is an extremely difficult task. Nevertheless, some effort has been done with the effective approaches and the lattice QCD.
A good approximation of the FF is performed within the vector meson dominance models (VMD) [2,3]. There are interesting results obtained with constituent quark models [4] as well as with other approaches (see for review [5]). However, the given theoretical description usually works well only on limited Q 2 range. In order to describe the full Q 2 domain various approaches must be combined. Hence a proper prediction of the FF in wide Q 2 range requires to use complex phenomenological models which contain plenty of internal parameters.
On the other hand, the experimental data, which have been collected during the last sixty years, covers a wide Q 2 domain and are accurate enough to provide reasonable information about the nucleon electromagnetic structure [6]. Therefore one can try to represent the nucleon form-factors by the data itself without assuming any model constraints. In this article we follow this philosophy.
Description of the electromagnetic properties of the nucleon is a problem of great interest of modern particle physics. The knowledge of the nucleon form-factors is also important for practical applications. We mention two of them: (i) predicting the cross sections for the quasi-elastic charged current (CC) and elastic neutral current (NC) neutrino scattering off nucleon and nucleus [7]; (ii) investigation of the strange content of the nucleon in elastic lepton scattering off nucleons/nuclei [8,9].
An accurate modeling of the neutrino-nucleus cross sections plays a crucial role in the analysis of the ν µ → ν τ neutrino oscillation data, collected in the long-baseline experiments. For instance in the experiments like K2K [10] or T2K [11] the neutrino energy spectrum is reconstructed from the quasi-elastic-like events. Observing the distortion of the energy spectrum in the far detector gives an indication for neutrino oscillation.
The investigation of the quasi-elastic CC neutrino-nucleon interactions gives an opportunity to explore the axial structure of the nucleon. The weak hadronic current is formulated assuming the conserved vector current (CVC) theorem. Then the vector part of the current is expressed in terms of the electromagnetic FF of the proton and neutron, while the axial contribution is described with two axial form factors: G A and G P (pseudoscalar axial form-factor). The hadronic weak current for the CC νn quasi-elastic scattering reads [12] J µ νn, If the partially conserved vector current hypothesis (PCAC) is assumed then the axial form-factors can be related: G P (Q 2 ) = 4M 2 G A (Q 2 )/(m 2 π + Q 2 ). The G A is usually parameterized with dipole functional form: , g A = −1.2695 ± 0.0029. (1.5) M A denotes the axial mass. Notice that recent studies [13,14] suggest M A value larger by about 20% with respect to the old measurements [15,16,17]. The impact of the electromagnetic form-factors on the axial mass extraction is small, but it can play a role in the future, when more precise measurements of the neutrino-nucleon cross-sections will be performed. The precise knowledge of the EM form-factors together with uncertainties is more important for predicting the NC elastic νN reaction cross-section. The structure of the weak NC hadronic current is similar to (1.3) [18], namely: θ W is the Weinberg angle. F s 1,2 (Q 2 ) and G s A (Q 2 ) describe the strange content of the nucleon. We see that the investigation of the elastic NC neutrino-nucleon scattering gives the opportunity to explore the nucleon strangeness [18,19] (mainly the axial strange part). The strangeness of the nucleon is also investigated in the elastic ep scattering [9,20]. The extraction of this contribution is sensitive to the accuracy of the EM form-factors. Therefore it is necessary to use the well determined FF parametrization together with the uncertainties.
There are many different phenomenological parametrizations of the EM form-factors [3,21,22,23,24,25,26,27,28,29]. Some of these are based on the theoretical models, but mostly in practical applications simple functional parametrizations fitted to the data are applied [30]. The functional form is chosen to satisfy some general properties (proper behavior at Q 2 → 0 and Q 2 → ∞, scaling behavior). However, a particular choice of the parametrization determines the final fit and affects also the uncertainty. The form-factors parameterized by the large number of degrees of freedom have a tendency to describe the data too accurately, and the generality of the fit is lost. On the other hand, the model with a small number of the parameters may describe the data imprecisely. Moreover the complexity of the fit has an impact on its uncertainties.
Searching for the proper parametrization, which describes the data well enough without losing the generality of the fit is just solving the problem, known in statistics as biasvariance trade-off [31,32]. Usually the most reasonable solution is chosen with a use of common sense, i.e. the fit which leads to the low enough χ 2 min value is accepted, and more complex models are not considered. The task of this paper is to evaluate a model independent FF parametrizations, which will not be affected by the problems described above i.e. the common sense will be replaced by the objective Bayesian procedure.
One of the possible fitting techniques is to apply artificial neural networks (ANN). The ANN has already been used in the high energy physics for decades [33] and it has been shown to be a powerful tool in the field. The pattern recognition tasks like particle or interaction identification are efficiently addressed with the ANN based methods also in present experiments [34,35]. The ANN are also applied to the function approximation and parameter estimation problems [36,37].
The ANN techniques have already been applied by NNPDF collaboration [38] to represent the nucleon and deuteron EM structure functions [36,39,40,41,42,43]. The method is based on the large collection of networks [41] of the same architecture prepared on the artificial data sets generated from original experimental measurements. Obtained fits are claimed to be unbiased due to networks being intentionally oversized -the number of free parameters, the network weights, is larger than required to solve the problem. To avoid potential over-fitting (representing the statistical fluctuations of the experimental data), that may arise under these conditions, the optimization of the network weights (so-called training) is stopped before reaching the minimum of the figure of merit (error function) calculated on training data. Stopping condition is based on the cross-validation technique, where the portion of available data is excluded from training. Such created subset is then used to calculate the test error function which starts to increase when the network becomes fitted to training data more than to the testing data. This observation is used to break the training. The best fit values and the uncertainties given by the NNPDF are computed by taking the average and standard deviation respectively, over the set of solutions obtained from the whole collection of the networks.
In the case of the present analysis the number of experimental points varies from 26 to 57, and we do not generate the Monte Carlo data. Therefore the cross-validation technique is unsuitable because constructing the testing data set can significantly restrict the information about the underlying data model used in training. Additionally our intention is to compare statistical models which are represented by the networks of various architectures and among them choose the most appropriate parametrization. It motivated us to consider another idea for finding the best fit and the choice of the neural network architecture. We apply Bayesian framework (BF) for the ANN. It is a different philosophy of building the statistical model than the NNPDF approach. However, both techniques are complementary and face with the same bias-variance trade-off. A pedagogical description of the main ingredients of both methodologies can be found in Ch. Bishop's book [31] (chapters 9 and 10 respectively).
In the BF approach the sequence of neural networks characterized by different number of hidden units is considered. A given network of a particular size has its specific ability to adjust to training data i.e. small networks give smooth approximation, large networks can over-fit the data. One can think that the network of a particular architecture represents the particular statistical model. With the help of the Bayesian technique we compare the models and choose the most appropriate one. This method has been developed for the ANN [31,44,45,46] in nineties of last century. We adapted this approach for the purpose of χ 2 minimization. In practice, the so-called evidence is computed for every network type in order to select the most appropriate parametrization for given data set. The evidence is a probabilistic measure which indicates the best solution.
The network of particular architecture has weights that need to be optimized i.e. the global minimum of the error function is searched for. In order to get the solution we consider various gradient algorithms. However, the training done with these algorithms can stick in local minimum. Therefore for a given network architecture the sample of networks with randomized initial weights is trained to find a single configuration at the global minimum (this procedure is described in Sec. 2.2). The error function is modified with so-called regularization term to improve generalization ability (to control the overfitting); the extent of regularization is controlled in the statistically optimal way, also as a part of the Bayesian algorithm.
The main results of our studies are unbiased proton and neutron FF parametrizations, available in the numerical form at [47] as well as in the analytical ones (see Appendix A). The proposed statistical method also allows to compute the form-factor uncertainties (from the covariance matrix). One of the strengths of this methodology is its ability of studying the deviations of the form-factors from the dipole form.
Eventually, let us mention that the previous (non-neural) form-factor data analysis (with ah-hoc parametrizations) have been done in the non-Bayesian spirit i.e. authors do not compare the possible FF parametrizations in order to choose the most suitable. Usually the one particular functional form was discussed and analyzed with the χ 2 framework.
The paper is organized as follows. In Sec. 2 the feed forward neural networks are shortly reviewed. Sec. 3 describes the Bayesian approach to neural networks. The last section contains the numerical results and discussion. We supplement the article with the appendix, which presents the fits in the analytical form.

Multi-Layer Perceptron
We consider the feed-forward neural network in the so-called multi-layer perceptron (MLP) configuration. The network structure (shown in Fig. 1) contains: the input layer, the layer of M hidden neurons and a single neuron in the output layer. We will say that the network of type 1-M-1 is considered. Each neuron (see Fig. 2) calculates the output value as an activation function f act of the weighted sum of its inputs: where w i denotes the weight parameter, while µ i represents the output value of the unit from previous layer. Neurons in the hidden layer are usually non-linear, with the sigmoid or hyperbolic tangent functions denoted as f act ; in this analysis the output neuron is linear function. In general, the ANN gives a map ( y) of the input into the output vector spaces. The overall network response is then a deterministic function of the input variable (vector in), and the weight parameters: In our analysis the ANN is expected to approximate the given form-factor G depending on the input variable Q 2 : Let D denotes the training data set of N points: where t i is the measured value of the nucleon form-factor at the point x i = Q 2 i , while the ∆t i denotes the total experimental error. The network training goal is to find w that minimizes an error function defined here as: (2.5) χ 2 term is the error on data: α parameter is the factor for the regularization term E w . In this work we apply the weight decay formula [49]: where W denotes the total number of weights in the network (including bias weights). In general, the output of the MLP with M hidden neurons and the linear output neuron can be written in the form: In this paper we consider the neural networks with (L = 1): one input unit µ 1 = Q 2 , and one bias unit µ 0 = 1 in the first layer. The bias of the output neuron in the above formula is considered as the hidden neuron with the constant output, f act = 1. Such representation closely corresponds to the Kolmogorov function superposition theorem [48]. Basing on this relation it was shown [50,51] that the MLP can approximate any continuous function of its inputs, to the extent that depends on the number of the hidden neurons. However, in the practical problem we are faced, the desired function is not known and only the limited number of experimental points is available instead. It leads to the mentioned earlier bias-variance problem. The output of the oversized network tends to approach closely to the training data points if weights are not constrained during the training. Usually this means that statistical fluctuations are captured. The weight regularizing term (Eq. 2.7) penalizes the large weight values and smooths out the network output, but on the other hand, applying the regularization with overestimated value of the factor α leads to the fit which does not reproduce significant features of the training data. The effect of applying regularization is illustrated in Figs. 3 and 4, where the relatively large network was trained with various values of the factor α. Similarly, the network with the low number of the hidden neurons may be not capable to represent the desired function. Sec. 3 presents the statistical approach to determine the network size appropriate to the given data set and to predict the optimal value of α.

Training of Network
It has been already mentioned that the training of the network is the process of establishing w which minimizes the error function (2.5). We denote the minimal error by S( w M P , D) (the notation will become clear latter). The first algorithm for the MLP weights optimization, the back-prop, was proposed by D. E. Rumelhart et al. in [52]. Currently there is a wide range of gradient descent and stochastic algorithms available for the network training. We use mainly the Levenberg-Marquardt algorithm [53,54], since it converges efficiently and does not require precise parameters tuning. However, we trained the networks also with quick-prop [55], and rprop [56] algorithms. The obtained results were very similar.
The algorithms we use, as all gradient based optimization patterns, may suffer from local minima. Therefore for given network type 1-M-1 we consider a large sample of networks with different (randomized) initial weights. We use a limited range of initial weight values according to the properties of the neuron activation function 1 .
After the training of the sample of networks of the same type the distribution of the total error value S( w M P , D) is obtained (see Figs. 5 and 6). Notice that the distribution sharply starts at particular S cut value. Such clear cut on the error value gives us an indication that the global minimum is well approximated. The number of networks in the sample required to determine the clear S cut value depends on the complexity of the data. The typical number we obtained were as follows: 150 (G En data), 250 (G M p data), 700 (G M n data), and 1300 (G Ep data).
The Bayesian framework allows to choose from the sample the best model. It is the solution characterized by the highest evidence (as it is described in Sec. 3). In practice, if the total error is too big then the evidence is too low and the given network can be discarded from further analysis. Hence to simplify the numerical procedure we take into consideration ten fits (neural networks) with the lowest total error values. They are also characterized by the low χ 2 value, namely χ 2 ( w M P , D)/(N − W ) < 1; N − W is the number of degrees of freedom. Among them the one with the maximal evidence is selected 1 High weight values make the sigmoid activation function very steep. Then the neuron input values have a very narrow range, where the neuron output is not saturated -this would efficiently block the training, where the output derivative is used extensively. Hence we restrict the initial weight range to |w initial | = f sat act /(Lµ), where f sat act is the value for which activation function saturates, L is the number of neuron inputs, µ is the mean neuron input value.
for further comparison with the network of other types. It was interesting to observe that the fit parametrizations given by the average over the fits selected by lowest error value were found to be very similar to those indicated by the highest evidence in each sample. This observation confirms that all solutions we select from the sample are localized in close neighborhood of the global minimum and are very similar to the one indicated by the highest evidence.

Bayesian Approach to Neural Networks
The Bayesian framework (BF) for the model comparison [44,45,46,31,57] is taken into consideration. We adapt this framework for χ 2 minimization purpose. The data is analyzed with the set of various neural networks types A M : 1-M-1. Given neural network of architecture A i corresponds to a particular statistical model (hypothesis) describing data. The BF allows to: • quantitatively classify the hypothesis; • choose objectively the best model (neural network) for representing a given data set; • establish objectively the weight decay parameter α (see Eq. 2.7); • compute the uncertainty for the neural network response (output), and uncertainties for other network parameters.
The approach in natural way embodies the so-called Occam's razor criterium which penalizes more complex models and prefers simpler solutions.

Bayesian Algorithm
At the beginning of the fitting procedure every neural network architecture A M is classified by the prior probability P(A M ). After the training of the network with the data D, the posterior probability is evaluated P (A M | D) i.e. a probability of the model A M given data D. It classifies quantitatively considered hypothesis. On the other hand applying the Bayes' theorem allows to express the posterior probability in the following way: where: is called evidence [44] (probability of the data D given A M ).
There is no reason to prefer some particular model before starting data analysis, hence: Then if one neglects the normalization factor P(D) the evidence (3.2) is the probability distribution which quantitatively classifies hypothesis.
The evidence is constructed in so called hierarchical approach. It is a three level procedure. Applying Bayes' theorem the probability distribution for the weights parameters is constructed, then the probability distribution of the decay parameter α, and eventually the evidence are evaluated.
Below the short description of the Bayesian approach is presented.

Constructing the weight parameter distribution
The probability distribution for the neural network weights is built, assuming that regularization parameter α is fixed: where P ( w| α, A M ) is a prior probability distribution of weights, while P (D| w, α, A M ) is the likelihood function. In the case of present analysis the likelihood function is given by the χ 2 function, namely: (3.8) The prior probability should be as general as possible. Indeed, there are plenty of possibilities (e.g. Laplacian or entropy-based priors see discussion in Ref. [58]). We assume that every weight parameter is equally distributed according to a Gaussian distribution (with the zero mean and the variance of 1/ √ α) (the arguments supporting above choice of the prior are presented in Sec. 3.2). It gives the probabilistic interpretation for the regularization function E w defined in the previous section (see Eq. 2.7). Then we see that: The last integral was computed by expanding the error function up to the Hessian term: where w M P is the vector of weights which minimizes S( w, D) (maximizes the posterior probability (3.7)). The Hessian matrix reads (3.14) We compute the full Hessian matrix [59]. Usually the double differential term in (3.14) is neglected, which is a good approximation only at the minimum. Taking into account full Hessian plays a crucial role in optimizing α parameter, as it will become clear below.
The network response uncertainty ∆y is defined by the variance: In the first approximation it is expressed by the covariance matrix, i.e. inverse of the Hessian matrix: (∆y(x)) 2 = (∇y(x, w M P )) T A −1 ∇y(x, w M P ). (3.16) In Appendix A the covariance matrices obtained for every considered problem are presented.

Constructing α the distribution of the parameter α
The α parameter is established by applying the so-called evidence approximation [44,45,60], the method, which is equivalent to type II maximum likelihood in conventional statistics. The Bayes' rule leads to: where the P (D| α, A M ) has been obtained in the previous section (see Eq. 3.10). We are searching for the α M P parameter, i.e. the one which maximizes the prior probability (3.17). It can be shown that in the Hessian approximation it is given by the solution of the equation: where λ i 's are eigenvalues of the matrix ∇ n ∇ m χ 2 w= w M P . In practice, the eigenvalues depend on α, therefore to get a proper α M P the α parameter is iteratively changed during the training process i.e.: The iteration procedure fixes in the optimal way the α parameter. The typical dependence of α k on the iteration step is presented in Fig. 8. In Sec. 3.2 it is shown that the choice of the initial α value has a small impact on the final results. At the end of the training procedure one can approximate (3.10) as follows: where in the Hessian approximation σ ln α ≈ 2/γ.

Constructing the evidence
The evidence for given model is defined by denominator of (3.17). If one assumes the uniform prior distribution of ln α parameter 2 on some large ln Ω region then the evidence can be approximated by: The ln Ω is a constant which is the same for the all hypotheses. The ln of evidence (we show only model independent terms) reads The first term in the above expression, −χ 2 ( w M P ), (usually of low-value for simple models) is the misfit of the approximated data, while the next four terms constitute the so called Occam factor, which penalizes the complex models. Since in this work we consider only the networks of type 1-M-1 (only one hidden layer) in the rest of the paper we will denote the evidence P (D| A M ) by P (D| M ).

Prior Function
We have already mentioned that the various possible prior distributions are considered in the literature [61]. In this analysis the likelihood function is given by χ 2 distribution, which has a Gaussian probabilistic interpretation. Therefore it seems to be reasonable to assume that the weight parameters distribution should also be described by the Gaussian-like prior function. Additionally we assume, without losing the generality, that: • negative, and positive values of the weight parameters are equally likely; • at the beginning of the learning procedure the weight parameters are independent; • small 3 weight values are more likely than the large values.
Then the Gaussian-like prior distribution can have a form: Notice that every w i parameter has its own α i regularization parameter. As it was mentioned in the previous section the α is the so-called scale parameter. The number of the scale parameters can be reduced if the symmetry property of the given network architecture is taken into account. The network of the type 1-M-1 has: M hidden weights; M corresponding bias weights and M + 1 linear weighs (output weights + one bias parameter).
The permutation between the hidden units does not change the network functional type. Permuting two hidden units is realized by exchange between the weight parameters of the same type (hidden, bias and linear weights). This symmetry property allows us to reduce the number of α's to three independent scale parameters: • α h for the hidden weights; • α b for bias weights (in hidden layer); • α l for linear weights in the output layer.
Then the prior function reads

(3.24)
We made an effort to compare results which are obtained with both (3.9) and (3.24) priors. It was observed that final results are very similar. Analogically as in the case of (3.9) prior the α h , α b and α l parameters were iteratively changed during the training procedure. The typical results, obtained for the G M n /µ n G D and G En data sets, are shown in Fig. 7. The differences between the final best fits are negligible. In the left column of the same figure we plot the dependence of the S( w, D) on the iteration step. We see that the minimal value of the total error is almost the same for both prior functions. For both cases the training started from the same initial weight configuration.
All above seem to justify the simplest choice of the prior function, namely the one given by Eq. 3.9. Nevertheless, it may happen that for more complex data then we discuss, the results will significantly depend on prior assumptions. In such case the Bayesian framework can be used to indicate the best prior function.
Eventually, we discuss the dependence of the final results on the initial α 0 value. We considered several initial values of α 0 (see Table 1). After training we noticed that the choice of the initial α 0 had a small impact on the final α M P value (see Fig 8) as well as the fits. It is shown in Table 1 where the relative distances, in the weight space, between fits are presented. Notice that the only one solution computed for α 0 = 1 is out of others.
It is worth to mention that decreasing the α 0 parameter can be understood as enlarging the effective prior domain. For the final analysis we set α 0 = 0.001.  2 between fits obtained for various initial α 0 values. The computations are done for the G En data for the network of 1-2-1 type.
In this section we have demonstrated that our results weakly depend on the prior assumptions. It has been also shown that it is relatively easy to construct the prior function if the symmetry properties of network are taken into consideration. Usually, it is not the case in the conventional form-factor data analysis, where the ad-hoc parametrizations are discussed. The typical phenomenological parametrization has no straightforward symmetries. As an example consider the function [25,62]: (3.25) Constructing the prior function for above form-factor parametrization seems to be more complicated than in the ANN case. One can postulate the values of the ratios a 0 /b 0 and a 2 /b 4 , which describe the low and high Q 2 behavior of the FF. However, the rest of parameters, which seem to model the intermediate Q 2 region, can have any arbitrary values. Therefore building the prior distribution for above FF would require an extra phenomenological and theoretical knowledge.

Data
We consider the electric and magnetic proton and neutron form-factor data. The electric and magnetic nucleon form-factors are defined as follows: where: G M p,n = µ p,n , G Ep = 1, G En = 0. The experimental data is usually normalized to the dipole form-factor G D = 1/(1 + Q 2 /0.71) 2 . The electric G Ep and magnetic G M p proton FF data have been obtained via Rosenbluth separation technique from elastic ep scattering [63]. Additionally since the beginning of nineties of last century the measurement of the form-factor ratio µ p G Ep /G M p in the spin dependent elastic ep scattering have been performed [64].
It turned out that systematic discrepancy error of artificial G M n /µ n G D error point (∆t) Q 2 =0  exists between so-called Rosenbluth and polarization transfer µ p G Ep /G M p ratio data. The difference can be explained when the two photon exchange effect (TPE) [65] is taken into account (for review see [66]). Hence, a proper fit of the EM form-factors requires to take into account the TPE correction [30]. In this work we consider the re-analyzed (TPE corrected Rosenbluth) G M p /µ p G D and G Ep /G D data (Tabs. 2 and 3 of Ref. [62]). However, to see the TPE effect we consider also the original, (called here old Rosenbluth data) G M p /µ p G D [63,67,68] and G Ep /G D [63,67,69] data sets 4 . The neutron form-factor data (G En and G M n ) are obtained from the electron scattering off light nuclei (deuteron [71], helium [72]). Since the complexity of nuclear target, getting nucleon formfactors is more demanding than in the case of the elastic ep scattering. The ground and final states of the nucleon must be properly described. In this analysis we consider the same G En and G M n /µ n G D data sets as in Ref. [30]. Let us mention that to obtain proper fits of the form-factors at Q 2 = 0 we added to every data set one artificial point, namely (Q 2 = 0, t = 1, ∆t = 0.001) for G M n /µ n G D , G M p /µ p G D and G Ep /G D data sets, and (Q 2 = 0, t = 0, ∆t = 0.001) for G En data set. This constraints have an effect on the final fit value and the uncertainty only in the close surrounding of the added point, as it is shown in Table 2, where we present how the best fit values and its uncertainties depend on the artificial point error. We present results for G M n data but for other considered data sets we got analogical conclusions. The ∆t value assigned to the additional point should be comparable to data uncertainties used in the network training. We have found that using ∆t = 0.01 and higher is not sufficient to attract the fit to desired value at constraint point, while ∆t = 0.0001 causes numerical difficulties during the training since the point has dominant contribution to the overall network error value.

Numerical Procedure
The numerical analysis was done with two independent neural network softwares (in order to cross-validate the results). One written by R.S. and P.P.
The procedure for finding the best neural network model for each data set consists of the five major steps: 4. the network (from the step above) with the highest evidence is chosen as the best fit candidate for given network type; 5. the best fits obtained for every network type are compared; the one with the highest evidence is chosen to represent the data.
Let us remind that in the second step the large number (from 150 to 1300) of networks in the sample (as it is explained in Sec. 2.2) is considered in order to find the solutions which maximizes the posterior probability for the given model. The procedure for the single network training is as follows (see Fig. 9): • initialize the network weights as small random values; • initialize the regularization factor (Eq. 2.7), in this analysis α 0 = 0.001; • perform the network training iterations, according to the Levenberg-Marquardt, quickprop, or rprop algorithms; • calculate the updated regularization factor α k+1 (Eq. 3.19) every 20 iterations of the training algorithm; eigenvalues of Hessian matrix below 10 −6 are rejected from the evaluation of γ(α k ) (Eq. 3.18); • calculate the network output (Eq. 2.3) and uncertainty (Eq. 3.16) values for the given range of Q 2 values; • calculate the ln of evidence (Eq. 3.22) Eventually, we will shortly highlight the major differences between the NNPDF approach and the one presented in this article.
In this work we consider the sequence of networks with graded number of hidden units. With the help of the Bayesian framework the best solution is chosen. The NNPDF group considers one particular network architecture (2-5-3-1 type) to fit the data [41]. But some discussion of the dependence of results on the network architecture is presented.
The NNPDF group prepares the sample of the networks. Each network from the sample is trained with the artificial data which is Monte Carlo generated from the original measurements. Then the best fit and its uncertainty are obtained as an average and standard deviations computed over the sample. In this work every network is always trained with the original data set. Nevertheless the large sample of networks of given type is prepared but in order to find the architecture and the weights which maximize the evidence. The network response uncertainty is computed from the covariance matrix (Eq. 3.16).
Both approaches deal with the over-fitting problem but in different ways. The NNPDF applies the early stopping in the training (cross-validation algorithm is imposed). Whereas we consider the regularization penalty term in the error function, which is optimized by the Bayesian procedure. Hence the approach we apply does not require validation of the solutions by comparing with the test data set.

Numerical Results
The numerical procedures described in the previous section were applied to all (six) the data sets. We consider networks with M = 1 − 5 hidden units for G M n , G En , and G Ep data and with M = 1 − 6 for the G M p data. The evidence quantitatively classifies the networks i.e. the most suitable network architecture for representing the data is indicated by the maximum of the evidence. Notice that the optimal way to deal with these results would be taking an average over all solutions weighted by the evidence. However, in all problems considered here we obtained clear signal (a peak at the evidence) for particular solution. It allowed us to neglect the contribution from networks of other size.
We start the presentation of the numerical results by the discussion of the G M n /µ n G D FF data. As it was described above, we consider a set of networks, which differ by number of hidden units M . In Fig. 10 we show the scatter plot presenting the dependence of given network size on error function and log of evidence. One can notice that the networks 1-2-1 and 1-3-1 have the highest evidences, but the networks with M = 2 hidden units are not able to reproduce as low total error value as 1-3-1 networks. It is interesting also to mention that for M ≥ 3 the total error slowly varies, i.e. increasing the number of the hidden units lowers the total error by the minor amount. The clear indication for 1-3-1 network type is seen in Fig. 11, where only dependence of ln P (D| M ) on M is shown. In this figure we plot the maximal evidences obtained for given network type. However, in order to control the stability of numerical procedure we plot also the ln of evidence averaged over the networks around global minimum (solutions selected in step 3, Sec. 4.2), as well as the ln of the minimal values of P(D |M ).
All together suggest the network of type 1-3-1 (with the highest evidence) for the best fit of the G M n data. The network output is drawn in Fig. 12 together with the experimental data. The neural network response uncertainty is computed with (3.16) expression and shown in Fig. 13. In Fig. 12 we plot also the best fits obtained for networks: 1-1-1, 1-2-1, 1-4-1 and 1-5-1. As could be expected increasing the number of hidden units makes the fit more flexible.
The electric neutron FF data (G En ) is analyzed in the same way as the magnetic neutron one. In Figs. 14, 15 and 16 the plots of evidence and G En form-factor are shown. For M = 2 we obtained the peak of the Occam's hill, what indicates 1-2-1 network architecture as the most representative parametrization.
The results for the electric and magnetic FF data are presented in Figs. 17, 18 and 21, 22 (scatter and evidence plots) and Figs. 19 and 20 (form-factor plots). The network of type 1-3-1 is preferred by the both electric and magnetic data sets. As it has been mentioned above we analyzed also the old form-factor data, which are not TPE corrected. It was obtained that the old G Ep prefers representation by the network of type 1-1-1. Hence, the old Rosenbluth G Ep /G D data fit is almost linear constant function in Q 2 . But the data seems to be not conclusive enough, so the Bayesian procedure leads to the simplest possible solution. On the other hand, it means that the old proton electric data does not show clear indication for deviation from the dipole form.

Summary
We have analyzed the form-factor data by the means of the artificial neural networks. The Bayesian approach has been adapted for the χ 2 minimization and then applied to the data analysis. For every form-factor data set sequence of neural networks have been considered. The Bayesian approach provided us with an objective criteria for choosing the most suitable form-factor parametrization (neural network) with the statistically optimal balance of the fit complexity and its uncertainty. Therefore the resulting fits are unbiased and model independent. It has been demonstrated also that the final results weakly depend on the prior assumptions.
The approach allowed to investigate objectively the non-dipole deviations of the formfactors. It is interesting to mention that the G Ep /G D , G M p /µ p G D as well as G M n /µ n G D form-factor data prefer the same type (size) network 1-3-1. The form-factor parametrizations, obtained in this analysis can be easily applied to any phenomenological and experimental analysis. Additionally, a part of the our software used in the analysis is available at [47,73].
Presented method seems to be a promising statistical framework for studying and representing the experimental data. Especially, if the theoretical predictions are not able to reproduce measurements with desired accuracy, but the experimental data is sufficiently comprehensive to describe physical quantity by itself.
where g = 1 for proton electric form-factor and g = µ p,n for the proton, neutron magnetic form-factors. The activation function reads The weights obtained for G En : with the covariance matrix: The weights obtained for G M n /µ n G D : with the covariant matrix: The weights obtained for G Ep /G D : with the covariance matrix: The weights obtained for G M p /µ p G D : with the covariant matrix:                Figure 12: Fits of the G Mn /µ n G D data parametrized with networks of 1-1-1 (green line), 1-2-1 (violet line), 1-3-1 (blue line), 1-4-1 (cyan line) and 1-5-1 (magenta line) types. The best fit (shown with 1σ uncertainty), which was indicated by the maximal evidence, is given by 1-3-1 network. The blue area denotes fit uncertainty computed with (3.16). The experimental data is the same as the one discussed in Ref. [30].   The best fit of G En data given by the 1-2-1 network. The blue area denotes fit uncertainty computed with Eq. 3.16. The experimental data is the same as the one discussed in Ref. [30].   The best fit of G Ep /G D data. The fit to TPE corrected data is given by 1-3-1 network (blue line), the data (red points) is taken from [62]. The fit to "old Rosenbluth data" (green points) is given by 1-1-1 network (violet line), the data is taken from [63,67,69]. The fit uncertainty is computed with Eq. 3.16. Figure 20: The best fit of G Mp /µ p G D data given by the 1-3-1 network. The fit to TPE corrected data is given by 1-3-1 network (violet line), the data (red points) is taken from [62]. The fit to "old Rosenbluth data" (green points) is given by 1-1-1 network (violet line), the data is taken from [63,67,68]. The fit uncertainty is computed with Eq. 3.16.