The NNPDF methodology
The NNPDF collaboration implements by default the Monte Carlo approach to PDF fits. The goal of such strategy is to provide an unbiased determination of PDFs with reliable uncertainty. The NNPDF methodology is based on the Monte Carlo treatment of experimental data, the parametrization of PDFs with artificial neural networks, and the minimization strategy based on genetic algorithms.
In the next paragraphs we outline the most relevant aspects of the NNPDF3.1 methodology. An exhaustive overview is beyond the scope of this paper, we invite the reader to review [4, 8] for further details.
The Monte Carlo approach to experimental data consists in generating artificial data replicas based on the experimental covariance matrix of each experiment. This procedure allows to propagate experimental uncertainties into the PDF model by performing a PDF fit for each data replica. Usually, PDF sets generated from such approach are composed by 100–1000 replicas.
The experimental data used in the PDF fit is preprocessed according to a cross-validation strategy based on randomly splitting the data for each replica into a training set and a validation set. The optimization is then performed on the training set while the validation loss is monitored and used as stopping criterion to reduce overlearning.
In the NNPDF fits, PDFs are parameterized at a reference scale \(Q_0\) and expressed in terms of a set of neural networks corresponding to a set of basis functions. Each of these neural networks consists of a fixed-size feedforward multi-layer perceptron with architecture 1-2-5-3-1. The input node (x) is split by the first layer on the pair \((x, \log (x))\). The two hidden layers (of 5 and 3 nodes) use the sigmoid activation function while the output node is linear:
$$\begin{aligned} f_i(x,Q_0) = A_i x^{-\alpha _i} (1-x)^{\beta _i} \mathrm{NN}_i(x), \end{aligned}$$
(1)
where \(\mathrm{NN}_{i}\) is the neural network corresponding to a given flavour i, usually expressed in terms of the PDF evolution basis \(\{g,\,\Sigma ,\,V,\,V_3,\,V_8,\,T_3,\,T_8,\,c^+\}\). \(A_i\) is an overall normalization constant which enforces sum rules and \(x^{-\alpha _i} (1-x)^{\beta _i}\) is a preprocessing factor which controls the PDF behaviour at small and large x. In order to guarantee unbiased results, in the current NNPDF methodology both the \(\alpha _i\) and \(\beta _i\) parameters are randomly selected within a defined range for each replica at the beginning of the fit and kept constant thereafter.
Unlike usual regression problems, where during the optimization procedure the model is compared directly to the training input data, in PDF fits the theoretical predictions are constructed through the convolution operation per data point between a fastkernel table (FK) as presented in Refs. [9, 10], which encodes the theoretical computation, and the PDF model, following the process type of the data point. For DIS-like processes the convolution is performed once, while for hadronic-like processes PDFs are convoluted twice.
The optimization procedure consists in minimizing the loss function
$$\begin{aligned} \chi ^2 = \sum _{i,j}^{N_{\mathrm{dat}}} (D-P)_i \sigma _{ij}^{-1} (D-P)_j, \end{aligned}$$
(2)
where \(D_i\) is the i-nth artificial data point from the training set, \(P_i\) is the convolution product between the fastkernel tables for point i and the PDF model, and \(\sigma _{ij}\) is the covariance matrix between data points i and j following the \(t_0\) prescription defined in appendix of Ref. [11]. This covariance matrix can include both experimental and theoretical components as presented in Ref. [12].
Concerning the optimization procedure, so far, only genetic algorithms (GA) have been tuned and used. In summary the procedure consists in initializing the weights of the neural network for each PDF flavour using a random gaussian distribution and checking that sum rules are satisfied. From that first network 80 mutant copies are created based on a mutation probability and size to update the weights. The training procedure is fixed to 30k iterations and stopping is determined using a simple look-back algorithm which stores the best weights for the lowest validation loss value.
A new methodological approach
The methodology presented above is currently implemented in a C++ code, introduced for the first time in official releases in NNPDF3.0 [8] and which relies on a very small set of external libraries. This feature can become a shortcoming as the monolithic structure of the codebase greatly complicates the study of novel architectures and the introduction of modern machine learning techniques developed during the last few years.
Our target in this work is to construct a new framework that to allow for the enhancement of the methodology. In order to achieve our goal we rebuild the code using an object oriented approach that will allow us to modify and study each bit of the methodology separately.
We implement the NNPDF regression model from scratch in a python-based framework in which every piece aims to be completely independent. We choose Keras [13] and Tensorflow [14] in order to provide the neural network capabilities for the framework as they are some of the most used and well documented libraries, sometimes also used in the context of PDFs [15, 16]. In addition, the code design abstracts any dependence on these libraries in order to be able to easily implement other machine learning oriented technologies in the future. This new framework, by making every piece subjected to change, opens the door to a plethora of new studies which were out of reach before.
For all fits shown in this paper we utilize gradient descent (GD) methods to substitute the previously used genetic algorithm. This change can be shown to greatly reduce the computing cost of a fit while maintaining a very similar (and in occasions improved) \(\chi ^{2}\)-goodness. The less stochastic nature of GD methods also produces more stable fits than its GA counterparts. The main reason why the GD methods had not been tested before were due to the difficulty of computing the gradient of the loss function (mainly due to the convolution with the fastkernel tables) in a efficient way. This is one example on how the usage of new technologies can facilitate new studies thanks to differentiable programming and distributed computing.
We also use just one single densely connected network as opposed to a separate network for each flavour. As previously done, we fix the first layer to split the input x into the pair \((x, \log (x))\). We also fix 8 output nodes (one per flavour) with linear activation functions. Connecting all different PDFs we can directly study cross-correlation between the different PDFs not captured by the previous methodology.
As we change both the optimizer and the architecture of the network, it is not immediately obvious which would be the best choice of parameters for the NN (which are collectively known as hyperparameters). Thus, we implement in this framework the hyperopt library [17] which allow us to systematically scan over many different combinations of hyperparameters finding the optimal configuration for the neural network. We detail the hyperparameter scan in Sect. 3.
A new fitting framework: n3fit
In Fig. 1 we show a schematic view of the full new methodology which we will refer to from now on as n3fit. The \(\text {xgrid}_{1}\cdots \text {xgrid}_{n}\) are vectors containing the x-inputs of the neural network for each of the experiments entering the fit. These values of x are used to compute both the value of the NN and the preprocessing factor, thus computing the unnormalized PDF. The normalization values \(A_{i}\) are then computed at every step of the fitting (using the \(\text {xgrid}_{int}\) vector as input), updating the “norm” layer and producing the corresponding normalized PDF of Eq. (1).
Before obtaining a physical PDF we apply a basis rotation from the fit basis, \(\{g,\,\Sigma ,\,V,\,V_3,\,V_8,\,T_3,\,T_8,\,c^+\}\), to the physical one, namely, \(\{\bar{s}, \bar{u}, \bar{d}, g, d, u, s, c(\bar{c})\}\). After this procedure we have everything necessary to compute the value of the PDF for any flavour at the reference scale \(Q_{0}\).
All fittable parameters live in the two red blocks, the first named NN [by default a neural network composed by densely connected layers corresponding to the NN of Eq. (1)] and the second the preprocessing \(\alpha \) and \(\beta \) which are free to vary during the fit (in NNPDF3.1 for each replica \(\alpha _{i}\) and \(\beta _{i}\) are fixed during the fit). In the next, when we refer to the neural network parameters we will be referring collectively to the parameters of these two blocks.
As in this new methodology each block is completely independent we can swap them at any point, allowing us to study how the different choices affect the quality of the fit. All the hyperparameters of the framework are also abstracted and exposed (crucial for the study shown in Sect. 3). It also allow us to study many different architectures unexplored until now in the field of PDF determination.
The PDFs, as seen in Sect. 2.1, cannot be compared directly to data, therefore it is necessary to bring the prediction of the network (the pdf\(_{i}\) of Fig. 1) to a state in which it can be compared to experimental data. For that we need to compute the convolution of the PDFs with the fastkernel tables discussed in Sect. 2.1 which produces a bunch of observables \(\mathcal {O}_{1}\cdots \mathcal {O}_{n}\) with which we can compute the loss function of Eq. (2).
The first step of the convolution is to generate a rank-4 luminosity tensor (for DIS-like scenarios this tensor is equivalent to the PDF):
$$\begin{aligned} \mathcal {L}_{i\alpha j \beta } = f_{i\alpha }f_{j\beta }, \end{aligned}$$
(3)
where the latin letters refer to flavour index while the greek characters refer to the index on the respective grids on x. The observable is then computed by contracting the luminosity tensor with the rank-5 fastkernel table for each separate dataset.
$$\begin{aligned} \mathcal {O}^{n} = \mathrm{FK}^{n}_{i\alpha j \beta } \mathcal {L}_{i\alpha j \beta }, \end{aligned}$$
(4)
where n corresponds to the index of the experimental data point within the dataset. The computation of the observables is the most computationally expensive piece of the fit and the optimization and enhancement of this operation will be the object of future studies.
Before updating the parameters of the network we split the output into a training and validation set (selected randomly for each replica) and monitor the value of \(\chi ^{2}\) for each one of these sets. The training set is used for updating the parameters of the network while the validation set is only used for early stopping. The stopping algorithm is presented in Fig. 2. We then train the network until the validation stops improving. From that point onwards, and to avoid false positives, we enable a patience algorithm which waits for a number of iterations before actually considering the fit finished.
A last block to review is the positivity constraint in Fig. 2. We only accept points for stopping for which the PDF is known to produce positive predictions for special set of pseudo data which tests the predictions for multiple processes in different kinematic ranges \((x, Q^2)\). This mechanisms follows closely that used in previous versions of NNPDF [4, 8].
The loss function defined in Eq. (2) is minimized in order to obtain the best set of parameters for the NN. We restrict ourselves to the family of Gradient Descent algorithms with adaptive moment where the learning rate of the weights is dynamically modified. In particular we focus on Adadelta [18], Adam [19] and RMSprop [20]. These three optimizer follow a similar gradient descent strategy but differ on the prescription for weight update.
Environment setup: data and theory
Benchmarking and validation of the new approach is done using as baseline the setup for NNPDF3.1 NNLO [4]. This means we will be using the same datasets and cuts, together with the same fraction of validation data for cross-validation although the stopping criteria is different (Fig. 2). This setup is named “global”, as it includes all datasets used in NNPDF3.1 NNLO with 4285 data points.
We also define, in order to facilitate the process of benchmarking and validation, a reduced dataset with only DIS-type data with 3092 data points. Namely, all datasets from the “global” setup that are not hadronic. We call this setup “DIS”. This reduced setup has a main advantage: in a DIS-like process there is only one PDF involved, this simplifies enormously the fit, making it much faster and lighter. These light fits, together with the new methodology, allow us to explore an space of parameters previously inaccessible.
Performance benchmark
Table 1 Comparison of the average computing resources consumed by the old and new methodologies for the DIS and Global setups. We find n3fit to be \(\sim 20\) faster on average. The only drawback is the bigger memory consumption in the global fit. Each fit can be comprised of 100–200 replicas. Good replicas are those which pass all post-fit criteria defined in [8]
In order to obtain a good quality and reliable PDF model it is necessary to perform the fit for many artificial data replicas. These are complex computation which require a great deal of CPU hours and memory consumption, therefore one of the goals of any new studies is to find a more efficient way of performing the PDF fits. As previously stated, GD methods improve the stability of the fits, producing less “bad replicas”, which need to be discarded, than theirs GA counterparts and this translates to a much smaller computing time. In Table 1 we find a factor of 20 improvement with respect to the old methodology and near to a factor of 1.5 in the percentage of accepted replicas for a global fit setup.
In the old code the memory usage is driven by the usage of APFEL [21], which does not depend on the set of experiments being used. Instead, the memory consumption of the new code is driven by the Tensorflow optimization strategy which in the case of hadronic data requires the implementation of Eq. (4) and its gradient. This difference translates to an importance increase on the memory usage of n3fit that is only realized in the Global fit.
We are currently working on ways that would allow us to reduce the memory consumption without introducing a penalty on the execution speed of the code as currently we favour speed with respect to memory.