The DNNLikelihood: enhancing likelihood distribution with Deep Learning

We introduce the DNNLikelihood, a novel framework to easily encode, through Deep Neural Networks (DNN), the full experimental information contained in complicated likelihood functions (LFs). We show how to efficiently parametrise the LF, treated as a multivariate function of parameters and nuisance parameters with high dimensionality, as an interpolating function in the form of a DNN predictor. We do not use any Gaussian approximation or dimensionality reduction, such as marginalisation or profiling over nuisance parameters, so that the full experimental information is retained. The procedure applies to both binned and unbinned LFs, and allows for an efficient distribution to multiple software platforms, e.g. through the framework-independent ONNX model format. The distributed DNNLikelihood can be used for different use cases, such as re-sampling through Markov Chain Monte Carlo techniques, possibly with custom priors, combination with other LFs, when the correlations among parameters are known, and re-interpretation within different statistical approaches, i.e. Bayesian vs frequentist. We discuss the accuracy of our proposal and its relations with other approximation techniques and likelihood distribution frameworks. As an example, we apply our procedure to a pseudo-experiment corresponding to a realistic LHC search for new physics already considered in the literature.


Introduction
The Likelihood Function (LF) is the fundamental ingredient of any statistical inference. It encodes the full information on experimental measurements and allows for their interpretation both from a frequentist (e.g. Maximum Likelihood Estimation (MLE)) and a Bayesian (e.g. Maximum a Posteriori (MAP)) perspectives. 1 On top of providing a description of the combined conditional probability distribution of data given a model (or vice versa, when the prior is known, of a model given data), and therefore of the relevant statistical uncertainties, the LF may also encode, through the so-called nuisance parameters, the full knowledge of systematic uncertainties and additional constraints (for instance coming from the measurement of fundamental input parameters by other experiments) affecting a given measurement or observation, as for instance discussed in ref. [3].
Current experimental and phenomenological results in fundamental physics and astrophysics typically involve complicated fits with several parameters of interest and hundreds of nuisance parameters. Unfortunately, it is generically considered a hard task to provide all the information encoded in the LF in a practical and reusable way. Therefore, experimental analyses usually deliver only a small fraction of the full information contained in the LF, typically in the form of confidence intervals obtained by profiling the LF on the nuisance parameters (frequentist approach), or in terms of probability intervals obtained by marginalising over nuisance parameters (Bayesian approach), depending on the statistical method used in the analysis. This way of presenting results is very practical, since it can be encoded graphically into simple plots and/or simple tables of expectation values and correlation matrices among observables, effectively making use of the Gaussian approximation. However, such 'partial' information can hardly be used to reinterpret a result within a different physics scenario, to combine it with other results, or to project its sensitivity to the future. These tasks, especially outside experimental collaborations, are usually done in a naïve fashion, trying to reconstruct an approximate likelihood for the quantities of interest, employing a Gaussian approximation, assuming full correlation/uncorrelation among parameters, and with little or no control on the effect of systematic uncertainties. Such control on systematic uncertainties could be particularly useful to project the sensitivity of current analyses to future experiments, an exercise particularly relevant in the context of future collider studies [4][5][6]. One could for instance ask how a certain experimental result would change if a given systematic uncertainty (theoretical or experimental) is reduced by some amount. This kind of question is usually not addressable using only public results for the aforementioned reasons. This and other limitations could of course be overcome if the full LF were available as a function of 1 To be precise, the LF does not contain the full information in the frequentist approach, since the latter does not satisfy the Likelihood Principle (for a detailed comparison of frequentist and Bayesian inference see, for instance, refs. [1,2]). In particular, frequentists should specify assumptions about the experimental setup and about experiments that are not actually performed which are not relevant in the Bayesian approach. Since these assumptions are however usually well spelled in fundamental physics and astrophysics (at least when classical inference is carefully applied), we ignore this issue and assume that the LF encodes the full experimental information.
the observables and of the elementary nuisance parameters, allowing for: notably some Higgs study in the four-leptons golden decay mode [17] and the majority of the analyses carried out at B-physics experiments. Second, the use of a DNN does not impose on the user any specific software choice. Neural Networks are extremely portable across multiple software environments (e.g. C++, Python, Matlab, R, or Mathematica) through the ONNX format [18]. This aspect could be important whenever different experiments make different choices in terms of how to distribute the likelihood.
In this respect, we believe that the use of the DNNLikehood could be relevant even in a future scenario in which every major experiment has followed the remarkable example set by ATLAS. For instance, it could be useful to overcome some technical difficulty with specific classes of analyses. Or, it could be seen as a further step to take in order to import a distributed likelihood into a different software environment. In addition, the DNNLikehood could be used in other contexts, e.g. to distribute the outcome of phenomenological analyses involving multi-dimensional fits such as the Unitarity Triangle Analysis [19][20][21][22][23], the fit of electroweak precision data and Higgs signal strengths [24][25][26][27][28], etc.
There are two main challenges associated to our proposed strategy: on one hand, in order to design a supervised learning technique, an accurate sampling of the LF is needed for the training of the DNNLikelihood. On the other hand a (complicated) interpolation problem should be solved with an accuracy that ensures a real preservation of all the required information on the original probability distribution.
The first problem, i.e. the LF sampling, is generally easy to solve when the LF is a relatively simple function which can be fastly evaluated in each point in the parameter space. In this case, Markov Chain Monte Carlo (MCMC) techniques [29] are usually sufficient to get dense enough samplings of the function, even for very high dimensionality, in a reasonable time. However the problem may quickly become intractable with these techniques when the LF is more complicated, and takes much longer time to be evaluated. This is typically the case when the sampling workflow requires the simulation of a data sample, including computationally costly corrections (e.g. radiative corrections) and/or a simulation of the detector response, e.g. through Geant4 [30]. In these cases, evaluating the LF for a point of the parameter space may require O(minutes) to go through the full chain of generation, simulation, reconstruction, event selection, and likelihood evaluation, making the LF sampling with standard MCMC techniques impractical. To overcome this difficulty, several ideas have recently been proposed, inspired by Bayesian optimisation and Gaussian processes, known as Active Learning (see, for instance, refs. [31,32] and references therein). These techniques, though less robust than MCMC ones, allow for a very "query efficient" sampling, i.e. a sampling that requires the smallest possible number of evaluations of the full LF. Active Learning applies machine learning techniques to design the proposal function of the sampling points and can be shown to be much more query efficient than standard MCMC techniques. Another possibility would be to employ deep learning and MCMC techniques together in a way similar to Active Learning, but inheriting some of the nice properties of MCMC. We defer a discussion of this new idea to a forthcoming publication [33], while in this work we focus on the second of the aforementioned tasks: we assume that an accurate sampling of the LF is available and design a technique to encode it and distribute it through DNNs.
A Jupyter notebook and the Python source files which allow to reproduce all results presented in this paper are available on GitHub . A dedicated Python package allowing to sample LFs and to build, optimize, train, and store the corresponding DNNLikelihoods is in preparation. This will allow not only allow to construct and distribute DNNLikelihoods, but also to use them for inference both in the Bayesian and frequentist frameworks. This paper is organized as follows. In Section 2 we discuss the issue of interpolating the LF from a Bayesian and frequentist perspective and set up the procedure for producing suitable training datasets. In Section 3 we describe a benchmark example, consisting in the realistic LHC-like New Physics (NP) search proposed in ref. [9]. In Section 4 we show our proposal at work for this benchmark example, whose LF depends on one physical parameter, the signal strength µ, and 94 nuisance parameters. Finally, in Section 5 we conclude and discuss some interesting ideas for future studies.

Interpolation of the Likelihood Function
The problem of fitting high-dimensional multivariate functions is a classical interpolation problem, and it is nowadays widely known that DNNs provide the best solution to it. Nevertheless, the choice of the loss function to minimise and of the metrics to quantify the performance, i.e. of the "distance" between the fitted and the true function, depends crucially on the nature of the function and its properties. The LF is a special function, since it represents a probability distribution. As such, it corresponds to the integration measure over the probability of a given set of random variables. The interesting regions of the LF are twofold. From a frequentist perspective the knowledge of the local maxima, and of the global maximum, is needed. This requires a good knowledge of the LF in regions of the parameter space with high probability (large likelihood), and, especially for high dimensionality, very low probability mass (very small prior volume). These regions are therefore very hard to populate via sampling techniques [34] and give tiny contributions to the LF integral, the latter being increasingly dominated by the "tails" of the multidimensional distribution as the number of dimensions grows. From a Bayesian perspective the expectation values of observables or parameters, which can be computed through integrals over the probability measure, are instead of interest. In this case one needs to accurately know regions of very small probabilities, which however correspond to large prior volumes, and could give large contributions to the integrals.
Let us argue what is a good "distance" to minimise to achieve both of the aforementioned goals, i.e. to know the function equally well in the tails and close to the local maxima. Starting from the view of the LF as a probability measure (Bayesian perspective), the quantity that one is interested in minimising is the difference between the expectation values of observables computed using the true probability distribution and the fitted one.
For instance, in a Bayesian analysis one may be interested in using the probability density P = L × Π, where L denotes the likelihood and Π the prior, to estimate expectation values as where the probability measure is dP(x) = P(x)dx, and we collectively denoted by the n-dimensional vector x the parameters on which f and P depend, treating on the same footing the nuisance parameters and the parameters of interest. Let us assume now that the solution to our interpolation problem provides a predicted pdf P P (x), leading to an estimated expectation value This can be rewritten, by defining the ratio r(x) ≡ P P (x)/P(x) = L P (x)/L(x), as so that the absolute error in the evaluation of the expectation value is given by For a finite sample of points x i , with i = 1, . . . , N , the integrals are replaced by sums and eq. (4) becomes Here, the probability density function P(x) has been replaced with F(x i ), the frequencies with which each of the x i occurs, normalised such that x i | U (x) F(x i ) = N , the notation x i | U (x) indicates that the x i are drawn from a uniform distribution, and the 1/N factor ensures the proper normalisation of probabilities. This sum is very inefficient to calculate when the probability distribution P(x) varies rapidly in the parameter space, i.e. deviates strongly from a uniform distribution, since most of the x i points drawn from the uniform distribution will correspond to very small probabilities, giving negligible contributions to the sum. An example is given by multivariate normal distributions, where, increasing the dimensionality, tails become more and more relevant (see Appendix A). A more efficient way of computing the sum is given by directly sampling the x i points from the probability distribution P(x), so that eq. (5) can be rewritten as This expression clarifies the aforementioned importance of being able to sample points from the probability distribution P to efficiently discretize the integrals and compute expectation values. The minimum of this function for any f (x) is in r(x i ) = 1, which, in turn, implies L(x i ) = L P (x i ). This suggests that an estimate of the performance of the interpolated likelihood is given by the Mean Percentage Error (MPE) Technically, formulating the interpolation problem on the LF itself introduces the difficulty of having to fit the function over several orders of magnitude, which leads to numerical instabilities. For this reason it is much more convenient to formulate the problem using the natural logarithm of the LF, the so-called log-likelihood log L. Let us see how the error on the log-likelihood propagates to the actual likelihood. Consider the Mean Error (ME) on the log-likelihood The last logarithm can be expanded for r(x i ) ∼ 1 to give It is interesting to notice that ME log L defined in eq. (8) corresponds to the Kullback-Leibler divergence [35], or relative entropy, between P and P P : While eq. (10) confirms that small values of D KL = ME log L ∼ MPE L correspond to a good performance of the interpolation, D KL , as well as ME and MPE, do not satisfy the triangular inequality and therefore cannot be directly optimised for the purpose of training and evaluating a DNN. Eq. (8) suggests however that the Mean Absolute Error (MAE) or the Mean Square Error (MSE) on log L should be suitable losses for the DNN training: we explicitly checked that this is indeed the case, with MSE performing slightly better for well-known reasons.
Finally, in the frequentist approach, the LF can be treated just as any function in a regression (or interpolation) problem, and, as we will see, the MSE provides a good choice for the loss function.

Evaluation metrics
We have argued above that the MAE or MSE on log L(x i ) are the most suitable loss functions to train our DNN for interpolating the LF on the sample x i . We are then left with the question of measuring the performance of our interpolation from the statistical point of view. In addition to D KL , several quantities can be computed to quantify the performance of the predictor. First of all, we perform a Kolmogorov-Smirnov (K-S) two-sample test [36,37] on all the marginalised one-dimensional distributions obtained using P and P P . In the hypothesis that both distributions are drawn from the same pdf, the p-value should be distributed uniformly in the interval [0, 1]. Therefore, the median of the distribution of p-values of the one-dimensional K-S tests is a representative single number which allows to evaluate the performance of the model. We also compute the error on the width of Highest Posterior Density Intervals (HPDI) PI i for the marginalised one-dimensional distribution of the i-th parameter, E i PI = PI i − PI i P , as well as the relative error on the median of each marginalised distribution. From a frequentist point of view, we are interested in reproducing as precisely as possible the test statistics used in classical inference. In this case we evaluate the model looking at the mean error on the test statistics t µ , that is the likelihood ratio profiled over nuisance parameters.
To simplify the presentation of the results, we choose the best models according to the median K-S p-value, when considering bayesian inference, and the mean error on the t µ test-statistics, when considering frequentist inference. These quantities are compared for all the different models on an identical test set statistically independent from both the training and validation sets used for the hyperparameter optimisation.

Learning from imbalanced data
The loss functions we discussed above are all averages over all samples and, as such, will lead to a better learning in regions that are well represented in the training set and to a less good learning in regions that are under-represented. On the other hand, an unbiased sampling of the LF will populate much more regions corresponding to a large probability mass than regions of large LF. Especially in large dimensionality, it is prohibitive, in terms of the needed number of samples, to make a proper unbiased sampling of the LF, i.e. converging to the underlying probability distribution, while still covering the large LF region with enough statistics. In this respect, learning a multi-dimensional LF raises the issue of learning from highly imbalanced data. This issue is extensively studied in the ML literature for classification problems, but has gathered much less attention from the regression point of view [38][39][40].
There are two main approaches in the case of regression, both resulting in assigning different weights to different examples. In the first approach, the training set is modified by oversampling and/or undersampling different regions (possibly together with noise) to counteract low/high population of examples, while in the second approach the loss function is modified to weigh more/less regions with less/more examples. In the case where the theoretical underlying distribution of the target variable is (at least approximately) known, as in our case, either of these two procedures can be applied by assigning weights that are proportional to the inverse frequency of each example in the population. This approach, applied for instance by adding weights to a linear loss function, would really weigh each example equally, which may not be exactly what we need. Moreover, in the case of large dimensionality, the interesting region close to the maximum would be completely absent from the sampling, making any reweighting irrelevant. In this paper we therefore apply an approach belonging to the first class mentioned above, consisting in sampling the LF in the regions of interest and in constructing a training sample that effectively weighs the most interesting regions. As we clarify in Section 3, this procedure consists in building three samples: an unbiased sample, a biased sample and a mixed one. Training data will be extracted from the latter sample. Let us briefly describe the three: • Unbiased sample: a sample that has converged as accurately as possible to the true probability distribution. Notice that this sample is the only one which allows posterior inference in a Bayesian perspective, but would generally fail in making frequentist inference [34].
• Biased sample: a sample concentrated around the region of maximum likelihood. It is obtained by biasing the sampler in the region of large LF, only allowing for small moves around the maximum. Tuning this sample, targeted to a frequentist MLE perspective, raises the issue of coverage, that we discuss in Section 3. One has to keep in mind that the region of the LF that needs to be well known, i.e. around the maximum, is related to the coverage of the frequentist analysis being carried out.
• Mixed sample: this sample is built by enriching the unbiased sample with the biased one, in the region of large values of the LF. This is a tuning procedure, since, depending on the number of available samples and the statistics needed for training the DNN, this sample needs to be constructed for reproducing at best the results of the given analysis of interest both in a Bayesian and frequentist inference framework.
Some considerations are in order. The unbiased sample is enough if one wants to produce a DNNLikelihood to be used only for Bayesian inference. As we show later, this does not require a complicated tuning of hyperparameters (at least in the example we consider) and reaches very good performances, evaluated with the metrics that we discussed above, already with a relatively small statistics in the training sample (considering the high dimensionality). The situation complicates a bit when one wants to be able to also make frequentist inference using the same DNNLikelihood. In this case the mixed sample (and therefore the biased one) is needed, and more tuning of the network as well as more samples in the training set are required. In order to optimise the predictions made by the DNNLikelihood close to the maximum (or in other words to get an accurate estimate of the t µ test statistics), allowing for a reliable frequentist inference, we average over a few models with the same architecture, but trained with different randomly generated training sets, therefore employing ensemble learning. This averaging can be done both at the level of the predicted quantities, or at the level of the DNNs. As an example we show the results obtained by averaging the values of the t µ test-statistics obtained with a few models trained with different training sets. We have obtained very similar results also by stacking these models together, training a small neural network to optimise their combined prediction, and using this neural network to predict t µ .
The final issue we have to address when training with the mixed sample, which is biased by construction, is to ensure that the DNNLikelihood can still produce accurate enough Bayesian posterior estimates. This is actually guaranteed by the fact that a regression (or interpolation) problem, contrary to a classification one, is insensitive to the distribution in the target variable, since the output is not conditioned on such probability distribution. This, as can be clearly seen from the results presented in Section 3, is a crucial ingredient for our procedure to be useful, and leads to the main result of our approach: a DNNLikelihood trained with the mixed sample can be used to perform a new MCMC that converges to the underlying distribution, forgetting the biased nature of the training set.
In the next Section we give a thorough example of the procedure discussed here in the case of a prototype LHC-like search for NP corresponding to a 95-dimensional LF.

A realistic LHC-like NP search
In this section we introduce the prototype LHC-like NP search presented in ref. [9], which we take as a representative example to illustrate how to train the DNNLikelihood. We refer the reader to ref. [9] for a detailed discussion of this setup and repeat here only the information that is strictly necessary to follow our analysis.
The toy experiment consists in a typical "shape analysis" in a given distribution aimed at extracting information on a possible NP signal from the Standard Model (SM) background. The measurement is divided in three different event categories, containing 30 bins each. The signal is characterized by a single "signal-strength" parameter µ and the uncertainty on the signal is neglected. 3 All uncertainties affecting the background are parametrised in terms of nuisance parameters, which may be divided into three categories: 1. fully uncorrelated uncertainties in each bin: they correspond to a nuisance parameter for each bin δ MC,i , with uncorrelated priors, parametrising the uncertainty due to the limited Monte Carlo statistics, or statistics in a control region, used to estimate the number of background events in each bin.
2. fully correlated uncertainties in each bin: they correspond to a single nuisance parameter for each source of uncertainty affecting in a correlated way all bins in the distribution. In this toy experiment, such sources of uncertainty are the modeling of the Initial State Radiation and the Jet Energy Scale, parametrised respectively by the nuisance parameters δ ISR and δ JES .
3. uncertainties on the overall normalisation (correlated among event categories): they correspond to the previous two nuisance parameters δ ISR and δ JES , that, on top of affecting the shape, also affect the overall normalisation in the different categories, plus two typical experimental uncertainties, that only affect the normalisation, given by a veto efficiency and a scale-factor appearing in the simulation, parametrised respectively by δ LV and δ RC .
In summary, the LF depends on one physical parameter µ and 94 nuisance parameters, that we collectively indicate with the vector δ, whose components are defined by The full model likelihood can be written as 4 Pr n obs where the product runs over all bins I. The number of expected events in each bin is given by n I (µ, δ) = n s,I (µ) + n b,I (δ), and the probability distributions are given by Poisson distributions in each bin Pr n obs In this toy LF, the number of background events in each bin n b,I (δ) is known analytically as a function of the nuisance parameters, through various numerical parameters that interpolate the effect of systematic uncertainties. The parametrisation of n b,I (δ) is such that the nuisance parameters δ are normally distributed with vanishing vector mean and identity covariance matrix Moreover, due to the interpolations involved in the parametrisation of the nuisance parameters, in order to ensure positive probabilities, the δs are only allowed to take values in the range [−5, 5]. In our approach, we are interested in setting up a supervised learning problem to learn the LF as a function of the parameters. Independently of the statistical perspective, i.e. whether the parameters are treated as random variables or just variables, we need to choose some values to evaluate the LF. For the nuisance parameters the function π(δ) already tells us how to choose these points, since it implicitly treats the nuisance parameters as random variables distributed according to this probability distribution. For the model parameters, in this case only µ, we have to decide how to generate points, independently of the stochastic nature of the parameter itself. In the case of this toy example, since we expect µ to be relatively "small" and most probably positive, we generate µ values according to a uniform probability distribution in the interval [− 1,5]. This could be considered as the prior on the stochastic variable µ in a Bayesian perspective, while it is just a scan in the parameter space of µ in the frequentist one. 5 Notice that we allow for small negative values of µ. Whenever the NP contribution comes from the on-shell production of some new physics, this assumption is not consistent. However, the "signal" may come, in an Effective Field Theory (EFT) perspective, from the interference of the SM background with higher dimensional operators. This interference could be negative depending on the sign of the corresponding Wilson coefficient, and motivates our choice to allow for negative values of µ in our scan.

Sampling the full likelihood
To obtain the three samples discussed in Section 2.2 from the full model LF in eq. (11) we used the emcee3 Python package [42], which implements the Affine Invariant (AI) MCMC  Ensemble Sampler [43]. We proceeded as follows: In the first sampling, the values of the proposals have been updated using the default StretchMove algorithm implemented in emcee3, which updates all values of the parameters (95 in our case) at a time. The default value of the only free parameter of this algorithm a = 2 delivered a slightly too low acceptance rate ≈ 0.12. We have therefore set a = 1.3, which delivers a better acceptance rate of about 0.36. Walkers 6 have been initialised randomly according to the prior distribution of the parameters. The algorithm efficiently achieves convergence to the true target distribution, but, given the large dimensionality, hardly explores large values of the LF.
In Figure 1 we show the evolution of the walkers for the parameter µ (left) together with the corresponding values of − log L (right) for an illustrative set of 100 walkers. From these figures a reasonable convergence seems to arise already after roughly 10 3 steps, which gives an empirical estimate of the autocorrelation of samples within each walker.
Notice that, in the case of ensemble sampling algorithms, the usual Gelman, Rubin and Brooks statistics, usually denoted asR c [44,45], is not expected to be a robust tool to asses convergence, due to the correlation between different walkers. However, in order to reduce this correlation, one could consider a number of independent samplers, extract a few walkers from each run, and computeR c for this set [46]. Considering the aforementioned empirical estimate of the number of steps for convergence, i.e. roughly few 10 3 , we have run 50 independent samplers for a larger number of steps (3 · 10 4 ), extracted randomly 4 chains from each, joined them together, and computedR c for this set. This is shown in the upper-left plot of Figure 2. With a requirement of R c < 1.2 [45] we see that chains have already converged after around 5 · 10 3 steps, which is roughly what we empirically estimated looking at the chains evolution in Figure 1. An even more robust requirement for convergence is given byR c < 1.1, together with stabilized evolution of both variancesV and W [45]. In the center and right plots of Figure 2 we show this evolution, from which we see that convergence has robustly occurred after 2 − 3 · 10 4 steps. In order to check the statement that theR c metric cannot be directly applied to ensemble sampling techniques, we have performed the same analysis using 200 walkers from a single sampler. The result is shown in the lower panels of Figure 2. As it can be seen comparing the upper and lower plots, we would have got to the exact same conclusions about convergence, showing that the correlation of walkers does not, at least in this case, affect diagnostics of convergence based on theR c measure.
An alternative and pretty general way to diagnose MCMC sampling is the autocorrelation of chains, and in particular the Integrated Autocorrelation Time (IAT). This quantity represents the average number of steps between two independent samples in the chain. For unimodal distributions, one can generally assume that after a few IAT the chain forgot where it started and converged to generating samples distributed according to the underlying target distribution. There are more difficulties in the case of multimodal distributions, which are however shared by most of the MCMC convergence diagnostics. We do not enter here in such a discussion, and refer the reader to the overview presented in ref. [49]. An exact calculation of the IAT for large chains is computationally prohibitive, but there are several algorithms to construct estimators of this quantity. The emcee3 package comes with tools that implement some of these algorithms, which we have used to study our sampling [47,48]. To obtain a reasonable estimate of the IAT τ , one needs enough samples, a reasonable empirical estimate of which, that works well also in our case, is at least 50τ . An illustration of this, for the parameter µ, is given in the left panel of Figure 3, where we show, for a sampler with 10 3 chains and 10 6 steps, the IAT estimated after different numbers of steps with two different algorithms, "G&W 2010" and "DFM 2017" (see refs. [47,48] for details). It is clear from the plot that the estimate becomes flat, and therefore converges to the correct value of the IAT, roughly when the estimate curves cross the empirical value of 50τ (this is an order of magnitude estimate, and obviously, the larger the number of steps, the better the estimate of τ ). The best estimate that we get for this sampling for the parameter µ is obtained with 10 6 steps using the "DFM 2017" method and gives τ ≈ 1366, confirming the order of magnitude estimate empirically extracted from Figure 1. In the right panel of Figure 3 we show the resulting one-dimensional (1D) marginal posterior distribution of the parameter µ obtained from the corresponding run. Finally, we have checked that Figures 1 and 3 are quantitatively similar for all other parameters.
As we mentioned above, the IAT gives an estimate of the number of steps between independent samples (it roughly corresponds to the period of oscillation, measured in number of steps, of the chain in the whole range of the parameter). Therefore, in order to have a true unbiased set of independent samples, one has to "thin" the chain with a step size of roughly τ . This greatly decreases the statistics available from the MCMC run. Conceptually there is nothing wrong with having correlated samples, provided they are distributed according to the target distribution, however, even though this would increase the effective available statistics, it would generally affect the estimate of the uncertainties in the Bayesian inference [50,51]. We defer a careful study of the issue of thinning to a forthcoming publication [52], while here we limit ourselves to describe the procedure we followed to get a rich enough sample.
We have run emcee3 for 10 6 + 5 · 10 3 steps with 10 3 walkers for 11 times. From each run we have discarded a pre-run of 5 · 10 3 steps, which is a few times τ , and thinned the chain with a step size of 10 3 , i.e. roughly τ . 7 Each run then delivered 10 6 roughly independent samples. With parallelization, the sampler generates and stores about 22 steps per second. 8 The final sample obtained after all runs consists of 1.1 · 10 7 samples. We stored 10 6 of them as the test set to evaluate our DNN models, while the remaining 10 7 are used to randomly draw the different training and validation sets used in the following.

Biased sample S 2
The second sampling has been used to enrich the training and test sets with points corresponding to large values of the LF, i.e. points close to the maximum for each fixed value of µ. In this case we initialised 200 walkers in maxima of the LF calculated for random values of µ, extracted according to a uniform probability distribution in the interval [−1, 1]. 9 Moreover, the proposals have been updated using a Gaussian random 8 All samplings presented in the paper were produced with a SYS-7049A-T Supermicro R workstation configured as follows: Dual Intel R Xeon R Gold 6152 CPUs at 2.1GHz (22 physical cores), 128 Gb of 2666 MHz Ram, Dual NVIDIA R RTX 2080-Ti GPUs and 1.9Tb M.2 Samsung R NVMe PM963 Series SSD (MZ1LW1T9HMLS-00003). Notice that speed, in our case, was almost constant for a wide choice of the number of parallel processes in the range ∼ 30 − 88, with CPU usage never above about 50%. We therefore conclude that generation speed was, in our case, limited by data transfer and not by CPU resources, making parallelization less than optimally efficient. 9 This interval has been chosen smaller than the interval of µ considered in the unbiased sampling since values of µ outside this interval correspond to values of the LF much smaller than the global maximum, that are not relevant from the frequentist perspective. The range for the biased sampling can be chosen a posteriori by looking at the frequentist confidence intervals on µ. move with variance 5 · 10 −4 (small moves) of a single parameter at a time. In this way, the sampler starts exploring the region of parameters corresponding to local maxima, i.e. large values of the LF, and then slowly moves towards the tails. Once the LF gets further and further from the local maxima, the chains do not explore the region of local maxima anymore. Therefore, in this case we do not want to discard a pre-run, neither to check convergence, which implies that this sampling will have a strong bias (obviously, since we forced the sampler to explore only a particular region).
In Figure 4 we show the evolution of the chains for the parameter µ (left panel) together with the corresponding values of log L (right panel) for an illustrative (random) set of 100 chains. Comparing Figure 4 with Figure 1, we see that now the moves of each chain are much smaller and the sampler generates many points all around the local maxima at which the chains are initialised.
In order to ensure a rich enough sampling close to local maxima of the LF, we have made 10 5 iterations for each walker. Since moves are much smaller than in the previous case (only one parameter is updated at a time), the efficiency in this case is very large, ε ≈ 1. We could therefore obtain a sampling of 1.1 · 10 7 points by randomly picking points within the 10 5 ·200·ε samples. As for S 1 , two samples of 10 6 and 10 7 points have been stored separately: the first serves to build the test set, while the second is used to construct the training and validation sets. As mentioned before, this is a biased sample, and therefore should only be used to enrich the training sample to properly learn the LF close to the maximum (and to check results of the frequentist analysis), but it cannot be used to make any posterior inference. Due to the large efficiency, this sampling took less than one hour to be generated.

Mixed sample S 3
The mixed sample S 3 is built from S 1 and S 2 in order to properly populate both the large probability mass region and the large log-likelihood region.  The distribution of the LF values in the three samples are shown in Figure 5 (for the 10 7 points in the training/validation set).
We have used the three samples as follows: examples drawn from S 3 were used to train the full DNNLikelihood, while results have been checked against S 1 in the case of Bayesian posterior estimations and against S 2 (together with results obtained from a numerical maximisation of the analytical LF) in the case of frequentist inference. Moreover, we also present a "Bayesian only" version of the DNNLikelihood, trained using only points from S 1 .

Bayesian inference
In the Bayesian approach one is interested in marginal distributions, used to compute marginal posterior probabilities and credibility intervals. For instance, in the case at hand, one may be interested in two-dimensional (2D) marginal probability distributions in the parameter space (µ, δ), such as or in 1D HPDI corresponding to probabilities 1 − α, such as   Table 1: HPDIs obtained using all 10 7 samples from the training set of S 1 . The result is shown both for µ > −1 and µ > 0 (only the upper bound is given in the latter case).
All these integrals can be discretized and computed by just summing over quantities evaluated on a proper unbiased LF sampling.
This can be efficiently done with MCMC techniques, such as the one described in Section 3.1. For instance, using the sample S 1 we can directly compute HPDIs for the parameters. Figure 6 shows the 1D and 2D posterior marginal probability distributions of the subset of parameters (µ, δ 4 , δ 7 , δ 31 , δ 62 , δ 86 ) obtained with the training set (10 7 points, green (darker)) and test set (10 6 points, red (lighter)) of S 1 . Figure 6 also shows the 1D and 2D 68.27%, 95.45%, 99.73% HPDIs. All the HPDIs, including those shown in Figure 6, have been computed by binning the distribution with 50 bins, estimating the interval, and increasing the number of bins by 50 until the interval splits due to statistical fluctuations.
The results for µ with the assumptions µ > −1 and µ > 0, estimated from the training set, which has the largest statistics, are given in Table 1. Figure 6 shows how the 1D marginal probability distributions are extremely accurate up to the 99.73%, while the 2D ones, for the same interval, start showing differences, due to the finite sample size.
Considering that the sample sizes used to train the DNN range from 10 5 to 5 · 10 5 , we do not consider probability intervals higher than 99.73%. Obviously, if one is interested in covering higher HPDIs, larger training sample sizes need to be considered (for instance, to cover a Gaussian 5σ interval, that corresponds to a probability 1 − 5.7 · 10 −7 , even only on the 1D marginal distributions, a sample with 10 7 points would be necessary). We will not consider this case in the present paper.

Frequentist inference
In a frequentist inference one usually constructs a test statistics λ(µ, θ) based on the LF ratio Since one would like the test statistics to be independent of the nuisance parameters, it is common to use instead the profiled likelihood, obtained replacing the LF at each value of µ with its maximum value (over the nuisance parameters volume) for that value of µ. One can then construct a test statistics t µ based on the profiled (log)-likelihood ratio, given by (17) Whenever suitable general conditions are satisfied, and in the limit of large data sample, by Wilks' theorem [53] the distribution of this test-statistics approaches a χ 2 distribution that is independent of the nuisance parameters δ and has a number of degrees of freedom equal to dim L − dim L prof [54]. In our case t µ can be computed using numerical maximisation on the analytic LF, but it can also be computed from S 2 (and S 3 , which is identical in the large likelihood region), which was constructed with the purpose of describing the LF as precisely as possible close to local maxima. In Figure 7 we show the result for t µ using both approaches for different sample sizes drawn from S 2 . The three samples from S 2 used for the maximisation, with sizes 10 5 , 10 6 , and 10 7 (full training set of S 2 ), contain in the region µ ∈ [0, 1] around 5 · 10 4 , 5 · 10 5 , and 5 · 10 6 points respectively, which results in increasing statistics in each bin and a more precise and stable prediction for t µ . As it can be seen 10 5 points, about half of which contained in the range µ ∈ [0, 1], are already sufficient, with a small bin size of 0.02, to reproduce the t µ curve with great accuracy. As expected, larger bin sizes result in too high local maxima estimates, leading to an underestimate of t µ . Under Wilks' theorem assumptions, t µ should be distributed as a χ 2 1 (1 d.o.f.) distribution, from which we can determine CL upper limits. The 68.27%(95.45%) CL upper limit (under the Wilks' hypotheses) is given by t µ = 1(4), corresponding to µ < 0.37(0.74). These upper limits are compatible with the ones found in ref. [9], and are quite smaller than the corresponding upper limits of the HPDI obtained with the Bayesian analysis in Section 3.1.1 (see Table 1). This may suggest that the result obtained using the asymptotic approximation for t µ is underestimating the upper limit (undercoverage). This is somehow expected for the search under consideration, given the large number of bins in which the observed number of events is below 5 or even 3 (see Figure 2 of ref. [9]). Indeed, the true distribution of t µ is expected to depart from a χ 2 1 distribution when the hypotheses of Wilks' theorem are violated. The study of the distribution of t µ is related to the problem of coverage of frequentist confidence intervals, and requires to perform pseudo-experiments. We present results on the distribution of t µ obtained through pseudo-experiments in Appendix B. The important conclusion is that using the distribution of t µ generated with pseudo-experiments, CL upper limits become more conservative by up to almost a factor of two, depending on the choice of the approach used to treat nuisance parameters. This shows that the upper limits computed through asymptotic statistics undercover, in this case, the actual upper bounds on µ.

The DNNLikelihood
The sampling of the full likelihood discussed above has been used to train a DNN regressor constructed from multiple fully connected layers, i.e. a multilayer perceptron (MLP). The regressor has been trained to predict values of the LF given a vector of inputs made by the physical and nuisance parameters. In order to introduce the main ingredients of our regression procedure and DNN training, we first show how models trained using only points from S 1 give reliable and robust results in the case of the Bayesian approach. Then we discuss the issue of training with samples from S 3 to allow for maximum likelihood based inference. Finally, once a satisfactory final model is obtained, we show again its performance for posterior Bayesian estimates.

Model architecture and optimisation
We used Keras [55] with TensorFlow [56] backend, through their Python implementation, to train a MLP and considered the following hyperparameters to be optimised, the value of which defines what we call a model or a DNNLikelihood.

• Size of training sample
In order to assess the performance of the DNNLikelihood given the training set size we considered three different values: 10 5 , 2·10 5 and 5·10 5 . The training set (together with an half sized evaluation set) has been randomly drawn from S 1 for each model training, which ensures the absence of correlation between the models due to the training data: thanks to the large size of S 1 (10 7 samples) all the training sets can be considered roughly independent. In order to allow for a consistent comparison, all models trained with the same amount of training data have been tested with a sample from the test set of S 1 , and with half the size of the training set. In general, and in particular in our interpolation problem, increasing the size of the training set allows to reduce the generalization error and therefore to obtain the desired performance on the test set.
• Loss function In Section 2 we have argued that both MAE and MSE are suitable loss functions to learn the log-likelihood function. In our optimisation procedure we tried both, always finding (slightly) better results for the MSE. We therefore choose the MSE as our loss function in all results presented here.
• Number of hidden layers From a preliminary optimisation we concluded that more than a single HL (deep network) always performs better than a single HL (shallow network). However, in the case under consideration, deeper networks do not seem to perform much better than 2HL networks, even though they are typically much slower to train and to make predictions. Therefore, after this preliminary assessment, we focused on 2HL architectures.

• Number of nodes on hidden layers
We considered architectures with the same number of nodes on the two hidden layers. The number of trainable parameters (weights) in the case of n fully connected HLs with the same number of nodes d HL is given by where d input is the dimension of the input layer, i.e. the number of independent variables, 95 in our case. DNNs trained with stochastic gradient methods tend to small generalization errors even when the number of parameters is larger than the training sample size [57]. Overfitting is not an issue in our interpolation problem [58]. In our case we considered HLs not smaller than 500 nodes, which should ensure enough bandwidth throughout the network and model capacity. In particular we compared results obtained with 500, 1000, 2000, and 5000 nodes on each HL, corresponding to 299001, 1098001, 4196001, and 25490001 trainable parameters.
• Activation function on hidden layers We compared RELU [59], ELU [60], and SELU [61] activation functions and the latter one showed to fit better our problem. In order to correctly implement the SELU activation in Keras we initialised all weights using the Keras "lecun normal" initialiser [61,62].
• Batch size When using a stochastic gradient optimisation technique, of which Adam is an example, the minibatch size is an hyperparameter. For the training to be stochastic, the batch size should be much smaller than the training set size, so that each minibatch can be considered roughly independent. Large batch sizes lead to more accurate weight updates and, due to the parallel capabilities of GPUs, to faster training time. However, smaller batch sizes usually contribute to regularize and avoid overfitting. After a preliminary optimisation obtained changing the batch size from 256 to 4096, we concluded that the best performances were obtained by keeping the number of batches roughly fixed to 200 when changing the training set size. In particular, choosing batch sizes among powers of two, we have used 512, 1024 and 2048 for 10 5 , 2 · 10 5 and 5 · 10 5 training set sizes respectively. Notice that increasing the batch size when enlarging the training set, also allowed us to keep the initial learning rate fixed [63]. Similar results could be obtained by keeping a fixed batch size of 512 and reducing the starting learning rate when enlarging the training set.

• Optimiser
We used the Adam optimiser with default parameters, and in particular with learning rate = 0.001. We reduced the learning rate by a factor 0.2 every 40 epochs without improvements on the validation loss within an absolute amount (min delta in Keras) 1/N points , with N points the training set size. Indeed, being the Keras min delta parameter absolute and not relative to the value of the loss function, we needed to reduce it when getting smaller losses (better models). We have found that 1/N points corresponded roughly to one to few permil of the best minimum validation loss obtained for all different traning set sizes. This value turned out to give the best results with reasonably low number of epochs (fast enough training). Finally, we performed early stopping [64,65] using the same min delta parameter and no improvement in the validation loss for 50 epochs. This ensured that training did not go on for too long without substantially improving the result. We also tested the newly proposed Ad-aBound optimiser [66] without seeing, in our case, large differences.
Notice that the process of choosing and optimising a model depends on the LF under consideration (dimensions, number of modes, etc.) and this procedure should be repeated for different LFs. However, good initial points for the optimisation could be chosen using experience from previously constructed DNNLikelihoods.
As we discussed in Section 2, there are several metrics that we can use to evaluate our model. Based on the results obtained by re-sampling the DNNLikelihood with emcee3, we see a strong correlation between the quality of the re-sampled probability distribution (i.e. of the final Bayesian inference results) and the metric corresponding to the median of the K-S test on the 1D posterior marginal distributions. We therefore present results focusing on this evaluation metric. When dealing with the Full DNNLikelihood trained with the biased sampling S 3 we also consider the performance on the mean relative error on the predicted t µ test statistics when choosing the best models.

The Bayesian DNNLikelihood
From a Bayesian perspective, the aim of the DNNLikelihood is to be able, through a DNN interpolation of the full LF, to generate a sampling analog to S 1 , which allows to produce Bayesian posterior density distributions as close as possible to the ones obtained using the true LF, i.e. the S 1 sampling. Moreover, independently on how complicated to evaluate the original LF is, the DNNLikelihood is extremely fast to compute, allowing for very fast sampling. 10 The emcee3 MCMC package allows, through vectorization of the input function for the log-probability, to profit of parallel GPU predictions, which made sampling of the DNNLikelihood roughly as fast as the original analytic LF.
We start by considering training using samples drawn from the unbiased S 1 . The independent variables all vary in a reasonably small interval around zero and do not need any preprocessing. However, the log L values in S 1 span a range between around −380 and −285. This is both pretty large and far from zero for the training to be optimal. For this reason we have pre-processed data scaling them to zero mean and unit variance. Obviously, when predicting values of log L we applied the inverse function to the DNN output.
We rank models trained during our optimisation procedure by the median p-value of 1D K-S test on all coordinates between the test set and the prediction performed on the validation set. The best models are those with the highest median p-value. In Table 2 we show results for the best model we obtained for each training sample size. Results have been obtained by training 5 identical models and taking the best one. We call these four best models B 1 − B 3 (B stands for Bayesian). All three models have two HLs with 5 · 10 3 nodes each, and are therefore the largest we consider in terms of number of parameters. However, it should be clear that the gap with smaller models is extremely small in some cases with some of the models with less parameters in the ensemble of 5 performing better than some others with more parameters. This also suggests that results are not too sensitive to model dimension, making the DNNLikelihood pretty robust.     in the legends. Early stopping is usually triggered after a few hundred epochs (ranging from around 200 to 500, with the best models around 200 − 300) and values of the validation loss (MSE) that range in the interval ≈ [0.01, 0.003]. Values of the validation ME, which, as explained in Section 2 correspond to the K-L divergence for the LF, range in the ≈ [1, 5]·10 −3 , which, together with median of the p-value of the 1D K-S tests in the range 0.2 − 0.4 deliver very accurate models. Training times are not prohibitive, and range from less than one hour to a few hours for the models we considered on a Nvidia Tesla V100 GPU with 32GB of RAM. Prediction times, using the same batch sizes used during training, are in the ballpark of 10 − 15µs/point, allowing for very fast sampling and inference using the DNNLikelihood. Finally, as shown in Table 2, all models present very good generalization when going from the evaluation to the test set, with the generalization error decreasing with the sample size as expected.
In order to get a full quantitative assessment of the performances of the Bayesian DNN-Likelihood, we compared the results of a Bayesian analysis performed using the test set of S 1 and each of the models B 1 − B 3 . This was done in two ways. Since the model is usually a very good fit of the LF, we reweighted each point in S 1 using the ratio between the original likelihood and the DNNLikelihood (reweighting). This procedure is so fast that can be done for each trained model during the optimisation procedure giving better insights on the choice of hyperparameters. Once the best model has been chosen, the result of reweighting has been checked by directly sampling the DNNLikelihoods with emcee3. 11 We present results obtained by sampling the DNNLikelihoods in the form of 1D and 2D marginal posterior density plots on a chosen set of parameters (µ, δ 10 , δ 40 , δ 70 , δ 95 ).
We have sampled the LF using the DNNLikelihoods B 1 − B 3 with the same procedure used for S 1 . 12 Starting from model B 1 (Figure 9), Bayesian inference is well reproduced up     to probability intervals of 95.45%, while large deviations start to arise at 99.73%. This is reasonable, since model B 1 has been trained with only 10 5 points, which are not enough to carefully interpolate in the tails, so that this region is described by the DNNLikelihood through extrapolation. Small deviations also arise for smaller probability intervals, which are again the outcome of a model trained with too few points. Nevertheless, we want to stress that, considering the very small training size and the large dimensionality of the LF, model B 1 already works surprisingly well. This is a common feature of the DNNLikelihood, which, as anticipated, works extremely well in predicting posterior probabilities without the need of a too large training sample, nor a hard tuning of the DNN hyperparameters. When going to models B 2 and B 3 (Figures 10 and 11) predictions become more and more reliable, improving as expected with the number of training points. Notice that the observed deviations are of the same size, and often smaller, than the ones observed in Figure 6 between the larger training set and the smaller test set. Therefore, at least part of the small deviations observed in the DNNLikelihood prediction have to be attributed to the finite size of the training set, and are expected to disappear when further increasing the number of points. Considering the relatively small training and prediction times shown in Table 2, it should be possible, once the desired level of accuracy has been chosen, to enlarge the training and test sets enough to match that precision. For the purpose of this work, we consider the results obtained with models B 1 − B 3 already satisfactory, and do not go beyond 5 · 10 5 training samples.
In order to allow for a fully quantitative comparison, in Table 3 we summarize the Bayesian 1D HPDI obtained with the DNNLikelihoods B 1 − B 3 for the parameter µ both using reweighting and re-sampling (only upper bounds for the hypothesis µ > 0). We find that, taking into account the uncertainty arising from our algorithm to compute HPDI (finite binning) and from statistical fluctuations in the tails of the distributions for large probability intervals, the results of Table 3 are in good agreement with those in Table 1. This shows that the Bayesian DNNLikelihood is accurate even with a rather small training sample size of 10 5 and its accuracy quickly improves by increasing the training sample size.

Frequentist extension and the full DNNLikelihood
We have trained the same model architectures considered for the Bayesian DNNLikelihood using the S 3 sample. In Table 4 we show results for the best models we obtained for each training sample size. Results have been obtained by training 5 identical models and taking the best one. We call these models F 1 − F 3 (F may stand for both Full and Frequentist, bearing in mind that these models also allow for Bayesian inference). As we anticipated in the previous Section, the performance gap between models with different number of parameters is very small and often models with less parameters overperform, at least in the hyperparameter space we considered, models with more parameters. This is especially due to our choice of leaving the initial learning rate constant for all architectures, which resulted in a slightly too large learning rate for the bigger models, with a consequently less stable training phase. This can be seen in Figure 12, where we show the learning curves obtained for the values of the hyperparameters shown in the legends. The training curves for models F 1 and F 2 are less regular than those of model F 3 . Nevertheless, as the learning rate gets reduced, they quickly reach a good validation loss, which promoted them best models for 10 5 and 2 · 10 5 training set sizes. This did not happen for models trained with 5 · 10 5 points, which could have benefited from reducing further the initial learning rate. However, we stress that no strong fine-tuning is needed for the DNNLikelihood to perform extremely well, so that we have chosen the best model F 3 with less parameters without pushing optimisation further.    Before repeating the Bayesian analysis presented for the Bayesian DNNLikelihood, we present results of frequentist inference using the Full DNNLikelihood. We used the models F 1 − F 3 to evaluate the test statistics t µ . The left panel of Figure 13 shows t µ obtained using the Full DNNLikelihoods, while the right panel of the same Figure shows the relative error with respect to numerical maximisation of the analytic LF. Apart for some visible deviation in the prediction of model F 1 , it is clear that the Full DNNLikelihood is perfectly able to reproduce the test statistics, allowing for robust frequentist inference, already starting with relatively small training sets of few hundred thousand points. Clearly, the larger the training set, the smaller the error, with an average percentage error on t µ (dashed lines in the right panel of Figure 13) that gets as low as a few percent for our models. We found that ensemble learning can help in reducing differences further (keeping under control statistical fluctuations in the training set and in the DNN weights). However, in the particular case under consideration, results are already satisfactory by taking the best out of five identical models trained with random subsets of the training set. For this reason we do not expand on ensemble learning in this work and just consider it as a tool to improve performance in cases where the LF is extremely complicated or has very high dimensionality.
We now show how the full DNNLikelihood is also able to catch all the features necessary for Bayesian inference. We do so by repeating the analysis done for models B 1 −B 3 for the full DNNLikelihoods F 1 − F 3 . Plots obtained by sampling the DNNLikelihood are shown in Figs. 14-16. Results are quantitatively unchanged with respect to the Bayesian DNNLikelihood. For completeness we also summarize in Table 5 the Bayesian 1D HPDI obtained with the DNNLikelihoods F 1 − F 3 for the physics parameter µ. Results are compatible with what we found in Table 3

Conclusion and future work
Publishing and distributing likelihoods in a simple yet general way is becoming a key issue in many fields, and in particular in the physics of fundamental interactions, where experimental (and phenomenological) results typically involve complicated (and often multi-modal or degenerate) likelihoods which depend on hundreds of parameters. We have introduced the DNNLikelihood framework, in which the full likelihood, binned or unbinned, including the dependence on all nuisance parameters, is published by providing a suitably trained DNN predictor. Distributing the results of experimental or phenomenological analyses using the DNNLikelihood framework allows for the combination of different analyses, for the reinterpretation of the results under different hypotheses, and for the use of the likelihood in a different statistical framework, without loss of information.  We have illustrated the power of the DNNLikelihood discussing in detail the toy experiment presented in ref. [9], which mimics a realistic LHC-like NP search. We found that the DNNLikelihood is able to catch the main features of the true LF, allowing for both frequentist and Bayesian inference, already with a limited amount of training data.
A Jupyter notebook, together with Python source files which allow to reproduce all results presented above are available on GitHub . A dedicated Python package allowing to sample LFs and to build, optimize, train, and store the corresponding DNNLikeliihoods is in preparation. This will allow not only allow to construct and distribute DNNLikelihoods, but also to use them for inference both in the Bayesian and frequentist frameworks.
We plan to release soon the first examples of the use of the DNNLikelihood for true experimental likelihoods, including unbinned, multimodal likelihoods, in forthcoming publications.  As a final remark, we stress the fact that the proposed approach supports and complements the remarkable step taken by the ATLAS collaboration in publishing full experimental likelihoods [12,14,15]. On one hand, the DNNLikelihood framework can be seen as a way to export a HistFactory likelihood to a software environment that doesn't meet the needed dependencies, or as an alternative option for experimental collaborations that don't use HistFactory. On the other, the use of a DNN model doesn't imply specific choices on the likelihood function (e.g. it covers equally well unbinned likelihoods and/or likelihoods built from analytical functions and not represented as histograms).
Last but not least, using a DNNLikelihood could be a viable solution to distribute the outcome of phenomenological studies, such as the ones presented in refs. [19-28, 67, 68].
On a long term, a large scale adoption of likelihood publishing towards the DNNLikelihood framework would motivate the possibility of publishing DNN models on HEPData (which, more generically, would be beneficial for reproducibility issues related to physics analysis using DNNs for selection, etc.), for instance supporting the ONNX format. In this respect, and considering that there could be interest in likelihood publishing in other domains (e.g. for the Λ CDM likelihood in cosmology), it might be worth considering a less hyerarchical submission procedure for HEPData (e.g. allowing individuals outside a structured organization/experimental collaboration to submit a likelihood function) or the opportunity to create a separate (and not HEP specific) likelihood function repository, e.g. based on Zenodo.
where we have denoted by x ∼ µ − nσ a point that is approximately nσ away from the l dimensional mean. The target set of the distribution within nσ from the mean is therefore given by: where min(N l (x))| x µ−nσ is the minimum of N l for x within nσ from the mean divided by the normalisation factor (2π) −l/2 . In Table 6 we show the value of − log 10 (min(N l (x))| x µ−nσ ) for l = 1, 10, 100, 1000 dimensions and for confidence intervals up to 6σ.  Table 6: Value of − log 10 (min(N l (x))| x µ−nσ ) for l = 1, 10, 100, 1000 dimensions and for confidence intervals up to 6σ. See the text for details.

B Pseudo-experiments and frequentist coverage
In order to check the validity of the asymptotic approximation given by Wilks' theorem, and to obtain more robust upper bounds from the t µ test-statistics for the LHC-like search discussed in Section 3, we need to generate several pseudo-experiments for each hypothesised "true" value of µ, compute the log-likelihood and the test statistics t µ for each pseudoexperiment, and study the pdf of t µ , denoted by f (t µ |µ). The cumulative distribution of f (t µ |µ) determines the coverage properties of the test-statistics t µ through the relation where t µ,obs is the t µ of our actual experiment at the given value of µ. By solving this equation for µ at specified value of p µ = α we get the value of µ excluded at a CL of 1 − α.
We generate pseudo-experiments following two different procedures for the treatment of nuisance parameters and compare the results.

• Profile construction
First we consider a frequentist treatment of nuisance parameters, referred to as profile construction [41,69]: we determine the valuesθ µ of the nuisance parameters at the maximum of the LF of our actual experiment for different values of µ, and keep them fixed to generate several pseudo-experiments for each µ according to eq. (12). This approach encodes statistical fluctuations arising in repeated experiments, but does not take into account systematic uncertainties. Although this approach violates the "anticipation criterion" [70], it is expected to work well, and have good coverage properties when the uncertainty is statistically dominated. 13 Indeed the coverage of this approach grows as the profiled valueθ µ approaches the "true" value of θ. This happens when systematic uncertainties become less and less relevant compared to statistical fluctuations.
Following this procedure we generated 5 · 10 4 pseudo-experiments for each value of µ in the range [0, 1] with steps of 0.1.   Table 8: Upper bounds on t µ and corresponding µ at different CL for the LHC-like new physics search discussed in Section 3, obtained from pseudo-experiment employing a profiled frequentist and hybrid frequentist-Bayesian tratement of nuisance parameters. The last two rows show the asymptotic result obtained with a χ 2 1 distribution.
• Hybrid frequentist-Bayesian (marginal model) In order to understand the impact of systematic uncertainties we also consider the hybrid frequentist-Bayesian treatment of nuisance parameters (referred to as marginal model in ref. [41]): pseudo-experiments are generated including both variations of the nuisance parameters according to their distribution, and statistical uncertainty through eq. (12). This allows to include the effect of systematic uncertainties in the generation of pseudo-experiments.
In this case we generated, for the same values of µ considered above, 10 4 expected counts (for all 90 bins) corresponding to variations of the nuisance parameters over their multivariate distibution, and subsequently used each of these expected counts to generate 10 pseudo-experiments according to eq. (12). This delivers a total of 10 5 pseudo-experiments for each vaue of µ, encoding both statistical and systematic uncertainties. Figure 17 shows the distributions f (t µ |µ) for each value of µ, while the probability values covered for each value of µ and the corresponding t µ,obs are given in Table 7. Using the distributions f (t µ |µ) we performed the integral in eq. (25) for 1−p µ = 0.6827, 0.9545, 0.9973, getting the upper bounds on µ reported in Table 8. All results are shown for both the profile construction and the hybrid frequentist-Bayesian approach. As expected, due to the small statistics in several bins, the asymptotic result is inaccurate, and in particular tends to undercover the true value. This delivers a too "aggressive" upper bound for µ. The hybrid approach gives relatively more conservative upper bounds, very similar (expectedly, given the marginal model used for the nuisance parameters) to the Bayesian result reported in Table 1. Finally, the profile construction gives the most conservative bound, that is substantially more conservative than the asymptotic one. We do not dare here to discuss which upper bound is to be quoted, since we are only interested in assessing the validity of the asymptotic approximation.