Simplified likelihoods using linearized systematic uncertainties

This paper presents a simplified likelihood framework designed to facilitate the reuse, reinterpretation and combination of LHC experimental results. The framework is based on the same underlying structure as the widely used HistFactory format, but with systematic uncertainties considered at linear order only. This simplification leads to large gains in computing performance for the evaluation and maximization of the likelihood function, compared to the original statistical model. The framework accurately describes non-Gaussian effects from low event counts, as well as correlated uncertainties in combinations. While primarily targeted towards binned descriptions of the data, it is also applicable to unbinned models.


Introduction
The statistical models describing experimental measurements are a key component of LHC data analysis. Consisting of the probability distribution function (PDF) of the measurement together with the observed dataset, they are used to compute the final experimental results -e.g. confidence intervals for model parameters, or significance values for possible excesses over background -often through the use of frequentist profile-likelihood ratio (PLR) methods [1]. They can also be utilized to make further use of the measurement information, for instance in combinations with other results, or as reinterpretations in the context of alternative signal models.
Despite this central role, statistical models are not systematically made available as part of experimental publications. This is partly for technical reasons: first, they are often complex, with up to O(10 4 ) parameters in some cases [2]. A single maximization the likelihood function, which is needed to compute the PLR, can therefore require up to several hours or days of computation time. Another limitation is the fact that the statistical models of LHC measurements are typically implemented within formats and tools not widely used in other fields, such as the ROOT framework [3].
The information provided in publications, such as the best-fit value of the parameters of interest (POIs) and the covariance matrix of their measurement, typically allow a partial reconstruction of the model. However this is only possible under additional assumptionsin particular Gaussian approximations that do not accurately describe data taken in the Poisson regime with low expected event counts. In cases where full PLR scans are published, the description of systematic uncertainties also does not typically allow a full separation of the different sources of uncertainty, so that correlations across different measurements cannot be properly accounted for when performing their combination.
For these reasons, recent efforts have encouraged the publication of faithful representations of the experimental statistical models under FAIR (Findable, Accessible, Interoperable, Reusable) principles [4], in particular with a view towards reinterpretations targeting alternative physics models [5,6]. This objective can be realized in particular through their publication in open formats. Some recent progress has been achieved in this direction, such as the publication of statistical models by the ATLAS collaboration using the pyhf [7,8] framework. These cases however remain rare so far, in particular due to the limitations described above.
Simplified likelihoods offer compromise solutions that aim to provide less complex descriptions of the experimental statistical models that remain more accurate than Gaussian models. Several approaches have been proposed [9][10][11][12][13]. This work describes a simplification applied to statistical models, in which the dependence on the POIs of the measurement is treated exactly, but the remaining nuisance parameters (NPs) are considered at linear order only. This allows the maximization of the likelihood function with respect to the NPs (usually denoted as profiling the likelihood function) to be performed in closed form using matrix algebra techniques. This in turn can significantly decrease the computing times of the PLR computation since the NPs, which are used to describe in particular systematic uncertainties, typically form a large fraction of the model parameters. The structure of the simplified model, in terms of the POIs, NPs, measurement regions and event samples, remains faithful to the original model. The models are stored in plain text, and computations are performed using python-based tools. The method is applicable to both binned and unbinned descriptions of the experimental data, with unbinned models treated in a binned approximation. This flavor of simplified likelihoods is denoted as Simplified Likelihoods with Linearized Systematics (SLLS) in the rest of this paper to avoid confusion with other simplified likelihood formats.
The paper is organized as follows: the SLLS formalism is presented in detail in Section 2; Section 3 shows a realistic application to a ATLAS search for supersymmetric particles; an application to an unbinned model is presented in Section 4, and Sections 5 and 6 present a discussion of these results and conclusions.

The HistFactory framework
The simplified likelihoods described in this work are based on the HistFactory framework [14], which is widely used in LHC experiments and implemented within both ROOT and pyhf.
It encodes measurements derived from multiple event counts as a set of channels, each corresponding to an independent set of data, consisting of one or several counting bins. In each bin, a counting experiment is described using a Poisson distribution. Each expected event yield is expressed as a sum of contributions from several samples, representing both signal(s) and background(s), and each is a function of the POIs and NPs of the model. Systematic uncertainties are represented as NPs that are constrained by external information described by a constraint PDF. This constraint is a representation of a separate auxiliary experiment, sensitive to the value of the NP through the measurement of an auxiliary observable. The full likelihood function is written as where the index c runs over the N channels measurement channels, b runs over the N bins,c bins in channel c, and s over the N samples,c samples. The observed event yield in bin b of channel c, denoted by n cb , is described by the Poisson PDF Pois in terms of the expected yields ν cbs (µ, θ) for each sample s. The µ and θ refer collectively to the POIs and the NPs, respectively, and the index l runs over the N constraints constrained NPs θ l and their respective auxiliary observablesθ l . The constraints C l are in principle arbitrary but in practice either Poisson or Gaussian forms are used, depending on the properties of the associated systematic uncertainty.

Simplified likelihoods with linearized systematics
The SLLS formalism introduced in this paper brings two simplifications to the HistFactory description. Firstly, the impact of the NPs on the log-likelihood value is described at linear order only. In particular, the ν cbs are expressed as a linear function of the NPs, The ν nom cbs (µ) are the expected event yields computed at the nominal values θ nom k of the NPs. The ∆ cbsk are linear coefficients specifying the impact of θ k on ν cbs , for each of the N NP parameters θ k . As noted above, the dependence of the ν cbs (µ, θ) on the parameters of interest µ is described exactly. The linear approximation in the impact of the NPs is also applied to the Poisson distributions, as described in Appendix A. Secondly, the constraints C l are all assumed to be Gaussian, and are collectively represented as a single multivariate Gaussian PDF with central valueθ and inverse covariance matrix Γ.
With these assumptions, the profiled valueθ k (µ) = arg max θ k L(µ, θ) of the parameter θ k at a given value µ of the POIs can be computed in closed form aŝ with the vector Q(µ) and the matrix P (µ) given by where ν nom cb (µ) = s ν nom cbs (µ). The Q k and P kk terms in Eq. 2.3 encode the impact of the data onθ k (µ), while the terms involving Γ kk originate from the constraint PDFs in the likelihood function. While P kk is quadratic in the ∆ cbsk , it generally cannot be neglected, in particular in the case of NPs that are not associated with a constraint PDF.
Using these relations, the profiling of the NPs at a given µ can be performed using simple matrix algebra. The sizes of the matrices is given by the number of NPs, which can be fairly large -in some cases up to O(10 4 ) -but building these matrices and performing multiplication and inversion operations is nevertheless far quicker than the non-linear maximization of the full likelihood function using e.g. a gradient descent algorithm.
While the form given in Eq. 2.2 is used to profile the NPs, the evaluation of the likelihood function uses instead the alternative form This guarantees that ν cbs (µ, θ) ≥ 0 for all θ as required for the expected event yield of a Poisson PDF, and provides a suitable approximation to Eq. 2.2 for small values of |θ k − θ nom k | since the two forms are equal at leading order in this quantity. 1

Implementation and storage format
A python implementation of the SLLS formalism is provided in the fastprof public package 2 . It describes the full statistical model, including both the PDF of the measurement and the observed data. It includes tools to evaluate and profile the likelihood function and perform maximum-likelihood fits as well as higher-level computations such for hypothesis testing, limit-setting and confidence interval estimation. Other tools are provided to validate the simplified models and perform other operations such as combining or pruning models. The computations make use of the linear algebra routines included in numpy [15] and the minimization routines provided by scipy [16].
The statistical models are stored in a plain-text format using the JSON markup language. The format specifies the POIs, NPs, auxiliary observables, and measurement channels. Each channel is described as a list of samples, specified by the nominal expected bin yields N nom cbs , the linear impacts ∆ cbsk of each NP on the expected yields, and an optional normalization factor K(µ) that can be an arbitrary function of the µ. The expected yields are then expressed as in Eq. 2.2, with ν nom cbs (µ) = N nom cbs K(µ)/K(µ nom ), where µ nom is the value of the POIs for which the nominal yields N nom cbs are provided. The format also specifies the observed data, in terms of the observed yields for each bin of each channel and the observed values of the auxiliary observables.

Example
As an illustration, we consider a simple example measurement consisting of a single-bin counting experiment in the presence of both signal and background contributions. The expected background yield is b 0 = 1, with a relative uncertainty = 25%. The background yield is treated as a NP in the fit, associated with a Gaussian constraint with an auxiliary observableb (as would occur in the case where the background is determined from a control region with a sufficiently large number of events). The observed yield is n = 2. The JSON specification of the statistical model is given in Figure 1.
The results for the signal yield s are computed using its maximum likelihood estimator (MLE)ŝ and the profile-likelihood ratio where L(s, b) is the likelihood function of the measurement,b is the MLE of b andb(s) its conditional MLE at a fixed value s of the signal yield. The values of Λ(s) can then be used to derive results such as confidence intervals on s or the discovery significance of the signal. Figure 2 shows the values of Λ(s) andb(s) computed from the model given in Figure 1 for a range of values of s. In this simple case both results can also be computed in closed form asb and excellent agreement is observed between these expressions and the SLLS results. The asymmetric shape of Λ(s) is driven by the Poisson nature of the measurement, and the good agreement in this case is due to the fact that this feature is accounted for exactly in the simplified likelihood. Using a Gaussian approximation would yield a parabolic shape for Λ(s) that would provide a less accurate description. While the systematic uncertainty on the background yield plays only a small role in this example, the good agreement in the profiled valuesb(s) of the corresponding NP shows that systematic effects are also described accurately within the linear approximation. 3 Application to an ATLAS search for new phenomena

Full statistical model
This section presents a realistic application of the SLLS framework to a search for new phenomena by the ATLAS collaboration [17] for which the full experimental statistical model has been published [18]. The search targets supersymmetric particles in final states with at least three charged leptons originating from the chargino decayχ + 1 → Z → 3 . The analysis considers three signal regions (SRs), targeting signatures with 3 leptons (3 ), 4 leptons (4 ) and 4 leptons with a fully reconstructed W , Z or H boson (FR). Each signal region is divided into 16 bins of the invariant mass m Z of the trilepton system. Three single-bin control regions are also included to provide data-driven estimates of the main backgrounds, from the Standard Model production of a W Z boson pair, a ZZ pair, or a tt pair accompanied by a Z boson (ttZ). The model includes a single parameter of interest, the signal strength µ signal , and 624 NPs: three unconstrained parameters representing normalization terms for the main backgrounds and 621 constrained parameters representing systematic uncertainties.
The full statistical model of the analysis was published by the ATLAS collaboration as a pyhf model available in the HEPData repository [18]. In this example we consider the example case of a chargino with a mass of 500 GeV with branching ratios to W , Z and H bosons of respectively 20%, 60% and 20%, and equal branching ratios to e, µ and τ for the accompanying lepton.

Simplified model
The SLLS model is computed by taking the nominal event yield for each signal and background sample in each bin of each region from the pyhf model, as published by the ATLAS collaboration. 3 The 1σ impacts of the systematics NPs are similarly determined from the definition of the systematic effects in the pyhf model. The impact of the background nor-  malization NPs are derived from the relative fractions of the corresponding backgrounds. The conversion is performed using an automated tool included in the fastprof package. The measurement regions of the analysis as implemented in the simplified model are shown in Figure 3.
The profile likelihood scan of the signal strength parameter µ signal using the simplified model is shown in Figure 4a. A reference scan computed using the full model is also presented for comparison, and shows that the simplified likelihood provides an adequate description of the full result. A simple Gaussian model, using the best-fit value µ signal in the observed data computed by pyhf and the corresponding parabolic error, is also displayed and shows worse agreement. The 95% CL s upper limit on µ signal computed using the simplified model is 0.126, in good agreement with the value of 0.124 obtained using the full model. The Gaussian model yields a value of 0.114.
The fits to the SLLS likelihood with fixed µ signal take about 50 ms on a laptop computer equipped with a 16-core Intel i7-10875H CPU. The fit with free µ signal , which relies on nonlinear rather than linear minimization for this parameter (since POIs are treated exactly), takes about 0.5 s. A full-likelihood fit performed with pyhf require approximately 10 min   Figure 9 in Ref. [17] and good agreement is observed. Gaussian models computed from the full likelihood as described in the text (dotted black) also show good agreement in this case.
Better performance can however likely be achieved using other pyhf backends interfacing to tensorflow [19] or pytorch [20].
To validate the SLLS linear profiling, the profiled values of selected NPs are shown in Figures 4b and 4c for both the SLLS and the full model. Good agreement is seen between the two cases, illustrating that the original likelihood function is modeled to good approximation at the level of individual NPs. The largest deviation is seen in the scale factor for the ttZ background, amounting to about 30% of the fit uncertainty. As a further illustration, the exclusion contour presented in Figure 9 of the original publication is recomputed using the SLLS models and compared to the full-model results. The results are shown in Figure 5, and good agreement is again observed. An exclusion contour based on Gaussian models built as described above is also presented for comparison and shows similar agreement in this case, in part due to the fact that the signal production cross-sections vary rapidly with the chargino mass.

Binned description of unbinned models
The previous examples use a binned description of the experimental measurement, which employs only two types of PDFs: Poisson distributions to describe the counting experiments in each bin, and Gaussian distributions for the constraints. Another common modeling option is unbinned models, which describe the continuous probability distribution of the measurement observables. It is used for instance to study the H → γγ decay of the Higgs boson at the LHC [21,22], as well as in many results published by LHCb (see for instance Refs. [23,24]). It requires support for arbitrary PDF forms, as needed to describe each measurement, and therefore more general and flexible tools than for binned models. For LHC measurements, this functionality is usually provided by the roofit package [25] distributed as part of ROOT, but this and other similar tools are not widely used outside the high-energy physics experimental community. While there are some recent ongoing efforts to provide more portable alternatives, none is currently in wide use.
A possible way forward is based on the observation that unbinned models can be approximated by binned models with a sufficiently fine binning (see Appendix B). While this approach typically runs into practical difficulties for full likelihoods due to the large number of bins required, it is feasible for simplified likelihoods which are quick to evaluate even for relatively large bin numbers. In the rest of this section, we present the application of the SLLS framework to an unbinned model loosely inspired by an ATLAS H → γγ measurement.

Full model example
We consider a simple example based on the ATLAS H → γγ analysis of Ref. [21]. The analysis uses an unbinned model based on the distribution of the invariant mass m γγ of the two photons in the range 105 < m γγ < 160 GeV. The Higgs boson signal manifests itself as a sharp peak in the m γγ distribution, with a position close to the Higgs boson mass and a width of 1.1-2.1 GeV depending on event kinematics. The background contributions follow smoothly falling shapes. Several signal regions (referred to as categories in the rest of this section) are defined according to the properties of the signal photons and of the rest of the event.
The example uses a simplified description of the 33 categories defined in Ref. [21] to study Higgs boson production in the gluon-fusion process. The signal and background distributions are represented respectively by Gaussian and exponential distributions, instead of the more complex shapes used in Ref. [21]. The peak position and width of the Gaussian, as well as the expected signal and background yields are taken from Ref. [21], while the exponential slope of the background is assumed to be −0.02 GeV −1 in all categories. The background normalizations and exponential slopes are free to vary in the fit, except for the slopes in five low-statistics categories which are kept fixed to avoid unstable fits. Five NPs are used to describe the leading systematic uncertainties: the uncertainty on the integrated luminosity of the dataset; on the reference cross-section for the gluon-fusion production process; on the effect of parton shower modeling on the signal yields; on the H → γγ reconstruction efficiency; and on the photon energy resolution. This last uncertainty leads to a change in the width of the signal peak and therefore induces highly non-linear effects in the per-bin signal yields in a binned description of the likelihood. Systematic uncertainties on the background model are implemented using separate NPs in each category, following the spurious signal method described in Ref. [21]. The values of the uncertainties listed above are all taken from Ref. [21]. In total, 99 NPs are defined. The single POI is the Higgs boson signal strength µ, applied as a scaling factor to the expected signal yield in  [21]. The bin width is 0.1 GeV in both cases. The signal and background contributions (blue and green histograms respectively) are shown together with the example dataset (black points), for the best-fit value of the model parameters.
all categories. A dataset of events randomly generated from the model PDF is used as the "observed" data in this example.

Simplified model
The SLLS model is built as a binned approximation to the full model. A fine binning is required to obtain an accurate description of the signal peak. In this example a uniform bin width of 0.1 GeV is used, leading to 18150 bins in total for the 33 categories 4 . The m γγ distributions for two selected categories (the first and last in the order used in Ref. [21]) are shown in Figure 6.
The conversion is performed using an automated tool distributed as part of the fastprof package. All the NPs are retained, and their effect is described in terms of their linear impact on the event yield in each measurement bin, following the SLLS procedure. In some cases, in particular normalization parameters such as the one shown on Figure 7a, impacts are linear by construction. By contrast, the photon energy resolution systematic shown in Figure 7b exhibits non-linear behavior since it induces a change in the width of the signal peak which does not propagate linearly to the bin contents. Non-linearities become larger closer to the tails of the signal peak, but with a smaller impact on the results due to lower signal yields. The linear approximation remains in any case typically accurate for small  deviations of the NP from the nominal. Figure 8a shows the profile likelihood scan for the signal strength parameter µ obtained with the linearized model. The reference result obtained with the full unbinned likelihood, computed using the RooFitUtils package 5 , is also shown for comparison and excellent agreement is observed. The resulting 68% CL likelihood intervals are µ = 1.082 +0.117 −0.093 for the full model and µ = 1.082 +0.113 −0.093 for the simplified model. A fully Gaussian approximation, constructed as described in the previous section, yields µ = 1.082±0.098. The fits take about 15 min to perform on the full model, compared to about 50 ms and 1 s for simplified likelihood fits with respectively a free and floating µ.
To better compare the treatment of systematic effects, the profiled values of the five NPs describing the leading systematic uncertainties are shown in Figure 8b. The agreement between the simplified and the full model is found to be accurate to about 10% of the parameter uncertainties. This agreement is crucial to the description of this example since the uncertainty on µ is dominated by systematic effects. In particular, the profiling of the photon energy resolution systematic (which represents about 20% of the total uncertainty) shows good agreement with the full model in spite of the non-linear effects highlighted in Figure 7b. Figure 8c shows the difference between the profiled values of the other NPs in the simplified and full models, normalized to their fit uncertainty. This difference is below 10% of the fit uncertainty for about 80% of the parameters.

Discussion
As observed in the examples shown in this paper, linearized NP impacts provide a generally adequate approximation of their behavior in the full model, in particular in the description of systematic effects. It can be noted that the approximation approaches the exact description in situations where the impact of the NP is naturally linear, such as the case shown in Figure 7a. Discrepancies are expected in cases which deviate from this ideal configuration, in particular for: • Large non-linear systematic uncertainties, with effects that are not fully accounted for in the linear approximation, such as the one shown in Figure 7b.
• Asymmetric systematic uncertainties, with different impacts for NPs above or below their nominal value. These effects cannot be included in the linearized profiling, although they can be taken into account for the evaluation of the likelihood function.
• Low expected event yields, leading to Poisson counts that are not well-described by Gaussian distributions. While the Poisson distribution itself is described exactly in the SLLS formalism, non-linearities can occur due to systematics on the expected event yield, since the Poisson PDF does not depend linearly on its expected yield.
These situations are partially covered in the examples described in this paper, and it is encouraging that in these cases at least, the linear description seems adequate. However simplified likelihoods should be carefully validated against the full model in each case nevertheless. Tools to perform these checks are included in the fastprof package, using methods similar to those shown in this paper. Another limitation to take into consideration is the memory footprint of the ∆ cbsk coefficients which encode the linear impacts: their number is given by N NPs × N bins × N samples , which can reach O(10 8 ) or more for complex models. For models with a large numbers of bins and NPs, such as converted unbinned models, memory constraints can be more stringent than those related to computation times, since these computations mainly involves matrix operations that are quite efficient even for large models.

Conclusion
Simplified likelihoods provide a convenient setting for the reuse of experimental results, and functionality that is complementary to that of full models and Gaussian approximations. The SLLS framework is based on a linear description of NP impacts which provides an approximation with two main benefits: it describes the Poisson behavior of counting measurements exactly, and also preserves the NPs of the full model and therefore a fully granular description of its systematic uncertainties. Both of these aspects make it well-suited to the description of LHC measurements, where systematic uncertainties and non-Gaussian effects from low event counts (e.g. in tails of distributions) both play important roles. The preservation of the NP structure allows in particular a proper treatment of correlated systematic effects when performing combinations of measurements, by identifying parameters associated with identical sources of uncertainty in the combination inputs. Since the POIs of the original model are also preserved, reinterpretations of the simplified model can also be performed as for the original model. These properties have particular relevance to global combinations of LHC measurements, for instance those performed in the context of effective-field theory models [26][27][28][29][30] which are based on measurements dominated by systematic uncertainties as well as others performed in high-momentum regions with low expected event counts. Currently these combinations are typically performed under Gaussian approximations without accounting for correlated uncertainties and Poisson behavior, and the use of simplified likelihoods could improve their accuracy. SLLS models can be built automatically from binned likelihood implemented using the HistFactory formalism within the ROOT and pyhf framework, or from unbinned likelihood using binned approximations. An implementation of the SLLS framework is provided in the fastprof package at https://github.com/fastprof-hep/fastprof. The models are stored in a plain-text JSON format, and computations and other operations are performed using python tools based on the widely available numpy and scipy libraries. Together with other full and simplified likelihood formats with complementary functionality, it is hoped that the SLLS framework will encourage the further publication of detailed statistical models by LHC experiments and beyond.

A Linearization procedure
This section provides a sketch of the derivation of the profile valueθ k (µ) of the NP θ k given by Eq 2.3. Starting from the likelihood in Eq. 2.1 and applying the linearization procedure described in Section 2.2, we obtain the negative log-likelihood up to a an additive constant. In the expression above, indices c, b and s run respectively over measurement channels, bins within each channel, and event samples. The ν cb (µ, θ) = s ν cbs (µ, θ) are the total expected events yields for the corresponding channel bin, and the per-sample yields ν cbs (µ, θ) are given by Eq 2.2. The Gaussian constraints on the NPs are parameterized using the auxiliary observablesθ and the inverse covariance matrix Γ.
The derivative of λ(µ, θ) with respect to the NPs θ is where ∆ cbs is the vector with components ∆ cbsk , the linear impacts of the parameter θ k on ν cbs . The linear approximation of NP impacts is applied to the denominator as with Q(µ) and P (µ) defined by Eq. 2.5, and the profile valuesθ(µ) defined by ∂λ/∂θ(µ,θ(µ)) = 0 are therefore given by Eq. 2.3.

B Binned approximation to an unbinned PDF
We consider an extended unbinned PDF for an observable x, where f (x, θ) is the PDF for one observation of x, the dataset consists of the values x 1 · · · x n , θ are the model parameters, and the expected number of observations is N (θ). We include the infinitesimal volume elements dx i in the expression since these will be useful below.
We introduce a set of bins B a , a = 1 · · · N bins that span the allowed range of x. In the spirit of finite-element analysis, we approximate f (x) by a form that is constant over each bin as f (x, θ) = where the indicator I a (x) is 1 if x ∈ B a and 0 otherwise, and w a = Ba I a (x)dx is the measure of bin B a . The value f a (θ) is the average of f (x, θ) over the bin B a , so that for a sufficiently fine binning and smooth f (x, θ), f (x, θ) ≈ f a (θ) for x ∈ B a .
One can remove the explicit dependence on the x i by integrating them out of the likelihood. The integration of the product term of Eq. B.1 can be written as where a i is the index of the bin to which x i belongs, n a is the number of observations that fall in bin B a , and we have used the fact that the x i are independent to propagate the integral through the product. Returning to the full expression of Eq. B.1, we can write the likelihood as a function of the n as P (n; θ) = e −N (θ) n 1 ! · · · n N bins ! N (θ) n N bins

a=1
[w a f a (θ)] na , (B.2d) after including an additional multiplicative factor (n, n 1 , n 2 , · · · n N bins ) to account for the number of different orderings of the x i that can yield a given set of n a . One can introduce the per-bin expected yields N a (θ) = w a f a (θ)N (θ) (B.2e) and note that since