Testable Likelihoods for Beyond-the-Standard Model Fits

Studying potential BSM effects at the precision frontier requires accurate transfer of information from low-energy measurements to high-energy BSM models. We propose to use normalising flows to construct likelihood functions that achieve this transfer. Likelihood functions constructed in this way provide the means to generate additional samples and admit a ``trivial'' goodness-of-fit test in form of a $\chi^2$ test statistic. Here, we study a particular form of normalising flow, apply it to a multi-modal and non-Gaussian example, and quantify the accuracy of the likelihood function and its test statistic.


I. INTRODUCTION
Contemporary experimental analyses at the Large Hadron Collider have, so far, not been able to discover particles beyond the Standard Model (BSM) at energy scales below ≃ 1 TeV.As a consequence, model building has increasingly turned toward using effective field theories (EFT) to describe any potential BSM effects below these scales.The Standard Model Effective Field Theory (SMEFT) [1] is one of the main choices, as is the Higgs Effective Field Theory (HEFT) [2].Constraining the EFT parameters is a challenge, due the large dimensionality of the parameter space.Taking the SMEFT as an example, the basis of operators in the leading mass-dimension six Lagrangian amounts to 2499 independent EFT Wilson coefficients [3].The observed hierarchies among masses and mixing of quark flavours at low-energies has inspired systematic approaches like Minimal Flavour Violation to reduce the number of free parameters [4].
Low-energy phenomena beside quark mixing have long been used to further our understanding of BSM physics at large scales; flavour-changing processes and anomalous electric dipole moments are excellent examples therof, which contribute substantial statistical power to constrain BSM effects [5].Flavour-changing processes in particular are commonly interpreted in a "model-independent" fashion by inferring their relevant parameters within another EFT, the Weak Effective Theory (WET) [6][7][8].However, including low-energy constraints in this way provides for a challenge: the interpretation of a majority of the low-energy constraints relies on our understanding of hadronic physics in some capacity.This leads to a proliferation of hadronic nuisance parameters, which renders a global interpretation of all WET parameters impractical if not practically impossible.Examples for this proliferation are plentiful in bquark decay and include analyses of exclusive b → sℓ + ℓ − processes with 113 hadronic nuisance parameters and only 2 parameters of interest [9] as well as analyses of exclusive b → uℓ − ν processes with 50 hadronic nuisance parameters and only 5 parameters of interest [10].To overcome this problem, the following strategy has been devised [11]: • divide the free parameters of the WET into so-called "sectors", which are mutually independent to leading power in G F ; • identify sets of observables that constrain a single sector of the WET and infer that sectors parameters; • repeat for as many sectors as possible.* anja.beck@cern.ch† merilreboud@gmail.com‡ danny.van.dyk@gmail.comIn this way, statistical constraints on the parameters in the individual WET sectors in form of posterior densities can be obtained, providing a local picture of the low-energy effects of BSM physics.
To gain a global picture of the BSM landscape, either in terms of the parameters of the SMEFT, the HEFT, or a UV-complete BSM model, the previously obtained local constraints for individual WET sectors can, in principle, be used as a likelihood function.On the physics side, this requires matching between the WET and the genuine parameters of interest, i.e, the parameters of a UV-complete model, the SMEFT, or the HEFT.In the case of the SMEFT, a complete matching of all dimension-6 operators to the full dimension-6 wet has been achieved at the one-loop level [12].
To date, the strategy outlined above has not yet been implemented: no library of statistical WET constraints is available.The key obstacles in the implementation are not specific to the underlying physics.Instead, they are the accurate transfer of the statistical results for even a single sector; the potential to test a likelihood's goodness of fit through a suitable test statistic; and the ease of use by providing a reference code that exemplifies the approach. 1 Here, we present a small piece to solving the overall puzzle of how to efficiently and accurately include the low-energy likelihoods in BSM fits: the construction of a likelihood function that encodes the WET constraints and that admits a test statistic.Our approach uses methods from the field of automatized learning and generational models.We illustrate the approach at the hand of a concrete example likelihood, for which we use posterior samples for the WET parameters obtained in the course of a previous analysis [10]; cf. also a direct determination of SMEFT parameters in a similar setup [14].Our example likelihood is multi-modal, and the shape of each mode is distinctly non-Gaussian.This renders the objective of providing a testable likelihood function in terms of the BSM parameters quite challenging.
The possible applications for obtaining such a likelihood function are two-fold: first, it permits to generate additional samples that are (ideally) identically distributed as the training samples.Second, it provides a test statistic to use in a subsequent EFT fit.We illustrate how normalising flows make both applications possible, by translating our example posterior "target" density to a unimodal multivariate Gaussian "base" density.We propose a number of tests to check the quality of this translation, gauging the validity of both applications.

II. PRELIMINARIES A. Notation and objective
We begin by introducing our notation for the physics parameters and their associated statistical quantities.Let P * T ( ⃗ ϑ | WET) be the "true" posterior density for the low-energy WET parameters ⃗ ϑ ∈ T ≡ R D , also known as the WET Wilson coefficients.We refer to the vector space T as the target space of dimension D. In our later example, we will consider T to be the space of WET Wilson coefficients in the ubℓν sector as studied in Ref. [10].We assume that we can access P * T numerically by means of Monte Carlo importance samples.Our intent is to construct a likelihood where P T is a model of P * T and ⃗ φ ∈ F represents some set of BSM physics parameters within some vector space F .In an envisaged SMEFT application, F would represent the vector space of SMEFT Wilson coefficients.We stress at this point that, although P ( * ) T is a probability density, L is in general not a probability density and is hence labelled a likelihood function.
Next, we introduce a bijective mapping f between our target space T and some base space Our objective is to find a mapping f and its associated Jacobian 1 An alternative to this strategy is to use the most constraining observables for each WET sector as low-energy likelihoods and use the latter directly within a global (SMEFT) likelihood.Despite the large number of nuisance parameters, this approach has been shown to be very useful and has been implemented as part of the smelli software [13].However, it currently comes with the drawback of repeating the low-energy analyses for every point in the global (here: SMEFT) parameter space, leading to a waste of computational resources compared to the strategy proposed above.Improvements to smelli overcoming this issue are presently under development.
that ensures that P B ( ⃗ β) is multivariate standard Gaussian density: ).As a consequence, one finds immediately that the two-norm in base space follows a χ 2 distribution: The likelihood function is thus fully defined by the mapping f .A natural test statistic is provided by the χ 2 statistic for the squared 2-norm in base space.

B. A normalising flow for a real-valued non-volume preserving model
The framework of normalising flows [15] has been developed with the explicit intent to transform an existing probability density to a (standard)normal density, i.e., to "normalise" the density.To this end, f is constructed as a composition of K individual bijective mapping layers f (k) , Here, we restrict ourselves to a particular class of mapping layer, following the real non-volume preserving (RealNVP) model [16]: the affine coupling layers.This class of layers are chosen here for their proven ability to map multi-modal densities of real-valued parameters to a standard normal through non-volume-preserving transformations [16].Each of these layers has the structure In the above, ⃗ x = (x 1 , x 2 , . . ., x D ) T , and p is a permutation operation with a trivial Jacobian |J p | = 1.The power of the RealNVP model lays in the use of affine coupling layers a (k) [16].
Each of these layers splits its input vector into two parts ⃗ x = (⃗ x T lo , ⃗ x T hi ) T , where d lo ≡ dim ⃗ x lo = ⌈D/2⌉ and d hi ≡ dim ⃗ x hi = ⌊D/2⌋.The affine coupling layers then map their inputs to where s (k) and t (k) are real-valued functions R d lo → R d hi , and ⊙ and ⊕ indicates element-wise multiplication and addition.Since both s (k) and t (k) are independent of ⃗ x hi , the Jacobian for each affine coupling layer a (k) takes the simple form: There are no further restrictions on the functions s (k) and t (k) and they are commonly learned from data using a neural network [17].

III. PHYSICS CASE: SEMILEPTONIC B DECAYS
To illustrate the viability of our approach, we carry out a proof-of-concept (POC) study using existing posterior samples [18] from a previous BSM analysis [10].This analysis investigates the BSM reach in exclusive semileptonic b → uℓν processes within the framework of the ubℓν sector of the WET: Here C ℓ i represents a WET Wilson coefficient (i.e., a free parameter in the fit to data) while O ℓ i represents a local dimension-six effective field operator (i.e., the source of the 50 hadronic nuisance parameters in the original analysis [10]).The assumptions inherent to that analysis lead to a basis composed of five WET operators: For the purpose of a BSM study, only the marginal posterior of the WET parameters is of interest.One readily obtains this marginal posterior by discarding the sample columns corresponding to any (hadronic) nuisance parameters.A corner plot of this marginal posterior originally published in Ref. [10] is shown in Fig. 1.As illustrated, the 5D marginal posterior is very obviously non-Gaussian with a substantial number of isolated modes.The samples obtained in this analysis are overlaid with a kernel-density estimate (KDE) of the marginalised posterior.Although estimating the density is one of our main goals, the use of a KDE is neither sufficient nor advisable for the purpose we pursue.First, KDEs are notorious for their computational costs.Second, KDEs are sensitive to misestimation of densities close to the boundaries of the parameter space.Last but not leas, KDEs do not provide a test statistic.
The objective of the POC study is now to determine whether a generative and testable likelihood can be constructed from the available posterior samples by employing normalising flows.To achieve this objective, we investigate a twodimensional subset of our posterior samples before we progress to the full five-dimensional problem.For both cases, we use normalising flows f of the form shown in Eq. ( 5), i.e., a sequence of K individual mapping layers Eq. ( 6), each consisting of a composition of an affine coupling layer [16] and a permutation.This class of mapping layer is implemented as part of the normflows software package [17], which we use to carry out our analysis.The normflows software is based on pytorch [19], which is used in the training of the underlying neutral networks.
We train the normalising flow on a total of 144k posterior samples using a total of K = 32 mapping layers.Each mapping layer is associated with a 4-layer perceptron, consisting of the input layer with ⌈D/2⌉ nodes, two hidden layers with 64 nodes each, and the output layer with 2 × ⌊D/2⌋ nodes.The outputs correspond to the vector-valued functions s (k) and t (k) as used in Eq. ( 6).We use the default choice of loss function provided by normflows, the "forward" Kullback-Leibler divergence [20] The loss function is minimised with respect to the neural network parameters ⃗ α using the "Adam" algorithm [21] as implemented in pytorch.Although we use a total of 10k optimisation iterations, we find that a close-to-optimal solution is found around the 2k iterations mark in both cases.We show the evolution of the loss functions in terms of the number of optimisation steps in Fig. 2.

A. Application to a two-dimensional subset of samples
In a first step, we perform a POC study at the hand of a subset of WET Wilson coefficients.The target space only consists of the two parameters: Our choice of 2D example exhibits two difficult features in target space as illustrated in Fig. 1.First, the 2D marginal posterior is multi-modal.Second, each mode is distinctly non-Gaussian and its isoprobability contours resemble a bean-like shape.In the following discussion, we will frequently refer to these shapes simply as "beans".The central objective of this 2D POC study is to train a normalising flow f such that the base space variables have a bivariate standard normal distribution: ⃗ β ∼ N 2 ( ⃗ 0, 1).

a. Distribution of the parameters in target and base space
In Fig. 3 we show the result of the training in the target and base spaces.The top plots contain the model in the target space after training and the base distribution used as starting point for the transformation.The bottom plots show the distribution of the posterior samples in the target space, used in the training, and after transformation to the base space.Visually, the trained model in the target space (top left) captures the main features of the training sample (bottom left).Nevertheless, the model contains a faint filament connecting the left and right beans (top left), which is not present in the true posterior samples (bottom left).Although this additional structure becomes less prominent as the value of the loss function decreases, it never fully disappears using our analysis setup.This additional feature translates to a less populated line across the sample distribution in the base space (bottom right).This line however becomes invisible after a sufficient number of training iterations as is shown in Fig. 4, where we juxtapose two transformation of the samples of P * T from T space to B space using the normalising flows trained with 2000 and trained with 10000 iterations.We conclude that the presence of the filament requires careful diagnosis if the RealNVP model can be used to achieve our stated objective: to obtain a generative and testable likelihood function.

b. Comparison of the modelled and true distributions
We investigate the quality of our trained normalising flows with a comparison of the modelled and true distributions.The comparison is first performed using histograms of the distributions in both the base and the target space.The value n i of the true distribution in bin x i is obtained by dividing the number of samples in the bin by the total number of samples N .The value of the modelled distribution is represented by the value of the modelled density at the bin centre, P X (x i ), multiplied with the area of the bin, A i ≡ A(x i ).Our measure for the comparison is the deviation defined as where the uncertainty on the true distribution in bin x i is assumed to be The deviation is shown in Fig. 5.As expected, the deviation is close to zero in highly populated regions and the connecting filament between the beans in the target space is clearly visible.The larger deviations in sparsely populated regions are a result of the limited size of the data sample, where the red bins contain a single data point and the blue regions are empty bins.

c. Sample classification with a BDT
The next step is to quantify the quality of the normalising flows in regard to the two primary purposes, generating samples from the posterior and determining the test statistic for the actual distribution.To this end, we train classifiers to distinguish samples generated from the true and modelled posterior distributions on the sample position in the base space, the position in target space, and the density in target space together with ∥ ⃗ β∥ 2 .In all three cases, the classification is carried out with a Boosted Decision Tree (BDT) [22] with the default settings of the XGBClassifier class provided by the python library xgboost [23].
The data generated from the true posterior is split into two equal parts, which are then only used for training and testing respectively.A sample of the same statistical power is generated from the modelled distribution and also split into two equal parts.The receiver-operator statistic is the dependence of the true positive rate on the false positive rate on the testing sample.The area under its curve (AUC), for a set of features x used in the classification, is a measure for the quality of the classification.Perfect classification results in AUC(x) = 1, while random classification results in AUC(x) = 0.5.

d. Differences between the modelled and the true posterior density
Training and testing the BDT on the position in target and base space, results in AUC( ⃗ ϑ) = 0.577 ± 0.002 and AUC( ⃗ β) = 0.549 ± 0.003 , respectively.These low AUC values indicate that the true and modelled density are very similar.However, they are significantly larger than 0.5 showing that the BDT can find exploitable differences.The smaller AUC score in the base space compared to the target space is in line with the observation of a small connecting artefact in target space and the resulting diagonal gap in the base space, as discussed in relation to Fig. 4 and Fig. 5.As a sanity check, the BDT is also trained on two samples which are both generated from the model resulting in AUC values compatible with 0.5 as expected for two indistinguishable samples.
e. Testing the modelled test statistic Finally, Fig. 6 shows the distribution of the 2-norm of the data samples in the base space.They appear to follow a two-dimensional χ 2 distribution as anticipated in Eq. ( 4).We determine the reliability of the test statistic by investigating the ability of a BDT to distinguish the true and modelled posterior densities.This BDT is trained on two variables: the posterior density in the target space and the 2-norm in the base space, ∥ ⃗ β∥ 2 , which resembles a χ 2 distribution.The posterior density is not available as a function.We are therefore forced to use an empirical estimation.Because the value of the density relies on the data itself, the AUC value for classification of two samples generated from the same model is larger than 0.5 and depends on the estimation method used; we investigate KDEs with different kernels as well as nearest neighbours density estimation.
The AUC scores for classification of samples generated from the true and modelled posterior density are al- ways larger than the corresponding value for classification of two model-based samples.The AUC value, obtained from estimating the density from the number of points with a distance below 0.015 around a given point, is AUC(density) = 0.614 ± 0.006 which is an increase of 0.039 with respect to the baseline value calculated when classifying two model-based samples.
f. Conclusion of the two-dimensional study Our conclusion for this 2D POC study is that our approach is a viable one and provides both an evaluatable likelihood and a reliable test statistic.Publishing 2D WET posteriors in form of a normalising flow therefore clearly beats current best practices such as Gaussian approximations or Gaussian mixtures models (as for example used in Ref. [10]) in terms of both usefulness and reliability.We find that the presence of the filament does not adversely affect achieving our stated objectives at the current level of precision.We leave an investigation as to other choices of mapping layers, in particular autoregressive rational quadratic splines as used in Ref. [24], to future work.Instead, we move on to apply our approach to the full five-dimensional POC.

B. Application to the full five-dimensional samples
We carry out the same analysis on the full five-dimensional posterior of Fig. 1.This is a much more complicated case in terms of possible artefacts between beans.Since this POC study is only illustrative, we do not try to optimise the training of the normalising flows and keep the same setup as for the two-dimensional case.As visual comparisons of the distributions suffer from loss of information due to the projection from five to two dimension, we rely on the BDT tests that we introduced for the two-dimensional case.Fig. 7 shows the distribution of the 2-norm of the samples transformed to the five-dimensional base space.The distribution is overall consistent with a χ 2 distribution with five dimensions.the residuals reveal small systematic differences between the shape of the distribution of the data points and the expected PDF.In particular, the residuals tend to aggregate positive values at small χ 2 < 5 and negative values at large 10 < χ 2 < 20.This conclusions are confirmed by the BDT classification in the base space which reaches an AUC score of AUC( ⃗ β) = 0.5708 ± 0.0005 .
This value is slightly larger than the value obtained in Sec.III A but corroborates that the data sampled from the true posterior distribution resembles a five-dimensional Gaussian distribution after transformation with the trained normalising flow.
On the other hand, the closeness of the true and modelled distributions in the target space is much less convincing.The BDT classification in the target space reaches AUC values close to one.Similarly, training and testing the BDT on the value of the posterior density in target space and the 2-norm in base space gives AUC(density) = 0.9150 ± 0.0005 which confirms a poor modelling of the true distribution.
We conclude that the full five-dimensional variable space presents a more complex challenge compared to the twodimensional POC.The basic normalising flow training as pursued in this work does not provide a satisfactory model for the multi-modal distribution in target space.This is not surprising since we use the same setup for the two-and five-dimensional tests.The small number of iterations needed to train the five-dimensional normalising flow, visible in Fig. 2, indicates that more complicated models can be trained to achieve better performances.The tuning of the model and the training procedure are however beyond the scope of this paper.

IV. SUMMARY
We show that normalising flows can overcome two major hurdles to the full exploitation of statistical constraints on EFT parameters stemming from phenomenological analyses of low-energy processes.On the one hand, they enable sampling from a previous analysis' posterior distribution by converting it into a simple multivariate Gaussian density.On the other hand, they provide a simple test statistics for the distribution.
We investigate this procedure by training a RealNVP normalising flow to a physics case that exhibits multimodal posterior densities in two and five dimensions.The compatibility of the modelled and the empirical posterior distributions are examined using different statistical tools.
Our conclusions are as follows.When facing multi-modal distributions, the modelling of the true posterior densities leads to the presence of artefacts as observed here.Nevertheless, these artefacts do not prevent the successful usage of normalising flows as proposed here.Instead, they require careful (supervised) learning of the features of the true posterior density.In the two-dimensional case, we have illustrated that both of our objectives, the sampling and the presence of a test statistics, are achieved.However, applying our approach to the five-dimensional case without modification leads to less satisfactory performances.Improvements to performance through detailed modelling, at the cost of more computationally-intensive training, are left for future work.
We are currently modifying the EOS software [25] to provide users with an interface that transparently makes use of our proposed method of constructing low-energy likelihoods for existing and future analyses of flavour-changing processes.

FIG. 1 .
FIG. 1. Marginalized 5D Posterior for the parameters of interest ⃗ ϑ in the ubℓν example.The black dots indicate the distribution of the posterior samples in form of a scatter plot.The blue-tinted areas correspond to the 68%, 95%, and 99% probability regions as obtained from a smooth histogram based on kernel-density estimation.

5 FIG. 2 .
FIG.2.Evolution of the loss function for the two-and five-dimensional case as the blue solid and red dashed lines, respectively.The loss values have been scaled so that all values are confined to the interval [0, 1].

2 FIG. 3 .
FIG. 3. Modelled (top) and empirical (bottom) distributions in the target (left) and base (right) space.Top: model in the target space after training (left) and base space (right).Bottom: distributions of the samples of the true posterior used for training the normalising flows in the target space (left) and after transformation to the base space using the trained flow (right).The circles in the base space serve only to guide the eye and show the contours || ⃗ β||2 = 1 and || ⃗ β||2 = 9.