Using Machine Learning techniques in phenomenological studies in flavour physics

An updated analysis of New Physics violating Lepton Flavour Universality, by using the Standard Model Effective Field Lagrangian with semileptonic dimension six operators at $\Lambda = 1\,\mathrm{TeV}$ is presented. We perform a global fit, by discussing the relevance of the mixing in the first generation. We use for the first time in this context a Montecarlo analysis to extract the confidence intervals and correlations between observables. Our results show that machine learning, made jointly with the SHAP values, constitute a suitable strategy to use in this kind of analysis.


Introduction
The experimental measurement of Lepton Flavour Universality Violating (LFUV) processes in B meson decays, in tension with the Standard Model (SM) predictions, would represent a clear sign for physics beyond the SM. For the b → s + − processes, observables such as the ratios of branching fractions R K ( * ) , provides evidence of LFUV and are of particular interest because much of the theoretical uncertainty cancels in the ratio. It is well known that in the SM, as a consequence of Lepton Flavour Universality (LFU), R K = R K * = 1 with uncertainties of the order of 1% [1,2]. However, the latest experimental results from LHCb, in the specified regions of q 2 di-lepton invariant mass, are: Clearly, the results for the compatibility of the individual measurements with respect to the SM predictions depend of the q 2 di-lepton invariant mass region, being of 3.1 σ for the R K ratio, 2.3 σ for the R K * ratio in the low-q 2 region and 2.4 σ in the central-q 2 region.
The Belle collaboration has also reported experimental results for the R K ( * ) ratios [5,6], although with less precision than the LHCb measurements. Other targets of flavour violating processes are the b → c ν transitions. The ratios of branching fractions R D ( * ) and R µ D ( * ) , defined by, Their measurements at BaBar [8], Belle [9] and LHCb [10] experiments are larger than the SM prediction. By assuming universality in the lighter leptons, the world average of the experimental values for the R D ( * ) ratios, as obtained by the Heavy Flavour Averaging Group (HFLAV), are [7] R ave D = 0.340 ± 0.027 ± 0.013, R ave D * = 0.295 ± 0.011 ± 0.008. (1.6) These values imply a 1.4 σ discrepancy with the SM predictions for R D , and 2.5 σ for R D * . When combined together, including their correlation, the excess is 3.08 σ.
There exist other observables displaying some discrepancies with SM predictions even when larger theoretical uncertainties are taken into account [11][12][13][14]. It is clear than when investigating the implications of the experimental measurements in flavour physics observables, a global fit should be considered. Several global fits can be found in the literature (see, for example [15][16][17][18][19][20][21][22][23][24][25] and references therein). We have recently done a global fit to the updated experimental information in [24,25], where an extensive list of references to previous analyses is included.
From the theoretical point of view, Effective Field Theory is one of the most widely used tools to study any possible New Physics (NP) contribution. The effective Hamiltonian approach allows us to perform a model-independent analysis of NP effects. In this paper, we consider the Standard Model Effective Field Theory (SMEFT) Lagrangian and we perform a global fit to the Wilson coefficients using the packages flavio v2.3 [26] and smelli v2.3 [27] (as described in details in section 3). The global fit includes the R K ( * ) and R D ( * ) observables, the electroweak precision observables; W and Z decay widths and branching ratios to leptons, superallowed nuclear β decays, all the available experimental data for the related b → s + − observables; i.e. the b → sµ + µ − observables (including the optimized angular observable P 5 and the branching ratio of B s → µ + µ − , as well as all the available data on angular observables in B → K ( * ) µ + µ − decay), the relevant data related to B → K ( * ) e + e − decays, and also the angular observables measured in different bins for B s → φµµ decays. Finally, the b → sνν observables are also included in the global fit.
Because the Gaussian approximation to characterize the fit is not successful, we will use for the first time in this context a Montecarlo analysis to extract the confidence intervals and other relevant statistics, and we explicitly show that machine learning, taking jointly with the SHAP (SHAPley Additive exPlanation) values, constitute a suitable strategy to use in this analysis.
The rest of this work is organized as follows: Section 2 presents a brief summary of the Effective Field Theory used to describe possible NP contributions to B decays observables. We then discuss in section 3 the details of the global fits performed, introducing the phenomenological scenarios that we used in the analysis and presenting our results. We found that the Gaussian approximation is not suitable to characterize the fit and, therefore, in order to extract the confidence intervals and other relevant statistics, we use a Montecarlo analysis that is described in section 4. The agreement of the results obtained by the Machine Learning Montecarlo algorithm that we have proposed and the ones obtained by using the Renormalization Group equations is also included in this section. Section 5 includes a discussion of the phenomenological implications of our analysis in leptoquark models. The conclusions are presented in section 6. Appendix A contains the list of observables that contribute to the global fit, as well as their prediction in the most general scenario considered in this work.

Brief summary of the Effective Field Theory
This section presents a short summary of the Effective Field Theory used in our analysis. First, at energy scales relevant for flavour processes it is convenient to work at an energy scale below the electroweak (EW) scale, for example µ WET = m b , with the top quark, Higgs, W and Z bosons being integrated out. The relevant terms of the Weak Effective Theory (WET) Lagrangian [28][29][30][31] for the semileptonic decays of B mesons are: where G F is the Fermi constant, e is the electromagnetic coupling and V qq are the elements of the Cabibbo-Kobayashi-Maskawa (CKM) matrix. The dimension six operators are defined as, being their corresponding Wilson coefficients C V L , C 9 , C 10 and C ν , respectively. The last three coefficients have contributions from both the SM processes (C SM i ), and NP contribution (C NP i ), 3) The dependence of the R K ( * ) ratios on the Wilson coefficients has been previously obtained in [15], | 2 + 0.0357|C NP e 10 | 2 .
(2.4) where an analytic computation of this ratio as a function of C NP µ 9 , C NP µ 10 in the region 1.1 ≤ q 2 ≤ 6.0 GeV 2 was performed.
For the R D ( * ) ratios, the dependence of the R D ( * ) ratios on the Wilson coefficients is given by [32,33]: (2.5) Second, the NP contributions at an energy scale Λ (Λ ∼ O(TeV)) is defined via the Standard Model Effective Field Theory (SMEFT) Lagrangian [34], where and q are the lepton and quark SU (2) L doublets in the basis of electroweak eigenstates, and i, j, k, l denote generation indices. The dimension six operators are defined as with τ I being the Pauli matrices. We note that we will use the SMEFT operators for our numerical analysis, and will refer to the WET operators only for discussion and comparison with other previous results in the literature.
The translation between the SMEFT Lagrangian in the electroweak basis and in the mass basis was obtained in [29]. The SMEFT Lagrangian in the mass basis is The relation between the C q coefficients in the electroweak basis and the C q coefficients in the mass basis is given by [29] where U d L and U u L are the SM rotation matrices for the left-handed quarks. The only constraint for these matrices is given by the CKM matrix, V = U u L U † d L . The choice U d L = 1, U u L = V defines the "Warsaw-down" basis of the SMEFT [35], where C q(1) = C q(1) and Finally, there is a recent proposal that links the B meson anomalies with NP in the top sector [33,36,37]. In the interaction basis, denoted by double-primed fermions, only the third generation particles exhibit NP couplings, where C 1 = C 3333 q(1) and C 3 = C 3333 q (3) . The interaction basis is related to the basis where the mass matrices are diagonal via the unitary transformations, where ψ L = P L ψ (ψ = u, d, ν, e),Û ψ are unitary matrices, and the quark unitary matrices are related to the CKM matrix asÛ uÛ † d = V . The fermionic bilinears are transformed as follows, with the flavour matrices λ given by We can write all the quark matrices in terms of λ q , so every u-type quark picks an additional CKM matrix, which are exactly the same factors appearing in the Lagrangian for the mass basis in Eq. (2.8). For example, if we expand the first term in Eq. (2.10), we obtain which agrees with Eq. (2.8) with the identification C ijkl q(1) = C ijkl q(1) = C 1 λ ij λ q kl . Repeating the same steps with the other term in Eq. (2.10), we arrive to C ijkl q(3) = C ijkl q(3) = C 3 λ ij λ q kl . In conclusion, the Lagrangian of Eq. (2.10) in the "Warsaw-down" basis becomes We perform the Renormalization Group (RG) running of the SMEFT Wilson coefficients from Λ = 1 TeV down to µ EW [38], where we match the SMEFT and WET operators [39], and finally we perform the RG running of the WET coefficients down to µ = m b [40]. We check that the analytical expressions are in agreement with the numerical results obtained by the package Wilson [41]. This operation is performed for all the effective operators in the WET that receive contributions from the Lagrangian in Eq. (2.16). Here we reproduce the matching conditions for the Wilson coefficients with the largest impact on the semileptonic B meson decays, that is, C i NP 9 and C i NP 10 for the B → K ( * ) + − decays, C i NP V L for the B → D ( * ) ν decays, and C i ν for the B → K ( * ) νν decays: We find out that there is a sizeable subleading term that affects C NP i 9 and not C NP i 10 , thus breaking the leading-order relation C NP i 9 = −C NP i 10 . However, this subleading term is LFU, since it does not depend on the leptonic rotation matrix λ , and consequently has a negligible effect on the universality ratios R K ( * ) . The spoiling of the tree-level relation will become relevant in observables that include only one lepton flavour, such as the branching ratios and angular observables for B → K ( * ) µ + µ − ; which depend on C µ 9 and C µ 10 ; and B s → µ + µ − only depending on C µ 10 . The interplay between the tree-level and loop-induced terms is well known and was also previously discussed by [42].
In order to describe the rotation from the two bases, the λ matrices introduced in Eq. (2.13) must be hermitian, idempotent λ 2 = λ, and trλ = 1. These properties are consequences of the fact that, in the interaction basis, NP only affects one generation, and follow immediately from the definitions: A 3 × 3 hermitian idempotent matrix with trace one has 4 free real parameters, or equivalently, 2 free complex parameters. Without loss of generality, we can use the parameterization [33] λ , q = 1 where α , q and β , q are complex numbers, which are related to the unitary rotation matrices as 33 (2.20) We can therefore understand the parameters α and β as the relative degree of mixing to the first and second generations of leptons, respectively, produced by the rotation from the interaction basis to the mass basis. Analogously, the parameters α q and β q represent the relative degree of mixing to the first and second generations of d-type quarks (remember that the u-type quarks pick additional CKM factors).
The conditions in Eq. (2.19) impose several relations between the LFUV operators, which are proportional to the diagonal entries of λ , and the LFV operators, proportional to the off-diagonal entries: (2.21) On the other hand, the O q operators also produce unwanted contributions to the B → K ( * ) νν decays [33]. In order to obey these constraints, we will fix at the scale µ = Λ the relation This relation cancels the tree-level contribution to the B → K ( * ) νν, but there is still a loopinduced contribution, proportional to the C q(3) coefficients. However, we have checked that in our scenarios, this is only a 0.1% correction of the SM predictions.

Global fits
Since the effective operators affect a large number of observables, connected between them via the Wilson coefficients, any NP prediction based on Wilson coefficients has to be confronted not only with the R K ( * ) an R D ( * ) measurements, but also with additional several measurements involving the B-mesons decays. The RG evolution in the SMEFT produces a mix of the low-energy effective operators, which implies that the W and Z couplings to leptons are modified [43,44]. Then, several EW observables are affected, such as the W boson mass, the hadronic cross-section of the Z boson (σ 0 had ) or the branching ratios of the Z to different leptons. In order to keep the predictions consistent with this range of experimental test, global fits have proven to be a valuable tool [18][19][20][21]. We have previously done in [24,25] an analysis of the effects of the global fits to the Wilson coefficients, assuming a model independent effective Hamiltonian approach.
In the current paper the global fits to the C q Wilson coefficients have been performed by using the packages flavio v2.3 [26] and smelli v2.3 [27] 1 . This code assumes unitarity of the CKM matrix. Note that the experimental measurements used to determine the SM input parameters, such as the µ → eνν decay, are not included in the fit in order to ensure the consistency of the procedure.
In our analysis, the goodness of each fit is evaluated with its difference of χ 2 with respect to the SM, ∆χ 2 SM = χ 2 SM − χ 2 fit . The package smelli actually computes the differences of the logarithms of the likelihood function ∆ log L = − 1 2 ∆χ 2 SM . In order to compare two fits A and B, we use the pull between them in units of σ, defined as [45,46] where Erf −1 is the inverse of the error function, F is the cumulative distribution function of the χ 2 distribution and n is the number of degrees of freedom of each fit. The SM input parameters used for these fits are the same as in our previous work [24]. The Renormalization Group effects of the SMEFT operators that shift the Fermi constant G F [39] from its SM value G 0 F are considered. The effects on the CKM matrix [47] are not implemented, and its parameters are treated as nuance parameters instead. Now we proceed to fit the set of flavour observables to the parameters C 1 = C 3 ≡ C, α , q and β , q of Eqs. (2.16) and (2.19). In this setting, we consider two Scenarios: • Scenario I: λ , q 11 = λ , q 12 = λ , q 13 = 0, that is, α = α q = 0, and C 1 = C 3 .
• Scenario II: The only assumption is C 1 = C 3 .
In both scenarios C 1 = C 3 in order to implement the constraints from the B → K ( * )ν ν observables, as previously mentioned (see Eq. (2.22)). In Scenario I we also set λ , q 11 = λ , q 12 = λ , q 13 = 0, i.e. α = α q = 0, assuming that the mixing affecting the first generation are negligible; this is the same assumption used in [33]. Scenario II is more general, including non-negligible mixings to the first generation, allowing us to check the validity of the above assumption and to discuss the results in a more general situation; focusing in the relevance of the mixing in the first generation. In both scenarios, we only consider real values for the parameters of the fit.
The best fits to the rotation parameters α and β for leptons and quarks and to the Wilson coefficient C ≡ C 1 = C 3 in these two Scenarios are summarized in Table 1. The best fit is found for Scenario II, with a pull of 6.70 σ with respect to the Standard Model, 3.77 σ with respect to Scenario I. We note that the β parameter, which mixes the second and third generations of leptons at tree level, is negligible in both fits. Figure 1 shows the twodimensional sections of the likelihood function ∆χ 2 SM for the α -β and α q -β q parameters in Scenario II, at 1σ and 2σ. The rest of parameters are given as in the best fit point of  this Scenario. Results for the R K ( * ) and R D ( * ) observables and for the LFV observables, as well as for the global fit are included. We can observe that, due to the non-linear relations imposed by Eq. (2.19), the regions of equal probability are highly non-ellipsoidal. Therefore, we cannot use the Gaussian approximation to characterise the fit. Instead, we will use a Montecarlo analysis, described in section 4, in order to extract the confidence intervals and correlations between observables. The values of the parameters of the Lagrangian (2.16) in Scenario II are C = C 1 = C 3 = −0.126 ± 0.010, and The most notable effect of the mass rotation is the mixing of the second and third generation quarks, and there is also some mixing between the first and third generation leptons.
The more relevant WET Wilson coefficients in Scenario II are As established in Eq. (2.17), subleading RG effects cause a notable deviation from the leading-order relation C NP µ 9 = −C NP µ 10 . This is in agreement with the fits performed in [15,16,46,[50][51][52][53][54][55][56][57][58][59][60][61], where the Wilson coefficient C NP µ 9 receives a greater NP contribution than C NP µ 10 . According to our fit, C NP µ 10 ≈ 0: from the matching conditions, this operator is generated at tree level and is proportional to λ 22 ∼ |β | 2 . From the plot in Fig. 1 we learn that the parameter β is severely constrained by the LFV observables, in green lines. Consequently C NP µ is dominated by the loop-generated term in Eq. (2.17). Clearly, the logarithmic term that appear in the first equation of (2.17) is relevant in the phenomenological analysis. In the electron sector, the mixing parameter α does not suffer large constraints from the LFV sector. In this case, the tree-level and loop-level terms are similar, and therefore C NP e 9 = −C NP e 10 +C loop , which is of the same order of magnitude as C NP e 10 . In Section 5.1, we assess an specific model of leptoquarks where these relations are met.
The predictions for R K ( * ) and R D ( * ) observables in the best fit points for both scenarios are displayed in Figure 2, where the central value and 1σ uncertainty of the observables are included. The yellow area corresponds with the SM prediction, and the green area with the experimental measurements for each observable. Table 2 summarizes the results for the R K ( * ) and R D ( * ) observables in Scenarios I and II for the corresponding best fit points. For comparison, an statistical combination of all the available measurements of each observable, performed by flavio is included in the last column of this table.
From the above results, it is clear that the assumptions of Scenario I do not allow for a simultaneous explanation of the R K ( * ) and R D ( * ) anomalies, as already pointed out in [33]. In particular, a value of the mixing between the second and third generation leptons β is large enough to describe R K ( * ) through the tree-level C NP µ . Instead, our fit shows a preference for a negligible β , and

Observable Scenario I Scenario II Measurement
0.289 ± 0.011 0.290 ± 0.009 0.31 ± 0.03 Table 2. Values of the R K ( * ) and R D ( * ) observables in Scenarios I and II for the best fit points.
therefore the R D ( * ) anomalies are explained only through NP in C τ V L . The predictions for the branching ratios and angular observables of the B → K ( * ) µ + µ − decays are improved thanks to the flavour-universal loop-induced contribution to C NP 9 = C loop 9 , while the R K ( * ) ratios are not sensible to the universal contribution and remain SM-like.
The parameters in the fit of Scenario II, on the other hand, are able to describe the R K ( * ) and R D ( * ) anomalies at the same time, as it is shown in Figure 2 and Table 2.
To consider the mixing between the first and third lepton generation does not notably alter the prediction for R D ( * ) . At the same time, it originates a tree-level contribution to C NP e 9 = −C NP e 10 , that breaks the universality between the electron and muon Wilson coefficients, allowing for R K ( * ) = 1. The comparison of the pull of each observable for this scenario with respect to their experimental measurement (blue line), compared to the same pull in the SM (orange line) is presented in Figure 3. The observables whose pull changes in more than 1.5 σ between the SM and Scenario II are specially marked in the plot, i.e. R l D ( * ) , R K and R µ D ( * ) (observables 2, 7 and 13 in the table presented in Appendix A). It is clear that for these observables NP improves their prediction. For completeness, the full list of predictions and pulls is also included in Appendix A. We have checked that all the observables in the appendix, with the only exception of | K | (observable 33), can receive a contribution from the Wilson coefficients in Scenario II when considering the full RG equations. It is also important to note that the muon lifetime is not included in the above list of observables because it is used to determine the SM value of G F ; an input parameter. Finally, we also investigate which class of observables constraint each parameter of the fit. For this purpose we modify the rotation parameters α and β for leptons and quarks and the Wilson coefficient C ≡ C 1 = C 3 independently, and we compare the results with respect to the likelihood for R K ( * ) , R D ( * ) and LFV observables, and to the global likelihood. Figure 4 shows the evolution of the likelihood for R K ( * ) and R D ( * ) observables and LFV observables, as well as the global likelihood, when one parameter is modified from its best fit value. The interplay between all observables is clearly established when the Wilson coefficient C is modified (Figure 4 (a)). In the case of the lepton mixing, it is clear that the R K ( * ) observables determine the best values of α (Figure 4 (b)), while the LFV observables limit the allowed values of β to a narrow region around zero; being the observables that determine the behaviour of the global fit in this case (Figure 4 (c)). In the quark mixing ( Figure 4 (d) and (e)), we found that α q is constrained by the observable BR(K + → π + νν) (observable 406), while β q is determined by the interplay of R K ( * ) and R D ( * ) , that prefer larger values, and the LFV observables, that disallow β q > 1. Clearly, the above results show the interplay between all parameters and confirm the relevance of considering all observables when performing phenomenological studies in the context of B-anomalies and the discussion of possible explanation of these anomalies through NP models.

Montecarlo analysis using Machine Learning
In this section we study the parameter points in the neighbourhood of the best fit point. We will generate samples of parameter points following the χ 2 distribution given by the likelihood of the fit. The Montecarlo algorithm is the standard procedure to generate samples that follow a known distribution. In our case, the computation time needed to calculate the likelihood of each candidate point is a huge drawback. Instead, we opted to use a Machine Learning algorithm to construct an approximation to the likelihood function and that can be evaluated in a much shorter time. As far as we know, this is the first time that these procedure is used in the analysis of flavour anomalies. There exist a previous paper that address the problem of NP model in b → cτ ν τ decays by using a specific machine learning algorithm [62], but the techniques used in this paper are different to the ones we used here. In the following we give some details of the Machine Learning procedure and then, we present our results.

Methodology
The first Machine Learning tool that we will use for our analysis is a model able to approximate any arbitrary function f : R n → R, that we will use to create an approximation of the log-likelihood function of our fit. We have chosen an ensemble method based on regression trees, which is implemented by the XGBoost (eXtreme Gradient Boosting) algorithm [63].
Regression trees are a type of decision tree. A decision tree is a diagram that recursively partitions data into subsets, based on the binary (true/false) conditions located at the nodes of the tree. The final subsets in which the data are classified are called "leaves". A decision tree with T leaves is formally a function q : R n → {1, 2, . . . , T } which associates to each data point x ∈ R n its leaf q(x). A regression tree assigns to each leaf i a real number w i ∈ R. The regression tree therefore defines a function f : R n → R, given by (4.1) An example of a regression tree with four leaves is depicted in Fig. 5. In practice, a single tree is not general enough to reproduce an arbitrary function. For this reason, we consider instead an ensemble of K regression trees The function φ(x) will represent the approximation for the log-likelihood function. It will be calculated using supervised learning, that is, the trees are obtained from a dataset D = {(x i , y i )} where x 1 , . . . x N ∈ R n are the inputs and y 1 , . . . y N ∈ R are the pre-computed outputs for each input. In our case, the input data will be of the form , and the outputs will be y i = log L(x i ). In order to train the model from the dataset, we need to define an objective function L[φ] that measures how well the model fits the data, which has two components: • The loss function l(φ(x i ), y i ) is a differentiable function that measures the similarity between the true output y i and its approximation φ(x i ). We use as loss function the mean absolute error, • The function Ω is the regularization term, that penalizes the complexity of trees, that is, trees with many leaves or with large ||w||. The purpose of the regularization is to prevent over-fitting, that is, the model learning "by heart" the training data and being unable to extrapolate from them.
The ensemble is constructed in an iterative way, starting from one single tree f (0) that contains just one leaf. At the step t of the iteration, the tree f (t) is obtained by splitting one Figure 6. Prediction models that would we necessary to train for three features in order to calculate the SHAP values. The edges represent the marginal contributions for each feature: in red for x 1 , green for x 2 and blue for x 3 .
of the leaves of the tree f (t−1) into two leaves; the splitting is determined by the optimization of the objective function. In order to prevent over-fitting, the shrinkage technique is used, that scales newly added weights by a factor η < 1, similar to the learning rate in other Machine Learning algorithms.
Once we have an approximation of the log-likelihood function, we put it to use and generate new samples of datapoints We use a Montecarlo algorithm to produce the data distributed according to the χ 2 distribution of the fit. At each step of the Montecarlo algorithm, a new tentative x i is proposed, which is accepted if the ratio of its probability divided by the probability of the best fit point is greater than a random number u distributed uniformly in the interval [0, 1], and rejected otherwise. Expressed in terms of the logarithms of the likelihood function instead, where L bf is the likelihood of the best fit. This algorithm requires many calls to the likelihood function, which are computationally very tasking, and most of the proposed points are rejected. As a way to ease the burden, we use the approximated log-likelihood φ(x i ) instead of the true function. We can asses the importance of each parameter in the Machine Learning approximation at any point of the generated samples by using SHAP (SHAPley Additive exPlanation) values [64,65]. SHAP values are based in Lloyd Shapley's work on game theory [66], who won the Nobel Prize in Economics for it in 2012.
The SHAP values are designed with three properties in mind:

Normalized frequency
Predicted histogram • Local accuracy: The sum of the SHAP values is equal to the model prediction.
• Missingness: If any feature is missing, its SHAP value is zero.
• Consistency: If the model is changed so any feature has larger impact, its SHAP value will increase.
Given a model φ(x), the SHAP trains 2 n new models φ z (x) for z ∈ {0, 1} n binary vectors. The model φ z (x) contains the feature x (i) only if z (α) = 1, while that feature is ignored when training if z (α) = 0. The marginal contribution φ z (x i )−φ z (x i ) for two models differing only in the presence of one feature (i.e. z (α) = 0, z (α) = 1 and z (β) = z (β) ∀ β = α), gives the importance of adding the feature α to the model z. The SHAP value for the feature α in the point x i is just the weighted average of all marginal contributions, with the weight given by a combinatorial factor. An example is depicted in Fig. 6. The prediction without any features φ 0···0 is simply the average of the values y i in the dataset, and acts as a base value common for all x i .
Finally, we will analyze the correlations between the points in the generated samples, in order to understand the physical relations caused by NP.

Procedure and results
In the first place we create a sample of 5000 parameter points and their likelihood using the traditional algorithm. We discard the points with ∆χ 2 SM < 20, retaining 3760 points. We train a Machine Learning predictor using the pre-computed sample. We used the XGBoost (eXtreme Gradient Boosting) algorithm [63] Table 3. SHAP values and Machine Learning prediction for the best fit point.
xgboost. We split the sample in two parts, 75% of the points for the training and 25% points for the validation of the model. The algorithm uses a learning rate of 0.05 and 1000 estimators, allowing early stopping at 5 rounds. The performance of the Machine Learning predictor can be seen in Figure 7 (a). The horizontal axis represents the actual value of the ∆χ 2 SM for each point of the validation dataset, computed using the full flavio and smelli code, with the best fit point found in Section 3 corresponding to the maximum value. The vertical axis represents the predicted value for the same points obtained using the Machine Learning Montecarlo algorithm. The predicted values for the ∆χ 2 SM reproduce their actual values, with a Pearson regression coefficient r = 0.970 and Mean Absolute Error of 0.719 in the validation dataset. The agreement between the predicted and actual values is specially good for parameters near the best fit point (∆χ 2 SM > 55). Finally we implement the Montecarlo algorithm: we generate random points C near the best fit. The predictor produces an approximation of the ∆χ 2 , and therefore, also an approximation of the logarithm of the likelihood, logL( C). The point is accepted if this approximation verifies the Montecarlo condition logL( C) > log L bf + log u , (4.5) where L bf is the likelihood of the best fit and u is a number randomly chosen from an uniform distribution in the interval [0, 1). To check if the Machine Learning Montecarlo algorithm can actually reproduce the χ 2 distribution, we generate a sample of 1000 points. The histogram for the predicted values of the χ 2 is plotted in Figure 7 (b). The histogram follows the general shape of the χ 2 distribution, although there is an excess of points near the best fit and a deficit of points in the region of low likelihood. In order to understand how each parameter affects the prediction of the likelihood, we use the SHAP values as described above. Remember that once we have an approximation of the log-likelihood function, we put it to use generating new samples of datapoints , and we use a Montecarlo algorithm to produce the data distributed according to the χ 2 distribution of the fit. Table 3 contains an example of the SHAP values for log L at the best fit point. According to the Machine Learning model, the values of C and α and β q have the larger impact in the Machine Learning prediction. Figure 8 shows the impact of each parameter to the final prediction, measured as the mean of the absolute values of their SHAP values across a sample of 10000 Montecarlo points. The SHAP values allow us to quantify the relative importance of each parameter in the fit. The parameters β q and α have the largest contribution and β and C contribute the less. This result is in disagreement with the assumption of Scenario I regarding the  parameter α describing the mixing to the first generation of leptons. Therefore, the obtained result is in agreement with the previous section, where we already concluded that the mixing with the first generation were necessary in order to describe both anomalies simultaneously. On the other hand we see that, while moderate values of β are relatively unimportant compared to the other mixing parameters, extreme values have a large negative impact in the prediction for the likelihood. We calculate the SHAP values for the logarithm of the likelihood at each point of the Montecarlo sample. In this way, we can determine how each parameter contributes to the fit, as shown in Figure 9. We can compare these SHAP values with Figure 4, where only one parameter was changed at a time. Then, it is clear the agreement between the results obtained by the Machine Learning Montecarlo algorithm proposed in this work and the ones obtained by following the RG equations. Therefore, we can conclude that the SHAP values reproduce correctly the general features of the fit.
The above results show that the Machine Learning Montecarlo algorithm can be very useful in this kind of analysis, being able to reproduce the results obtained in the previous section in a shorter time. We can conclude that the machine learning, made jointly with the SHAP values, constitute a suitable strategy to use in complex fitting problems with large dimensionalities and complicated constraints, where a direct evaluation is too timeconsuming.

Correlations between observables
In order to check our Machine Learning procedure, we now discuss on the agreement of the results obtained by the Machine Learning Montecarlo algorithm that we have proposed and the ones obtained by using the RG equations defined as given in Section 2.
The Lagrangian in Eq. (2.16) exhibit a flavour structure, given by the λ matrices, relating the different entries of the tensor of Wilson coefficients C ijkl q . Under the RG evolution and matching, this flavour structure is imprinted in the WET Lagrangian in Eq. (2.1), and therefore in the related observables. Using the Machine-Learning Montecarlo algorithm described in the previous section, we generate a sample of 15000 points in parameter space around the best fit point. In each point we run the RG equations down to the electroweak scale, perform the matching with the WET, and run the RG equations again down to µ = m b . We compute the correlations between the semileptonic b → s and b → c coefficients C 9 , C 10 , C V L and C ν for the different lepton generations. Figure 10 shows the matrix of Pearson coefficients describing linear correlations between the WET Wilson Coefficients. In the electron sector, C e 10 , C e V L and C e ν show strong correlations close to ±1. In the muon sector, C µ 10 , C µ V L and C µ ν are also correlated between them, however they are linearly independent of C µ 9 . Instead, C µ 9 is correlated with the tau coefficients C τ V L and C τ ν , and to a lesser extent to C e 9 . The correlations that we have found are consistent with the results of RG evolution and matching in Eq. (2.17). In the case of the electron sector, the C e 10 , C e V L and C e ν coefficients are all proportional to the product Cλ q 23 λ 11 appearing in the tree-level contribution. Analogously in the muon sector C µ 10 , C µ V L and C µ ν , depend on Cλ q 23 λ 22 and in the tau sector C τ V L and C τ ν depend on Cλ q 23 λ 33 . The coefficient C µ 9 is not correlated to the rest of the muonic   , which depends on the product Cλ q 23 . The coefficient C e 9 receives sizeable contributions both from the treelevel and the one-loop terms, and consequently shows a mild correlation with C µ 9 and a total correlation with the combination C µ 9 − C e 10 . Lastly, there is a perfect correlation of ±1 between C µ 9 and the tau coefficients, which is caused by the fact that λ l 33 = 0.994 ± 0.001 is almost constant, so Cλ q 23 λ l 33 ≈ Cλ q 23 . We can therefore conclude that the obtained data is in agreement with the arrangement of Wilson coefficients presented in Eq. (2.17).
Besides, in the same sample of 15000 points, we determine the predictions of our model for several selected observables of various flavour sectors, with large pull differences between the SM and Scenario II predictions: R νν) (observable 406) from s → d decays that has a great impact in the fit value of α q , and the tau decay BR(τ − → µ − νν) (observable 37). The correlation matrices are depicted in Figure 11.
From the above results, it is clear that the observables R D * and BR(B + → K + νν) show an almost-perfect correlation. Then, predictions for these two observables in the generated sample are shown in Figure 12. The green vertical band in this figure corresponds to the R D * measurement [7], the red horizontal band to the 90% CL excluded region for BR(B → K * νν) [67] and the grey band to the 2021 world average obtained by Belle II [68]. The yellow horizontal band summarizes the SM prediction. The obtained values of Montecarlo points and the best fit prediction of our computations are also included. It is important to stress that R D * depends on the Wilson coefficient C τ V L , and BR(B + → K + νν) on C τ ν , and both of them are proportional to the product Cλ q 23 λ 33 . This is in contrast with the conclusions of [69], where several leptoquark scenarios coupling to right-handed neutrinos did not find a significant correlation between both observables. Even if the correlation is strong, the prediction for the B + → K + νν decay remains compatible with the 90% confidence level (CL), BR(B → K + νν) < 1.6 × 10 −5 [67], for the whole range of experimentally-compatible values of R D * . The world average for the branching ratio obtained by Belle II [68] (not included in our numerical analysis) shows an enhancement of a factor of 2.4 ± 0.9 compared to the SM prediction [69]. While our data is in tension with this world average, it is an encouraging sign of a possible interplay between R D ( * ) and BR(B + → K + νν). Future experimental results from Belle II will further clarify the situation.
It is worth stressing that the observable R K * displays a moderate correlation with R D * and BR(B + → K + νν), caused by the relation of the Wilson coefficient C µ 9 with C τ V L and C τ ν . On the other hand, R K * shows almost no correlation to BR(B s → µ + µ − ), even though both observables depend on C µ 10 . This is a result that sets us apart from many NP models that impose the relation C µ 9 = −C µ 10 . Finally, none of the selected observables display a large correlation to the goodness of fit measured by ∆χ 2 . This is a sign that there is not a single observable dominating the fit, and reaffirms that global fits are in fact a necessity on the analysis of flavour anomalies. The green vertical band corresponds to the R D * [7] measurement, the red horizontal band to the 90% CL excluded region for BR(B → K * νν) [67], the grey band to the 2021 world average obtained by Belle II [68] and the yellow horizontal band to its SM prediction.

Connection to leptoquark models
In this section we discuss the phenomenological implications of our assumptions in the vector leptoquark model. The vector leptoquark U 1 = (3, 1) 2/3 couples to left-handed and right-handed fermions as An U 1 leptoquark with mass M U , when matched with the SMEFT at the scale Λ, contributes to the following Wilson coefficients [70]: Our model does not include couplings to right-handed leptons in the interaction Lagrangian, and therefore all the x R couplings are set to zero. The left-handed couplings x L are related to the parameters of the Lagrangian (2.16) according to where θ is a free global complex phase. Since the rotation matrices λ are hermitian (λ ii and λ q jj are real and positive), we need C 1 = C 3 to be a real negative number. This condition is fulfilled in both Scenarios I and II.
Without loss of generality we set θ = 0. The mass of the leptoquark is chosen to be M U = 1.5 TeV, the lowest mass not excluded by direct searches [71]. The results in Scenario In both scenarios, the most important coupling are x 23 L to second generation quarks and third generation leptons, and x 33 L to third generation quarks and leptons. A similar leptoquark model has been proposed previously, as scenario RD2A in [72] as a solution for the R D ( * ) anomaly. The advantage of our proposal is that the inclusion of small non-zero values of the couplings x 21 L and x 31 L is able to explain the R K ( * ) anomalies at the same time. The values of x 23 L and x 33 L are compatible with the exclusion limits set in [72]. Other leptoquark models do not retain the C q(1) = C q(3) condition [32,70], and therefore produce large contributions to the B → K ( * ) νν decays. That is the case of the scalar S 3 = (3, 3) 1/3 , that predicts C q(1) = 3C q(3) , and the vector U 3 = (3, 3) 2/3 , where C q(1) = −3C q(3) . The scalar S 1 = (3, 1) 1/3 is even less suited, as it predicts C q(1) = −C q(3) , which would result in no NP contributing to b → s + − at all. New vector bosons W and Z would also be in conflict with the B → K ( * ) νν decays, as they predict C q(1) = 0 while C q(3) has a non-zero value.

A simplified model
In this section, we will propose a simplified U 1 leptoquark model, depending only on two coupling constants, that reproduces the numerical results that we have obtained in section 3. This scenario implies that the NP contributions to the C e 9 and C µ 9 Wilson coefficients are of the same order, but the ones to C e 10 is two orders of magnitude larger than to C µ 10 . In the quark sector, we assume that the leptoquark does not interact with the first generation quarks, and that it interacts equally with second and third generation quarks. The rotation matrix corresponding to this assumption has elements (Û q ) 31 = 0 and (Û q ) 32 = (Û q ) 33 = 1 √ 2 . The parameters of the mixing matrix, α q = 0 and β q = 1, are compatible with the results of the fit obtained in Table 1.
For the leptonic sector, we assume that the leptoquark interacts differently with each generation, being the interaction with the second generation leptons negligible. That is, β = 0 and a non-zero value for α .
Using these simplifying assumptions in Eq. (5.3), we obtain the following couplings for the U 1 leptoquark with the left-handed fermions: In this model, the interactions of the U 1 leptoquark with fermions are governed by just two couplings, x 1 and x 3 . Their numerical values, assuming again a leptoquark mass of M U = 1.5 TeV and the best fit values for C and α in Scenario II given in Table 1, are

Conclusions
In this paper, we present the results of the global fit to the flavour physics observables that exhibit some discrepancies with respect to the SM values, by considering the NP effects on the Wilson coefficients of the SMEFT Lagrangian. The global fit includes the b → sµµ observables; i.e. the Lepton Flavour Universality ratios R K ( * ) , the angular observables P 5 , the branching ratio of B s → µµ and all the available data on angular observables in B → K ( * ) µ + µ − decay, as well as the R D ( * ) observable, the relevant data related to B → K ( * ) e + e − decays and the angular observables measured in different bins for B s → φµµ decays, b → sνν and electroweak precision observables (W and Z decay widths and branching ratios to leptons). We choose two scenarios in which the condition C 1 = C 3 is imposed in order to avoid unwanted contributions to the B → K ( * ) νν decays. In Scenario I we fix parameters by assuming that the mixing in the first generation is negligible, as already considered in [33]. Scenario II includes non-negligible mixings to the first generation, allowing us to check the validity of the above assumption. We found that the better fit is obtained for Scenario II, with a pull of 6.70 σ with respect to the Standard Model, 3.77 σ with respect to Scenario I (Table 1). Simultaneous explanation of the R K ( * ) and R D ( * ) anomalies have been also found in Scenario II ( Figure 2 and Table 2). We show that the Gaussian approximation to characterize the fit is not successful (see Figure 1) and therefore, we use for the first time in the context of the so-called B-anomalies a Machine-Learning Montecarlo analysis to extract the confidence intervals and correlations between observables. We found that our procedure reproduce the results obtained in Section 3 for both the ∆χ 2 distribution and the analysis of the impact of each parameter on the global fit. We also have checked the agreement between the results obtained by the Machine Learning Montecarlo algorithm proposed in this work and the ones obtained by following the RG equations. Therefore, we conclude that machine learning, jointly with the SHAP (SHAPley Additive exPlanation) values, constitute a suitable strategy to use in this kind of analysis. This is a promising area of study even if present uncertainties do not allow us to conclusively establish the presence of physics beyond the SM, and further analyses are needed.
An observation of the B + → K + νν decay in the near future at Belle II could provide further insight in the R K ( * ) and R D ( * ) anomalies, especially if the excess in the current world average is confirmed. This, together with the expected improved measurements of the electroweak observables in the future linear colliders that we previously studied in [24,25]

A Pulls of the observables in Scenario II
In this appendix we collect the list of all observables that contribute to the global fit, as well as their prediction in Scenario II and their pull in both scenario II (NP pull) and SM (SM pull). Observables are ordered according to their SM pull, and color-coded according to the difference between the scenario II and SM pulls: green observables have a better pull in scenario II, red observables have a better pull in the SM and white observables have a similar pull in both cases.