Potential and limitations of machine-learning approaches to inclusive | V ub | determinations

,


Introduction
The determination of the parameters of the CKM matrix is an important test of the Standard Model (SM).Its least known element is |V ub |, which can be determined at Bfactories from semileptonic B-decays in the exclusive B → π ν channel [1][2][3][4] as well as from inclusive B → X u ν decays [5][6][7].Moreover, it can be tested at the LHCb experiment in Λ b → pµν µ decays [8].The current average value of inclusive and exclusive measurements is |V ub | = (3.82±0.24)×10−3 [9].However, there is a long-standing 3σ tension between them, making the determination of |V ub | in the inclusive mode an exciting future measurement for Belle II.
From the theoretical standpoint, the total B → X u ν decay rate would offer the cleanest extraction of |V ub |.It can be calculated using the local operator product expansion (OPE) familiar from inclusive semileptonic decay into charm quarks, B → X c ν [10][11][12][13].At leading order in this 1/m b expansion the result for the inclusive decay is equal to that for the quark-level process b → u ν, whose total [14] and differential [15] decay rates are known up next-to-next-to-leading order in QCD.At relative order 1/m 2 b only a handful of non-perturbative parameters appear, and recently even for these power corrections the next-to-leading-order QCD corrections have been calculated [16].
From the experimental standpoint, the large background from charmed final states precludes a straightforward measurement of the total inclusive B → X u ν decay rate.The traditional approach to inclusive |V ub | measurements has thus been to make kinematic cuts to restrict measurements in phase-space regions which, neglecting detector effects, are free from charm background.Examples of such cuts are M X < m D , where M X is the invariant mass of the hadronic final state X and m D is the D-meson mass, or P + < m 2 D /m B , where P + = E X − | P X | is the energy-momentum difference of the hadronic final state and m B is the B-meson mass.
Even apart from the fact that detector effects cause the charm background to populate these theoretically charm-free phase-space regions (see Figure 1 below), requiring a nontrivial separation of signal and background also for these restrictive kinematic cuts, the theoretical description of the partial B → X u ν decay rates becomes considerably more involved.To the extent that the phase-space cuts limit the partial decay rates to the "shape function region", where the hadronic final state is a collimated jet whose energy is much larger than its invariant mass, the local OPE breaks down and is replaced by a non-local, shape function OPE.The leading-order contribution in the corresponding 1/m b expansion involves a single non-perturbative shape function [17,18], which is a function of one lightcone variable.Analyses in soft-collinear effective theory have shown that the 1/m b power corrections in this non-local OPE involve a plethora of subleading shape functions beyond tree level, some of which are a function of up to three light-cone variables [19][20][21], and that the next-to-next-to-leading order QCD corrections to the leading-power decay rate can be substantial [22].
Phenomenologically, several theoretical approaches to partial B → X u ν decay rates are used in |V ub | extractions, going under the acronyms ADFR [23], BLNP [24,25], DGE [26] and GGOU [27].These differ in the treatment of QCD effects in the shape function region, but all reduce to the conventional, local OPE results if the kinematic cuts do not introduce new scales which are parametrically much smaller than the b-quark mass.Given the complicated structure of the factorisation theorems, the debate over the precise nature of the shape-function OPE, and the fact that there is no obvious new physics explanation for the current discrepancies between inclusive and exclusive determinations [28], it is clearly desirable to extend measurements over as large a region of phase space as possible, such that the theoretically clean local OPE results can be applied.
Multivariate analysis techniques based on machine learning (ML) are ideally suited for accessing the B → X u ν decays in regions dominated by the B → X c ν background, while still achieving good signal-to-background ratios.From the ML perspective, the challenge is to build a classifier between signal (B → X u ν) and background (B → X c ν and other decays).The first example of such a ML approach to |V ub | determinations was the Belle analysis of Ref. [5].It used a boosted decision tree (BDT) based classifier taking various high-level kinematic and global features as input and gave a result for the partial decay rate with the single restriction that the charged lepton carries momentum greater than 1 GeV in the B-meson rest frame.Thereby, it samples more than 90 % of the inclusive B → X u ν phase space such that a theoretical description based on the local OPE is applicable.A potential criticism is that such a classifier needs to be trained on Monte Carlo (MC) samples of signal and background events, and is thus especially susceptible to systematic errors based on the kinematic modelling of the signal.A possible approach to evading this criticism was presented recently in the reanalysis of the Belle data in Ref. [7], where kinematic properties were not included as input features in a BDT classifier.Although the classification power of such a BDT is reduced when viewed in terms of typical machine-learning metrics such as the area under the curve, it can be used to enhance the signal-to-background ratio to a level which permits a binned one-and two-dimensional likelihood analyses of the kinematic features of the signal and background after event selection resulting in a similar significance after the likelihood analysis.
The purpose of this paper is to perform a systematic study on the use of ML-based classifiers for inclusive |V ub | analyses.We focus on two main aspects.First, we explore the use of deep neural nets (NNs) as an alternative ML architecture to BDTs.While BDTs typically work best when given a small set of carefully engineered, high-level features such as the hadronic invariant mass, NNs can take as input the very high-dimensional set of low-level features characterising the event (such as the four-momenta of the final-state particles) and use it to learn an optimal way to classify signal and background. 1 Second, we study in detail the inclusivity of the classifiers and their sensitivity not only to the set of input features chosen, but also to the event generator used producing the training data.In particular, while present |V ub | analyses rely on the generator EVTGEN [32], in this paper we compare results using combinations of SHERPA [33] and EVTGEN event samples, which differ very little in their description of the B → X c ν background but much more so in the description of the B → X u ν signal.This paper is organised as follows.In Section 2, we discuss the generation of the MC event samples used in our analysis, both with EVTGEN and SHERPA, and show selected distributions before and after an in-house detector simulation.In Section 3, we present the input features of our ML analysis and compare the performance of BDTs and NNs for different levels of input variables.While ML techniques have great potential in extending the fiducial regions of experimental analyses, it is also vital to understand their limitations.Therefore, in Section 4, we study the inclusivity of different ML approaches and their dependence on the MC generator used to produce the training data.Finally, we conclude in Section 5.

Event generation
Our analysis aims at distinguishing B → X u ν signal events from the ∼ 50 times larger background induced by the CKM-favoured B → X c ν process.Other background contributions from continuum and combinatorial backgrounds are neglected.The training and test samples of the signal and background events for our ML analyses are produced using MC event generators.In this section we explain our simulation set-up and explore characteristics of the signal and background before and after a detector simulation.We also compare MC samples produced with the default generator for B-physics analyses, EVTGEN-v01.07.00 [32], with those from SHERPA-v2.2.8 [33].
For the EVTGEN sample, we generate signal and background events with the default run card.For the B → X u ν signal we use the built-in hybrid model for combining resonant and non-resonant modes, with the default input values m b = 4.8 GeV for the b-quark mass, a = 1.29 for the Fermi motion parameter and α s (m b ) = 0.22 for the strong coupling at the b-quark mass 2 .The fragmentation of the X u system into final-state hadrons is performed by PYTHIA8 [34,35], and final state QED radiation is performed by PHOTOS [36,37].
In the SHERPA simulations, we make use of the standard run card for B-hadron pair production on the Υ(4S) pole and use the SHERPA default settings for fragmentation.
In both cases, our baseline event selection process is based on Ref. [5].We select events with one fully hadronically decaying B meson on the tagging side (B tag ), and require the other B meson on the signal side (B sig ) to decay semileptonically to an electron or muon with p * > 1.0 GeV, where p * is the magnitude of the electron or muon momentum in the B-meson rest frame.

Detector effects
In order to mimic detector effects, we pass our MC data through an in-house detector simulation described in Appendix A. In that appendix we also show some validation plots comparing our MC samples with those produced by the Belle collaboration (see Figure 10).
Our detector simulation includes detector efficiencies and mistagging for particles on the signal side; it does not take into account that decay products from the tag side can be incorrectly assigned as signal-side particles.While this in-house detector simulation is too simplified to create completely realistic event samples, it does show reasonable agreement with MC results from the Belle collaboration, and can be considered sufficient for the purpose of the qualitative studies performed in this paper.
In Figure 1, we show normalized distributions of signal and background events in the EVTGEN MC sample before and after detector simulation for three kinematic variables: the hadronic invariant mass M X , the energy-momentum difference P + , and the lepton momentum in the B-meson rest-frame p * .The distributions of M X and P + , which are based on multiple final-state particles and are therefore subject to a cumulative effect from detector inefficiencies and mistagging, are clearly strongly affected by detector effects.In the low-M X and low-P + regions, detector effects cause the charm background to populate 2 For the resonant modes the following branching ratios for B 0 and B ± are assumed: even the theoretically inaccessible phase-space regions M X < m D and P + < m 2 D /m B .The lepton momentum, on the other hand, can be determined quite precisely and detector effects have only a marginal effect. 3These plots make clear that to achieve an efficient separation of signal and background after detector effects, kinematic cuts on their own are insufficient.We will list a full set of distinguishing features of the signal used in our ML analysis in Section 3.1.While EVTGEN and SHERPA follow the same general principle in modelling resonant contributions, they differ in the treatment of the non-resonant modes.In this section we highlight the effects of these modelling choices on distributions of the signal and background.
In Figure 2, we compare distributions for the B → X c ν background.In addition to the kinematic features M X and P + we also show the number of charged kaons N kaons in the event.Given that inclusive semileptonic decays into charm are nearly saturated by a small number of resonant contributions, it is not surprising that the EVTGEN and SHERPA results show a close agreement.Minor differences, for instance the number of kaons, are caused by small discrepancies in the assumed branching ratios for high-mass X c resonances as well as by the different hadronization modelling in PYTHIA8 and SHERPA.
The analogous distributions for the B → X u ν signal are shown in the upper panel of Figure 3.There are clear differences between the EVTGEN and SHERPA distributions of kinematic features such as the M X distribution, which are caused by the different treatment of the non-resonant modes.In EVTGEN, the built-in hybrid model describes the nonresonant decay modes at leading order in the heavy-quark expansion using the DeFazio-Neubert (DFN) model [38], including a non-perturbative shape function to describe the Fermi motion of the b quark inside the B meson.The non-resonant contribution is modelled such that the M X distribution for the sum of the resonant and non-resonant contributions matches the distribution predicted by the DFN model.This is achieved through a bin-bybin reweighting of the non-resonant modes.In SHERPA the non-resonant signal decay modes are modelled by parton showering and hadronizing the leading-order partonic decay.Non-perturbative shape-function effects characterising the low-M X region are not taken into account, and no reweighting of the events is performed to match state-of-the-art theory calculations.
Comparing these two different approaches for the signal modelling in Figure 3, we find that, on the one hand, the EVTGEN results have a non-physical bump in the 1.5 GeV region of the M X distribution, which is an artefact of the bin-by-bin reweighting to match the DFN results.The SHERPA distributions do not share this characteristic, since the nonresonant events are instead obtained by excluding resonant events from the parton shower.On the other hand, the current implementation of the SHERPA parton shower model also produces a smaller proportion of the non-resonant signal contribution and generates fewer events in the high-M X and -P + regions compared to EVTGEN, which is precisely the region where the inclusive QCD predictions should be reliable.We further highlight this in the lower panel of Figure 3, where we compare the state-of-the-art OPE results from the BLNP approach [25] with EVTGEN and SHERPA results at the level of cumulative distributions.Overall, the agreement of the EVTGEN-generated distributions with the BLNP predictions is stronger, which is not surprising since the underlying inclusive is the OPE-based DFN result.
Clearly, the B → X u ν modelling in SHERPA needs a more sophisticated matching of the non-resonant, parton shower contributions with (shape-function) OPE results before being used in |V ub | extractions by experiments.For this reason, we use EVTGEN in the following section when studying the performance of ML-based classifiers, in spite of its own deficiencies in the low and intermediate invariant mass regions.However, for the purposes of the paper, the present situation allows us to study an interesting question: how do ML approaches to |V ub | extractions perform when the training and testing data are substantially different?This is the subject of Section 4.

BDTs vs Deep Neural Networks
In this section we give a systematic analysis of signal vs. background event classification using BDTs and deep neural networks.We use Bayesian neural networks (BNNs), which have been argued to deliver stable results and avoid overfitting [39].The details of the architecture for the BDTs and NNs used in our study can be found in Appendix B, along with a breakdown of data used in the training and testing procedure.We describe the input features to the ML algorithms in Section 3.1, metrics used in evaluating their performance in Section 3.2, and then move on to the results in Section 3.3.Throughout this section we use EVTGEN to generate the training and testing samples.

Input features
The features used in our multivariate analysis break into two sets.One is based on physical high-level features such as invariant masses and the number of final-state particles of a specific type, e.g. the number of kaons or slow pions, and the other is based on low-level features, i.e. single particle properties.In particular, the low and high-level features are: (3.1) The low-level features include, first off, the four-momentum p Btag and charge Q Btag of the tagged B meson.In addition, we pick out the 10 most energetic (as measured in the lab frame) detected final-state particles, label them with an index i = 1, . . .10, and use as features the lab frame four-momenta p i , the charge Q i and the identity ID i of these particles.Events with less than 10 detected final-state particles have the corresponding particle features filled in with zeros.The high-level features are defined as follows.The four-momentum transfer squared is q 2 = (p B − p X ) 2 .N denotes the number of leptons, which can only be greater than one if the secondary leptons have momenta smaller than 1 GeV.Since the B → X u ν signal is very unlikely to contain secondary leptons, this feature can be used to suppress the background, see the left panel of Figure 4. N K ± and N K 0 denote the number of charged and neutral kaons, respectively, where neutral kaons K 0 S are reconstructed from charged pions with an invariant mass in the range m π + π − ∈ [0.490, 0.505] GeV.Kaons are frequently produced in D-meson decays and their presence hence indicates a B → X c ν background event, see the central panel of Figure 4.The number of final-state particles resulting from the hadron decay N hadron is typically larger for hadrons with a higher mass such as the background D mesons.The missing mass squared M 2 miss , defined as the square of the missing momentum p miss = p sig − p X − p , where p sig = p Υ(4S) − p Btag is the reconstructed momentum of the signal-side hadron, would always be compatible with zero without detector effects.For background events, which as discussed above have a higher final-state particle multiplicity, the probability of misidentifying a final-state particle is higher resulting in positive values of the missing mass squared, see the right panel of Figure 4.The total charge Q tot of all particles in the event, on both the signal and the tag side, is also subject to detector effects.It will only be non-zero for events where charged particles have been missed, which happens more often for the background events due to their larger final-state particle multiplicity.Slow pions, i.e. pions with momentum |p π | < 220 MeV, can originate from D * → Dπ transitions and hence appear more often for the B → X c ν background.We therefore include the number of neutral and charged slow pions, N π 0 slow and N π ± slow , in our high-level feature set.To test the compatibility of the slow pion with a D * → Dπ transition, we further define M 2 miss, . In this we have explicitly assumed that the slow pion direction is strongly correlated with the D * direction.The quantity M 2 miss, D * will more likely be peaked at zero for true D * → Dπ transitions.Distributions in the high-level input features not shown in Figure 4 are displayed in Appendix C in Figure 12.
We have chosen this set of high-level features to mimic the feature selection in the BDT analyses performed by Belle in Refs.[5,7].Some differences with respect to the sets used in those papers arise, because we do not have access to all experimental features in our simplified detector simulation, for instance features related to the quality of the signal reconstruction.

Metrics
Before we compare the performance of different ML approaches and input feature setups, let us briefly introduce some notation for the ML output and review metrics used to quantify performance.
Our binary classifiers take as input the multidimensional features of an event, and return a classifier output which is a single number, ζ ∈ [0, 1].Events with classifier output ζ ∼ 1 are likely to be signal while events with ζ ∼ 0 are likely to be background.We define our signal (fiducial) region through a cut on the classifier output.All events with ζ > ζ cut are classified as signal events.Events which are correctly classified as signal events are denoted true positive (TP) events, while background events which are incorrectly classified as signal events are denoted false positive (FP) events.
Standard performance metrics in ML are the receiver operating characteristic (ROC) curve, i.e. the true positive rate (TPR, signal acceptance) as a function of the false positive rate (FPR, background acceptance), and the corresponding area-under-curve (AUC), the integral of the ROC curve.It is also customary to plot the inverse of the FPR as a function of the TPR.A quantity which is often used as a metric in particle physics is the statistical significance σ, defined as where in the second equation we have used S and B to denote the number of signal and background events in the signal region to bring the expression into a more familiar form.
To remove the dependence on the data sample size from the significance, we make use of the significance improvement σ, i.e. the significance normalized to its value at the baseline selection σ = σ σ baseline (3.4) A significance improvement greater than one signals a performance increase.Plotting the significance improvement as a functions of the true positive rate defines the significance improvement characteristic (SIC) curve [40].

BDT and BNN performance on different levels of input features
We first contrast the performance of the BDT and BNN on signal vs. background classification using different levels of input features.We consider three scenarios: (i) using only the low-level features in Eq. (3.1) (ii) using only the high-level features in Eq. (3.2) (iii) using a combination of these low-and high-level features.
The ROC and SIC curves for the BDT and BNN analyses using these input feature scenarios are shown in Figure 5.
As expected, the BDT performs well on high-level input features, the most used features being the number of kaons, number of leptons, the hadronic invariant mass M X , hadron multiplicity and the missing mass squared M 2 miss .However, it performs poorly  when trained only with low-level features, indicating that it cannot use them to construct additional non-linear features such as invariant masses.Using a combination of low-and high-level features slightly improves the BDT performance compared to high-level only.We have explicitly checked that this performance increase results almost entirely from adding the particle energies.The particle three-momenta, on the other hand, do not seem to contain additional usable information for the BDT.
For the BNN the situation is very different.It performs slightly better when trained only on low-level features than it does when trained only on high-level features.This indicates that, as expected, it is able to learn new and efficient discriminating features from the low-level inputs.Training on a combination of low plus high-level inputs very marginally improves its performance compared to low-level only (mainly due to the inclusion of M X as a feature), showing that the BNN has learned the most important high-level features on its own.
The maximum of the SIC curves is reached for a cut on the classifier output of ζ cut ≈ 0.97, which corresponds to a signal acceptance, or true positive rate TPR = TP/(TP+FN), of approximately 75 %.Explicitly, we find the following values for the maximum significance improvement and the AUC for a BDT or BNN trained and tested on a combination of high and low-level features from the EVTGEN data: AUC = 0.981, σ = 5.59 BDT AUC = 0.986, σ = 5.67 NN . (3.5) The AUC and σ for the BNN is only about 2 % better than BDT approach.Training on high-level features only puts the BNN on equal footing with the BDT -in fact, we find that in that case they reach the exact same significance improvement, which is σ = 5.42.
The very small loss of performance compared to the Eq.(3.5) indicates that the high-level features are well chosen for a discrimination of signal and background, containing (almost) the full relevant information that the BNN can learn from the low-level features.
It is interesting to contrast the significance improvements using the BDT and BNN with those obtained from a typical cut-and-count analysis based on the cuts provided in Ref. [6].With the minimal requirement of having exactly one lepton, a total charge of zero, a veto on kaons and a low missing mass squared, we obtain a significance improvement of σ = 1.9.If in addition to theses cuts we select a theoretically background-free region, we find4 Comparing the significance values Eq. (3.7) with those from the BDT and BNN analysis in Eq. (3.5), we see that the ML approaches clearly outperform the cut-and-count analyses.In Appendix A.4, we study the dependence of these results on the detector simulation.

Inclusivity of ML approaches
A main motivation for the application of ML techniques to |V ub | determinations is to widen the experimentally accessible fiducial region to a level of inclusivity where the theoretically clean, local OPE is unambiguously applicable.This amounts to two conditions on the measured X u final state: first, that it is not subject to severe kinematic cuts (in which case the shape-function OPE would apply), and second, that it contains a sufficiently broad sample of exclusive hadronic final states in a given kinematic region (such that quarkgluon duality applies).A concern in supervised ML approaches is that the classifiers will overuse either inclusive kinematic properties or IR unsafe hadron-level properties of the final state, thereby limiting the signal output to a restricted fiducial region which is very sensitive to MC modelling, regardless of the inclusivity of the input events.
In this section we study the inclusivity of the signal acceptance in ML approaches to event classification.As the inclusivity depends crucially on the input features used in the ML classifier, we consider two scenarios: • NN tight : a NN using as input both the low and high-level features listed in Eq. (3.1) and Eq.(3.2), respectively.This is a more sophisticated implementation of the basic approach of Ref. [5], and its classification power was explored in Section 3.3.
• NN loose : a NN using as input the high-level features listed in Eq. (3.2), but excluding the kinematic features M X , P + , q 2 and p * .This is a proxy for the BDT used in the recent reanalysis of Belle data [7].
In both cases the classifier threshold is chosen to maximize the significance of the accepted event set.Obviously NN loose , which intentionally excludes discriminating kinematic features of the signal and background, will not lead to the same signal purity as NN tight .In our analysis NN tight reaches a signal-over-background ratio of S/B ∼ 13, while for NN loose S/B ∼ 0.3 such that the background contribution is still dominant even after event selection by the BNN.In this latter case it is thus essential to perform a binned one-and two-dimensional likelihood analyses of the kinematic features of the signal and background after event selection by the NN loose , as was done in Ref. [7]; this procedure can be useful for NN tight as well, even though the S/B ratio is much higher.
A main focus of our study is how changes of the testing and training data affect the inclusivity of the ML analyses.Testing and training the BNNs on differently modelled event sets provides a good test for overtraining and gives insight into how well the classifier might perform when applied to real-world events, which are not expected to show perfect agreement with MC data.The existing ML-based Belle analyses [5,7] estimate uncertainties stemming from input data modelling by testing on samples produced with different parameter choices within the EVTGEN framework while fixing the ML configuration.Here we explore the alternative method of using a fundamentally different MC-event generation framework, namely SHERPA.In this section we train all BNNs on EVTGEN and then study their classification properties on both SHERPA and EVTGEN data; in Appendix D we show equivalent results when the BNNs are trained instead on SHERPA data.All MC samples used in testing the BNNs, whether generated by SHERPA or EVTGEN, contain the same ratio of signal to background events after detector simulation.
We compare the inclusivity of the two BNN set-ups in two main ways.In Section 4.1, we study the inclusivity in kinematic phase space, and in Section 4.2 we focus on inclusivity in the available hadronic final states.In the latter section we also study sensitivity to changes of hadronization parameters within the EVTGEN framework.

Inclusivity in kinematics
We illustrate the salient features of event selection by NN tight and NN loose as a function of M X , q 2 , and p * in Figure 6.The binning of the kinematic variables matches that used in the fitting procedure of the recent |V ub | extractions in Ref. [7]:  .Distributions and signal acceptance of SHERPA and EVTGEN Monte Carlo data as functions of M X , q 2 , and p * for NN tight (left) and NN loose (right), trained on EVTGEN data.The distributions in the upper panels of each plot are normalized to the total number of signal events.For NN loose the dashed lines in the lower panels show the background acceptance, using the scale for the y-axis displayed on the right.In the lower panels, error bars highlight the MC uncertainty on the acceptance.The error bars for the background uncertainty, which becomes visible at high p * , uses a lighter shade.For bins with very low event numbers, we have used the tabulated uncertainties from Ref. [41].
In all cases, the bins are sufficiently wide that the results can be compared with predictions from the (shape-function) OPE, after correcting for acceptances and detector effects.Each plot in the figure shows the following three results for the indicated MC event sample: the detector-level signal distributions and the total number of events (TP+FP) accepted by the given BNN (upper panels), and the signal acceptance of the BNN (lower panels), all normalized to the number of detector-level signal events.The left (right) column uses NN tight (NN loose ).The BNNs are trained on EVTGEN data, and then tested on both EVTGEN and SHERPA data.For NN loose , we also display the background acceptance in the lower panels, using the scale for the y-axis displayed on the right of the plots.The background acceptance for NN tight is negligible across phase space and is thus not shown.
The figure highlights an inevitable fact -since NN tight uses kinematic features to discriminate between the signal and background, its acceptance is kinematics dependent.The acceptance is higher in the theoretically background-free regions of low M X , high q 2 , and high p * , and lower in regions where the charm background is large.
It is interesting and important to study the MC-data dependence of the signal acceptance in these two regions, and connect it to kinematic modelling uncertainties in the MCs.Take for example the results as a function of M X in the top left of the figure.In the 0 < M X < 1.5 GeV bin, the EVTGEN and SHERPA modelling of the b → u signal differ dramatically, with far more events in the SHERPA sample, and also a very different shape as seen in the finely binned distributions shown in Figure 3.This is not entirely unreasonable, as the details of the low-M X distributions depend on the method for matching resonant and non-resonant modes, and even the integrated distribution over the entire bin depends on the exact implementation of the shape-function OPE.However, the MC-dependence of the signal distribution in this theoretically intricate region does not propagate into the signal acceptance of NN tight , which is essentially MC-independent.
Contrast this with the high-M X region, especially in the bins above 1.9 GeV where the charm background is large.In this case, the marked difference in the shapes of the EVTGEN and SHERPA signals as a function of M X does lead to noticeably different signal acceptances.On the other hand, kinematic distributions in the high-M X region where this becomes most significant are reliably calculable within the local OPE (before detector effects), so the MCdependence can be viewed as an improvable deficiency in the current implementation of SHERPA, which does not perform a matching with first-principle predictions as described in Section 2.3, rather than as an irreducible kinematic modelling uncertainty.One would therefore expect a reasonable MC uncertainty associated with extrapolating the accepted events to the full fiducial region, although this deserves careful quantitative study in actual experimental analyses.
Similar qualitative comments hold for the p * and q 2 distributions -the signal acceptances are essentially MC-independent in the highest bins, where kinematic modelling dependence due to non-perturbative shape-function effects is expected to be significant, but then start to become MC-dependent in the lower bins, where the local OPE is applicable.On the other hand, the acceptances are somewhat flatter in these variables than in M X , never dropping below 60% in any of the bins.
The exclusion of kinematic input features from NN loose leads to a different qualitative picture of event acceptance compared to NN tight .The right-hand side of Figure 6 shows that its signal acceptance as a function of M X is considerably flatter, remaining large at and above the m D resonance, although at the price of rejecting far less background.In total, NN loose also accepts less of the signal.Whereas NN tight accepts 75% (85%) of the EVTGEN (SHERPA) signal, the corresponding numbers for NN loose are 61% (53%) at the value of the threshold classifier which optimizes the significance improvement.For the q 2 and p * distributions the acceptances of NN loose are only moderately flatter than NN tight , if at all.The signal acceptances of NN loose are reasonably independent of the MC testing data across the kinematic phase space.However, unlike NN tight , noticeable differences can be seen in the lowest M X and highest q 2 and p * bins, where shape-function effects and kinematic modelling are expected to be most important.The background acceptance of NN loose is relatively flat at high M X and low p * , but not at low q 2 .Moreover, in the lowest M X bins as well as the high-q 2 region the background is largely excluded; these regions correlate with a large missing mass squared.These observations show that MC-dependence of the acceptances of a given BNN is subtle -avoiding sensitivity to kinematic modelling by excluding kinematic features is not always possible.As a further illustration, consider a NN, NN binned , taking as input the following features NN binned is the same as NN tight , except that particle 4-momenta are excluded5 , and the high-level kinematic features are defined in the bins This binning matches that used in the construction of the hybrid Monte Carlo implemented within EVTGEN in Ref. [7], and is sufficiently wide that fully inclusive distributions within these bins are accessible to the (shape-function) OPE.In other words, unlike NN tight , this set-up is blind to the heavily model-dependent point-by-point distributions of the hybrid Monte Carlo in the low M X and high p and q 2 region, at least as far as the explicit input features are concerned.
In Figure 7 we compare the acceptances of NN tight and NN binned as a function of kinematic variables, using the same binning as in Figure 6.Examining the figure shows that the MC-dependence of the NN binned acceptances are not reduced compared to NN tight , and they depend more strongly on the kinematic variables.In particular, when viewed as a function of M X , NN binned shows a considerable drop in classification power in the higher bins, where kinematic modelling uncertainties are expected to be best under control as long as the hybrid Monte Carlo is matched to OPE predictions.Moreover, the maximal  significance improvement σ drops: when tested on EVTGEN data NN tight has σ = 5.67 while NN binned has σ = 5.46.It is thus far from clear that using a set-up such as NN binned would lead to a reduced theory uncertainty in |V ub | extractions compared to NN tight , even though its explicit kinematic input features can be calculated within the (shape-function) OPE.

Inclusivity in hadronic final states
We now shift our focus to inclusivity in properties of the final-state X u system which appear only after fragmentation into hadrons.Such features are by definition inaccessible to OPE-based QCD calculations, which rely on a sum over hadronic final states in order for quark-gluon duality to apply.
In Figure 8 we display the same information as in Figure 6, but this time as a function of the number of kaons and total charge in the event.The number of kaons is an explicit probe of the flavour structure of the final state, whereas the total charge is closely related to the charged hadron multiplicity (see the discussion after Eq. (3.2) above).Comparing the acceptance of NN tight and NN loose , we find that NN loose effectively vetos both signal and background events with kaons or a non-zero total charge. 6Therefore, when performing fits of the kinematic distributions after the NN loose analysis, a good understanding of both the signal and the charm background after strict cuts on the hadronic final states is required.NN tight , on the other hand, accepts a large proportion of events with kaons or a non-zero total charge and is thus more inclusive in (and less dependent on) these hadronizationmodel dependent features.
The number of signal events containing kaons in the final state is directly related to the ss-popping probability γ s , which determines how often an ss-pair is produced in the decay of the hadronic X system.It is interesting to further investigate the hadronization modelling sensitivity of the classifiers NN tight and NN loose resulting from their different kaon acceptances.Since the number of kaons in the background, which is entirely dominated by resonant contributions, is largely unaffected by changes of γ s , we investigate the sensitivity trained on EVTGEN data.For NN loose the dashed lines in the lower panels show the background acceptance, using the scale for the y-axis displayed on the right.In the lower panels, error bars highlight the MC uncertainty on the acceptance which for most bins (all bins for the background acceptance) is too small to be visible in the plots.
of the signal acceptance only.We have produced additional EVTGEN test samples with a modified ss-popping probability in the range γ s ∈ [0.1, 0.4] and apply NN tight and NN loose to these. 7n Figure 9, we display the relative change of the number of TP events as a function of γ s , taking the PYTHIA8 default γ s = 0.217 [44] as our reference value.As events containing kaons are more likely to be classified as background by the BNNs, the number of TP events decreases with an increasing value of γ s .For NN loose , which relies more heavily on the number of kaons as a features, the decrease of the signal acceptance is stronger.
We contrast the effect of γ s on our ML analysis with a simple kaon veto as well as a cutbased approach defined by the cuts listed in Eq. (3.6) plus an additional cut M X < 1.5 GeV (tight cuts).The ML approach NN loose shows the same influence on γ s as a kaon veto, as expected from the signal acceptance shown in Figure 8. NN tight , however, is less disturbed by an increased value of γ s than its cut-and-count counterpart as it does not apply a stringent veto on kaons in signal events.Overall, our findings highlight the ability of ML approaches to lift the weight from single observables.

Discussion
The above results show that conclusions on the inclusivity of NN tight and NN loose are based heavily on how one thinks about the issue.If the focus is on a flat coverage of kinematic phase space, especially as a function of M X , then NN loose , which does not include kinematic features, would be preferable.If on the other hand one wishes to be more inclusive in the sum over exclusive hadronic final states on which quark-gluon duality is based, then NN tight , which accepts more events overall due to its increased discriminating power, is more attractive.An important thing to keep in mind when considering |V ub | extractions is that in both cases MC modelling is used to extrapolate the signal from the fiducial region singled out by the NN to the partial inclusive branching fractions with a baseline kinematic cut of p * > 1.0 GeV (with no restrictions on the hadronic decomposition of the X u final state).For NN tight this extrapolation is mainly sensitive to the shape of the signal distribution at relatively high M X , which can reliably be calculated in the local OPE.For NN loose it is mainly sensitive to non-perturbative phenomena such as the flavour decomposition and multiplicity of the hadronic final state across all kinematics.Given that the extrapolations are sensitive to different effects, it may be wise to pursue both approaches in real-life |V ub | extractions.
It is worth mentioning that the signal acceptance of the kinematics independent "background suppression" BDT used in the recent analysis of Ref. [7] is significantly smaller than that found using NN loose and our in-house detector simulation, so that the extrapolation from the accepted fiducial region to fully inclusive partial branching fractions with kinematic cuts is correspondingly larger.By the same token, we expect that the acceptance of NN tight in the high-M X region would be considerably lower in the full experimental environment, again requiring a larger extrapolation than seen in our simplified set-up.

Conclusions
We have performed a systematic study on the use of ML techniques in inclusive |V ub | determinations.While our analysis is based on a simplified set-up using an in-house detector simulation and seeking only to separate the B → X u ν signal from the B → X c ν background, it has revealed several important qualitative points.
First, in Section 3, we showed that using a deep neural network trained on low-level single-particle features leads to a small performance increase with respect to a BDT analysis based on high-level features of the type used in the Belle analysis [5].While upgrading such analyses to modern ML standards is certainly worthwhile, the modest performance increase produced by the more sophisticated ML architecture implies that the high-level features used in current BDTs are well-chosen -the most important aspects of discriminating the b → u signal from the b → c background can be understood with physicist-engineered observables.
Second, in Section 4 we studied the inclusivity of the fiducial region selected by cuts on the classifier output of two types of neural networks: NN tight , based on input features of both kinematic and hadron-level features of the final states, such as the one just described and used in Ref. [5], and NN loose , which excludes the kinematic properties and is similar to the BDT used in the recent analysis in Ref. [7].While the signal acceptance of NN loose is fairly flat across the kinematic phase space, it effectively makes hard cuts in hadronic properties of the event such as the number of kaons and the total charge.On the other hand, NN tight is significantly more inclusive in the hadronic decomposition of the final state and also in general, but tends to give less weight to kinematic regions where there is a large overlap with the b → c background.Both of these issues deserve careful consideration when assessing systematic theory uncertainties related to MC extrapolation from the fiducial regions to partial branching fractions that are calculable within the (shape-function) OPE in QCD.
Finally, as the Belle II measurements become systematics dominated, it will be important to pay close attention to the sensitivity of supervised ML approaches to the MC data on which they are trained.We have investigated the influence of a modified ss-popping probability on the signal acceptance using EVTGEN data.A ML approach based on kinematic information, such as NN tight , is generally less biased by changes of global event parameters.Furthermore, in Section 2 we showed results from the multipurpose MC event generator SHERPA in addition to those from EVTGEN, which has been the exclusive MC tool for all previous |V ub | analyses, and in Section 4 we discussed features appearing when the BNNs were trained and tested on event sets produced by different MCs.While SHERPA needs optimisation in matching with OPE-based theory predictions before it can be used in experimental analyses, investigating the stability of ML approaches against MCs whose modelling is based on different theory assumptions can provide a powerful stress-test on MC uncertainties, beyond the current practice of exploring modifications within EVTGEN.

A Detector simulation
Theoretically, the signal and background processes are well separated by the through kinematic boundaries at M X = m D , P + = m 2 D /m B and p * = (m 2 B − m 2 D )/(2m B ).However, detector effects lead to large contributions from the B → X c ν background to the B → X u ν signal region, and it is necessary to include them in order to mimic the challenges of the experimental environment.
In the following, we describe our in-house detector simulation meant to capture the main features of a more complete one.We list the assumed parameters for detector resolution in Section A.1 and for detector efficiencies and mistagging probabilities in Section A.2.Most of these values are based on the description of the BaBar detector in Ref. [45], from the BaBar analysis of the inclusive determination of |V ub | paper [6] and the corresponding PhD thesis on the same subject [46].We compare the resulting distributions after our detector simulation to those shown in the recent reanalysis of Belle events in Ref. [7].We highlight that the beam energies in Belle (3.5 GeV and 8.0 GeV) are slightly different from the values we used in our MC event generation (4.0 GeV and 7.0 GeV), see Section 2. We therefore expect deviations of the lab-frame momenta on the level of 10 %.

A.1 Detector resolution
We assume perfect reconstruction of the direction of each detected particle and we only smear the energy (momemtum) for photons (charged particles).
The energy resolution of photons is parametrized by [46] σ For the resolution of charged particles, we use the p T resolution of the Drift Chamber (DCH) which is the main tracking device for charged particles with p T ≥ 120 MeV [46].σ p T p T = 0.45 % ⊕ 0.13 % p T , p T in GeV.(A.2) We apply this formula on all charged particles, also those with p T < 120 MeV.

Photons
Photons are detected with an efficiency of 96 % for energies above 20 MeV.
eff γ (E γ ) = 0.96 (E γ ≥ 0.02), E γ in GeV (A.3) .Detector simulation validation plots for signal (left) and background (right) contributions.We compare the distributions of our MC events after detector simulation (detector sim) with the MC events produced by the Belle collaboration displayed in Figure 14 of Ref. [7].See paragraph below Eq. (3.2) for the feature definitions.

A.4 Broader resolution
Some of the input features in our analysis do not fully resemble the experimental input features.In this appendix, we study the dependence of our findings in Section 3.3 on the detector simulation.As a test case, we broaden the smearing of the charged particle momenta and photon energies.Increasing the smearing by a factor 10 brings the M X resolution to a level close to what is seen in experiment.The resulting M X distribution is shown in the top panel of Fig. 11.The modified particle resolution will similarly affect low-level and high-level input features and allows us to study its impact on the different multivariate analysis set-ups.We re-perform our tight NN and BDT analyses using training and test data with the increased smearing and show the corresponding significance improvement of these analyses in the bottom panel of Fig. 11.Qualitatively, the comparison of the high-level and low-level data sets has not changed.There are, however, some quantitative changes of the maximum significance reached.For the NN the ratio of the maximum significance σ(NN low )/σ(NN high ) changes from 1.03 in the standard set-up to 1.09 when increasing the smearing by a factor ten.For the BDT the ratio of the maximum significance σ(BDT high )/σ(BDT low ) changes from 1.38 in the standard set-up to 1.12 when increasing the smearing by a factor ten.In both cases the classifiers using high-level input features are more strongly affected than those using low-level features.We would like to emphasise, however, that although broadening the detector resolution by a factor of ten brings the invariant mass resolution in line with that seen in experimental simulations, it is not a realistic scenario and therefore these results should be taken with a grain of salt.Our Bayesian NN is implemented with Tensorflow [48], TensorFlow-Probability [49] and Keras [50] with a total of 5 layers.The number of nodes of the input layer is the number of input features.There are 3 hidden DenseFlipout layers [51], each of them containing 256 nodes using the Kullback-Leibler (KL) divergence function as the kernel divergence function.The KL divergence function is defined as KL[q(ω), p(ω|C)] = dω q(ω) log q(ω) p(ω|C) , (B.1) where p(ω|C) is the posterior probability distribution given classifier C and q(ω) is the approximation created through the classifier [52].We use a sigmoid activation function for all hidden layers.The first two hidden layers are followed by a batch normalisation layer which scales the weights and biases to have mean = 0 and standard deviation = 1.This helps avoid the vanishing gradient problem with sigmoid functions.The output layer only has 1 node with a sigmoid activation function, the posterior function for the kernel and bias are both assumed to be mean field normal distributions.The kernel divergence function for the output layer is also the KL divergence function.We use binary cross-entropy as our loss function and apply the Adam [53]  The BDT is implemented with XGBoost [54].We allow for a maximum depth of 10 as higher depth did not improve performance.The learning rate is fixed at 0.4.The number of estimators is set to 300 with early stopping in place.The gamma factor is fixed at 1.The subsample ratio of the training instance is 0.9 and subsample ratio of columns when constructing each tree is set to be 0.7 to reduce the risk of overfitting.The BDT set-up is summarized in Table .2.
In training the algorithms, the hyperparameters displayed in Tab. 1 and 2 were predetermined with minimal optimization through HyperOpt [55].The distributions in the upper panels of each plot are normalized to the total number of signal events.A broader binning has been chosen to show the acceptance at M X > 2 GeV, where event statistics are low.
The figure shows that the signal acceptances for NN tight are fairly independent of the training and testing data up until about M X ∼ 1.5 GeV, even though finely-binned signal modelling from the two MCs is vastly different.For M X > 1.5 GeV, on the other hand, the acceptances depend crucially on the which MC is used in the training.The reason is that the SHERPA signal drops quickly to zero beyond this point, and is already negligible at the D-meson resonance at M X = 1.9 GeV.Consequently, as seen in the top-right plot, a SHERPA-trained NN tight tends to reject the higher-M X region of the EVTGEN signal, as it has not seen signal events in that region during the training.
This artificial separation of signal and background in SHERPA is an unphysical effect that can be remedied by a matching with OPE-based results, which give a modelindependent description of fully inclusive rates in the higher-M X region.We note further that the signal acceptance of NN loose is fairly flat as a function M X , whether trained on EVTGEN or SHERPA data, and in particular even the SHERPA-trained version accepts EVTGEN signal events across the entire region.In this case, however, the unphysical be-haviour of the signal modelling would inevitably show up in a poor fit quality in the second stage of the analysis.For these reasons we have not considered SHERPA-trained NNs in the body of the text.
Still, for completeness, we show in Figures 14 and 15 the SHERPA-trained versions of Figures 6 and 8.The most prominent feature is the expected reduction in the signal acceptance of EVTGEN data by NN tight in the regions of high-M X and low q 2 and p * in Figure 14 compared to the EVTGEN-trained version in Figure 6, as well as a higher acceptance of the SHERPA signal overall, regardless of the NN.

Figure 1 .
Figure 1.EVTGEN hadronic mass distribution M X , energy-momentum difference P + and lepton momentum in B-meson rest frame p * before (top) and after detector simulation (bottom).The gray lines highlight the boundaries of the theoretically background-free regions.

Figure 2 .
Figure 2. High-level features of B → X c ν events generated with EVTGEN and SHERPA.

Figure 3 .
Figure 3. Upper panel: Comparison of EVTGEN and SHERPA high-level features for B → X u ν signal events.Lower panel: Cumulative sum of the differential distributions M X , P + and p * in EVTGEN and SHERPA, compared to BLNP prediction.

Figure 4 .
Figure 4. High-level features of the EVTGEN sample.Number of leptons N (left), number of kaons N kaons (middle) and missing mass squared M 2 miss (right).Notice the logarithmic scale for some of the distributions.

Figure 5 .
Figure 5. ROC (top) and SIC curves (bottom) for BDT (left) and BNN (right) for different levels of input features, trained and tested on EVTGEN data with a physical ratio of signal-tobackground events in the test set.The dashed lines in the upper panel are ROC curves for the case of no separation.As a reference, the gray lines in the bottom panel show the significance improvement from the three cut-and-count scenarios in Eq. (3.7).A: M X < m D , B: M X < 1.5 GeV, C: P + < m 2 D /m B .

Figure 6
Figure 6.Distributions and signal acceptance of SHERPA and EVTGEN Monte Carlo data as functions of M X , q 2 , and p * for NN tight (left) and NN loose (right), trained on EVTGEN data.The distributions in the upper panels of each plot are normalized to the total number of signal events.For NN loose the dashed lines in the lower panels show the background acceptance, using the scale for the y-axis displayed on the right.In the lower panels, error bars highlight the MC uncertainty on the acceptance.The error bars for the background uncertainty, which becomes visible at high p * , uses a lighter shade.For bins with very low event numbers, we have used the tabulated uncertainties from Ref.[41].

Figure 7 .
Figure 7. Signal acceptance as a function of M X , p * and q 2 for NN tight (solid lines) compared to NN binned (dashed lines) defined in Eq. (3.2).

Figure 8 .
Figure 8. Q tot and N kaons distributions and signal acceptance for NN tight (left) and NN loose (right) trained on EVTGEN data.For NN loose the dashed lines in the lower panels show the background acceptance, using the scale for the y-axis displayed on the right.In the lower panels, error bars highlight the MC uncertainty on the acceptance which for most bins (all bins for the background acceptance) is too small to be visible in the plots.

Figure 9 .
Figure 9. Sensitivity of the number of TP events to the ss-popping probability γ s .The number of TP events at the PYTHIA8 default is chosen as a reference value for each of the considered ML and cut-and-count approaches, TP ref = TP(γ s = 0.217).The tight cuts are defined by the cuts listed in Eq. (3.6) plus M X < 1.5 GeV.

Figure 10
Figure 10.Detector simulation validation plots for signal (left) and background (right) contributions.We compare the distributions of our MC events after detector simulation (detector sim) with the MC events produced by the Belle collaboration displayed in Figure14of Ref.[7].See paragraph below Eq. (3.2) for the feature definitions.

Figure 11 .
Figure 11.Top: M X distribution before (left) and after (right) the background suppression BDT (plot analogous to Fig.6in Ref.[7]).Bottom: Significance improvement for a NN (left) and BDT (right) trained on high-level or low-level input features.We compare our standard set-up to training and testing on a sample with the detector resolution for photons and charged particles broadened by a factor 10.

Figure 13 .
Figure 13.M X distributions and signal acceptance for NN tight (top) and NN loose (bottom) trained on EVTGEN (left) and SHERPA (right) data.For NN loose the dashed lines in the lower panel show the background acceptance using the scale for the y-axis on the right.

Figure 14 .
Figure 14.As in Figure 6, but using SHERPA instead of EVTGEN data for training the BNNs.

Figure 15 .
Figure 15.As in Figure 8, but using SHERPA instead of EVTGEN data for training the BNNs.

Table 1 .
B Machine Learning analysis set-upB.1 Training and test setsTo train our classifiers, we create balanced data sets with 10M B → X u ν signal events and 10M B → X c ν background events.The data preparation process includes the application of the in-house detector simulation and a standard scaling of the data based on the training set.Categorical features are one-hot encoded and are not scaled.The training set is shuffled and 20 % of it is used for cross validation.For testing, we create two test sets with a physical signal-to-background ratio (1/45).Each test set contains 40K signal and 1.8M background events after detector simulation, which roughly corresponds to the number of semileptonic B-decays in a sample of 22.6M B B events.Neural network architecture.

Table 2 .
optimizer.The KL divergence is automatically added to the loss during training.Early stopping and model checkpoints are in place to monitor the validation loss of each epoch.The model weights from the best performing epoch are saved out and loaded back in before inference.We summarise the BNN architecture in Table1.Boosted decision tree architecture.