The DL Advocate: playing the devil’s advocate with hidden systematic uncertainties

We propose a new method based on machine learning to play the devil’s advocate and investigate the impact of unknown systematic effects in a quantitative way. This method proceeds by reversing the measurement process and using the physics results to interpret systematic effects under the Standard Model hypothesis. We explore this idea with two alternative approaches: the first one relies on a combination of gradient descent and optimisation techniques, its application and potentiality is illustrated with an example that studies the branching fraction measurement of a heavy-flavour decay. The second method employs reinforcement learning and it is applied to the determination of the P5′\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$P_{5}^{'}$$\end{document} angular observable in B0→K∗0μ+μ-\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$B^0 \rightarrow K^{*0} {\mu ^+\mu ^-}$$\end{document} decays. We find that for the former, the size of a hypothetical hidden systematic uncertainty strongly depends on the kinematic overlap between the signal and normalisation channel, while the latter is very robust against possible mismodellings of the efficiency.


Introduction
The potential for hidden systematic effects, so-called "unknown-unknowns", in a physics measurement is difficult to address.This problem can be alleviated by an independent confirmation in a different experiment.A good example, in particle physics, is the discovery of the Higgs boson, whereby the simultaneous announcement from both the ATLAS and CMS experiments of an excess led to high confidence in the discovery [1,2].However, the size of the collaborations and the complexity of the experiments involved can make such independent confirmations prohibitively expensive for future particle physics experiments.The confidence in the Higgs discovery was also aided by the fact that it was not completely unexpected: it was predicted by the Standard Model (SM) of particle physics to emerge with a distinctive pattern of couplings to the known particles.
The discovery of physics beyond the SM is expected to answer many of the open questions of the SM and it is therefore the main focus of experimental particle physics.This discovery will require even more experimental evidence to be confirmed, particularly if it manifests itself in ways that are unexpected.The question in this case is therefore: under which conditions can one claim a physics discovery in an experiment which has unique physics sensitivity and therefore no direct competitors?The answer to this question is normally qualitative, such as seeing a particular control channel pass a set of compatibility tests or observe new physics appear with multiple and complementary experimental signatures.The goal of this paper is to introduce a method that provides quantitative answers to these questions.
The philosophy this paper follows is to apply deep learning techniques to play the devil's advocate (DL Advocate), with respect to deviations from the SM.This implies assuming the true value of the parameters of interest is indeed SM-like and determine what experimental effects could cause the observed set of measurements.Concretely, one can consider the set of measurements of the experiment as a system of equations: where F represents the measurement process, η are parameters in common to all measurements, such as the detector response, and Ω i are parameters specific to a particular measurement, such as theoretical parameters.When a measurement deviates from its SM prediction, it is tempting to interpret the observed deviation as a sign of physics beyond the SM.However, such an important claim must be supported by an equally strong confidence in the understanding of the experimental apparatus.The idea developed in this paper reverses the classical reasoning and, instead of attributing the observed deviation to physics beyond the SM, it starts from the SM hypothesis by fixing the theory parameters to their predicted values, uses the simulation to model F and uses a neural network to find possible values of η that reproduce the observed measurements M i .In other words, it tries to find possible detector effects that can cause the observed deviation.For the examples illustrated in this paper, the parameters η represent a mismodelling of the detector efficiency, but can be extended to any assumption in the analysis.
The system in Eq. 1 could in principle include all measurements by the experiment.However, in practice one can only consider measurements which are correlated either through theory or experiment.
The advantage of this approach is that the resulting values of η can then be used to make quantitative predictions of mismodelling that can be falsified by additional crosschecks.There are two categories of information that η can represent.One is high-level information, such as kinematic information of particles.A mismodelling as a function of a particle momentum would fall into this class of information.Such information can be readily implemented with existing tools, but requires physics intuition and therefore has to be tuned to each specific case.The other category of information is low-level quantities, such as the material budget and hit resolution.Exploring low-level information would require tuning simulation in real time and would be a challenge to implement, but would be fully general, applicable for any measurement that relies on simulation.
In recent years increasingly more interest has been devoted to apply machine learning techniques to systematic uncertainties.The main emphasis however has been on the optimisation of statistical inference in the presence of systematic uncertainties (see e.g.[3,4,5]) or incorporating known systematic effects in simulation (see e.g.[6,7,8]).In this work we propose a method to bound the size of unknown or underestimated systematic effects.A quantitative and incontrovertible demonstration of the control of systematic uncertainties is in fact essential for any scientific discovery, whose claim must be supported with a confidence higher than one-in-a-million chance not to be a fluctuation of statistical or systematic origin.
In order to demonstrate the potential of such an approach, we apply this method to a simple toy example of a branching fraction measurement of a particle decay and restrict our attention to a potential mis-modelling of the efficiency.Such measurements are often normalised to a decay mode with a known branching fraction and ideally the same final state as the signal, which cancels systematic uncertainties due to efficiency mismodelling to a high degree.It is therefore an ideal testing ground for our approach.We also consider a more concrete example which is the observable P ′ 5 , which arises in the angular distribution of B 0 → K * 0 µ + µ − decays and is highly sensitive to physics beyond the SM [9,10].In both cases we show in a simplified scenario the use of the DL Advocate technique to investigate the impact of possible efficiency mismodelling on the considered measurements.Finally, another interesting system to test this methodology would be the W mass, recently been measured by the CDF collaboration [11] to be significantly different from previous measurements [12,13,14] and the Standard Model prediction [15].This paper is structured as follows: The general idea and the algorithm implementation is described in Sect. 2. The concrete example of the branching fraction measurement of a particle decay is briefly summarised in Sect.3.An implementation with Deep Reinforcement learning applied to the observable P ′ 5 is discussed in Sect.4, which is followed by a summary section 5 and a conclusion.

Methodology
Measurements in experimental particle physics typically involve the determination of a signal yield, either integrated over a particular decay channel or differentially in regions of phase-space.Due to imperfections of the detection process, the recorded signal yield must be corrected for the finite detector efficiency in order to retrieve the true number of signal events originally produced.This efficiency is typically estimated based on simulation or calibration samples.
Taking the example laid out in Eq. 1, we consider a set of measurements M i , where one is the measurement of interest and the others are control channels which are used to validate the analysis and to check for systematic uncertainties.For each of these measurements there are some observed candidates N i and an associated efficiency e i , so that M i = N i e i .The candidates of each channel are characterised by a set of variables (features) such as the kinematics of the produced particles.Differences in these distributions, together with a detection efficiency which can depend on the same kinematic variables, can result in different total efficiencies between the signal and control channels.
Broadly speaking, mismodelling of the efficiency and unaccounted or mismodelled backgrounds can create a bias in the measurements, which are accounted for by introducing ad-hoc systematic uncertainties.As a consequence, partial or incorrect evaluation of these effects would result in underestimated systematic uncertainties.In this paper we will focus on the role of the efficiency.
We describe possible mismodelling of the efficiency with a weighting function w(x), which depends on the kinematic variables x of the event.Values of w(x) = 1 correspond to a perfect modelling of the efficiency, while values below/above unity correspond to efficiency under/over estimated.The key idea is that, while the detector response depends entirely on the kinematics of the single event, the total signal/control channel efficiency can suffer from different biases once integrated over the individual kinematic distribution of each decay channel.We can then define the true total efficiency for a given channel i as where ε(x) is the per-event estimated efficiency in the experiment and the expectation value indicates the weighted average over the kinematic distribution of each decay channel p(x|i).
Since our goal is to study the impact of possible mismodelling of the efficiency in a given set of measurements we can safely assume ε(x) = 1 without loss of generality.This simplifies the expression of the per-channel efficiency to where we approximated the expectation value with a sum over a large number n i of simulated events {x k } i , where i labels the different decay channels.
Control channels provide important constraints on how well the efficiency is estimated.They are typically selected with topology and kinematics similar to the signal decay mode in order to maximise the phase space overlap between channels.The result of the measurements obtained on such control channels can then be compared to known reference values, e.g.existing precise measurements from other experiments or clean SM predictions.If a good agreement is found, a certain level of confidence can be ascribed to the estimation of the efficiency, at least for what concerns the kinematic regions populated by the control channels.The requirement that measurements performed on the control channels must be compatible with a certain reference can be formulated as which reduces to once we restrict the attention to the sole role of the efficiency in the measurement.Here are the values that bound the efficiency for the control measurements to pass scrutiny.
The goal of this paper is to find regions of the kinematic space ⃗ x where a mismodelling of the efficiency can have a significant impact in the signal measurement while the effect on the control channels remains within the constraints of Eq. 5. Given these constraints, the intuition is that this problem can be thought as a classification task between the signal and control channels using the kinematic variables provided in the space ⃗ x.

DL Advocate algorithm
The algorithm is formed of two main parts, as shown schematically in Fig. 1.The first is a fully connected neural network which resembles a multi-classification algorithm.The inputs are a set of features x whereas the output h j is a classification score for each decay hypothesis.The last layer of the NN is normalised with a softmax activation function which enforce h j (x) ≥ 0 and j h j (x) = 1.Details on the technical implementation of the neural network are given in App. A.
The second part of the algorithm consists of a linear combination of the NN output, also referred to as linear programming (LP), which defines the final per-event weight Combining this with Eq. 3 and moving to a vectorial notation we can express the total per-channel efficiency as or, in an even more compact form, where we have introduced the H matrix which is defined as Here, H is a quadratic M ×M matrix where M is the total number of decay channels.From the previous equations it is evident the role of the coefficients α j which relate the NN classification response to the different decay-channel efficiencies.The meaning of the α j coefficients can be easily understood in the ideal case of a perfectly discriminating network.In such a scenario H takes the form of the identity matrix and α j can be individually chosen to satisfy all possible combination of efficiencies, i.e. we can choose α (j|j∈C) = 1 for all control channels C in order to perfectly satisfy their efficiency constraints, and arbitrarily move α (j|j=s) for the signal channel s to get any possible values for the signal efficiency e s .In general, however, the H matrix will have non-diagonal terms that correlate the efficiency of the different decay channels.The measurements carried out on the control channels, therefore, provide non trivial constraints on the signal efficiency.

Optimisation procedure
The goal of the algorithm is to find the solution that maximises possible shifts in the signal efficiency while maintaining the control measurements within their allowed range, as defined in Eqs. 4 and 5.This is achieved with an iterative procedure:

LP
Figure 1: Schematic view of the DL Advocate algorithm.The algorithm is formed by two parts: the output of a neural network (NN) is passed to a linear programming (LP) solver which returns the final weight for each event.
i) the NN is pretrained as a simple classifier, i.e. minimising the cross-entropy loss between channels; ii) for a given set of NN parameters θ the matrix H is determined and the optimal values of the coefficients ⃗ α are calculated; iii) with the obtained values of ⃗ α, the NN parameters are updated to improve the current solution; with step ii) and iii) repeated for 1000 iterations.In each iteration the value of e s is recorded and the solution that manifests the largest bias in the signal efficiency is returned at the end of the procedure.More details on the individual steps ii) and iii) are given in the next subsections.In general, mismodelling of the efficiency can result in both over and underestimation of the total signal efficiency.In the following, we focus on the minimisation of the signal efficiency; oppositely, in order to get the maximum allowed positive shift it will be sufficient to target −e s instead of e s in the minimisation process below.

Determination of the coefficients α j
In order to find the values of ⃗ α that allow the largest deviation in e s while satisfying the constraints from the control channels we have to solve the following system of linear equations This linear programming (LP) problem can be solved numerically and the minimisation of e s is performed with the SciPy python package with the use of the scipy.optimize.linprogfunction [16].

Update of the neural network
The values of ⃗ α obtained in the previous step correspond to the optimal result for a fixed set of NN parameters, however, these can be modified to further improve the overall solution.The neural network weights θ are then updated based on the following loss function with driving the minimisation of e s with respect to the NN parameters θ and is a regulariser term which is added to avoid det(H) = 0 and it keeps H invertible during the optimisation process.This term is found to help the stability of the process since the determination of ⃗ α obtained in the previous step implicitly requires H matrix inversion (⃗ α = H −1 ⃗ e) which can be numerically unstable.The gradient of the loss ℓ(θ) can then be calculated using the chain rule, i.e. with where eq.16 follows from eq. 13 as shown in [17], while ∂θ is computed with standard backpropagation techniques.Finally, the NN parameters θ are updated based on the overall loss function improvement θ → θ − η ∂ℓ ∂θ , where η is the learning rate set to 0.0001, and the previous step is repeated.The overall optimisation procedure is schematised in Alg.A.1.

Smoothening of the output weighting function
The solution obtained following the algorithm description presented in the previous section represents the largest possible variation of e s that keeps the control-channel measurements within their limits.However, the resulting weighting function w(x) can display extreme trends, such as fast-changing values, which can be mitigated by adding the following regulariser to the loss function where (•) + denotes a positive cut, i.e. (y) + = max(y, 0), x k is a dataset sampled randomly from the domain ⃗ x and p is a tunable parameter indicating the turn on of this gradient penalty (GP) term and is set to 0.5 in the rest of the paper.

A concrete example: a branching fraction measurement
In this section, we illustrate the use of the DL Advocate method with the example of a branching fraction measurement.Branching fractions are typically measured as a ratios of a given signal decay channel with respect to a normalisation channel where B is the branching fraction, N is the observed yield and e is the detector efficiency.It is evident from Eq. 18 that a problem in the efficiency estimation would unavoidably lead to an incorrect determination of the branching ratio.
For this example, we consider the branching fraction measurement of a hypothetical decay of the type P → V C, where V → AB is a hypothetical intermediate resonance decaying into two given particles A and B, and C is a companion final-state particle.Branching ratios are typically normalised relative to decays with similar topology, which we denote here with P → XC channel, with X → AB, where the branching ratios P → XC and X → AB are assumed to be well known.For instance the Particle Data Group (PDG) lists several branching fractions for various systems, some of which have uncertainties as low as 3% [18].In addition to the normalisation itself, crosschecks involving other control channels with well known branching fractions are often performed during experimental analyses.In the following, we consider the existence of a second control channel denoted with P → Y C with Y → AB.
The advantage of the existence of a second control channel stands in the possibility to build ratio of branching fractions, i.e. the same procedure developed for the signal is employed on the control channel by replacing in Eq. 18 the signal efficiency and observed yield with the corresponding control channel counterparts.Ratios of branching fractions, in fact, are typically characterised by smaller uncertainties compared to single branching fractions, since common sources of uncertainties are cancelled out in the ratio.For instance, the PDG reports ratios of branching fractions which are known with a precision at percent level [19].
In conclusion, in our toy example we assume the absolute branching ratio of the normalisation channel to be known with a precision of 3%, the ratio of branching ratios between the two control channels to be known with a precision of 1% and we train the DL Advocate to place an upper bound on the systematic associated to the measurement of signal branching ratio while obeying these constraints.

Simulated dataset
In order to explore this method in a concrete setting, we generate simulated events based on a typical measurement performed by the LHCb experiment [20], as many heavy-hadron decays are measured by LHCb and it is therefore a natural testbed for our approach.In order to generate the representitive simulation we employ a fast simulation based on the the RapidSim package [21], which simulates decays of heavy hadrons with parameterised momentum smearing and representitive production kinematics for high energy collisions.Basic acceptance criteria are applied such as requiring each final state particle to be within 2 < η < 5, which is a crude estimate of the LHCb acceptance, and to have a minimum transverse momentum of 300 MeV/c 2 .Finally, final state radiation effects are simulated with PHOTOS [22], which play a minor role in the analysis.We use the decay of a B + meson into a kaon and a pair of oppositly charged muons.This decay is chosen as a proxy for a generic three-body decay as several control channels exist such as decays involving the J/ψ and ψ(2S) mesons [23] which make it a plausible test case for the method.As there is no detailed particle identification response in our analysis, we use this decay as a representation of the general three-body P → ABC particle decay introduced above, where the two muons, represented here as A and B, originate from an intermediate resonance of variable mass, m V , and the kaon represents particle C. One hundred thousand events are generated for each considered value of m V .
While the generated samples do not take into account reconstruction effects specific to the LHCb detector, they provide an excellent template to demonstrate the proposed method on a simplified example.

Analysis setup
In order to determine how a potential mismodelling of the efficiency could affect the branching fraction measurement of P → V C decays while keeping unchanged the value of the crosscheck provided by the control channel P → Y C, we need to consider all possible differences among the considered decay channels.Due to the different resonance masses m R , with R ∈ {V, X, Y }, signal and control modes are characterised by different kinematic distributions.In the following we use the normalised resonance mass mR = m R /m P as a more generalisable kinematic variable.This resonance mass mR can be expressed as m2 the mass of the resonance is unequivocally determined by the momentum of the two final state particles AB and their opening angle.As an example, we take X and Y particles to have similar normalised masses, i.e. mX ≃ 0.6 and mY ≃ 0.7, which results in similar kinematic distributions, while mV is allowed to span all the kinematically allowed range, with the result that the further mV is from mX,Y the bigger the difference in their kinematic distributions will be (see Fig. 2).
For the example illustrated in this paper, we will therefore focus on the study of the momentum, transverse momentum and opening angle, or combination of those.In high-energy-physics experiments, the efficiency to reconstruct and select a given particle is typically dependent on its (transverse) momentum, which makes particularly important -and potentially prone to hidden systematics -to have good control of the detector efficiency as function of those variables.
In the following, we will therefore train the DL Advocate to evaluate what is the maximum impact that a hypothetical mismodelling of the efficiency can have on the determination of the P → V C branching ratio under different values of mV .It is also important to stress that the goal of the algorithm shall not be to reconstruct mV , which is what occurs if the two momenta and the opening angle of the AB particles are given to the network, but rather to explore possible undetected patterns in the efficiency response of the detector which may lie hidden in a subset of such kinematic variables.
Three different sets of input variables are therefore considered for this study: , the two transverse momenta of particles A and B; • x = {max p, α AB }, the maximum momentum of the two particles A and B and their opening angle; • x = {max p T , α AB }, the maximum transverse momentum of the two particles A and B and their opening angle; The DL Advocate algorithm is then trained following the methodology described in Sec. 2 with the constraints imposed by the normalisation and control channels taken into account as discussed in Sec.2.2.1.In particular, the constraints takes the form where e i should be interpreted as relative variations of the efficiency with respect to the hypothesis of perfectly modelled efficiency (as defined in Eq. 3) and the former equation assumes perfect knowledge of the production of the parent particle P .Finally, in order to evaluate the dependence of the maximum allowed bias on mV , the training is repeated for different values of the normalised resonant mass ranging from 0.1 to 0.9.

Training and results
Figure 3 shows the evolution of e s during the training process for the different values of mV and for the selected input features x = {p A T , p B T }.A good learning curve is observed for all cases, with an almost-optimal solution obtained after less than 500 iterations.A very similar pattern is also seen for the other pairs of features used.
Figure 4 shows the resulting weighting functions obtained from the training of the network for the three considered sets of variables and for different values of mV .For cases where mV < mX , the obtained solution shows a clear split of the 2D plane, with a large fraction of signal events that receive a weight close to zero.This is due to the larger separation in the kinematic space between signal and control channels, which allows to find a solution that strongly affects the former while keeping unchanged the other two.On the other hand, for signal masses between or above the X and Y particles, the strong overlap in the kinematic  distributions leaves very little to no room for efficiency mismodelling to be able to modify the signal branching ratio while satisfying the constraints from the control channels.
We can now quantitatively define the observed bias in each decay mode as the deviation of its integrated efficiency with respect to the hypothesis of perfect efficiency modelling, i.e.
(1 − e i ).The maximum allowed shift on the determination of the signal efficiency, and hence the signal branching ratio, obtained in the different tested configurations is illustrated in Fig. 5.We can draw the following conclusions: • as expected, the lower the normalised signal mass mV is, the bigger the bias is allowed to be; this is due to the large separation in the feature space between signal and control channels, which allows a mismodelling of the efficiency that only affects the signal decay channel; • vice versa, it is nearly impossible for signal with mass similar to the ones of the normalisation/control channels to suffer from uncontrolled systematic effects; • in all cases, the set of variables that allows the largest bias is given by the combination {max p, α AB }, which is natural being the one with the largest correlation with the resonant mass; • as expected, the gradient penalty term applied to the loss function as described in Sec.2.3 reduces the overall allowed systematic effect.
We note that, in most of the cases the obtained solution presents region of the parameter space where the mismodelling of the efficiency is maximal (i.e. an efficiency very close to zero and/or infinity).In addition, despite the gradient penalty term added to the loss function as in Eq. 17, rather steep transitions between well-modelled and badly-modelled regions are still visible in Fig. 4. Nevertheless, the effectiveness of this term is clearly visible from the comparison of the solution obtained with and without it as shown in Fig. 5 (right) and a further tuning of this parameter goes beyond the scope of this paper.no GP GP, p = 0.5 Figure 5: Left: maximum allowed bias obtained on P → V C branching ratio by running the DL Advocate algorithm with the three studied sets of features at different masses.All cases have been trained with a gradient penalty term set to 0.5.Right: maximum allowed bias obtained using {max p, α AB } as input features with and without the gradient penalty term introduced in Eq. 17.

Implications on differential control-channel measurements
In the toy study presented above, we limited the role of the control channels to the sole contraint provided by their integrated branching fractions.In real analyses, however, the use of simulated datasets undergoes a rigorous validation process, which once again makes use of well-known and largely produced calibration channels.One of the additional crosschecks which can be naturally performed on the control channels consists in looking for possible biases as function of the relevant kinematic variables.A systematic uncertainty can in fact escape the bound imposed by the total branching fraction measurement while appearing in its differential distribution.This is exactly the case studied in the previous section.Figure 6 shows the bias expected for the different obtained solutions as function of the minimum transverse momentum and opening angle of the considered control channel.The strong visible trend demonstrates that, assuming a sufficient statistical precision, the systematic uncertainty postulated by the DL Advocate algorithm in the previous section would have been revealed by analysing the consistency of the control-channel differential branching fraction measurement.For example, in Ref. [23], which measures the branching fraction of B → K ( * ) µ + µ − decays, the biases shown in Figs. 5 and 6 would have been detected by the experimental crosschecks performed in the analysis thanks to the large datasets collected for the B → J/ψK ( * ) normalisation channels, which are of the order of hundreds of thousands of events.
One of the limitation of the standard analysis procedure, which consists in manually validating the simulation by looking at different relevant variables, is that it is limited to the statistical power and kinematic distribution of the control channels.Therefore, a systematic effect may still escape this validation procedure if it is limited to a region poorly populated by the control samples or it appears with complex correlations among different or unexpected sets of variables.On the opposite, the DL Advocate algorithm presented in the previous section can be extended to include any arbitrary set of constraints, such as the differential control channel measurements discussed above, and provide a solution that would pass all the standard crosschecks of this kind of analyses.A complete implementation of all the available constraints goes beyond the scope of this paper, however, we can expect them to play a significant role in reducing the room for possible undetected hidden systematic uncertainties.
Finally, beyond all internal crosschecks that can be developed within a given analysis, one of the strength of the proposed method is that one can systematically explore all the repercussions that a solution found by the DL Advocate may have on other observables and/or decay channels.For example, one can check the impact that a certain solution obtained investigating systematic uncertainties on the signal branching fraction has on the mismodelling of the angular distributions of the same decay or, vice versa, how measuring the angular distribution of the decay can further constrain the impact of hidden systematic uncertainties on the branching fraction determination.

A Reinforcement Learning approach
As already mentioned, the method proposed in this paper can only reach its full potential when applied to low-level detector information.The use of detector-related quantities, such as hits positions, energy clusters, magnetic fields, etc., introduces several additional complications to the problem, since it requires an accurate description of the detector and, most importantly, an iterative refinement of the simulation, i.e. the interaction of the particles with the detector has to be re-evaluated for every considered modification.On the other hand, the inclusion of low-level quantities would provide a general tool that is applicable to any measurement that relies on simulation, enabling a systematic evaluation of all possible mismodellings of the detector response.While a complete solution to this problem is out of the scope of this paper, we illustrate with a simplified example the use of Reinforcement Learning (RL) to potentially tackle this difficult task.
Reinforcement learning [24,25] recently achieved impressive results in many domains of applied research, such as robotics [26], self-driving cars [27], gaming [28].Coming to high energy physics, it has been mainly suggested for jets reconstruction [29,30] and on-line control system for accelerator machines [31].In this section we discuss the possibility to train a RL agent to play the role of the devil's advocate (RL Advocate).The goal of the algorithm is unchanged, i.e. trying to find possible mismodelling of the efficiencies that can affect a certain set of observed measurements; however, we need to introduce some new concepts in order to formalise the problem within a RL approach.
Reinforcement learning algorithms are designed to train an agent via continuous interaction with an external environment.At each time step t, the agent makes an observation of the environment's state, S t , and, based on such observation, it undertakes a certain action A t .In turn, as a consequence of this action, the agent will find itself in a new state S t+1 , while receiving from the environment a numerical reward R t+1 .Figure 7 illustrates the described process.Finally, the decision-making of the agent, also called policy, π, is trained to maximise the expected total reward over the long run.

Environment's setup
Identically to the classifier-based approach discussed in Sec. 2, also in this case we intend to describe possible mismodelling of the efficiency by introducing a per-event weighting function w(x) which can depend on a certain set of input features x.However, instead of training a neural network to determine the mismodelling map, we define a parametric expression of the weighting function itself, w(η, x), where η is the set of parameters that describes the detector response which needs to be determined.This approach has some advantages, e.g.we avoid the risk of having sharp (unphysical) changes of efficiencies, as well as disadvantages, e.g. the requirement of a parametric expression in the formulation of the problem, first, looses generality compared to the NN output (given the enormous number of parameters that characterises a NN allowing it to be able to approximate any function in its domain), and second, it definitely requires some physics intuition.
The goal of the RL agent can therefore be formulated as finding the values of the parameters η that best satisfies the agreement with the experimental measurements.We can then define: • the state, s ≡ η (renamed for convenience), the list of the parameters used to describe the weighting function; • the possible actions, which consist of increasing or decreasing each of the η parameter by a discrete quantity; • the reward system, with N meas the number of considered measurements and 2 where, recalling the notation introduced in Eq. 1, M i = F(η) are the values of the different measurements evaluated at each time step based on the evolving values of the η parameters, while µ i and σ i are the corresponding central values and uncertainties measured by the experiment.
The scheme of Eq.21 is designed to return a negative reward (proportional to χ 2 ) when the agent is far from finding a good solution that accommodates all the considered measurements (large χ 2 values).On the other hand, the agent is encouraged with an increasingly positive reward once it reaches the condition χ 2 /N meas < 3. Finally, in the event the agent finds a good compatibility between the obtained solution and the experimental measurements, which is defined in Eq.21 as χ 2 /N meas ≤ 0.1, a large positive reward is assigned and the episode is terminated.Ultimately, since χ 2 can take on very large values, a scaling factor of 0.01 is added to improve the convergence of the algorithm.Each episode is run for a maximum of 200 time steps, while the size of each parameter's update is adjusted to decrease from 0.1 to 0.01, according to the returned χ 2 /N meas value, i.e. the closer the algorithm is to the best solution the smaller the parameter's update will be.
We implemented the described environment within the RLlib python library [32], which offers a great variety of Reinforcement Learning algorithms ready for use.In the following studies, we will employ deep Q-networks (DQN) [33,34,35], which are designed to learn the action-value function, or Q-value, defined as where γ is a discount factor that weights future rewards compared to present ones, whose default value is set to 0.99.In general, the Q-value quantifies how good is to take action a while being in state s; learning the Q-value function is the target of the DQN training procedure.Once a good approximation of the Q-values is achieved, it will be sufficient to always select the action with the highest Q-value, namely following a greedy policy, in order to find the optimal solution.However, better training performances are typically achieved by following the so-called ε-greedy policy, where a random action is selected a fraction ε of the times.This behaviour allows more exploration for new -and potentially better -actions and it is found to help the convergency of the training process.In the following, the value of ε is chosen to decrease exponentially from 1.0 to 0.1 with a decay length of 10 4 steps.

RL Advocate applied to
This section illustrates the use of Reinforcement Learning to play the devil's advocate with a second concrete example inspired by a set of measurements that shows some tension with respect to the Standard Model prediction.In principle, the same study could be conducted with the linear programming method presented in Sec. 2 (after an appropriate modification of the loss definition) as well as RL be applied to the branching fraction example of Sec. 3. In both cases similar results and conclusions have to be expected given the simple nature of the examples under study and agreement between the simple optimisation and RL results shown later.In the following, we present the efficacy of the RL Advocate applied on a physics example of high interest for the community, i.e. the P ′ 5 anomaly [36].

The
Rare decays proceeding via b → sℓℓ transitions are a sensitive probe of physics Beyond the Standard Model.In particular, potentially yet-undiscovered particles may contribute to the decay process and modify, among other properties, the angular distribution of the final state particles [37,38,36,39].Precision measurements of these angular observables are therefore a fundamental test of the Standard Model.
In recent years, different experiments [40,41,42,43] measured a deviation in one of the angular observables of B 0 → K * 0 µ + µ − decays, namely P ′ 5 , in a region of q 2 between 4 and 8 GeV 2 , where q 2 is defined as the di-muon invariant mass squared.In the following, we will use the example of the LHCb measurement of P ′ 5 in the [4.0, 6.0] GeV 2 q 2 bin [40] which is found to be higher than the Standard Model value by a factor of 30-40%, depending on the considered theoretical predictions [44,45,46,47].
In the following, we will run the RL Advocate to test whether an uncontrolled modification of the efficiency can cause a shift in the P ′ 5 observable capable of explaining the observed discrepancy.

Determination of P
′ 5 and choice of the weighing function Three decay angles are necessary to describe the angular distributions of B 0 → K * 0 µ + µ − decays, namely ⃗ Ω = {θ K , θ ℓ , ϕ}, defined as in Ref. [48].Starting from this angular definition, one can define a basis of angular functions, whose coefficients -the angular observablesdepend on the underlying physics model [49].The P ′ 5 observable is defined as a combination of the angular coefficient S 5 and the fraction of longitudinal polarisation of the K * 0 meson In the case of a pure signal sample, F L , S 5 and consequently P ′ 5 can be easily extracted using the method of moments [48], i.e. averaging the value of the relevant angular functions f j over the collected sample, which gives F L = 2 − 5 2 M 1s and S 5 = 5 2 M 5 , where the relevant angular moments are defined as with f 1s = sin 2θ K and f 5 = sin 2θ K sin θ ℓ cos ϕ.
The loss of efficiency due to imperfect experimental detection can be taken into account by introducing a series of weights and modifying the determination of the moment M j to M 1s(5) = 1 where w i represents a per-event weighting function.In general the weight w i encodes the distorsion occurred to the signal angular distributions due to the finite detector efficiency, which is studied with high accuracy by the different experiments [40,41,42,43].In the following, however, we will uncouple all known detector effects, i.e. as in the previous sections we assume perfect estimated efficiency, and we entirely identify the weight w i with the mismodelling of the efficiency under investigation.Finally, for this study we employ a fast simulation of one million events of B 0 → K * 0 µ + µ − decays, with K * 0 → K + π − , obtained with the RapidSim package [21] run under the same conditions described in Sec.3.1.
Before entering the core of the problem, it is interesting to test whether the solution obtained in the previous section can have any sort of impact on the angular observables P ′ 5 .We note that none of the solutions found in Sec.(and the angular observables in general) does not depend on the total signal efficiency, as it was for the branching fraction, but it can only be affected by variations of the efficiency in the angular space.Therefore, in order to create a shift in P ′ 5 , we need a mismodelling of the efficiency that goes beyond the simple dependency on the total kinematic of the event.One possibility would be to introduce a detector response that is different for the two muons.In particular, one can consider a muon efficiency that depends on the relative charge between the muon and the kaon.Such scenario can in principle occur if there are hidden correlations between the hadronic system and the muons in the reconstruction of the decay.In the ideal case, one may want the agent to reach this conclusion by itself and autonomously take the initiative to split the efficiency weighting function by muon charge.While this possibility may appear exciting, it introduces an enormous amount of complications in the definition of the state-action space, e.g. a variable number of available actions and/or variable dimensionality of the state, and it is left to future work.
In the following, we will therefore program the agent to have a mismodelling function which is different for the two muons and we define the total per-event weight where w + is the mismodelling function associated to the muon with the same charge of the kaon, e.g.µ +(−) for a K +(−) in the final state, while w − the one associated to the oppositely charged muon, e.g.µ −(+) for a K +(−) in the final state.
Concerning the choice of the functional form for w + (w − ), we limit the study reported in this example to a simple linear dependence on the muon transverse momentum, i.e.
where k + and k − are the two parameters to be determined by the RL Advocate and µ SS (µ OS ) indicates the same-sign (opposite-sign) muon with respect to the kaon.In addition, in order to avoid unphysical negative values of the efficiency, a lower bound of 0.1 is applied to the weights w + and w − , i.e. w ± = max(0.1,w ± ).In principle the mis-modelling function can be of any form.However, we assume this simple parameterisation as a proof-of-principle of the proposed method for ease of comparison with simple optimisation for validation purposes.

Training and results
We train the RL Advocate under the configuration described above with two different targets (I) First, we only focus on signal B 0 → K * 0 µ + µ − decays and try to see if we can find a mismodelling of the efficiency that could describe a shift in P ′ 5 of the order of what it is seen in the experiment; (II) second, we introduce the control channel B 0 → J/ψK * 0 , with J/ψ → µ + µ − , and check if a systematic bias in the determination of P ′ 5 is still possible once we impose the constraints derived from the precise measurement of the angular observables in this control channel.Similarly to the previous example, in fact, the resonant mode provides an important validation of the efficiency estimation.
In concrete terms, the difference between the two training configurations stands in the composition of the reward system defined in Eq. 21.In the first case, only one measurement is considered and the χ 2 entering in the reward is simply where J/ψ is assumed to be 0.0015, which is the typical uncertainty associated to the control channel measurement [50,51]  Table 1 shows the results obtained by running the RL Advocate for 3000 episodes.We note that, in this simplified example, finding the best solution for {k + , k − } is equivalent to find the minimum of the χ 2 .Therefore, we can validate the result obtained by the trained RL agent with the one derived by a simple χ 2 minimisation, also shown in Tab. 1 for completeness.We can draw the following conclusions • in case-I, the solution {k + = 0.04, k − = −1.09}found by the RL Advocate can result in a positive shift in the value of P ′ 5 of about 0.24, which corresponds to approximately 34 % of its Standard Model prediction.This shift is of the order of the deviation observed by the LHCb experiment.
• the solutions found by the RL Advocate and the χ 2 minimisation provide very similar results.By definition, the RL solution cannot achieve values of χ 2 lower than 0.1, since episodes are considered to be successfully terminated when the χ 2 reaches the 0.1 threshold.The threshold of 0.1 is chosen as it is significantly smaller than one standard deviation and indeed decreasing this threshold to smaller values has a neglgible effect on the results.In general, it is sufficient to have a χ 2 /d.o.f. of the order of 1 to claim a good statistical description of a given series of measurements, however, due to the extremely simple example considered here, with only one input measurement, a threshold of 0.1 is chosen in order to get a solution that is as close as possible to the true minimum.
• the solutions obtained for case-I (trained only with the signal) would result in a shift on the P ′ 5 measurement for the control mode that is almost as large as the one on the signal, up to ∆P ′ 5 J/ψ ≃ 0.15, which is more than one order of magnitude larger than the uncertainty on the control channel measurement.
• when explicitly including the constraints from the control channel (case-II) we find that it is impossible for a mismodelling of the efficiency to cause a shift in the value of P ′ 5 for B 0 → K * 0 µ + µ − decays while keeping unchanged the measurement in the control channel, with the largest allowed variation of the order of 0.01.
Table 1: Solutions obtained by running the RL Advocate on the P ′ 5 example.The first two columns refer to the case where only the B 0 → K * 0 µ + µ − measurement in q 2 ∈ [4.0, 6.0] GeV 2 is considered, with the first column (SciPy minimisation) showing the values obtained by running an analytic χ 2 minimisation with the BFGS method from the SciPy python package [16], while the second column (RL solution I) report the solution found by the RL Advocate algorithm.The third column (RL solution II) shows the best solution obtained in presence of the constraint imposed by the control channel measurement.

Discussion
The two methods presented above demonstrate the ability to account for potentially hidden systematic uncertainties in measurements that may show a discrepancy between the values observed in the experiments and those predicted by the theory.While in this paper we employ simplified assumptions, which are necessary due to the impossibility to access realistic detector simulation and all the crosschecks performed during physics analyses that are not made public by the experiments, we set the ground for a novel methodology to access potentially hidden systematic uncertainties in high-energy physics measurements.A final and more accurate implementation of the proposed method is therefore left to the individual experiments.
The first method relies on a combination of gradient descent and optimisation techniques and its application is illustrated with the example of a branching fraction measurement.This method can be extended to incorporate additional assumptions on the mismodelling of the efficiency which can be derived from physics considerations.As an example, we demonstrated how to impose a smoothening of the mismodelling function by adding an appropriate penalty term to the loss function.These coarse assumptions can be avoided if low-level detector information is used rather than the high-level features used for this paper.
In principle, the first method can be extended to account for latent theoretical or detector parameters.Those parameters can become available within the simulation with the help of automatic differentiation [52].For example, one can explore cases where the measurement uncertainty depends on the detector's alignment or some coupling coefficient.However, most simulation packages are written in a way that doesn't support such techniques.This is precisely the motivation for studying RL-based approach, which can deal with non-differentiable function optimisations.The downside of this approach is that it is computationally more expensive compared to a classifier-based method.However, the flexibility afforded might be crucial in certain cases, particularly when dealing with effects difficult to parametrise.We illustrated the use of Reinforcement Learning with a simple example applied to the measurement of the angular observable P ′ 5 in B 0 → K * 0 µ + µ − decays.Despite the fact that the limited complexity of the considered example makes the examined case solvable by more standard approaches, we find that formulating the problem in terms of Reinforcement Learning can open a new avenue of Machine Learning applications in experimental particle physics.
In addition, throughout this paper we have only considered systematic effects from mismodelling of the efficiency as this is the simplest case to incorporate into the DL Advocate methodology.However, this can be extended to include several other types of systematic effects, such as signal resolution and backgrounds mismodellings or even theoretical uncertainties.Low-level information would again be very useful in order to generalise this approach to these issues.
Finally, with the growth of sensitivity of measuring devices and an inherent increase in the sensitivity to measured events, the degree of the uncertainty introduced during the measurement process becomes more and more eminent.Thus techniques helping estimating the boundaries of the observables that would confidently distinguish alternative physics hypotheses become an imminent research tool.Another reason pushing towards the development of such methods is that it becomes innately expensive to support a twin experiment to confirm expected discoveries, like CMS and ATLAS did for the discovery of the Higgs boson.The possible development of the Future Circular Collider [53] would be rather difficult to match by any other research facility.Nevertheless, all the results and measurements should be scrutinised from the perspective of possibly unknown systematic uncertainties, which the devil's advocate approach is capable of.An interesting venue for exploring and extending this method would be the measurement of W mass, which has recently been measured by the CDF collaboration [11] to be significantly different from previous measurements [12,13,14,54] and the Standard Model prediction [15].This would require a large amount of numerical detail to be publicly released by experiments to have meaningful results such as auxillary inputs assumed and the exact variations that are applied in the measurements.

Summary
In summary, we have introduced a method to place quantitative bounds for hidden systematic effects using machine learning.The philosophy behind the approach is to play the devil's advocate by reversing the measurement process and assuming the Standard Model hypothesis, such that systematic nuisance parameters are determined by the measurements themselves.
To demonstrate the method, we have applied the proposed approach to a hypothetical branching fraction measurement to see the maximum allowed bias due to possible mismodelling of the detector efficiency.We find that the bias depends significantly on the kinematic overlap between the signal and the control channels that are used as crosschecks in the analysis.
In addition to the nominal approach, a reinforcement learning technique is also employed.We explore the potentiality of this alternative route by applying it to the measurement of the P ′ 5 angular observable in B 0 → K * 0 µ + µ − decays.We find that in the tested minimal scenario, when including all the crosschecks of the analysis there is no room for hidden systematic uncertainties associated to the modelling of the detector efficiency to explain the deviation observed in data.This result is based on the use of fast simulation, a robust and final statement can only be achieved by considering all possible detector effects which can only be accessed with a full simulation of the detector.The development of a reinforcement learning solution is an essential complement to the results of the paper, but such an approach can have the highest potential when generalised to low-level quantities.The extension of the presented method to all possible aspects directly related to the detector, in fact, will be fundamental to demonstrate the robustness of any future discovery claim.

Figure 2 :
Figure 2: Distribution of events in the kinematic plane max(p A , p B ), α AB for two choices of signal mass mV = 0.1 and 0.2, as well as for the control channels.

Figure 3 :
Figure 3: Evolution of the signal efficiency e s for the different values of mV when trained with the set of features x = {p A T , p B T }.

Figure 4 :
Figure 4: Resulting weighting maps describing the efficiency mismodelling as function of the input variables for the three sets of considered features {max p, α AB } (top), {max p T , α AB } (middle) and {p A T , p B T } (bottom) for mV = {0.1,0.2, 0.3, 0.4, 0.5} (from left to right).Values of mV ≥ 0.6 are not shown since no significant deviations from unity are visible in the entire feature plane.

Figure 6 :
Figure 6: Expected relative bias on differential branching fraction measurement for the P → XC control channel as function of the minimum transverse momentum (left) and opening angle (right).The algorithm has been trained with a signal of mV = 0.3, while the result for the three tested configurations is shown with different colours.
3.3 can cause a shift in P ′ 5 , this is because P ′ 5

5 are
the central value and uncertainty measured by the experiment, while P ′ 5 (k + , k − ) is the modified value of P ′ 5 based on the variation of the mismodelling parameters k + , k − .On the other hand, in order to include the constraint derived from the control channel, it is sufficient to add to the χ 2 of Eq. 29 the term corresponding to the control channel measurement χ 2 case−II = P ′ 5 (k + , k − ) − P

and P ′ 5 J 5 obtained
/ψ (k + , k − ) is the alteration of the control channel value of P ′ by running the RL Advocate.