Matrix Element Regression with Deep Neural Networks -- breaking the CPU barrier

The Matrix Element Method (MEM) is a powerful method to extract information from measured events at collider experiments. Compared to multivariate techniques built on large sets of experimental data, the MEM does not rely on an examples-based learning phase but directly exploits our knowledge of the physics processes. This comes at a price, both in term of complexity and computing time since the required multi-dimensional integral of a rapidly varying function needs to be evaluated for every event and physics process considered. This can be mitigated by optimizing the integration, as is done in the MoMEMta package, but the computing time remains a concern, and often makes the use of the MEM in full-scale analysis unpractical or impossible. We investigate in this paper the use of a Deep Neural Network (DNN) built by regression of the MEM integral as an ansatz for analysis, especially in the search for new physics.


Introduction
In the near future the LHC will begin its third run of data collection with the aim of doubling the luminosity collected so far during the two previous runs. While Run-1 allowed the discovery of the Higgs boson, no new physics has yet emerged from Run-2 and chances are faint to get big surprises in the existing dataset, despite the many analyses that are still ongoing. It is becoming obvious that the expected increase in luminosity will not yield a strong enough gain in sensitivity. Tiny signals hidden behind large backgrounds would not benefit much from additional statistics and would call for an improvement in the analysis techniques to benefit fully from the information contained in each event.
Neural networks and boosted decisions trees, along with other less popular machine learning techniques, are used to learn this information directly from simulations or more rarely from data. The main difficulty of the machine learning methods is to make sure that they actually learned physical information and not some artifacts in the training data. Special care must also be taken to prevent overfitting but with enough training data one can achieve a better efficiency than with traditional techniques.
On the contrary, the Matrix Element Method (MEM) uses our knowledge of the Standard Model (SM) by means of the Lagrangian to compute the compatibility of experimental events with a hypothetical process. The underlying knowledge of the physics process and detector response makes its internal dynamic more straightforward to interpret than for machine learning techniques. Since there is no training step, the MEM can also be used when the number of events in the dataset in consideration is very small, a situation in which other methods struggle.
The MEM originates from the Tevatron experiments DØ and CDF for top quark measurement in tt production [1,2,3,4,5,6,7] and has become common in particle physics analyses. Recent examples at the LHC are the searches and measurements of the ttH [8,9,10,11,12,13,14] and single top [15] production processes, as well as studies of spin correlation in tt production [16]. However this technique suffers from complexity and an expensive computation time. In order to obtain the probability for a given process, the matrix element must be convolved with the parton distribution functions, the transfer function of the detector and be integrated over the whole phase-space. This integral is high dimensional and its integrand has a nontrivial shape with many sharp peaks, the details will be presented in the next section. Even modern implementations of the method (e.g. MoMEMta [17]) that use efficient integration strategies need more CPU time to evaluate this integral than is required by many applications of the method.
The aim of this paper is to show how a Deep Neural Network (DNN) can be used to approximate the result of the matrix element integration. This makes it possible to use the MEM for search for new physics, parameter scans, etc. The probability provided by MoMEMta can be seen as an untraceable function -meaning there is no closed or computationally affordable form -of the final state 4-momenta. Any function with reasonable assumptions can be approximated by a neural network given a large enough width -which can be broken down into several layers -and although nothing guarantees the result of the MEM fulfills all these assumptions it was enough to motivate this study.
This method is potentially much faster that the straight evaluation of the matrix element by integration. A DNN needs a simulated sample and several hours of computing time for training, but evaluating it afterwards is much faster: typically at least a few order of magnitude less computing time than for the classical MEM integration, which takes a few seconds per events even for the easiest processes. This is illustrated in Figure 1for a few of the benchmark processes from Ref. [17]. Figure 1: Computation time as a function of the number of events for few processes with MoMEMta and using the proposed DNN approach. For the DNN it was assumed that the training and evaluation times were 10 h and 150 µs per event respectively. Time spent on producing the training sample is not taken into account.

The Matrix Element Method in a nutshell
The purpose of the MEM is to compute P (x|α), the probability to observe an experimental event given the theoretical hypothesis α. In this context x refers to the 4-momenta of an arbitrary number of particles observed in the final state. We will distinguish these experimentally observed particles x from the parton level particles y produced at the interaction point before the hadronization and their detection. α can refer to any set of parameters (e.g. the mass of a resonance) or to different models.
For hadron colliders, the likelihood of a hard scattering producing a partonic final state y is proportional to the differential cross section defined as where q 1 and q 2 are the initial state parton momentum fractions and s is the center-of-mass energy. dΦ(y) is the n-body phase space of the final state y, while |M(q 1 , q 2 , y)| 2 denotes the matrix element for the given process α (including the summation over spin and colors) usually computed numerically at leading order by packages such as MG5_aMC@NLO [18].
The propagation of the parton-level 4-momenta y to the experimentally observed ones x includes the parton distribution functions (PDF) f a (q) (for each parton q of flavor a), efficiency (y) to reconstruct and select the hadronic state y and transfer function T (x|y) normalized with respect to x. The latter parameterizes the parton shower, the hadronization and the detector response (whose resolution is limited and produces a smearing of the observed particles momentas).
The probability P (x|α) results from a convolution of the differential cross section with the transfer function and a sum over initial states: where σ vis α stands for the visible cross-section and is there to make sure that the probability is normalized. It is often computed a posteriori with σ vis α = σ α , where σ α = dσ α (y) is the total cross-section and is the average selection efficiency. In practice the integration and the computation of this factor are separated which is why we will omit it in the following from Equation 2 and define the MEM weights as W (x|α) = σ vis α × P (x|α). In addition the weights can span several orders of magnitude which is why most of the time we will use the event information defined as I α = − log 10 (P (x|α)) or I α = − log 10 (W (x|α)) which only differs by a constant term.
The transfer function includes various complicated processes and several assumptions are usually made to simplify its integration. The detection and measurement of each particle in the final state is independent, at first order, which allows to factorise the transfer function for the different particles. This argument can be pushed even further by factorizing the different components of the measured 4-momentum. We therefore write the transfer function as where the index i refers to the final-state particles. In most cases, the resolutions in η and φ are very narrow and are parameterized as delta functions. On the contrary, the energy resolution depends on the nature of the particles and is usually parameterized as a Gaussian inspired from simulations -narrow for leptons and broader for jets -or a custom two-dimensional histogram filled by simulated events. Note that these assumptions can break down when two objects have small anglular separation which would require specific care to not impair the convergence and accuracy of the integration.
The high-dimension integration in Equation 2 requires the use of numerical integrators such as Vegas [19]. These tools implement adaptive Monte Carlo techniques, the basic principle of which is to randomly generate points evaluated with the function that one wishes to integrate. With enough points, a relatively close approximation of the integral can be obtained along with its uncertainty. However this method becomes extremely expensive in high-dimension space: while flat regions of the phase-space only need a few points in order to get a good approximation of their integrals, regions where the function fluctuates a lot need to be well covered.
Adaptive Monte Carlo techniques are iterative processes designed to populate the phase-space heterogeneously in order to decrease the integral variance. Even though they perform better than the uniform sampling, they do not scale easily with the dimensionality and the factorization assumption on which some are based -e.g. importance sampling in Vegas -makes them especially suboptimal if the integrand presents peaks that depend on several integration variables. In our case sharp peaks can arise from the propagators in the matrix element or in narrow transfer function. The latter can already be mapped to integration variables in the classic parameterization of the phase-space where the Dirac function ensures the momentum conservation. Propagator enhancements then need to be addressed by inverting the Breit-Wigner resonances and the delta functions need to be integrated out. In addition this parameterization includes invisible particles, i.e. particles that did not leave a trace in the detector. Neutrinos, particles outside the acceptance and initial-state partons should be taken into account and the large volume that they represent will heavily impact the integration -unless the kinematic constrains are used to remove these degrees of freedom.

Fitting the MEM with DNN
Computing the weight of an event with MoMEMta can take from a few seconds to several minutes depending on the complexity of the process in question. This has to be repeated for each hypothesis α and for each event and quickly becomes prohibitive. In a real-life analysis with several hypotheses and sometimes additional model parameters, it often makes the implementation of the method challenging. In a nutshell, the approach proposed in this study uses MoMEMta to process simulated samples and produce event information I under different hypotheses. The result is then used together with MoMEMta inputs -the 4-momentas of each visible particle -to train a DNN. As shown in what follows, the resulting network can be used instead of MoMEMta on a larger set of events and for different values of the model parameters.
The inputs of MoMEMta are the 4-momenta of all the observed particles as well as the missing transverse energy (only its P T and φ angle). These are the inputs that we want to provide to the DNN. However, depending on the longitudinal difference in momentum between the initial partons that collide in the detector, the particles produced might be boosted in one or the other direction. The network will still be able to learn the function at Equation 2 but will also have to learn about the longitudinal boost itself, which hinders its ability to describe the interesting part of the matrix element.
Using as inputs the P T , η and φ angles for each of the particles improves a lot the situation, since the ∆η between two particles is in good approximation independent of the longitudinal boost. Furthermore, the detector has a cylindrical symmetry and we do not want the network to learn about an arbitrary reference in φ. Any relative quantity defined on the angles could in principle be used, for example the relative ∆φ angle with an arbitrary selected particle. This parameterization has shown to yield better results than the raw 4-momenta without loss of generality. While in the integration of the MEM the ordering of the particles is important because all permutations of indistinguishable particles -e.g. two jets originating from a quark and an antiquark -need to be considered, there is no notion of distance in the inputs of a fully connected network and it does not have to be taken into account. The outputs also need some adjustments before the actual learning can take place. Their very small values that span several orders or magnitude -from 10 −15 to 10 −30 -make them not really suited for a regression, which is why we will regress on the event information I = − log 10 (weight) instead.
In some cases, the computation does not converge before reaching the maximum number of iterations. In these cases the weights have non-physical infinitesimal values with even smaller uncertainties, referred to as invalid. These weights are logically not included in the learning process, but the DNN can be evaluated on these events as on any other. In order to probe both the behavior of the MEM and the DNN for these events, some of the invalid weights can be recomputed in MoMEMta with more sampling points and iterations and evaluated with the DNN trained on valid weights.
An inherent quality of the DNN approach is the ability to interpolate on inputs that were not seen during the training. In the classical method, parameter scans require to generate weights at different parameter values and eventualy to perform an interpolation. But the cost grows with the granularity and the dimensionality of the parameter space, which can be prohibitive. The advantage of the DNN is that no integration with MoMEMta is necessary anymore once the model is built, even for new events and new parameter values -as long as one stays clear of the extrapolation regime where DNN are know to be untrustworthy.
Note that in practice the MEM probabilities -and by extension the MEM information I -are rarely used directly in an analysis and are combined in an application-dependent procedure. A simple comparison of their values is thus not a sufficient criterion to state that the method we propose can be used without losses.
We have used the Keras [20] library interfacing Tensorflow [21] to train the DNNs. The datasets were separated into three sets : one for the training (∼ 70%), one for the hyperparameter scans used in model selection (∼ 10%) and one for the performance evaluation of the selected model (∼ 20%). All the plots in the paper contain events in the last set. All this is done to detect any overfitting of the network, i.e. the loss of generalization to unknown data because statistical fluctuations of the training data are learnt in addition to, or instead of, the general features of its underlying distribution. The hyperparameter scans were conducted with the Talos [22] package.

Proof of concept : the llbb topology
As a proof of concept we will apply the method we propose in this paper to several processes producing two opposite sign leptons and two jets initiated by b quarks (bjets) as detected particles. This topology is interesting because the main contributions to this topolgy -Drell-Yan Z → l + l − production with additional jets, and top quark pair production tt with leptonic decays of the W bosons from both top quarks -are very dissimilar in the way they are treated in the integration. The computation for the former is rather straightforward as no missing particles are produced while the latter contains undetected neutrinos whose degrees of freedom need to be accounted for. We will then consider the resonnant H → Z(→ l + l − )A(→ bb) process that arises in the context of Two Higgs Doublet Models (2HDM), and has been studied by the ATLAS [23] and CMS [24] collaborations. The multiple resonances in this process present an interesting challenge from the integration point of view and the unknown masses of the H and A bosons will illustrate the power of our method when the parameter space is multi-dimensional -which is precisely where the classical integration is impractical. A summary of the number of events for which the weights have been computed is given in Table 1. Only 500K events for the tt process have been used in the training to not unbalance the network. The H → ZA samples and weights are split in 23 mass configurations up to 1 TeV for both m A and m H . The event information I α will be evaluated for every event -tt, Drell-Yan and H → ZA -23 times with different mass parameters for the H → ZA case alone. The number of invalid weights and how many were recomputed with more iterations are also given in the table. The non-convergence indicates that it is very difficult for MoMEMta to associate an event with the process the weights is computed for, likely because the event is too incompatible with the hypothesis. The changes of variables introduced in the computation are an attempt to optimize the axes of the integration to the kinematics of the process in question. If the event is pathological or comes from another process then the integrator might not generate points in the large value regions. Therefore more iterations or more points are needed and the integration can reach the threshold on the number of steps before convergence. Sometimes the threshold might be too hard an the convergence was very close, this is probably the case for the invalid Drell-Yan weights given their relatively small number. In addition the fact that all of them converged at the recomputation step with more time and points tends to support this explanation. On the other hand, it is possible than the phase-space regions to be populated for the event are very far in the tails of the sampling distribution for the process under investigation. This is probably the case for a portion of the tt weights that did not converge even with more points, especially for H → ZA events with invalid weights that are mostly at high M H and M A (two thirds of the them are for cases where both masses are at the TeV scale). This is because the kinematics of the leptons and bjets are incompatible with a 173 GeV precursor, and thus make the task of MoMEMta more complicated. This does not happen for the Drell-Yan weights because the kinematic range of the products is more flexible. A study of these invalid weights is presented in a dedicated section.
We made sure the events were weighted proportionally to the sample size between the three categories so that they all have the same importance in the training of the DNN. The goal for the H → ZA weights is to include M H and M A in the inputs of the DNN and provide the Information I as target. This parametric DNN [25] would be very interesting for parameter scans (for example in a maximum likelihood context) that become prohibitive with the classical approach. This DNN is therefore capable to provide Information for parameter values it had not seen during the training and the interpolation that must be performed apart from the integration in the classical approach is now embedded in our method.
In the next section we will detail the computation process and regression results of each type of weights. The summary of computation time and DNN topologies are at Table 2.

Drell-Yan weights
Drell-Yan weights are the easiest to compute since the topology of the process is quite flexible in the range of allowed kinematics for the products. This makes the task of MoMEMta relatively easy when the correct changes of variables is applied. In practice the evaluation of the Drell-Yan weights takes on average a few seconds per event, some rare events can take a few tens of seconds in the tail of the distribution.
The I DY distributions are given in Figure 2 for the three types of sample. The agreement is very good between the weights from the MEM computed with MoMEMta and the ones from the DNN. The best model selected during the parameters scan is the one with six layers of 200 neurons, relu [26] and selu [27] activation functions for the hidden and output layers respectively, the optimizer for the gradient descent was Adam [28]. Dropout [29] and L2 [30] regularization technique did not improve the efficiency of the training. We emphasize that these events have never been seen by the DNN during the training or in the model evaluation in order to detect overfitting. As expected the Drell-Yan events have higher weights than tt and H → ZA ones because they have more compatibility with the Drell-Yan hypothesis.

tt weights
Due to the more complicated topology of the tt process -with fully leptonic decay, this will be implicit from now on -from the narrow top resonance, the tt weights are more intricate to compute. However taking advantage of the Breit-Wigner resonances in a change of variables, the computation time can be taken to a reasonable level (about 3.2 times slower than for the Drell-Yan weights).
The corresponding I tt distributions are shown in Figure 3. The contrast in the weight distribution between the Drell-Yan and tt samples is less obvious but a longer tail can be observed for the Drell-Yan events. The double peak of the H → ZA case comes from the different mass configurations M H and M A that constitute this sample. High (pseudo)scalar masses lead to low weights while low masses are more consistent with the tt hypothesis. Overall the agreement between the classically computed weights and the ones from the DNN is good. The best model is close to the one for the Drell-Yan weights: it contains eight layers of 500 neurons and a small L2 regularization factor, probably to counter the overfitting of such a deep network.

H → ZA → llbb weights
The case of the H → ZA hypothesis is more complicated due to its hypothetical nature that makes this process dependent on unknown parameters. In our case we only varied two parameters, the masses M H and M A , while other have been fixed. We have focused on 23 configurations, both for the event generation and for the MEM computation. While some integration tricks were beneficial for the other hypotheses, mostly by using the delta function for the momentum conservation in Equation 2 to remove some degrees of freedom -such as the bjets momentum magnitude for the Drell-Yan process or replace the integration over the invisible particles by integrating over the resonances in the tt process -the absence of invisible particles and the multiple resonances of the H → ZA process prevent these  tricks to be profitable . This has a heavy impact on the computation time, about 50000 CPU days for all the samples (on average about ten minutes per event per weight). Figure 4 shows the different contributions of the weight distributions from the event generated with different resonant masses. As expected, the event information is lower (hence the probability is higher) when the masses used for the generation match the hypothesis used for the weight calculation, although the dependence on the visible cross-section is not taken into account here and might have a small effect. This can be seen on the left part of Figure 4 where the low mass events have higher weights. On the contrary, the high mass events have the highest weights in the right part of the figure, as they match more closely the topology of the high mass weights. Interestingly enough, events with same M A but different M H compared to the values included in the weight tend to have slightly higher probabilities.
From a pure technical point of view the best model for the H → ZA case is intrinsically similar to that of Drell-yan and tt. It consists of 8 layers of 300 neurons with relu and selu activation functions for the hidden and output layers respectively, trained with a small L2 factor. It is however conceptually radically different. In addition to the particles inputs, it was also given the mass parameters for each H → ZA weight provided as target. It has been trained on each event 23 times, for different M H , M A and target weight. Not only has the DNN learned about the relation between the kinematics and the weight, but also the dependence on the process parameters. The comparison between the weights from MoMEMta and the DNN is shown in Figure 5 for a specific mass configuration (M H = 800 GeV, M A = 400 GeV).
As expected, the tt events have small H → ZA weights when the mass parameters are high. This is the reciprocal of the fact that high mass signals have low tt weights. Drell-Yan events also depict the same behavior since they rarely produce particles with high momenta, which is typically the case for H → ZA with high mass resonances. Notice also that the higher weights for Drell-Yan events occur when the difference in mass hypothesis M H − M A is large. In that case, the Z boson acquires a large boost not commonly observed in Drell-Yan events. In our specific case the parameter space is two-dimensional. To test the interpolation capabilities of the network, a new set of weights was computed with parameters M H = 600 GeV and M A = 250 GeV (never seen during the training) on a small subset of the initial samples (1K events per H → ZA sample, 5K events for the Drell-Yan and tt samples). For reference, the Delaunay technique -a piecewise-linear interpolation -was employed to obtain the weights at these parameters values from closeby computed points. This is compared to the DNN applied to these events without retraining in Figure 6. Both method perform equally well, demonstrating that the DNN is properly interpolating the parameter space from observed samples. Note that while the Delaunay technique is relatively fast, the main bottleneck is that it requires some points to start from, which means that each event still need to be computed for several mass points with MoMEMta. In addition the granularity will still scale exponentially with the parameter space dimension. On the contrary, there is no need to use MoMEMta anymore once the DNN is trained and while the two methods give the same result, the DNN can be orders of magnitude faster -especially in multi-dimensional parameter space.

Applications and studies
While the weight distributions presented so far are a good indicator to evaluate the regression, they do not bring information about the event-by-event agreement. It is also difficult to evaluate if the residual difference is physically relevant. In the following, we will look at typical applications of the MEM in analysis, with the Information used as an input of the MVA discriminant, or interpreted as a likelihood. This will allow also to better understand the status of the invalid weights and the sensitivity to systematic uncertainties.

Discriminant
A very simple discriminant between two hypotheses α and β can be defined as The discriminant can be close to one or zero depending on which hypothesis α or β prevails (respectively). For illustration, α and β can be taken to be respectively the tt and Drell-Yan processes. As an evaluation criterion for this discriminant we have used the Receiver Operating Characteristic (ROC) curve. Although it does not impact the ROC curve, the shape of the discriminant will be impacted by the factor γ in the denominator, we have arbitrarily taken here γ = 1.
The ROC curves obtained with the weights coming from both the integration in the MEM and from the DNN are shown in Figure 7. The weights produced by the DNN actually provide a slightly better discriminant than the ones from MoMEMta. The difference can be traced to outliers present in the MoMEMta calculation, while the DNN behavior is smoother by nature and has fewer of them. Figure 7: ROC curve of the discriminant. The x axis represents the probability for an event to be classified in the correct process (tt or Drell-Yan) while the y-axis represents the misidentification probability. The AUC score is the area under curve.
The discriminant in Equation 6 is limited to a classification between two hypotheses which is a bit restrictive. In addition its simple definition might not make use of the full information encapsulated in the MEM weights. A discriminant for higher dimension parameter space could be generalized but it is not guaranteed to be optimal. In this paper we decided to follow a different path by using a classifier based on the MEM weights and leave to it the determination of an optimal decision function. A natural choice here is the use of a classifying DNN with three output nodes whose inputs are the weights for the three processes under study. The DNN is trained to maximize the probability of correct identification using a binary cross-entropy loss function. There are two ways one can define the classifiers inputs based on the parametric definition of the H → ZA weights. A global classifier is used to find an excess in the whole mass plane, regardless of its location. In the spirit of the analysis in Reference [24] and anticipating the search for a specific resonance, a parametric classifier is given the knowledge of the mass plane point of interest and can therefore be used to find an excess at a given place. On the one hand the global classifier is less sensitive because the excess needs to be  large to be noticeable on the whole plane, on the other hand there is no need to correct for the look-elsewhere effect which would be the case for the parametric classifier that is evaluated at several locations.
The inputs of the global classifier are the Drell-Yan, tt and the 23 H → ZA weights. As we have no knowledge of the actual value ot the masses during the training on simulated events we need a good enough coverage of the parameter space. At least one input should provide sensitivity to a given hypothesis. The classification probabilities are given in Figure 8 and the corresponding ROC curves are compared in Figure 10a using both the weights from MoMEMta and the regressive DNNs.
The parametric classifier also takes as inputs the Drell-Yan and tt weights but only one H → ZA weight with the corresponding M A and M H parameters. For H → ZA events, the actual parameter value is used, while for Drell-Yan and tt events they can either be attributed a random parameter point -in the same proportions as in H → ZA events -or repeated for every parameter point found in the H → ZA events. The latter was used to artificially increase the statistics. The associated ROC curves are shown in Figure 10b, averaging over all the mass points. The dependence on the performance as a function of these mass points is illustrated by the AUC score in Figure 9. Naturally, the best performance is achieved away from the regions heavily populated by other processes.
As a comparison, a simple classifier ROC curve with only the Drell-Yan and tt weights as inputs is shown in Figure 10c. Even though this simple classifier is suboptimal for H → ZA, it reaches reasonable performance. Additionally the Drell-Yan and tt classification improves when provided with the H → ZA weight information. The MEM weights can provide discriminating power even for processes other that the one used in its computation.
All the classifiers are trained with weights from MoMEMta and the ROC curves shown in Figure 10 are evaluated with weights from both methods. The regression errors introduced when using the regressive DNNs are propagated through the classifiers but the loss in performance is negligible. Only in the global classifier can the MEM and DNN curves be distinguished due to the residual differences already highlighted in Figure 5 that add up for all the H → ZA weights inputs.

Invalid weights
As discussed earlier, the MoMEMta integration may fail and result in "invalid weights". It is nevertheless trivial to evaluate the DNN for the corresponding events. The result can be compared to the output of MoMEMta when the integration is made to converge by increasing the number of sampling points, as done in Figures 11 and 12, respectively for Drell-Yan and tt weights. For both cases it is obvious that the DNN does not agree with MoMEMta for these events with invalid weights. In the case of Drell-Yan, the DNN seems to deliver values similar to what is obtained for normal events, while MoMEMta returns small probabilities even after allowing more iterations. The picture is quite different for the tt case, where the network returns consistently smaller weights (even though these events were never seen during training). The question of what is happening with these events and whether we can trust MoMEMta with these very small weights remains open at this stage.
Applying the discriminant from Equation 6 to compare the invalid weights computed with MoMEMta and the DNN, much better performance is obtained with the DNN inputs than with the MoMEMta inputs ( Figure 13). This may suggest that the DNN provides a more reliable information than MoMEMta for these events, but more studies are needed to get a conclusive answer.

Effect of the systematics
Already heavy in terms of computing in its simplest form, the MEM becomes even more demanding when the effect of systematic uncertainties has to be evaluated. Indeed, the effect of uncertainties affecting the event kinematics cannot be propagated without recomputing the matrix element integral, with relatively few opportunities to optimize this calculation. While the DNN ansatz proposed here alleviates the impact of this additional integration, it is important to verify that the regression performed during the training phase is robust against these systematic effects too.
As an example, we will look at the jet energy scale (JES), which is potentially among the most dangerous effects, due to the rather poor resolution of jets compared to leptons in hadron collider experiments. We will not consider the impact of the jet energy resolution, which is mostly covered by the transfer functions. To emulate a potential JES we have applied an upward scaling of the jet energy by 10% -which is an extreme case -for each of the bjets on a subset of the events and computed the corresponding new weights both with MoMEMta and with the existing DNN parameterization (thus without retraining).
The comparison of the weights when the JES shift is applied or not is shown in Figure 14a. The regression error itself is not significantly impacted, as can be seen from the DNN bias and resolution in Table 3. This shows that the DNN is able to properly handle these modified events.
In addition, the discriminant from Section 4.4.1 has been evaluated using both MoMEMta and DNN inputs for nominal and modified events. The associated ROC curves are shown in Figure 14b. This specific discriminant appears to be robust against variations of the jet energy scale, both in its traditional implementation and when using the DNN ansatz as input.
where the product is over n measured events. This likelihood will peak around the parameter α which can be any measurable physics quantity such as a mass or a coupling and in the context of the H → ZA hypothesis will represent (M A ,M H ). It is expected that events generated from this process will produce a likelihood peaked in the two dimensional parameter space with a width roughly equal to the experimental resolution of the invariant masses m jj and m lljj used as estimators of the parameters if n = 1 and will improve when more events are taken into account.
The log-likelihood on simulated events defined as where the geometric mean of the likelihood is used to evaluate the resolution that would be obtained for one measured H → ZA event with no background. In that expression, − log(W (x i |α)) is the output of the DNN. Had it been computed with MoMEMta, each event would have had to be integrated for several values of α = (M A , M H ) with a granularity fine enough to allow the fit of a potential peak. In this particular case the computation time will grow exponentially with the parameter space dimentionality. In our two-dimensional case it is already a major pitfall.
The DNN on the other hand can evaluate any event unseen during training on a grid of parameter points with the same event inputs. The grid can be made arbitrarily fine due to the non-linear interpolation property of the DNN for a very low cost in computation time. The number of evaluations still scales exponentially with the parameter space dimensionality but evaluating the DNN on batches of event allows to take advantage of parallelization to break the exponential dependence.
The second term of Equation 8 is important and must be evaluated separately. In our specific case the definition given at Section 2 translates to The acceptance is measured on simulation and the theoretical cross-section must be multiplied by the branching ratios of the particular channel being looked at. The production cross section and branching ratio A → bb are very model dependent and are not taken into account. The branching ratio of H → ZA is mostly kinematic dependent and was kept, however a model dependent effect can be seen when M H = 2 × M A , when the H → AA process becomes relevant. The resulting likelihood is presented in Figure 15.

Direct application to a real-life analysis
As an example of a real-life study that tackles the llbb final states, we will look closer to the CMS H → ZA → llbb analysis [24]. The strategy of this study is to exploit the kinematics of the process H → Z(→ l + l − )A(→ bb) (where l − = e − or µ − ) to reconstruct the masses of both H and A bosons using the two-and four-body invariant masses, m jj and m lljj and to define a signal region using these quantities. These distributions are positively correlated and an elliptic signal region has been chosen. The sizes and tilt angles depend on the kinematics and the masses themselves. Hence the parameterization that this analysis used is based on one-dimensional Gaussian fits of the m jj and m lljj distributions to obtain the reconstructed center and the diagonalization of the covariance matrix of both distributions yields the axes and tilt angle ( Figure 17). The fraction of signal and background events in different ellipses sizes is binned to be used in a maximum likelihood procedure. The use of ellipses in this kind of searches makes it very well optimized and hard to improve without loss of generality. The MEM is not expected to surpass the standard approach for such an analysis, but should be able to approach the published performances.
Throughout our paper so far, only 23 points were used in the DNN training. As in the CMS paper, our models will be applied on a larger (> 200) set of H → ZA samples to cover finely the mass plane. Since no retraining takes place, we do not have to worry about overfitting issues.
Every signal and background event has been processed through the regressive networks of Sections 4.1, 4.2 and 4.3 in order to produce the weights needed by the two classifiers (global and parametric) that were then applied for each mass configuration. The ROC curve has then been computed for each classifier either using all preselected events, or using only events contained in the elliptic region defined in the mass plane for a given mass hypothesis and a give size parameter ρ. To do so, the script provided by the collaboration [31] has been used. Results are presented in Figure 18 for two representative mass points, and compared to the performance of the elliptic selection alone.  classifiers, and the combination of these methods. Each ellipse is denoted by a marker and the events that pass its selection are then used to compute the ROC curve with both classifiers.
As expected, the global classifier brings no improvement to the ellipse method -especially at low masses. In some cases the combined ROC curve shows a potential gain by taking a larger ellipse combined with the classifier, as can be seen on the right plot. However this improvement is mostly located in the high purity region which is not the one aimed in the CMS analysis. The ellipse method is very well optimized in the search for a resonance while the global classifier searches for an excess in the whole mass plane. As already mentioned, the latter will require a larger excess to detect something but will not be as heavily affected by the look-elsewhere effect.
While better, the parametric network alone will only outperform the standard approach at very high masses when the ellipses are ill-defined. However, the combination of both becomes interesting for lower masses. In some cases the background contamination can be reduced by almost one order of magnitude with very small loss in signal efficiency. This effect is visible for M H as low as 200 GeV and increases towards the boosted and high mass regions.
While the simple approach followed in this work only has the potential to marginally improve the CMS H → ZA analysis, the DNN ansatz still opens a wide range of possibilities previously out-of-reach of the MEM. Apart from their training time, which have lasted from a few hours to a single day on CPU, evaluating a weight or a probability is a very fast process : about 150µs with large enough batches, with small variations depending on the depth of the network. This must be compared to the computation time of MoMEMta (Table 4) and the number of times it would have been called to produce Figure 18 ( Table 5). The weight computations for the global (parametric) classifier would have taken about 1450 (3050) CPU years, which is more than prohibitive even with a large farm of CPU. Using the DNN to produce the weights requires in total less than 10 hours, where must be added the time of I/O data streams, RAM allocation, data repetition and processing -all of which even tend to become dominant compared to the pure weight production. These weights must be fed to the classifiers, looped through for each ellipse and the ROC curves must be computed. In the end, with the DNN, the production of Figure 18 on a cluster of CPUs with a few hundred nodes took less than a day. The pure weight computation has been reduced by six orders of magnitude.

Conclusion
In this paper we presented a method where the integral of the Matrix Element Method is regressed by a Deep Neural Network in order to speed up the computations involved in the MEM. From the few representative processes studied in this paper, we conclude that a DNN can be trained that closely reproduces the results of the direct numerical integration of the matrix element using dedicated tools like MoMEMta. This regression with the DNN introduces inevitable inaccuracies in the weights that nonetheless do not have a significantly impact on the performance in the studied applications.
Faster weight calculations open a wide range of possibilities : study of systematics, likelihood scans, parameters scans and in general enables the use of the MEM for a new wider class of physics analyses, including the search for new physics.