Demonstration of background rejection using deep convolutional neural networks in the NEXT experiment

Convolutional neural networks (CNNs) are widely used state-of-the-art computer vision tools that are becoming increasingly popular in high-energy physics. In this paper, we attempt to understand the potential of CNNs for event classification in the NEXT experiment, which will search for neutrinoless double-beta decay in 136Xe. To do so, we demonstrate the usage of CNNs for the identification of electron-positron pair production events, which exhibit a topology similar to that of a neutrinoless double-beta decay event. These events were produced in the NEXT-White high-pressure xenon TPC using 2.6 MeV gamma rays from a 228Th calibration source. We train a network on Monte Carlo-simulated events and show that, by applying on-the-fly data augmentation, the network can be made robust against differences between simulation and data. The use of CNNs offers significant improvement in signal efficiency and background rejection when compared to previous non-CNN-based analyses.


Introduction
Machine learning techniques have recently captured the interest of researchers in various scientific fields, including particle physics, and are now being employed in search of improved solutions to a variety of problems.In this study, we show that deep convolutional neural networks (CNNs) trained on Monte Carlo simulation can be used to classify, to a high degree of accuracy, events containing particular topologies of ionization tracks acquired from a high-pressure xenon (HPXe) time projection chamber (TPC).As CNNs trained on simulation are known to be difficult to apply directly to data due to the challenges associated with producing a Monte Carlo that perfectly matches experiment, we also present methods for extending the domain of application of a CNN trained on simulated events to include real events.We claim that our use of these methods in adapting CNNs to the experimental domain and verifying their performance is novel to the use of CNNs in the field.
Event classification is of critical importance in experiments searching for rare physics, as the successful rejection of background events can lead to significant improvements in overall sensitivity.The NEXT (Neutrino Experiment with a Xenon TPC) experiment is searching for neutrinoless double-beta decay (0νββ) in 136 Xe at the Canfranc underground laboratory in Spain.In the ongoing first phase of the experiment, the 5 kg-scale TPC NEXT-White [1] has demonstrated excellent energy resolution [2] and the ability to reconstruct highenergy (O(2) MeV) ionization tracks and distinguish between the topological signatures of two-electron and one-electron tracks [3].It has also been used to perform a detailed measurement of the background distribution and is expected to be capable of measuring the 2νββ mode in 136 Xe with 3.5σ sensitivity after 1 year of data-taking [4].The next phase of the experiment, the 100 kg-scale detector NEXT-100, will search for the 0νββ mode at Q ββ , around 2.5 MeV.New techniques such as CNNs which analyze the topology of an event near Q ββ and aim to eliminate background events are becoming more relevant and essential to reaching the best possible sensitivity.
Machine learning techniques have seen many recent applications in physics [5].In neutrino physics in particular, CNNs have been applied to particle identification in sampling calorimeters in the NOvA experiment [6].The MicroBooNE experiment has also employed CNNs for event classification and localization [7] and track segmentation [8] in liquid argon TPCs.IceCube has applied graph neural networks to perform neutrino event classification [9], and DayaBay identified antineutrino events in gadolinium-doped liquid scintillator detectors using CNNs and convolutional autoencoders [10].Experiments searching for 0νββ decay have also employed CNNs: EXO has studied the use of CNNs to extract event energy and position from raw waveform information in a liquid xenon TPC [11] and PandaX-III has performed simulation studies demonstrating the use of CNNs for background rejection in a HPXe gas TPC with a micromegas-based readout [12].Further simulation studies in HPXe TPCs with a charge readout scheme ("Topmetal") allowing for detailed 3D track reconstruction have also shown the potential of CNNs for background rejection in 0νββ searches [13].NEXT has also presented an initial simulation study [14] of the use of CNNs for background rejection.In this study we show that CNNs can be applied to real NEXT data, using electron-positron pair production to generate events with a two-electron "ββ-like" topology and studying how the energy distribution of such events changes when varying an acceptance cut on the classification prediction of a CNN.
The paper is organized as follows: section 2 describes the topological signature of a signal event.In section 3 the data acquisition and reconstruction is explained.A description of the CNN and training procedure, as well as evaluation on MC and data is given in section 4. Finally, conclusions are drawn in section 5.

Topological signature
In a fully-contained 0νββ event recorded by a HPXe TPC, two energetic electrons produce ionization tracks emanating from a common vertex.Though the fraction of energy Q ββ carried by each individual electron may differ event-by-event, the general pattern observed is similar for the majority of events, and consists of an extended track capped on both ends by two "blobs", or regions of relatively higher ionization density.These regions are present due to the increase in stopping power experienced by electrons in xenon gas as they slow to lower energies.They provide a distinct signature for 0νββ decay, as measured tracks with similar energy produced by single electrons1 , for example photoelectric interactions of background gamma radiation, contain only one such "blob".The use of this signature, illustrated in Fig. 1, in performing background rejection is an essential part of the NEXT approach to maximizing sensitivity to 0νββ decay.showing its distinct two-electron topological signature (left) compared with that of single-electron event (right) of the same energy (figure from [15]).
In order to demonstrate this approach experimentally, a reliable source of events with a similar topological signature is necessary.Electron-positron pair production by high energy gammas, followed by the subsequent escape from the active volume of the two 511 keV gamma rays produced in positron annihilation ("double-escape"), leaves a two-blob track formed by the electron and positron emitted from a common vertex, similar to the track that would be left by a 0νββ event.In this study, we use gamma rays of energy 2614.5 keV from 208 Tl (provided by a 228 Th calibration source, see Fig. 2) and observe the events in the double-escape peak at 1592 keV.This peak lies on top of an exponential background of single-electron tracks from Compton scattering of the calibration gamma rays and other background radiation.Experimentally, then, we have a sample containing 0νββ-like events and background-like events.By evaluating these events with a Monte-Carlo-trained neural network and studying the resulting distribution of accepted events, we can demonstrate, using real data acquired with the NEXT-White TPC, the potential performance of such a network when employed in a 0νββ search.These results can be compared to a similar, non-CNN based analysis published in [3].
3 Data acquisition and analysis

The NEXT-White TPC
The NEXT-White TPC measures both the primary scintillation and ionization produced by a charged particle traversing its active volume of high-pressure xenon gas.The main detector components are housed in a cylindrical stainless steel pressure vessel lined with copper shielding and include two planes of photosensors, one at each end, and several semi-transparent wire meshes to which voltages are applied, defining key regions of the detector (see Fig. 2).The two planes of photosensors are organized into an energy plane, containing 12 PMTs (photomultiplier tubes, Hamamatsu model R11410-10) behind the cathode, and a tracking plane containing a grid of 1792 SiPMs (silicon photomultipliers, SensL series-C, spaced at a 10 mm pitch) behind the anode.These sensors observe the scintillation produced in the active volume of the detector by ionizing radiation, including primary scintillation produced by excitations of the xenon atoms during the creation of the ionization track and secondary scintillation produced by electroluminescence (EL) of the ionization electrons.Note that in practice only the PMTs observe a consistently measurable primary scintillation signal, while EL is observed by both the PMTs and the SiPMs.EL occurs after the electrons of the ionization track are drifted through the active region by an electric field (of order 400 V/cm) created by application of high voltage to the cathode (-30 kV) and gate (-7.6 keV) meshes and arrive at the EL gap, a narrow (6 mm) region defined by the gate mesh and a grounded quartz plate on which a conductive indium tin oxide (ITO) coating has been deposited.The large voltage drop over the narrow gap between the gate and the grounded plate creates an electric field high enough to accelerate the electrons to energies sufficient to excite the xenon without producing further ionization, allowing for better energy resolution compared to the charge-avalanche detectors [16].The subsequent decay of these excitations lead to EL scintillation, yielding of order 500-1000 photons per electron traversing the EL gap.These photons, produced just in front of the tracking plane, cast a pattern of light on the SiPMs which can be used to reconstruct the (x, y) location of the ionization.The PMTs located in the energy plane on the opposite side of the detector see a more uniform distribution of light, including EL photons that have undergone a number of reflections in the detector, and record a greater total number of photons for a more precise measurement of the energy.The time difference between the observation of the primary scintillation (called S1) and secondary EL scintillation (called S2) gives the distance drifted by the ionization electrons before arriving at the EL region, corresponding to the z location at which this ionization was produced.

Event reconstruction
The data used in this study consisted of events with total energy near 1.6 MeV, including electron-positron events produced in pair production interactions from a 2.6 MeV gamma ray (see section 2) and background events, mostly due to Compton scattering of the same 2.6 MeV gamma rays2 .The acquired signals for each event consisted of 12 PMT waveforms sampled at 25 ns intervals and 1792 SiPM waveforms sampled at 1 µs intervals for a total duration per read-out greater than the TPC maximum drift (approximately 500 microseconds).The ADC counts per unit of time in each waveform were converted to photoelectrons per unit time via conversion factors established by periodic calibration using LEDs installed inside the detector, a standard procedure in NEXT-White operation.The calibrations were performed by driving LEDs installed inside the vessel with short pulses and measuring the integrated ADC counts corresponding to a single photoelectron (pe).
The analysis of the acquired data was similar to that of [3].The 12 PMT waveforms were summed, weighted by their calibrated gains, to produce a single waveform in which scintillation pulses were identified and classified as S1 or S2 according to their shape and location within the waveform.Events containing a single S1 pulse and at least one S2 pulse were selected, and for these events, the S2 information was used to reconstruct the ionization track.To do this, the S2 information was integrated into time bins of width 2 µs in both the PMTs and SiPMs.Note that to eliminate dark noise, SiPM samples with less than 1 pe were not included in the integration.
For each time bin, one or more energy depositions ("hit") was reconstructed, and the pattern of signals observed on the SiPMs was used to determine the number of hits for a specific time bin and their corresponding (x, y) coordinates.A hit was assigned to the location of all SiPMs with an observed signal greater than a given threshold, and the total energy measured by the PMTs in that time bin was redistributed among the hits according to their relative SiPM signals.
The energy of each hit as measured by the PMTs was then corrected, hit-by-hit, by two multiplicative factors, one accounting for geometric variations in the light response in the EL plane and the other for electron attachment due to a finite electron lifetime in the gas.These correction factors were mapped out over the active volume by simultaneously acquiring events from decays of 83m Kr, which was injected into the xenon gas and provided uniformly distributed point-like depositions of energy 41.5 keV [17].The z-coordinate of each hit in the time bin was obtained from the time difference between S1 and S2 pulses, assuming an electron drift velocity of 0.91 mm/µs, as extracted from an analysis of the 83m Kr events.A residual dependence of the event energy on the length of the event along the z-axis is observed, and a linear correction is performed to model this effect, which is not observed in simulation and remains to be fully understood.For details on this "axial length" effect, see [2].
The detector volume surrounding the reconstructed hits was then partitioned into 3D voxels of side length 10x10x5 mm 3 , and the energy of all hits that fell within each voxel was integrated.The X and Y dimensions of the individual voxels were chosen based on the 1 cm SiPM pitch, while the Z dimension was chosen to account for most of the longitudinal diffusion (1σ spread at maximum drift length is ∼ 2 mm).The final voxelized track could then be considered in the neural-network-based topological analysis (see Fig. 3).

Data preparation
To generate the events used in training the neural network, a full Monte Carlo (MC) of the detector, including the pressure vessel, internal copper shielding, and sensor planes, was constructed using Nexus [18], a simulation package for NEXT based on GEANT4 [19] (version geant4.10.02.p01).The 208 Th calibration source decay and the resulting interactions of the decay products were simulated by GEANT4, up to and including the production of the ionization track.Events in the energy range of 1.4-1.8MeV were selected, and the subsequent electron drift, diffusion, electroluminescence, photon detection, and electronic readout processes were simulated outside of GEANT4 to produce for each event a set of sensor waveforms corresponding to those acquired in NEXT-White.The analysis of data waveforms described in section 3.2 could then be applied to these MC waveforms to produce voxelized tracks (see Fig. 3).MC events that were fully contained in the active detector volume were used in the training set.To ensure the classification was done only based on the track topology, the energy of each voxel was scaled by the total event energy  (the sum of voxel intensities for a given event was normalized to 1) such that the training data did not contain event energy information.Those events containing an electron and a positron registered in the MC true information, with no additional energy deposited by the two 511 keV gamma rays produced upon annihilation of the positron (i.e. a true "double-escape"), were tagged as "signal" events and all others were tagged as "background".
In [3], an additional single-track selection cut is made, and for a fair comparison with this previous result we also apply the same cut (obtained from the standard track reconstruction, for details see [3]) on test data only, for both MC and experimental data.As a reference, inside the peak energy range, the efficiency of the single-track cut was ∼ 0.9 for signal events and ∼ 0.7 for background events.For signal events, additional tracks appear either from physical processes as bremsstrahlung, or from artificial splitting of the track due to imperfect reconstruction.The energy distribution of MC events with labeled signal events used for testing after the fiducial and single track selection cut is given in Fig. 4.
When applying a network trained on events from one domain (MC) to events from a different domain (data), the performance will depend on the similarity between those two domains.Known differences between MC and data in high level variables, such as track length, have been observed in a previous topological analysis [3], calling into question the performance of the MC-trained CNN when applied to data.In this study, the classification task is focused on the double-escape peak, which is clearly visible and for which we understand the underlying physical process (pair production).In this case, we could attempt to design our network to obtain optimal classification results on the acquired data and in the energy range of interest (a method to evaluate the network on double-escape data is explained in section 4.4), but we could not argue that the same procedure would work in a 0νββ search for which we do not have a confirmed understanding of the underlying physics, nor would it be justified to make predictions on the same events used in optimizing the network.
Therefore, we develop a general paradigm (as described in section 4.3) that could be applied at 0νββ energies and, in evaluating the performance of the network on the data domain, uses events outside the energy range within which we intend to make predictions.Namely, before applying the CNN to the peak itself, we evaluate the performance on the peak sidebands (see Fig. 4), where the sample composition is known, and we expect the CNN predictions to be similar in data and MC.The underlying assumption is that the domain shift between MC and data is not correlated with the type of event, i.e. we expect that if a network is robust to MC/data differences on sidebands, it will be robust to MC/data differences in the peak region as well.In [3] it was shown that the track length difference between data and MC is consistent across a wide energy range, giving us confidence that the differences are indeed coming from the detector simulation and reconstruction (which should have the same effect on both signal and background events), rather than incorrectly simulated physical processes, justifying the sidebands-testing approach.

Network architecture
In this study we embedded our network architecture within the Submanifold Sparse Convolutional Networks (SCN) framework [20], implemented in PyTorch.SCN is highly suitable for sparse input data, making the linear algebra far more efficient than with non-sparse techniques.Further, in SCN the convolution rules allow only nonzero voxels in initial layers to hold non-zero output, thus conserving input sparsity.SCN is appropriate for our detector in which the large majority of voxels have zero charge.Such networks have already been used in high energy physics analysis [21] and the main advantage of these types of network is that they occupy less memory and allow for larger input volumes and/or larger batch sizes.All of the results shown here were obtained using this framework, but we obtained similar results using the standard implementation of dense convolutions in Keras/TensorFlow.
We employed a residual [22] 3D CNN in performing the topological classification task.The network architecture is summarized in Fig. 5.The network consisted of two initial convolutional layers, and a set of pre-activated ResNet block layers [23] followed by two consecutive dense layers with a dropout layer before each.The input dimensions were 40x40x110 with each input corresponding to one voxel, therefore covering a volume of 40x40x55 cm3 , essentially the entire active volume of the detector.The output was a 2-element probability vector.

Training procedure
A total of about 500k simulated fiducial events were used as a training set, of which 200k were signal events, and an additional ∼ 30k events were used as a validation sample with similar signal proportion.A batch size of 1024 was chosen, and binary cross entropy, weighted according to the signal/background ratio of the entire data set, was used as the loss function.To avoid overfitting, L2 weight regularization and dropout were employed, as well as on-the-fly data augmentation 3 [24], including translations, dilation or "zooming"  (scaling all 3 axes independently), flipping in x and y, and varying SiPM charge cuts4 as detailed in Fig. 6.We note that augmentation procedures used here are explicitly designed to be "label preserving" in that they do not change the single-or double-blob nature of events, but do reduce the significance of differences in data/simulation.
As noted in section 4.1, since CNNs are highly nonlinear models, their application outside the training domain cannot be assumed to be reliable, and before applying the network to events in the peak we compare extracted "features"5 of MC and data events on the sidebands.It is common to consider convolutional layers as feature extractors (each one extracting higher level features), and consecutive dense layers as a classifier.We chose the first flattened layer as a representative feature vector and applied a two sample test -a test to determine whether independent random samples of R d -valued random vectors are drawn from the same underlying distribution, for which we chose energy test statistics [25,26].The energy distance between two sets A, B is given by where x i , y i are n, m samples drawn from the two sets.In [26], it was proven that this quantity is non-negative and equal to zero only if x i and y i are identically distributed.The p-value, or probability of observing an equal or more extreme value than the measured value, for rejecting the null hypothesis (in this case, that the samples come from the same distribution) can be calculated via the permutation test [27].Namely, the nominal energy distance is computed, and the x i and y j are then divided into many (1000 in our case) possible arrangements of two groups of size n and m.The energy distance is computed again for each of these arrangements, each of which corresponds to one permutation.The p-value is given by the fraction of permutations in which the energy distance was larger than the nominal one.
The training and validation losses, which are measures of disagreement between the CNN predictions and true labels, are given in Fig. 7 for the networks trained with and without data augmentation.The overfitting apparent in the case of training without augmentation is prompt and is manifested in the divergence of the validation and test losses, meaning that the network is beginning to memorize the training dataset and is not generalizing well.In Fig. 8 we show that the data augmentation also reduces the data/MC features distribution distance (eq.4.1), giving us more confidence that the performance on data will be similar to the performance on MC.As the distances are always calculated on MC and data events directly (without applying any data augmentation transformations), this technique does not directly correct MC but rather makes the model more robust to the data/MC differences.The final model is chosen by varying regularization parameters and selecting the training iteration step that gives minimal classification loss on the MC validation sample, ensuring that the corresponding p-value of energy test statistics is not larger than 5%.Overfitting in the left-hand plot is visible only after 1000 iterations.As the augmentation procedure is only relevant to the training phase, it was not applied to the validation set.The ability of the network to make correct predictions is improved for events unaltered by data augmentation, which explains why the loss is higher for the training set than for the validation set in the right-hand plot.

Evaluation on data
In an ideal test of the trained network, we would have a data sample of only e + e − events at the energy of interest acquired from our detector, and another sample of single-electron events at the same energy.However, as we will always have background events, in particular due to Compton scattering of the high-energy gamma rays used in producing the e + e − events with the topology of interest, an exactly-labeled test set of detector data is impossible.Therefore we make an assumption about the characteristics of the energy spectrum near the energy of interest and attempt to extract the number of signal and background events present, following the procedure explained in [3].
First, we select only fiducial events passing a single-track cut as explained in section 4.1.Note that the single-track cut was not applied to the training set, but we do apply it to the test set to allow for exact comparison with the previous analysis.We then assume that the signal events produce a Gaussian peak (as indeed would be the case for events occurring at a precise energy), and that the background, consisting of Compton electrons, in the region of the peak can be characterized by an exponential distribution.The peak energy region is fixed to 1.570-1.615MeV (as in [3]), a region that contains more than 99.5% of the Gaussian peak for both data and MC.Then, we apply an unbinned fit of the sum of two curves (Gaussian + exponential) to the full energy spectrum in the larger energy range 1.45-1.75MeV 6 in order to keep the fits stable, obtaining the parameters defining the two curves.Integrating over theoretical Gaussian and exponential curves in the peak energy range gives us the estimate of the initial number of signal events s 0 (from the Gaussian) and the initial number of background events b 0 (from the exponential).This procedure is then repeated using the spectra obtained from events with network classification greater than a varying threshold, in each case obtaining the number of accepted signal events s and accepted background events b.
Figure 9 illustrates this fit procedure for three different threshold values on the CNN prediction output using data from NEXT-White and a set of Monte Carlo simulated events which were not present in the training set.Varying the classification threshold traces out a curve in the space of signal acceptance s/s 0 vs. background rejection 1 − b/b 0 (see Fig. 10).To obtain optimal sensitivity in a 0νββ search, one must maximize the ratio of accepted signal to the square root of the rejected background [15], and therefore we also construct the figure of merit F = s/ √ b for the various classification thresholds.We show for comparison the non-CNN-based result obtained in [3].In Monte Carlo, we find a maximum figure of  We note that in Fig. 10 there is excellent agreement between data and simulation when comparing the signal efficiency in this analysis at a fixed background rejection, but there is still a minor disagreement between the figure of merit for simulation and data as a function of prediction threshold.Several reasons account for this disagreement.First, the data-augmentation technique extends the domain of applicability of the neural networks trained solely on simulated data, but it does not account for all possible differences between the data and Monte Carlo events.For example, any effect that would redistribute the energy along the track is not covered by the transformations we employ in data-augmentation.We anticipate that many of the effects contributing to data/simulation disagreement, such as the axial length effect mentioned in section 3.2, will be understood and resolved in the future and will bring these minor residual differences even closer together.A smaller EL TPC built from the original hardware of the NEXT-DEMO prototype [28,29] is currently operational and will provide data that can be used to study these effects in more detail.
Second, the fit procedure error could account for some of the differences in the figure of merit plot.Namely, modeling the energy distribution as the sum of a Gaussian signal and exponential background does not adequately account for the signal-like, double escape events that originate from a slightly lower energy gamma.These gammas Compton scatter before entering the detector, and their double-escape energy forms a continuum of signal-like events below the 1.59 MeV peak energy as seen in Fig. 4. For low prediction threshold cuts, while the background acceptance is still high, these events are a minor effect, but as the threshold cut increases they become a larger portion of the left sideband during the fit procedure, leading to an underestimated signal efficiency when compared to the efficiency calculation on simulation obtained using the true underlying event type (the mismatch of red points and the continuous red line in Fig. 10).A different ratio of signal inside the left sidebands between data and MC could lead to a different figure of merit.

Conclusions
We have demonstrated the first data-based evaluation of track classification in HPXe TPCs with neural networks.The results confirm the potential of the method demonstrated in previous simulation-based studies and show that neural networks trained using a detailed Monte Carlo can be employed to make predictions on real data.The present results show that the background contamination can be reduced to approximately 10% while maintaining a signal efficiency of about 65%.In fact, these results are likely to be conservative, as this demonstration was performed at an energy of 1592 keV, while at the same pressure, tracks with energy Q ββ are longer, and therefore their topological features should be more pronounced.
Furthermore, we have shown that, with the application of appropriate domain regularization techniques to the training set, our model performs similarly on detector data and simulation in the extraction of the signal events of interest.

Figure 1 :
Figure1: Energy depositions from trajectories in a Monte Carlo simulation of a 0νββ event, showing its distinct two-electron topological signature (left) compared with that of single-electron event (right) of the same energy (figure from[15]).

Figure 2 :
Figure 2: Schematic of the NEXT-White TPC, showing the positioning of the calibration sources ( 137 Cs and 228 Th) present during data acquisition for this study (figure derived from [2]).

Figure 3 :
Figure 3: Reconstructed hits (left) and voxels (right) of a background Monte Carlo event.The volume within a tight bounding box encompassing the reconstructed hits is divided into 10x10x5 mm 3 voxels to produced the voxelized track.

Figure 4 :
Figure4: Left: Energy distribution of all MC events (dashed line histogram) and of chosen signal events (solid histogram).Right: Energy distribution of experimental data events showing selected sideband events.The sidebands are 100 keV in width, with each band starting 45 keV from either side of the double escape peak.The same procedure is also used to select the sidebands in MC.

Figure 5 :
Figure 5: a) Summary of the neural network architecture used in this analysis, with b) details of each ResNetBlock architecture.

Figure 6 :
Figure 6: Example of on-the-fly data augmentation used during training on a selected signal event, projected on three planes for easier visualization.

Figure 7 :
Figure 7: Training and validation losses without (left) and with (right) the application of data augmentation to the training set.Overfitting in the left-hand plot is visible only after 1000 iterations.As the augmentation procedure is only relevant to the training phase, it was not applied to the validation set.The ability of the network to make correct predictions is improved for events unaltered by data augmentation, which explains why the loss is higher for the training set than for the validation set in the right-hand plot.

Figure 8 :
Figure 8: Energy distance between data and MC features during the training on the left sideband (left) and right sideband (right) for training with and without the augmentation.The corresponding p-value for the chosen model with augmentation at the chosen iteration step was ∼ 0.1 (0.2) for the left (right) sideband.
Figure9illustrates this fit procedure for three different threshold values on the CNN prediction output using data from NEXT-White and a set of Monte Carlo simulated events which were not present in the training set.Varying the classification threshold traces out a curve in the space of signal acceptance s/s 0 vs. background rejection 1 − b/b 0 (see Fig.10).To obtain optimal sensitivity in a 0νββ search, one must maximize the ratio of accepted signal to the square root of the rejected background[15], and therefore we also construct the figure of merit F = s/ √ b for the various classification thresholds.We show for comparison the non-CNN-based result obtained in[3].In Monte Carlo, we find a maximum figureofmerit of F = 2.20 with signal acceptance s/s 0 = 0.70 and background rejection 1 − b/b 0 = 0.90.In data, fixing the CNN cut to the one giving the best Monte Carlo figure of merit, we find F = 2.21, with signal acceptance s/s 0 = 0.65 and background rejection 1 − b/b 0 = 0.91.