Separation of track- and shower-like energy deposits in ProtoDUNE-SP using a convolutional neural network

Liquid argon time projection chamber detector technology provides high spatial and calorimetric resolutions on the charged particles traversing liquid argon. As a result, the technology has been used in a number of recent neutrino experiments, and is the technology of choice for the Deep Underground Neutrino Experiment (DUNE). In order to perform high precision measurements of neutrinos in the detector, final state particles need to be effectively identified, and their energy accurately reconstructed. This article proposes an algorithm based on a convolutional neural network to perform the classification of energy deposits and reconstructed particles as track-like or arising from electromagnetic cascades. Results from testing the algorithm on data from ProtoDUNE-SP, a prototype of the DUNE far detector, are presented. The network identifies track- and shower-like particles, as well as Michel electrons, with high efficiency. The performance of the algorithm is consistent between data and simulation.

Technology (BIST), Barcelona, Spain the date of receipt and acceptance should be inserted later Abstract Liquid argon time projection chamber detector technology provides high spatial and calorimetric resolutions on the charged particles traversing liquid argon. As a result, the technology has been used in a number of recent neutrino experiments, and is the technology of choice for the Deep Underground Neutrino Experiment (DUNE). In order to perform high precision measurements of neutrinos in the detector, final state particles need to be effectively identified, and their energy accurately reconstructed. This article proposes an algorithm based on a convolutional neural network to perform the classification of energy deposits and reconstructed particles as track-like or arising from electromagnetic cascades. Results from testing the algorithm on experimental data from ProtoDUNE-SP, a prototype of the DUNE far detector, are presented. The network identifies trackand shower-like particles, as well as Michel electrons, with high efficiency. The performance of the algorithm is consistent between experimental data and simulation.

Introduction
The ProtoDUNE single phase detector (ProtoDUNE-SP) [1,2] is a prototype liquid argon time projection chamber (LArTPC) for the Deep Underground Neutrino Experiment (DUNE) far detector [3,4]. ProtoDUNE-SP is known as a single phase detector as it is operated entirely within liquid phase argon. The detector readout mechanism consists of six Anode Plane Assemblies (APAs), each containing three wire readout planes at angles of ±36 • and 0 • to the vertical, where the readout planes are denoted U, V and W, respectively. The U and V views are the induction views, meaning that charge is induced on the wires by drifting electrons, and the W-view wires collect the drifting electrons. The wires in each readout plane are spaced with approximately 5 mm pitch and are read out at a rate of 2 MHz. A full description of the detector is given in Ref. [2]. ProtoDUNE-SP collected data from a positively-charged-particle beam at CERN [5,6] in autumn 2018, including charged pions, charged kaons, protons, muons and positrons recorded with momenta in the range from 0.3 to 7.0 GeV/c. Additionally, since ProtoDUNE-SP is located on the Earth's surface, it is subject to a large flux of cosmic ray muons.
The particle interactions can be easily and naturally visualised as three twodimensional images (one for each readout view) in the wire number and time parameter space. Each pixel in the image represents the measured charge from a reconstructed energy deposition, called a hit, on a given wire at a given time. A major challenge in the automated reconstruction of particle interactions in LArT-PCs is identifying whether energy deposits originate from track-like (linear, such as protons, charged kaons, charged pions, and muons) or shower-like (locally dense, such as electrons and photons) structures. An example of a 7 GeV/c charged pion interaction is given in Fig. 1, where the π + enters the detector and interacts (just after wire 200 and at time tick 4500) producing a number of track-and showerlike particles. In order to classify the interaction type of the π + , for example as charge-exchange or inelastic scattering, the particles emitted from the interaction vertex must be identified. The classification of reconstructed particles as tracklike or shower-like will also be important in DUNE for the correct classification of neutrino interactions in the far detector. The identification of Michel electrons helps to distinguish between µ + and µ − . It can also be used to identify stopping charged pions whose energy can be fully reconstructed. In this article, we propose and demonstrate the use of a convolutional neural network (CNN) to classify hits as either belonging to track-like or shower-like structures [7]. Furthermore, a Michel electron score is given to each hit to help identify Michel electrons and positrons, produced in the decay of muons and antimuons, respectively [8] 1 . These hit-level classifications can be used alongside pattern recognition based reconstruction algorithms such as Pandora [9,10] to refine the track or shower classification of reconstructed particles. The performance of the Pandora reconstruction on ProtoDUNE-SP simulated and experimental data is described in detail in Ref. [11]. Convolutional neural networks have been successfully used in neutrino physics for event classification [12,13,14], particle identification [15,16] and reconstruction [17]. The fine-grain images obtained from LArTPC detectors makes CNNs a natural choice for such tasks. This algorithm is novel in that it aims to classify the hits based on a small local neighbourhood as opposed to a semantic segmentation approach that uses a much larger image containing a large part (or all) of the detector. The algorithm was designed in this way to minimise the memory usage and computational processing time, allowing it to run quickly on standard computing node CPUs where there is no access to powerful GPUs.

The convolutional neural network
Convolutional neural networks extract features from images by applying a series of filters that are learned during the training process [18,19]. The number of filters and the number of convolutional layers varies for each specific use case; they are determined by the class of problem the network is trying to solve, and the computer hardware available for training and evaluating the network. In this case, a GPU was available for the training of the network, but the inference is performed on computing cluster CPUs (where GPUs are typically not available) as a part of the ProtoDUNE-SP reconstruction chain. As a result, only simple architectures containing few convolutional layers were considered, constrained by the desired evaluation time on the CPUs. For inference tasks within the ProtoDUNE-SP event reconstruction workflow, a C++ interface was added to the LArSoft framework [20]. Recent attempts to introduce GPU acceleration into the workflow mentioned above show promising reductions in processing time [21].
The input to the network is a small region of the entire detector image known as a patch. For each reconstructed hit object, the wire number w and peak time t are extracted, and a 48 × 48 pixel image, centred on (w, t), is created and the value of each pixel corresponds to the detected charge on a given wire at a given time. The wire dimension of the image corresponds to 48 wires with one wire per pixel. The time data are downsampled by averaging over six time samples, such that the spatial dimensions of the pixels match the 5 mm wire pitch in both directions. Therefore, each image represents around 24 × 24 cm 2 of wire data. Figure 2 shows the hits from one APA in a simulated ProtoDUNE-SP event and the three zoomed regions give example 48 × 48 pixel patches in the track, shower and Michel categories. Detector effects such as the ones introduced by space charge [1,22] are included in the simulation. The images from the three wire planes are independently evaluated. This paper only reports on results from the collection plane, which has the highest signal-to-noise ratio.
The architecture for this hit-tagging CNN is shown in Fig. 3. A single convolutional layer containing 48 5 × 5 pixel filters is used to extract feature maps from the input image, which are then flattened and passed to two dense layers that use them to classify the images. Two dropout 2 layers are used for regularisation [23]. The output of the network is split into two branches. The first branch returns the scores for track, shower, or empty (TSE) classification, which can be interpreted as probabilities as they are constrained to sum to one by a softmax [24] activation function. The second returns the probability for a Michel electron classification, with a sigmoid [24] activation function. The output of the network is split in this way due to the overlap of the shower and Michel electron classes. The total loss function is a weighted sum of the two branches, with the weights derived from the relative size of the training samples in each branch.

Training details
For the purposes of training a true classification must be attached to each of the patches. In addition to track, shower and Michel electron patches, empty patches are also created where the central pixel contains no energy deposit. Approximately 30 million images were prepared in total using approximately 500 simulated events: ∼15 million in the track sample, ∼11 million in the shower sample, ∼3 million in the empty sample, and ∼1 million in the Michel electron sample. The CNN was trained with TensorFlow [25] through its Keras [26] interface, and performance metrics, such as the losses, purity and efficiency, were monitored throughout training using TensorBoard [27]. Before training, the data set was split into training, test, and validation sets in the ratio 80:10:10. The performance metrics were monitored throughout training with the training and validation sets, and again after training with the test set. Figure 4 shows the evolution of the training and validation losses throughout the training. Due to the large number of training images and relative simplicity of the task, the losses fall sharply within the first epoch, which is not visible in the plots. The training and validation losses  In this case, the CNN processes 256 images in parallel. Each image is a 48 × 48 pixel patch of the calibrated detector readout. A single convolutional layer, with 48 filters of size 5 × 5, is used to extract features from the images. These are processed by two dense layers containing 128 and 32 neurons respectively, before being split into two branches which provide the trackshower-empty (TSE) and Michel outputs. The dimensions of the data after each operation are given next to the black arrows.
begin to diverge after the first few epochs suggesting there is some over-fitting, but the network generalises well when considering the similar performance of the algorithm on the test and validation sets. To further ensure generalisation, an early stopping algorithm was used, which focused on the loss in the TSE branch [28]. The final weights for the network were taken from a checkpoint at the end of the fifth epoch 3 since the validation loss in the TSE branch starts to plateau on the fifth epoch.

Performance
The performance of the hit tagging was evaluated with reconstructed events from ProtoDUNE-SP simulation. A 48 × 48 pixel image is created around each reconstructed hit, which is then classified by the network and the classification compared with the truth label. Note that by definition this method ensures that no processing is performed on empty images. Figure 5 shows the shower score distributions for true shower hits and all other hits, and a strong separation is seen between the distributions with a score close to one corresponding to a hit that is highly likely to come from a shower. The small peak in the other hits distribution close to one comes from delta-ray electrons overlapping with the cosmic-ray muon that produced them. The classification threshold can be set on a case by case basis, for the initial validation of the network on the ProtoDUNE-SP data it was optimised based on the F1 score, which is given by: where the purity is defined as the fraction of correctly classified shower hits in the sample of all selected shower hits, and the efficiency as the fraction of all true shower hits that were selected as shower hits.   Figure 6 demonstrates the performance of the network in terms of the true positive and false positive rates. In this case, the true positive rate is the fraction of true shower hits that have been correctly classified as shower hits, and the false positive rate is the fraction of other hits incorrectly classified as shower hits. The receiver operating characteristic (ROC) curve is shown, which shows the true positive rate against the false positive rate as the selection threshold on the shower classifier output is varied. ROC curves are shown for simulation with the space charge effect (SCE, red) and without (blue). The close agreement between the curves suggests that the CNN results are robust against changes in the SCE model. track-like objects, or more fragmented due to photon emission and subsequent Compton scattering to produce additional electrons. Furthermore, some delta-ray electrons and components of electromagnetic showers can produce signatures in the detector that are similar to Michel electrons. Therefore, while both distributions are strongly peaked at the expected values, with Michel electrons close to one and other hits close to zero, there are also sub-leading peaks of hits that are not correctly classified. Due to this, and the significantly smaller sample of Michel electron hits, the network is not able to achieve a good performance in terms of the F1 metric. However, when combined with simple clustering, a high purity sample of Michel electron events can be selected, as will be discussed in Sec. 3.

Results from experimental data and simulation
It is important that the CNN is robust against potential differences between experimental data and simulation, and hence the performance has been compared between experimental data and simulation for several particle species. Hits are tagged in the three different readout views and reconstructed particles from Pandora are assigned a score between 0 and 1 that is the average of the shower classifier score from the CNN from all of the 2D hits in the collection view. Each hit is weighted by the hit charge when calculating the average shower score. A score close to one means that it is highly probable that the particle is shower-like, and a low score means the particle is very likely to be track-like. Data from ProtoDUNE-SP runs 5387 and 5809 taken in the H4-VLE test beam at CERN with 1 GeV/c beam momentum were used for the initial qualitative validation of the CNN performance on ProtoDUNE-SP data. These runs contain cosmic rays and particles from the charged particle beam. Run 5809 was taken with the inclusive beam trigger giving a dataset primarily consisting of beam positrons. Run 5387 was taken with a trigger that vetoed positrons, which resulted in a sample primarily consisting of beam π + 's, µ + 's and protons. Figure 8 shows an example of the CNN shower scores of reconstructed particles in a ProtoDUNE-SP event as a visual cross-check of the CNN performance. As expected, the cosmic-ray muon and pion tracks in the event have low shower scores, while the photon shower from the charged particle beam interaction is given a high score. In addition, delta ray electrons, which are emitted along the muon tracks, are associated with showers and therefore receive a high CNN shower score. The latest ProtoDUNE-SP Monte Carlo (MC) simulation sample was used to compare with experimental data. This is a new MC simulation with improved modelling of detector response, which is completely independent of the previous MC simulation that was used to train the CNN.
The following sections report the performance of the CNN classification at the hit level and the particle level for cosmic rays and charged particles from the test beam. In order to classify the hits, a threshold of 0.72 was applied to the shower classifier output of the CNN, with hits exceeding the threshold being classified as shower hits. This threshold was selected by choosing the value with the largest F1 score in Fig. 5. For particle-level classification, a different threshold of 0.81 is applied to the average shower score to classify particles, where the threshold was chosen to maximise the product of the selection efficiencies of all four types of charged beam particles. The CNN shower score of each hit in reconstructed particles for the same 7 GeV/c π + interaction shown in Fig. 1. This diagram shows the location of reconstructed hits in wire-time coordinates, and the hits are coloured based on the CNN shower score. Red hits are track-like, and blue hits are shower-like. A number of cosmic muon tracks can be seen, along with tracks and showers produced by the pion interaction. The small shower-like patches along the muon tracks are delta-ray electrons.

Cosmic-ray muons
The CNN was first tested on a sample of cosmic-ray muons 4 in order to validate its performance before it was used to classify beam particles (see Sec. 3.2). A sample of cosmic-ray muons was selected from simulation and experimental data (run 5387), where cosmic-ray muon candidates were selected using the following criteria: the particle was reconstructed by Pandora as a track the track was at least 1 m in length the track started and ended at least 50 cm from the front face of the detector (to veto beam particles) -the track was directed at least 15 • away from the vertical (to veto tracks that only deposited energy on a small number of collection plane wires).
All of the hits associated to the selected tracks were labelled as true cosmic-ray muon hits. The hits from any other particles associated with the cosmic-ray muon candidate, such as delta-ray and Michel electrons, were not included to avoid contaminating the hit selection. Firstly, the hit-level classification was studied. Figure 9 shows the CNN shower output score for cosmic-ray hits in experimental data (black) and simulation (red), and demonstrates the high level of agreement between the two samples. The peak in the score distribution close to one can be attributed to hits from the numerous delta-ray electrons produced by high energy muons, such as those shown previously in Fig. 8. The results of the hit-level classification, obtained by measuring the fraction of hits below a threshold of 0.72, are given in Table 1.  The CNN shower classifier scores for cosmic-ray muon hits from experimental data (black) and simulation (red). The error bars on the data are statistical. Figure 10 shows the particle-level comparison of the average CNN shower score for the cosmic-ray muons in experimental data and simulation. As expected, the distributions are peaked close to zero, and the experimental data distribution is slightly shifted compared to the simulation. However, when applying the threshold of 0.81 to classify the cosmic rays as track-like, both the performance and agreement between experimental data and simulation is excellent, as shown in Table 1. Table 1: The fraction of correctly classified cosmic-ray muon hits and particles using the CNN measured for experimental data and simulation. The errors represent the statistical uncertainties calculated using the Clopper-Pearson method [29].

Charged particle test beam
In the case of particles originating from the charged particle beam in the experimental data samples, the beam instrumentation [1] can be used to provide an effective truth source to which the results from the CNN can be compared. For simulation samples we use the truth information to get the primary beam particle species information. This allows the shower score distributions from the CNN to be compared between experimental data and simulation for different particle species. The reconstructed particles with angles inconsistent with the beam direction and that arrive out-of-time with the beam can be assumed to be cosmic muons. Note that at 1 GeV/c beam momentum, π + and µ + are indistinguishable using the beam instrumentation information. A 1 GeV/c µ + is expected to stop in the middle of the detector around z = 380 cm, where the z axis is horizontal. A 1 GeV/c π + will most likely interact with the argon nucleus before stopping because of the relatively short interaction length (∼100 cm). We identify an event as a pion if the reconstructed track end z position is less than 100 cm and as a muon if the end z position is greater than 300 cm for events identified by the beam instrumentation as either pions or muons. We require the number of collection plane hits in the reconstructed shower should be greater than 200 for the positron candidate events in order to remove events with an incompletely reconstructed shower. This cut is not applied to the other three particle species. Table 2 shows the numbers of events after the beam quality and number of hits cuts for beam pions, muons, protons and positrons and the purity of the selected samples based on the truth information in the simulation. Figure 11 shows the distribution of shower classifier score for each individual hit in the beam pions, muons, protons, and positrons. The data in all of the beam particle distributions are normalised by the number of triggered beam particles of the given flavour after the beam quality and number of hits cuts. There is a reasonable agreement between the experimental data and simulation in terms of the shower score distributions for each particle species. To quantify the efficiency to select track-like and shower-like hits, Table 3 lists the fraction of individual hits selected into the appropriate category for each sample in experimental data and simulation for a selection threshold of 0.72. The difference between the selected fraction in each case is an estimate of the systematic uncertainty associated with hit-by-hit selection. The class used for the selection in each sample is also given in Table 3. The fractional difference between experimental data and simulation varies based on the particles species, and falls in the range of 1-2%.  Figure 12 shows the distribution of the average shower classifier scores over all the hits in the reconstructed pion, proton, and positron particles. This average shower classifier score is what analysers normally use to identify a reconstructed particle as a track-like or shower-like particle. The distributions in each category are normalised to unit area. The experimental data and simulation distributions are in a reasonable agreement. There is a long tail in the average shower classifier score distribution for both the beam pions and protons. This tail is caused by the spatial distortion introduced by the SCE and is largely suppressed if we make the distributions using simulation sample without simulating SCE. There is a shift in the average shower classifier score for beam positrons between experimental data and simulation. There are slightly more hits in experimental data than in simulation for reconstructed positron events, making the experimental data hits more shower-like. It can be seen that the score distribution for the beam muons is more strongly peaked towards low scores than for cosmic-ray muons, shown in Fig. 10, because they are significantly lower in energy and hence produce fewer delta rays. Table 4 lists the fraction of reconstructed particles selected into the appropriate category for each sample in experimental data and simulation for a selection threshold of 0.81. The fractional difference between experimental data and simulation is within 1% for all particles species.

Michel electrons
To validate the performance of the CNN Michel score calculation, we examine the Michel score of hits around the muon and pion track end point. Hits around the muon end points are most likely from the Michel electron which are expected to have a high Michel score. We define a window of 30 wires × 200 ticks (approximately 15 × 16 cm 2 ) centred around the reconstructed track end point projected on the collection plane to select daughter hits. Hits from the secondary particles produced by the pion interaction are expected to have a low Michel score as shown in Fig. 13(a). The Michel hits from the muon decay are expected to have a high Michel score as shown in Fig. 13(b). Hits on the primary beam track or on another track that is longer than 25 cm are excluded to remove the contributions from primary beam particle and cosmic ray muons. Figures 14 and 15 show the hit-level and particle level comparison of the CNN Michel score over daughter hits in the reconstructed pion and muon particles.   The results of the hit-level and event-level classification, obtained using a threshold of 0.19, are given in Tables 5 and 6, respectively. The threshold is chosen to maximise the product of selection efficiencies of pions and muons. We are able to select 73% of the µ + events while rejecting 90-92% of the π + events  using the average Michel score. The fractional difference between experimental data and simulation falls in the range of 1-2%. Efficient identification of Michel electrons provides crucial information on particle identification and kinematic reconstruction. It allows the separation between µ + and µ − because 70% of the µ − s are captured while most of the µ + s decay into Michel electrons. It also allows the identification of stopping π + which goes through the decay chain π + → µ + → e + . The momentum of those stopping pions can be reconstructed either through track range or using calorimetric information, which can be used to reconstruct the full kinematics of the final state particles.

Conclusion
In this paper, we described an effective hit tagging algorithm for track, shower, and Michel electron hit classification based on a convolutional neural network, using a small patch approach. This algorithm is shown to give good agreement in selection efficiencies, of around 1-2%, between experimental data and simulation for cosmic rays and 1 GeV/c test-beam interactions for a hit-by-hit event selection. When combined with the full event reconstruction (which includes a BDT-based classifier) and applied to the hits of each reconstructed particle, the CNN refines the track and shower classification to produce highly efficient selections that agree within 1% between experimental data and simulation. Additionally, this network also provides a method to select Michel electrons, which helps with the particle identification and kinematic reconstruction. This algorithm is being widely used within ongoing ProtoDUNE-SP data analyses, including pion cross-section analyses and detector calibrations.