Improved calorimetric particle identification in NA62 using machine learning techniques

Measurement of the ultra-rare $K^+\to\pi^+\nu\bar\nu$ decay at the NA62 experiment at CERN requires high-performance particle identification to distinguish muons from pions. Calorimetric identification currently in use, based on a boosted decision tree algorithm, achieves a muon misidentification probability of $1.2\times 10^{-5}$ for a pion identification efficiency of 75% in the momentum range of 15-40 GeV/$c$. In this work, calorimetric identification performance is improved by developing an algorithm based on a convolutional neural network classifier augmented by a filter. Muon misidentification probability is reduced by a factor of six with respect to the current value for a fixed pion-identification efficiency of 75%. Alternatively, pion identification efficiency is improved from 72% to 91% for a fixed muon misidentification probability of $10^{-5}$.

Machine learning (ML) methods developed primarily for image recognition can be applied to particle identification (PID) at NA62.Convolutional neural networks (CNN), a type of deep neural network, have been successfully used for classification tasks commonly encountered in particle physics [3,4,5].In this work, a calorimetric PID algorithm based on ML models is designed for the NA62 experiment.Two ML models are considered: LightGBM [6], which is an implementation of a gradient boosting machine (GBM) [7], and a CNN-based model.Performance of the algorithm is compared to that of the boosted decision tree (BDT) implemented using TMVA GradientBoost [8], currently used by NA62.

NA6beam and detector
The layout of the NA62 detector [9] is shown schematically in Fig. 1.An unseparated secondary beam of π + (70%), protons (23%), and K + (6%) is created by directing 400 GeV/c protons extracted from the CERN SPS onto a beryllium target in spills of 3 s effective duration.The central beam momentum is 75 GeV/c, and the momentum spread is 1% (rms).Beam kaons are tagged with 70 ps time resolution by a differential Cherenkov counter (KTAG) using a nitrogen gas radiator at 1.75 bar pressure contained in a 5 m long vessel.Beam particle positions, momenta and times are measured by a silicon pixel spectrometer consisting of three stations (GTK1, 2, 3) and four dipole magnets.A 1.2 m thick steel collimator (COL) is placed upstream of GTK3 to absorb hadrons produced in upstream K + decays.Inelastic interactions of beam particles in the GTK are detected by an array of scintillator hodoscopes (CHANTI).The beam is delivered into a vacuum tank evacuated to a pressure of 10 −6 mbar, which contains a 75 m long fiducial decay volume (FV) starting 2.6 m downstream of GTK3.Downstream of the FV, undecayed beam particles continue their path in vacuum.Momenta of charged particles produced in K + decays in the FV are measured by a magnetic spectrometer (STRAW) located in the vacuum tank downstream of the FV.The spectrometer consists of four tracking chambers made of straw tubes, and a dipole magnet located between the the second and third chambers that provides a horizontal momentum kick of 270 MeV/c.The spectrometer momentum resolution is σ p /p = (0.30⊕0.005•p)%,where the momentum p is expressed in GeV/c.
A ring-imaging Cherenkov detector (RICH), consisting of a 17.5 m long vessel filled with neon at atmospheric pressure (with a Cherenkov threshold for pions of 12.5 GeV/c), is used for the identification of charged particles and for time measurement with 70 ps precision.Two scintillator hodoscopes (CHOD) located downstream of the RICH provide trigger signals and time measurements with 200 ps precision.
A quasi-homogeneous liquid-krypton (LKr) electromagnetic calorimeter of thickness 27 radiation lengths is used for particle identification and photon detection.The calorimeter has an active volume of 7 m 3 , is segmented in the transverse plane into 13248 projective cells of 2 × 2 cm 2 , and provides an energy resolution σ E /E = (4.8/√ E ⊕ 11/E ⊕ 0.9)%, where E is expressed in GeV.To achieve hermetic acceptance for photons emitted in the FV by K + decays at angles up to 50 mrad to the beam axis, the LKr calorimeter is supplemented by annular lead glass detectors (LAV) installed in 12 positions inside and downstream of the vacuum tank, and two lead/scintillator sampling calorimeters (IRC, SAC) located close to the beam axis.
A steel/scintillator hadronic sampling calorimeter is formed of two modules (MUV1, 2) with transverse dimensions of approximately 2.6 × 2.6 m 2 .The MUV1 (MUV2) module consists A two-level trigger system [10,11] is employed to reduce the event rate from 10 MHz to about 100 kHz.Multiple trigger lines with different downscaling factors are operated concurrently.

GBM and CNN algorithms
Supervised ML algorithms are used to construct a function (called model or classifier), f (x i ; θ), mapping a set of inputs x i (called examples) consisting of a number of attributes (called features) to a set of outputs y i (called labels).The set of model parameters, θ, is chosen to minimise a loss function (also called objective function), L = i (y i , f (x i ; θ)), where the function (y, f (x; θ)) is selected depending on the problem to solve [12,13].
GBM is a generic family of algorithms used to build strongly predictive classifiers as linear combinations of weak classifiers (such as decision trees).The classifier, f (x i ), is constructed iteratively, starting from an initial decision tree.At each step m, residuals are computed using the loss function obtained at the previous step as −∂ (y i , f m−1 (x i )) /∂f m−1 , and a regression tree fitting the residuals is constructed and added to the linear combination.This procedure is repeated until a stopping condition is met [7].
CNN algorithms have been developed to address computer vision tasks such as translationinvariant image classification, and can recognize complex geometric features at multiple scales.CNNs consist of consecutive layers, the output of each layer being the input to the next one.Each layer is represented by a function g(x), where x is an input tensor.Common layer types include convolutional, batch normalisation and pooling layers [12,13].It has been observed [14] that increasing the number of layers leads to degradation of the training performance; to mitigate this issue, the ResNet architecture [14] introduces groups of consecutive layers called the residual blocks.Instead of learning the parameters of the underlying function g(x) directly, the block learns the parameters of the residual function g r (x), encoding the difference between the input and output of the layer.Many of these blocks are typically stacked.

Data samples
Two independent datasets are employed: a training and validation dataset is used to train the models and adjust the hyperparameters (i.e. the parameters not derived via training), while an independent test dataset is used to characterise the performance of the optimal model with a frozen set of hyperparameters.The training and validation (test) dataset is based on 2.5% (1.5%) of the data sample collected by the NA62 experiment in 2016-2018.The training and validation dataset contains events collected with a non-downscaled K + → π + ν ν trigger line [1] based on RICH and CHOD signals in the absence of in-time signals, and a control trigger line based on the CHOD signal downscaled by a factor of 400.The test dataset contains events collected with the control trigger line.This provides sufficient statistics for training, ensuring that the test dataset is free from trigger-induced bias.
High-purity pion (π + ), muon (µ + ) and positron (e + ) track samples in the momentum range 15-50 GeV/c are obtained from K + → π + π 0 , K + → µ + ν and K + → π 0 e + ν decay candidates selected without calorimetric particle identification, following the procedure used earlier for the existing NA62 BDT algorithm.Isolated STRAW tracks with associated in-time KTAG, CHOD, RICH and LKr signals and a single associated GTK track are considered.The decay vertex, reconstructed as the point of closest approach of the STRAW and GTK tracks, is required to be located in the FV and within the beam envelope.For the K + → π + π 0 and K + → π 0 e + ν decays, a prompt π 0 → γγ decay is reconstructed by measuring the photons in the LKr calorimeter.Further selection criteria are based on photon veto conditions, particle identification in the RICH, and missing mass squared, m 2 miss = (P K − P tr ) 2 , where P K and P tr are the K + and decay track four-momenta, respectively.A label (π + , µ + or e + ) is assigned to each selected track.
The following features are encoded as matrices and vectors: track momentum and impact position in the LKr front plane; presence of a matching MUV3 signal; energy deposits in a matrix of 22 × 22 LKr calorimeter cells centered around the track impact position (these matrices are sparse due to the LKr Molière radius of 4.7 cm); total LKr energy deposit associated to the track; energy deposits in the horizontal and vertical views in all MUV1 channels (2 × 44 channels in each view) and all MUV2 channels (2 × 22 channels in each view).Energy deposits within 10 ns of the track time are considered; the energy deposit values in every channel or cell are divided by the track momentum.
The GBM algorithms require input information in the form of a single vector for each track.Therefore the LKr, MUV1 and MUV2 energy deposit matrices are transformed into vectors; the LKr matrix is cropped to a size of 18 × 18 to exploit its sparsity.The results are concatenated into a 588-element feature vector containing the complete information about the track.
The CNN processing aims to exploit the correlations between the LKr, MUV1 and MUV2 signals.Most CNN architectures require an arbitrary number of matrices of identical dimensions (known in the ML context as channels) as input.Therefore the LKr, MUV1 and MUV2 matrices are transformed to a fixed size of 44 × 44 by replicating the element values (though the LKr and MUV matrices correspond to different transverse areas).The resulting matrices are illustrated for two events in Fig. 2. The inputs are finally arranged into a 5 × 44 × 44 input tensor.

Particle identification algorithm
The flow of the developed PID algorithm is illustrated in Fig. 3.The algorithm returns the probabilities p π , p µ , p e of the track to be classified as π + , µ + , e + , respectively.Tracks with LKr energy deposit to momentum ratio E/p < 0.95 are considered, which reduces the e + contamination in the π + and µ + samples.Tracks with matching MUV3 signals, or identified as minimum-ionising particles (MIPs) by a dedicated filter, are assigned p µ = 1.For the remaining tracks, a ML model (either a GBM or a CNN algorithm) is employed to evaluate the probabilities p π , p µ , p e .The models are trained for tracks in the momentum range 15-40 GeV/c.Two MIP filter algorithms, which use the same input information as the ML models, are considered.The filter A identifies MIPs as tracks with at most four geometrically associated in-time signals in LKr calorimeter cells, at most two hits in each MUV1 view, and at most one hit in each MUV2 view.The filter B additionally exploits the fact that muons typically produce narrow calorimeter signal patterns.Both ML models have been tested with each of the two filters; the corresponding setups are referred to as GBM/A, GBM/B, CNN/A and CNN/B.
The GBM implementation uses the LightGBM framework, which is efficient for large datasets and high-dimensional feature spaces.A CNN architecture of the ResNet-18 type [14] is used.To handle the sparsely populated input matrices (Fig. 2), the 3 × 3 maximum pooling layer after the first batch normalisation layer is removed.The network is further simplified by removing the last two residual blocks that include four 512-input convolution layers.The output of the last of the remaining residual blocks is down-sampled via a global average pooling layer, which is followed by a fully connected layer producing the final output subsequently normalised using a softmax function to obtain the probabilities p π , p µ and p e .The resulting NA62ResNet architecture is shown in Fig. 4.  Data augmentation is used to improve the robustness and generalisation of the CNN model: for each epoch of the training, the five input matrices are mirror-imaged with respect to the horizontal axis for 50% of the tracks in the training sample chosen at random.The weights and the bias terms of the fully connected layer are initialised with random numbers distributed uniformly in the (−1/ √ N , 1/ √ N ) range, where N = 256 is the number of input features.The convolution layers are initialised using the He scheme [15], and the cross entropy loss function is used.The network parameters are optimised following the Adam method [16].To detect potential overfitting, the validation loss is monitored during training.The PyTorch framework [17] is used to build and train the model.The architecture is integrated into the NA62 software framework.
The numbers of π + , µ + and e + tracks in the training and validation dataset passing the MIP filter A and used for training (Fig. 3) are 6.9 × 10 6 , 1.0 × 10 7 and 7.8 × 10 4 , respectively.During the model training, 25% of the dataset is randomly set aside to form a validation sample, and the rest is allocated to the training sample.The cross entropy loss function [13] is used for training of both GBM and CNN models.

Performance of the algorithm
The results presented in this section are obtained using the test dataset.Background contamination in the π + and µ + samples in the dataset is found using Monte Carlo simulations to be below 10 −5 and 10 −6 , respectively.The p π distributions for π + and µ + tracks obtained with the CNN/A setup for tracks in the momentum range of 15-40 GeV/c are shown in Fig. 5

(left).
A condition p π > p 0 is used for π + identification, where p 0 is chosen for each setup to obtain a π + identification efficiency of 75%.The corresponding µ + misidentification probabilities, ε 75 µπ , measured for each setup using a subset of data are displayed in Fig. 5 (right).The CNN/A setup is found to provide the lowest ε 75 µπ value, and is used for further analysis.The results reported below are obtained with the CNN/A setup.
The measured ε 75 µπ value as a function of track momentum is shown in Fig. 6 (left).Strong µ + suppression is achieved also in the momentum range 40-50 GeV/c not used for training.The receiver operating characteristic (ROC) curve in the momentum range 15-40 GeV/c is shown

Figure 1 :
Figure 1: Schematic side view of the NA62 detector used in 2018.
by the STRAW track momentum

Figure 2 :Figure 3 :
Figure 2: CNN input matrices of 44 × 44 size constructed for a µ + track (top row) and a π + track (bottom row) from the training dataset.The LKr matrix is centred at the track impact point in the LKr front plane.In the MUV matrices related to the horizontal (vertical) view, the vertical (horizontal) coordinate is centered at the track impact point in the LKr front plane.

Figure 4 :
Figure 4: The NA62ResNet architecture.Residual blocks (ResBlock) are stacked in pairs.The stride S [13] is a parameter of the filter.The output tensor dimensions at each step are indicated.