Deep machine learning for the PANDA software trigger

Deep machine learning methods have been studied for the software trigger of the future PANDA experiment at FAIR, using Monte Carlo simulated data from the GEANT-based detector simulation framework PandaRoot. Ten physics channels that cover the main physics topics, including electromagnetic, exotic, charmonium, open charm, and baryonic reaction channels, have been investigated at four different anti-proton beam momenta. Different classification concepts and network architectures have been studied. Finally a residual convolutional neural network with four residual blocks in a binary classification scheme has been chosen due to its extendability, performance and stability. The presented study represents a feasibility study of a completely software-based trigger system. Compared to a conventional selection method, the deep machine learning approach achieved a significant efficiency gain of up to 200%, while keeping the background reduction factor at the required level of 1/1000. Furthermore, it is shown that the use of additional input variables can improve the data quality for subsequent analysis. This study shows that the PANDA software trigger can benefit greatly from the deep machine learning methods.


Motivation
The antiProton ANnihilation at DArmstadt experiment (PANDA) at the Facility for Antiproton and Ion Research (FAIR) [1] aims to explore hadron physics by a e-mail: p.jiang@gsi.de means of antiproton induced reactions [2,3].A wide array of physics topics and open questions is in the scope of PANDA, including strangeness physics, charm and exotics, nucleon structure and hadrons in nuclei [4,5].Some examples are the investigation of the charmonium spectrum, hyperon spectroscopy, electromagnetic form factors as well as the search for exotic states, e.g.hybrids, glueballs and hypernuclei.At centre-of-mass energies between 2 and 5.5 GeV the total pp cross section ranges between 50 and 100 mb [6], while the signals of interest have cross sections between a couple of microbarns and a few nanobarns or even picobarns.Thus, with signal cross sections being many orders of magnitude smaller than the total proton-antiproton cross section, extracting the physics of interest is challenging in terms of background suppression.
With an expected average reaction rate of about 20 MHz, and with the average event size of 10-20 kB, the full data rate will be roughly 200 GB/s or more.Because there is only a tiny fraction of events containing physics processes of interest, it would be inefficient to store all the data.In order to identify the interesting events in an online environment, a sophisticated trigger system is necessary.The PANDA triggering system is foreseen to reduce the rate of events to be stored by an approximate factor of 1000 to 20 kHz, reducing the stored data to 200 MB/s, which still leads to about 1 PB stored data per year.
Other experiments such as E835 [7,8], ATLAS [9], BaBar [10], or BESIII [11] have different approaches to trigger and perform event selection.Detector coincidence combined with clean background conditions, e.g. in e + e − collisions, allow for more or less straightforward triggering decisions.In PANDA, the kinematic similarity of signal and background reactions combined with a wide physics interest and the high interaction rate  The PANDA experimental setup (bottom) with the initial detectors (in black) and staged upgrades (in red) [3](and references therein).
puts the key challenge on the trigger system and consequently the selections have to be made on full event candidates with high-level information.Similar to experiments like LHCb [13,14], ALICE [15] or CMS [12], PANDA will use a Software Trigger system based on fully reconstructed event information.To cope with the challenges, high-level properties need to be obtained from the reaction data for proper signal-background separation, and deep machine learning methods are studied for the PANDA Software Trigger to improve performance compared to a conventional cut-and-count method.

Experiment Environment
FAIR is an international accelerator facility being under construction in Darmstadt, Germany [1,3].It expands the existing accelerator complex in a large scale, shown in Figure 1.An antiproton beam will be pre-Fig.2: Schematic overview of the complete PANDA online trigger system.pared in a cascade of accelerators, making use of almost the whole facility.The synchrotron High-Energy Storage Ring (HESR) [16] will store, cool and accelerate the antiprotons, which will have momenta in the range between 1.5 GeV/c and 15 GeV/c, corresponding to centre-of-mass energies in the antiproton-proton reactions at PANDA between about 2 and 5.5 GeV.With a luminosity of up to L = 2 • 10 32 cm −2 s −1 in the "high luminosity mode" and with a beam momentum resolution of dp/p < 2 • 10 −5 in the "high resolution mode" HESR is an essential component for PANDA that allows for precise measurements.
The PANDA experiment, shown in Figure 1, consists of two spectrometers, one barrel-shaped surrounding the interaction point, the other covering the forward direction, both providing measurements for precise charged particle tracking and particle identification as well as electromagnetic calorimetry.

Software Trigger in PANDA
The PANDA online trigger system will consist of online reconstruction, event building, and the Software Trigger, as illustrated in the schematic in Figure 2. First, the online reconstruction of neutral particles and charged particles will be performed as completely as possible using fast algorithms, and the Particle IDentification (PID) information will be assigned to the corresponding tracks.This will be combined with the event reconstruction process, possibly in an iterative manner, to provide the online event candidate with reconstructed final state particle information to the Software Trigger module.The event candidate will then be processed in the Software Trigger module by the selection algorithms.Then the event candidates will be tagged when being consistent with a signal signature.Finally, an event can-didate will be written to the data storage if any of the trigger algorithms accepts it to be a signal event.

Physics channels
The PANDA experiment has a rich physics program, so that the composition of active trigger signatures will be adapted depending on the current physics aim [4,5].Therefore, the PANDA Software Trigger system needs to identify many types of physics reactions and offer a high flexibility in configuration.
For this study a total of ten channels are considered to verify the feasibility of the Software Trigger inspired by the physics motivation given in the PANDA Physics Book [4].These ten physics channels have simple and clear decay modes in their respective field, spanning a number of event topologies which may occur in the experimental runs, while covering the main physics topics, namely exotic hadrons, charmonium, open charm, and baryon states, including the resonances φ, η c , J/ψ, D 0 , D + , D + s , Λ, and Λ c .The physics topics, the reaction physics channels together with the corresponding codes and the trigger signatures are listed in Table 1.Signal events are generated with EvtGen [17] using a simple phase space decay model (PHSP) as well as the VSS (decay of vector to two scalar particles) and VLL (decay of vector to two charged leptons) models for the decays of φ and J/ψ, respectively.More complex decay patterns e.g. in the Dalitz decay channels are omitted in order to populate the phase space more evenly for data quality studies.Generic inelastic background reactions are generated by the Dual Parton Model (DPM) [18], which models the various production cross sections in antiproton-proton reactions.These background events have to be rejected effectively by the triggering algorithm.
For each reaction type, a specific selection procedure, the trigger line, is defined.To determine the performance, the individual trigger lines are under investigation as well as the complete trigger system for the ten reaction types to be tagged simultaneously.
A trigger line includes the reconstruction of a certain composite resonance / particle candidate and the classification whether this composite candidate originates from signal or background events.A comprehensive list of input quantities (Table 2) is computed based on reconstructed properties, momenta, angles, particle identification probabilities of the composite candidates as well as final state particles involved.Furthermore, event specific variables such as minimal and maximal momenta, momentum sums, event planarity and sphericity, thrust magnitude, Fox-Wolfram moments [19] and multiplicities are deduced, to further support the trigger decision.
One trigger line consists of four parts: -Reconstruction of composite candidates by final state combinatorics -Preselection with a broad cut on the invariant mass of the candidate of ±10σ around its peak position -Calculation of the necessary observables, Table 2 -Selection by algorithm (focus of this work) The selection target is determined by achieving the required background suppression factor.In this study, the overall required suppression factor is s = 1/1000, i.e. the number of background events is suppressed by 99.9% with all active trigger lines acting simultaneously.Applying n trig trigger lines simultaneously, the background suppression for each trigger line is required to satisfy s i = s/n trig .In turn, due to the possibility of different trigger lines accepting the same background event, the actual total background suppression factor s all can be a little bit better, such that s all ≤ s = s i .

Data preparation
The data for the physics channels and background are obtained by Monte-Carlo simulations with the Panda-Root software [20].This simulation and reconstruction software framework is based on the publicly available software package FairRoot [21] v18.2.0 and the the external package collection FairSoft in version "jun19p1".
EvtGen [17] provides the particles for propagation through the detector volumes with the GEANT 3 transport modeller [22], which is followed by detailed detector simulations, digitisation and reconstruction.Per particle candidate, tracking, calorimetry and PID information, such as particle energy loss, time of flight, Cherenkov angle, EMC energy deposit, is being provided.All events are self-contained and no event time relevant effects, such as event mixing or incomplete events, are considered.In Table 3, the number of simulated events is shown for the chosen channels where the centre-of-mass energy permits the reaction.Also the Monte-Carlo truth information, matched to the reconstructed particle candidates, is provided to distinguish combinatorial background from signal events for the training stage.

Methods
Our figure of merit for comparing different approaches to the triggering process will be the triggering efficiency, defined as the ratio of the number of triggered events (N trig ) to the number of events passing the detector reconstruction, combinatorics and a 10 σ preselection mass window (N rec ),

=
N trig N rec with σ being the width/resolution of the individual reconstructed resonance.The triggering efficiency can be defined for two cases.Individually the efficiency focuses on the selection performance of each trigger line, serving one particular physics channel.This is a useful quantity to optimise the triggering algorithms in question.For the complete trigger setup, the total efficiencies will be affected by cross-tagging.Here events can be rejected by their intended trigger line, but are accepted accidentally by one or more of the other trigger lines.This is an unintentional but fortunate effect.The total triggering efficiency is the practically more relevant measure for the experiment.
Since we require a fixed background reduction for each neural network (trigger line) individually, the resultant signal trigger efficiencies already represent a comparable measure for network performance.As an additional quality measure we provide the integral (AUC = area-under-curve) of the receiver-operator-characteristics (ROC) being more common in Machine Learning context, while not being of large practical value for this study.

Cut-and-count method
The conventional benchmark trigger scheme is following a cut-and-count approach employing an optimised set of trigger line specific one-dimensional cuts on the measured observable distributions.
In order to identify this set of criteria all input observables are evaluated individually for each trigger line.We integrate the distributions of both signal and background events for each of the n observables in both directions up to a threshold value retaining 90% of the signal events.This threshold defines the cut to be applied.From these 2n possible cuts, we select and apply the one leading to the largest suppression of background events.This procedure is iterated until the total background suppression matches the required factor.The last criterion in the set is adapted in the way, that the background requirement is exactly met, so that the corresponding signal efficiency can be larger than 90%.
The complete trigger configuration is given by 30 sets of one-dimensional selection criteria, one for each trigger line at each accessible centre-of-mass energy.

Deep learning methods
In order to let a neural network decide on which event belongs to one of the signal channels or to background, the data has to be presented in a sensible choice of observables.The same set of input variables (Table 2) used for the cut-and-count approach is being used for direct comparison.Eventually this set of variables is extended by additional event shape observables to determine the optimal performance achievable using all available quantities.The neural network forms its answer in a single number or a set of numbers in case of a multiple classification.Performing a selection cut on this output will determine the rejection rate of background events as well as the efficiency of signal event acceptance.
The PyTorch framework [23] is chosen to provide the underlying functionality to build the neural networks, which are trained and evaluated with the prepared simulation data, roughly in a 1:1 ratio between the training and testing data sets.

Neural network optimisation
Neural network setups are quite diverse and mostly tailored for the problem at hand.For the Software Trigger, several choices are made based on the performance, especially the signal efficiency and background reduction.Individual networks are optimised by: -Network depth [24] -Width or conventional kernel size of each layer -Choice of the optimiser, learning rate and related items, e.g.momentum [25], -Choice of the activation function [26,27,28,29,30,31], -Weight decay generalisation [32] -Weight initialisation [33].

Multi-class and binary approaches
The first question investigated is if a single network serving all trigger lines simultaneously performs better than a single network for each physics channel.Allowing multi-class classification would mean fewer but bigger networks to be trained.Binary classification in contrast allows for easier adaption and modification of the list of channels.The multi-class approach reduces the number of required training experiments for each trigger setup to find the optimal network size at the cost of longer training times due to the increased complexity.
A set of Dense NNs [24] (DNN) is used to study this issue exemplarily for one setting at √ s = 4.5 GeV by the means of a Bayesian approach for the best triggering efficiency [34].The results are shown in Table 4, where the individual trigger efficiencies serve as figure of merit.Here, only signal candidates that match the Monte-Carlo truth are under investigation, eliminating combinatorial effects.In those channels where the triggering efficiency is not close to 100%, the binary classification outperforms the multi-class approach by up to a factor of two.
Therefore, we choose the binary classification approach with one neural network per trigger line and energy point.

NN type selection
In order to identify the optimal network architecture and meta-configuration, seven different types of networks are studied, listed in Table 5: A dense neural network (DNN), a convolutional neural network (CNN) [35], both with and without residual blocks [36,37,38], a CNN with bottleneck residual blocks [39] as well as 1D and a 2D Long-Short-Term-Memory network [40].
Some of the trigger lines perform reasonably well and show stable results under all kind of network configurations due to particularities of the decay kinemat-   ics.For example a J/ψ decaying into two leptons will leave two strongly correlated high-momentum tracks in the detector, which is significantly different to the average multi-pion background event and in consequence always leads to high selection quality.For a meaningful optimisation, these "simple" cases are ignored, and the networks are optimised in depth and layer size for a high triggering efficiency for the three more challenging channels Etac, Dch and Lamc.For instance, the depth is optimised by obtaining the trigger performance as function of used layers/blocks, and ten runs are carried out for each NN framework of a certain depth.The performance is evaluated by the individual efficiencies as well as a combined efficiency ("normalised trigger performance" ˆ ), defined as the geometric mean of the efficiencies of the channels Etac, Dch and Lamc.
The absolute value of ˆ is the primary measure how well a network type performs.The standard deviation of the individual efficiencies from a series of training runs with varying random initialisation of the internal weights is considered as a measure for the training stability.Furthermore the networks are investigated for their robustness under changing the network size, in terms of number of layers or blocks, which may be required in a future scenario for more complex triggering setups.
In Figure 3, we present exemplarily four network types (DNN, CNN, CNNRes and LSTM1D) and their individual efficiencies for just the channel Dch.For each network size ten networks have been trained and evaluated, providing a median and standard deviation from the individual results.The straight forward approach would be a DNN (Figure 3, top left), however, the triggering efficiency and stability for training is inferior as it is visible from the large spread of the results.For a NN with increasing depth, in general significant degradation of performance is expected because of shattered gradients [38].This can be most clearly seen in case of the CNN with decreasing performance using a larger number of layers (Figure 3, top right).Using a CNN with residual blocks (Figure 3, bottom left) mitigates this behaviour.The LSTM model also looks stable but shows a worse generalisation than CNNRes, with regard to the different physics channels (Figure 3, bottom right) For the selection of the best architecture the combined efficiencies ˆ , shown in Figure 4, are compared.Here all considered network types are presented, which contains the information of a total of ≈ 2200 network training and evaluation runs, three channels times ten networks per marker.We identify the CNNRes type network (orange) as optimal for the given purpose based on its performance in triggering efficiency as well as training stability and robustness concerning varying size.

Chosen Network Architecture
The final choice to evaluate the performance of a neural network approach for the PANDA Software Trigger is the CNNRes, a Convolutional Neural Network with four residual blocks.It is "deep" enough to ensure that the NN has enough capacity, while performing relatively stable when adding more blocks, which may be considered within a production triggering environment, e.g.featuring many more trigger lines.
In Figure 5, the details of the network are presented.In the first stage the 127 input channels are mapped to a 11×11 matrix with a fully connected 121 large linear layer in between.Here, the network has the possibility to sort any linear combination of observables next to each other, which will enhance the performance of the image recognition type architecture to follow.Then two convolutional layers extend the dimensionality to 16×11×11, allowing for a detailed feature extraction performed by four residual blocks -pairs of convolutional layers with a residual path.Three convolutional and two linear layers then reduce the dimensions and perform the classification for the output.Between each layer batch normalisation [41] is being performed.

Individual trigger performance
Measuring the performance of the neural network approach is done by the individual trigger efficiency determined per channel with that corresponding single trigger line active.The cuts on the network output are tuned to achieve a background suppression of 1/1000 in total, with an equal fraction of background contribution by each trigger channel for a certain centre-of-mass energy.For example at the energy of √ s = 4.5 GeV with nine of the ten channels being energetically accessible, the targeted background suppression per channel is 1/9000.Another commonly established measure of the network capabilities are the ROC curves and, more condensed, the corresponding AUCs.Both, the individual triggering efficiencies and the network AUCs are summarised in Table 6.
In all cases the triggering with the aid of the neural networks improves the efficiencies compared to the cut-and-count approach.This improvement is of course expected, mainly because possible correlations between observables are exploited by the new algorithm.Figure 6 shows the relative efficiency gain, where the bars in light colours represent the increase based on the same set of input variables, the dark coloured bars the performance gain using the extended set of variables.It reaches almost up to 200% for some of the channels, corresponding to about a factor three in performance.The actual benefit strongly depends on the channel in question.For example the channel ee has already a very good triggering efficiency from the cut-and-count approach, leaving no room for improvements.In the case of the two J/ψ channels J2e and J2mu, the triggering efficiency is dropping with increasing beam momentum (see Table 6), which could be almost completely recovered by the new approach.For the open charm and Etac and Lamc channels the triggering efficiency has about doubled.
Adding more observables, describing the overall event topology, leads in many cases to a substantial gain in triggering efficiency.Hence, in a future software trigger setup it would be desirable to provide these event shape observables, even if computationally expensive in an online environment.Based on the AUC values all networks show good (0.8 -0.9) to excellent (> 0.9) classification performance.

Simultaneous trigger performance
In a realistic setup, all active trigger lines have the ability to trigger events simultaneously.Therefore it can happen, that a signal event of a certain type missed by the dedicated trigger line is accidentally tagged by another one.This cross tagging by simultaneous triggering has the potential to further improve the triggering efficiency.Nevertheless, at the same time the background level will not exceed the requirements.
Figure 7 quantifies this effect in the current scenario, showing the increase in the triggering efficiencies, also presented in Table 6, which are calculated as the fraction of triggered events from events that passed the reconstruction in the trigger line designed for the channel.The light, medium and dark coloured bars correspond to the conventional cut-and-count benchmark algorithm, the neural network approach and the neural network approach based on the extended variable set, respectively.In some cases this adds a substantial increase in triggering efficiency.For example the channel Etac at 5.5 GeV gains a factor of two in total triggering efficiency by cross tagging in this particular setup.It is important to note that these "accidental" gains are highly dependent on the actual composition of trigger lines, the centre-of-mass energy, and the actual trained networks.

Computing performance
The bulk of training processes goes into finding the optimal solution for each network architecture, which is time consuming and requires the intensive usage of GPU and CPU resources.A dedicated server with one Intel ® Xeon ® W-2135 CPU with 3.70 GHz and one NVIDIA ® 3090 GPU is used, equivalent in performance to 179 cores of the AMD ® EPYC ® 7551 32-Core Processors on the computing cluster at GSI. Processing times are in the order of two days for a single trigger line Fig. 6: Individual triggering efficiency gains of the neural networks compared to the conventional approach, based on the same set of input observables (light colours), and on the extended set of observables (dark colours).Fig. 7: Cross-tagging efficiency gain as effect of simultaneous triggering of multiple reactions, for the conventional benchmark approach (light colours), the NN approach with the same set of observables (medium colours), and the NN approach with the extended set of observables (dark colours).

Data quality and choice of observables
The input features and observables have varying impact on the trigger decision.Some features are correlated, some are not relevant for the classification problem at hand, which differs between the channels and even between the data sets of the same channel at different beam energies.The ranking of the observable importance can guide the observable choice and thus may reduce the required computations in the online processing before.
Furthermore, it is important to maintain a certain quality of the data during the selection process, which in turn means that the underlying physics constrains the choice of observables.The Dch channel (D + → K − π + π + ) for example has a significant enhancement in the invariant mass distribution of background events around the signal peak position (blue histogram in Figure 8, left) when using both the energy of the D + candidate and the invariant mass of the recoil system at the same time as input for the NN.This effect originates from the specific kinematics of the two body reaction pp → D + D − considered here.However, it is the goal to trigger D + particles from many different reactions inclusively.Dropping these observables from the NN input removes the undesirable peaking background (Figure 8, right) at the cost of some signal efficiency.We observed now a smooth distribution (blue histogram) in the signal region similar to the rejected background (black histogram).
When analysing complex decay patterns and mixtures of resonances, it is important to cover the phase space without steep drops or holes in the triggering efficiency distributions introduced by the triggering algorithm.Complex fitting algorithms, such as a partialwave analysis [42], benefit most from a flat triggering efficiency distribution if possible, but they require at least a smooth dependency on any kinematic observable.
For example a three-body-decay, such as the channel η c → K s K − π + , could be studied in the Dalitz plot [43] representation.We find that the relative efficiency of the neural network trigger introduces a cut-off in one corner of the Dalitz plot (Figure 9, left) with the observables that are used in the cut-and-count approach.It could be that the network identified this corner to be particularly occupied with background.Introducing the event shape observables (Table 2) as additional input, the training is able to produce a significantly more flat triggering efficiency (Figure 9, right), being beneficial for later Dalitz plot analysis.

Summary and outlook
Based on a set of ten channels with typical event topologies in the reach of PANDA physics investigated at four anti-proton beam energies, it is demonstrated, that PANDA will greatly benefit from a neural network supported software trigger system.
As one result the use of binary classification networks outperforms an approach with multi-class classification.As an additional advantage, it will make the setup much easier to extend the system with a larger set of simultaneous trigger lines and improves the computational scalability.From seven network architectures, the convolutional network with four residual blocks showed the best results for the three channels with the lowest triggering efficiency, while producing stable results under different training attempts and size changes.
In all cases the network approach performs better than the cut-and-count method, which further improves when adding more observables.In the comparison the triggering efficiencies show gains of up to 200% and all networks show good up to excellent performance.
Data quality is an important topic which deserves to be studied in more detail.It would be desirable to give a feedback about e.g. the triggering efficiency flatness in the Dalitz plot coordinates to the neural net- works during the training process.This also holds for the background flatness in critical distributions, such as the candidate invariant mass in order to avoid peaking structures.
For the study presented here, the background suppression requirement is set equally among the physics channels.However, this approach neglects the differences in reconstruction and varying background levels in the different channels.Finding an approach to achieve the best background suppression requirements for each trigger line, while maintaining the desired overall background reduction, is a complex challenge.Enhancing the triggering efficiency of one channel will reduce the triggering efficiency of another channel, which has to be carefully balanced in the future experiment.

Fig. 1 :
Fig. 1: Overview of the FAIR facility (top).The PANDA experimental setup (bottom) with the initial detectors (in black) and staged upgrades (in red) [3](and references therein).

Fig. 3 :
Fig. 3: Individual trigger efficiencies (blue dots) for the channel Dch at 5.5 GeV/c for four of the seven NN types with varying depth parameters as well as the median with standard deviation (red makers and bars).

Fig. 4 :
Fig. 4: Normalised triggering performance ˆ as function of the used layers/blocks in the seven network types, combined for the three channels Etac, Dch and Lamc at 5.5 GeV/c.

Fig. 5 :
Fig. 5: Schematic view of the chosen network architecture.

Fig. 8 :
Fig. 8: Candidate mass distributions of accepted and rejected signal (red and green, respectively) and background (blue and black, respectively) for the channel Dch at 4.5 GeV/c.Left, all kinematic input observables, right without the energy of the candidate and the invariant mass of the recoiling system.The histograms are normalised to the same integral.

Fig. 9 :
Fig. 9: Trigger efficiency deviation from the mean value in Dalitz coordinates for the channel Etac at 3.8 GeV/c, without (left) and with (right) the additional observables.

Table 2 :
Summarised list of input observables.
-Aplanarity of the event Event shape -Sphericity of the event -Magnitude of the event thrust vector -Reduced Fox-Wolfram Moments 1-4

Table 3 :
Numbers of simulated events of each physics channel at each energy, given in million events.The dashes mark the cases, where the reaction is energetically not possible.

Table 4 :
The comparison on the individual trigger efficiencies of truth matched events between multi-class classification and binary classification based on a DNN.

Table 5 :
The list and the description of the NNs investigated for the PANDA Software Trigger in this note.

Table 6 :
Collection of individual trigger efficiency results for a target 1/1000 background suppression, the AUC values of the ROC curves as well as the simultaneous trigger efficiency gains for all accessible channels.