Characteristics of Graphics Processing Units
Developed for the graphics processing pipeline, GPUs excel at data parallel tasks under the SIMT paradigm [13]. An algorithm executed on the GPU is called a kernel. Every kernel is launched with many threads on the GPU executing the same instruction on different parts of the data in parallel, independently from each other. These threads are grouped into blocks within a grid, as illustrated in Fig. 2. Threads within one block share a common memory and can be synchronized, while threads from different blocks cannot communicate. The threads are mapped onto the thousands of cores available on modern GPUs for processing.
Typically, a GPU is connected to its CPU host server via a PCIe connection, which sets a limit on the bandwidth between the GPU and the CPU: 16 lanes of PCIe 3.0 and PCIe 4.0 provide 128 Gbit/s and 256 Gbit/s, respectively. From these parameters we conclude that 500 GPUs are able to consume the 40 Tbit/s data rate of the upgraded LHCb detector. The total memory on a GPU is on the order of hundreds of Gbits nowadays. Consequently, 500 GPUs should also be able to process the full HLT1 sequence if enough data processing tasks fit into GPU memory at the same time and if the tasks can be sufficiently parallelized to fully unlock the TFLOPs theoretically available on the GPU.
The Allen Concept
In our proposal, a farm of GPUs processes the full data stream, as shown in Fig. 3, which can be compared to the baseline x86-only architecture of Fig. 1. Every GPU receives complete events from an event building unit and handles several thousand events at once. Raw detector data is copied to the GPU, the full HLT1 sequence is processed on the GPU and only selection decisions and objects used for the selections, such as tracks and primary vertices, are copied back to the CPU. This approach is motivated by the following considerations:
-
LHCb raw events have an average size of 100 kB. When copying raw data to the GPU, the PCIe connection between the CPU and the GPU poses no limitation to the system, even when several thousand events are processed in parallel.
-
Since single events are rather small, several thousand events are required to make full use of the compute power of modern GPUs.
-
As the full algorithm sequence is processed on the GPU, no copies between the CPU and the GPU are required, apart from the raw input and selection output, and quantities needed to define the grid sizes of individual kernels.
-
Intra-GPU communication is not required because events are independent from one another and small enough in memory footprint to be processed on a single GPU.
The project is implemented in CUDA, Nvidia’s API for programming its GPUs [14]. AllenFootnote 1 includes a custom scheduler and GPU memory manager, which will be described in a companion publication.
Main Algorithms of the First Trigger Stage
A schematic of the upgraded LHCb forward spectrometer is shown in Fig. 4. The information from the tracking detectors and the muon system is required for HLT1 decisions, as described in Sect. 1. The tracking system consists of the vertex detector (Velo) [15] and the upstream tracker (UT) [16] before the magnet and tracking stations behind the magnet which are made of scintillating fibres (SciFi) [16]. The measurements from the muon detector are used to perform muon identification. The LHCb coordinate system is such that z is along the beamline, y vertical and x horizonal. The dipole magnet bends charged particle trajectories along x. Figure 4 indicates the magnitude of the y-component of the magnetic field, which extends into the UT and SciFi regions. As a consequence, tracks in the Velo detector form straight lines, while those in the UT and SciFi detectors are slightly bent.
The following recurrent tasks are performed at various stages of the HLT1 sequence:
-
Decoding the raw input into coordinates in the LHCb global coordinate system.
-
Clustering of measurements caused by the passage of the same particle into single coordinates (“hits”), depending on the detector type.
-
Finding combinations of hits originating from the same particle trajectory (pattern recognition).
-
Describing the track candidates from the pattern recognition step with a track model (track fitting).
-
Reconstructing primary and secondary vertices from the fitted tracks (vertex finding).
Figure 5 shows the full HLT1 sequence. In most cases, a single event is assigned to one block, while intra-event parallelism is mapped to the threads within one block. This ensures that communication is possible among threads processing the same event. Typically, the raw input is segmented by readout unit (for example a module of the vertex detector), so naturally the decoding can be parallelized among the readout units. During the pattern recognition step, many combinations of hits are tested and those are processed in parallel. The track fit is applied to every track and, therefore, parallelizable across tracks. Similarly, extrapolating tracks from one subdetector to the next is executed in parallel for all tracks. Finally, combinations of tracks are built when finding vertices and those can be treated in parallel.
Initially, events are preselected by a Global Event Cut (GEC) based on the size of the UT and SciFi raw data, removing the 10% busiest events. This selection is not essential for the viability of the proposed GPU architecture. It is also performed in the baseline x86 processing [7], because very busy events have a less efficient detector reconstruction and their additional physics value to LHCb is not proportionate to the computing cost of reconstructing them. The subsequent elements of the HLT1 sequence are now described in turn.
Velo Detector
The Velo detector consists of 26 planes of silicon pixel sensors placed around the interaction region. Its main purpose lies in reconstructing the pp collisions (primary vertices or PVs) and in creating seed tracks to be further propagated through the other LHCb detectors. The Velo track reconstruction is fully described in an earlier publication [17] and is recapped here for convenience.
The reconstruction begins by grouping measurements caused by the passage of a particle within each silicon plane into clusters, an example of a more general process known as connected component labeling. Allen uses a clustering algorithm employing bit masks, which searches for clusters locally in small regions. Every region can be treated independently, allowing for parallel processing.
Straight-line tracks are reconstructed by first forming seeds of three hits from consecutive layers (“triplets”), and then extending these to the other layers in parallel. We exploit the fact that prompt particles produced in pp collisions traverse the detector in lines of constant \(\phi\) angle (within a cylindrical coordinate system where the cylinder axis coincides with the LHC beamline) and sort hits on every layer by \(\phi\) for fast look-up when combining hits to tracks.
Velo tracks are fitted with a simple Kalman filter [18] assuming that the x- and y-components are independent from one another and assigning a constant average transverse momentum of 400 MeV to all tracks for the noise contribution from multiple scattering.
Finally, we search for PVs in a histogram of the point of closest approach of tracks to the beamline, where a cluster indicates a PV candidate. We refrain from a one-to-one mapping between a track and a vertex, which would introduce dependencies between the fitting of individual vertex candidates and would require sequential processing. Instead, every track is assigned to every vertex based on a weight, so that all candidates can be fitted in parallel.
UT Detector
Four layers of silicon strip detectors make up the UT detector, the strips of the two outer layers are aligned vertically, the two inner layers are tilted by \(+5^{\circ }\) and \(-5^{\circ }\) around the z-axis, respectively. Since more than 75% of the hits consist of only one fired strip, no clustering is performed in this subdetector. The UT hits are decoded into regions based on their x-coordinate. Every region is then sorted by the y-coordinate. This allows for a fast look-up of hits around the position of an extrapolated Velo track. Velo tracks are extrapolated to the UT detector based on a minimum momentum cut-off of 3 GeV, resulting in a maximal bending allowed between the Velo and UT detectors.There is no requirement on the transverse momentum. Subsequently, UT hits are assigned to Velo tracks and the track momentum is determined from the bending between the Velo and UT fitted straight-line track segments with a resolution of about 20%. The UT decoding and tracking algorithms are described in more detail in Ref. [19].
SciFi Detector
The SciFi detector consists of three stations with four layers of scintillating fibres each, where the four layers of every station are in x–u–v–x configuration. The u- and v-layers are tilted by \(+5^{\circ }\) and \(-5^{\circ }\), respectively, while the x-layers are vertical. The clustering of the SciFi hits and sorting along x is performed on the readout board; therefore, sorted clusters are obtained directly when decoding.
Tracks passing through both the Velo and UT detectors are extrapolated to the SciFi detector using a parameterization based on the track direction and the momentum estimate obtained after the UT tracking. This avoids loading the large magnetic field map into GPU memory. A search window defined by the UT track properties and a maximum number of allowed hits is determined for every UT track and every SciFi layer.
The hit efficiency of the scintillating fibres is 98–99%; therefore, several seeds are allowed per UT track, so that the track reconstruction efficiency is not limited by requiring hits from specific layers. Seeds are formed combining triplets of hits from within the search windows of one x-layer in each of the three SciFi stations. The curvature of tracks inside the SciFi region due to the residual magnetic field tails from the LHCb dipole is taken into account when selecting the best seeds. Only the seeds with the lowest \(\chi ^2\) relative to a parameterized description of the track within the SciFi volume are then extended by adding hits from the remaining x-layers, using the same track description. Since only the information of three hits is used for the \(\chi ^2\), its discriminating power is limited. Therefore, multiple track seeds are processed per UT track.
The magnetic field inside the SciFi detector can be expressed as \(B_{y}(z) = B_{0}+B_{1}\cdot z\) and it is found that at first order \(\frac{B_{1}}{B_{0}}\) is a constant. Using this parameterization, tracks are projected onto the remaining x and u/v-layers, and hits that deviate the least from the reference trajectory, within a track-dependent acceptance, are added. Only the U/V-layers provide information on the track motion in the y–z plane. Thus, a parameterization accounting for the small curvature in the y–z plane is also taken into account in the track model, once all hits have been added.
Finally, a least means square fit is performed both in x and y. Every track is assigned a weight based on the normalized x-fit \(\chi ^2\), y-fit \(\chi ^2\), and the number of hits in the track. Only the best track is accepted per UT track, reducing fake tracks as much as possible.
Muon Detector
The muon system [20] consists of four multiwire proportional chambers interleaved with iron walls. Every station is divided into four regions with chambers of different granularity. Hits are read out with pads and strips, while strips from the same station can overlap to give a more accurate position measurement. During the decoding of muon measurements, such crossing strips are combined into a single hit. For muon identification, the “isMuon” algorithm described in Ref. [21] is employed: tracks are extrapolated from the SciFi to the muon stations and muon hits are matched to a track within a region defined by the track properties. Depending on the track momentum, hits in different numbers of stations are required for a track to be tagged as a muon.
Kalman Filter
A Kalman filter is applied to all tracks to improve the impact parameter resolution, where the impact parameter (IP) is the distance between the point of closest approach of a track and a PV. The nominal LHCb Kalman filter uses a Runge-Kutta extrapolator to propagate track states between measurements and a detailed detector description to determine noise due to multiple scattering. In order to increase throughput and limit memory overhead, these costly calculations are replaced with parameterizations. Two versions of the parameterized Kalman filter are implemented in Allen: one which takes into account the whole detector and one which fits only the Velo track segment but using the estimated momentum from the full track passing through the Velo, UT and SciFi detectors. Since the impact parameter is mainly influenced by the measurements nearest to the interaction region, the Velo-only Kalman filter is used in the HLT1 sequence. This results in a significant computing speedup compared to applying the full Kalman filter.
Selections
Given the momentum, impact parameter and position information from the track fit as well as the muon identification, selections are applied on single tracks and two-track vertices similarly to the HLT1 selections used in Run 2 of LHCb [22,23,24]. Secondary vertices are fitted in parallel from combinations of two tracks each, providing a momentum and mass estimate for the hypothetical decaying particle, assigning the pion mass hypothesis to all tracks except for those identified as muons, for which the muon mass is assigned. The following five selection algorithms, which cover the majority of the LHCb physics programme and which are similar to lines accounting for about 95% of the HLT1 trigger rate in Run 2 [22], are implemented in Allen:
-
1-Track: A single displaced track with \(p_{\mathrm { T}} > 1\) GeV.
-
2-Track: A two-track vertex with significant displacement and \(pt > 700\) MeV for both tracks.
-
High-\(p_T\) muon: A single muon with \(p_{\mathrm { T}} > 10\) GeV for electroweak physics.
-
Displaced dimuon: A displaced dimuon vertex with \(p_{\mathrm { T}} > 500\) MeV for both tracks.
-
High-mass dimuon: A dimuon vertex with mass near or larger than the \(J/\varPsi\) mass with \(p_{\mathrm { T}} > 750\) MeV for both tracks.