GeantV

Amadio, G.; Ananya, A.; Apostolakis, J.; Bandieramonte, M.; Banerjee, S.; Bhattacharyya, A.; Bianchini, C.; Bitzes, G.; Canal, P.; Carminati, F.; Chaparro-Amaro, O.; Cosmo, G.; De Fine Licht, J. C.; Drogan, V.; Duhem, L.; Elvira, D.; Fuentes, J.; Gheata, A.; Gheata, M.; Gravey, M.; Goulas, I.; Hariri, F.; Jun, S. Y.; Konstantinov, D.; Kumawat, H.; Lima, J. G.; Maldonado-Romo, A.; Martínez-Castro, J.; Mato, P.; Nikitina, T.; Novaes, S.; Novak, M.; Pedro, K.; Pokorski, W.; Ribon, A.; Schmitz, R.; Seghal, R.; Shadura, O.; Tcherniaev, E.; Vallecorsa, S.; Wenzel, S.; Zhang, Y.

doi:10.1007/s41781-020-00048-6

GeantV

Results from the Prototype of Concurrent Vector Particle Transport Simulation in HEP

Original Article
Open access
Published: 03 January 2021

Volume 5, article number 3, (2021)
Cite this article

Download PDF

You have full access to this open access article

Computing and Software for Big Science Aims and scope Submit manuscript

GeantV

Download PDF

G. Amadio¹,
A. Ananya¹,
J. Apostolakis¹,
M. Bandieramonte^1,2,
S. Banerjee³,
A. Bhattacharyya⁴,
C. Bianchini^5,6,
G. Bitzes¹,
P. Canal³,
F. Carminati¹,
O. Chaparro-Amaro⁷,
G. Cosmo¹,
J. C. De Fine Licht¹,
V. Drogan^1,8,
L. Duhem⁹,
D. Elvira³,
J. Fuentes¹⁰,
A. Gheata¹,
M. Gheata^1,11,
M. Gravey¹²,
I. Goulas¹,
F. Hariri¹,
S. Y. Jun³,
D. Konstantinov^1,17,
H. Kumawat⁴,
J. G. Lima³,
A. Maldonado-Romo⁷,
J. Martínez-Castro⁷,
P. Mato¹,
T. Nikitina^1,13,
S. Novaes⁵,
M. Novak¹,
K. Pedro³,
W. Pokorski¹,
A. Ribon¹,
R. Schmitz¹⁵,
R. Seghal⁴,
O. Shadura^1,14,
E. Tcherniaev¹,
S. Vallecorsa^1,13,
S. Wenzel¹ &
…
Y. Zhang^1,16

4057 Accesses
4 Citations
Explore all metrics

Abstract

Full detector simulation was among the largest CPU consumers in all CERN experiment software stacks for the first two runs of the Large Hadron Collider. In the early 2010s, it was projected that simulation demands would scale linearly with increasing luminosity, with only partial compensation from increasing computing resources. The extension of fast simulation approaches to cover more use cases that represent a larger fraction of the simulation budget is only part of the solution, because of intrinsic precision limitations. The remainder corresponds to speeding up the simulation software by several factors, which is not achievable by just applying simple optimizations to the current code base. In this context, the GeantV R&D project was launched, aiming to redesign the legacy particle transport code in order to benefit from features of fine-grained parallelism, including vectorization and increased locality of both instruction and data. This paper provides an extensive presentation of the results and achievements of this R&D project, as well as the conclusions and lessons learned from the beta version prototype.

ReDecay: a novel approach to speed up the simulation at LHCb

Article Open access 12 December 2018

Status and initial physics performance studies of the MPD experiment at NICA

Article 27 July 2022

Key4hep, a framework for future HEP experiments and its use in FCC

Article Open access 21 January 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

With ever-increasing data acquisition rates and detector complexity, the experimental particle physics program is reaching the exascale in terms of the data produced. The high-luminosity phase of the Large Hadron Collider (LHC) will produce about 150 times more data than its first run [1]; hence, a proportional increase in computing requirements is expected. All steps in the data processing chain are expected to cope with the increased throughput, under the assumption of a flat computing budget.

Particle transport simulation is an essential component in all phases of a particle physics experiment, from detector design to data analysis. Its main role is trying to predict the detector response to the traversal of particles, which is a very complex task involving a large number of models. Among the most used particle transport libraries in high energy physics (HEP) are Geant4 [2], Fluka [3], and Geant3 [4]. Simulation is one of the most computationally demanding applications in HEP, utilizing more than half of the distributed computing resources of the LHC. The increasing demand for simulated data samples can be satisfied in part with approximate (so-called) fast simulation techniques, but accelerating the detailed simulation process remains essential for increasing simulation throughput.

The ambitious experiment upgrades are occurring in a context where computing technology is rapidly evolving. Since the historical approach to improve CPUs, increasing the clock speed and shrinking the transistors, is now limited by quantum leakage, industry is exploring alternative solutions for the next technological breakthrough. The main hardware manufacturers now favor parallel (or vector) processing units as well as heterogeneous hardware solutions with accelerators such as GPUs, FPGAs, and ASICs, facilitating a performance boost for many domain-specific applications. Most HEP applications are not optimized for Single Instruction Multiple Data (SIMD) parallelism or coprocessors and therefore do not make efficient use of these new resources.

The SIMD model utilizes specialized CPU vector registers to execute the same sequence of instructions in parallel for multiple data. The Single Instruction Multiple Threads (SIMT) model has the same concept as SIMD but the common code (kernel) is executed by multiple synchronous threads. The main practical difference between the two models is the length of the data vector: short in the case of SIMD, usually found on CPUs, and much longer in the case of SIMT, usually found on GPUs. Also, SIMD requires the strict alignment in a single register of all the data to be processed, while each thread in SIMT processes data in its own, separate register. Vectorized applications are easier to port to coprocessors that implement the SIMT model.

The benefits of SIMD and SIMT have been demonstrated for applications featuring massive data parallelism, such as linear algebra and graphics. However, bringing these vectorization techniques to complex code with significant branching presents a different type of challenge. Particle transport simulation has many features hostile to SIMD, including sparse memory access into large data structures, deep conditional branching, and long algorithmic chains and deep function call stacks per data unit (a track, representing a particle state) with poor code locality.

The GeantV simulation R&D project [5] aimed to exploit modern CPU vector units by re-engineering the simulation workflow implemented in Geant4 [2] and the associated data structures. The goal was to enhance instruction locality by regrouping data (tracks) according to the tasks to be executed, rather than executing a sequence of tasks for the same track. The advantage of such an approach, besides the temporal locality, is that it enables new forms of data parallelism that were inaccessible before, such as SIMD and SIMT. Other computational workflows in HEP, such as reconstruction or physics analysis, could benefit from the same optimizations and it is expected that the lessons learned from the GeantV R&D can be applied to these areas.

The target of the GeantV prototype was to speed up particle transport simulation applications by a factor of 2–5 on modern CPUs [5], compared to Geant4 in similar conditions. Gains from SIMD and better instruction cache locality were foreseen, along with code and algorithm refactoring. To support multi- and many-core platforms, thread parallelism was supported starting with the very early versions of the prototype. Another design requirement of this study was to ensure portability to various hardware architectures. This entails keeping the same code and preserving the ability to migrate the data model representation in a device-friendly format.

Concepts and Architecture

Particle transport simulation is peculiar in terms of workflow and data access patterns. In most HEP event processing applications, the data lifetime is rather short: data is filtered and processed to produce results or derived quantities that are consumed by subsequent tasks. It is common for the same data to be used as constant input by several algorithms, but it is less common for that data to be recursively changed while being processed. The latter is the case for simulation, which follows the life cycle of a track, representing a particle traveling through the detector. The track is the central data object used by most of the transport algorithms: geometry computations, propagation in electric or magnetic fields, or physics processes affecting the associated particle. From a computational perspective, the track represents a state taken as input and modified subsequently by a sequence of tasks, collaborating to perform a step that moves it from one point to another. There is a design choice in the ordering of individual steps. In the traditional design, simulation engines perform consecutive steps on a single track until it completes its transportation. To enhance code locality, one can chose an alternative approach, grouping tracks undergoing similar stepping tasks (e.g. the same physics model actions). This requires deep changes in the track handling and step ordering compared to the classical approach, which is the basic direction taken by GeantV.

An important feature of simulation that drives the application design is unpredictability: particle physics is stochastic by nature, implying that the next physics process affecting a particle has to be chosen according to probability distribution functions. One cannot generate a sequence of processes in advance, because their probabilities are dependent on the material properties of the geometry location and the kinematic properties of the current track. Hence, the scalar (per track) data flow consists in a sequence of tasks which, depending on the previous one, cannot be known a priori.

The most convenient concept for handling the multitude of alternative algorithms is run-time polymorphism or virtual inheritance. Moreover, the large diversity and complexity of physics and geometry algorithms typically generate deep simulation call stacks and expensive branching logic, with a corresponding loss of computational efficiency.

The main GeantV concept is to change the focus from being data-centric to being algorithm-centric, making simulation SIMD- and SIMT-friendly. Instead of following a workflow from a track’s perspective, static processing stages are defined that handle track populations being processed by each stage. This change of viewpoint helps to enhance spatial and temporal instruction locality, at the price of using more memory and likely worse data caching. Bundling more work together also enables more fine-grained parallelism and favors deployment on heterogeneous computing resources.

Another important exploration in the context of simulation is parallelism. Multi-threading parallelism is an important lever for making use of the full processing power of modern CPUs. Even if most HEP workflows are embarrassingly parallelizable on input data (such as individual LHC collision events), most of our applications are memory-bound and simulation is not an exception. Event-level parallelism has already been used in production for several years in Geant4, with very good overall scaling performance in multi-threaded mode and rather small memory overhead coming from each additional thread. The only problem is that, while multi-threading allows effective use of many-core CPUs, it does not produce any increase in the throughput per thread.

Vectorization is one of the throughput-increasing acceleration techniques and becomes beneficial when the code produces a large percentage of SIMD instructions. Although compiler authors are striving to provide solutions for automatic vectorization, in practice there are only a few kinds of problems for which auto-vectorization works out of the box. Auto-vectorization is more likely to be successful within confined data loops with reduced branching complexity and without any dependence on the input data. Since, in simulation, relatively few algorithms have natural internal loops, there are only limited benefits from auto-vectorization. GeantV explores percolating track data into low-level algorithms, aiming to loop over this data internally. This approach requires being able to schedule reasonable data populations for each vectorized algorithm.

In this approach, data first needs to be accumulated into per-algorithm containers (“baskets” in GeantV jargon), before being processed. The algorithms need to expose a new interface to handle an input basket and provide implementations that handle the basket data in a vectorizable manner. Note that the tracks coming from a single event may not suffice to fill baskets efficiently, given the complex branching of simulation code and the sheer variety of physics processes needed. One framework prerequisite is, therefore, to be able to mix tracks belonging to many concurrent events in the same processing unit.

Moving one level below, the requested track data has to be gathered and copied into the vector registers. For this to happen, the data are copied into arrays, each entry corresponding to the data of one track. In this scenario, the algorithm can be expressed as an easily-vectorized loop over C-like arrays. Scattering the algorithm output data to the original tracks completes the procedure and allows the processed tracks to be dispatched to subsequent algorithms. This schema requires a data transformation layer on top of each algorithm as shown in Fig. 1.

During this study, available vectorization techniques were thoroughly investigated in terms of programmability, performance, and portability. The techniques evaluated include auto-vectorization, compiler pragmas, SIMD libraries, and compiler intrinsics. The conclusion was that the higher the control over vectorization performance, the lower the portability and programmability. Assembly code or intrinsics are both difficult to write and maintain. On the other hand, auto-vectorization and compiler pragmas do not guarantee vectorization as an outcome, and this is an effect that worsens with increasing algorithm complexity. Our preferred choice was to use SIMD libraries offering a high-level approach to vectorization via SIMD types and higher-level constructs, while keeping the complexity at a reasonable level and leveraging the portability of the library. It was decided to decouple as much as possible the implementation of algorithms from the concrete SIMD libraries, leading to the creation of VecCore [6], an abstraction layer on top of SIMD types and interfaces, supporting both scalar and vector backends (such as Vc [7], and UME::SIMD [8]). The scalar backend supports SIMT as well.

Software Design

GeantV transforms the scalar workflow into a vector one. Instead of handling one track at a time, algorithms can operate on baskets of tracks. Once a basket is injected in the algorithm, the vectorization problem is reduced to transforming all scalar operations on track data into vector operations on basket data. To generate efficient SIMD instructions and to quickly load data into SIMD registers, the basket data needs to be transformed from an array of track structures (AOS) to a structure of arrays of track data (SOA). This copying operation is only necessary for the part of the track data needed by the algorithm.

The workflow is orchestrated by a central run manager. This coordinates the work of several components, among which there are the event generator, the geometry and physics managers, and the user application. The main event loop can be controlled by either the GeantV application or the user framework. Primary tracks, defining the original input collision event, are either generated internally or injected by the user, buffered by an event server. The track-stepping loop is re-entrant, executed concurrently in several threads. Each thread takes and processes tracks from the event server. Once all tracks from a given event are transported, another event is generated/imported. The scheduler respects the constraint not to exceed the maximum number of events in flight set by the user.

GeantV Scheduler

The scheduler’s main task is to gather data efficiently in baskets for all the components, in order to improve vectorization. Also, the multi-threaded approach needs to have good scaling to make efficient use of all available cores. During the study, several different approaches to achieve both of these goals were tested, resulting in several versions of the scheduler.

The first version of scheduling was mostly geometry-centric. It tried to benefit from the observation, illustrated in Fig. 2, that many track steps are done in a smaller number of important detector volumes/materials (volume locality). At the least, geometry calculations could be vectorized for such baskets. The model had a central work queue that handled baskets containing tracks located in the same geometry volume. Dedicated transport threads concurrently picked baskets from the queue and transported them to the next boundary. Whenever a track entered a new volume, it was copied into a pending basket for that volume. The worker thread that managed to fill a given basket beyond a threshold was then responsible for dispatching it to the work queue and replacing it with a recycled empty basket. A garbage collector thread was responsible for pushing partially filled baskets to the work queue whenever the queue started to be depleted. Merging produced hits and storing them to the output file was managed by a special I/O thread.

This first approach focused on demonstrating track-level parallelism based on geometry locality, although vectorized algorithms for baskets were not available at the start of the project. This was an extremely useful step for understanding the differences and peculiarities of the basket-based track workflow compared to the single-track approach. However, the model had scaling issues due to high contention on specific baskets and frequent flushes done by the garbage collector during the event tails.

A second version of the scheduler introduced support for explicit SIMD vectorization. The basket contained a track SOA with aligned arrays ready to be copied into the vector registers. Track data was copied in and out of the SOA, as tracks were passing from one basket to another. A simplified tabulated physics model was available in this version and, since it was not vectorized, the scheduler was still dealing only with geometry-local baskets. The prototype complexity increased and several tunable parameters were introduced in an attempt to implement an adaptive behavior, optimizing the performance of different setups and in different simulation regimes. Gathering and scattering data into the SOA baskets introduced new overheads due to extra memory operations, plus extra bottlenecks in the concurrent approach. To minimize the cost of memory operations, awareness of non-uniform memory access (NUMA) was introduced for handling basket data, leading to improvements of up to \(10\%\) of the simulation time.

The final version of the GeantV scheduler is shown in Fig. 3. Track data is described as a POD structure and pre-allocated in contiguous memory blocks. Each thread takes pointers to primary tracks from an event server, storing them in an input buffer (having the role of particle stack). The stepping loop is implemented as a sequence of stages, each implementing a specific part of the processing required to make a single step for a population of tracks. Pointers to tracks tagged to execute a given stage are accumulated in the input stage basket, processed by the stage algorithms, then dispatched to the input stage basket of the next stage. This implements a stepping pipeline for track populations. The scheduler takes bunches of track pointers (last generations first) and copies them in the input basket of the first stage, triggering the pipeline execution. The stage basket is dispatched internally to specific handlers of specific processing tasks. For example, the propagation stage dispatches all neutral tracks to a linear propagator and the charged ones to a field propagator. The handlers of vectorized algorithms first accumulate (basketize) enough tracks to make the algorithm execution efficient. Subsequently, only the members of the track structure needed by the algorithm are gathered in an SOA before being processed, and then the results are scattered back to the original track pointers. Scalar algorithms make use directly of the stage basket track pointers, without having to gather/scatter data, so scalar and vector workflows can coexist. The last stage in the stepping pipeline implements the final stepping actions and calls the user application for scoring (tallying hits in sensitive detectors), before completing the cycle by copying the surviving tracks back to the prioritized particle stack. The scheduler has the role to push tracks in the stepping pipeline until exhausting the initial track population, then refilling it from the event server. Globally, the scheduler has also to balance the workload among concurrent threads and enforce policies to optimize the global workflow. In addition to fixing many of the issues identified in the previous versions, such as contention in multi-threaded mode and memory behavior, this version introduced a generic model for basketizing, corresponding to the availability of more vectorized algorithms, in addition to geometry ones. The new framework significantly improved the basketizing efficiency, while also accommodating scalar and vector processing flows, switching from one to another depending on the workflow conditions.

Scalar and Vector Workflows

To support both scalar and vector workflows in the same framework, a common interface class called handler was introduced to wrap all simulation algorithms in a common tasking system. The algorithm needs to implement the appropriate scalar and vector interfaces taking as input either a single track pointer or a vector (basket) of tracks. The vector method acts as a dispatcher for the SIMD version of the algorithm. It has to first gather the needed data from the container of tracks and copy it into a custom SIMD data structure. For example, geometry navigation requires only the track position and direction, while magnetic field propagation needs also the charge, momentum, and energy. The SIMD structure is then passed to the vectorized algorithm. The newly produced track state variables are then scattered to the original track pointers. To feed such handlers in a workflow, tracks executing the same algorithm need to be gathered in SIMD baskets before being handed to the vector interface. In the case vectorization of a given algorithm is not implemented or inefficient, the scalar interface can be directly invoked, using a scalar pipeline for this algorithm.

Algorithms of the same type are grouped into simulation stages. The simulation stages refer to specific operations that have to be executed in a pipelined manner to perform a single step that moves a particle from one position to the next. The sequence of stages executed per step by baskets of tracks can be followed in Fig. 4. At the beginning of the step, a PreStep stage initializes the track flags and separates killed tracks, handling them to a final SteppingActions to be accounted and scored. The remaining tracks enter the stage ComputeIntLen, which samples physics processes’ cross-sections and proposes an interaction length. Subsequently, a GeomQuery stage computes the geometry stepping limits in the current volume and a PrePropagation stage uses the actual step to determine in advance if multiple scattering will affect the current step. The actual track propagation is performed during a PropagationStage, having one handler for neutral and one for charged particles. The multiple scattering deflection is added after the propagation in a PostPropagation stage, and any continuous processes are subsequently applied by the AlongStepAction stage. For steps limited by physics processes, a PostStepActions stage is executed, and then the final SteppingActions stage that accounts for stopped tracks and executes user actions. Every stage has an input basket per thread, used to execute the stage either in scalar mode, by looping over the contained tracks, or in vector mode, by passing the full basket to the interface.

The workflow is executed in the following manner. Each thread collects a set of primary tracks in a special buffer, called StackBuffer, which emulates the functionality of a typical track stack (also used in Geant4). Secondary tracks of a higher generation are also pushed into this buffer and prioritized compared to their ancestors. The workload manager only copies the highest generation tracks into the basket of the first stage, then executes it. Once processed, the tracks are copied to the input basket of the second stage, and so on. Each stage has one or more follow-ups, so most particles get pushed along the stepping pipeline, but some particles may loop between stages before being able to execute the complete step. As an example, charged particle propagation requires repeated queries to the geometry before finally crossing the volume boundary. The stepping loop just pushes the input buffers executing the stages one after another, multiple times, until the baskets are empty. It then takes a new bunch of tracks from the StackBuffer. During this loop, some tracks typically end up in unscheduled SIMD baskets, but a subsequent loop can fill these SIMD baskets and flush them back into the pipeline.

Concurrency Model

The GeantV prototype implements parallelism at the track level. It supports an internal mode where the workload is parallelized among threads managed by the GeantV scheduler. It also supports an external mode implemented as a call to a re-entrant task transporting an event set, where the parallelism is controlled by the framework that makes the call.

Primary tracks produced by an event generator are stored in a concurrent event server and delivered to worker threads in bunches of customized size. The track data storage itself is pre-allocated to avoid dynamic memory management, partitioned per NUMA domain, and only pointers to tracks are delivered via the event server interfaces, as shown in Fig. 5. Once a thread picks up a set of primary tracks, it becomes the only user of each track in the set for the given step. Due to this design, there is no synchronization needed when changing the state of a track. Threads handle tracks in their buffers; however, they share a single set of SIMD baskets per NUMA domain, so a thread may steal the tracks accumulated in these baskets by other threads. Even in scalar mode, when the SIMD baskets are empty, there is a mechanism allowing threads to steal tracks from each other as a mechanism of work balancing during the processing tail at the end of events.

The concurrency model was designed to minimize the synchronization needs and to reduce contention in the concurrent services, while sharing track data to increase basket populations. The thread-specific state data needed by the different methods cooperating for track propagation is aggregated in specific objects (called TaskData), different for every thread. A TaskData object is passed as argument to the stepping loop method executed by a given thread, becoming visible to all the callees requiring it. This approach avoids the need of syncronizing concurrent write operations on state data.

To maximize the basket population, vectorized handlers have a common SOA basket shared between threads. This was a requirement for enhancing the vector population, but it has a large cost of increased contention and loss of data locality. To improve this, thread-local copies of the SIMD basket are created for the handlers with the largest population of tracks, such as those for field propagation and multiple scattering. For these, the track population in a single thread is enough to fill them, without workflow perturbation or basket population loss. This allowed a large reduction in contention in the multi-threaded basket mode.

An important feature for fine-grained workflows is load balancing. The GeantV workflow is naturally balanced by the event server, which acts as a concurrent queue. The main problem that occurs is the depletion of the stack buffers of each thread when most of the remaining particles reside in SIMD baskets that do not have a large enough population to execute efficiently in vectorized mode. Such a regime becomes blocking when the number of events in flight has already reached the maximum specified by the user, so the scheduler enters the so-called flush mode. All SIMD baskets are simply flushed and the scalar DoIt methods are executed by the first thread triggering this mode. Flushed particles are gathered in the stage baskets of this thread, which feeds the thread but depletes, even more, the other threads that were already starving. This unbalancing mechanism is compensated by a round-robin track sharing mechanism, which allows threads to feed not only from the event server, but also from the shared buffers of other threads. To preempt the depletion regime, threads always share a small fraction of their own track populations, but will consume those themselves if no other client has. This mechanism of weak sharing allows the reduction of contention in the normal regime. Sharing is dominant during event tails and is also more important when running with many threads.

The externally-driven concurrency mode is the so-called external loop mode. In this mode, no internal threads are launched. The run manager provides an entry point that is called by a user-defined thread and that takes a set of events coming from the user framework. This will subsequently book a GeantV worker to perform the stepping loop, and will notify the calling framework via a callback. An example is provided, performing a simplified simulation of the CMS detector steered by a toy CMSSW [9] framework, mimicking the features of the full multi-threaded software framework of an LHC experiment. The GeantV simulation can be wrapped in a TBB (Threading Building Blocks, Intel^®) [10] task and executed in a complex workflow, as described in Sect. 5.2.

Implementation

This section describes the core components of GeantV libraries and modules: VecCore, VecGeom, VecMath, propagation in a magnetic field and electromagnetic (EM) physics. Auxiliary modules, such as I/O and user interfaces, are briefly summarized as well.

Vector Libraries: VecCore

Portable and efficient vectorization is a significant challenge in large-scale software projects such as GeantV. The VecCore library [6] was created to address the problem of lack of portability of SIMD code and unreliable performance when relying solely on auto-vectorization by the compiler. VecCore allows developers to write generic computational kernels and algorithms using abstract types that can be dispatched to different backend implementations, such as the Vc [7] and UME::SIMD [8] libraries, CUDA, and scalar. VecCore provides an architecture-agnostic API, illustrated in Fig 6, that covers the essential parts of the SIMD instruction set. These include performing arithmetic in vector mode, computing basic mathematical functions, operating on elements of a SIMD vector, and performing gather and scatter, load and store, and masking operations. Code written using VecCore can be annotated for running on GPUs with CUDA, and can be portable across ARM^®, PowerPC^®, and Intel^® architectures, if not relying on features specific to a particular backend (e.g. using CUDA-specific variables such as thread and block indices, or calling external library functions that may be available only on the CPU).

VecCore is used to implement vectorized geometry primitives in VecGeom (described in Sect. 3.2), and vectorized physics models in GeantV. A brief discussion of VecCore with code samples can be found in Ref. [6], and examples of VecCore usage within VecGeom and GeantV appear in the following sections as well.

Geometry Description: VecGeom

Introduction

Detector simulation relies on the availability of methods to describe and construct the detector layout in terms of elementary geometry primitives, as well as interfaces that allow the determination of positions and distances with respect to the constructed layout. Well-known examples of such geometry modelers are the Geant4 geometry module and the ROOT TGeo library [11]. Both enable users to build detectors out of hierarchical descriptions of (constructive) solids and their containment within each other.

The vectorized geometry package, named VecGeom, was chosen as one of the first areas in which to study the optimal usage of SIMD and SIMT paradigms for passing vector data between algorithms, which is one of the main targets of GeantV. From this point of view, the primary development focus was implementing algorithms capable of operating on elements of baskets in parallel. This entails geometry primitives, such as a simple box, that offer kernels to calculate distances for a group of tracks in one function call, in addition to the normal case where only one track is handled.

Below as an example the signature for a typical geometry primitive is followed by the corresponding signature of the new vector/basket interface:

Moreover, data structures and algorithms in VecGeom are laid out to enable efficient operation in heavily multi-threaded frameworks. For instance, a clear separation of state and services enables frequent track or context switches in the navigation module. This module is responsible for predicting where a track will go in the geometry hierarchy along its straight-line path. Multi-platform usage was targeted since the beginning: the same code base is intended to compile and run on CPUs as well as GPU accelerators.

Besides these primary goals, the development of VecGeom was guided and influenced by other requirements and circumstances. The first is to continue offering traditional interfaces operating on the single-particle (scalar) level. This ensures backward compatibility with the Geant4 or TGeo systems and is, in any case, needed to treat particles that have not been put into a basket. Secondly, another geometry project called USolids [12], funded by the EU AIDA project, was already in place, aiming to review and modernize the algorithms of Geant4 and TGeo and to unify the geometry code base. The VecGeom project joined forces with the USolids project for better use of available resources. As a consequence, VecGeom was factored out into a standalone repository and with the potential to evolve independently of GeantV. Therefore, VecGeom serves GeantV, with a basketized treatment and GPU support, as well as making the modernized code available to clients in traditional scenarios using the single-particle interface.

The multitude of use cases and APIs to support (scalar, basketized, CPU, GPU) poses the risk of code duplication. In order to reduce this, VecGeom started with an approach, adopted by most GeantV modules, in which standalone (static) templated algorithmic kernels are instantiated multiple times with different types and specializations behind the public interfaces. This development architecture is visualized in Fig. 7, where the typical use cases are depicted as functional chains of algorithms (scalar, vector, GPU), all implemented in terms of the same kernel templates. In order to make this happen, the kernels are written in such a way that they can be instantiated with native C++ types as well as with SIMD vector types (as offered by vectorization libraries such as Vc). Furthermore, all constructs used have the proper annotation to compile on the GPU (using CUDA). VecCore, prototyped within the VecGeom effort, provides the abstractions needed to write these generic kernels.

Using this development approach, VecGeom has evolved into a geometry library that offers similar features to the classical Geant4 geometry or TGeo for transport simulation for single particle queries. On top of this, these algorithms are also made available for basket queries or for execution on CUDA GPUs. In particular, all major geometry primitives have been implemented, and hierarchical detectors can be constructed from them via composition and placement. To solve the complex geometry tasks typically needed in particle detector simulation, such as determining the minimum distance of particles to any other material boundary or computing the intersection points with the next object along a particle’s straight-line path, VecGeom offers navigator classes that operate on top of these primitives.

Today VecGeom’s objective is to be a high-performance library for these tasks in general. A lesson learned in the development was that it is worth taking a more loosely defined approach to achieve good performance and to benefit from SIMD instructions. In particular, VecGeom targets both basketized (or horizontal) vectorization as well as inner-loop (vertical) vectorization, depending on the complexity of the algorithm. A simple box primitive is an example of the former, and a complicated tessellated shape is an example of the latter. The best SIMD performance for a box is obtained with the use of baskets, yet a SIMD speedup for the tessellated solid is available even in scalar/single-particle mode and does not require basket input. However, processing baskets can still be beneficial due to positive cache effects.

VecGeom has been discussed and presented in various publications [13,14,15]. The following sections briefly review a few important results for specific aspects of VecGeom.

The Performance of Geometry Primitives

Geometry primitives (or solids) are, in addition to affine transformations, the basic building blocks of complex detectors. The range goes from simple structures such as boxes, tubes, and cones, to more complex entities such as polycones, polyhedrons, and tessellated solids (see, e.g., the GDML reference manual [16] for a description). In general, VecGeom offers improved performance of the solid algorithms with respect to previous implementations in Geant4 and TGeo and even with respect to USolids [12]. In most cases, the improvement is due to better algorithms, often as a natural consequence of the effort to restructure towards SIMD-friendly code. In the case of simpler geometry primitives, the implementations provide real SIMD acceleration for basketized usage. Figure 8 exemplifies this for a tube segment, where the SIMD acceleration was found to be a factor of 2 or better with the AVX instruction set (maximum of 4 vector lanes) via the use of VecCore and Vc.

For the more complex solids, some performance improvements for the scalar interface are shown in Fig. 9. In these cases, an additional SIMD acceleration for the basket interface is not feasible due to divergent code paths taken by different particles in a basket. However, as mentioned, the vector units can often still be utilized by vectorizing inner loops or inner computations. This technique is used heavily in the tessellated solid, polyhedra, and multi-union cases [14], and it contributed to the excellent performance gain compared to previous implementations.

The Performance of Navigation Algorithms

Table 1 The time (in s) to navigate a batch of test particles in selected volumes of the CMS detector, and speedup factors for selected methods

GeantV

Abstract

Similar content being viewed by others

ReDecay: a novel approach to speed up the simulation at LHCb

Status and initial physics performance studies of the MPD experiment at NICA

Key4hep, a framework for future HEP experiments and its use in FCC

Introduction

Concepts and Architecture

Software Design

GeantV Scheduler

Scalar and Vector Workflows

Concurrency Model

Implementation

Vector Libraries: VecCore

Geometry Description: VecGeom

Introduction

The Performance of Geometry Primitives

The Performance of Navigation Algorithms

VecMath

Fast Math

Pseudorandom Number Generation

GeantV Tracking and Navigation

EM Physics Models and Vectorization

Magnetic Field Integration

Input and Output (I/O)

Input

Output

MC Truth

User Interface

Scoring Interfaces

GeantV Applications and Physics Validations

Usability Aspects

Reproducibility

Experiment Framework Integration

Performance Results

Global Performance

Scheduler Performance

Profiling Analysis

Vectorization Performance

Concurrency Performance

Performance in an Experiment Framework

Lessons Learned

Vectorization Model, Basketization and Parallelism

Geometry

Physics

Magnetic Field

Interfacing with User Task-Parallel Frameworks

Summary and Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix: Performance Benchmark

Appendix: Performance Benchmark

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation