1 Introduction

Many measurements at particle colliders can only be made with the help of precise Standard Model predictions, which are typically derived using fixed-order perturbation theory at the next-to-leading order (NLO) or next-to-next-to-leading order (NNLO) in the strong and/or electroweak coupling. Unitarity-based techniques and improvements in tensor reduction during the past two decades have enabled the computation of many new one-loop matrix elements, often using fully numeric techniques [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]. The algorithmic appeal and comparable simplicity of the novel approaches has also led to the partial automation of the computation of one-loop matrix elements in arbitrary theories, including effective field theories that encapsulate the phenomenology of a broad range of additions to the Standard Model [20, 21]. With this “NLO revolution” precision phenomenology has entered a new era.

It has become clear, however, that the fully numeric computation of one-loop matrix elements is not without its drawbacks, the most relevant being a relatively large computational complexity. While the best methods exhibit good scaling with the number of final-state particles and are the only means to perform very high multiplicity calculations, it is prudent to resort to known analytic results whenever they are available and computational resources are scarce. The problem has become pressing due to the fact that the computing power on the Worldwide LHC Computing Grid (WLCG) is projected to fall short of the demand by at least a factor two in the high-luminosity phase of the Large Hadron Collider (LHC) [22,23,24,25]. Moreover, most techniques for fully differential NNLO calculations rely on the fast and numerically stable evaluation of one-loop results in infrared-singular regions of phase space, further increasing the demand for efficient one-loop computations [26, 27].

In this letter, we report on an extension of the well-known NLO parton-level program [27,28,29,30], which allows the one-loop matrix elements in to be accessed using the Binoth Les Houches Accord (BLHA) [31, 32] via a direct C++ interface.Footnote 1 This is in the same spirit as the BLHA interface to the library [1], which gives access to analytic matrix elements for \(V+\) jet(s), \(\gamma \gamma \) (+jet) and di-(tri-)jet production. We have constructed the new interface for the most relevant Standard-Model processes available in , representing a selection of \(2 \rightarrow n\) processes with \(n \le 4\). As a proof of generality, we have implemented it in the [33] and [34] event generation frameworks.Footnote 2 We test the newly developed methods in both a stand-alone setup and a typical setup of the event generator, and summarize the speed gains in comparison to automated one-loop programs.

Table 1 Processes available in the Standard Model

2 Available processes

The Standard Model processes currently available through the one-loop interface are listed in Table 1, with additional processes available in the Higgs effective theory shown in Table 2. All processes are implemented in a crossing-invariant fashion. As well as processes available in the most recent version of the code (v10.0), the interface also allows access to previously unreleased matrix elements for \(pp\rightarrow \gamma j j\) [38] and di-jet production. Further processes listed in the manual [39] may be included upon request.

In assembling the interface we have modified the original routines such that, as far as possible, overhead associated with the calculation of all partonic channels – as required for the normal operation of the code – is avoided, and only the specific channel that is requested is computed. Additionally, all matrix elements are calculated using the complex-mass scheme [40, 41] and a non-diagonal form of the CKM matrix may be specified in the interface. In general, effects due to loops containing a massive top quark are fully taken into account, with the additional requirement that the width of the top quark is set to zero.Footnote 3 The intent is that the interface can therefore be used as a direct replacement for a numerical one-loop provider (OLP). We have checked, on a point-by-point basis, that the one-loop matrix elements returned by the interface agree perfectly with those provided by 2, 2 and 5. A brief overview of the structure of the interface is given in Appendix B.

Table 2 Processes available in the Higgs EFT

3 Timing benchmarks

Fig. 1
figure 1

CPU time ratio of , , and 5 to at the level of loop matrix elements

To gauge the efficiency gains compared to automated one-loop providers, we compare the evaluation time in against 2, 2, and 5 using their default settings. In particular, we neither tune nor deactivate their stability systems. The tests are conducted in three stages. First, we test the CPU time needed for the evaluation of loop matrix elements at single phase space points; in a second stage, we test the speedup in the calculation of Born-plus-virtual contributions of NLO calculations using realistic setups; lastly, we compare the CPU time of the different OLPs in a realistic multi-jet merged calculation. In all cases, we estimate the dependence on the computing hardware by running all tests on a total of four different CPUs, namely

  • Intel\(^\text{{\textregistered }}\) Xeon\(^\text{{\textregistered }}\) E5-2650 v2 (2.60GHz, 20MB)

  • Intel\(^\text{{\textregistered }}\) Xeon\(^\text{{\textregistered }}\) Gold 6150 (2.70GHz, 24.75MB)

  • Intel\(^\text{{\textregistered }}\) Xeon\(^\text{{\textregistered }}\) Platinum 8260 (2.40GHz, 35.75MB)

  • Intel\(^\text{{\textregistered }}\) Xeon Phi\(^{\mathrm{TM}}\) 7210 (1.30GHz, 32MB)

For the timing tests at matrix-element level, we use stand-alone interfaces to the respective tools and sample phase space points flatly using the algorithm [66]. We do not include the time needed for phase-space point generation in our results, and we evaluate a factor 10 more phase-space points in in order to obtain more accurate timing measurements at low final-state multiplicity. The main programs and scripts we used for this set of tests are publicly available. The results are collected in Fig. 1, where we show all distinct partonic configurations that contribute to the processes listed in Tables 1 and 2. We use the average across the different CPUs as the central value, while the error bars range from the minimal to the maximal value. The interface to typically evaluates matrix elements a factor 10–100 faster than the numerical one-loop providers, although for a handful of (low multiplicity) cases this factor can be in the 1,000–10,000 range.

We perform a second set of tests, using the event generator [33, 67], its existing OLP interfaces to 2 and 2Footnote 4,Footnote 5 [68], and a dedicated interface to 5Footnote 6. With these interfaces we test the speedup in the calculation of the Born-like contributions to a typical NLO computation for the LHC at \(\sqrt{s}=14\) TeV, involving the loop matrix elements in Tables 1 and 2. The scale choices and phase-space cuts used in these calculations are listed in Appendix A. Figure 2 shows the respective timing ratios. It is apparent that the large gains observed in Fig. 1 persist in this setup, because the Born-like contributions to the NLO cross section consist of the Born, integrated subtraction terms, collinear mass factorization counterterms and virtual corrections (BVI), and the timing is dominated by the loop matrix elements if at least one parton is present in the final state at Born level. The usage of speeds up the calculation by a large factor compared to the automated OLPs, with the exception of very simple processes, such as \(pp\rightarrow \ell {\bar{\ell }}\), \(pp\rightarrow h\), etc., where the overhead from process management and integration in Sherpa dominates. To assess this overhead we also compute the timing ratios after subtracting the time that the Sherpa computation would take without a loop matrix element. The corresponding results are shown in a lighter shade and confirm that the Sherpa overhead is significant at low multiplicity and becomes irrelevant at higher multiplicity.

Fig. 2
figure 2

CPU time ratio of 2, , and 5 to at the level of Born-like contributions to the NLO cross section (BVI)

In the final set of tests we investigate a typical use case in the context of parton-level event generation for LHC experiments. We use the event generator in a multi-jet merging setup for \(pp\rightarrow W\)+jets and \(pp\rightarrow Z\)+jets [69] at \(\sqrt{s}=8\) TeV, with a jet separation cut of \(Q_{\mathrm{cut}}=20\) GeV, and a maximum number of five final state jets at the matrix-element level. Up to two-jet final states are computed at NLO accuracy. In this use case, the gains observed in Figs. 1 and 2 will be greatly diminished, because the timing is dominated by the event generation efficiency for the highest multiplicity tree-level matrix elements [70] and influenced by particle-level event generation as well as the clustering algorithm needed for multi-jet merging.Footnote 7 We make use of the efficiency improvements described in Ref. [73], in particular neglecting color and spin correlations in the S-MC@NLO matching procedure [74]. We do not include underlying event simulation or hadronization. The results in Table 3 still show a fairly substantial speedup when using . We point out that a higher gain could be achieved by also making use of ’s implementation of analytic matrix elements for real-emission corrections and Catani-Seymour dipole terms.

Table 3 CPU time ratios in an NLO multi-jet merged setup using
Table 4 Comparison of integration times using and

We close this section with a direct comparison of the CPU time needed for the calculation of Drell-Yan processes with one and two jets using and , up to a target precision on the integration of 0.1% (one jet) or 0.3% (two jets). The center-of-mass energy is \(\sqrt{s}=14\) TeV, and the scale choices and cuts are listed in Appendix A. The results are shown in Table 4. As might be expected when comparing a dedicated parton-level code with a general-purpose particle-level generator, is substantially faster than for the evaluation of all contributions to the NLO calculation. These results indicate a few avenues for further improvements of general-purpose event generators. With the efficient evaluation of virtual contributions in hand, attention should now turn to the calculation of real-radiation configurations – that represent the bottleneck for both and . In the simplest cases with up to 5 partons, the real radiation and dipole counterterms could be evaluated using analytic rather than numerical matrix elements, by a suitable extension of the interface we have presented here. In addition, the form of the phase-space generation may be improved for Born-like phase-space integrals. Table 4 lists the number of phase-space points before cuts that are required to achieve the target accuracy. We find that uses fewer than half of the points needed by in the Born-like phase-space integrals, while uses fewer points than in the real-emission type integrals but at a much higher computational cost. This confirms that ’s event generation is indeed impaired by the slow evaluation of real-emission type matrix elements, and by the factorial scaling of the diagram-based phase-space integration technique [75, 76] used in its calculations.Footnote 8

4 Numerical stability

As alluded to above, the numerical stability of one-loop amplitudes is of vital importance for both NLO and NNLO calculations, where the latter case necessitates a stable evaluation in single-unresolved phase-space regions. Here we wish to limit the discussion to this case and estimate the accuracy that can be expected from the one-loop amplitudes with an additional parton with respect to the Born multiplicity, i.e., those processes that correspond to the real-virtual contribution in an NNLO calculation. To this end, we generate trajectories into the singular limits according to dipole kinematics, rescaling the Catani-Seymour variables [77] of an initially hard configuration asFootnote 9

$$\begin{aligned} y_{ijk} \rightarrow {\left\{ \begin{array}{ll} \lambda \frac{s}{Q^2} &{} \mathrm {final-final} \\ \frac{1}{1+\frac{Q^2(1-y_{ijk})}{\lambda s}} &{} \mathrm {final-initial} \\ -z_i\frac{\lambda }{1-z_i\lambda } &{} \mathrm {initial-initial} \end{array}\right. } \, , \quad z_{i} \rightarrow z_{i} \end{aligned}$$
(1)

in the collinear limit, and

$$\begin{aligned} y_{ijk}&\rightarrow {\left\{ \begin{array}{ll} \frac{C(1-z_{i})}{z_{i}} &{} \mathrm {final-final} \\ \mathrm {sign}(y_{ijk})C\frac{1-z_{i}}{z_{i}} &{} \mathrm {initial-initial} \end{array}\right. } \, , \nonumber \\ \quad z_{i}&\rightarrow {\left\{ \begin{array}{ll} 1-2\frac{\lambda }{1+C} &{} \mathrm {final-final} \\ \frac{1}{1+2\frac{\lambda ^2-\lambda \sqrt{1+\lambda ^2}}{1+C}} &{} \mathrm {initial-initial} \end{array}\right. } \, , \\ C&= \frac{y_{ijk}z_{i}}{1-z_{i}} \nonumber \end{aligned}$$
(2)

in the soft limit. To assess the stability of the loop-amplitude evaluation, we calculate the number of exact digits as

$$\begin{aligned} N_\text {sd} = -\log _{10}\left( \frac{\vert V-V'\vert }{V}\right) \, , \end{aligned}$$
(3)

where V and \(V'\) denote the finite parts of the one-loop amplitude evaluated on two phase-space points that are rotated with respect to each other.

Fig. 3
figure 3

Test of the numerical accuracy of standard-model three-parton one-loop amplitudes in the soft and collinear limits

Fig. 4
figure 4

Test of the numerical accuracy of standard-model four-parton one-loop amplitudes in the soft and collinear limits

Fig. 5
figure 5

Test of the numerical accuracy of HEFT three- and four-parton one-loop amplitudes in the soft and collinear limits. Note that the accuracy is set to 16 digits in the \(h_0+j\) case, where the two results agree perfectly

We consider crossings of the processes listed in Tables 1 and 2 such that only final-state singularities are considered. We have validated that the numerical accuracy is generally worse when approaching final-state singularities, so that we deem this simplification sufficient. For each singular limit, we generate \(10^4\) hard phase-space points with \(\sqrt{s}=1~\mathrm {T}e\mathrm {V}\) using and, depending on the singular limit of interest, rescale the momenta according to Eq. 1 or Eq. 2 with \(\lambda \in \{10^{-3}, 10^{-6}, 10^{-9} \}\). The results are collected in Figs. 3, 4 and 5, where each point corresponds to the average numerical accuracy according to Eq. 3 and the solid error bars indicate the \(25\%\) quantiles of the median. The lighter-shaded error bands span from the worst to the best result in each run. In cases where the two results agree perfectly within machine precision,Footnote 10 we set the number of stable digits to 16.

For all processes of interest, the numerical evaluation is sufficiently stable even in the deep infrared regions. Although not shown in the figures, we have checked that the stability of our interface is comparable to that of the other OLPs shown in Figs. 1 and 2 using appropriate settings.

5 Conclusions

We have presented a novel C++ interface to the well-known parton-level Monte Carlo generator, giving access to its extensive library of analytical one-loop amplitudes. The interface is generic and not tied to any specific Monte Carlo event generation tool. As a proof of its generality, we have implemented the interface in both, the and event generators. The interface will become public with version 3.0.0, and the interface is foreseen to become public in a future release of the 8.3 series. It should be straightforward to adapt our code to the needs of other event generators.

We expect the interface to be valuable in two respects. First, for many of the processes considered here the speedup over other OLPs is substantial; accessing these matrix elements via this interface rather than an automated tool will therefore provide an immediate acceleration of event generation for many processes of high phenomenological interest. Second, the speed comparisons presented here highlight processes that are particularly computationally intensive for automated tools. Further improvements to the efficiency of these codes may be possible, with potential gains across a wider range of processes.

The structure of the interface allows for simple extensions. Further one-loop matrix elements in , implemented either currently or in the future, may become accessible in a straightforward manner. In the same spirit, the interface could also be extended to provide tree-level or two-loop matrix elements included in as the need arises. Further extensions to the interface, for instance to provide finer control over the one-loop matrix elements via the selection of helicities or color configurations, would also be possible.

Given that we have interfaced three popular automated OLPs within the generator-agnostic structure of the new interface, it is natural to envision the future development of a hybrid program that makes use of the fastest matrix element library for each process. Thinking further ahead, it may be worthwhile to reconsider a streamlined event generation framework, combining different (dedicated) parton-level and particle-level tools. This idea has been pursued with ThePEG [79], but so far rarely deployed. Apart from obvious efficiency improvements through the use of dedicated tools for different applications, such a framework enables previously unavailable methods for systematics studies. In view of both the faster integration in over and the magnitude of uncertainties pertaining to theoretical modeling of collider observables, this is becoming an increasingly important avenue for future work.

We want to close by highlighting that only a relatively small number of analytical amplitudes has to be known in order to cover a wide range of physical processes. When judiciously assembled, many parts of the calculations can be recycled in a process-independent way, with only charge and coupling factors being process-specific. Compared to other efforts to increase the efficiency of event generators, swapping automated for analytical matrix elements is straightforward and simple. Analytical matrix element libraries provide a so-far little explored path towards higher-efficiency event generation for the (high-luminosity) LHC and future colliders.