On-the-fly reduction of open loops

Building on the open-loop algorithm we introduce a new method for the automated construction of one-loop amplitudes and their reduction to scalar integrals. The key idea is that the factorisation of one-loop integrands in a product of loop segments makes it possible to perform various operations on-the-fly while constructing the integrand. Reducing the integrand on-the-fly, after each segment multiplication, the construction of loop diagrams and their reduction are unified in a single numerical recursion. In this way we entirely avoid objects with high tensor rank, thereby reducing the complexity of the calculations in a drastic way. Thanks to the on-the-fly approach, which is applied also to helicity summation and for the merging of different diagrams, the speed of the original open-loop algorithm can be further augmented in a very significant way. Moreover, addressing spurious singularities of the employed reduction identities by means of simple expansions in rank-two Gram determinants, we achieve a remarkably high level of numerical stability. These features of the new algorithm, which will be made publicly available in a forthcoming release of the OpenLoops program, are particularly attractive for NLO multi-leg and NNLO real-virtual calculations.


Introduction
The continuous improvement of statistics and experimental systematics at the Large Hadron Collider (LHC) permits to challenge the Standard Model of particle physics at steadily increasing levels of energy and precision. In this context, the uncertainty of theoretical predictions starts playing a critical role in many areas of the physics program of the LHC, providing strong motivation for developing new techniques that make it possible to push theoretical calculations towards more complex processes and higher perturbative orders.
In the last decade, the advent of new powerful methods for the calculation of one-loop scattering amplitudes [1][2][3][4][5][6][7][8][9] has opened the door to the automation of next-to-leading order (NLO) calculations. Nowadays, one-loop calculations are supported by a number of highly automated tools [10][11][12][13][14][15][16][17][18][19][20] that provide the key to achieve NLO precision in the context of multipurpose Monte Carlo generators [21][22][23][24][25][26][27]. This recent progress has enabled NLO calculations for a huge number of processes and has extended their reach up to multi-particle final states of unprecedented complexity [28][29][30][31][32]. Nevertheless, in various cases the technical limitations of one-loop generators still represent a serious bottleneck or even a show stopper. These issues can be encountered in processes with many final-state particles and for kinematic configurations with two or more widely separated scales. An important example is given by the real-virtual contributions to next-to-next-to leading order (NNLO) calculations, which require very fast and highly stable one-loop amplitudes in deeply infrared regions of phase space.
Motivated by these considerations, in this paper we introduce a new method that leads to very significant efficiency and stability improvements in the construction of one-loop amplitudes. This new method builds on OpenLoops [9,16], a fully automated framework for the automated generation of scattering amplitudes in the Standard Model. The original implementation of the open-loop approach [9,16] supports NLO QCD [31,[33][34][35][36][37][38][39] as well as NLO EW [40][41][42][43][44] calculations and is interfaced to various multi-purpose Monte Carlo tools. The OpenLoops program is also part of Matrix [45] and has already been applied to several NNLO calculations [46][47][48][49][50][51][52][53][54][55]. The essence of the open-loop method [9] consists of a numerical recursion that generates cut-open loop diagrams, called open loops, by multiplying, one after the other, the various building blocks that are connected through loop propagators. More precisely, the construction of N -point loop integrands is organised through the factorisation of N loop segments, which consist each of a loop propagator and a corresponding external subtree. Segment multiplications are implemented through process-independent numerical routines that correspond to the Feynman rules of the model at hand. This type of recursion was first proposed in the context of off-shell recurrence relations for colour-ordered gluon-scattering amplitudes [8]. Thanks to a tensorial representation that retains the loop-momentum dependence of all building blocks, this approach can be used in combination with reduction techniques based on tensor integrals [4] or with the OPP reduction method [5], resulting in both cases in very fast computer code [9].
The new method presented in this paper exploits the factorised structure of the openloop representation in a completely new way. The key idea is that certain operations, which are usually done when all building blocks of Feynman diagrams have been assembled, can be anticipated and performed on-the-fly during the construction of the diagrams. Exploiting the factorised structure of the integrands, this on-the-fly approach permits to perform various types of operations at a much lower level of complexity, thereby boosting their efficiency. As we will show, it can be exploited in order to factorise helicity summations as well as the sums over different Feynman diagrams that share the same one-loop topology. Moreover, based on the integrand reduction method by del Aguila and Pittau [2], we will introduce an on-the-fly technique for the reduction of open loops. In this way, we will promote OpenLoops to an algorithm that combines the construction and the reduction of loop amplitudes in a unified numerical recursion. A notable feature of this approach is that it permits to avoid high-rank objects at any stage of the calculations. More precisely, tensor integrals are always kept at rank two or lower, thereby reducing the computational complexity in a dramatic way.
The on-the-fly technique leads to very significant improvements of CPU efficiency. For what concerns numerical stability, in order to avoid severe instabilities that result from squared inverse Gram determinants in the reduction identities of [2], we present a method that isolates such instabilities in certain triangle topologies and circumvents them via analytic expansions in the limit of small Gram determinants. In this way we obtain the first integrand-reduction algorithm that is essentially free from Gram-determinant instabilities. The achieved level of stability in double precision is competitive with the most sophisticated tools on the market [19] and with public implementations of OPP reduction in quadruple precision.
The paper is organized as follows. In Section 2 we review the original open-loop method. The on-the-fly approach is introduced in Section 3 for the case of helicity sums and for the merging of topologically equivalent open loops. In Section 4 the on-the-fly approach is generalised to the reduction of open loops. Details on the employed integrand-reduction identities and our treatment of Gram-determinant instabilities are discussed in Section 5. The entire algorithm and its implementation are outlined in Section 6, where we also present technical studies on the CPU performance and numerical stability. Our conclusions are presented in Section 7, and Appendix A deals with low-rank integrals that remain to be solved at the end of the on-the-fly recursion.

The open-loop method
In this section we review the original open-loop method [9], which is implemented in the publicly available OpenLoops 1 program [16]. At variance with the original publication [9], here we refine various aspects of the notation and we adopt a particular perspective that sets the stage for the new methods introduced in Sections 3-5. These new techniques are going to become publicly available in the OpenLoops 2 release.

Helicity and colour bookkeeping
The task carried out by the open-loop algorithm is the calculation of the tree-level and one-loop contributions to the scattering probability density, or the squared one-loop contribution for loop-induced processes. The polarised matrix elements M 0 (h) and M 1 (h) should be understood as generic tree and one-loop amplitudes, in the sense that the techniques presented in this paper are applicable to any renormalisable theory, including the QCD and electroweak sectors of the Standard Model, as well as BSM theories. The sums in (1)-(2) run over all helicity and colour degrees of freedom of the scattering particles. While colour indices are kept implicit, the helicity dependence is characterised by a single index h, which corresponds to the global helicity configuration of the event, as described below.
Scattering amplitudes are computed as sums of Feynman diagrams, where Ω tree and Ω 1−loop stand for the sets of tree and one-loop diagrams. Each tree and one-loop diagram can be factored into a colour factor C(I) and a colour-stripped diagram amplitude, 1 which appear in the calculation of the scattering probabilities (1)- (2). This task must be addressed only once per process. It is handled by algebraically reducing all C(I) to a standard basis {C i } and relating the terms (5) to the interference matrix [9,56] In OpenLoops we use a basis where all colour factors are expressed through products and traces of the SU(3) generators T a ij in the fundamental representation. For the bookkeeping of external momenta and helicities in a process with N p scattering particles we introduce the set of particle indices To characterise the helicity configurations of individual particles we use labels for fermions with helicity s = −1/2, 1/2 1, 2, 3 for gauge bosons with s = −1, 0, 1 0 for scalars with s = 0 The configuration λ i = 0 will also be used to characterise unpolarised particles, i.e. fermions or gauge bosons whose helicity is still unassigned or has already been summed over. Since a particle can have up to four different helicity states, it is convenient to adopt a helicity numbering scheme based on the labelsh which correspond to a quaternary number with λ i ∈ {0, 1, 2, 3} as i th -last digit and all other digits equal to zero. In this way, the helicity configurations (λ 1 , . . . , λ Np ) of the full event can be uniquely identified with the label which corresponds to a quaternary number of N p digits, each of which describes the helicity of a particular external particle. Let us also introduce the single-particle helicity spaces, where we do not include unpolarised states. Finally, the global helicity space for the full set of scattering particles, H h, is defined as where the product is understood as A ⊗ B = {a + b|a ∈ A, b ∈ B}.

Tree amplitudes
At tree level, each colour-stripped Feynman diagram is constructed by contracting two so-called subtrees, which arise by cutting the diagram in two pieces in correspondence of an internal propagator, A generic subtree w a corresponds to the part of a certain Feynman diagram that connects an internal off-shell line with outgoing momentum k a to a subset of external particles, In the numbering scheme (9)-(10), the helicity configurations of a subtree are labeled Figure 1: Diagrammatic representation of a subtree and its numerical construction through the recurrence relation (18). The outgoing momentum k a and the spin or Lorentz index σ a are associated with the off-shell internal line, which is shown explicitly, while the on-shell external particles with helicity h a are implicitly understood. and the corresponding helicity space, H a h a , is defined as Subtrees are represented as complex n-tuples, w σa a (k a , h a ), where σ a is the spinor or Lorentz index of the cut line. With this notation, the contraction (13) takes the form where k b = −k a , h = h a + h b , and summation over repeated indices is implicitly understood. The propagator associated with the cut line is included only in the subtree w a and not in w b .
Subtrees are constructed by means of a numerical recursion that starts from the external wave functions and recursively merges subtrees by attaching their off-shell lines to the vertices that occur in the various tree diagrams. A recursion step for the case of a generic three-particle vertex is depicted 2 in Fig. 1. Its algebraic form reads where the tensor X σa σ b σc describes the vertex that connects w b and w c to w a , as well as the numerator of the propagator that connects to w a . The related denominator, (k 2 a − m 2 a ), appears explicitly in (18). The momentum of the resulting subtree is k a = k b + k c and its helicity is h a = h b + h c . Each recursion step must be repeated for all independent helicity configuration h a ∈ H a = H b ⊗ H c . The corresponding recursion step for quartic vertices reads The recursion ends when all off-shell propagators that have been cut in the beginning can be reconnected, such as to obtain the colour-stripped amplitudes (17) for the full set of tree diagrams.
Note that (18)- (19) are analogous to Berends-Giele recurrence relations for off-shell currents [58]. However, while each subtree corresponds to a single topology, off-shell currents incorporate all possible subtrees associated with a certain internal line. The inefficiency due to the usage of individual subtrees is compensated, especially at one-loop level, by the optimisation opportunities that result from the colour-factorisation identities (4) and from the fact that each subtree can occur in multiple Feynman diagrams at tree and loop level.

One-loop amplitudes
The amplitude of a colour-stripped N -point one-loop diagram, I N , has the general form Symbols carrying a bar denote quantities in D = 4−2ε dimensions, and for the loop momentum q we adopt the decompositionq = q +q, where q andq denote its four-dimensional and (D − 4)-dimensional parts, respectively. The denominators readD where k j is the external momentum flowing into the loop at the j th loop vertex. Internal momenta are chosen such that p 0 = p N = 0, i.e. the momentum flowing through theD 0 =D N propagator isq. The one-loop diagram I N in (20) can be regarded as a sequence of loop segments, where the segment S i consists of a subtree w i that involves a certain set E i of external particles and is connected to the i th loop vertex, v i , and to the adjacent loop propagator associated with D i . Segments associated to a quartic vertex involve two subtrees, w i 1 and w i 2 . The helicity configurations of the whole diagram are related to the ones of individual segments, The trace in (20) stands for the full contraction of the spinor and Lorentz indices of propagators and vertices along the loop. In general, the numeratorN (q) consists of a 4-dimensional part N (q) and an ε-dependent remnant N (q), The terms that result from N (q) are known as rational terms of type R 2 and can be reconstructed separately as counterterms using appropriate Feynman rules [59][60][61][62]. Thus, the full amplitude can be decomposed asĀ and in the following we focus on the nontrivial part which stems from the four-dimensional part of the numerator but involves the D-dimensional denominators (22).
In the open-loop approach, loop diagrams are cut-open in correspondence of theD 0 propagator, in the sense that the loop numerator is constructed as a tensor, where β 0 and β N are the spinor or Lorentz indices associated with the cut propagator. We use the Feynman gauge, which means that the numerator of the gluon propagator is simply −ig β α .
is determined, we take its trace, where summation over repeated indices is implicitly understood.
A key feature of the open-loop approach is that, similarly to the product of loop denom-inatorsD 0 · · ·D N −1 in (27), the loop numerator (28) is factored into a product of segments, Here and in the following, the matrix structure is implicitly understood, i.e. (30) should be interpreted as Segments involving a triple vertex have the generic form where corresponds to the interaction term in (18) and embodies the q-dependent contributions of the loop vertex v i and of the numerator of the adjacent D i propagator. In renormalisable theories, each segment S i (q, h i ) is a q-polynomial of rank R ≤ 1. In the SM, the structure of three-point vertices is while four-point vertices have rank zero, The numerator (30) is built as a sequence of N segment multiplications, and we refer to such a multiplication as the dressing of a segment. In the following, we will represent the state of the numerator after k dressing steps as, where S k+1 , . . . , S N are the still undressed segments, and is a q-polynomial of rank R ≤ k that incorporates the k dressed segments. The symbolĥ k and its counterpartȟ k = h −ĥ k denote, respectively, the helicity configurations of the dressed and undressed parts of a diagram with k dressed segments and N − k undressed ones. They are defined through where h is the global helicity state. The corresponding helicity spaces,Ĥ k andȞ k , are defined by The q-dependent polynomials (36) are denoted open loops, and this notion implicitly includes also the corresponding undressed segments S k+1 , . . . , S N and loop denominatorsD 0 , . . . ,D N . A graphical representation of a generic open loop with k dressed segments is depicted in Fig. 2.
The dressing of open loops is implemented through a numerical recursion whereĥ k =ĥ k−1 + h k . This operation needs to be performed separately for all relevant helicity configurationsĥ k ∈Ĥ k =Ĥ k−1 ⊗ H k and iterated for k = 1, . . . , N . The initial condition is whereĥ 0 ∈Ĥ 0 = {0}, and the identity operator is understood as [1 1] β β = δ β β . In order to capture the full q-dependence of open-loop polynomials we use the tensorial representation and numerical operations are always performed at the level of the tensor coefficients N k; µ 1 ...µr (I N ,ĥ k ). In particular, the explicit form of a step of the dressing recursion (39) is for a three-point vertex as defined in (33). For an efficient implementation the µ 1 . . . µ r indices shall be symmetrised throughout.

Parent-child relations and cutting rule
The original open-loop algorithm can be boosted by using parts of pre-computed (N − 1)-point diagrams as a starting point for the construction of more involved N -point diagrams. This approach is based on so-called parent-child relations, which connect open loops of type and that start with identical segments S 1 , . . . , S k . Since open loops are colour-stripped, i.e. they do not depend on the different colour factors of the loop diagrams it is clear that the dressed parts of (43) and (44) remain identical up to step k of the recursion, i.e.
This allows one to construct the more involved N -point parent diagram (43) using a building block of the simpler (N − 1)-point child diagram (44). Such relations can be applied for any k with 2 ≤ k ≤ N − 2, and the maximum gain in efficiency is obtained when k = N − 2, so that only the last two segments of the parent diagram remain to be dressed.
The availability of child diagrams of type (44) is an obvious prerequisite for the applicability of the parent-child approach, and in QCD most one-loop diagrams turn out to be the parent of a corresponding child. Moreover, the correspondence between the first k dressed segments in (43) and (44) requires an appropriate cutting rule, i.e. a prescription that determines the cut propagator and the dressing direction in a similar way as in (43)- (44).
To this end, for each segment S i with external particles E i = {α i1 , . . . , α in i } we introduce a binary weight defined as the sum of the weights 2 α−1 for each particle α, i.e.
For example, F (S i ) = 2 0 +2 1 +2 3 = 11 for a segment connected to the external legs E i = {1, 2, 4}. For the merging of subtrees S i and S j into a single segment S i ⊕ S j with external legs E i ∪ E j , the weight function obeys the useful distributive property This implies that merged segments always outweigh the original segments. Based on this feature, for N -point diagrams we adopt the cutting rule [9] The fact that the first segment is identified as the one with lowest weight guarantees its stability with respect to the merging of S k+1 ⊕ S k+2 in (43)- (44), while (50) guarantees the stability of the dressing direction for all configurations with k ≥ 2. In this way, the parent-child approach permits to recycle the longest possible open loops.
Note that relations of type (46) can be exploited also for diagrams that involve the same number N of loop propagators and identical dressed segments, but different undressed ones.

Helicity treatment and reduction to scalar integrals
In the following we discuss the operations that are required in order to determine the contribution of a loop diagram I N to the scattering probability density (1) Instead of proceeding via a direct construction of the one-loop amplitude A 1 (I N , h) defined in (27), we start with the associated colour structure C(I N ) defined in (4), and we proceed by building the colour-summed interference with the Born amplitude, combining it with the trace of the colour-stripped loop numerator, and performing helicity sums, Here we use h = 0 for the configuration where all particles are unpolarised, in the sense that their helicities have been summed over. The above operations are performed at the level of q-coefficients in the tensorial representation (41), i.e. in practice we compute After the summation over colours and helicities it is possible to combine all diagrams with the same one-loop topology, i.e. diagrams of type with all possible combinations of segments, that have the same external legs E i and loop propagatorsD i but different external subtrees w α i i and/or loop vertices v α i i . To filter out combinations of segments that are not allowed by the Feynman rules we introduce the tensor In this way, the full set of topologically equivalent one-loop diagrams can be defined as and their sum yields The contribution of the diagrams (58) to the scattering probability (1) reads In OpenLoops 1, the calculation of the coefficients (59) is entirely based on the open-loop approach, but the reduction of the loop integrals (60) to scalar integrals, as well as the numerical evaluation of the latter, are performed by means of external libraries.
By default, OpenLoops 1 adopts the tensor representation on the rhs of (60) and computes the relevant tensor integrals with the Collier library [19], which implements the reduction techniques of [4,63] and the scalar integrals of [64]. One of the powerful features of Collier lies in sophisticated analytic expansions [4] that avoid dangerous numerical instabilities in phase space regions with small Gram determinants.
Alternatively, the reduction to scalar integrals is performed with Cuttools [10], and scalar integrals are computed with OneLoop [65]. The Cuttools program implements the OPP reduction method [5], which is based on double-, triple-and quadruple-cuts of the integrand on the lhs of (1). This requires a large number of evaluations of V(Ω N , q, 0), and the high efficiency of the open-loop representation, V(Ω N , q, 0) = R r=0 V µ 1 ...µr (Ω N , 0) q µ 1 · · · q µr , results in a dramatic boost of the OPP method.
Another key feature behind the high speed of the open-loop method is the fact that, irrespectively of the reduction method, the time-consuming reduction to scalar integrals is performed after summing over colour and helicity degrees of freedom.

Summing helicities and diagrams on-the-fly
In this section we introduce a new technique that makes it possible to sum helicities and to merge different one-loop diagrams on-the-fly, i.e. after each step of the open-loop dressing recursion. Besides boosting the open-loop algorithm in a significant way, this approach is also a key aspect of the on-the-fly reduction technique introduced in the Section 4.

On-the-fly helicity summation
In the original formulation of the open-loop method, helicity sums (53) are performed at the end of the dressing recursion. This implies that the k th dressing step (39) needs to be performed for all helicity configurations of the dressed segments S 1 , . . . , S k . This feature, combined with the fact that the number of relevant helicity states and the cost of a single dressing step scale exponentially with k, result in a very rapid growth of the CPU cost of dressing operations in the course of the open-loop recursion. To avoid this negative trend, in this section we introduce a method that exploits the factorisation properties of the open-loop representation (30) in a way that makes it possible to perform helicity sums on-the-fly, after the dressing of each new segment.
The idea, sketched in Fig. 3, is that, upon taking the interference of open loops with the Born amplitude, it is possible to sum over the helicities of all dressed segments, irrespectively of the presence of still undressed segments. To introduce the technical aspects of this approach, let us rewrite the interference (53), between the colour-summed Born term U 0 (I N , h) and the one-loop numerator, as To take advantage of the factorisation of loop segments on the rhs of (62), we then postpone the trace operation and generalise (62) by defining where the interference with U 0 (I N , h) is restricted to the first k dressed segments of the open loop, and the corresponding helicities,ĥ k = h 1 + · · · + h k , are summed over. As a result, the first k segments become effectively unpolarised, and (63) depends only on the helicities of the remaining N − k undressed segments,ȟ k = h k+1 + · · · + h N . Due to this dependence, which is induced by the fact that the Born term (51) depends on the helicities h =ĥ k +ȟ k of all external particles (37), the sums over h 1 , . . . , h k in (63) cannot be entirely factorised. However they can be cast in the nested form which highlights the fact that each segment becomes effectively unpolarised after its dressing.
In practice, in analogy with the standard open-loop recursion (39), the helicity-summed open loops (63) are constructed with the recurrence relation where the helicities h k ∈ H k of the dressed segment are summed on-the-fly. To this end, the dressing operation needs to be performed for allȟ i.e. a fully undressed open loop is given by the interference of its colour structure with the Born amplitude, whose helicity states h live in the global helicity space H. At each dressing step, helicity degrees of freedom are reduced by a factor equal to the number of helicity states of the dressed segment, i.e. by factor two for each external fermion or massless vector boson and a factor three for each external massive vector boson in the segment.
At the end of the recursion, when all N segments are dressed, no helicity dependence is left over (ȟ N ≡ 0), and the unpolarised loop numerator (53) is obtained by taking the trace The recursion (65) is understood as matrix multiplication, in the tensor representation which leads to the same tensorial recursion as in (42).
In summary, performing helicity sums on-the-fly leads to a decreasing number of helicity degrees of freedom when the number k of dressed segments increases. In this way, the effect of the growing CPU cost of dressing operations at large k can be strongly attenuated. The price to pay is that the parent-child approach (43)- (46) is not applicable anymore, due to the fact that (66) incorporates the colour structure C(I N ) of the whole one-loop diagram. However, as we will see in Section 3.2, the parent-child relations can be replaced by a similarly efficient method based on the merging of topologically equivalent one-loop diagrams. Finally, let us note that the recursion (65)- (66) is not applicable to squared one-loop amplitudes. For this case we still rely on the original open-loop algorithm.

On-the-fly merging of topologically equivalent open loops
The key idea behind the recursion (65)- (66) is that, taking the interference between the Born amplitude and the one-loop colour structure C(I N ) as initial condition makes it possible to anticipate operations that are usually performed after completion of the construction of a oneloop diagram. In particular, such operations become applicable on-the-fly after the dressing of individual loop segments. This technique will be denoted as on-the-fly approach, and its applicability goes well beyond helicity sums.
As sketched in Fig. 4, the on-the-fly technique can be extended to the double sums over helicity states and topologically equivalent loop diagrams in (59). The idea is that, rather than being constructed one by one, the topologically equivalent diagrams defined in (55)-(58), can be merged in a recursive way by summing over the various subtrees S α i i as soon as they get dressed. To this end, let us define subsets of diagrams, Ω k N ⊂ Ω N , that share the same undressed where the tensor δ, defined in (57), filters out one-loop diagrams that are not allowed by the Feynman rules. By construction, all diagrams in the set (71) must undergo identical future dressing steps, which can be performed only once after merging the first k segments. This operation can be organised in a very similar way as helicity summations in Section 3.1. Technically, taking as a starting point the nested helicity sums in (64), it is sufficient to generalise the loop segments and the Born term (66) by replacing and to extend the summation over the helicities h 1 , . . . , h k of the dressed segments to the corresponding "diagrammatic" degrees of freedom α 1 , . . . , α k . This leads to the identity which defines an open-loop object with fixed undressed segments {S α k+1 k+1 , . . . S α N N } and helicitieš h k = h k+1 + · · · + h N that incorporates all possible chains of dressed segments {S α 1 1 , . . . , S α k k } forming a valid Feynman diagram, summed over the corresponding helicities h 1 , . . . , h k .
Note that the dependence of (73) on the helicities h k+1 , . . . , h N and diagrammatic indices α k+1 , . . . , α N of the undressed segments is due to the fact that the Born term defined in (66)  and (72) retains the helicity dependence of the Born amplitude as well as the tensor (57) and the colour structure of the whole one-loop diagram.
In analogy with (39) and (65), the open-loop objects (73) can be constructed with the recurrence relation where helicity sums and diagram merging are performed on-the-fly. An explicit example of an on-the-fly merging step is illustrated in Fig. 5. Similarly as for (65), the recursion (74) is implemented in the form of tensorial relations (42). The relevant initial conditions at k = 0 are where each fully undressed contribution corresponds to an individual diagram, with helicities h that live in the full helicity space H. Let us point out that, thanks to the absorption of the colour factors C(I α 1 ...α N N ) in the Born interference term (75), in (74) it is possible to merge parts of diagrams that carry different colour structures in a single object, while respecting the exact colour dependence.
After N recursion steps one obtains a single open-loop object V N (Ω N N , q,ȟ N ), which merges the full set of topologically equivalent diagrams (Ω N N ≡ Ω N ) and is entirely unpolarised (ȟ N ≡ 0). At this stage, taking the trace that closes the loop one arrives at which is equivalent to (59).
As demonstrated in Section 6.2, performing helicity sums and merging diagrams on-the-fly yields a very significant efficiency improvement with respect to the original open-loop algorithm. More precisely, if helicity sums are performed at the end of the recursion as in (53), the merging approach and the parent-child relations (46) permit to achieve a similar speedup factor of the order of two. However, contrary to the parent-child technique, the on-the-the-fly approach is applicable both to diagram merging and helicity sums. This leads to a further speed-up factor that can vary from two to three, depending on the process.
As we will see in Section 4, the on-the-fly approach will be a crucial ingredient in order to arrive at a new efficient algorithm that combines the operations of open-loop dressing and tensor reduction at the level of each individual step of the open-loop recursion.

On-the-fly reduction of open loops
In the original version of the OpenLoops program, the construction of integrand numerators and the reduction to scalar integrals are performed independently of one another using different tools. Open-loop numerators of N -point diagrams are constructed by recursively dressing N segments as described in Section 2.3. Each step of the recursion can increase the tensor rank by one, and, upon symmetrisation of all q µ 1 . . . q µr monomials with r ≤ R, open-loop polynomials of rank R involve R+4 4 independent tensor coefficients. Thus their complexity grows exponentially with the number of recursion steps. For instance, open loops with R = 6 and R = 7 involve, respectively, 210 and 330 components, while only 5 components are present for R = 1. As illustrated in the left plot of Fig. 6, in the original open-loop algorithm tensorial complexity keeps growing until the maximum rank R ≤ N is reached at the end of the dressing recursion. At this stage, upon summation of helicity configurations and loop diagrams with equivalent one-loop topology, tensor integrals are reduced to scalar integrals using external libraries, such as Collier [66] or Cuttools [10], as described in Section 2.5.
Dealing with intermediate results with a large number of tensor components requires a considerable amount of computing power, both for the reduction of high-rank objects and at the level of the tensorial structure of the open-loop recursion (42), which needs to be performed for each relevant helicity configuration and each [. . . ] β k β 0 component. These operations can be significantly accelerated by means of the techniques introduced in Section 3. Nevertheless, they remain the most CPU intensive aspect of multi-particle one-loop calculations in OpenLoops.
Motivated by the above considerations, in this section we introduce a new approach that avoids the appearance of high-rank objects at any stage of the calculation. This is achieved by extending the on-the-fly approach introduced in Section 3 to the reduction of open loops. In this way, interleaving the operations of open-loop dressing and tensor reduction, we build a single recursive algorithm, where each increase of tensorial rank caused by a dressing step is compensated by an integrand-reduction step. As illustrated in the right plot of Fig. 6, the onthe-fly reduction approach avoids the appearance of any intermediate object with rank higher than two. Besides the CPU cost needed for the processing of high-rank objects, this alleviates also possible memory issues due to their storage.

On-the-fly integrand reduction
For the on-the-fly reduction of open-loop polynomials we are going to use the method of [2], which permits to reduce rank-two monomials of the loop momentum through identities of the form The rank-one polynomial on the rhs is a linear combination of four loop denominators, D 0 , . . . , D 3 , and the corresponding tensor coefficients, A µν j and B µν j,λ , depend only on the three external momenta p 1 , p 2 , p 3 . The coefficients of loop denominators are labeled with indices j = 0, . . . , 3, while j = −1 is used for the constant parts. Their explicit expressions are presented in Section 5.2.
The identity (78) provides an exact reconstruction of q µ q ν in terms of four-dimensional loop denominators, but can be easily generalised to D-dimensional denominators by replacing Note thatq 2 contributions resulting from the terms B µν j,λ D j with j = 0, 1, 2, 3 must cancel among each other in (78) since they generate rank-three terms of type q λq2 that are not consistent with the rank-two structure on the lhs. Thus the substitutions (79) generate only an extra term −q 2 A µν 0 on the rhs of (78).
) correspond, respectively, to the rank-two polynomial on the lhs of (80) and its reduced rank-one counterparts on the rhs. The red crosses indicate the pinching of theD j propagators in the V k (Ω k N [j]) terms with j = 0, . . . , 3. SinceD 0 =D N , in our graphical representation theD 0 denominator is located on the N th segment.
The integrand reduction (78) holds at the integrand level, irrespectively of the presence of extra loop denominators D 4 , . . . , D N −1 or additional q-dependent factors that may multiply the q µ q ν monomial. These properties, in combination with the factorisation of open loops into segments, make it possible to apply the reduction (78) at any intermediate stage of the recursion (74). This on-the-fly reduction approach is illustrated in Fig. 7, and the corresponding reduction identities for N -point integrands at step k of the dressing recursion have the form where Ω k N [j] for 0 ≤ j ≤ 3 denote the (N − 1)-point subtopologies that arise from Ω k N by pinching theD j propagator, while terms with j = −1 on the rhs correspond to the original topology, Ω k N [−1] = Ω k N . Note that the denominatorsD j can be pinched irrespectively of whether the related S j (q) segments are already dressed or not. In (80) we adopt the approach of Section 3.2, where open-loop polynomials incorporate the colour-summed interference with the Born amplitude as well as helicity sums and merging of all dressed segments. However, for simplicity, the bookkeeping of helicities, merged diagrams, and [. . . ] β k β 0 indices is kept implicit.
The partially dressed open loops on the lhs and rhs of (80) have the general form where four-dimensional loop-momentum components are accompanied byq 2 terms that arise from (78)-(79). As discussed in Section 4.4, only a small fraction of theq 2 -dependent terms can lead to non-vanishing contributions at the end of the recursion. Thus, in order to avoid the proliferation of tensor coefficients, allq 2 terms that are expected to vanish are identified and discarded in advance at each dressing and reduction step.
In general, the relation (80) allows one to reduce any polynomial ,q). But, in practice, the reduction (80) can be interleaved with the open-loop dressing recursion in a way that the tensor rank never exceeds two. For R = 2 the coefficients of the rank-one open-loop polynomials that arise from the reduction (80) read The transformations (80) Rank-two open loops with only N = 3 loop propagators can be reduced to rank one in a very similar way [2]. The relevant identities (see Section 5.3) have the same structure as (78) but involve only three reconstructed propagators,D 0 ,D 1 andD 2 . Moreover, they do not hold at the integrand level, but only upon integration over the loop momentum. The tensors A µν j and B µν j,λ for the case N = 3 depend only on p 1 and p 2 . They are obtained from the ones for N = 4 by simply setting to zero the terms involving p 3 (see Section 5.3).
The on-the-fly reduction formula for N = 3 has the form where V k (Ω k 3 ,q) is an open loop of rank R = 2 that results from a certain number k ≥ 2 of dressing steps and k − 2 or less reduction steps. Possible undressed segments are denoted as S rem (q), and (83) is valid only for terms S rem (q) of rank zero or one. In general this allows for only N rem ≤ 1 extra segments, i.e.
and these two cases are sufficient to cover all relevant N = 3 topologies and pinched subtopologies in renormalisable theories. 3 The relations between the rank-two polynomial V k (Ω k 3 ,q) and its reduced counterparts V k (Ω k 3 [j],q) have the same form as in (82).
Although the above three-point reduction holds only at the integral level, the fact that the term S rem (q) can be factorised makes it possible to apply (83) as soon as the dressed open loop V k (Ω k 3 ,q) has reached rank-two, independently of the remaining part of the numerator. In summary, exploiting the fact that open loops are factorised into segments, the identities (78) and (83) can be applied on-the-fly during the dressing recursion, while an arbitrary number of segments is still undressed. 4 Thus, dressing and reduction steps can be interleaved in a way that the increase of tensorial rank resulting from dressing is promptly compensated through an on-the-fly reduction. More precisely, on-the-fly reduction steps are applied to diagrams and pinched sub-diagrams with N ≥ 3 at those stages of the recursion where the next dressing step would generate a rank-three object. 5 The reduced rank-one objects with N and N − 1 loop propagators are further dressed and possibly reduced until one arrives at fully dressed open loops of rank R ≤ 2 for all two-and higher-point contributions. An this stage, the open-loop matrix structure can be eliminated by taking the trace (77), and all rank-two objects with N ≥ 3 can be reduced to rank one with a final reduction step of type (78) or (83).
After the above dressing and on-the-fly reduction steps, the following types of integrals remain to be reduced: (i) integrals with N ≥ 5 loop propagators and rank R = 1, 0; (ii) integrals with N = 4, 3 loop propagators and rank R = 1; (85) (iii) integrals with N = 2 loop propagators and rank R = 2, 1.
For their reduction to scalar integrals with N ≤ 4 we use a combination of integral reduction and OPP reduction identities as described in Appendix A.

Merging pinched topologies
Since it allows one to keep the tensor rank low at any stage of the calculations, the on-the-fly reduction approach has the potential to accelerate one-loop calculations in a significant way. However, some aspects of the on-the-fly reduction approach could lead to a dramatic increase of the computational cost. First, the fact that the reduction is performed when the loop is still open, i.e. before taking the trace (77), implies that the entries of the [. . . ] β k β 0 matrix have to be processed as 16 independent objects. Second, the reduction has to be performed for all not yet summed helicity configurations of the undressed segments. Third, each reduction step (80) generates four pinched topologies that need to be processed as independent contributions in subsequent dressing and reduction steps.
Due to the proliferation of pinched subtopologies, the naive iteration of on-the-fly reduction steps would lead to a dramatic increase of the CPU cost. Fortunately, this can be avoided by means of the merging technique introduced in Section 3.2, which makes it possible to absorb pinched N -point open loops into unpinched (N − 1)-point open loops, in such a way that they do not need to be processed as separate objects. As explained in the following, the merging of pinched subtopologies requires a different implementation depending on whether the pinch belongs to the dressed part of the open loop or not, as well as for the special case of aD 0 pinch.

Pinching a dressed propagator
Let us consider the on-the-fly reduction of an N -point open loop with k dressed segments, focusing on contributions that result from the pinching of aD j denominator with j < k, Here, consistently with the notation of Section 3.2, we have restored the indices α i of the various undressed segments, while helicity indices are kept implicit. The pinched propagator in (86) is entirely dressed and merged, in the sense that both adjacent segments, S j and S j+1 , are dressed and merged. Thus, for what concerns all future dressing and reduction steps, apart from the disappearance of theD j denominator, the pinch has no effect. Consequently, the above contribution can be absorbed into The pinched open loop (87) can be merged with corresponding unpinched open loops to form a single (N − 1)-point object with k − 1 dressed segments. The corresponding formula reads A diagrammatic representation of this identity is given in Fig. 8. The resulting object, for which we introduce the symbol V which can directly connect theD j−1 andD j+1 propagators to subtrees involving the external legs attached toŵ j andŵ j+1 (see Fig. 8). In QCD, this turns out to be the case for most pinched configurations.
Moreover, pinched N -point configurations of the form (87) can also be merged with other pinched higher-point diagrams that get the relevant pinches in past or future reduction steps. Thus, pinched objects of type (87) will always be denoted as absorbable, irrespective of whether corresponding unpinched (N − 1)-point Feynman diagrams exist or not.
The merging procedure (89) needs to be applied after any on-the-fly reduction step that creates new pinched objects. Thus merging operations have to be interleaved with iterated dressing and reduction steps. Since pinched N -point objects are absorbed into lower-point objects, the algorithm should start with the dressing and on-the-fly reduction of open loops with N = N max , and continue towards lower N . The merging operation (89) starts at stage N − 1 = N max − 1 and is applied after every dressing step, together with the merging of unpinched open loops (see Section 3.2). Note that, due to the iterative nature of the algorithm, the term V α k ...α N −1 k on the rhs of (89) can be the result of multiple pinching and merging steps.

Pinching an undressed propagator
The possibility to absorb pinched open loops as in Fig. 8 is based on the fact that all future dressing and reduction operations can be performed only once at the level of a merged object. Thus, the undressed segments of pinched and unpinched open loops should be identical. However, the segments connected to the D j−1 and D j propagators are different for pinched and unpinched terms. In the pinched case there are two separate segments, which involve w j and w j+1 and require two subsequent dressing steps. In contrast, unpinched open loops require a single dressing step, since w j and w j+1 are combined in a single segment.
It is thus clear that pinched open loops can be absorbed only after the dressing of the segments that lie on the two sides of the pinch. If this is not the case, i.e. when aD j pinch is applied to an open loop with k ≤ j dressed segments, its absorption becomes possible only after (k − j + 1) additional dressing steps, which result into where S k+1 . . . S j+1 are dressed. Note that in (90) we also sum over all possible α i , and use indices α j+1 , . . . , α N −1 with shifted labels for the undressed segments S j+2 , . . . , S N on the right side of the pinch.
As illustrated in Fig. 9, the dressing operation (90) combined with the relabeling (88) brings the pinched open loops in a configuration that can be absorbed with the merging formula (89). However, the absorption of undressed pinched propagators is more involved than the simplified picture of Fig. 9. Pinched open loops can require more than one dressing step to become absorbable, in which case, in general, also new reduction steps are needed in order to keep the tensor rank below three. Such new reductions generate additional pinches, and their iteration can lead to a proliferation of multi-pinched configurations. Thus, to avoid dramatic inefficiencies, it is crucial to minimise the number of required dressing steps by keeping pinchable propagators as close as possible to the dressed part of the numerator. This is why we choose to perform the on-the-fly reduction using the denominators D 0 , . . . ,D 3 . 6 In this way, when the first two segments are dressed and the first reduction step is applied (see Fig. 7), the various pinched propagators are located at most one left-dressing step (D 0 ) and two right-dressing steps (D 3 ) away from the dressed part.
In order to identify pinches that cannot be directly absorbed and to anticipate how they propagate through the recursion, let us consider generic open-loop configurations before the creation of a new pinch through a reduction step. At this stage the rank R must be equal to two. We first consider the very first reduction step, which can occur after k ≥ R = 2 dressing steps, and we focus on aD 2 pinch in the case where only k = 2 segments are dressed. In this case, an on-the-fly reduction step and a subsequent dressing step yield i.e. theD 2 pinch can be brought in the standard form (87) and can thus be absorbed into unpinched contributions. The same considerations apply also toD 1 andD 0 pinches (see Section 4.2.3) and can be generalised to any step of the recursion, since the structure of the dressed parts on the lhs and rhs of (91) is the same.
AlsoD 3 pinches can be absorbed in a similar way in case there are at least three dressed segments before the reduction step. Otherwise, when only two segments are dressed, the combination of an on-the-fly reduction step with a subsequent dressing step leads to where the pinch is applied on the last dressed segment. Unless another dressing step can be applied before reaching rank three, this kind of pinch cannot be absorbed without a further reduction and dressing step. Dressing one more segment allows one to absorb the original D 3 pinch as well as newD 0 ,D 1 , andD 2 pinches that arise from the new reduction step. However, the new reduction leads again to a configuration with a D 3 pinch on the last dressed segment, Again, the dressed parts on the lhs and rhs of (93) have the same structure, which implies that suchD 3 -pinched configurations are stable with respect to further reduction steps. Thus, open loops with multiple non-absorbable pinches do not occur, and the only type of configuration with a single non-absorbable pinch is the one in (93).

Pinching theD 0 propagator
Finally, we consider open loops with k dressed segments and a pinchedD 0 denominator, For convenience, here and in the following we putD 0 at the end of the chain of denominators. Similarly as for the cases discussed in Section 4.2.2, the above pinched configuration becomes absorbable only when the segments connected to theD 0 propagator, i.e. S N (q) and S 1 (q), are dressed. However, this happens only at the end of the standard dressing recursion.
In order to anticipate the absorption of the most advanced pinch one could replace the denominatorsD 0 , . . . ,D 3 used for the on-the-fly reduction byD 1 , . . . ,D 4 , which lie all directly on the right side of the cut. However, the absorption of eachD 4 pinch would require up to three extra dressing steps and two related reduction steps, resulting in the creation of multiple new pinches. This problem can be circumvented by observing that theD 0 propagator lies only one step away from the dressed part of the open loop, if one reverts the dressing direction. Therefore, as illustrated in Fig. 10, the pinchedD 0 propagator can be entirely dressed by means of a single left-dressing step. This operation results in where the left multiplication of the N th segment should be understood as Technically, as indicated on the rhs, this operation can be easily implemented through a standard right-dressing step upon transposition of the input matrices and back-transposition of the result. 7 As usual, before merging with unpinched open loops, the propagators and undressed segments that lie on the right side of the pinch need to be brought back in standard form. In case of aD 0 pinch, all remaining N − 1 denominators and segments preserve their relative position along the open loop. Thus, onlyD N −1 needs to be relabeled, since it assumes the role of the new D 0 . Moreover, the standard form D 0 = q 2 − m 2 0 requires a loop momentum shift q → q − p N −1 for the entire open loop. The corresponding reparametrisations for the various denominators and segments read Apart from the loop momentum shift, q → q − p N −1 , this formula is entirely analogous to (89). Let us note that, since the shift is applied to a single term, the identity (98) holds only upon loop momentum integration. Nevertheless, as far as the correctness of final results at integral level is concerned, it can be safely applied at the integrand level.
As demonstrated in Section 6.2, using the on-the fly reduction with pinch absorption in combination with the on-the-fly techniques of Section 3 results in a very fast and numerically stable one-loop algorithm. In particular, as compared to the original version of OpenLoops, we find very significant improvements, both in terms of speed and numerical stability. Actually, using only the new techniques of Section 3 without on-the-fly reduction yields even higher CPU efficiency. However, as we will see, the moderate extra CPU cost that results from the on-the-fly reduction approach is counterbalanced by a very significant gain in numerical stability, which implies a reduced usage of quadruple precision for exceptional phase space points.

Cutting rule
Here we have applied our standard labeling scheme, where the cut is located between S N and S 1 , i.e. on theD 0 propagator, while the dressing recursion starts with S 1 and is directed towards S 2 . This configuration will be referred to as a S N /S 1 ordered cut. Since the labeling scheme is a consequence of the position of the cut, and not vice versa, we have to define a cutting rule that selects S N /S 1 out of all possible S i /S j cuts.
The cutting rule should enable the merging of pinched subtopologies that arise from (99) by pinching certain propagatorsD j , i.e. by combining S j and S j+1 in a single segment S j ⊕ S j+1 . To this end, unless the cut propagatorD 0 is pinched, the cutting rule should guarantee that the position of the cut and its direction remain unchanged after a pinch. More explicitly, the desired cut configurations after aD j pinch with j > 0 are In this way, as required for the merging operations described in Sections 4.2.1-4.2.2, the dressing of pinched and unpinched objects always starts and ends with segments that contain the external legs attached to S 1 and S N , respectively. In the case of aD 0 pinch, where the original cut propagator disappears, in order to enable the merging of left-dressed pinched subtopologies described in Section 4.2.3, the cut should be moved to the left of S N ⊕ S 1 , so that In order to ensure, at least in part, that pinched topologies are cut as in (100)-(103), we replace the original cutting rule (49)-(50) by the new prescriptions where the weights F (S a ) are defined in (47). The key property of the above cutting rule is the pinch-invariance of the selection rule, which determines the first segment S 1 . In fact, if this condition is realised for (99), then it is guaranteed to hold also for all pinched configurations (100)-(103). For j = 0, 1 this is an obvious consequence of the fact that F (S 1 ⊕ S a ) > F (S 1 ) for any a = 1. In the other cases, the fact that S 1 remains the first subtree in spite of the appearance of a new pinched subtree S j ⊕ S j+1 with weight F (S j ) + F (S j+1 ), is guaranteed by This inequality is an automatic consequence of (104) and of the binary nature of the weights (47). This is easily understood by observing that, due to 2 Np−1 = 1 +

Np−1 α=1
2 α−1 , the last external particle (α = N p ) outweighs the ensemble of all other particles. Therefore the last external particle must belong to the leading-weight subtree S 1 , which implies that F (S 1 ) ≥ 2 Np−1 and leads to (106).
Unfortunately, the direction rule (105) is not sufficient in order to preserve the direction of the cut. For instance, in case of aD 1 pinch, the desired cut (100) requires F (S N ) > F (S 3 ), which does not automatically follow from F (S N ) > F (S 2 ). More generally, apart from the case of a D 3 pinch, where the second and last subtree do not change, there is no guarantee that the condition (105) preserves the direction of the cut.
Thus the above cutting rule does not allow one to absorb all pinched open loops. Nevertheless, as demonstrated in Section 6.2, it is sufficient to obtain a very fast on-the-fly reduction algorithm. Moreover, it should be possible to further increase CPU efficiency, either by means of an improved cutting rule or by inverting the dressing direction after certain pinches.

Rational terms of type R 1
As discussed in Section 4.1, each reduction step of type (80) and (83) generates termsD i −D i = q 2 that account for the mismatch between the D-dimensional and four-dimensional parts of loop denominators. The resulting tensor integrals with (D − 4)-dimensionalq 2 terms in the numerator can give rise to finite terms. As is well known [2,59], these so-called rational terms of type R 1 can arise only in the presence of 1/(D − 4) poles of ultraviolet type. Thus, vanishing integrals of type R 1 can be easily identified by means of the simple power counting criterion if s ≥ 1 and r + 2s + 4 < 2N. (107) In a renormalisable theory, where each loop segment increases the rank at most by one, r + 2s ≤ N and all integrals with N ≥ 5 and s ≥ 1 vanish. Thus, the only non-vanishing integrals of type R 1 that remain at the end of the on-the-fly reductions of Section 4.1 are [2] d Dqq d Dqq d Dqq The power counting criterion (107) can be exploited in a way that makes it possible to discard irrelevant terms of type R 1 at any intermediate step of the open-loop recursion. To this end, for each number k of dressing steps we anticipate the maximum rank R max k of the segments S k+1 (q), . . . , S N (q) that remain to be dressed. Given this information, it is clear that monomials of type q µ 1 q µ 2 . . . q µrq2s in the dressed open loop cannot give rise to terms of D-dimensional rank higher than r + 2s + R max k at the end of the recursion. Thus, we can anticipate that, (112) The systematic application of this condition allows one to filter out a very large number ofq 2 terms, thereby improving the efficiency of the algorithm.
Note also that the unpinched contributions V k (Ω k N [−1]) in the reduction identities (82) involve terms that reduce r + 2s − 2N , and thus the degree of ultraviolet divergence, by one and two. Depending on the values of N and R max k , this can result in vanishing R 1 contributions that can also be immediately discarded. For instance, in the reduction of d D q q µ 1 q µ 2q 2 D 0D1D2D3 all unpinched contributions apart from those of type (111) can be neglected.

Reduction identities and numerical stability
This section deals with the reduction method of [2], which provides the basis of the on-the-fly reduction approach of Section 4.1. In Sections 5.1-5.3 we outline the derivation of the tensor coefficients A µν j and B µν j,λ in the reduction identities (78) and (83). In doing so we set the stage for Section 5.4, where we discuss numerical instability problems and present a systematic approach for their solution.

The reduction basis
The reduction identities of [2] are based on a decomposition of the four-dimensional loop momentum, in a basis l 1 , . . . , l 4 , formed by massless momenta in two orthogonal planes, This reduction basis is defined in terms of the external momenta p 1 , p 2 , which enter the propagators D 1 ,D 2 . The basis momenta l 1,2 are chosen in the plane spanned by p 1 and p 2 , while l 4,3 lie in the plane orthogonal to p 1 ,p 2 and are defined as where u,v are massless spinors. This definition of l 3,4 implies l * 3 = e iχ l 4 , where χ is twice the phase difference between the u and v spinors. The normalization of the basis is chosen such that and the α 1,2 coefficients in (115) read 8 where ∆ is related to the rank-two Gram determinant ∆ 12 = det(p i · p j ) via The Gram determinant is related to the normalisation factor γ via and these two parameters play a critical role for the stability of the reduction. In fact, in the limit of vanishing Gram determinant, (117) and (120) imply that ( 8 The sign of the square root is chosen such that ± √ ∆ = sign(p 1 · p 2 ) √ ∆. This guarantees that the limits lim which implies that all light-like basis momenta l i become parallel to each other 9 leading to severe numerical instabilities in the decomposition (113).
Note that in [2] the basis momenta l i contain an additional normalisation factor 10 which diverges like 1/ √ ∆ when ∆ → 0. As a consequence, in [2] numerical instabilities are in part visible as factors β in the reduction formulas and in part hidden in the definition of the basis vectors. Instead, the basis momenta defined in (115)-(116) are stable in the ∆ → 0 limit. Thus, in the reduction formulas presented in Section 5.2 and 5.3, instabilities related to the Gram determinant (119) are fully manifest in the form of inverse powers of the parameter γ ∝ ∆. More precisely, for p 2 1 = 0 and p 2 2 = 0, Gram-determinant instabilities arise also from α 2 = ±p 2 2 /(2 √ ∆). However, the parametrisation adopted in Section 5.3 ensures that α 2 is always regular.

On-the-fly box reduction
In the following we discuss the reduction identity (78), which can be rewritten in a slightly more compact form as with D −1 = 1. Since q µ q ν is reconstructed in terms of D 0 , D 1 , D 2 , D 3 , we denote (123) as box reduction identity, although it is applicable to any integrand with N ≥ 4 loop propagators. The starting point for its derivation is given by the decomposition (113). Since the basis momenta l 1,2 and l 3,4 lie in mutually orthogonal planes, it is natural to split the loop momentum into corresponding components, with q µ = c 1 l µ 1 + c 2 l µ 2 and q µ ⊥ = c 3 l µ 3 + c 4 l µ 4 . The respective c i coefficients can be easily related to scalar products (q · l i ) using (114) and (117). This leads to, 11 The q µ component can be directly reduced to rank zero by reconstructing the scalar products (q · l 1,2 ) in terms of D 0 , D 1 , D 2 using 9 As a consequence of l µ 1,2 ∈ R and (l 1 · l 2 ) = E 1 E 2 (1 − cos(θ 12 )) → 0, the first two basis vectors become parallel to each other, i.e. l µ 1,2 → ξ 1,2 η µ with ξ 1,2 ∈ R and η µ ∈ R. As for the C-valued basis vectors l 3,4 , using l * This yields with In order to obtain an identity that reduces also q ⊥ to a linear combination of D 0 , . . . , D 3 one has to move to rank two by squaring (124) in a way that does not generate q µ q ν terms. To this end one can write Applying (127) to the rhs of (129) reduces q µ q ν and q µ q ν ⊥ to rank one, such that only withl 3,4 = l 4,3 , remains to be reduced. This is achieved by means of the relations [2] ( where the quadratic terms (q · l i )(q · l j ) with i, j = 3, 4 are reconstructed in terms of D 0 and D 3 using also the external momentum p 3 .
Combining (127)-(132) leads to the reduction identity (123) with where we have introduced The relations between the A µν j and B µν j,λ tensors in the first two lines of (133) follow from the requirement that terms of rank different from two vanish on the rhs of (123). Note also that the tensor L µν 34 can be rewritten in terms of l 1 , l 2 and g µν as

On-the-fly triangle reduction
The identity (83), which reconstructs q µ q ν in terms of D 0 , D 1 , D 2 at the integral level, will be denoted as triangle reduction. Its derivation is based on the observation that the only terms that involve D 3 and p 3 in Section 5.2, i.e. the squared scalar products (q · l 3 ) 2 and (q · l 4 ) 2 in (132), do not contribute in three-point integrals of rank R ≤ 3. More precisely [2], As a consequence, the derivations of Section 5.2 are also applicable to three-point functions at the integral level upon replacing (130) by In this way one arrives at the reduction identities where S(q) = S + S ρ q ρ is an arbitrary rank-one polynomial, and the tensors A µν j and B µν j,λ are obtained from (133) through the trivial replacements

Treatment of Gram-determinant instabilities
As pointed out in Section 5.1, when the rank-two Gram determinant ∆ 12 tends to zero the reduction basis (115)-(116) becomes degenerate. This leads to spurious singularities that manifest themselves as factors γ −k ∝ ∆ −k 12 in the reduction identities. In practice, the residues of ∆ −k 12 poles are suppressed at O(∆ k 12 ) as a result of subtle numerical cancellations between various contributions. Thus, for ∆ 12 → 0 the results of the reduction are finite but suffer from severe numerical instabilities. As can be seen from (133), spurious singularities reach the maximum power k = 2, i.e. each reduction step results in a numerical instability that scales quadratically in the inverse Gram determinant ∆ 12 .
The reduction (133) involves also spurious singularities related to the rank-three Gram determinant, ∆ 123 , which enter through the terms (l 3 · p 3 ) −1 ∝ |∆ 123 | −1/2 [2]. However, as compared to the ∆ 123 → 0 case, ∆ 12 → 0 instabilities are statistically more likely and are also more enhanced due to their ∆ −2 12 scaling behaviour. In fact, studying high statistics samples for various representative processes, we have found that the numerical instabilities of the on-the-fly reductions of Sections 5.2-5.3 are very strongly correlated to the parameter γ ∝ ∆ 12 . Therefore, as we will see in the following, avoiding ∆ 12 → 0 spurious singularities in a systematic way makes it possible to reach excellent numerical stability.

Box reduction
In the case of the on-the-fly box reduction (123), poles in ∆ 12 arise only through the factors 1/γ 2 and 1/γ in (133), which are a direct consequence of the choice of the momenta p 1 , p 2 for the construction of the reduction basis (115)-(116). Since rank-two Gram-determinant instabilities depend on only two of the three available momenta, they can be easily avoided by constructing the basis with p 1 , p 3 or p 2 , p 3 instead of p 1 , p 2 , depending on the values of the respective Gram determinants ∆ 13 , ∆ 23 , ∆ 12 .
In practice, in order to avoid small rank-two Gram determinants we perform the box reduction upon a permutation, which orders loop denominators with related momenta and masses in such a way that The scales Q 2 ij , which render the above ratios dimensionless, are defined as the largest element of the respective Gram matrices, i.e.
Note that the permutation (140) can be applied without changing the order of the corresponding segments S i (q), i.e. without any modification of the open-loop dressing recursion. Moreover, the choice of the optimal permutation (140)-(141) can be done in a fully flexible way at runtime and locally in individual reduction steps, depending on the kinematics of the actual phase space point. In practice (140) is applied only to compute the reduction basis and the coefficients (133), which are then converted back to the original ordering.
Avoiding a spurious ∆ 12 → 0 singularity with (140)-(141) does not guarantee its disappearance in future reduction steps. In fact, all reduced contributions where D 1 and D 2 remain unpinched will still involve the same small Gram determinant. However, the permutation trick (140) can be iterated as long as N ≥ 4 loop denominators are available. In this way, rank-two Gram-determinant instabilities can be isolated in triangle contributions, which arise only at later steps of the open-loop recursion for loop diagrams with N > 3.

Triangle reduction
For the case of triangle topologies one can show that, excluding soft and collinear regions, vanishing ∆ 12 Gram determinants can arise only from the t-channel topology depicted in Fig. 11, where the triangle exhibits two space-like external momenta, p 1 and p 2 , and a time-like external momentum, p 2 − p 1 . Since the Gram determinant vanishes when p 2 1 → p 2 2 , we adopt the parametrisation Figure 11: Triangle t-channel (sub)topology that gives rise to ∆ 12 → 0 numerical instabilities when (p 2 − p 1 ) 2 = 0 and p 2 1 → p 2 2 .
where p 1 and p 2 can be ordered such that δ > 0. The parameters ∆ and γ are related to δ via i.e. the ∆ → 0 limit corresponds to δ → 0.
In order to avoid numerical instabilities for δ 1, we perform a systematic δ-expansion of the reduction formulas (133) with the substitutions (139) for the case of triangles. To obtain regular Taylor series including terms up to order δ m , the coefficients of the various factors δ −k with k ≤ 4 in (133) are expanded up to order δ k+m . For a complete cancellation of 1/δ poles, the expansions need to be applied to all needed on-the-fly reduction steps (138), as well as to the final reduction of rank-one triangles (166), rank-one bubbles (167) and rank-two bubbles (168) to scalar integrals, and to the scalar integrals themselves. 12 To expand the relevant C 0 and B 0 functions we use LiteRed [67], and the final results can be expressed in terms of B 0 functions with external masses equal to −p 2 or zero.
For the case of triangles with massless internal lines, m 0 = m 1 = m 2 = 0, the original scalar integrals are B 0 (0, 0, 0), B 0 (−p 2 , 0, 0), B 0 (−p 2 (1 + δ), 0, 0), and C 0 (−p 2 , −p 2 (1 + δ), 0, 0, 0). Expanding the tensor integrals up to rank r = 3 and including terms up to order δ 2 we obtain with only two independent scalar integrals, and for i, j, k = 1, 2. Here the sums are restricted to inequivalent permutations, e.g. p µν Expansions in δ are applied to exceptional phase space points and specific triangles with δ < δ thr , and we have chosen the threshold δ thr = 10 −3 . Since the expansions reduce triangles of rank r to rank zero in a single step, in order to be able to apply them in a fully flexible way, the on-the-fly reduction of triangles (83) has to be postponed at the end of the open-loop recursion, where triangles contributions have reached the maximum rank. This means that, in addition to the contributions listed in (85), we also generate N = 3 terms with R = 3. After taking the trace (77), depending on the actual value of δ, the reduction of triangles is done either using the expansion formulas or the on-the-fly reduction (83) followed by Passarino-Veltman reduction steps (166)-(168).

Implementation and performance
This section summarises the key structure of the new algorithm, outlines some aspects of its implementation, and presents technical performance studies.

Structure and implementation of the new algorithm
Similarly as for the original open-loop method, given a certain scattering process the algorithm starts with the generation of all tree and one-loop diagrams. This is done in symbolic form, including only topological information and particle content. One-loop diagrams are colour stripped and cut-open, and the interference of their colour structure with the Born amplitude is taken as initial condition for the open-loop recursion.

2.
For each open loop with n < N and rank R determine the rank R next that would be reached by performing the subsequent dressing step.
2.a If R next = 3 or if n = N and R = 2 avoid the dressing step and perform an on-the-fly reduction, which generates unpinched and pinched terms with R = 1. The unpinched terms remain in the (N, n) group. As for the pinched terms, if the adjacent segments of a pinched propagator are dressed they can be reduced to a single effective segment, thereby turning (N, n) → (N − 1, n − 1). Otherwise, contributions with an undressed pinch stay in the (N, n) group.
2.b If n < N and R next ≤ 2 perform a dressing step, which turns (N, n) → (N, n + 1). If this step dresses a pinched propagator that was previously undressed, the corresponding segments can be reduced to a single effective segment turning (N, n + 1) → (N − 1, n). The above on-the-fly algorithm has been implemented in the framework of the original OpenLoops program [16], which consists of a computer-algebraic code generator written in Mathematica and a numerical part written in Fortran 90. Given an arbitrary Standard Model process, the Mathematica generator simulates the full chain of recursion steps in symbolic form and translates it into Fortran 90 code for the calculation of the actual scattering amplitude. The only external tools that need to be interfaced to the new OpenLoops program are Feynarts [68], for the generation of tree and one-loop diagrams, and Collier [19], for the calculation of scalar integrals. All other aspects of the open-loop method are directly implemented as process-independent Mathematica and Fortran routines. This includes the management of colour algebra, the kernels of the dressing recursion at tree and one-loop level, the on-the-fly and integral reductions, the helicity bookkeeping system, R 2 rational terms, UV counterterms, and several other aspects.
The entire program is fully automated, the new on-the-fly methods are implemented and widely tested at NLO QCD and they will soon be extended to NLO EW. These methods will be made publicly available with the upcoming release of OpenLoops 2. Similarly as for Open-Loops 1, numerical routines generated with the new one-the-fly techniques will be accessible through an automated download and installation system and the standard OpenLoops interfaces to a variety of public Monte Carlo programs. In addition to the on-the-fly approach, OpenLoops 2 will support also the original open-loop method, which requires additional thirdparty tools such as Collier or Cuttools [10] and OneLOop [65] for the reduction to scalar integrals.

Technical performance
In this section we study the technical performance of the new algorithm. Similarly as in [9], we present speed and stability benchmarks for the one-loop QCD corrections to four families of partonic processes, (a) uū → tt + n g, (c) ud → W + g + n g, with n = 0, 1, 2 additional gluons. Top quarks and W bosons are not decayed, and sums over the colour and helicity degrees of freedom of the external particles are included throughout.
Benchmarks obtained with the new open-loop algorithm, denoted as OpenLoops 2, are based on Collier for the calculation of scalar integrals. In order to highlight the effect of the new on-the-fly methods of Sections 3-4, we also consider variants of the OpenLoops program where these new methods are not used. Specifically, as detailed in Table 1, we restrict the onthe-fly approach to helicity sums and diagram merging, using Collier for tensor reduction. This approach is denoted as OpenLoops 2+Collier. Alternatively, we apply the original open-loop method in combination with tensor integrals (denoted as OpenLoops 1+Collier) or OPP reduction (denoted as OpenLoops 1+Cuttools). The OpenLoops 1+Cuttools mode relies on OneLOop for the scalar integrals and is used also to generate benchmarks in quadruple precision.
By default, OpenLoops calculations are monitored through a built-in stability system that estimates the level of instability of one-loop results and automatically triggers re-evaluations in double or quadruple precision for critical phase space points. In the following, in order to avoid any bias in the comparisons, the stability system is switched off. In this way, oneloop amplitudes are computed only once per phase space point in double precision. Unless uu → tt + n g gg → tt + n g ud → W + g + n g uu → W + W − + n g  Table 1). stated otherwise, Gram-determinant expansions are always kept active, both in the on-the-fly reduction of OpenLoops2 and in Collier.

Speed benchmarks
To illustrate the speed of the new algorithm, in Fig. 13 we plot runtimes per phase space point for the calculation of the one-loop scattering probability (1). For the processes listed in (151), runtimes on a single Intel i7-4790K core with gfortran-4.8.5 vary from the order of 0.1 ms to about 1 s. In this range, similarly as in [9] we observe an approximate linear scaling with the number of one-loop diagrams. As compared to OpenLoops 1+Collier, the new algorithm with on-the-fly reduction is up to a factor two faster for multi-particle processes. Depending on the process, using OpenLoops 2+Collier, i.e. restricting the on-the-fly approach to helicity sums plus diagram merging and reducing tensor integrals with Collier, can result in a further significant speedup. However, the moderate slowdown caused by the on-the-fly reduction can be counterbalanced by the improved numerical stability, which implies a reduced need of reevaluations in quadruple precision (see Section 6.2.2).

Stability benchmarks
In this section we study the numerical stability of the new open-loop algorithm. To this end, one-loop scattering probability densities computed in double precision (W DP 1−loop ) are compared against benchmarks in quadruple precision (W QP 1−loop ). More precisely, defining the relative difference between two results as we estimate the instability of double-precision results as 13 This quantity can be regarded, up to a minus sign, as the number of correct digits of the double-precision evaluation.
To estimate the intrinsic accuracy of quad-precision benchmarks, computed using Open-Loops 1+Cuttools and OneLOop, we use a so-called rescaling test [9], where scattering amplitudes are computed with rescaled masses and momenta and scaled back according to their mass dimensionality. Thus for a given phase space point the accuracy of the quadruple precision benchmarks is assessed as where W QP 1−loop and W QP,R 1−loop are the original and rescaled quad-precision evaluations. This quantity represents the finite resolution of the instability estimate (153). As we will see, quadprecision benchmarks can become more unstable than double-precision results obtained with OpenLoops 2. In this case, the instability estimate (153) yields A DP ∼ A QP but should be interpreted as A DP < A QP .
To assess the stability of OpenLoops 2, for each process in (151) we have studied a sample of 10 6 homogeneously distributed phase space points at √ s = 1 TeV. To exclude soft and collinear regions we have required p i,T > 50 GeV and ∆R ij > 0.5 for all massless final-state QCD partons. Figure 14 illustrates the effect of Gram-determinant instabilities and the goodness of the solutions introduced in Section 5.4 in the case of gg → ttgg. For this challenging multiparticle process, using OpenLoops 2 without any special treatment of Gram determinants in the one-the-fly reductions we observe an extremely high level of numerical instability in double precision. The probabilities to obtain one-loop results with less than four or zero correct digits are around 10 −1 and 10 −2 , respectively, and the tail of the stability distribution extends up to a level of instability of ten orders of magnitude and more. Applying the permutation trick (140) results in a dramatic improvement. The probability of finding points with only few correct digits goes down by three orders of magnitude, and the maximum level of instability is around 10 2 . Switching on the ∆-expansions of Section 5.4.2 leads to a further very drastic reduction of the probability of finding results with less than 3-4 correct digits. In this range, we observe an overlap with the tail of the quad-precision distribution. As discussed above, this indicates that OpenLoops 2 in double precision is more stable than the quad-precision benchmarks, and its estimated instability represents only an upper bound. Most likely, the tail of the true OpenLoops 2 stability distribution ends at 10 −3 .
To gain more insights into the origin of numerical instabilities in the on-the-fly reduction of OpenLoops 2, let us investigate the correlation between the instability (153) and rank-two Gram determinants ∆ in the gg → ttgg sample of Fig. 14. More precisely, in Fig. 15 we consider the minimal value of the dimensionless parameter ∆ ij /Q 4 ij in the event, where Q 2 ij is the largest |p i · p j | in the corresponding Gram matrix (see Section 5.4.1). As demonstrated by the left plot in Fig. 15, the instability of the entire scattering amplitude features a remarkably strong correlation with rank-two Gram determinants over twenty orders of magnitude. Moreover we observe a quadratic or faster scaling in Q 4 /∆, consistent with the form of the γ −2 ∼ ∆ −2 poles in (133). The middle plot shows the effect of the permutation trick (140), which avoids the smallest Gram determinant of the event in all reductions with N ≥ 4 loop denominators, thereby reducing to 0.1 permil the probability of having less than four correct digits. Finally, in the right plot we see that points with less than 3-4 correct digits disappear completely when Gram-determinant expansions are switched on. 14 As one can clearly recognise in the right plot, the threshold for the activation of Gram-determinant expansions corresponds to (Q 4 /∆) 2 ∼ δ −4 thr = 10 12 . Finally, in Figs. 16-17 we compare the stability of OpenLoops 2 against Open-Loops 1+Collier and OpenLoops 1+Cuttools for the 2 → 3 and 2 → 4 processes in (151). In OpenLoops 2 the stability improvements of Section 5.4 are applied throughout. The results of OpenLoops 1+Cuttools feature the highest instability tails for all considered processes. The probability of finding less than four correct digits can exceed 10 −3 in 2 → 3 and 10 −2 in 2 → 4 processes, while the fraction of fully unstable points with A ≥ 0 can reach 10 −3 in 2 → 4 processes. Switching to OpenLoops 1+Collier we find that, depending on the process, the probability of finding only a few correct digits goes down by one to three orders of magnitude, while in eight samples of 10 6 points we do not find a single result with A > 0. 15 Using OpenLoops 2 can lead to a further significant stability improvement. This is especially evident for 2 → 3 processes, where the stability of the on-the-fly reduction in Open-Loops 2 is remarkably close to the quad-precision benchmarks and even superior than quad precision for the case of ttg production. When quad precision is sufficiently accurate to resolve the instabilities of OpenLoops 2 we observe improvements of one-two orders of magnitude with respect to OpenLoops 1+Collier. In the case of 2 → 4 processes, depending on the process and the considered number of digits, OpenLoops 2 can perform somewhat better or slightly worse than OpenLoops 1+Collier, like in the case of ud → W + ggg or uū → W + W − gg, respectively. However, both approaches guarantee excellent numerical stability.

Conclusions and Outlook
We have presented a new approach for the automated calculation of scattering amplitudes at one loop. This new technique is based on the open-loop approach, where cut-open loop integrands are factored into a product of loop-momentum dependent segments that are combined through recursive tensorial multiplications.
The key idea behind the new method is that various operations, which are typically done at the level of full Feynman diagrams or amplitudes, can be performed on-the-fly during the open-loop recursion, i.e. after the multiplication of each loop segment. Since it exploits the factorised structure of open loops in a systematic way, this on-the-fly approach can reduce the complexity of certain operations in a very significant way.
We have first applied the on-the-fly method to helicity summations and to the merging of topologically equivalent open loops finding speed-up factors of up to two or three as compared to the original open-loop algorithm. Moreover, using the integrand reduction method by del Aguila and Pittau, we have introduced an on-the-fly technique for the reduction of open loops. With this approach, the construction of loop amplitudes and their reduction are interleaved step by step within a single numerical recursion. In this way, objects with tensor rank higher than two are avoided throughout, and the complexity of the calculations is reduced in a very drastic way. The proliferation of pinched subtopologies that emerge from the reduction is avoided by absorbing them on-the-fly into topologically equivalent open loops.
The employed integrand reduction method suffers from severe numerical instabilities that are dominated by kinematic regions with small rank-two Gram determinants ∆ and scale like 1/∆ 2 . In the reduction of N -point objects with N ≥ 4, we have shown that ∆-instabilities can easily be avoided through appropriate permutations of the loop denominators. In this way we were able to isolate ∆-instabilities in triangle topologies with a particular kinematic configuration and to cure them by means of analytic expansions in ∆. This approach is the first example of an integrand reduction algorithm that is essentially free from Gram-determinant instabilities. The level of stability that is achieved in double precision is competitive with public implementations of OPP reduction in quadruple precision.
The new algorithm is fully automated and validated at NLO QCD and be extended to electroweak interactions. It will become publicly available in the upcoming release of Open-Loops 2. Its technical features can be especially beneficial in NLO calculations for challenging multi-particle processes. Moreover, in view of its excellent numerical stability, the new algorithm is very attractive for the calculation of real-virtual contributions at NNLO.

A Reduction of the remaining tensor integrals
After the on-the-fly reduction described in Section 4.1 we are left with a few types of low-rank integrals (85) that need to be reduced to the standard basis of one-loop scalar integrals. The relevant reductions are performed upon taking the trace (77) as described in the following. In this appendix we restrict ourselves to tensor integrals with four-dimensional q µ terms in the numerator, while contributions with additionalq 2 terms in the numerator are described in Section 4.4.
A.1 Five-and higher-point integrals with R = 0, 1 Integrals with N ≥ 5 loop propagators and a numerator of rank R ≤ 1, can be reduced to a linear combination of scalar boxes, To determine the box coefficients we use the OPP reduction formula [5] where q ± 0 are the solutions of the quadruple-cut conditionsD i 0 =D i 1 =D i 2 =D i 3 = 0, and are the corresponding residues.
For the special case d 0123 , where p i 0 = p 0 = 0, the explicit solutions of the quadruple cut equations read where l i are the basis momenta defined in (115)-(116). The x 1,2 coefficients are given by with d i = m 2 i − p 2 i . The terms α 1,2 and γ are defined in Section 5, and the remaining coefficients in (159) are given by Besides the standard reduction formulas for B 1 , B 11 , and B 00 , for p 2 = 0 we have implemented the special cases (173)