Two-loop tensor integral coefficients in OpenLoops

We present a new and fully general algorithm for the automated construction of the integrands of two-loop scattering amplitudes. This is achieved through a generalisation of the open-loops method to two loops. The core of the algorithm consists of a numerical recursion, where the various building blocks of two-loop diagrams are connected to each other through process-independent operations that depend only on the Feynman rules of the model at hand. This recursion is implemented in terms of tensor coefficients that encode the polynomial dependence of loop numerators on the two independent loop momenta. The resulting coefficients are ready to be combined with corresponding tensor integrals to form scattering probability densities at two loops. To optimise CPU efficiency we have compared several algorithmic options identifying one that outperforms naive solutions by two orders of magnitude. This new algorithm is implemented in the OpenLoops framework in a fully automated way for two-loop QED and QCD corrections to any Standard Model process. The technical performance is discussed in detail for several $2\to2$ and $2\to 3$ processes with up to order $10^5$ two-loop diagrams. We find that the CPU cost scales linearly with the number of two-loop diagrams and is comparable to the cost of corresponding real-virtual ingredients in a NNLO calculation. This new algorithm constitutes a key building block for the construction of an automated generator of scattering amplitudes at two loops.


Introduction
Precise theory predictions based on the Standard Model (SM) play a crucial role for the success of the LHC physics program. In particular, in the absence of striking signals of new physics, the availability of high theory precision for the widest possible range of scattering processes is a key prerequisite in order to maximise the sensitivity of LHC measurements to small effects of physics beyond the SM. Predictions at next-to-leading order (NLO) in perturbation theory can be obtained through multi-purpose Monte Carlo tools [1][2][3][4], and the hard scattering amplitudes at the core of these calculations are computed with automated numerical tools [5][6][7][8][9][10][11][12][13][14][15][16][17] at tree and one-loop level.
While specialised NNLO Monte Carlo tools [63][64][65] have been developed, a fully general and automated NNLO tool, including in particular the two-loop scattering amplitudes, is not yet within reach. In light of the high-precision requirements of the LHC and future colliders, such a NNLO tool for arbitrary 2 → 2 and 2 → 3 SM processes is highly desirable. With this objective in mind, in this paper we present a new algorithm that provides an important building block for the automated construction of two-loop amplitudes.
This new algorithm represents the extension of the well-established OpenLoops technique [10] from one to two loops. In the OpenLoops approach, one-loop scattering amplitudes are constructed as sums of Feynman diagrams. Individual one-loop diagrams Γ have the formM 1,Γ = C 1,Γ dq 1N (q 1 ) D(q 1 ) , (1.1) where C 1,Γ is a colour factor, and the bar indicates quantities that are defined in D dimensions in order to regularise divergences in loop integrals [66]. The denominator D(q 1 ) is a product of propagator denominators depending on the loop momentumq 1 . In OpenLoops the numeratorN (q 1 ) is decomposed into loop-momentum tensors, and the tensor coefficientsN µ 1 ...µr are constructed through a numerical recursion [10,67,68] based on process-independent operations, which depend only on the Feynman rules of the model at hand. This construction is carried out in four space-time dimensions, and the contribution of the missing numerator parts of order (D − 4) are reconstructed via insertion of process-independent rational counterterms [69][70][71][72] into tree diagrams. Scattering amplitudes are obtained by combining the tensor coefficients with the associated tensor integrals which can be reduced to scalar integrals using external tools [5,14]. Alternatively, the construction of tensor coefficients and the reduction of tensor integrals can be combined in a single numerical recursion using the on-the-fly reduction method [68]. Besides all required scattering amplitudes for NLO calculations, the public OpenLoops program provides also some of the building blocks of NNLO calculations, namely squared one-loop amplitudes as well as Born one-loop interferences for processes with one unresolved emission, which enter the real-virtual parts of NNLO cross sections.
In this paper we present a new algorithm for the construction of the interference of twoloop and Born amplitudes, which enters the virtual-virtual parts of NNLO calculations. Also in this case, full scattering processes are handled as sums over individual Feynman diagrams. The amplitude of a generic two-loop diagram Γ has the form , (1.4) where C 2,Γ is a colour factor, and the denominator D(q 1 ,q 2 ) embodies all propagator denominators depending on the two independent loop momentaq 1 ,q 2 . Similarly as in (1.2), for the numerator we apply a tensor decomposition in the two loop momenta, (1.5) The main challenges in the development of a fully automated two-loop tool are the efficient construction of the tensor coefficientsN µ 1 ...µr, ν 1 ...νs , the reduction of the associated two-loop tensor integrals to a set of master integrals, and the evaluation of these master integrals. At higher loop levels, a fully automated tensor-integral library in the style of the existing one-loop tools [5,8,14] is not yet available. However, concerning the integral reduction, the tools and techniques based on integration-by-parts relations [73] have been greatly advanced in the recent years [74][75][76][77][78]. Furthermore, differential equations methods [79,80] for the computation of Feynman integrals have greatly expanded the range of available two-loop master integrals (see e.g. [81][82][83][84][85][86][87]).
The new algorithm presented in this paper deals with the construction of the two-loop tensor coefficientsN µ 1 ...µr, ν 1 ...νs in (1.5). This is achieved through a process-independent numerical recursion, which includes also the effect of the interference with the Born amplitude as well as the summation over all colour and helicity degrees of freedom. This algorithm has been implemented in a fully automated way within the OpenLoops framework for two-loop QCD and QED corrections to any SM process. Due to the generality of the algorithm it can easily be extended to other models. Similarly as in the one-loop case, all tensor coefficients are constructed in four space-time dimensions, and the contributions associated with the O(D − 4) parts of loop numerators can be reconstructed by means of two-loop rational counterterms [88][89][90]. More precisely, renormalised two-loop amplitudes in D dimensions can be obtained from two-loop amplitudes with four-dimensional loop numerators together with insertions of one-loop and two-loop counterterms of rational and UV kind into one-loop and tree amplitudes. This approach is fully established for rational terms of UV origin, while we expect that rational terms of infrared (IR) origin cancel in IR-subtracted amplitudes. This cancellation mechanism is well understood at one loop [91] and is presently under investigation at the two-loop level. In this case, we expect that all rational terms of IR origin can be cancelled by means of a simple and process-independent modification of the well-established procedures for the subtraction of two-loop IR divergences.
The paper is organised as follows. In Section 2 we outline general aspects of the construction of scattering amplitudes and cross sections at NLO and NNLO. In this context we discuss rational counterterms, as well as the bookkeeping of colour and helicity degrees of freedom. In Section 3 we review the OpenLoops method for the construction of tree and one-loop amplitudes. This sets the stage for the discussion of the new two-loop algorithm. Section 4 deals with reducible two-loop diagrams, which can be decomposed into one-loop subdiagrams. For this kind of two-loop diagrams we present an algorithm that exploits and extends various features of the OpenLoops method at one loop. In Section 5 we present the main novelty of this paper, a fully general numerical algorithm for the construction of the tensor coefficients of irreducible two-loop diagrams. We first analyse the structure of a generic two-loop algorithm, and then present the most efficient algorithm that we have identified through a systematic CPU cost analysis. The implementation of this new algorithm in the OpenLoops framework and its technical performance are discussed in Section 6, where we show that the CPU cost of two-loop tensor coefficients scales linearly in the number of Feynman diagrams and is comparable to the cost of the real-virtual building blocks in a NNLO calculation. Our conclusions are presented in Section 7.
2 Scattering amplitudes and partonic cross sections up to NNLO Partonic cross sections are derived from scattering amplitudes, which in perturbation theory are computed up to a fixed loop order, The bar marks quantities in D = 4 − 2ε dimensions throughout this paper. The L-loop matrix elementM L is computed as the sum over all L-loop Feynman diagrams Γ of the process at hand,M where the argument h corresponds to the helicity configuration of all external particles as described in Section 2.3. Loop integrals can exhibit UV divergences, which are cancelled through the renormalisation procedure. We denote a renormalised amplitude as where the renormalisation of the fields, couplings and masses in an L-loop amplitude is implemented through counterterm insertions into lower-loop amplitudes. In a renormalisable model, the finite set of UV counterterms is computed once and for all in the chosen renormalisation scheme. Differential cross sections are obtained from the colour-and helicity-summed scattering probability density where the initial-state average and symmetry factors for identical final-state particles are encoded in Here S in is the set of initial-state particles, and P out the set of final-state particle and antiparticle types 2 , while N hel,i and N col,i are the number of helicity and colour states of particle i, and n p the number of identical final-state particles of type p. with the L-loop squared and the Born L-loop interference terms For loop-induced processes W 11 is the leading order contribution. Finite partonic cross sections require, in addition to the UV renormalisation, the inclusion of real-emission contributions in order to cancel final-state collinear and soft divergences, as well as the factorisation of initial-state collinear singularities, which are absorbed into the parton distribution functions. For the amplitudes of scattering processes with X additional unresolved partons, in analogy with (2.1) and (2.6), we use the notation  [92][93][94][95][96][97].
The program OpenLoops 2 [17] supports the calculation of tree-level and one-loop amplitudes, i.e. the contributions (2.7) for L = 0, 1 and (2.8) for L = 1, and hence also the various NLO and NNLO contributions in (2.11) and (2.12), for any process. This is achieved through automated algorithms that feature high CPU efficiency and numerical stability. In particular, thanks to targeted analytic expansions and a hybrid-precision approach [17,68], OpenLoops 2 guarantees a numerically stable evaluation of the realvirtual NNLO contributions W (1) 01 in the full phase space, including the regions where the unresolved radiation becomes highly soft or collinear. This high degree of stability in the IR regions was demonstrated through successful applications of OpenLoops 2 to stateof-the-art NNLO calculations based on local subtraction methods, such as in the recent NNLO calculations of pp → 3 jets [48,98].
In this paper, we present a new algorithm for the efficient numerical computation of the Born two-loop interference W 02 , defined in (2.8), at the level of tensor-integral coefficients.

Dimensional regularisation and rational terms
In the OpenLoops framework, UV and IR divergences are regularised in the 't Hooft-Veltman scheme [66], where external wave functions and momenta are four-dimensional, while loop momenta, metric tensors and Dirac γ-matrices inside the loops live in D = 4−2ε dimensions. The integration measure as well as the denominators of loop integrands are kept in D dimensions throughout, while loop-integrand numerators are split into a fourdimensional part and a remainder of O(ε). Contributions stemming from the former part are referred to as D n = 4 dimensional, where D n denotes the dimensionality of the loop numerator, and are calculated with numerical algorithms where the loop momenta, metric tensors and γ-matrices in the numerator are handled in four dimensions. The remaining (D − 4)-dimensional numerator parts give rise to additional rational contributions, which originate from their interplay with the divergences of loop integrals and can be reconstructed through process-independent counterterms.
At one loop, such rational terms originate only from UV poles [91], and the corresponding counterterms are known for the full SM and for a variety of other models [69][70][71][72]. Using rational counterterms, renormalised one-loop amplitudes in D dimensions can be obtained from quantities in D n = 4 dimensions through the formula Here and in the following, loop amplitudesM L carrying a bar are fully D-dimensional as introduced in (2.1), while amplitudes M L without a bar are in D n = 4 dimensions, i.e. they are computed with four-dimensional integrand numerators and D-dimensional denominators. The first term on the rhs of (2.17) is the unrenormalised amplitude in D n = 4, and M (CT) 0,1 stands for the tree-level amplitude with all relevant insertions of UV and rational one-loop counterterms.
At two loops, as recently shown in [88,89], renormalised amplitudes in D dimensions can be obtained from amplitudes in D n = 4 dimensions through a general formula of the form Here the first term on the rhs is the unrenormalised two-loop amplitude in D n = 4, while each of the three additional contributions embodies standard counterterms for the subtraction of UV divergences in combination with rational counterterms for the reconstruction of the contributions of the (D − 4)-dimensional parts of loop numerators. The term M (CT) 1,1 (h) denotes the one-loop amplitude with all relevant one-loop counterterm insertions, while M (CT) 0,2 (h) and M (CT) 0,1,1 (h) correspond, respectively, to the tree-level amplitudes with single two-loop and double one-loop counterterm insertions. Similarly as for the related UV counterterms, also two-loop rational counterterms of UV origin are process-independent [88]. The explicit expressions for all two-loop rational counterterms of UV origin in QED and for QCD corrections to the full SM have been derived in [88][89][90].
Two-loop rational terms originating from the interplay of (D −4)-dimensional numerator parts and IR divergences are currently under investigation. As anticipated in the introduction, we expect that -at the level of IR-subtracted two-loop amplitudes -all rational terms of IR origin can be cancelled by means of a simple and process-independent modification of the well-established procedures for the subtraction of two-loop IR divergences.

Scattering amplitudes and colour factors
In the original OpenLoops algorithm, as well as in the new algorithm, L-loop matrix elements M L are computed as sums of all Feynman diagrams Γ of the scattering process, The contribution from each diagram is factorised into a colour factor C L,Γ and a colourstripped helicity amplitude A L,Γ (h), Since each quartic gluon vertex gives rise to three independent colour structures, Feynman diagrams Γ that involve n q quartic gluon vertices are decomposed into 21) and, in the following, each colour-factorised contribution Γ j is handled as a different Feynman diagram.
The colour structures C L,Γ are algebraically reduced to a standard colour basis {C i } (see [17] for details), and Born amplitudes are cast in the form Colour-summed interferences are built by means of the colour-interference matrix For example, the LO probability density is computed as The contribution of an L-loop diagram Γ with L ≥ 1 to (2.8) is computed as where the colour-stripped loop amplitude A L,Γ (h) is factorised from the colour-Born interference term This colour treatment is implemented in the public OpenLoops code [17] as well as in the new two-loop algorithm presented in this paper.

Helicity bookkeeping
In this section we define the helicity labels used in OpenLoops at all loop orders, following the notation of [68]. For the bookkeeping of external momenta and helicities in a process with N scattering particles we introduce the set of particle indices To characterise the helicity configurations of individual particles p ∈ E we use the labels The configuration λ p = 0 is also used to characterise unpolarised particles, i.e. fermions or gauge bosons whose helicity is still unassigned at a certain stage of the calculation or has already been summed over. We use a helicity numbering scheme based on the labels which correspond to a quaternary number with λ p ∈ {0, 1, 2, 3} as p th -last digit and all other digits equal to zero. This scheme allows us to derive the helicity labels of any set of particles as the sum of the helicity labels of its disjoint subsets, and hence as the sum of the helicity labels of all its particles. A set of particles E a = {p a 1 , . . . , p an } has the helicity label, In particular, the global helicity label of the scattering process is For two disjoint sets of particles E b , E c the helicity label of the combined set (2.33)

Tree and one-loop amplitudes in OpenLoops
In this section we review the algorithm for the construction of tree and one-loop amplitudes that is implemented in the public OpenLoops program. This sets the stage for the discussion of the new two-loop algorithm.

Tree-level amplitudes
At tree level, the colour-stripped amplitude of a Feynman diagram Γ is decomposed into two subtrees connected by a certain off-shell propagator, 3 Here k a = −k b and σ a , σ b are the off-shell momenta and Lorentz/spinor indices 4 of the subtrees, while h a , h b denote the helicity labels of the subsets of external on-shell particles connected to the respective subtrees, and corresponds to the global helicity of all scattering particles. The subtree w a is bounded by external wave functions and the off-shell propagator, at which the diagram was cut, while w b is either a single external wave function or is bounded by a set of external wave functions and a vertex that connects it to w a . The relevant subtrees are generated in a recursive way starting from the external wave functions, and connecting them to other external subtrees through operations of the form leading to subtrees with an increasing number of external particles. The tensor X σa σ b σc corresponds to the triple vertex that connects w b , w c together with the numerator of the offshell propagator with momentum k a = k b + k c . Each step is performed for all independent helicity configurations h b , h c , and h a = h b + h c . For quartic vertices the relation The tree recursion is implemented in such a way that the subtrees contributing to multiple Feynman diagrams are computed only once.

One-loop amplitudes
The colour-stripped amplitude of a one-loop diagram Γ is given by 5 Here the loop numerator N (1) (q 1 , h) and its fundamental building blocks, i.e. the loop momentum q i , Dirac matrices γ µ and metric tensors g µν , are all evaluated in four dimensions. For such four-dimensional quantities and amplitudes computed in D n = 4 dimensions we consistently use symbols without a bar. 6 The trace in (3.5) represents the contraction of Lorentz/spinor indices along the loop. The scalar denominators are kept in D dimensions. They depend on the D-dimensional loop momentumq 1 , the internal mass m 1a , and where k 1b is the external momentum entering the loop between D b . For later convenience we apply a label (i) to various q i -dependent building blocks of a loop integrand, where q i is either a loop momentum or a linear combination of independent loop momenta. At one-loop level, i = 1, since there is only a single loop momentum. For the integration measure in loop-momentum space we define the shorthand where µ is the scale of dimensional regularisation. In OpenLoops the loop numerator is constructed through a numerical recursion that exploits its factorisation into loop segments. A loop segment consists of a loop propagator, one adjacent triple or quartic vertex, and the one or two external subtrees connected to this loop vertex. In the case of a triple vertex the loop segment has the form where an external subtree w In renormalisable theories, a segment can be written as a rank-one polynomial in the loop momentum with coefficients Y and Z. The indices β (i) a are, depending on the particle type in the loop propagator, Lorentz indices (gauge bosons) or spinor indices (fermions) with β (i) a = 1, . . . , 4. For scalar particles (ghosts, scalars) the index has a fixed value β (i) a = 1. Segments associated with a quartic vertex involve two subtrees, w (i) a 1 and w (i) a 2 , and are of rank zero in q 1 , At one loop, the numerator is computed by cut-opening the loop at one propagator, which results in a tree-like object consisting of a product of N 1 loop segments, a describes the helicity configuration of the a-th subtree, a ⊆ E is the corresponding set of external particles. The loop numerator is constructed through recursive matrix multiplications, which are applied for n = 1, . . . , N 1 , starting from the initial condition N 0 = 1 1. The label describes the helicity configuration of the external legs entering the first n segments, and h n . The operations (3.13) are referred to as dressing steps, and the partially dressed numerator is called an open loop. In this schematic representation, blue and grey blobs correspond to the dressed loop segments and those that remain to be dressed, respectively. Each open loop is a polynomial in q 1 , with rank R ≤ n, and the dressing recursion (3.13) is implemented at the level of the tensor coefficients N (1) n; µ 1 ...µr . The explicit form of a dressing step for a segment (3.9) with a three-point vertex is For an efficient implementation the µ 1 . . . µ r indices are symmetrised. In the final step of the dressing algorithm, the trace is taken over the indices and for the amplitude of the colour-stripped Feynman diagram (3.5) we obtain The tensor integrals on the rhs can be reduced with external libraries such as Collier [14]. Alternatively, they can be reduced with the on-the-fly method of [68], where dressing steps are interleaved with reduction steps in such a way that the tensor rank remains low at all stages of the calculation. The on-the-fly reduction is the default method in OpenLoops 2 [17].

Born-loop interference
As pointed out in [68], for the efficient construction of the helicity-and colour-summed Born-loop interference defined in (2.26) it is convenient to absorb the colour-Born interference factor (2.27) into the loop numerator. In this approach, instead of the original helicity-dependent numerator we construct the helicity summed quantity with the colour-Born interference where, again, the label Γ is kept implicit. Exploiting the factorisation of U(q 1 ) into U 0 (h) and loop segments, one can write where helicity sums are partially factorised at the level of individual segments. More precisely, each time the numerator is dressed with a new segment S n ) can be summed, and the subsequent dressing steps depend only on the helicities of the yet undressed segments. This way of constructing the loop numerator (3.20) corresponds to a recursive dressing algorithm where the initial condition (3.21) is used, and the helicity labeľ corresponds to the helicity configuration of all undressed segments. To keep track of the polynomial dependence of the partially dressed numerators on the loop momentum q 1 , a tensorial representation similar to (3.16) is used, 25) and the dressing recursion is implemented at the level of tensor coefficients in the same form as in (3.17), but with an additional sum over h (1) n as in (3.23). This on-the-fly helicity summation approach guarantees the smallest possible number of helicity configurations in the last dressing steps, where the rank in q 1 , and hence the number of tensor coefficients and required matrix multiplications, is highest.
At the end of the dressing recursion, i.e. after step n = N 1 , in analogy with (3.18) we take the trace N 1 = 0 since no undressed segments are left. Finally, combining the tensor coefficients with the associated tensor integrals, one arrives at which corresponds to the one-loop virtual scattering probability density as defined in (2.26). The reduction of the tensor integrals can be performed with external libraries, such as Collier [14], or with the on-the-fly reduction method of [68]. Figure 1. Categorisation of two-loop topologies into reducible (Red) and irreducible (Irred) ones. Reducible two-loop topologies are further split into two subcategories, where two one-loop subdiagrams are either connected to a bridge through two vertices (Red2) or are attached to each other through a common quartic vertex (Red1). Similarly as in Section 3 the blue blobs denote tree structures that are connected to internal and external lines (the latter are not shown). In the Red2 topology the blue blob labelled P can be connected only to the two visible internal lines or also to additional external lines.

Reducible two-loop integrands
In this and the following section we present a new automated algorithm that extends the construction of the tensor coefficients (3.26) to two loops. Specifically, we will focus on the construction of tensor coefficients for unrenormalised two-loop amplitudes in D n = 4 dimensions, i.e. with four-dimensional numerators, while all other ingredients of renormalised two-loop amplitudes (2.18) can be obtained from one-loop and tree amplitudes with UV and rational counterterm insertions. All amplitudes are split into individual Feynman diagrams, and their colour structures are factorised as described in (2.19)-(2.21). Thus we will concentrate on the construction of tensor coefficients for colour-stripped unrenormalised two-loop diagrams. At two loops we categorise Feynman diagrams into reducible and irreducible ones as illustrated in Fig. 1. This categorisation is based on the structure of loop chains. Such chains consist of loop propagators that are linked to each other and to external subtrees through connecting vertices. More precisely, a loop chain C i contains the propagators that depend only on a certain loop momentum q i when all external momenta are set to zero.
Two-loop diagrams that involve only two loop chains C 1 , C 2 , with two independent loop momenta q 1 , q 2 , are categorised as reducible. This category is further split into two-loop diagrams of type Red2 and Red1 depending on how the two chains are connected to each other. As indicated in Fig. 1, in reducible two-loop diagrams of type Red2 the chains C 1 and C 2 are connected through a tree structure P , which is referred to as bridge, while the type Red1 corresponds to the case where the two chains are connected through a single quartic vertex V 4 . Since reducible two-loop diagrams factorise into two one-loop subdiagrams, for their calculation one can exploit various functions of the one-loop OpenLoops framework. An algorithm for the construction of reducible two-loop integrands of type Red2 and Red1 is presented in Sections 4.1-4.2.
Two-loop diagrams that involve three loop chains C 1 , C 2 , C 3 are categorised as irreducible. In this case the three chains are connected to each other by two vertices V 0 , V 1 , and of the three loop momenta q 1 , q 2 , q 3 only two are linearly independent. The loop momenta can always be chosen in such a way that q 1 + q 2 + q 3 = 0. An algorithm for the construction of irreducible two-loop integrands is presented in Section 5.
In general, the algorithms presented in Sections 4.1-4.2 and 5 compute tensor coefficients that corresponds to the helicity and colour-summed interference (2.26) of the full Born amplitude with the integrand of individual two-loop diagrams.

Reducible two-loop integrands of class Red2
A reducible two-loop diagram of class Red2 factorises into two chains C i and a bridge P as depicted in Fig. 1. The loop chains C i each consist of N i loop propagators (i = 1, 2), and the external subtrees w  .9) and (3.10). The bridge P is a tree structure connected to each of the two chains with a corresponding vertex, which we call the bridge vertex of that chain. In general, the bridge is also connected to a subset of the external particles of the scattering process. The colour-stripped amplitude of a reducible diagram Γ has the generic form with the denominator chains where p ia and m ia are the external momenta and internal masses along the chain C i . The Lorentz/spinor indices α 1 , α 2 connect the bridge P α 1 α 2 to the chains C 1 , C 2 . The external subtrees w (i) a depicted in (4.1) play the two-fold role of single subtrees or pairs of subtrees connected to the loop chains via triple and quartic vertices, respectively. The global helicity configuration is given by with the chain helicities 4) and the bridge helicity h (B) , which corresponds to the helicity configuration of all external particles connected to P . Reducible two-loop diagrams can be efficiently constructed by the following algorithm, which uses and extends functions of the one-loop OpenLoops algorithm. Ordering and cutting rule: Chain C 1 is chosen to be the chain with less loop propagators. 7 As illustrated in Fig. 2, the construction of the loop numerator is organised by cut-opening the two one-loop subdiagrams between the bridge vertices and the D Step 1 -Construction of chain C 1 : The numerator of the chain C 1 factorises into loop segments, where the last segment carries the Lorentz/spinor index α 1 that connects it to the bridge, while the indices connecting the segments to each other are kept implicit.
The recursive construction of (4.5) starts from the segment S The first chain is constructed starting from the initial condition N (1) where we use the partial chain helicities defined in (3.14) withĥ n . Similarly as in (3.13)-(3.18) the recursion is implemented in terms of tensor coefficients and, taking 7 In case there is more than one chain with the minimum number of propagators, then the one with the minimum number of helicity d.o.f. is chosen to be C1. If in turn there is more then one chain with the minimum number of propagators and helicity d.o.f., then C1 is selected based on the particle content of the external subtrees that are connected to the various chains. the trace after the last step, yields the coefficients Contracting all coefficients with corresponding one-loop tensor integrals results into the closed first loop which serves as the starting point for the bridge construction. Choosing C 1 to be the shorter chain, i.e. N 1 ≤ N 2 , ensures that the number of helicity states h (1) , for which the tensor coefficients need to be computed and contracted with a tensor integral, is minimal.
Step 1 is performed for all diagrams, recycling partially or entirely constructed chains wherever possible.
Step 2 -Bridge construction: The bridge involves various segments S (B) a with a = 0, . . . , N B as depicted in Fig. 2. As a starting point for its construction, the segment S (B) 0 , which consists solely of the first bridge propagator, is attached to the one-loop subdiagram (4.8), Here the global helicity h (1) of the subdiagram (4.8) has been renamed asĥ n . Note that, in contrast to loop segments, these tree segments contain also the associated propagator denominators, which depend solely on external momenta and internal masses. The label h (B) n corresponds to the helicity configuration of the external particles in w (B) n . These subtrees are recursively attached to (4.9) through dressing steps for n = 1, . . . , N B , which are implemented in the OpenLoops tree-level algorithm. The global helicity configuration for this partially dressed bridge contracted with the first oneloop subdiagram corresponds tô Recycling opportunities are systematically exploited whenever partially dressed bridges connected to the first loop occur in multiple diagrams.
The final result of the recursion (4.10) corresponds to the first two building blocks on the rhs of (4.1), i.e.
Step 3 -Construction of chain C 2 : The second chain is constructed starting from the loop segment associated with the propagator denominator D 1 . In this loop segment we include, as an effective external subtree, the full bridge-loop combination (4.12). If the bridge and the chain C 2 are connected by a triple vertex, according to (3.9) the first loop segment has the form with h 1 =ĥ (B) . Here Y and Z embody the connecting triple vertex together with the propagator numerator associated with D 1 . For a quartic vertex, similarly as in (3.10) we have where w (2) 1 denotes an additional subtree that is connected to the quartic vertex, and h 1 . As a result of (4.14)-(4.15) the index α 2 is saturated at the beginning of the construction of the chain C 2 , which renders the subsequent operations more efficient. The subsequent loop segments S (4.16) The dressing steps for this chain are carried out at the level of the Born-loop interference using the one-loop algorithm of Section 3.3. To this end, using as initial condition the colour-Born interference the various segments are recursively combined via for n = 1, . . . , N 2 . In this way helicity states are efficiently summed on-the-fly, anď corresponds to the helicity configuration of the segments that are still undressed. Note thatȟ (2) 0 = h . The dressing recursion is again implemented in terms of tensor coefficients, and after step n = N 2 one arrives at the two-loop scattering probability density along the same lines as in (3.26)-(3.27), i.e. by taking the trace whereȟ (2) N 2 = 0, and combining the tensor coefficients with the associated tensor integrals. This leads to which corresponds to the two-loop scattering probability density as defined in (2.26).

Reducible two-loop integrands of class Red1
The above algorithm can be easily extended to reducible diagrams of type Red1. In this case the one-loop chains C 1 , C 2 are connected by a single four-gluon vertex V 4 (see Fig. 1). The various colour-stripped amplitudes that result form the splitting (2.21) of V 4 and any other quartic vertices are handled as independent diagrams. For each of them the first one-loop subdiagram is constructed as in Step 1 of Section 4.1. The only exception is that the open index α 1 , which connects the chain C 1 to the bridge, is replaced by two indices β 1 . The latter correspond to the Lorentz indices that result form cut-opening the two propagators of the chain C 2 that are connected to V 4 . In practice, the quartic vertex V 4 is included in the last segment of the chain C 1 , and the first loop (4.8) assumes the form Since topologies of type Red1 feature a trivial bridge (no bridge segments, Step 2 can be by-passed. In Step 3 the starting point, i.e. the first loop segment of chain C 2 , is simply given by where h (1) , and the factor −i originates form the numerator −ig ββ of the gluon propagator associated with D 1 . The rest of Step 3 can be implemented as in Section 4.1.

Irreducible two-loop integrands
The colour-stripped amplitude A 2,Γ of an irreducible two-loop diagram Γ has the form with the denominator chains where p ia and m ia are the external momenta and internal masses along the chain C i . The three chains are connected by two vertices V 0 , V 1 , and each connecting vertex V a can either be a three-point vertex, as depicted in (5.1), or a quartic vertex. In the latter case V a is attached to a an external subtree w (V ) a , which is not shown in (5.1). Along each chain C i flows a single loop momentum q i in the direction from V 0 to V 1 , with the boundary condition q 3 = −(q 1 + q 2 ). If V 0 is a triple vertex the external momenta associated with the propagators D connected to V 0 . The numerator of (5.1) factorizes into three numerator chains and the two vertices V 0 , V 1 , each connecting all three chains, where the tensors V a (q 1 , q 2 , q 3 , h (V ) a ) represent the connecting vertices V a . In case of quartic vertices they also embody the related external subtree w where h (i) denotes the helicity configuration of the chain C i and is simply given by the sum of the corresponding segment helicities, as defined in (3.12), Each chain numerator in (5.3) factorises into loop segments The segment S The multiplications in (5.6) should be understood as matrix multiplications, For simplicity we will suppress the Lorentz/spinor indices β (i) a wherever possible. Ultimately, we are interested in the construction of the scattering probability density (2.8), and in the following we focus on the contribution of a single unrenormalised twoloop diagram as defined in (2.26). To account for the effect of the colour-Born interference, similarly as in the one-loop relations (3.20) and (3.21) we introduce and, as discussed in the following subsections, we combine it with the two-loop numerator in order to obtain a single helicity-summed object We note that possible symmetry factors can be included in the two-loop numerator N or, alternatively, in the associated colour factor C 2,Γ .

Generic structure of a recursive two-loop algorithm
Thanks to its factorised structure, the colour and helicity summed two-loop numerator (5.9) can be constructed through a numerical recursion of the form which starts withÛ 0 = 1 1 and terminates, after a certain number of steps N r , witĥ U Nr (q 1 , q 2 ) = U(q 1 , q 2 ). We refer to the operations (5.10) as dressing steps. The building blocks K n that are attached in each step are either the colour-Born interference U 0 , a connecting vertex, individual loop segments, or full chain numerators, i.e.
The chain numerators N (i) can in turn be constructed through a similar recursion from their loop segments. To this end we define the partially dressed chains whereĥ (i) . In order to control the polynomial dependence on the loop momenta in an efficient way, similarly as in the one-loop case, the dressing steps (5.10) and all relevant building blocks are implemented at the level of tensor coefficients.
For instance, for the partially dressed chains (5.12) we use the tensorial representation where a i = 0 for a = 0, a i = N i − 1 for a = 1, and h (V ) a = 0. As observed above, triple vertices are helicity-independent, while the coefficientsŶ a andẐ ia,ν encode the linear dependence on the loop momenta. Quartic vertices are independent of the loop momenta and have the form The dependence of the two-loop numerator (5.9) on the two independent loop momenta q 1 , q 2 is encoded in the tensorial representation and its contribution to the two-loop scattering probability density is Re U µ 1 ···µr,ν 1 ···νs I µ 1 ···µr,ν 1 ···νs , with the two-loop tensor integrals I µ 1 ···µr,ν 1 ···νs = dq 1 dq 2 q µ 1 1 · · · q µr 1 q ν 1 2 · · · q νs Table 1. Number of independent tensor structures q µ1 1 . . . q µr 1 q ν1 2 . . . q νs 2 symmetrised in the q 1 and q 2 indices up to maximal ranks R 1,2 in q 1,2 , i.e. for r ≤ R 1 and s ≤ R 2 . 0  1  2  3  0  1  5  15  35  1  5  25  75  175  2  15  75  225  525  3  35 175 525 1225  4  70 350 1050 2450  5 126 630 1890 4410 The CPU efficiency and the memory footprint of the dressing recursion (5.10) depend in a critical way on the tensorial and helicity structure of the various building blocks, which depend in turn on the order in which they are attached to each other. In general, the most relevant aspects for the efficiency of the algorithm are the number of independent tensor coefficients, and the structure of the Feynman rules involved in the chain segments S (i) k and in the connecting vertices V a . At step n of the construction, the structure of the first term on the rhs of (5.10) iŝ Here the number of independent tensor structures q µ 1 1 . . . q νs 2 grows exponentially with the rank in q 1 and q 2 as illustrated in Table 1, and the tensorial rank increases during the construction. In renormalisable theories, attaching a loop segment S (i) k or a connecting vertex V a augments the rank in a loop momentum q i by either 0 or 1. In addition, the number of tensor coefficients in (5.20) increases by a factor 4 for each one of the open Lorentz/spinor indices β 1 , . . . , β N l . Such indices are associated with the various loop chains, and their number is typically N l = 0, 2 or 3 in the phases of the construction that have a significant CPU cost. Finally, the number of tensor coefficients is also proportional to the number of independent helicity configurations h n−1 in (5.20). Analogous considerations hold for the term K n on the rhs of (5.10).
In general, we observe that the construction of a single chain has the same complexity as for a one-loop chain, due to the dependence on a single loop momentum and two open indices, while the steps involving the two-loop vertices V a are much more expensive, since they involve three indices and they typically depend on two loop momenta.
Based on these qualitative considerations we have selected from all possible algorithms of the form (5.10) the most promising candidates. The algorithm presented in the following section has been identified as the most efficient candidate based on a CPU cost simulation for a wide range of QED and QCD Feynman diagrams. In this simulation the CPU cost was approximated by the number of numerical multiplications, which represent the most expensive operations.
It is worth mentioning that the algorithm presented in Section 5.2 is roughly two orders of magnitude faster than a naive implementation, in which all three chains are first constructed independently of each other and are then contracted with the vertices V 0 and V 1 . The inefficiency of this naive approach is due to the fact that the expensive V 0 and V 1 contractions are carried out when the number of active helicities, Lorentz/spinor indices and tensor rank in the loop momenta are all maximal.

Integrand of a single irreducible two-loop diagram
In this section we present a dressing algorithm of type (5.10) for the construction of the integrand of a single irreducible two-loop diagram of the form (5.1) and its interference with the Born amplitude. The organisation of the full set of irreducible two-loop diagrams is discussed in Section 5.3.
As a starting point, given an irreducible two-loop diagram, its chains and connecting vertices are ordered in a well-defined way that will correspond to the sequence of dressing steps.
Ordering rules: The ordering is implemented by assigning the labels V 0 , V 1 and C 1 , C 2 , C 3 to specific vertices and chains as follows.
I) The three chains are ordered by their number of segments in such a way that N 1 ≥ N 2 ≥ N 3 . In case of equality, the number of helicity configurations along the chain is used as a criterion. In case of equality there, the minimal external particle index on the chain is used, as defined in (2.28).
II) The role of V 0 and V 1 is assigned to the connecting vertices according to the following ordered set of criteria: (i) A single three-gluon vertex or a ghost-gluon-vertex with a rank increment in q 1 or q 3 becomes V 0 . 8 (ii) A single two-loop four-gluon vertex becomes V 1 .
(iii) Use the external-particle indices, as defined in (2.28), and the position of these particles along the chains: Particle 1 is required to be closer to V 0 , in case of equality particle 2, etc.
These ordering rules fix the direction in which the various chains are constructed, namely from vertex V 0 to V 1 as depicted in (5.1). Moreover they determine the order and the way in which the three chains and the two connecting vertices are constructed and attached to each other. For this reason, the ordering rules affect the efficiency of the dressing algorithm in a significant way. The following algorithm has been optimised by testing a wide range of ordering rules, using for example different hierarchies of the criteria in I) and/or interchanging the roles of the three chains, e.g. making C 3 the chain with the most helicity configurations. Furthermore the vertex rules have been optimised for each combination of QED and QCD vertices as V 0,1 . The reasoning behind the above ordering rules will be explained together with the steps of the algorithm for which they are relevant. This algorithm consists of two main parts.

Part 1: Construction of chain C 3
The shortest chain, i.e. C 3 , is fully dressed through the recursion steps with the initial condition N n . Similarly as in (3.13)-(3.17), the above recursion is implemented in terms of tensor coefficients, and the result i.e. the full integrand is built by starting from the chain C 1 , which is subsequently connected to the vertex V 1 and the chain C 3 . Finally, the vertex V 0 and the chain C 2 are attached.
Step 1 -Construction of chain C 1 : The goal of this step is the construction of the numerator of the longest chain C 1 interfered with the colour-Born factor U 0 , defined in (5.8) combined with the colour-Born interference factor (5.8) and summed over the relevant helicities, i.e.
Here h (1) and h correspond, respectively, to the helicity configurations of the full chain C 1 and of the whole process. Upon summation over h (1) , the quantity on the lhs depends only onȟ (1) for n = 0, . . . , N 1 − 1. In the n-th dressing step, the helicity d.o.f. h (1) n of the n-th segment are summed on-the-fly with the technique of Section 3.3. As a result, U (1) n depends only on the helicity state of the undressed part of the two-loop diagram, whereĥ (1) n is defined in (5.13) and corresponds to the dressed part of C 1 . Each time that a new segment is attached, the number of remaining helicity configurations decreases aš h n . The initial condition for the recursion (5.27) is  (1) . Since C 1 is the longest chain, a large number of helicities is already summed during this part of the algorithm, and a large number of segments is constructed with dependence on a single loop momentum.
Step 2 -Connecting the V 1 vertex and chain C 3 : In this twofold step the chain C 1 interfered with U 0 is connected to V 1 and to the previously constructed numerator of C 3 , and C 1 interfered with U 0 , .  Step 3 -Connecting the V 0 vertex: In this step V 0 is connected to the previously constructed object (5.31), which consists of the numerators of C 1 , V 1 and C 3 interfered with U 0 , . (5.32) Here the number of open indices is reduced to two. For quartic vertices V 0 the associated helicity d.o.f. are summed on the fly, and the output depends only on the helicity h (2) of the still undressed chain C 2 . Also this step is implemented using a tensorial representation, but now we switch to the independent loop momenta q 1 , q 2 by replacing q 3 = −(q 1 + q 2 ). For a generic polynomial with maximum ranks r 1,3 in q 1,3 , the replacement q 3 = −(q 1 + q 2 ) results into a polynomial with maximum ranks R 1 = r 1 +r 3 and R 2 (r) = 0, . . . , r 3 in q 1 and q 2 . In our implementation of (5.32), by default this replacement is applied only upon contraction of U with V 0 and summation over h (V ) 0 . In some cases it can be more efficient to apply q 3 = −(q 1 + q 2 ) to U (13) 1 already at the end of step 2. In QCD this is the case if the vertex V 1 in step 2 is a triple gluon vertex or a ghost vertex that does not increase the rank in q 3 .
For an efficient implementation of the entire algorithm it is important to note that-as explained in the following-the CPU cost of step 3 is much lower as compared to step 2. For example, let us consider step 3 for the case of a triple-gluon vertex, where the prefactor with the coupling constant has been suppressed and can be restored as an overall factor at the end of the construction. When this triple vertex is attached to the tensor U where the tensor-coefficient indices n µ and m ν correspond to the powers in q µ 1 and q ν 3 , respectively. In this symmetrised representation, multiplying A(q 1 , q 3 ) by a component of q 1 or q 3 amounts to a trivial reshuffling of the tensor coefficients. For instance, tens additional multiplications. In total, thanks to additional optimisations, our implementation of step 2 for a triple-gluon vertex involves a number of multiplications slightly above 700 × N tens . Given the potentially large number of helicity and tensorial components, this number can range from order 10 5 to 10 7 for nontrivial two-loop diagrams.
In general, step 2 is the most expensive operation of the entire algorithm, and its exact cost depends on the nature of the connecting vertex V 1 . For this reason, the CPU efficiency of the entire algorithm depends in a critical way on the ordering rule II, which assigns the roles of V 1 and V 0 to the actual connecting vertices of two-loop diagrams. Regarding point (i) of the ordering rule II, complex vertices are preferably chosen to be V 0 . In QCD the triple-gluon vertex is by far the most CPU expensive one, and as a consequence it is only assigned the role of V 1 if both connecting vertices V 0,1 are of this type. The rules for ghostgluon vertices are guided by which rank of a loop momentum q i is increased, and hence at which point in the algorithm the replacement q 3 = −(q 1 + q 2 ) is performed. Regarding point (ii), a quartic vertex is preferred to be inserted as V 1 , since in this way the on-the-fly summation of the helicities h (V ) 1 of its external subtree can be performed at an earlier stage. Finally we note that when both V 0 and V 1 are fermion-gauge-boson vertices, whose tensorial structure consists of γ-matrices containing only a few non-vanishing components, the most efficient implementation available in our framework is to combine V 1 and V 0 into a single tensor Γ(V 1 , V 0 ) and to merge steps 2 and 3 in a single operation, where the index sums run only over the non-zero components. This is particularly efficient in pure QED calculations, where all vertices are proportional to γ µ .
Step 4 -Connecting C 2 : The construction of the two-loop numerator is completed by attaching the chain C 2 . The various segments of C 2 are connected through a sequence of dressing steps for n = 0, . . . , N 2 − 1. At step n the helicity d.o.f. h (2) n of the n-th segment are summed on the fly, and the result depends only on the helicities of the still undressed segments, The initial condition for the recursion is given by the outcome of step 3, i.e.
10 (q 1 , q 2 , h (2) ) , (5.43) whereh (2) −1 = h (2) corresponds to the helicity configuration of the full chain C 2 . The dressing steps (5.41) are similar to the ones for the construction of C 1 , since the segments S (2) n depend on a single loop momentum q 2 . However, the dressing of chain C 2 features an higher complexity due to the dependence of U (123) n on two loop momenta, which leads to a larger number of tensor components.
During the last step the loop is closed by contracting the index β (2) N 2 between vertex V 1 and chain C 2 , U(q 1 , q 2 ) = Tr U (123)  Figure 3. Example of the interdependence of the tensorial ranks. The coloured area shows the ranks (r, s) in (q 1 , q 2 ) that result from the q 3 = −(q 1 + q 2 ) replacement applied to a polynomial with ranks r 1 , r 2 , r 3 ≤ 2 in q 1 , q 2 , q 3 . In this case the maximum rank in q 2 obeys R 2 (r) ≤ min(4, 6 − r). Note that the excluded region with r, s ≤ 4 and s + r > 6 corresponds to the highest-rank configurations, which involve a large number of tensor coefficients.

Efficient implementation of steps 2-4:
The steps 2 to 4 involve polynomials of the form (5.33) and (5.34), with a large number of tensor coefficients. For their efficient implementation, besides using symmetrised representations like (5.36), we systematically exploit the correlation of tensorial ranks as explained in the following. Let us consider polynomials in the two independent loop momenta q 1 and q 2 , which consist of various monomials of rank (r, s) in (q 1 , q 2 ). As a result of the particular form of the q 1 , q 2 dependence of the involved Feynman rules, the maximum ranks R 1 = max(r) and R 2 = max(s) are not independent. For instance, attaching a triple-gluon vertex can increase the rank (r, s) of a monomial to (r + 1, s) or (r, s + 1) but not to (r + 1, s + 1). In general, as a result of such correlations, for the subset of tensor coefficients with fixed rank r in q 1 , the maximum rank in q 2 depends on r, i.e. R 2 = R 2 (r), and vice versa. As discussed in (5.34), this kind of correlation arises also from the q 3 = −(q 1 + q 2 ) replacement. An example of the corresponding R 2 (r) dependence is shown in Fig. 3. In our implementation, for each single step of the algorithm the values of R 1 and R 2 (r) are determined based on the actual q 1 , q 2 dependence of the involved Feynman rules, and all operations are restricted to the tensor components with r ≤ R 1 and s ≤ R 2 (r). In particular, we systematically avoid memory allocation for irrelevant tensor components. As a result the overall memory usage is typically reduced by more than a factor two as compared to an implementation with independent R 1 and R 2 .

Two-loop tensor coefficients for scattering amplitudes
For a set of Feynman diagrams, in particular for all irreducible two-loop diagrams of a full scattering amplitude, the algorithm is structured in such a way that the overall number of dressing steps is minimised. This is achieved by identifying all equivalent dressing steps that are required in multiple diagrams and performing such steps only once. To this end, all diagrams are ordered as discussed in Section 5.2, and the steps described above for a single diagram are performed for the full set of diagrams in the following way. Part 1: The chains C 3 of all diagrams are grouped according to their number of segments N 3 , and the various groups are dressed starting from the lowest N 3 , i.e. the chains with only a propagator segment S (3) 0 . All fully or partially dressed chains that have already been constructed are stored and can be recycled whenever they appear as part of another chain. Since the various chains are ordered such that N 3 ≤ N 2 ≤ N 1 , all chains C 3 involve only a few segments, 9 which require a rather small number of dressing equations.
Part 2: For the main algorithm, where the full two-loop amplitudes are constructed, we employ an extension of the on-the-fly diagram merging procedure, first introduced in [68] for one-loop amplitudes. This approach exploits the factorisation of the numerators of Feynman diagrams into segments. Consider for instance two diagrams with the same loop topology, i.e. the same scalar propagator denominators, which have the same undressed segments after a certain number n of recursion steps, e.g.
Here U A,n and U B,n can differ due, for instance, to a different external subtree in the n-th segment of the chain C 1 . Since ultimately the sum of all two-loop diagrams interfered with the full Born amplitude is needed, the partial numerators U (1) To this end, the full set of two-loop diagrams is constructed segment by segment, i.e. each of the steps 1 to 4 in Section 5.2 is performed on all diagrams before continuing with the next step. For example, the first dressing step U 1 . After each dressing step along chain C 1 and C 2 the algorithm recognises and performs all possible merging operations before proceeding with the next segment. After the final dressing and merging steps all diagrams with the same topology, and hence corresponding to a single type of tensor integral, are summed into a single object.
The last criterion in ordering rule I, see Section 5.2, namely the ordering of chains by external particle indices in case of N i = N j and the same number of helicities, is designed in such a way that on-the-fly merging opportunities during the construction of sets of diagrams are maximised. In general, merging steps can be performed for any number of diagrams, independently of their ranks. The overall CPU efficiency benefit of the on-the-fly merging is 10 to 20% for most processes.
Alternatively, to the segment-by-segment approach, the full set of recursion steps 1 to 4 can be performed diagram by diagram, without merging. This makes it possible to free the memory of previously computed partially dressed diagrams, thereby minimising memory usage. Since memory is usually not the bottleneck of the calculation, we chose the segment-by-segment approach as default.

Technical performance of the two-loop algorithm
The two-loop algorithms presented in Sections 4-5 have been implemented in the Open-Loops framework in a fully automated way. The present implementation supports QED and QCD corrections to any SM process, and the full generality of our method allows for extensions to any other model. This section is devoted to the validation and the technical performance of our implementation. In particular, we discuss the numerical stability, CPU efficiency and memory footprint for a variety of 2 → 2 and 2 → 3 processes at two loops in QED and QCD. Since the construction of reducible two-loop diagrams is largely based on well-established one-loop techniques, in the following we restrict ourselves to the contributions stemming from irreducible two-loop diagrams, which are constructed with the algorithm of Section 5.

Validation with pseudotree test
For the validation of the two-loop algorithm we have implemented a so-called pseudotree test, which was employed to validate every new numerical routine, e.g. all possible two-loop vertices V 0,1 , in single diagrams, as well as the full algorithm for a wide range of processes.
The idea of the pseudotree test is to cut-open L-loop Feynman diagrams by "cutting" L loop lines, such as to obtain tree diagrams that can be evaluated with well established treelevel algorithms. In the case of irreducible two-loop diagrams, as illustrated in Fig. 4, we cut the loop lines associated with D In the pseudotree test the cut-open diagrams of Fig. 4 are constructed in two different ways, as detailed in the following.
First, the colour-and helicity-summed interference of the cut-open two-loop diagram with the Born amplitude is computed using the well-tested tree-level algorithm (t) in OpenLoops. This calculation is carried out at fixed external momenta q 1 , q 2 , and is implemented in double precision (DP) and quadruple precision (QP), where the latter is intended as a benchmark. The Born two-loop interferences computed in this way are denoted W 0 . The resulting loop numerator U (q 1 , q 2 ), defined in (5.9), is obtained in the tensorial representation (5.17), which encodes the full q 1 , q 2 dependence in the form of tensor coefficients. Including all denominators, we obtain which is evaluated at fixed q 1 , q 2 and compared against the result obtained with the tree algorithm. The Born two-loop interference computed in this way is denoted W where p=DP,QP. This test was applied to a wide range of single Feynman diagrams and full SM processes, considering two-loop QED or QCD corrections.
In particular, we have investigated the 2 → 2 and 2 → 3 processes listed in Table 2, which involve all possible incarnations of the QED and QCD Feynman rules in the various steps of the algorithm of Section 5. The double-precision comparison A (t) DP showed excellent agreement for all tested diagrams and processes with typical accuracy at the level of 10 −15 .

Numerical stability
In order to demonstrate the numerical stability of the two-loop algorithm we have assessed the relative deviation between numerical evaluations of (6.2) in DP and QP,    Table 2. Processes considered for the validation of the two-loop algorithm and for the assessment of its CPU efficiency and memory footprint (see Sections 6.3-6.4). The massive and massless states that enter closed fermion loops are listed in the third and fourth columns. Light quarks are implemented as one or two first-generation fields. In processes with a single internal quark field u, light-fermion loops are multiplied with the number of light quarks N l in order to restore all active massless quark flavours. If the doublet (u, d) is used, light-fermion loops are multiplied with the number of light generations N l /2. corrections process type massless fermions massive fermions process as a benchmark by confirming that the pseudotree test (6.3) in QP yields a typical accuracy at the level of 10 −30 and always much better than 10 −17 , which is the limit accuracy in DP.
The numerical accuracy (6.4) was tested for selected processes using samples of 10 5 uniformly distributed phase-space points. Results for a 2 → 2 and a 2 → 3 process are reported in Fig. 5. In both cases the bulk of the points has a relative uncertainty of 10 −16 to 10 −14 , and the upper bounds for the relative uncertainties are of order 10 −12 and 10 −11 for the 2 → 2 and the 2 → 3 process, respectively.
In conclusion, the two-loop algorithm for the construction of tensor coefficients is fully implemented and validated for QED and QCD corrections. Its numerical stability is similarly good as for the tree-level and one-loop OpenLoops algorithm. This is an important prerequisite for a full two-loop calculation, where the dominant numerical instabilities are expected to arise from the reduction of tensor integrals, and should not be further enhanced by the calculation of tensor coefficients.

CPU efficiency
In this section we assess the CPU cost of the two-loop algorithm of Section 5 and-in order to judge its impact in the context of realistic applications-we compare it to the cost of the related real-virtual contributions to a full NNLO calculation.
The CPU cost of the algorithm of Section 5 is influenced by several aspects, such as the number of Feynman diagrams, the presence of massive propagators, the number of external helicity configurations (2 for fermions, 3 for W and 1 for H bosons) and the complexity of the vertices, in particular the connecting vertices V 0,1 , as well as the maximum tensor ranks. The overall impact of these different aspects is illustrated in Fig. 6, where we present the total CPU cost for a wide range of processes with two-loop QED and QCD corrections. Again we consider the 2 → 2 and 2 → 3 processes listed in Table 2, which involve all relevant building blocks of our algorithm in different combinations. Colour and helicity sums are included throughout, and all two-loop timings in Fig. 6 correspond to the segment-by-segment approach.
In the upper frame of Fig. 6 we present the virtual-virtual timings per phase-space point, t VV , for the construction of the two-loop tensor coefficients. Plotting these timings versus the number of irreducible two-loop diagrams N diags we observe clusters of processes of similar number of diagrams, which also have similar timings. The simplest 2 → 2 processes, such as e + e − → e + e − , take around 10 ms/point. Next there is a group of more complex 2 → 2 and simple 2 → 3 processes, which require about 100 ms/point, the fastest one being uū → ttH at 65 ms/point. The next group consists of more complex 2 → 3 processes, the fastest one being dd → uūg at 261 ms/point, as well as the most expensive 2 → 2 process, gg → gg. The most time consuming processes we studied are gg → ttH, which takes 2.5 s/point, and gg → ggg, which takes 9.2 s/point. The timings plotted in Fig. 6 are presented in Appendix A.
In general, we observe that the two-loop runtimes grow only linearly with the number of diagrams, i.e. the ratio t VV /N diags is approximately a process-independent constant. On     Figure 6. Runtimes per phase-space point for the calculation of the two-loop tensor coefficients on a single core Intel(R) Core(TM) i7-6600U @ 2.6 GHz with 16GB RAM. The timings correspond to the QED and QCD processes in Table 2 and are plotted versus the number N diags of irreducible two-loop diagrams. The upper frame displays the absolute two-loop timings t VV in ms. The middle frame shows the ratio t VV /t full RV , where t full RV is the time for computing the full one-loop probability density W (1) 01 (including tensor integrals) for a related process with one extra gluon or photon which enters the real-virtual part of the same NNLO calculation. The lower frame shows the ratio t VV /t RV , where t RV is the cost of computing the one-loop tensor coefficients of W (1) 01 without tensor integrals. Colour and helicity sums are included throughout, and the latter are always carried out on-the-fly. All timings have been taken in DP using the GNU Fortran 7.5.0 compiler and averaging over 1000 points. the employed CPU we find t VV ≈ N diags × 150 µs . (6.5) We also observe that adding an extra external gluon (photon) to a process in QCD (QED) increases the number of diagrams as well as the computation time t VV by approximately one order of magnitude, e.g. in e + e − → e + e − (+γ) and gg → tt(+g). These processindependent scaling features are comparable to the behaviour of one-loop calculations in OpenLoops.
In the lower frames of Fig. 6 we show the ratio of virtual-virtual and real-virtual timings for partonic subprocesses that contribute to the same NNLO calculation. More precisely, the virtual-virtual time t VV , which corresponds to the two-loop tensor coefficients for a given 2 → N process, is compared to the real-virtual time t RV corresponding to the oneloop correction to the associated 2 → N + 1 process with one extra gluon or photon. Helicity d.o.f are always summed one-the-fly, both at one and two loops.
The real-virtual timings t full RV in the middle frame correspond to the cost of the full realvirtual density W (1) 01 in (2.11), i.e. in t full RV we include also the cost of the required one-loop tensor integrals. 10 In this way we gain an idea of the share of the two-loop tensor coefficient construction in a full NNLO calculation. As shown in Fig. 6, in spite of the fact that t VV and t full RV vary by more than three orders of magnitude, their ratio amounts to and is approximately constant. In other words, independently of the type of process and its particle multiplicity, the cost of two-loop tensor coefficients per phase-space point is only a few times higher as compared to the related real-virtual amplitudes. The lower frame of Fig. 6 shows the same ratio for real-virtual timings t RV that do not include the contribution of tensor integrals. This provides a more direct comparison of the cost of one-loop and two-loop NNLO building blocks at the level of tensor coefficients. For this ratio we find and again we observe an approximately process-independent behaviour. This is related to the fact that the number of two-loop diagrams for a given process usually lies between the number of one-loop diagrams for the corresponding process with one and two extra partons (gluon or photons). We also note that the total tensor rank in the loop momenta is the same at two-loop level and at one-loop level with two extra partons, and comparing the cost of the corresponding tensor coefficients we find that t VV is a factor 3 to 8 less expensive wrt the one-loop coefficients for processes with two extra partons (see Appendix A). Considering the much higher complexity of two-loop diagrams as compared to one-loop diagrams, these are very promising findings. Finally, let us compare the efficiency of the diagram-by-diagram and segment-by-segment approaches discussed in Section 5.3. The former uses less memory and tends to be slower, while the latter requires more memory and tends to be faster thanks to the on-the-fly merging of diagrams. For simple processes the two approaches feature similar timings, their ratio being 1 ± 0.15, and in most cases the segment-by-segment approach being faster. For more complex processes, however, the segment-by-segment approach outperforms the diagram-by-diagram approach by a factor 1.15 to 1.35, with the highest ratio of 1.35 for gg → ggg. This behaviour is expected, since the efficiency gain due the on-the-fly merging Table 3. Memory required by the full set of tensor coefficients for the processes of Table 2. The two central columns correspond to the construction of two-loop coefficients in the segmentby-segment and diagram-by-diagram approaches (see Section 5.3). The last two columns report the memory footprint of the one-loop coefficients for the real-virtual contributions to the same NNLO calculation, as well as the memory footprint of the full real-virtual calculation, including the reduction and evaluation of the tensor integrals with Collier. For these the public OpenLoops 2 program with on-the-fly helicity summation was used. is larger for more complex processes with a higher number of Feynman diagrams. As the default option for our two-loop algorithm the segment-by-segment approach is the clear choice.
In conclusion, for the same number of phase-space points the CPU cost of two-loop tensor coefficients is of the same order as the cost of the related real-virtual parts of a NNLO calculation. However, the number of required points for the integration of the twoloop contributions is typically much lower. In the light of these observations, the presented two-loop timings are quite promising. In practice we expect that the cost of two-loop tensor coefficients should play a subleading role in the CPU budget of complete 2 → 3 NNLO calculations.

Memory Usage
The construction of two-loop tensor coefficients with the algorithm of Section 5 can require an important amount of memory. This is quantified in Table 3 for the full set of processes listed in Table 2. In addition to the memory footprint of the segment-by-segment and diagram-by-diagram approaches at two loops, for comparison we also report the cor-responding memory usage for real-virtual contributions, i.e. for the one-loop corrections to the related processes with one extra parton. At two loops only the memory for the storage of tensor coefficients is considered, while at one loop we present both the memory consumption for the construction of the tensor coefficients and for the full calculation.
We observe that the memory requirements for the two-loop tensor coefficients in the diagram-by-diagram and segment-by-segment approaches are, respectively, a factor 1-3 and 2-13 higher wrt the real-virtual case. Including the one-loop tensor integrals, the diagramby-diagram and segment-by-segment require, respectively, a factor 0.3-2.3 and 0.8-9.4 of the memory for the full real-virtual calculation. The segment-by-segment approach uses 2 to 6 times more memory than the diagram-by-diagram approach for the construction of the tensor coefficients. Thus the latter may be preferable when memory consumption plays a critical role. However, this is never the case for the considered processes, and both two-loop approaches are well within the scope of current computing environments.

Conclusion
We have presented a new and fully general algorithm for the efficient construction of twoloop integrands. To this end we have designed and implemented a generalisation of the open-loops method to two loops. In this approach, the polynomial dependence of two-loop integrand numerators on the loop momenta q 1 , q 2 is encoded in a set of tensor coefficients that are constructed through a numerical recursion. The interference with the Born amplitude as well as helicity and colour sums are systematically included, and the resulting tensor coefficients are ready to be combined with the relevant two-loop tensor integrals to form scattering probability densities at two loops.
For reducible two-loop diagrams, which factorise into one-loop subdiagrams, we have presented an algorithm that exploits and extends various aspects of the pre-existing Open-Loops program at one loop, while for irreducible two-loop diagrams we have developed a largely new and more sophisticated algorithm, which represents the main novelty of this paper.
Irreducible two-loop diagrams involve three loop chains that depend, respectively, on the loop momenta q 1 , q 2 , q 3 = −q 1 − q 2 , and are attached to each other through two connecting vertices. Each loop chain corresponds to a sequence of so-called loop segments, each consisting of a loop propagator attached to one or two external subtrees through a vertex. The numerator of an irreducible two-loop diagram is constructed through a numerical recursion where the various loop segments and connecting vertices are attached to each other one after the other. The single steps of the recursion correspond to processindependent operations that depend only on the Feynman rules of the model at hand, and the entire recursion is implemented in terms of tensor coefficients. Due to the large number of tensorial structures and helicity degrees of freedom, the most expensive individual steps can require of the order of 10 6 multiplications per two-loop diagram. Thanks to a systematic comparison of several algorithmic options and a detailed cost analysis we have identified an efficient algorithm, which turned out to outperform naive approaches by two orders of magnitude. This high efficiency is achieved through an optimal ordering of the construction steps, the factorisation of colour structures, the on-the-fly summation of helicity degrees of freedom, and various other tricks.
The algorithm has been implemented in a fully automated way in the OpenLoops framework, and this first implementation supports two-loop QED and QCD corrections to any scattering process within the Standard Model. Its correctness and numerical stability have been verified by means of a so-called pseudo-tree test in combination with quadrupleprecision benchmarks. In order to assess the efficiency of the new algorithm we have studied a set of 2 → 2 and 2 → 3 processes with numbers of two-loop diagrams ranging from order 10 2 to 10 5 per process. We have observed that the cost of the two-loop tensor coefficients per phase-space point for an entire process is approximately proportional to the number of (irreducible) two-loop diagrams. On a single Intel i7 core this cost is around 150 µs per two-loop diagram, and for typical 2 → 3 partonic processes at two loops in QCD it is of the order of one second. Finally we have shown that the cost of the two-loop tensor coefficients that enter the virtual-virtual part of a NNLO calculation is comparable to the one of the one-loop ingredients that enter the corresponding real-virtual part.
In conclusion, the presented algorithm provides a key building block for the construction of an automated generator of scattering amplitudes at two loops, and the observed technical performance guarantees its applicability to arbitrary 2 → 2 and 2 → 3 processes.

A CPU efficiency measurements
In Table 4 we report the explicit timings that have been discussed in Section 6.3 and summarised in Fig. 6. The first column corresponds to the full list of hard scattering processes defined in Table 2. The second and third columns indicate, respectively, the number of irreducible two-loop diagrams, N diags , and the CPU time t VV per phase space point for the construction of two-loop tensor coefficients, as defined in Section 6.3. This time t VV corresponds the default variant of the two-loop algorithm, i.e. the segment-bysegment approach (see Section 5.3). The fourth column shows the ratio of t VV to the corresponding time t dbd VV required by the diagram-by-diagram approach. The next two columns give the ratios of t VV to the timings for the one-loop tensor coefficients (t RV ) or full one-loop amplitude including tensor integrals (t full RV ) for the corresponding process with one additional parton (gluons in QCD and photons in QED). The last two columns give the ratios of t VV and the timings for the one-loop tensor coefficients (t RRV ) or full one-loop calculation (t full RRV ) for the corresponding process with two extra partons. All time measurements have been carried out as detailed in the caption of Fig. 6.