Integration-by-parts reductions of Feynman integrals using Singular and GPI-Space

We introduce an algebro-geometrically motived integration-by-parts (IBP) re- duction method for multi-loop and multi-scale Feynman integrals, using a framework for massively parallel computations in computer algebra. This framework combines the com- puter algebra system Singular with the workflow management system GPI-Space, which are being developed at the TU Kaiserslautern and the Fraunhofer Institute for Industrial Mathematics (ITWM), respectively. In our approach, the IBP relations are first trimmed by modern tools from computational algebraic geometry and then solved by sparse linear algebra and our new interpolation method. Modelled in terms of Petri nets, these steps are efficiently automatized and automatically parallelized by GPI-Space. We demonstrate the potential of our method at the nontrivial example of reducing two-loop five-point non- planar double-pentagon integrals. We also use GPI-Space to convert the basis of IBP reductions, and discuss the possible simplification of master-integral coefficients in a uni- formly transcendental basis.

Frequently, when computing scattering amplitudes, IBP reduction is a crucial and bottleneck step.It is a fundamental tool for both the reduction of integrals to master integrals (MIs), and for computing the master integrals themselves using the differential equation method.IBP relations (IBPs) are derived from integrating a total derivative [34], where the v µ i are polynomials in the loop momenta i , the D i are the inverse propagators, and D is the spacetime dimension.
The standard approach to obtain IBP reductions, by which we are able to express an integral as a linear combination of a finite number of MIs, is to generate sufficiently many IBP relations, and then use the Laporta algorithm [35] to solve the associated linear system.The algorithm works by imposing an ordering on the different integral families and solving recursively.There exist multiple public and private implementations of this approach [32,[36][37][38][39][40][41], which usually generates a large linear system to be solved.
In the case of a system of IBPs which does not have double propagators [42][43][44], however, we obtain a much smaller linear system.The IBPs without double propagators are physically related to dual conformal symmetries [45].A significant simplification can be made by using unitarity methods, where by considering a spanning set of generating cuts it is possible to reduce the size of the IBP system.This requires prior knowledge of a basis of MIs.Such a basis can be obtained by running the Laporta algorithm with constant kinematics, or by using specialized programs such as Mint [46] or Azurite [47].(Note that the dimension of a basis of integrals can also be obtained by studying the parametric annihilators [48].)There is also the important technique [49] of simultaneously nullifying all master integrals except one, which often makes large-scale linear reductions feasable.
Besides the advances in purely analytical methods in recent years, there has been a lot of work towards numerical implementations of the generation of IBPs.The idea is to utilize either integer values or finite-field values for the kinematical invariants [30,31,38], depending on the difficulty of the problem, and then to run the same reduction several times for reconstruction.This method has been very successful in tackling difficult problems.Furthermore, it is possible to numerically generate and reduce the IBP relations, and, while skipping the IBP coefficient reconstruction, directly carry out an amplitude reconstruction.(For examples, see [9,10,13,50]).In this paper, we in particular present our own implementation of a semi-numeric rational interpolation method, see Appendix A for more details.
Furthermore, new approaches were developed recently to obtain the reduction directly, without generating IBP relations from total derivatives.In [51], the direct solution method was presented to derive recurrence relations for Feynman integrals and solve them analytically with arbitrary numerator degree.One very promising progress is based on the intersection theory of differential forms in the Baikov representation [52][53][54].This approach calculates the master integral coefficients from intersection numbers.There is also a very intuitive approach to reduce Feynman integrals by considering the η expansion of the Feynman prescription [55][56][57].Using this approach, the scaling of the reduction computation depends only linearly on the number of master integrals.Furthermore, it is possible to determine two-loop planar diagram IBP coefficients directly from the Baikov representation [58].
In this paper, we present our new powerful IBP reduction method based on: 1. Computational algebraic geometry.We apply the module intersection method from [59,60], modified by using a suitably chosen degree bound for the Gröbner basis computation, to efficiently generate a small IBP system, without double propagators (or IBPs with a given bound on the propagator exponents).
2. A modern framework of massively parallel computations in computer algebra which combines the computer algebra system Singular [61] with the workflow management system GPI-Space [62].We have completely automatized our approach and make our algorithms run automatically in parallel on high performance computing clusters.In this way, IBP results can be obtained in an efficient, reliable and scalable way.Our implementation can automatically determine the minimal number of points needed for interpolating the IBP coefficients, it can identify possible "bad" points, add more points, if necessary, and interpolate the final result.
We demonstrate the power of our method by reducing the two-loop five-point nonplanar double pentagon diagram analytically, up to numerator degree 4.This is a nontrivial test since the diagram has a complicated topology and there are five symbolic Mandelstam variables as well as the spacetime variable D.
Furthermore, we start to look at the possible simplification of IBP coefficients by converting the master integral basis.In this paper, we test the conversion to a "dlog" basis [63], a special case of the canonical basis [19].We find that for the double pentagon diagram above, the size of the IBP coefficients reduces significantly from the byte size ∼ 2.0G in the Laporta basis to ∼ 0.48G in the dlog basis on disk, that is, by 76%.The master integral basis conversion computation is also automated by the Singular-GPI-Space framework.
Our paper is structured as follows.In Section 2, we present the general background on how to generate simple and trimmed IBP systems using computational algebraic geometry and finite-field methods, as well as the improvement on the algorithm in [59].In Section 3, we give a short overview on how we use Singular in conjunction with GPI-Space.In Section 4, we describe how to model our algorithm in the Singular-GPI-Space framework, and discuss timings and scaling of the algorithm, focusing on the double pentagon diagram.This, in particular, demonstrates the potential of the Singular-GPI-Space framework for applications in high-energy physics.In Section 5, we review the algorithmic computation of a dlog basis which has uniform transcendental weight, and we comment on how to convert coefficients from the Laporta basis to the dlog basis.In Section 6, we study the working example of our implementation, the double pentagon graph, in detail.We discuss the analytic IBP reduction and the conversion of IBP coefficients to the dlog basis.Finally we present a summary and conclusion of this paper.
The result of our IBP reductions can be downloaded from the following links: Whereas For the convenience of the reader, we also present the IBP coefficients in the dlog basis with the full scale dependence: https://www.dropbox.com/s/dnkr6h5t3vik2r0/IBPmatrix_dlog_basis_scaled.tar.gz We encourage researchers in the high energy community to send us IBP reduction problems (mailto: alessandro.georgoudis@physics.uu.se) for cutting-edge precision calculations and the further sharpening of our new reduction method.

The module intersection method reloaded
In this section, we present a refined version of the approach of using module intersections to trim IBP systems.For the detailed account of the module intersection IBP reduction method, we refer to [59].

Module intersection
The Feynman integrals under consideration are labeled as where L is the loop order and the l i 's are the loop momenta.We have E independent external vectors that we label as p 1 , ..., p E .We assume that the Feynman integrals have been reduced on the integrand level, and set m = LE +L(L+1)/2 which equals the number of scalar products in the configuration.
For us it is convenient to use the Baikov representation [46,64] for IBP reductions, Here, P is the Baikov polynomial, which can be written as a Gram determinant, Moreover, U and C L E are the Gram determinant respectively constant factor below: where J is a constant Jacobian.The factors U and C L E are irrelevant for the IBP relations.As in [20,44,65], the IBP relations in the Baikov representation are of type where each a i (z) is a polynomial in the variables z 1 , . . ., z m .Note that P vanishes on the boundary of the Baikov integration domain, so this form of IBP identities does not have surface terms.Suppose we wish to reduce an integral family with n j ≤ 0, j = κ + 1, . . ., m, for some κ.That is, we face integrals with the inverse propagator product 1/(D 1 . . .D κ ) and the sub-topology integrals.We use the idea of restricting to IBP systems without double propagators [42], choosing suitable a i (z) to prevent the appearance of double propagators in (2.5).In the Baikov representation, we also need to avoid total derivatives with dimension shifts [20,44].These constraints translate into syzygy equations of the following type: ) where b(z) and the b i (z) are also polynomials in z i 's.Relation (2.6) avoids dimension shifts of the integrals, while (2.7) ensures that there is no double propagator for D i if the initial index n i = 1 in (2.5).The goal is to find such polynomials a i (z), b(z), and b i (z).Since we require polynomial solutions, this is not a linear algebra problem, but a computational algebraic geometry problem.We use the module intersection method from [59,66] to solve (2.6) and (2.7) simultaneously.Note that the analytic generators of all solutions of (2.6) can be directly written down via either the canonical IBP vector method [20] or the Gram matrix Laplace expansion method [60] 1 .The relations in (2.7) can be trivially expressed as a module membership condition.Hence without any algorithmic computation, we know the individual solutions for (2.6) and (2.7), respectively.These form polynomial submodules M 1 respectively M 2 of R m over the polynomial ring R = Q(c 1 , . . ., c k )[z 1 , . . ., z m ] (where the variables c 1 , . . .c k collect the Mandelstam variables and the mass parameters).The task is then to compute (2.8) This module intersection can be obtained by computing a module Gröbner basis in a particular ordering [59].One decisive strategy is the localization technique, which allows us to compute M 1 ∩ M 2 over the polynomial ring R = Q[c 1 , . . ., c k , z 1 , . . ., z m ].In this manner, we treat kinematic variables in the same way as the Baikov variables.This greatly speeds up the intersection computation for multi-scale problems, but results in a redundant generating system.The latter can be trimmed further by importing the result back to R m and removing redundant generators by checking the leading monomials.This is powered by Singular's command simplify.Once M 1 ∩ M 2 is obtained, we know all simultaneous solutions for (2.6) and (2.7), and can use (2.5) to get IBPs without double propagators.We emphasize that, although (2.6) and (2.7) were originally designed for IBPs without double propagators, the solutions of (2.6) and (2.7) can be used to simplify IBP systems with double or multiple propagators.Using these solutions a i (z), the resulting IBP system does not introduce integrals with higher powers of propagators, and hence also greatly decreases the size of the IBP system.
Frequently, instead of computing IBPs directly, we compute IBPs on spanning cuts and assemble the full IBPs afterwards.This amounts to setting some of the z i to zero in (2.6) and (2.7).For details on IBPs on cuts using the Baikov representation, we refer to [59].
Compared to the approach in [59], we present the following new features of the module intersection method in this paper: • When we compute the intersection M 1 ∩ M 2 , instead of finding a full generating system, we heuristically impose a polynomial degree bound in the computation.Then we reduce the resulting IBPs over finite fields to test if we already have all the IBP relations needed.If the IBP relations are insufficient, we increase the degree bound and repeat the computation.This approach speeds up the intersection computation dramatically in many cases.In practice, we use the option degbound in the computer algebra software Singular to set the degree bound.
• In the approach of [59], the module intersection was only computed for the top sector, which, for the hexagon box diagram, turned out to be sufficient for reducing integrals to a master integral basis.However, in this paper, we compute the module intersection for the top sector and also all subsectors.This approach may, in general, generate more IBP relations.Via linear algebra trimming as discussed in the next subsection, this approach eventually gives a block triangular linear system and makes the linear reduction easier.

Linear reduction
For the simplified IBP system arising from the module-intersection method, we use our own linear reduction algorithm to reduce the IBP system.The steps are: 1. Trim the linear system in two stages: (a) Set all the kinematic variables to integer values, and use linear algebra over a finite field to find the independent IBP relations.(b) Again over a finite field, carry out the reduction.From the intermediate steps, determine a sufficient subset of IBP relations for reducing the target integrals.These operations are powered by the finite field computation tool SpaSM [67].
2. Remove the overlap between two different cuts and simplify the linear system: If two cuts have a common master integral, use the idea from [49] to set the master integral to zero in the IBP system of one of the two cuts.This will later on dramatically simplify the IBP reduction for the cut.
3. For the linear system simplified by the first two steps, we use our own Singular row reduce echelon form (RREF) code over function fields to reduce the target integrals to master integrals.Our code applies both row and column swaps for finding the optimal pivots.Note that column swaps change the set of master integrals.After the RREF computation, we convert the new master integrals to the original master integrals.We have observed that this approach is in general much faster than fixing the column ordering and directly reducing the target integrals to the original master integrals.
For difficult IBP reduction computations, we use a "semi-numeric" approach: This approach sets several but usually not all of the kinematic variables for the reduction computation to numeric values (that is, to constant integers).Without loss of generality, for the kinematic variables (c 1 , . . ., c k ), we set for some k 1 < k and some a i ∈ Z.
The actual degree of the coefficients in these variables can be decided by a univariate analytic computation (that is, we set all but one of the c i to constant values).For example, we may pick the dimension D and all parameters c i except c 1 as random integers, and then carry out the reduction.This computation is much easier than the actual IBP reduction with fully analytic parameters.From the reduction, we determine the degree of c 1 in the final IBP reduction coefficients.Proceeding similarly for each i, we find the degree of each c i .This determines the minimal number of semi-numeric points for the subsequent interpolation step.(See [31] for an alternative way of finding the degree of each parameter in a rational function.) After accumulating enough points, we collect the semi-numeric reduction results and interpolate to get the final IBP reduction coefficients.To do this, we first run step 3 above for a semi-numeric set of parameters, find the optimal pivots and record the row/column swap history as a trace of our computation.For other numeric values, we always use the same trace to ensure the relatively uniform running time of the computation.
In practice, we use our rational function interpolation algorithm described in Appendix A. We do a reduction computation, with a carefully chosen semi-numeric reference point, and c 1 , . . .c k 1 symbolic.Using the reference point result, we convert the rational function interpolation problem to individual polynomial interpolation problems for the numerators and denominators.With this approach, the number of "semi-numeric" computations is where the d i , for 1 ≤ i ≤ k 1 , are the maximal degrees of the c i in the numerator and denominator polynomials in the RREF matrix.This algorithm is also implemented in

Singular.
For the semi-numerical reduction and interpolation, we need to parallelize our computations in an efficient way.Furthermore, with semi-numeric points, we may have some bad points in the reduction or interpolation.In order to make use of massively parallel computations in an efficient way, and to automize the workflow for the replacement of bad points, we use the modern workflow management system GPI-Space, in conjunction with the computer algebra system Singular.We will discuss the ideas behind this approach in the subsequent section.

Massively parallel computations using Singular and GPI space
Large scale calculations such as row reductions of IBP identities in the case of Feynman diagrams which are relevant to current research in high-energy physics, are only feasible by using parallel computing on high-performance clusters.The computer algebra methods applied in this context require to model algorithms which rely on sub-computations with time and memory requirements that are difficult to predict.This is due, for example, to the behaviour of Buchberger's algorithm for finding Gröbner bases: Although this algorithm performs well in many practical examples of interest, its worst case complexity is doubly exponential in the number of variables [68].Nevertheless it turned out recently [69,70] that massively parallel methods, which have been a standard tool in numerical simulation for many years, can also be applied successfully in symbolic computation.Proposing the general use of massively parallel methods in computer algebra, we describe our ongoing effort in this direction which is based on connecting the computer algebra system Singular for polynomial calculations with the workflow management GPI-Space.The latter consists of a scheduler distributing the actual computations to workers in the cluster, a virtual memory layer to facilitate communication between the workers, and a workflow management system which relies on modeling algorithms in terms of Petri nets.
In its basic form, a Petri net is a directed bipartite graph with two kinds of nodes: While a place can hold a number of indistinguishable (structure-less) tokens, a transition may fire if each input place contains at least one token (we then say that the transition is enabled ).When fired, a transition consumes one token from each input place and puts one token on each output place.See Figure 1 for an enabled transition and its firing, and Figure 2 for a transition which is not enabled.In the figures, places are shown as circles, transitions as rectangles, and tokens as black dots.The execution of a Petri net is non-deterministic: At each step, a single random enabled transition is chosen to fire.We have observed that the randomized reformulation of deterministic algorithms in computer algebra in terms of Petri nets can lead to a more consistent and predictable behavior throughout the course of the computation.

−→
In our approach, we model the coarse-grained structure of an algorithm in terms of a Petri net.The transitions call procedures from the C-library version of Singular to do the actual computations.The result of this setup is a flexible framework for massively parallel computations in computational algebraic geometry (similar setups are possible using Clibraries of computer algebra systems aiming at possibly different application areas).Our framework has, for example, already been used to implement a non-singularity test for algebraic varieties [69,71], the computation of combinatorial objects in geometric invariant theory [72], and the computation of tropical varieties associated to algebraic varieties [73].
For the efficient use in practical programming, the basic concept of a Petri net has to be extended.Here, GPI-Space provides multiple additional features: • Modeling complex algorithms just by the use of structure-less tokens is not very efficient.In GPI-Space, tokens can have a data type and hold actual data.In fact, it is often more efficient if the tokens just hold a reference to a storage place for the data (in memory or in the file system).Using the shared memory subsystem of GPI-Space or the powerful file systems of modern high-performance clusters, computations can then scale far beyond the limitations of a single machine.
• The firing of a transition may be subject to conditions which have to be fulfilled by the input tokens.
• Transitions in practice involve computations which take time.The properties of Petri nets allow us to execute different enabled transitions at the same time (task parallelism) and to execute multiple instances of the same transition in parallel, provided the input places hold multiple tokens (data parallelism).In Figure 3, the transitions f 1 and f 2 can fire in parallel, and, if the input place of f i holds multiple tokens, then f i can fire in multiple instances.We have observed that some algorithms in computer algebra scale in a superlinear way when implemented in parallel as a Petri net.The reason is that then, at run time, the algorithms can automatically determine from a given set of paths a path which leads to the solution in the fastest possible way (see [69,Section 6.2]).
In the next section, we illustrate the use of the Singular-GPI-Space framework for applications in high-energy physics by modeling our IBP reduction algorithm.

Parallel matrix reduction as a Petri net
In this section, we first describe how to model the parallel IBP reduction algorithm in terms of a Petri net.Focusing on the cut {1, 3, 4, 5} of the two-loop five-point nonplanar double pentagon diagram, we then discuss timings and scaling of the algorithm to indicate the practical use and significant potential of the Singular-GPI-Space framework for algorithmic problems in high-energy physics.

General structure of the algorithm
Our approach includes a massively parallel execution of row-reductions over function fields, where a number of parameters has been replaced by integers, followed by a parallel interpolation step to reconstruct the dependency on these parameters.
So the task is to find the reduced row-echelon form M red of a large linear system of equations, given as a matrix M over the rational function field Q(c 1 , . . ., c k ).Since applying Gaussian elimination directly is not feasible, we instead proceed by substituting, say, the first r parameters by the coordinates of a point a ∈ Z r , and then by computing the reduction (M | c 1 →a 1 ,...,cr →ar ) red .
We refer to Section 2.2 above for details on how we handle this reduction step.To determine the number of interpolation points required to reconstruct the dependency on c 1 , . . ., c r , we find bounds for the degrees of numerators and denominators for each parameter by doing a univariate row reduction (that is, all but one of the parameters are set to be numeric).
After the reduction, we check that the resulting matrix is equal to the desired result ..,cr →ar by normalizing it relative to a previously computed reference matrix with c r+1 , . . ., c k constant, and performing degree checks using the exact degrees obtained from the univariate calculations.These steps are described in more detail in Appendix A. The final result M red is then found by iteratively combining the reduced matrices via univariate interpolation (see again Appendix A).Let d 1 , . . ., d r be degree bounds for the entries of M red in the parameters c 1 , . . ., c r , respectively.To obtain M red by interpolation, we need for ∈ Z.Similarly, to obtain any one of the above matrices, we need d 2 + 1 matrices over Q(c 3 , . . ., c k ).Continuing inductively, this process ends with matrices defined over Q(c r+1 , . . ., c k ), which are then computed by reduction with c 1 , . . ., c r numeric.This tree-like dependency structure is depicted in Figure 4.

Managing the interpolation
We model the current status of the interpolation process in a tree-like data structure corresponding to that from Figure 4, with references to the reduction results at the leaves, and references to the interpolation results at the other nodes.Within GPI-Space, reductions and interpolations are executed according to this data structure.The tree is generated as soon as the degree bounds d 1 , . . ., d r are known, and it is extended if the algorithm requires additional data points due to the occurrence of bad interpolation points.

Description of the Petri net
Figure 5 depicts the Petri net that implements the complete reduction algorithm.Going beyond the standard syntax introduced in Section 3, dashed arrows stand for read-only access, that is, the data in the respective places is not consumed.The dotted arrows illustrate read and write access to the interpolation tree described in Section 4.2.A transition can be annotated by conditions which indicate that the transition can only fire by consuming tokens for which the conditions evaluate to true. 2 In the following, we describe the individual structures of the net: Input token: The net is initialized with one token: A token on the place I, which holds references to the following input data: • The input linear relations, which are given as a matrix M over the rational function field Q(c 1 , . . ., c k ).
• The vector of indices of the parameters which will be interpolated (in the following we assume that these indices are 1, . . ., r).
• The vector of indices of the target variables.
• Optionally: A precomputed trace for the reduction step (consistent with the targets).
In the Petri net, the trace is referred to as I.trace (we use the usual dot-notation for sub-data structures).Note that the trace fixes the variables corresponding to the master integrals.Transition trace: If the token on I does not contain a trace, then trace is enabled, computes a trace for the linear reduction (see Section 2.2) and returns a copy of I with the trace included.

I
Transition copy: If the token on I already contains a trace, then copy is enabled and simply passes the token on I through.
Transition init: This transition takes the input token, which was produced by either trace or copy, and pushes it onto I t .This way, the input data on I t is guaranteed to contain trace data.It additionally enables the transitions degrees and reference.
Transition reference: This generates a random substitution point q = (q r+1 , . . ., q k ) with values for all parameters which will not be interpolated, substitutes the q i for the c i , and runs the row reduction step (see Section 2.2), that is, computes The transition then stores the actual result in the file-system and produces an output token which contains both a reference to the result and the point q.The stored data will be used later in the normalization step of the interpolation (see above).
Transition degree: This generates a substitution point p (j) ∈ Z {1,...,j−1,j+1,...,k} for each 1 ≤ j ≤ k yielding a matrix over the field Q(c j ).After applying the row reduction, M red can be used to determine degree bounds for the numerator and denominator of each entry of the final result M red as a polynomial in c j .
For j ≤ r, we need a global degree bound to determine the number of interpolation points.We thus take the maximum of all numerator and denominator degrees of entries of M (j) red , and store these as a vector in N {1,...,r} 0 , which is put on the place d v .
If j > r, two integer matrices will be produced, which store the degrees of the numerators and denominators of each entry of M red , respectively.This information will be used later to filter out bad interpolation points, that is, points at which polynomial cancellation occurs (see Appendix A).The result is stored in the file system and a token with a reference to the result is put on the place d m .
Note that degree is in fact modeled by a sub-Petri net which behaves in a hierarchical manner as a transition.In practice, we actually compute multiple matrices M (j) per parameter to reduce the probability of a bad point producing wrong degree bounds.
Transition points: This transition takes the degree data in d v and initializes the interpolation tree described in Section 4.2 and depicted in Figure 4. This, in turn, produces the corresponding set of interpolation points, which are put as separate tokens on the place p.
The resulting matrix together with its interpolation point are put on the place m.Since reduce performs parameter substitutions in rational function expressions, the computation may fail due to division by zero.If this happens, m.valid is set to false, otherwise it is set to true.
Transition replace failure: An input token for which m.valid is false is consumed by the transition replace failure, which marks the respective interpolation point as failed in the interpolation tree.If necessary, the interpolation tree is extended by additional interpolation points, which are also put on the place p.
Transition normalize: An input token for which m.valid is true is consumed by the transition normalize.This transition reads M ref and multiplies the input matrix referenced by m with a suitable constant factor.It also compares the entries with the degree matrices in d m to identify bad interpolation points.The result is put on the place n.If the corresponding point was bad, n.valid is set to false, otherwise to true.
Transition replace invalid: For an input token for which n.valid is false, the transition generates new interpolation points in a fashion similar to that in replace failure.
Transition store normalized: For an input token for which n.valid is true, the transition marks the corresponding interpolation point as successful in the external storage.
If enough interpolation points for a given parameter have been marked as successful, the storage produces a token on place i, which triggers the respective interpolation.If the point (p 1 , . . ., p r ) triggers the interpolation (which will then use further points of the form (p 1 , . . ., p r−1 , p r )), the result of the interpolation will be associated to the point (p 1 , . . ., p r−1 ) in the interpolation tree.If there are not yet enough interpolation points, the transition produces a token which only contains i.valid with value false.
Transition discard: This transition discards tokens with i.valid equal to false.
Transition interpolate: Tokens with i.valid equal to true are consumed by this transition, which then retrieves the references to the input data for the interpolation from the interpolation tree, loads the respective data from the file system, and executes the interpolation.If (in the above notation) the token holds (p 1 , . . ., p r−1 ), then for (d v ) r + 1 many points the corresponding row reduced matrices are retrieved from the storage.Note that due to the tree structure of the interpolation tree, all these points must have the first r − 1 coordinates equal to (p 1 , . . ., p r−1 ).The interpolation is then performed entry-wise as explained in Appendix A.
Transition store interpolated: This transition marks the current point (p 1 , . . ., p r−1 ) in the interpolation tree as processed.If r > 1, just like in store normalized, the transition produces an interpolation token for the next parameter.If r = 1, we have arrived at the final result, and a token with i.valid equal to false is produced, which will then be discarded.
The Petri net contains additional infrastructure (not described here) which terminates the execution once no tokens exist any more on the places i and p.

Parallel timings
To illustrate the efficiency of our approach, we consider the cut {1, 3, 4, 5} of the double pentagon diagram (see Section 6 for a discussion of all possible cuts).Choosing this particular cut, which is less complex than others, our computations finish even when only a small number of cores is involed.This is necessary to analyze the scaling of our algorithm.In  1. Timings and efficiency for the cut {1, 3, 4, 5}.We use the same algorithm for all core counts.The single core run serves as a reference.
Apart from the running time T (n) of the algorithm on a total of n cores, we also give the speedup S(n) = T (1)  T (n) and the efficiency E(n) = T (1) nT (n) , which measure how "well" the algorithm parallelizes with increasing core counts.Note that the single-core timing is somewhat special: As experiments have shown, the performance per core decreases with the number of cores used on a given node.This effect has been investigated in [69] (see in particular [69,Figure 5]).Thus, for the analysis of the expected run-time below, we rather consider the relative speedup and efficiency with respect to the 15-core timing.This in particular makes the assumption that the 15-core speedup is 15.
The saw-tooth shape of the efficiency graph in Figure 6 (and the corresponding behavior in the timing and speedup graphs) is due to the fact that the number of reductions to execute is usually not divisible by the number of cores utilized.Since in our test problem approximately 450 reductions are required to enable the final interpolation, the running time of the full algorithm is roughly 450 number of CPUs .This effect can be avoided by a more fine-grained structuring of the problem (for instance by interpolating more parameters).Note, however, that increasing the number of processes in this way will lead to more overhead via inter-process communication and disk accesses.Thus, dividing the algorithm into very small parts may in fact slow down the overall computation.
Figure 6 also depicts the ideal expected runtime, speedup and efficiency.These ideal graphs stem from the simple assumption, called Amdahl's law, that an algorithm can be divided up into a part that is ideally parallelizable and a part which is not parallelizable at all.Denoting the parallelizable fraction by f , the expected runtime T ideal (n) on n cores is not T (1)  n , but rather which yields the ideal speedup and efficiency Using the experimental values for 15 and 30 cores, we arrive at a value f ≈ 0.999748, that is, only 0.025 % of the algorithm is not parallelizable.
As we can see, the ideal curves give a fairly tight bound on the actual timings, at least in the cases where the core count is properly aligned to the number of reductions.This indicates that our approach for parallelization not only provides an automatic and fast solution to a tedious and complicated task, but stays highly efficient even when used with a large amount of computing power.

IBP conversion between different integral bases
It is well known that the IBP coefficient size may vary significantly if we choose different master integral bases.We prefer the IBP reduction to a uniformly transcendental (UT) basis as introduced in [18], for several reasons: a) The differential equations satisfied by a UT integral basis have a particularly simple form [18] which allows for the integrals to be solved analytically in terms of polylogarithms.There is also evidence that for numerical computations, a UT basis is more convenient to evaluate 4 .So the IBP reduction to a UT basis greatly simplifies the amplitude computations after the IBP reduction.b) We observe that, in the case of the double pentagon, the IBP coefficients in a UT basis are significantly simpler than those in a traditional Laporta basis.This makes the IBP relations easier to use.
In practice, we consider special forms of UT bases, the so-called dlog bases, which will be introduced in the next subsection.

Dlog bases and the dlog algorithm
We say that a Feynman integral is a dlog integral if its integrand with R(x 1 , ..., x n ) a rational function in x 1 , ..., x n , can be expressed in dlog form [74], that is, it can be written as a linear combination of type with rational functions f i,j in x 1 , ..., x n .This is only possible if the integrand has at most simple poles, including points at infinity.For example, both forms dx x 2 and dx admit no dlog form because of the double poles at zero respectively infinity.
The coefficients c i in equation (5.2) are called leading singularities [75].For Feynman integrals, that are not of the elliptic type, they are in general algebraic functions of the external variables.By choosing an appropriate parametrization of the external variables, the leading singularities are typically rational functions.This is, in particular, true for the two-loop five-point integrals that are discussed in the next section.The leading singularities can also be understood as integrals over the original integrand where the integration contour is localized around the poles of the integrand.Leading singularities and the integrals integrated on the real contour have analytic properties in common.So, integrals with leading singularities that are just constant numbers are particularly useful, most importantly because they fulfill differential equations in the canonical form [18].This implies that they have the property of uniform transcendental weight, which means that if the series is expanded in , the parameter of dimensional regularization, the coefficients have homogeneous transcendental weight and the weight increases by one for each order in .
Next, we recall from [63] how to transform a given integrand into dlog form, in case this is possible.Given an integrand in n integration variables, we choose, if possible, one variable x that is linear in all denominator factors and do a partial fraction decomposition while treating all other variables as constants.In this way, we obtain a sum of integrands of the form where Ω is an (n−1)-form, independent of x, and a is a polynomial that may depend on the other integration variables.Then we iterate this procedure taking Ω as our new integrand until no integration variables are left.If in any intermediate step a pole of degree two or higher is encountered, then the integrand does not admit a dlog form.There are cases where no variable exists that is linear in all denominator factors.One way to proceed in such a case is to make a variable transformation such that at least one of the new variables is linear in all denominator factors.The algorithmic approach of this section was used in [76] and [77] to construct a complete basis of dlog master integrals with constant leading singularities for all two-loop five point integral families.The denominator structure for each integral family is given by the propagators.To construct the dlog integrals we make a general numerator ansatz.We write the numerator as a linear combination of terms that are products of inverse propagators and irreducible scalar products.Each term is multiplied by a free parameter, and by applying the algorithm to this general integrand, we can determine values of the free parameters such that the integrand has a dlog form and constant leading singularities.In this way, we obtain a set of dlog integrals that form a basis of dlog master integrals.
In general, the dlog algorithm can be applied only in a dimension that is an integer number, which we choose to be four.The loop momenta are very conveniently parametrized using spinor helicity variables as in [74].Although this parametrization can be very useful, it also has its limitations as soon as the numerator has terms that vanish in dimension four, but which are non-zero in generic dimension D. In such cases, an extended approach as in [77] using the Baikov parametrization can be applied.

IBP reduction with a dlog basis
Given a dlog basis, we discuss the IBP reduction in two settings: 1.When both the IBP coefficients in the Laporta basis and the dlog basis are needed, we first compute the reduction in the Laporta basis I with our module intersection and GPI-Space reduction algorithm, where F is the list of target integrals as a column vector.Then we reduce the dlog basis Ĩ to the Laporta basis I, Ĩ = T I. (5.5) Note that since the dlog basis construction has a restriction on the numerator degree, this reduction is usually easy.Terms exceeding the allowed numerator degree have double poles at infinity.This can be seen by inverting the loop momenta k µ i → k µ i /k 2 i .Using our Singular RREF code, with a good pivot strategy, we can analytically find the inverse T −1 .The matrix product AT −1 contains the coefficients of an IBP reduction to the dlog basis.
We remark that the product AT −1 can be difficult to calculate even if T −1 has a relative small size.Instead of computing the product directly, we again use the seminumerical approach, setting several of the kinematic values to be integers, computing the product several times, and then using our interpolation program to get the fully analytical matrix product AT −1 .This is again implemented using our Singular-GPI-Space framework.
2. When only the IBP coefficients in a dlog basis are needed, we apply our seminumerical reduction method to a set of numeric IBP coefficients in the Laporta basis.Instead of interpolating these coefficients, we use the semi-numeric points to interpolate the product AT −1 , not calculating the analytic form of A.
In the next section, we illustrate our approach by considering a non-trivial example, the two-loop five-point nonplanar double pentagon diagram.This includes the IBP generation via the module intersection method, the massively parallel reduction of the IBP system and the basis conversion.
6 The two-loop five-point nonplanar double pentagon example In this section, we illustrate our IBP reduction method by applying it to a nontrivial example, the two-loop five-point nonplanar double pentagon.Note that a symbolic UT basis for this example was found in [5,11].Furthermore, UT bases in terms of polylogarithm  .We depict the two-loop five-point nonplanar double pentagon diagram, writing z i for the Baikov variables, which are equal to the inverse propagators.In particular, z 1 = l 2 1 and z 4 = l 2 2 .We also draw the 11 spanning cuts of this integral family.These correspond to the non-collapsible master integrals, before using symmetries.functions for the double pentagon and other two-loop five-point nonplanar massless integral families were analytically calculated in [77].
For the diagram in Figure 7, we chose the following labeling for the propagators: where the l i represent the loop momenta, the p i represent external momenta, and p i•••j = j i p i .The first 8 propagators represent the topology and the last three ones the irreducible scalar products.This is a complicated integral family for IBP reduction, due to the number of independent scalars, which are s12, s23, s34, s45, s45, s15 and the spacetime dimension D, and due to the nonplanar topology with two pentagons inside.We demonstrate our method by reducing the 26 integrals listed in Figure 8 to a master integral basis in the fashion of Laporta.Furthermore, we convert the IBP coefficients to the coefficients of a dlog basis given in [77].In this base change, we observe a significant coefficient size reduction.

Module intersection with cuts
First, we use Azurite [47] to find an integral basis.Without considering symmetries, there are 113 irreducible integrals, and with symmetries, there are 108 master integrals.Note that due to the number of master integrals, this IBP reduction is significantly more complicated than the reduction of the hexagon-box diagram in [59], which has only 73 master integrals.
With the degbound option in Singular, it is easy to generate all the module intersections.For this integral family, choosing the degree bound 5, and using one core for each cut, it takes less than 5 minutes in total to solve all the module intersection problems an-alytically.Later on, by finite-field methods, we find that with this choice of degree bound, we obtain sufficiently many IBPs for our problem.
After generating the IBPs, we use the two-step trimming process described in Section 2.2 to select necessary IBPs for our targets.This computation is via finite-field methods and powered by the package SpaSM.
We compute the module intersections analytically.For the purpose of linear reduction, we further set  -22 -

IBP reduction
We apply our reduction method via Singular and GPI-Space to reduce the linear systems in Table 2.We use a semi-numeric approach, choosing c 4 , c 5 and the space-time dimension D to be symbolic, and compute the linear reduction with integer-valued c 2 and c 3 .By a linear reduction with c 2 (respectively c 3 ) symbolic and all the other parameters numeric, we easily determine the maximal degree of c 2 (respectively c 3 ) in the reduced IBP relations.The degrees are listed in Table 2 as d 2 and d 3 , respectively.From this information, we get the minimal number (d 2 + 1) × (d 3 + 1) of semi-numeric computations for interpolating the analytic reduction result.For example, for the cut {1, 5, 7}, we need to run semi-numeric computations at least 506 times.Of course, the cuts exhibit different running times when performing the reductions: For instance, cut {1, 3, 4, 5}, which we already considered as an example in Section 4.4, is the easiest in terms of running time, taking only about 11 minutes when using 384 CPU cores.In contrast, the cut {3, 4, 8} is much more complex: its reduction took 12 hours and 21 minutes, using 384 cores.

IBP coefficient conversion to a dlog basis
In this subsection, we discuss converting the IBP coefficients for the Laporta basis to the IBP coefficients of the dlog basis found in [77].
For this conversion, we again use the semi-numeric approach, taking integer-valued c 2 , c 3 , and symbolic c 4 , c 5 and D, converting the coefficients and then interpolating.It is easy to determine that the coefficients in the dlog basis have the following maximal degrees for c 2 and c 3 , respectively, d 2 = 20, d 3 = 20.(6.9) By comparing with Table 2, where d 2 can be as high as 35, we find that the maximal degree drops.For the basis conversion, we carry out a semi-numeric matrix multiplication with subsequent interpolation using Singular and GPI-Space.
After the computation, we see that the IBP reduction coefficients of Figure 8 in this dlog basis have size 480 MB on disk, which shows a significant 76% reduction of the IBP coefficient size compared to what we have for the Laporta basis.On the other hand, if only the IBP reduction coefficients in the dlog basis are needed, we can skip the interpolation for the Laporta basis IBP coefficients, and directly convert the intermediate numerical results to dlog basis IBP coefficients.Because of the maximal degree drop, this shortcut reduces the required number of semi-numeric computations.
For convenience, we also provide the IBP coefficients in the dlog basis, with the s 12 scalar recovered.All these analytic results can be obtained via the links presented in the introduction of this paper.Note that all files provided under the links contain 26 × 108 matrices.For each matrix, the entry in the ith row and jth column is the corresponding IBP coefficient for the ith target integral in Figure 8, expanded on the jth master integral.The Laporta basis and the dlog basis are included in the auxiliary files of this paper.

Summary
In this paper, we present our powerful new IBP reduction method, which is based on computational algebraic geometry powered by the computer algebra system Singular in conjunction with the taskflow management system GPI-Space.Our method is suitable for large scale IBP reduction problems with complicated Feynman diagrams and multiple variables.We demonstrate the power of the new method by the analytic two-loop fivepoint nonplanar double pentagon IBP computation.The computational result has been cross-checked numerically using state-of-the-art IBP programs.
Our method is flexible and can be adapted in various different scenarios: 1. Modern methods for amplitude computation often follow the approach of numerically or semi-numerically calculating the IBP relations in order to interpolate the amplitude coefficient under consideration directly, instead of interpolating the analytic IBP relations.Our method can efficiently compute the reduced numeric or semi-numeric IBP relations and, hence, perfectly fits into this purpose.
2. Our module intersection method can also be used for integrals with double propagators or multiple-power propagators since this IBP generating method avoids the increase of propagator exponents and significantly reduces the size of the IBP system.
3. Although our method is currently based on semi-numerical parallelizations with integer-valued numerics, it clearly can be extended to finite-field linear reduction, if necessary.
4. More generally, our linear reduction parallelization method can be used for computational problems other than IBP reduction.For example, in recent years, it was found that the Bethe Ansatz equation of integrable spin chains can be analytically computed by algebraic geometry methods [78,79].Often, this involves large-scale linear algebra computations with symbolic parameters, and our parallelization via the Singular-GPI-Space framework can greatly speed up the computation.We also expect that our reduction method can be used more generally for Gröbner basis computations with parameters.

Figure 1 .
Figure 1.An enabled transition and its firing.

Figure 2 .
Figure 2. A transition which is not enabled.

f 1 f 2 Figure 3 .
Figure 3. Task and data parallelism in a Petri net.

Figure 4 .
Figure 4.The structure of the interpolation tree.

Figure 7
Figure 7.We depict the two-loop five-point nonplanar double pentagon diagram, writing z i for the Baikov variables, which are equal to the inverse propagators.In particular, z 1 = l 2 1 and z 4 = l 2 2 .We also draw the 11 spanning cuts of this integral family.These correspond to the non-collapsible master integrals, before using symmetries.

Figure 8 .
Figure 8. Integrals up to numerator degree 4 without double propagators for the non-planar double pentagon diagram.

Table 1 ,
we give timings for different numbers of cores.All timings are in seconds, taken on the high performance compute cluster at the Fraunhofer Institute for Industrial Mathematics (ITWM).Each compute node provides two Xeon E5-2670 processors, which amounts to 16 cores 3 running at a base clock speed of 2.6 GHz.Each node has 64 GB of memory.For all runs with more than 15 cores, on each node we ran 15 compute jobs and one job for interfacing with the storage system.Since the storage jobs use negligible computation time, we omit them from the CPU core count when determining speedup and efficiency.
[49] , c 3 ≡ s 34 /s 12 , c 4 ≡ s 45 /s 12 , c 5 ≡ s 15 /s 12(6.3)todehomogenize the IBP relations and speed up the computation.The s 12 dependence can be recovered in the final step.The resulting IBPs are summarized in Table2.Note that for the cut {1, 6, 8}, there are 1203 independent relations and 1205 integrals after applying the idea of[49]to set most master integrals supported on the cut {1, 6, 8} to zero.As a result we only have to compute just two master integral coefficients.

Table 2 .
The IBP relations generated on each cut by the module intersection method.We used finite-field methods to pick linearly independent and necessary IBP relations to reduce all target integrals.The size is the output file size on disk before reduction.The numbers d 2 and d 3 are the maximal degrees in the reduced IBP relations for c 2 and c 3 , respectively.