Fast computation of MadGraph amplitudes on graphics processing unit (GPU)

Continuing our previous studies on QED and QCD processes, we use the graphics processing unit (GPU) for fast calculations of helicity amplitudes for general Standard Model (SM) processes. Additional HEGET codes to handle all SM interactions are introduced, as well as the program MG2CUDA that converts arbitrary MadGraph generated HELAS amplitudes (FORTRAN) into HEGET codes in CUDA. We test all the codes by comparing amplitudes and cross sections for multi-jet processes at the LHC associated with production of single and double weak bosons, a top-quark pair, Higgs boson plus a weak boson or a top-quark pair, and multiple Higgs bosons via weak-boson fusion, where all the heavy particles are allowed to decay into light quarks and leptons with full spin correlations. All the helicity amplitudes computed by HEGET are found to agree with those computed by HELAS within the expected numerical accuracy, and the cross sections obtained by gBASES, a GPU version of the Monte Carlo integration program, agree with those obtained by BASES (FORTRAN), as well as those obtained by MadGraph. The performance of GPU was over a factor of 10 faster than CPU for all processes except those with the highest number of jets.


Introduction
The start-up of the CERN Large Hadron Collider (LHC) opens a new discovery era of high energy particle physics.With proton beams colliding at unprecedented energy, it provides us with great opportunities to discover the Higgs boson and new physics beyond the Standard Model (SM), with typical signals involving multiple high p T γ's, jets, W 's and a junichi.kanzaki@kek.jpb qliphy@gmail.comc nokamura@iuhw.ac.jp d tstelzer@illinois.eduZ's.Reliable searches for these signatures require a good understanding of all SM background processes, which is usually done by simulations performed by Monte Carlo (MC) event generators, such as MadGraph [1,2,4].However, the complicated event topology expected for some new physics signals makes their background simulation time consuming, and it is important to increase the computational speed for simulations in the LHC data analysis.
In previous studies [5,6], the GPU (Graphics Processing Unit) has been used to realize economical and powerful parallel computations of cross sections by introducing a C-language version of the HELAS [7] codes, HEGET (HELAS Evaluation with GPU Enhanced Technology).HEGET is based on the software development system CUDA [8] introduced by NVIDIA [9].For pure QED processes, qq → nγ, with n = 2 to 8, the calculations ran 40-150 times faster on the GPU than on the CPU [5].For pure QCD processes, gg → ng with n up to 4, qq → ng and qq → qq+(n−2)g with n up to 5, 60-100 times better performance was achieved on the GPU [6].In this paper, we extend these exploratory studies to cover general SM processes, opening the way to perform the complete matrix element computation of MadGraph on the GPU.The complexity of the calculations is increased due to new interaction types and complicated event topologies expected in background simulations for various new physics scenarios.We introduce additional HEGET functions to cover all of the SM particles and their interactions, and a phase space parameterization suited for GPU computations.In order to test all of the new functions and the efficiency of the GPU computation in semi-realistic background simulations, we systematically study multi-jet processes associated with the production of SM heavy particles(s), followed by its (their) decay into final states including light quarks and leptons, including full spin correlations.In particular, we report numerical results on the following processes: WW /W Z/ZZ + n-jets (n ≤ 3), (1b) HP/HZ + n-jets (n ≤ 3), (1d) For the processes (1a) to (1e), we examine all of the major subprocesses at the LHC, while for the multiple Higgs production (1f), we study only the weak-boson fusion (WBF) subprocesses to test the Higgs self interactions.We present numerical results for the cross sections of processes in eq. ( 1) computed by using the GPU version of the Monte Carlo integration program, gBASES [10], with the new HEGET functions in the amplitude calculations, and compare the results with those obtained running two different programs on the CPU, the FORTRAN version BASES [11] programs with HELAS subroutines and the latest version of MadGraph (ver.5)[1].We also compare the performance of two versions of the BASES program, one on the GPU and the other on the CPU.
The paper is structured as follows.In section 2, we present the cross section formulae for general SM production processes at the LHC, and list all of the subprocesses we study in this report.In section 3, we briefly describe a new phase space parameterization for efficient GPU computation.In section 4, we introduce new HEGET functions for all SM particles and their interactions.In section 5, we introduce the software used to generate CUDA codes with HEGET functions from FORTRAN amplitude programs with HELAS subroutines obtained by MadGraph.In section 6, we review the computing environment, basic parameters of the GPU and CPU machines used in this analysis.Section 7 gives numerical results of computations of cross sections and comparisons of performance of GPU and CPU programs.Section 8 summarizes our findings.Appendix A explains in more detail our phase space parameterization introduced in section 3. Appendix B explains our method for generating random numbers on GPU.Appendix C lists all the new HEGET codes introduced in this paper.

Physics processes
In order to test not only the validity and efficiency of our GPU computation but also its robustness, we examine a series of multi-jet production processes in association with the SM heavy particle(s) (W , Z, t and H), followed by their decays into light quarks and leptons, that can be backgrounds for discoveries in many new physics scenarios.In this section, we list all the subprocesses we study in this paper and give the definition of multi-jet cross sections that are calculated both on the CPU and on the GPU at later sections.
At the LHC with a collision energy of √ s, the cross section for general production processes in the SM can be expressed as where D a/p and D b/p are the parton distribution functions (PDF's), Q is the factorization scale, x a and x b are the momentum fractions of the partons a and b, respectively, in the right-and left-moving protons, √ ŝ is the subprocess center of mass energy, and d σ ( ŝ) gives the differential cross section for the 2 → n subprocess a(p a , λ a , c a ) + b(p b , λ b , c b ) The subprocesses cross section can be computed in the leading order as where is the invariant n-body phase space, λ i are the helicities of the initial and final partons, n a and n b are the color degree of freedom of the initial partons, a and b, respectively.M c i λ i are the Helicity amplitudes for the process (3), which can be generated automatically by MadGraph and expressed as where the summation is over all the Feynman diagrams.The subscripts λ i stand for a given combination of helicities (0 for Higgs bosons, ±1/2 for quarks and leptons, ±1 for photons and gluons), and the subscripts c i correspond to a set of color indices (none for colorless particles, 1, 2, 3 for flowing-IN quarks, 1, 2, 3 for flowing-OUT quarks, and 1 to 8 for gluons).Details on color summation in MadGraph can be found in Ref. [6] .
In this paper, we are interested in general SM processes, typically involving the production of heavy resonances with decays, associated with multiple jet production, which often appear as major SM background for new physics signals.The limitation in the number of extra jets, n or n−k in eq.(1), is primarily due to limitations in the amount of memory available to the GPU as reported in previous studies [5,6].
Correlated decays of the heavy particles into the following channels are calculated: We do not consider τ decays in this report for brevity.The same selection cuts are imposed for = e, µ, τ, so that the Higgs boson production cross sections listed in this report can be used as a starting point for realistic simulations.A Higgs boson mass of 125 GeV and its branching fraction into τ + τ − , 0.0405, are used throughout this report.
For definiteness, we impose the following final state cuts at the parton level.For jets, we require the same conditions as in Refs.[5,6]: Where η i and p Ti are the pseudo-rapidity and the transverse momentum of the i-th jet, respectively, in the pp collisions rest frame along the right-moving (p z = |p|) proton momentum direction, and p Ti j is the relative transverse momentum [12] between the jets i and j defined by Here ∆ R i j measures the boost-invariant angular separation between the i and j jets, and ∆ η i j and ∆ φ i j are defined as differences of pseudorapidities and azimuthal angles of the two jets.For b jets from t decay, we require Note that we do not require them to be isolated from other jets via (8c).For charged leptons, we require and for simplicity we treat τ the same as e and µ.Like b jets from t decays, we do not impose isolation cuts for leptons, since performing realistic simulations is not the purpose of this study.We use the set CTEQ6L1 [13] parton distribution functions (PDF) for all processes.

Single W production
The following four types of W production subprocesses are studied in this paper: The subprocess (12a) starts with the leading order α 0 s , the subprocess (12b) starts with the next order α 1 s , and those of (12c) and (12d) start with the α 2 s order.These subprocesses give the dominant contributions in pp collisions.The corresponding W − production cross sections are smaller in pp collisions since an incoming u-quark in the subprocesses (12a) to (12c) should be replaced by a d-quark, whose PDF is significantly softer than the up-quark PDF in the proton.

Single Z production
Similarly, the following subprocesses are studied for Z production: As in the case of W + +n-jets, we examine all of the dominant contributions up to 4-jets.It should be noted that the down quark contribution to the Z+jets cross section is less suppressed than the W − +jets case, since the Z-boson couples stronger to the down-quarks than the up-quarks (cf.

WW production
For the W boson pair production, we study the following subprocesses: The subprocess (14a) starts with the leading order α 0 s , the subprocess (14b) starts with the next order α 1 s , and those of (14c) to (14e) start with the α 2 s order.The subprocesses (14d) are included as the dominant same sign W -pair production mechanism in pp collisions.

W + Z production
Similarly, the following subprocesses are studied for W + Z production: As in the WW+n-jets case, we consider all of the dominant W + Z production subprocesses up to 3 associated jets.Note again that the down quark contribution to the W − Z+jets cross section can be significant because of the large Z coupling to the d-quarks.

ZZ production
The following ZZ production subprocesses are also studied: All of the dominant ZZ production subprocesses up to 3 associated jets are studied.Note, however, that the downquark contribution to the qg collision subprocess (16b) has the coupling enhancement factor of (Γ (Z → d d)/Γ (Z → u ū)) 2 ≈ 1.7.Although we study only 4 lepton final states, ZZ+jets processes can be backgrounds for new physics signals with a Z boson plus jets and large missing E T .

tt production
For t t production, we consider the following subprocesses: The subprocess (17a) starts with the leading order α 2 s , the subprocess (17b) starts with the next order α 3 s , and those of (17c) and (17d) start with the α 4 s order.Again, only those subprocesses which give dominant contributions in pp collisions are studied in each order.

W boson associated Higgs production
As for the associate production of the Higgs boson and the W , we consider the following subprocesses: All of the subprocesses (18a) to (18d) are obtained by replacing one gluon in the W +jets subprocesses (12a) to (12d) by H, respectively.The HW − +jets subprocesses corresponding to (18a) to (18c) are suppressed since an incoming u-quark would be replaced by a softer d-quark in the proton.

Z boson associated Higgs production
Likewise, the following subprocesses are studied for HZ production: All of the four subprocesses are obtained from the corresponding Z+jets subprocesses (13a) to (13d), by replacing one final gluon by H. Again, the down quark contributions to the subprocesses (19a) to (19c) are less suppressed than those to the HW + production processes (21a) to (21c) due to the large Z coupling to the d-quark.

Top quark associated Higgs production
For the Ht t production process, the following subprocesses are examined in this paper: All of the subprocesses (20a) to (20d) are obtained, respectively, from the t t+jets subprocesses (17a) to (17d) by replacing one gluon in the final state by a Higgs boson.

Higgs boson production via weak-boson fusion
In all of the above subprocesses we consider only QCD interactions for production of jets and top quarks, while the weak interactions contribute only in W , Z, H productions and in decays.For Higgs+jets processes, we study weakboson fusion (WBF) subprocesses which can be identified at the LHC for various decay modes [14,15,16].WW fusion contributes to the subprocesses (21a), ZZ fusion contributes to the subprocesses (21b), both contribute to the subprocesses (21c) and (21d): The subprocesses (21a) and (21b) start with the leading order α 0 s , those in (21c) start with the next order α1 s , the subprocess (21d) occurs at the α 2 s order.We do not consider gg fusion process, because its simulation requires non-renormalizable vertices [1] which are absent in the minimal set of HELAS codes [7].

Multiple Higgs bosons production via weak-boson fusion
The following multiple Higgs boson production processes are studied in this paper as a test of cubic and quartic vertex functions among scalar and vector bosons: The quartic scalar boson vertex appears only in the subprocesses (22c) and (22d).

Algorithm for phase space generation
In this section, we briefly introduce our phase space parametrization, designed for efficient GPU computing.Details are given in Appendix A.
Optimizing code to run efficiently on the GPU requires several special considerations.In addition to the careful use of memory mentioned earlier, one needs to consider that each "batch" of calculations processed on the GPU must undergo identical operations.That has particular consequences when one considers generating momentum in phase space that satisfy the appropriate cuts.The efficiency of generating momenta that pass cuts is not particularly important on the CPU, since one can simply repeatedly generate momenta until a set is found that pass the cuts.The structure of the GPU however does not allow such flexibility.If the generation program is running on multiple processors, each processor has one opportunity to generate a valid phase space point before moving forward to calculate the amplitude.So if the efficiency for generating a point in phase space that passes cuts is only 10% then you loose a factor of 10 in computing speed.
In previous studies [5,6] for pure QED and QCD processes at hadron colliders: the following phase space parameterization has been used including the integration over the initial parton momentum fractions (see Appendix A for details): With this parameterization, when all the final particles are massless partons, we can generate phase space points that satisfy the rapidity constraints, |η cut | < η cut , for all partons (i=1 to n) and the p T constraints, p Ti > p cut T , for n−1 partons (i = 1 to n−1).Only those phase space points which violate the conditions, x a < 1, x b < 1 or p Tn > p cut T , will be rejected.Studies in Refs.[5,6] find that, for example, 78%, 58%, 42% and 29% of generated phase space points satisfy the final state cuts 1 , for 2-5 photon productions at the LHC, when we choose y i (i = 1 to n), ln p 2 Ti and φ i (i = 1 to n−1) as integration variables.
We adopt this parameterization of the generalized phase space in this report, which is extended to account for the production and decay of several Breit-Wigner resonances.In particular, for process with n-jet plus m-resonance production when the resonance R k decays into a final state f k , we parameterize the generalized phase space as follows: Here dΦ(R k → f k ) is the invariant phase space for the R k → f k decay when the invariant mass of the final state f k is √ s k , and dΣ n is the n-body generalized phase space for n − m massless particles and m massive particles of masses √ s k (k = 1 to m).Integration over the invariant mass squared s k is made efficient by using the standard Breit-Wigner formalism and the transverse momenta are generated as Finally, we note that s-channel splitting of massless partons can be accommodated by using the same parameterization eq. ( 26), where R k in eq. ( 26) is a virtual parton of mass √ s k and f k is a set of partons.Instead of the Breit-Wigner parameterization (27), we simply use ln s k as integration variables.

HEGET functions
In this section, we explain the new HEGET functions for computing the helicity amplitudes for arbitrary SM processes.All of the HEGET functions that appear in this report are listed in Appendix C as List 3 to List 30.

Wave functions
In the previous works [5,6], the wave functions for massless particles, which are named ixxxxk, oxxxxk and vxxxxk (k = 0, 1, 2), have been introduced for fermions and vector bosons.In this report, we introduce wave functions for massive spin 0, 1/2 and 1 particles, which can also be used for massless particles by setting the mass parameters to zero.The naming scheme for HEGET functions follows that of HELAS subroutines: the HEGET (HELAS) function names start with i(I) and o(O) for flow-IN and flow-OUT fermions, respectively, v(V) for vector boson wave functions, and s(S) for scalar boson wave functions.All of the HEGET functions for external lines are summarized in Table 1.
respectively, just the same as in HELAS subroutines, where nSF = +1 for particles, and nSF = −1 for anti-particles.The helicities are given by nHEL/2 = ±1/2.The last 2 components2 of the fi [6] and fo [6] give the 4-momentum of the fermions along the fermion-number flow, Like the fermion wave functions, the first four components of the output complex array vc [6] in the HEGET function vxxxxx gives the wave function where nSV = +1 for final states, and nSV = −1 for initial states.The helicity nHEL takes ±1 and 0 for vector bosons, while nHEL=±1 for massless vector bosons.The last 2 components of vc [6] give the flowing-out 4-momentum, The vector boson wave function vxxxxx is listed in C.2.2.The function sxxxxx computes the wave function of a scalar boson, and is listed in C.2.3.Since the scalar boson does not have any Lorentz indices, the first component of the output sc [3] is simply unity, 1 + i0.The last 2 components give the flowing-out 4-momentum: where nSS = +1(−1) when the scalar boson is in the final (initial) state.

Vertex functions
There are 8 types of renormalizable vertices among spin 0, 1/2 and 1 particles in the SM, as listed in Table 2.Here the capital letters F, V and S stand for spin 1/2, 1 and 0 particles respectively.As in the HELAS subroutines, we introduce two types of HEGET functions for each vertex, those which give an amplitude (a complex number) as output and those which give an off-shell wave function as output.HEGET function names for off-shell wave functions of a fermion start with f, those for a vector boson start with j, and those for a scalar boson start with h.The corresponding HELAS subroutine names are also shown in Table 2.

FFV: fermion-fermion-vector vertex
The HEGET functions for the FFV vertex are defined by the Lagrangian, following the HELAS convention [7], where the boson names are defined by their flowing-out (final state) quantum number.For instance, if F 1 = up and F 2 = down in eq. ( 35), then V should be W − (V * in eq. ( 35) is W + ).The amplitude function iovxxx, the off-shell fermion wave functions fvixxx and fvoxxx, and the off-shell vector current jioxxx obtained from the FFV Lagrangian ( 35 For the qqg vertex of QCD, we adopt the Lagrangian where g s (= √ 4πα s ) is the strong coupling constant and T a i j (a=1 to 8) are the SU(3) generators in the fundamental representation, i and j are the indices of the 3 and 3 representations, respectively.When calculating the amplitude of the qqg vertex, we set for the couplings of eq. ( 35) in HEGET functions.Correspondingly, the color amplitude should read where (vertex) represents the output of iovxxx or that of any other off-shell wave functions for the FFV vertex.The color factor of −T a i j is then processed algebraically in integrand functions of the BASES program.

FFS: fermion-fermion-scalar vertex
The FFS vertex of the HEGET functions are defined by the Lagrangian following the HELAS convention [7].

VVV: three vector vertex
The HELAS subroutine for the VVV vertex functions are defined by the following Yang-Mills Lagrangian with a real coupling gc, where the vector boson triple products are anti-Hermitian.The amplitude of the VVV vertex is calculated by the HEGET function vvvxxx and the offshell vector current is computed by jvvxxx, which are listed C.5.1 and C.5.2, respectively.For the electroweak gauge bosons, the coupling gc is chosen as Here the three vector bosons (V 1 , V 2 , V 3 ) should be chosen as the cyclic permutations of (W − , W + , Z/A), because of the HELAS convention which uses the flowing out quantum number for boson names: see Table 16 in C.5 for details.Note (V 1µ ) * = V 2µ in the HELAS Lagrangian (40) The ggg vertex in the QCD Lagrangian can be expressed as where now the vector boson triple products are Hermitian; We can still use the same vertex functions vvvxxx and jvvxxx with the real coupling gc = g s for ggg (43) and by denoting the corresponding amplitude as where (vertex) gives the output of the HEGET functions vvvxxx or jvvxxx.The associated factor of i f abc is then treated as the color factor in MadGraph [2].

VVVV: four vector vertex
There are two types of VVVV vertex in the SM Lagrangian, one is for SU(2) and the other for SU (3).As in the case for the VVV vertex, the SU(2) vector bosons are expressed in terms of The contact four-point vector boson vertex for the SU(2) weak interaction is given by the Lagrangian Two distinct subroutines wwwwxx and w3w3xx (see Table 2) were introduced in HELAS, because there was an attempt to improve their numerical accuracy by combining the weak boson exchange contribution to the four-point vertices [7].The attempt has not been successful and we have no motivation to keep the original HELAS strategy.Instead, we introduce only one set of HEGET functions for the interactions among four distinct vector bosons The corresponding HEGET functions are named as ggggxx for the amplitude and jgggxx for the off-shell currents.
Because the SU(2) weak boson vertices (46) always have two identical bosons (or two channels for the vertex of W + W − γZ) we further introduce the HEGET functions wwwwxx and jwwwxx which sum over the two contributions internally.The functions are listed in C.6.1 and C.6.2, respectively, and the input fields and corresponding couplings are given in Table 17.
As for the QCD quartic gluon coupling we can use the HEGET functions ggggxx and jgggxx for the Lagrangian (47) to obtain the amplitudes and the offshell gluon currents respectively.For instance, by using the four-vector vertex amplitude of Γ (g 2 ;V 1 ,V 2 ,V 3 ,V 4 ) for the Lagrangian (47), we can express the amplitude of g a 1 , g b 2 , g c 3 , g d 4 as follows: The off-shell gluon currents are obtained similarly by calling the HEGET function three times as explained in [5,6] and repeated in C.6 for completeness.

VVS: vector-vector-scalar vertex
The only VVS interaction of the SM appears in the Higgs Lagrangian in the form where the coupling gc is real and proportional to the Higgs vacuum expectation value (v.e.v.).The HEGET functions for the amplitude vvsxxx, the off-shell vector current jvsxxx, and the off-shell scalar current hvsxxx are given for a general complex gc (in GeV units), distinct complex vector bosons (V 1 and V 2 ), and a complex scalar field (S), and are listed in C.7.1, C.7.2 and C.7.3, respectively.In the SM, only WWH and ZZH couplings appear.

VVSS: vector-vector-scalar-scalar vertex
The HEGET functions for the VVSS vertex are obtained from the Lagrangian The amplitude function vvssxx, the off-shell vector current jvssxx, and the off-shell scalar current hvvsxx are listed in C.8.1, C.8.2 and C.8.3, respectively, for a complex gc, distinct complex vector bosons (V 1 and V 2 ), and for distinct complex scalars (S 3 and S 4 ).In the SM, only WWHH and ZZHH couplings appear, and the coupling gc is real and proportional to the squares of the electroweak gauge couplings.

SSS: three scalar vertex
In C.9, we show the HEGET functions for the SSS vertex, which are obtained from the Lagrangian The HEGET functions for the amplitude sssxxx and the off-shell scalar current hssxxx are given for a complex gc (in GeV units) and for distinct complex scalars (S 1 , S 2 , S 3 ), and are listed in C.9.1 and C.9.2 respectively.In the SM, the coupling gc is real and proportional to the Higgs v.e.v. and only the H 3 coupling appears.

SSSS: four scalar vertex
The Lagrangian gives the HEGET functions for the SSSS vertex: ssssxx for the amplitude and hsssxx for the off-shell scalar current, which are listed in C.10.1 and C.10.2, respectively.Here again the HEGET functions are given for a complex gc, and for the four distinct complex scalar bosons (S 1 , S 2 , S 3 , S 4 ).
In the SM, only the H 4 coupling exists whose coupling gc is real and proportional to the ratio of the Higgs boson mass and the v.e.v.

Generation of CUDA functions for Monte Carlo integration
For the Monte Carlo integration of cross sections of the physics processes on the GPU, all integrand functions have to be coded using CUDA [8].In order to prepare these amplitude functions efficiently we develop an automatic conversion program, MG2CUDA.As input this program takes the FORTRAN amplitude subroutine, matrix.fgenerated by MadGraph (ver.4) [2], analyzes the source code and generates all CUDA functions necessary for the Monte Carlo integration on GPU.MG2CUDA also optimizes generated CUDA codes for execution on the GPU by reducing unnecessary variables and dividing long amplitude functions into a set of smaller functions as necessary.
In the following subsections, the major functions of MG2CUDA are briefly described.

Generation of HEGET function calls from HELAS subroutines
MG2CUDA converts calling sequences of HELAS subroutines in matrix.f to those of HEGET functions in the integrand function of gBASES.All HEGET functions for the GPU are designed to have a one-to-one correspondence to HELAS subroutines with the same name, and their arguments have the same order and the same variables types.Hence HELAS subroutine calls are directly converted to HEGET function calls.

Decoding of initial and final state information
MG2CUDA decodes the physics process information, species of initial and final particles, the number of graphs and the number of color bases, written into matrix.fby MadGraph and adopts an appropriate phase space program and prepare header files to store process information and some constants.

Division of a long amplitude program
As the number of external particles increases, the number of Feynman diagrams contributing to the subprocess grows factorially and the amplitude program generated by Mad-Graph becomes very long.Due to the current limitation of the CUDA compiler, a very long amplitude program cannot be compiled [5,6] by the CUDA compiler.MG2CUDA divides such a long amplitude function program into smaller functions which are successively called in the integrand function of gBASES.
Among the processes listed in eq. ( 1), several processes with the maximum number of jets require such decomposition into smaller pieces by MG2CUDA.Those processes are denoted explicitly in the tables and plots in Section 7 by an asterisk.

Decomposition of a color matrix multiplication
Compared to CPU, the memory resources of GPU are quite limited.Hence, if calculations on GPU require a large amount of data, the data must reside on slower memory (global memory), and the access to the data becomes a cause of the degradation of performance of GPU programs.When the number of independent color bases of a physics process becomes large, the data size necessary for the multiplication of color factors also becomes large.For example, the number of color bases for u ū → t t +ggg (17a) is 144, and the data size and the total number of multiplications in the color matrix multiplication becomes the order of (144) 2 ∼ 20000.In order to avoid degradation of performance of GPU, MG2CUDA decomposes arguments of the color matrix into a set of independent color factors and combines multiplications which have the same factors.This significantly reduces the number of color factors and the total number of multiplications [6].For the u ū → t t +ggg case, the number of independent factors is only 51 and the reduced number of multiplications becomes ∼ 3600.These color factors are stored in the readonly (e.g.constant) memory which GPU can access more quickly than the global memory.

Reduction of the number of temporary wave functions
During the computation of amplitudes, temporary variables of wave functions are necessary to keep intermediate particle information.MG2CUDA analyzes the use of these temporary variables and recycles variables which are not used any more in the latter part of the program.This greatly relaxes the memory resource requirement.Again for the u ū → t t +ggg case, the number of variables used for wave functions in original matrix.f is 1607, and it becomes only 83 after recycling temporary variables.

Computing environment
In this section, we introduce our computing environment used for all computations presented in this paper.

Computations on GPU
We used a Tesla C2075 GPU processor board produced by NVIDIA [9] to compute cross sections of the physics processes listed in eqs.( 12)-( 22).The Tesla C2075 has 448 processors (CUDA cores) in one GPU chip, which delivers up to 515 GFLOPS of double-precision peak performance.Other parameters of the board are listed in Table 3.The Tesla C2075 is controlled by a Linux PC with Fedora 14 (64bit) operating system.CUDA codes executed on the GPU are developed on the host PC with the CUDA 4.2 [8] software development kit.
For the computation of cross sections we use the Monte Carlo integration program, BASES [11].The GPU version of BASES, gBASES, has originally been developed in single precision [10].In this paper, however, we use the newly developed double precision version of gBASES for all GPU computations throughout this report.

CPU environment
As references of cross section computations and also for purposes of comparisons of process time, we use the BASES  As another reference of cross sections we also use the latest version of MadGraph (ver.5)[1] which has been released in 2011.All numerical results appear in Sec.7 as "MadGraph" are obtained by this new version of MadGraph.During computations it shows good performance in the execution time and gives us stable results.

Results
In this section, we present numerical results of the computations of the total cross sections and a comparison of total process time of gBASES programs on the GPU for the SM processes listed in Sec. 2. As references for the cross sections and process time, we also present results obtained by BASES and MadGraph on a CPU.Recent improvements in double precision calculations on the GPU allows us to per-form all computations in this paper at double precision accuracy.
In addition, in the previous papers [5,6], a simple program composed of a single event loop without any optimizations for the computation of cross sections was used both on the GPU and the CPU and their process time for a single event loop was compared.In this paper, we compare the total execution time of BASES programs running on the GPU and on a CPU with the same integration parameters.Since the total execution time of gBASES includes both process time on the GPU and the CPU, the comparison gives a more practical index of the gain in the total process time by using the GPU.
For all three programs, gBASES, BASES and MadGraph, we use the same final state cuts, PDF's and the model parameters as explained in Sec. 2. All results for the various SM processes are obtained for the LHC at √ s = 14 TeV, and summarized in Tables 5-15, and Figs.1-11.Some general comments are in order here.
First, we test all of the HEGET functions listed in Appendix C against the HELAS subroutines [7] and the amplitude subroutines for MadGraph (ver.5) by comparing the helicity amplitudes of all subprocess listed in Sec. 2. We generally find agreement within the accuracy of double precision computation, except for multiple Higgs production processes via weak boson fusion, eqs.(22c) and (22d).For these processes, the MadGraph amplitude codes give significantly smaller amplitudes.We identify the cause of the discrepancy as subtle gauge theory cancellation among weak boson exchange amplitudes.After modifying both the HELAS and HEGET codes to respect the tree-level gauge invariance strictly, by replacing all m 2 V in the weak boson propagators and in the vertices 3 , and by setting as a default, we find agreement for all the amplitudes.Except for the triple Higgs boson production processes of eqs.(22c) and (22d), these modifications on the code do not give significant difference in the amplitudes.
In Tables 5-15, the results of the process time ratio with the divided amplitude functions are denoted by an asterisk.In the plots of Figures 1-11 they are indicated by open circles.By comparing the numbers with and without asterisks in the Tables, and also by comparing the heights of the blobs with and without circles in the Figures for the same number of jets, we can clearly observe the loss of efficiency in the GPU computation when the amplitude function is so long that its division into smaller pieces is needed.For example, among the Z+4-jets processes only the amplitude function of the process, uu → Z+uu+gg (13c), has to be divided.It is clearly seen from Table 6 and Fig. 2 that the GPU gain over the CPU is significantly lower (∼ 5) for this process, as compared to the other Z+4-jets processes.A similar trend is observed for all other processes with divided amplitude program.
From Tables 5-15, we find that the results obtained by gBASES with HEGET functions agree with those by the BASES programs with HELAS within the statistics of generated number of events.On the other hand we observe some deviations of MadGraph results from BASES results as the number of jets in the final state, n, increases.It amounts to about 5-7% level for processes with the maximum number of jets.These deviations may be attributed to the difference of the phase space generation part for multi-jet productions, and require further studies.The program gSPRING [17] which generates events on the GPU by making use of the grid information of variables optimized by gBASES [10] is being developed, and more detailed comparison between MadGraph and its GPU version will be reported elsewhere.
In the following subsections, let us briefly summarize our findings for each subprocesses as listed in Sec. 2.

Single W production
The results of the total cross section and the process time for W + +n-jet production processes of eq. ( 12) followed by W + → + ν ( = e, µ) decays at the LHC with √ s = 14 TeV are presented in Table 5 and the ratio of process time between GPU and CPU is shown in Fig. 1.All the n-jets are required to satisfy the conditions eq. (8) while leptons ( = e, µ) from W decays satisfy the cuts eq.(11).The QCD coupling is fixed as α s (20 GeV) LO = 0.171 and the CTEQ6L1 parton distribution functions [13] are evaluated at the factorization scale of Q = p cut T,jet = 20 GeV, except for n = 0 for which the factorization scale is chosen as Q = m W .
As for the integrated results for the W + +n-jet processes presented in Table 5, we find that the GPU results obtained by gBASES with the HEGET functions agree well with the corresponding CPU results from the other programs, especially with those obtained by BASES with HELAS, within the statistics of generated number of events.For n = 3 or 4 cases, some discrepancies between BASES and MadGraph results are found.They may be due to the difference in the phase space generation, as mentioned above.
Their performance on GPU was significantly better than that on CPU, as clearly shown by the total process time ratios of BASES program on CPU over that on GPU with the same integration parameters listed in Table 5.In Fig. 1, we show this ratio as a function of the number of jets in the final state, n.The ratio gradually decreases from ∼ 100 at n = 0 as n grows, but it still exceeds 10 even for n = 4. Compared with other processes, the gain for the process, uu →W + ud+gluons, is small.That is simply because these processes have more Feynman diagrams and a larger color bases than the other processes with the same number of jets.8) and ( 11), the parton distributions of CTEQ6L1 [13] at the factorization scale of Q= p cut T,jet =20 GeV and the fixed QCD coupling at α s (20 GeV) LO = 0.171 are used, except for n = 0 for which the factorization scale is chosen as Q = m W .Similarly, results for Z + n-jet production processes of eq. ( 13) with Z → + − ( = e, µ) are presented in Fig. 2 and Table 6.All of the selection cuts and the SM parameters are the same as in the previous subsection.The factorization scale is chosen as Q=m Z for n=0, while Q= p cut T,jet =20 GeV and α s = α s (p cut T,jet ) LO = 0.171 for n ≥ 1.

Single Z production
As in the case of W + +n-jet processes, the integrated cross sections obtained by gBASES with HEGET and by BASES with HELAS as well as those from MadGraph are more or less consistent.Small discrepancies may be attributed to differences in phase space parameterization or in its optimization processes of integrations.The ratios of total process time of BASES between CPU and GPU are given in Fig. 2 and show very similar behavior to those in Fig. 1 for W + +n-jet processes.A factor over 100 is obtained for n=0.It decreases as n, but still exceeds 10 for n = 4, except for uu → Z +uu+gg.The gains for uu → Z +uu+gluons (13c) are smaller than the other processes with the same number of jets, because of their larger number of graphs and color bases.The amplitude program for uu → Z +uu+gg cannot be compiled as a single GPU function call, and it has to be divided into smaller function calls of GPU programs.The gain for this process, which is denoted with an asterisk in Table 6 and with an open circle in Fig. 2, is only a factor of ∼ 5. Results for WW +n-jet production processes of eq. ( 14) are presented in Fig. 3 and Table 7.Here both W 's decay leptonically, with W + (W − ) → + ν ( − ν ) with ( = e, µ).In addition to the W + W − production processes, we also give results for W + W + production processes of eq.(14d).The factorization scale is chosen as Q = m W for n = 0, while Q = p cut T,jet = 20 GeV and α s = α s (p cut T,jet ) LO = 0.171 for n ≥ 1.The GPU gain over the CPU computation is 50 for n = 0 and 1, where it becomes ∼ 12 for n = 3 except for uu → W + W − uu+gluons and uu → W + W + dd +gluons.Smaller gains of ∼ 6 are observed for these two types, due to the Table 5 Total cross sections and BASES process time ratios for W + +n-jet production with W + → + ν ( = e, µ) at the LHC ( √ s = 14 TeV).Event selection cuts are given by eqs.( 8) and (11), the parton distributions of CTEQ6L1 [13] at the factorization scale of Q = p cut T,jet = 20 GeV and the fixed QCD coupling at α s (20 GeV) LO = 0.171 are used, except for n = 0 for which the factorization scale is chosen as Q = m W . 19.0 uu → W + + udgg 7.398 ± 0.009 7.389 ± 0.006 6.917 ± 0.007 ×10 4  10.6 gg → W + + d ūgg 3.662 ± 0.001 3.664 ± 0.001 3.502 ± 0.004 ×10 4  11.9

Number of Jets in Final
Table 6 Total cross sections and BASES process time ratios for Z + n-jet production with Z → + − ( = e, µ) at the LHC ( √ s = 14 TeV).Event selection cuts, PDF and α s are the same as in Table 5, except for n = 0 for which the factorization scale is Q = m Z .19.5 uu → Z + uugg 5.250 ± 0.005 5.259 ± 0.004 4.964 ± 0.005 ×10 3  5.1* gg → Z + u ūgg 3.038 ± 0.002 3.038 ± 0.001 2.901 ± 0.003 ×10 3  11.6 large size of the programs which have been divided into several smaller pieces, as indicated by the asterisk besides the number in the right most column of Table 7.The trends are similar to those observed for qq → V qqgg subprocesses for V = W or Z, which has been discussed in subsections 7.1 and 7.2, respectively.

W + Z production
Results for W + Z + n-jet production processes of eq. ( 15), with W + → + ν , Z → + − ( = e, µ), are presented in Fig. 4 and Table 8.The final state selection cuts for jets and leptons are given by eqs.( 8) and (11) Here again the factorization scale is chosen as Q=m W for n=0, while Q= p cut T,jet =20 GeV and α s = α s (p cut T,jet ) LO = 0.171 for n ≥ 1.All of the cross sections in Table 8 are consistent between GPU and CPU within the statistical error of Monte Carlo integrations, and the performance improvement of GPU over CPU again depends on n.The improvement for uu → W + Zud+gluon is also the smallest among those of W + Z+3jets processes, because the amplitude function has to be divided into smaller pieces as indicated by the asterisk besides the factor of 8.2.

ZZ production
Results for ZZ+n-jet production processes of eq. ( 16) where both Z bosons decay as Z → + − ( = e, µ) are presented in Fig. 5 and Table 9.All of the selection cuts and the SM parameters are the same as in the previous subprocesses.The factorization scale is chosen as Q = m Z for n = 0, while Q = p cut T,jet = 20 GeV and α s = α s (p cut T,jet ) LO = 0.171 for n ≥ 1.All of the cross sections in Table 9 are consistent between the GPU and the CPU computations, and the GPU gain over CPU for the total process time of BASES shown in Fig. 5 ranges from ∼ 70 for n = 0 to ∼ 16 for n = 3 except for uu → ZZuu+gluon whose gain is 8.7 with asterisk in Table 9, just as in the case of the other qq → VV qqg processes whose amplitude functions are too large to be executed as a single function.

tt production
Results for tt+n-jet production processes of eq. ( 17) are presented in Fig. 6 and Table 10, where both t and t decay semileptonically as t → b + ν and t → b − ν ( = e, µ).Here the factorization scale is chosen as Q=m tt for uu→tt (n=0) and  8), ( 10) and ( 11) and the parton distributions of CTEQ6L1 [13] at the factorization scale of Q = p cut T,jet = 20 GeV is used, except for n = 0 for which the factorization scale is chosen as Q = m t The strong coupling constants are set as α 2+n s = α s (m t ) 3.9* ug → tt + ugg 9.267 ± 0.003 9.251 ± 0.008 8.758 ± 0.010 ×10 3  5.2* uu → tt + uug 6.760 ± 0.041 6.792 ± 0.005 6.462 ± 0.009 GeV for all the processes with jet productions (n ≥ 1).The strong coupling constants are set as for the others.The numerical values are α s (2m t ) LO = 0.108 at the uu→tt threshold, α s (m t ) LO =0.108 and α s (20 GeV) LO = 0.171.
All of the cross sections in Table 10 are consistent between the GPU (HEGET) and the CPU (BASES) computations within the statistical uncertainties of the Monte Carlo integrations.
The GPU gain over CPU on the total process time of BASES shows very similar dependence on n as previous results.It starts from ∼ 90 for n = 0 and drops to ∼ 5 for n = 4.
For all tt +3-jet production processes, the amplitude functions have to be divided into smaller pieces in order to be processed by the CUDA compiler.The main cause of the long amplitudes for these processes is the proliferation of color the factor bases which has been observed for all of the QCD 5 jet production process in ref. [6].

WZ + jets
Fig. 4 Ratio of BASES process time (CPU/GPU) for W + Z + n-jet production with W + → + ν , Z → + − ( = e, µ) in pp collisions at √ s = 14 TeV.Event selection cuts, PDF and α s are the same as in Fig. 1.
the cross section where the leptons and hadrons from τdecays are in the central detector region, |η|<2.5 in eq.(11a).
The factorization scale of the parton distribution functions is set at Q = m HW for n = 0, and at Q = p cut T,jet = 20 GeV for all the jet production processes (n ≥ 1).The strong coupling is fixed at α s (20 GeV) LO = 0.171.
All of the cross sections in Table 11 are consistent between the GPU and the CPU computations within the statistical uncertainty of the Monte Carlo integrations.
Again, the GPU gain over CPU of the total process time of BASES shows an n dependence similar to those for W + n-jets processes.This is because the HW production in the SM can be regarded as H emission from virtual W , in the lowest order of the electroweak couplings, and hence the number of Feynman diagrams are exactly the same between the HW + +n-jets processes in Table 11, and the corresponding W + +n-jets processes in Table 5.This comparison confirms our observation that the GPU gain over CPU is limited mainly by the size of the amplitude function, at least in our application.

Z boson associated Higgs production
Results for HZ+n-jet production with Z → + − ( = e, µ) and H → τ + τ − for m H = 125 GeV are presented in Fig. 8 and Table 12.The factorization scale of the parton distribution functions is set at Q = m HZ for n = 0, and at Q = p cut T,jet = 20 GeV for all the jet production processes (n ≥ 1).The strong coupling is fixed at α s (20 GeV) LO = 0.171.Both the cross sections shown in Table 12 and the GPU gain over CPU shown in Fig. 8 are similar to those found for the HW + Table 12 Total cross sections and total process time for HZ + n-jet production with Z → + − ( = e, µ) and H → τ + τ − at the LHC ( √ s = 14 TeV), for m H = 125 GeV and Br (H → τ + τ − ) = 0.0405.Event selection cuts are given by eqs.(8), (10) and (11) and the parton distributions of CTEQ6L1 [13] at the factorization scale of Q = p cut T,jet = 20 GeV is used, except for n = 0 for which the factorization scale is chosen as Q = m HZ .The strong coupling is fixed at α s (20 GeV) LO = 0.171.

+ jets t t
Fig. 6 Ratio of BASES process time (CPU/GPU) for tt + n-jet production with t → b + ν and t → b − ν ( = e, µ) for m t = 175 GeV and Br (t → b + ν ) = 0.216 in pp collisions at √ s = 14 TeV.Event selection cuts are given by eqs.(8), (10) and (11) and the parton distributions of CTEQ6L1 [13] at the factorization scale of Q = p cut T,jet = 20 GeV is used, except for n = 0 for which the factorization scale is chosen as Q = m t .The strong coupling constants are set as α 2+n n-jet processes in the previous section.These gains are also consistent with those for Z +n-jets.As in the above case, all the Feynman diagrams for the ZH +n-jets processes obtained from those of the Z +n-jets process by allowing the external Z boson to split into Z+H.The size of the amplitude functions are hence essentially the same between the two corresponding processes.Accordingly, the gain factors in Table 12 for the HZ+n-jets processes are almost the same as those of the corresponding Z+n-jets processes in Table 6.8), (10) and (11) and the parton distributions of CTEQ6L1 [13] at the factorization scale of Q = p cut T,jet = 20 GeV is used, except for n=0 for which the factorization scale is chosen as Q=m HW .The strong coupling is fixed at α s (20 GeV) LO =0.171.

Top quark associated Higgs production
Results for Htt + n-jet production with t → b + ν and t → b − ν ( = e, µ) and H → τ + τ − are presented in Fig. 9 and Table 13.Both t and t decay semi-leptonically with Br (t → b + ν )=0.216, and H → τ + τ − decay with Br (H → τ + τ − )= 0.0405 are assumed, and τ ± are treated as = e or µ, satisfying |η τ | < 2.5 and p Tτ > 20 GeV in eqs.(11), ignoring further τ decays.The factorization scale is chosen as Q = m Htt for uu → Htt (n = 0) and Q = m t for gg → Htt (n = 0), while Q = p cut T,jet = 20 GeV for all the processes with jet productions (n ≥ 1).The strong coupling constants are set as α 2+n s = α s (m Htt ) 2  LO α s (p cut T,jet ) n LO for uu → Htt process, and  8), (10) and (11) and the parton distributions of CTEQ6L1 [13] at the factorization scale of Q = p cut T,jet = 20 GeV is used, except for n = 0 for which the factorization scale is chosen as Q = m Htt for uu → Htt and Q = m t for gg → Htt.8), (10) and (11) and the parton distributions of CTEQ6L1 [13] at the factorization scale of Q = p cut T,jet = 20 GeV is used, except for n=0 for which the factorization scale is chosen as Q=m HZ .The strong coupling is fixed at α s (20 GeV) LO =0.171.
LO for all the others.Numerical values are α s (m t ) LO = 0.108 and α s (20 GeV) LO = 0.171.All of the cross sections in Table 13 obtained by the GPU and the CPU programs as well as those by MadGraph are consistent within the statistical uncertainty of the Monte Carlo integration.The GPU gain over CPU for the total process time of BASES decreases from ∼70 to ∼20 as n grows from n=0 to n=2 except for the n=2 processes gg → Httgg and uu → Httuu for which the factor is reduced to ∼ 10 or less.The amplitude functions for these two processes are too long for the CUDA compiler to process, and they are divided into smaller pieces for execution.8), (10) and (11) and the parton distributions of CTEQ6L1 [13] at the factorization scale of Q = p cut T,jet = 20 GeV is used, except for n = 0 for which the factorization scale is chosen as Q = m Htt for uu → Htt and Q = m t for gg → Htt.The strong coupling constants are set as α 2+n s =α s (m Htt )

Higgs boson production via weak-boson fusion
Results for Higgs boson plus n-jet production via weakboson fusion followed by H → τ + τ − decay are presented in Fig. 10 and Table 14.The factorization scale of the parton distribution functions is set at Q = m W /2 for ud → Hud, Q = m Z /2 for uu → Huu, and Q = p cut T,jet = 20 GeV for all the other processes with n ≥ 3 jets.The above choice of factorization scales are motivated by the observation that the peak positions of the distribution of the transverse momentum of the two jets in the processes ud → Hud and uu → Huu are Table 14 Total cross sections and BASES process time ratios for Higgs boson plus n-jet production via weak-boson fusion with H → τ + τ − at the LHC ( √ s = 14 TeV).Event selection cuts are given by eqs.( 8), (10) and (11) and the parton distributions of CTEQ6L1 [13] at the factorization scale of Q= p cut T,jet =20 GeV is used, except for n≤2 for which the factorization scale is chosen as Q=m W /2 for ud →Hud, Q=m Z /2 for uu→Huu.8), (10) and (11) and the parton distributions of CTEQ6L1 [13] at the factorization scale of Q = p cut T,jet = 20 GeV is used, except for n ≤ 2 for which the factorization scale is chosen as Q = m W /2 for ud → Hud, Q = m Z /2 for uu → Huu.The strong coupling is fixed at α s (20 GeV) LO = 0.171.found to be < p T >∼ 40 GeV and 45 GeV, respectively.The strong coupling is fixed at α s (20 GeV) LO = 0.171.8), (10) and (11) and the parton distributions of CTEQ6L1 [13] at the factorization scale of Q = p cut T,jet = 20 GeV is used, except for n ≤ 2 for which the factorization scale is chosen as Q = m W /2 for ud → HH(H)ud, Q = m Z /2 for uu → HH(H)u.The strong coupling is fixed at α s (20 GeV) LO = 0.171.

Multiple Higgs bosons production via weak boson fusion
Results for multiple Higgs boson plus n-jet production via weak boson fusion followed by H → τ + τ − decay are presented in Fig. 11 and Table 15.The factorization scale of the parton distribution functions is set at Q = m W /2 for ud → HH(H)ud, Q = m Z /2 for uu → HH(H)u, and Q = p cut T,jet = 20 GeV for all processes.The strong coupling is fixed at α s (20 GeV) LO = 0.171.
All of the cross sections in Table 14 obtained by the GPU and the CPU BASES programs as well as those by Mad-Graph are consistent within the statistical uncertainty of the Monte Carlo integration/event generation.For double Higgs It should be noted here again that we first encountered significant discrepancy of more than 100% level for HHH production processes.This is because subtle gauge theory cancellation among weak boson scattering amplitudes W * W * , Z * Z * → HZ * (Z * → HH) can be violated significantly when the m V (V = W, Z) appearing in the propagators and in the W vertices do not satisfy the tree level relations of the SM.The exact agreements among all the programs have been obtained after we replace all m V 's in the couplings and and in the propagators as and by imposing Here GeV and Γ Z = 2.4952 GeV are used as inputs.

Summary
We have shown the results of our attempt to extend the HEGET function package originally developed for the computations of QED and QCD processes [5,6] on the GPU to all Standard Model (SM) processes.Additional HEGET functions to compute all SM amplitudes are introduced, and are listed in Appendix C. We have tested all of the functions by comparing amplitudes and cross sections of multi-jet processes associated with the production of single and double weak bosons, a top-quark pair, Higgs boson plus a weak boson or a tt pair, as well as multiple Higgs bosons produced via weak-boson fusion, where the heavy particles (W, Z,t,t, H) are allowed to decay into leptons with full spin correlations.Based on the GPU version of the Monte Carlo integration program, gBASES [10], we compute cross sections for all of these processes at the LHC target energy of √ s = 14 TeV.The program, MG2CUDA, has been developed to convert arbitrary MadGraph generated HELAS amplitude subroutines written in FORTRAN into HEGET codes in CUDA for the general purpose computing on GPU.As references for the computation of cross sections on GPU, we have performed two types of computations on CPU.One is based on the FORTRAN version of the BASES program whose performance is also compared with gBASES.Another comparison is performed with respect to the new version of Mad-Graph [1].We find that our Monte Carlo integration program with the new HEGET functions run quite fast in the GPU environment of a TESLA C2075, as compared with the FORTRAN version of BASES programs for the same physics processes with the same integration parameters.
Our achievements and findings may be summarized as follows.
-A set of new HEGET functions in CUDA has been developed for the general computation of amplitudes and cross sections for the SM processes.-For processes with larger number of jets in the final state, we find that total cross sections obtained with MadGraph is somewhat smaller than those computed with BASES programs.The difference amounts to about 5-10% for the processes with large number of jets in the final state.They might be due to the difference in the phase space generation part of the program and require further studies.
-For processes with a simple final state, i.e. the number of jets in the finals state, n, is equal to 0, the gain of the process time of the total BASES program of GPU over those of CPU becomes nearly 100 and gradually deceases as n becomes large.-We use double precision version of gBASES contrary to our previous papers [5,6].The double precision version of gBASES shows not only a good performance in process time but also good stability for various physics processes with wide range of integration parameters.-Due to the limitation of the CUDA compiler, a long CUDA function cannot be compiled.We have included the new mechanism to gBASES to handle successive CUDA functions calls in order to handle a long amplitude program as a set of small CUDA functions.The integrated results also agree very well with those obtained by BASES in FORTRAN.The performance of these divided programs is somewhat lower compare with un-divided programs.We have two functions to compute external fermions.One is for "flowing-In" fermions and the other is for "flowing-Out" fermions.The spinor wave function with a generic 3momentum p for "flowing-In" fermion is named ixxxxx (List 3), and for "flowing-Out" fermion is named oxxxxx (List 4).The argument of ixxxxx and oxxxxx as:

Fig. 1
Fig.1Ratio of BASES process time (CPU/GPU) for W + +n-jet production with W + → + ν l ( = e, µ) in pp collisions at √ s = 14 TeV.Event selection cuts are given by eqs.(8) and (11), the parton distributions of CTEQ6L1[13] at the factorization scale of Q= p cut T,jet =20 GeV and the fixed QCD coupling at α s (20 GeV) LO = 0.171 are used, except for n = 0 for which the factorization scale is chosen as Q = m W .

Fig. 2
Fig. 2 Ratio of BASES process time (CPU/GPU) for Z + n-jet production with Z → + − ( = e, µ) in pp collisions at √ s = 14 TeV.Event selection cuts, PDF and α s are the same as in Fig. 1, except for n = 0 for which the factorization scale is Q = m Z .

Fig. 5
Fig. 5 Ratio of BASES process time (CPU/GPU) for ZZ +n-jet production with Z → + − ( = e, µ) in pp collisions at √ s = 14 TeV.Event selection cuts, PDF and α s are the same as in Fig. 2.

Fig. 9
Fig. 9 Ratio of BASES process time (CPU/GPU) for Htt + n-jet production with t → b + ν and t → b − ν ( = e, µ) for m t = 175 GeV and Br (t → b + ν ) = 0.216 and with H → τ + τ − in pp collisions at √ s = 14 TeV for m H = 125 GeV and Br (H → τ + τ − ) = 0.0405.Event selection cuts are given by eqs.(8),(10) and(11) and the parton distributions of CTEQ6L1[13] at the factorization scale of Q = p cut T,jet = 20 GeV is used, except for n = 0 for which the factorization scale is chosen as Q = m Htt for uu → Htt and Q = m t for gg → Htt.The strong coupling constants are set as α 2+n s =α s (m Htt ) 2 LO α s (p cut T,jet ) n LO for uu → Htt process, and α 2+n

Fig. 10
Fig. 10 Ratio of BASES process time (CPU/GPU) for Higgs boson plus n-jet production via weak-boson fusion with H → τ + τ − in pp collisions at √ s = 14 TeV for m H = 125 GeV and Br (H → τ + τ − ) = 0.0405.Event selection cuts are given by eqs.(8),(10) and(11) and the parton distributions of CTEQ6L1[13] at the factorization scale of Q = p cut T,jet = 20 GeV is used, except for n ≤ 2 for which the factorization scale is chosen as Q = m W /2 for ud → Hud, Q = m Z /2 for uu → Huu.The strong coupling is fixed at α s (20 GeV) LO = 0.171.

Fig. 11
Fig. 11 Ratio of BASES process time (CPU/GPU) for multiple Higgs boson plus n-jet production via weak boson fusion with H → τ + τ − in pp collisions at √ s = 14 TeV for m H = 125 GeV and Br (H → τ + τ − ) = 0.0405.Event selection cuts are given by eqs.(8),(10) and(11) and the parton distributions of CTEQ6L1[13] at the factorization scale of Q = p cut T,jet = 20 GeV is used, except for n ≤ 2 for which the factorization scale is chosen as Q = m W /2 for ud → HH(H)ud, Q = m Z /2 for uu → HH(H)u.The strong coupling is fixed at α s (20 GeV) LO = 0.171.

Table 1
HEGET functions for external lines

Table 2
HEGET functions for vertices

Table 3
Parameters of Tesla C2075 and CUDA tools [11]ram in FORTRAN on the CPU[11].The measurement of total execution time is performed on the Linux PC with Fedora 13 (64 bit) operating system.The hardware parameters and the version of the software used for the execution of the FORTRAN BASES programs are summarized in Table 4.

Table 8
Total cross sections and BASES process time ratios for W + Z + n-jet production with W + → + ν , Z → + − ( = e, µ) at the LHC ( √ s = 14 TeV).Event selection cuts, PDF and α s are the same as in Table5.

Table 10
Total cross sections and BASES process time ratios for tt + n-jet production with t → b + ν and t → b − ν ( = e, µ) for m t = 175 GeV and Br (t → b + ν ) = 0.216 at the LHC ( √ s = 14 TeV).Event selection cuts are given by eqs.( 2 LO α s (p cut T,jet ) n LO with α s (m t ) LO = 0.108 and α s (20 GeV) LO = 0.171.

Table 15
[13]l cross sections and BASES process time ratios for multiple Higgs boson plus n-jet production via weak boson fusion with H → τ + τ − at the LHC ( √ s = 14 TeV).Event selection cuts are given by eqs.(8),(10)and(11)and the parton distributions of CTEQ6L1[13]at the factorization scale of Q = p cut T,jet = 20 GeV is used, except for n ≤ 2 for which the factorization scale is chosen as Q = m W /2 for ud → HH(H)ud, Q = m Z /2 for uu → HH(H)u.The strong coupling is fixed at α s (20 GeV) LO = 0.171.HH + ud + g 3.964 ± 0.003 3.945 ± 0.002 3.696 ± 0.004 ×10−447.6 uu → HH + uu + g 6.521 ± 0.006 6.505 ± 0.017 6.169 ± 0.006 ×10−546.4bosonproductionprocesses the GPU gain over CPU for the total process time of BASES is ∼ 70 for n = 2 and ∼ 50 for n = 3.For triple Higgs boson production processes the GPU gain over CPU for the total process time of BASES is ∼ 50 for n = 2.We study triple Higgs boson production in weak boson fusion to test the 4 scalar boson coupling function, hsssxx.cu(List30 in Appendix C), even though the total cross section obtained by Br (H → τ + τ − ) is less than 0.01fb in Table15.

2
, (C.27)where the boson names are given by the flowing out quantum numbers.The same functions, vvvxxx for amplitudes and jvvxxx for off-shell currents are used to compute both SU(2) vertices, WW γ and WW Z, and the QCD triple-gluon vertex.Gauge boson names and the coupling (color factors) for each vertex are summarized in Table16.C.6.3:ggggxxThefunction ggggxx (List 19) computes the amplitude from 4 gluon wave functions ga, gb, gc and gd, each with the color index a, b, c and d, respectively, when the associated color factor is f abe f cde , whether the gluons are on-shell or off-shell.The function has the arguments: cmplx vertex amplitude of the VVVV vertex with the color factor f abe f cde .The color amplitudes are then expressed as f abe f cde (v1 v1 v1) + f ace f dbe (v2 v2 v2) + f ade f bce (v3 v3 v3) .

Table 17
Possible sets of inputs for wwwwxx and jwwwxx where all the boson names give the flowing-Out quantum number.