Color singlet production at NNLO in MCFM

We present the implementation of several color-singlet final-state processes at Next-to-Next-to Leading Order (NNLO) accuracy in QCD to the publicly available parton-level Monte Carlo program MCFM. Specifically we discuss the processes $pp\rightarrow H$, $pp\rightarrow Z$, $pp\rightarrow W$, $pp\rightarrow HZ$, $pp\rightarrow HW$ and $pp\rightarrow\gamma\gamma$. Decays of the unstable bosons are fully included, resulting in a flexible fully differential Monte Carlo code. The NNLO corrections have been calculated using the non-local $N$-jettiness subtraction approach. Special attention is given to the numerical aspects of running MCFM for these processes at this order. We pay particular attention to the systematic uncertainties due to the power corrections induced by the $N$-jettiness regularization scheme and the evaluation time needed to run the hybrid openMP/MPI version of MCFM at NNLO on multi-processor systems.


calculated u
ing the non-local N -jettiness subtraction approach.Special attention is given to the numerical aspects of running MCFM for these processes at this order.We pay particular attention to the systematic uncertainties due to the power corrections induced by the N -jettiness regularization scheme and the evaluation time needed to run the hybrid openMP/MPI version of MCFM at NNLO on multi-processor systems.1 Version 8.0 of MCFM can be downloaded from the mcfm.fnal.govwebsite.

Introduction

The second run of the LHC (Run II) which is currently underway, will result in the accumulation of an unprecedented amount of high-quality data in a new high energy regime.In tandem with the well-understood and carefully calibrated detectors, this will lead to experimental uncertainties that are at the level of a few percent or smaller for many of the most important processes.These include various Higgs boson production channels, as well as standard candle processes such as vector boson production.Studies of diboson production will allow for stringent tests of the Electroweak sector of the Standard Model (SM) and constraints on possible new physics scenarios.In order to make best use of the precise experimental observations it is crucial to have access to accurate theoretical calculations of the same quantities.At the LHC this requires the calculation of QCD corrections to inclusive and differential cross sections at increasingly higher order.For the most efficient comparison between theoretical predictions and experimental data it is extremely beneficial for theoretical results to be released in the form of a public code, allowing users full flexibility in obtaining theoretical predictions relevant for their analysis.

While calculations at Next-to-Leading Order (NLO) in the strong coupling constant are by now quite standard, only about 20 processes have been calculated through to Nextto-Next-to-Leading Order (NNLO).Recent publications on these processes are shown in Table .1.All such calculations require a means by which to regulate the soft a d collinear radiation that appears in the calculation of the higher-order contributions.At NLO local subtraction schemes, such as FKS [40] or Catani-Seymour dipole subtraction [41], are typically preferred.In these local subtraction formalisms, the singular unresolved infra-red limits are cancelled point-wise by local counterterms.These local counterterms, after analytic integration over the unresolved partons, are added to the virtual corrections yielding a finite result.

The construction of a local subtraction scheme for a NNLO calculation is a daunting task, given the complexities of the multiple infrared limits and differing dimensionality of phase space for the component parts.However, progress has been made, with significant advances over the last decade.The first local subtraction scheme used at NNLO was the sector decomposition approach presented in ref. [42].This scheme separates the overlapping singularities by using a plus-prescription to isolate the singular contributions, thereby avoiding any analytic integrations over regions of phase space.The antenna subtraction method was extended to NNLO in Refs.[43,44], and has been used to obtain predictions for 2 → 2 processes in which both final state particles are colored [39].Antenna subtraction resembles the NLO subtraction formalisms in that the doubly unresolved limits are cancelled point-by-point in phase space by count rterms which require analytic integration to cancel infrared poles in the real-virtual and double virtual phase spaces.Finally in Refs.[45][46][47] the sector decomposition approach was generalized to arbitrary processes.By partitioning the phase space into appropriate sectors in which each singularity can be made manifest, and then performing a Laurent series expansion to extract the poles.This method has been applied to various processes at the LHC [5,7,34,38].

In addition to the local subtraction schemes discussed above, there is an alternate form of regulation, which is inherently non-local.Indeed one of the first NLO regularization techniques developed was one such method, phase space slicing, introduced in Refs.[48][49][50].In these methods a parameter is used to separate the resolved and unresolved phase spaces.The resolved region of phase space corresponds to a calculation of the process
H + 0 jet [1-4]
H + 1 jet [5][6][7][8][9] Higgs WBF [10] H → b b [11, 12] W + 0 jet [13,14] Z/γ * + 0 jet [4,14,15] W + 1 jet [16] Z + 1 jet [17][18][19][20]] ZH [21,22] W H [22,23] W Z [24] ZZ [25][26][27] W W [28][29][30] W + γ, Z + γ [31] γγ [32,33] t t [34,35] single top [36] top decay [37,38] dijets [39] Table 1.Publications on processes evaluated differentially at NNLO.

with one additional final state parton, and if a suitable resolution parameter is chosen, the unresolved region can be directly calculated.At NLO non-local methods have generally fallen out of favor.This is due to the large cancellation between the resolved and unresolved contributions at small values of the resolution parameter, which can induce large Monte Carlo uncertainties.However non-local subtraction schemes have made a resurgence for NNLO calculations.Although they have the disadvantages discussed above they also have several advantages which make them attractive for NNLO calculations.First, they are conceptually simple to implement.Once a suitable resolution parameter is selected, the s ngly unresolved part of the calculation can be obtained with existing NLO event generators, such as MCFM [51][52][53].Second, with recent advances in computing, the drawback associated with the large numerical cancellations can be mitigated by running with a large number of computer cores.Finally by using a resolution parameter motivated by a physical factorization theorem, the approximations inherent in the method can be systematically improved, e.g. by analytic calculations of power-suppressed contributions [54].

The first non-local subtraction developed for NNLO calculation was the so-called q T subtraction method [3].This method uses the transverse momentum of the final state color neutral particle, q T , as the cut variable.For q T < q cut T the factorization theorem of Collins, Soper and Sterman [55], can be used to compute the cross section, while above the cutoff the NLO calculation of the color-singlet plus jet can be utilized.An obvious drawback is that it is only applicable to color neutral final states.Inspired by a factorization formula [56] from Soft Collinear Effective Field Theory (SCET) [57][58][59][60][61] the first steps towards extending these ideas to calculations containing colored final states were taken in a calculation of top-quark decay at NNLO [37].However, no initial collinear singularities appear in this calculation.A powerful generalization of this idea applicable to general initial and final states was introduced in [4,16].It is obtained by replacing the q T variable with the event shape N -jettiness variable [62].Below the N -jettiness (τ N ) cutoff, SCET provides the relevant factorization theorem [62].For the below-cut region the necessary SCET ingredients, corresponding to the final state and initial state collinear radiation functions are already known, and are represented by the two-loop jet-functions of Ref. [63,64] and the two-loop beam-function of Ref. [65,66].The corresponding two-loop soft functions ar also known for zero-jettiness [67,68] and for general N -jettiness [69].The first process calculated at NNLO using this method was pp → W +jet [16], followed by calculations of the pp → Higgs+jet [8] and pp → Z+jet [18] processes, and by detailed phenomenological studies of these processes at this order [70][71][72].pp → H and pp → Z were also calculated using this method [4].Additi nal processes of phenomenological interest, pp → V H [22] and pp → γγ [33] have been calculated using the same approach.

As mentioned before, an important advantage of the N -jettiness subtraction method is that it meshes well with the existing NLO calculations, such as those included in MCFM.Included in MCFM are the NLO corrections to W + n jet production, Z + n jets production [73], Higgs +n jets production [74,75] for (n = 0, 1, 2), making the implementation of W , Z, H + 0,1 jet at NNLO possible.

The recent advances in NNLO technologies allows for the exciting possibility of releasing a public code capable of computing many 2 → 2 processes at NNLO accuracy.This paper presents a first step in this journey by summarizing the implementation of the N -jettiness subtrac ion procedure in MCFM, and presenting a detailed breakdown of the method for the processes released in the initial version of the NNLO code.An important consideration in making the code public is computational speed.In Ref. [53] MCFM was upgraded to use a parallel version of the VEGAS adaptive integration method using openMP.For NNLO calculations, this was expanded by using a hybrid openMP/MPI version of MCFM for use on computing clusters to facilitate the numerical NNLO calculations of Ref. [18].

Using the hybrid version of MCFM we can calculate NNLO distributions efficiently within a reasonable timescale.In summary, this paper describes the implementation of the N -jettiness subtraction procedure in MCFM and presents results for the processes available in MCFM v8.0.Specifically these processes are pp → H, W, Z, V H, γγ.Where present the decays of unstable pa

icles are included, allowing for
fully flexible MC code.In section 2 we will give a schematic overview of the non-local N -jettiness subtraction scheme.Section 3 will detail the calculational set-up and sections 4 and 5 will look at the N -jettiness subtraction at NLO and NNLO dominant power corrections is presented in Section 6.The more numerical aspects are studied in section 7. Finally in section 8 the main results are summarized.


SCET Based Non-Local Subtraction

A collision of partons a and b with momentum fractions x a,b , originating from the incoming beam protons with momenta p a,b , produces a final state includi is defined as
T N (p j ) = min i=a,b,1,...,N 2 q i • p j Q i ,(2.1)
where for notational simplicity we have set q a,b = p a,b .We denote the jet or beam energy by E i .Q i is a measure of the jet/beam hardness.In our numerical results we set this equal to twice the jet/beam energy, Q i = 2E i [62].We can now define the event jettiness, or N -jettiness, as the sum over all the M final state parton jettiness values
T N = M k=1 T N (p k ) = M k=1 min i=a,b,1,...,N 2 q i • p k Q i . (2.2)
For Leading Order (LO) events we have {p } = {q i } and the event jettiness is zero.Beyond LO (M > N ), only in the soft/collinear limit will the event jettiness necessarily go to zero.Therefore the event N -jettiness can be used in a non-local subtraction approach where we can isolate the doubly unresolved region of the phase space by demanding T N < T cut N .In this paper we restrict ourselves to color singlet final state events.We can herefore use the event shape variable T 0 to regulate the initial state radiation.

By demanding T 0 < T cut 0 one isolates the doubly unresolved regions of phase space.The matrix elements in the soft/collinear approximation can be analytically integrated over this region and added to the virtual cut 0 → 0 this will result in the correct results for the cross section.

To obtain the analytic soft/collinear expressions we use all-orders resummation results which rely heavily on the machinery of soft-collinear effective theory (SCET) [57][58][59][60][61].The all-orders resummation of the T 0 event-shape variable in the limit T 0 → 0 was constructed in Ref. [62]:
dσ dT 0 = ab dx

e part
ns involved in the scattering.The initial state momenta p a,b are given by the momenta fractions x a,b , while Φ B denotes the Born-level color singlet phase space p a p b → p singlet .The composite Θ(p singlet ) denotes any phase space restrictions on the color-singlet phase space.The soft/collinear function ∆ ab is given by
d∆ ab dT 0 = B a ⊗ B b ⊗ S ab ≡ dt Ba dt B b dt S δ (T 0 − t Ba − t B b − t S ) B a (t Ba , x a , µ) B b (t B b , x b , µ) ab (t S , µ) .(2.4)
A summary of the various components which appear in these expressions is given below:

• The hard function H encodes the effect of hard virtual corrections.At leading order in the α s -expansion it reduces to the leading-order partonic cross section.At higher orders it also contains the finite contributions of the pure virtual corrections, renormaliz nd the scale choice.

• The beam function B a contains the effects of initial-state collinear radiation.It depends on t Ba , the contribution of initial-state collinear radi tion to T 0 .The beam function is non-perturbative; however, up to corrections suppressed by Λ QCD /t B , it can be written as a convolution of perturbative matching coefficients and the usual parton density functions,
f i/H , B a (t Ba , x, µ) = i 1 x dξ ξ I ai (t Ba , x/ξ, µ)f i/H (ξ),(2.5)
where we have suppressed the scale dependence of the parton density functions, and i runs over all partons.The two-loop beam functions have been computed in Refs.[65,66].

• The soft function S collects the jettiness contributions of soft radiation.It depends on t S , the contribution of soft radiation to T 0 .The expansion of the soft function for zero-jettiness up to two-loop order can be found in Refs.[67,68].

The delta function appearing in Eq. (2.4) combines the contribution of each type of radiation to produce the measured value of T 0 .The factorization formula is correct up to power corrections, indicated by the ellipsis in Eq. (2.3).These power corrections can in p , where Q denotes the hard scale in the process (for the zero-jet processes considered here, Q is of the order of the invariant mass of the final state).Integrating Eq. (2.

O crosssection.q
is the overall invariant mass squared of the vector boson and the Higgs boson.
Process µ R µ F Cross-section to NNLO Reference gg → H M H M H 12.937 × (1 + 1.28 + 0.77) pb ggh@nnlo [76] Z 2M Z M Z /2 44.303 × (1 + 0.22 + 0.05) nb ZWMS [77] W + 2M W M W /2 81.561 × (1 + 0.23 + 0.06) nb ZWMS [77] ZH q 2 2 0.68255 × (1 + 0.16 + 0.10) pb vh@nnlo [78, 79] W + H + W − H q 2 q 2 1.2593 × (1 + 0.16 + 0.02) pb vh@nnlo

Process Overview

For all of the studies performed in this paper we perform calculations for the LHC operating at a center-of-mass energy of √ s = 13 TeV.The parameters that are used throughout this pa PDF set (MSTW8nn) that corresponds to α s (M Z ) = 0.11707.

An overview of the processes that will be studied in detail in this paper is shown in Table 3 1 .As well as detailing the default choice of renormalization and factorization scales (µ R and µ F ), this table also shows the corresponding cross-section up to NNLO.The NNLO cross-sections are written in the form,
σ N N LO = σ LO × 1 + ∆σ N LO σ LO + ∆σ N N LO σ LO ,(3.1)
so that, for instance, the corresponding NLO result is obtained by simply omitting the final term in this equation.The cross-sections have been obtained by running the readilyavailable public codes referenced in the final column of Table 3.We now describe the calculational setup that we use for these processes, which corresponds to the default behavior of the

ove codes.This behaviour has been matc
ed in the MCFM code and, in order to establish the equivalence of the parameters for MCFM and the other publicly available codes, we compare results up to NLO in Table 4.The agreement is excellent for all oduce the same results as the other codes when computing the NLO and NNLO predictions using the N -jettiness subtraction method.


Higgs production through gluon fusion

We H a G a µν G µν a . (3.2)
where the sum is over the color degrees of freedom of the gluon.At the order required in this paper, the coefficient C(m 2 t , µ 2 ) is given in the MS scheme by [81,82],
C(m 2 t , µ 2 ) = α S 6πv 1 + α s 4π (5C A − 3C F ) (3.3) + α s 4π 2 27 2 C 2 F + 11 ln m 2 t µ 2 − 100 3 C F C A − 7 ln m 2 t µ 2 − 1063 36 C 2 A − 4 3 C F T F − 5 6 C A T F − 8 ln m 2 t µ 2 + 5 C F T F

f − 47 9 C A T F n
.
Here v is the vacuum expectation value of the Higgs field, v = 246 GeV.The only remaining m t -dependence at this order is the one shown in the O(α 2 s ) contribution to the coefficient of the effective operator.

The validation cross-section for this process is obtained using ggh@nnlo [76].As can be seen from Table 3, at 13 TeV the higher-order corrections to this cross-section are quite large.


W and Z production

To establish the correct values of the higher-order cross sections for W + and Z production we use the program ZWMS [77].For the sake of illustration we have chosen to perform the comparison for only one charge of the W -boson.We note that for the canonical scale choice µ R = µ F = M V (where V = W, Z) the NNLO corrections are ery small.Although this is ultimately an advantage in terms of the accuracy required for phenomenological applications, it prohibits a careful study of the behaviour of the N -jettiness calculation.To enhance the size of the NNLO correction we therefore use an asymmetric choice, µ R = 2M V , µ F = M /2.This results in NNLO corrections of approximately 5% relative to the LO cross-section at 13 TeV (c.f.Table 3).

Note that by default MCFM incl

es the decay of the vector bosons, Z/γ * to a lepton
pair.For comparison with the rate for production of on-shell Z-bosons, we remove the (small) contribution mediated by a virtual photon and divide out the overall branching ratio of the Z-boson to leptons.Comparison of ethod with the codes used for cross-checking in this paper.


Associated Higgs production: W ± H and ZH processes

To establish target NLO and NNLO values for the total cross section for W ± H and ZH production we use the program vh@nnlo [78,79].In order to facilitate an easy comparison with this program, we use the scale choice µ R = µ F = q 2 ≡ (p V + p H ) 2 , with V = W ± or V = Z as appropriate.We also sum over both charges of the W boson, i.e. we include both + and W − contributions -which can differ substantially at a pp collider such as the LHC -in all of the results below.For the diagrams in which the Higgs boson couples directly to a top quark loop we work in the effective theory, valid in the large m t limit given by Eq. (3.2).A detailed phenomenological study of the NNLO implementation of these processes in MCFM has been presented in Ref. [22].

For both W ± H and ZH processes the correction originating from diagrams with the Higgs boson coupling to a top quark loop is approximately 1.5%.The ZH process also includes a substantial finite co

onent due to gg → ZH
loops at NNLO.The NNLO corrections that correspond to simple dressings of the LO diagrams are very small, of order 1%, for both W ± H and ZH production.The net effect of all these contributions is shown in Table 3, where the NNLO corrections to the ZH process are at the level of 10%.In contrast, the total NNLO correction to the W ± H process is about 2% of the LO result.


Diphoton production

NNLO predictions for the diphot

process, obtained using MCFM, h
ve been presented in ref. [33].Therein the results have been validated using the same procedure as we will adopt later; we do not repeat that analysis here.However, we will later on summarize the size of the power corrections and timing results for this process.5. NLO corrections to the processes computed in this paper using the N -jettiness method.


N -Jettiness subtraction at NLO

Although the calculation of NLO corrections for the processes considered here is straightforward, a detailed examination of the corresponding N -jettiness subtraction calculation is extremely useful.It provides a stringent check of the accuracy of this approach, namely a direct probe of the size of the power corrections that have been neglected in Eq. ( 2.


3).

This can be tested with exquisite accuracy, due to the relative simplicity of the calculation compared to the corresponding exercise at NNLO.This comparison can also illuminate the limitations of this approach when moving beyond an inclusive calculation, by using MCFM to compare calculations of differential distributions at NLO.The calculation of NLO corrections using the N -jettiness subtraction method is straightforward in MCFM.The below-cut contribution is easily computed, while the above-cut contribution corresponds to a LO calculation of the process that contains an additional parton.In order to avoid numerical instability in calculations using MCFM, previous versions of the code have applied a small cutoff on all invariant masses present in the problem, √ s ij > cuto

.In this version this has
been changed so as to enforce a small cutoff on the partonic jettiness of every parton present in a given calculation, T N (p j ) > cutoff.Since the above-cut region involves a standard LO calculation, for which there are no numerical instabilities, we are able to choose a value for this cutoff close to the limit o double precision, cutoff = 10 −12 GeV.


Inclusive cross-sections

The benchmark cross-sections that form the basis for this comparison can be extracted from Table 4 and, for convenience, have been summarized in Table 5.As is well-known, the NLO corrections to Higgs production through gluon fusion are very large, while all of the other processes receive corrections of order 20%.

A comparison of the N -jettiness calculations of these coefficients, with the results shown in Table 5, is shown in Fig. 1.The ratio of the calculations is shown as a function of T cut 0 , for a range of suitable values of T cut 0 .The approach of the N -jettiness calculation to the correct result as T cut 0 → 0 is clear for each process.However the manner in which the correct result is reached varies considerably.For instance, Higgs production through gluon fusion approaches the correct result from above, while the other processes approach it from below.The approach is much slower for W + and Z production than for any of the Higgs production processes, with percent level accuracy only reached for T cut 0 0.01 GeV.

Figure 1.The ratio of the NLO correction calculated using N -jettiness subtraction as implemented in MCFM to the standard MCFM subtraction result (as presented in Table 5).The t 0 in GeV.The comparison is performed for gg → H, Z, W + , ZH and W ± H production and the lines represent fits to the individual points using the form given in Eq. (4.1).

The approach of the N -jettiness subtraction result to the correct answer is determined by the behaviour of power corrections that are not accounted for at present.A the NLO level after integration over the final-state phase space this can be modeled by the following functional form,
∆σ N LO jettiness (T cut 0 ) = ∆σ N LO + c × T cut 0 Q × log T cut 0 Q , (4.1)
where Q is the appropriate scale for the process at hand and c is an unknown constan

For single boson production
is taken to be the mass of the produced particle (M H , M W or M Z ) while for the associated production processes we use  5 is no larger than one per mille for all processes.The results of a study of the dominant NLO power correction, obtained using an analytic calculation [54], will be discussed in Section 6.
Q = M W + M H and Q = M Z + M H .
Since the speed of the approach to the correct result is qualitatively much worse for W and Z production it is instructive to examine the processes in more detail in order to uncover the origin of the difference.To that end we now turn to a comparison of more differential results.


Rapidity distributions at NLO

The simplest distribution to tudy is the rapidity of the produced system, which is intimately related to the momentum fractions carried by the incident partons.We will compare the prediction for the NLO contribution to this distribution (i.e.corresponding to ∆σ N LO ) computed using dipole subtraction and jettiness subtraction with T cut 0 = 0.01 GeV and T cut 0 = 0.04 GeV.The difference between the true result and the jettiness calculation for T cut 0 = 0.04 GeV is about 0.4% for gg → H, 1.5% for Z production and 0.3% for ZH.These processes are sufficient to illustrate the issue, since W + and W ± H production show very similar behaviour to the Z and ZH processes respectively.

Results are shown in Fig. 2. The agreement of the jettiness calculations with the normal MCFM result is excellent overall, particularly for central production |y|

However there is evidence
for an increase in the size of the power corrections at larger absolute rapidities2 .The reason for the qualitative difference in the behaviour is thus two-fold.First, the onset of power corrections with increased rapidity occurs sooner for Z production.Second, and critically, the shape of the rapidity distribution is much broader for Z production so that the effect of the high-rapidity tails is more apparent in the inclusive rates presented in the previous section.It suggests that a restriction to more central rapidities would decrease the effect of power corrections and speed the convergence to the correct result.


Cross-sections under cuts

As an explicit demonstration of this behavior we will contrast the effect of the power corrections on the inclusive cross-section with the behavior under a more realistic set of experimental cuts.Rather than cutting directly on the rapidity of the W or Z boson, we instead apply a minimal set of cuts on the W and Z boson decay products that might be applied in an experimental nalysis.We consider a Z boson decay to an electron-positron pair and demand that both leptons be observed in the central region, |y(e ± )| < 2.5.For the W + boson case we consider the decay into an positron and neutrino, imposing a rapidity constraint on the charged lepton |y(e + )| < 2.5 and a minimum missing transverse energy (MET) of 30 GeV.Note that the application of these cuts means that a comparison with the code ZWMS can no longer be made.Although DYNNLO [14] or FEWZ [83,84] could be used to provide a reference cross-section under these cuts we do not pursue that

re.Instead we simply normalize to
the (fitted) asymptotic result.

The results of this study are shown in Fig. 3.As anticipated, the effect of the cuts is to significantly decrease the T cut 0 -dependence of the cross-section.For instance, rather than a difference of approximately 1% with the asymptotic result for T cut 0 = 0.02 GeV in the inclusive case, the fiducial cross-section differs by a few per mille or less for the same value of T cut 0 .The inability to restrict the rapidity of the unobserved neutrino in the case of W + → e + ν production, compared to Z → e − e + , leads to a slightly slower approach to the correct result.in GeV.The comparison is performed for Z (top) and W + production (bottom) and for both the inclusive case and for a minimal set of fiducial cuts (detailed in the text).The lines represent fits to the ind

idual points using the fo
m given in Eq. (4.1).


N -Jettiness subtraction at NNLO

At NNLO, the N -jettiness subtraction method involves an above-cut contribution that corresponds to a NLO calculation of the process containing an additional parton.In contrast to the previous order, this results in genuine numerical instabilities that primarily arise from the cancellation of subtraction terms in the real radiation contribution.As a result we must use a larger value of the safety-cutoff parameter, namely cutoff = 10 −8 GeV.This is appropriate for computations in double precision, such as the ones presented in this paper.Although we do not include any quadruple precision results here, we note that this cut may be relaxed significantly in that case.We note the caveat that the running time of the code increases significa tly in quadruple precision, by about an order of magnitude.


Inclusive cross-sections

The expected NNLO cross-sections in the inclusive case, obtained using the already-available public codes listed previously, are shown in Table 6.The corrections to the gg → H process are again large at this order, while all of the other processes have corrections in the 2-10% range.Of these other processes ZH production has the largest correctio ulation of the NNLO coefficients by jettiness subtraction are compared with results from the literature in Fig. 4. Note that all of the plots use a common scale for the ordinates, which display the ratio, except for the one representing the gg → H calculation, for which the power corrections are much smaller.It is clear from this figure that there is a slower approach to the asymptotic result than at NLO, but that excellent agreement is still obtained for smaller values of T cut 0 .The relatively poorer approach to the true result is expected from the behaviour of the power corrections at NNLO, whose leading two terms can be modeled after integration over the final-state phase space as
∆σ N N LO jettiness (T

ut 0 ) = ∆σ N N LO + c 3 × T cu
0 Q × log 3 T cut 0 Q + c 2 × T cut 0 Q × log 2 T cut 0 Q ,
(5.1) where Q is the appropriate scale as before and c 2,3 are unknown constants.Also shown in Fig. 4 are fits of the results to Eq. ( 5.1), with the values of ∆σ N N LO and c 2,3 determined in the fit.The subleading term is only important in the case of the gg → H process, in order to capture the observed turn-over for larger values of T cut 0 .For gg → H, ZH and W ± H production the fit value of ∆σ N N LO differs from the known result given in Table 6 by less than one per mille.For the Z and W + processes the agreement is not as good, at the level of approximately 4%.Again, the dominant NNLO power correction can be calculated analytically from first principles [54] and its impact will be shown in Section 6.


Rapidity distributions at NNLO

Given the effect of the power corrections on the rapidity distribution at NLO, we expect to see a similar pattern at NNLO.We compare predictions for T cut 0 = 0.01 GeV and T cut 0 = 0.004 GeV.For the gg → H and ZH processes that we study here, the predictions for T cut 0 = 0.004 GeV should be a good proxy for the exact distribution given the small deviations from the inclusive cross-section to which they correspond (around 0.8% for both).For Z production, this value of T cut 0 yields a total cross-section that differs by 10% from the known result.To obtain an actual phenomenological result one must run with a lower T cut 0 .Nevertheless it is sufficient to demonstrate the pattern of the power corrections.

The dependence on T cut 0 of the NNLO contributions to the rapidity distributions is llustrated in Fig. 5.As observed at NLO, all three distributions are much less sensitive to the choice of T cut 0 in the central region than at large rapidities.The quality of the independence from T cut 0 deteriorates substantially for |y| 2. However, even in the central region, the Z process is far more affected by the choice of T cut 0 than the other two calculations.In the more forward regions, which still contribute to the cross-section at an appreciable level, the T cut 0 dependence rises to the level of a few tens of percent.For this reason it is  6, as a function of the N -jettiness resolution parameter T cut 0 (in GeV).The comparison is performed for gg → H, Z, W + , ZH and W ± H production and the lines represent fits to the individual points using the form given in Eq. (5.1).

crucial to apply the basic fiducial cuts introduced earlier in order to obtain a percent level agreement with the NNLO coefficient.In contrast, for phenomenology it is sufficient to study the effect of the value of T cut 0 not on the effect of the NNLO correction itself, but on the total prediction at that level of accuracy.In that case the smallness of the N

O coefficient in the case
f Z production is an advantage as it suppresses the relative size of the power corrections in the total.On the other hand the gg → H process, which receives a very large correction at NNLO, is more easily subject to power corrections.In order to provide a full NNLO prediction for the rapidity distributions discussed in this section we sum the results of a standard MCFM calculation at NLO and a computation of only the NNLO correction using jettiness subtraction.The resulting distributions are shown in Fig. 6.The gg → H and Z production processes differ by a couple of percent in the tails of the distribution, for these two values of T cut 0 , but are otherwise in excellent agreement.The dependence on T cut 0 is even smaller for the case of ZH production.


Cross-sections under cuts

Although the W and Z production cases are the most sensitive to T cut 0 at NNLO, at this order both ZH and W ± H production also display a non-negligible dependence on T cut 0 .We therefore consider all four processes in this section.For W and Z production we apply the same cuts as before.For the other processes we consider the final states W ± (→ e ± ν)H(→ γγ) and Z(→ e + e − )H(→ b b) but do not apply any cuts to the Higgs boson decay products in either case.In this way the results remain valid for any decay channel of the Higgs boson.The W ± and Z decay products are subject to the same cuts as in the corresponding inclusive W and Z production processes.

The results of this study are shown in Fig. 7.For the W and Z cases, the i

rovement is dramatic; for T cut 0 = 0.02 Ge
the difference from the asymptotic result improves from approximately 35% in the inclusive case to 8% under cuts.A similar level of improvement


Process

σ LO,f id ∆σ N N LO,f id ∆σ N N LO,f id /σ applies in the case of W production.For ZH production the gain is less pronounced due to the fact that only the Z decay products are restricted in rapidity, which results in a less stringent constraint on the combined ZH system.Nevertheless, the agreement with the asymptotic result improves by about a factor of two relative to the case of no cuts.The asymptotic value of each NNLO N -jettiness calculation, together with the LO cross-sections under the fiducial cuts used in this study, are shown in Table 7.


Analytic Power Corrections at NLO and NNLO

As is clear from the discussions in the earlier sections it is important t understand the T cut 0 dependence.The choice of this cut is a balance between achievable statistical uncertainties and the uncertainty due to the missing power corrections of Eq. (4.1) and Eq.(5.1).It would therefore be highly beneficial if one could calculate some of the power corrections analytically.This will both speed up the code and increase

e accurac
.A preliminary calculation of the dominant power correction for NLO and NNLO as a function of the rapidity of the Higgs boson has been performed [54].The results for the total gg → H cross section are shown in Fig. 8.The full result for the leading power correction (in GeV).The comparison is performed for Z, W + , ZH and W ± H production and for both the inclusive case and for a minimal set of fiducial cuts (detailed in the text).The lines represent its to the individual points using the form given in Eq. (5.1).

to Drell-Yan like processes will be discussed in a separate publication [54] and included in a future version of MCFM.

As can be seen from Fig. 8 the effect of including this term into MCFM is substantial and one can choose the T cut 0 an order of magnitude larger and still obtain about the same uncertainty due to the new subleading power corrections.This will have a large impact on the ultimate achievable precision of the jettiness method as implemented in MCFM.


Numerics

In this section we discuss the numerical performance of MCFM.As an illustration we will run the hybrid openMP/MPI version of MCFM on a modest sized cluster.This cluster consists of 24 nodes, each node having of a motherboard with two Intel X5650 chips (2.67 GHz) using an unified memory.Each of the Intel chips has 6 cores, resulting in a total of 24 × 2 × 6 = 288 computing cores for the cluster.The nodes are connected using InfiniBand NFS mounts.

We will use 4×100, 000+10×1, 000, 000 VEGAS events in the remainder of this section.It is straightforward to scale the results obtained for this particular cluster to other cluster configurations.Speci ically we examine two important performance issues.First, we will look at the time required to calculate the cross section as a function of the number of cores used.Second, we will look at the obtained statistical precision due to the Monte Carlo integration as a function of the T cut 0 parameter.For all the runs in this section we use, in addition to the input parameters of Table 2, a collision energy of 14 TeV and an inclusive anti-k T jet algorithm with a cone size of 0.4.We apply, where applicable, the following cut on the transverse momenta of the final state objects p JET T > 20 GeV, p l ± T > 25 GeV, p MISS T > 40 GeV, p γ 1 T > 40 GeV and p γ 2 T > 20 GeV.The rapidity of all final state objects is required to be less than 2.5 and we require a separation between the observable final state objects of ∆R > 0.4.When a Z-boson is produced we apply the additional cut on the di-lepton invariant mass of 40 GeV with no separation requirement between the two charged leptons.

Calculating cross sections at higher order requires a significant amount of computing power.In Ref. [53] several of us extended MCFM to use openMP by modifying VEGAS in such a manner that it distributes the event generation and evaluation over the computing cores of a single node/motherboard.By using multiple computing cores openMP makes the evaluation of NLO cross sections on desktops efficient, while still using a single VEGAS grid for the optimization of the numerical integration.For a timely evaluation of cross sections at NNLO it is desirable to use a cluster combining many processors.As the processors in a cluster do not share the same physical memory one has to use MPI.We extend VEGAS to use MPI to distribute the event generation and evaluation further over all processors, while openMP still distributes the events per processor over its computing cores.Again a single VEGAS grid is used to optimize the numerical integration.It is important to use openMP to distribute the events on a single processor as it keeps only one version of shared variables, while MPI would keep a separate copy of those variables for each MPI process thereby using the limited cache memory in an inefficient manner.This is particularly important as MCFM use large shared arrays such as for example the VEGAS grid, PDF grids, histograms, etc. which are common for all computing cores.It i therefore beneficial to maintain a hybrid openMP/MPI version of VEGAS, especially given the continuing increase of the number of cores per processor.

There are two limits which come into play when executing parallel code.The first limit is the memory bound limit.Here the evaluation time is determined by memory transfers and not by computations.In this limit the evaluation time will not scale well when adding more computing cores and improving the scaling behavior will be difficult, necessitating a better management of cache memory by the openMP code and/or more efficient message passing by the MPI code.In the other limit the evaluation time is determined by the computations and time used for memory management is negligible.In this limit the execution time will scale perfectly with the number of processors, i.e. doubling the number of processors will half the execution time.These limits are important in order to understand the scaling behavior seen in MCFM.

The scaling of the computing time with the number of processing cores for the process pp → W + → l + ν is given in Table 8, with a visible representation in Figure 9.We have the option to run one MPI process per node and let openMP distribute the events over the 12 cores of the two processors (indicated by the 1 × 12 column).This in gen ral is not a preferred mode of operating because the cache memory is divided over the two processors, requiring openMP to make sure the two cache memories are synchronized, unduly invoking a memory management overhead on the time needed for the evaluation.Alternatively, by running two MPI jobs per node openMP is used on a single processor thereby optimizing the cache usage and minimizing memory management overhead (2 × 6 column).This is clearly demonstrated in Table 8.For example using all 288 processors on the cluster we can use of 48 MPI jobs (2 MPI jobs per node) to evaluate the NNLO cross section in 328 seconds, or use 24 MPI jobs (1 MPI job per node) requiring 358 seconds to evaluate the NNLO cross section.The time difference is due to the fractured cache memory caused by  forcing openMP to use two processors in the case of running with 24 MPI jobs.Therefore in the remainder of this sectio 2 MPI jobs per node, allowing openMP to operate on a single processor.

Especially at LO, and to some extent at NLO, the computation effort to evaluate this process is minimal, making the execution time operate close to the memory bound limit.This behavior is exhibited in
+ → l + ν, pp → Z → l + l − , pp → H → γγ, pp → H + W + → γγ + l + ν, pp → H + Z → γγ + l + l −
and pp → γγ cross sections at NNLO using a given number of MPI processes for the node openMP texture of one MPI job per 6-core processor (2 × 6).

longer improves when using more than 12 MPI jobs (and for NLO more than 18 MPI jobs).At NNLO using more cores still improves the evaluation time, s a consequence of the need to evaluate a large number of the more computational intensive double parton bremsstrahlung events.It is worth noting that the execution time on a single processor using openMP executes in under 3 hours making the evaluation of this process on desktops very feasible.Running on the full cluster using the 48 processors results in an execution time of less than 4 minutes.This means one can easily increase the number of events and lower the tau cut value to obtain better statistics.

Next we look at all the new NNLO processes added to MCFM in Table 9 and Figure 10 where the time in seconds is given as a function of the number of MPI jobs (= number of processors) used.As can be seen, the processes scale well all the way up to the 288 processors.Some indication of a less than perfect scaling can be seen in the simplest of the NNLO processes pp → H → γγ when we get to a high number of processors indicating there is some memory overhead.All other processors still are computing dominated which will allow easy speed-up by invoking even more processors.The most complicated NNLO process pp → γγ takes just under 11 minutes to evaluate using 48 processors.Therefore obtaining higher statistics is rather easy.This process would still only take a bit less than 8 hours on a single processor desktop.

Finally, the statistical integration error btained for the inclusive cross section given the cuts using the 10,000,000 VEGAS events as a function of the T cut 0 is given in Table 10.The evaluation times are given for using 8 processors.As can be seen there is a small dependence of the evaluation time on the choice of the T cut 0 .As we choose the T 0 cut smaller the Monte Carlo becomes more "efficient" because it will generate more soft/collinear events.That is, less events will be rejected by the cuts hence the evaluation time will grow.

As can be seen from table 10, the acquired statistical uncertainty is quite process dependent.However the value of the T cut 0 will also determine the systematic error due to   0.2% (1847) 0.08% (3427) 0.5 0.04% (1266) 0.04% (2356) 0.04% (1176) 0.1% (1150) 0.09% (1768) 0.07% (3376) Table 10.The relative statistical precision (in percentages) on the pp → W + → l + ν, pp → Z → l + l − , pp → H → γγ, pp → H + W + → γγ + l + ν pp → H + Z → γγ + l + l − and pp → γγ cross sections at NNLO as a function of T cut 0 (in GeV) using 4 × 2 × 6 cores.Also given in brackets is the evaluation time (in seconds).the power corrections.Looking at Table 11 we see the required value of T cut 0 to reduce the power corrections to a 1% or a 0.2% level. 3First, focussing on the 1% uncertainty we see that in all cases statistical error obtained with the 10,000,000 events is smaller than 1%.The worst case is the inclusive W + production with a statistical uncertainty of 0.7%.For all other cases the statistical error is more than on order of magnitude smaller.To achieve a systematic error of about 0.2% we see that we need to reduce the statistical uncertainty significantly in order to be smaller than the systematic error.Th reduction for some (in GeV) required to perform the NNLO N -jettiness calculation to a given accuracy, for the processes studied in this paper.At larger T cut 0 the accuracy deteriorates because of increased power corrections.

processes is about an order of magnitude, requiring of the order of 100 times more events.T

s means
hat an overall uncertainty of order 1% is easily obtainable using a desktop, however going to the per-mille level will require a modest computer cluster such as the one used for the numerical results in this section.


Summary

In this paper we detailed the performance of the first NNLO version of MCFM.Using the non-local N -jettiness subtraction method, we included the NNLO corrections for six final states: pp → W ± , pp → Z/γ * , pp → H, pp → W ± H, pp → ZH and pp → γγ.For each process decays of the unstable vector bosons are included where appropriate.The method was checked at NLO against existing calculations and excellent agreement was found.At NNLO the dependence on the jettiness cut was studied in great detail and some guidelines on the choice of the jettiness cut were given for all NNLO processes added to MCFM.

Another addition to MCFM is the ability to run in a hybrid openMP/MPI mode, enabling the Monte Carlo to use clusters efficiently while still maintaining a single VEGAS grid.The evaluation time for NNLO inclusive cross sections with an overall precision better than 1% on a single 8-core processor using openMP ranges from 3 hours to 8 hours depending on the specific process.It was shown all processes scaled well on a multi-processor cluster us ng MPI in addition to openMP, giving an evaluation time of 5 minutes to 11 minutes on a 48 8-core processor cluster.A small cluster of 50+ cores will give good statistics for distributions at NNLO in a short tim