The Continuum and Leading Twist Limits of Parton Distribution Functions in Lattice QCD

In this study, we present continuum limit results for the unpolarized parton distribution function of the nucleon computed in lattice QCD. This study is the first continuum limit using the pseudo-PDF approach with Short Distance Factorization for factorizing lattice QCD calculable matrix elements. Our findings are also compared with the pertinent phenomenological determinations. Inter alia, we are employing the summation Generalized Eigenvalue Problem (sGEVP) technique in order to optimize our control over the excited state contamination which can be one of the most serious systematic errors in this type of calculations. A crucial novel ingredient of our analysis is the parameterization of systematic errors using Jacobi polynomials to characterize and remove both lattice spacing and higher twist contaminations, as well as the leading twist distribution. This method can be expanded in further studies to remove all other systematic errors.


Introduction
Ever since the pioneering deep inelastic lepton-proton scattering (DIS) experiments at SLAC in 1973 which yielded the first evidence for proton structure, the excitement towards the understanding of the fundamental constituents of the nucleon lead to a culmination of theoretical and experimental results. From the theoretical side the QCD factorization theorem allows for a separation of the hadronic cross sections into a perturbative, process dependent partonic cross section and nonperturbative, process independent parton distribution functions (PDFs). Thus, in order to decipher the information coming from the Large Hadron Collider (LHC) experiments, and in order to capitalize maximally the potential of the upcoming Electron-Ion Collider (EIC), it is of vital importance to accurately determine the PDFs. The calculation of the momentum distribution that bound quarks and gluons carry within the proton is a nonperturbative problem which due to the lightcone nature of the PDFs was elusive to lattice QCD calculations until very recently. Consequently, progress was being made through global fits of experimental data or modelling. PDFs have a huge phenomenological value since they constitute a fundamental limit for the Higgs boson characterization in terms of its couplings, they are the dominant systematics for precision Standard Model (SM) measurements, such as the W boson mass, but also the biggest uncertainties for beyond the SM heavy particle production. Therefore a precise knowledge of the PDFs can help to rule out a broad class of BSM models. For a comprehensive review we refer the reader to [1].
A series of lattice methodologies have been developed to circumvent the issues stemming from the light-cone nature of PDFs and in the recent years there has been a new dawn in the ab-initio determination of lightcone parton distributions via numerical lattice simulations. The first such approach [2][3][4] was to directly calculate the hadronic tensor and factorize it in a similar way that one adopts when interpreting DIS data. In this direct approach, one can study not only the DIS regime, but also the Resonance and Shallow Inelastic regimes, which makes it unique amongst the lattice approaches. A related approach [5] was proposed to calculate Distribution Amplitudes where between the two currents a scalar quark will be inserted instead. The most widely adopted approach, which is also the one most responsible for today's vast amount of activity in lattice calculations of PDFs, is the Large Momentum Effective Theory (LaMET) method [6]. In this approach, a matrix element of an operator with a spacelike separation is Fourier transformed with respect to the separation length to get a so-called quasi-distribution. A factorization theorem is applied when the matrix element has sufficiently large external momentum, hence the name LaMET. The next approach [7][8][9][10], called OPE-without-OPE, is to calculate either the hadronic tensor or the matrix elements with spacelike separations used in later approaches. These matrix elements can be used to determine moments of the PDF through the OPE and from those moments determine the PDF. Attempts to reconstruct the PDF with the limited number of moments accessible from local matrix elements had already met with some success [11]. In the OPE-without-OPE approach, far more moments are accessible than through the calculation of local matrix elements and the PDF extraction can be systematically improved as more moments are obtains. Finally, the approach, which is adopted in this study, was proposed originally in [12], and now is usually called Short Distance Factorization (SDF), is to calculate a matrix element of operators with a spacelike separation, which are now typically named pseudo-distributions or lattice cross sections (LCS) when using this approach. The Operator Product Expansion (OPE) is used to relate these matrix elements to light cone matrix elements through a factorization theorem at small spacelike separation. Originally, SDF was proposed to determine meson distribution amplitudes [12], but the method was later independently reinvented by [13,14], who were motivated by LaMET, for calculations of the PDFs. Since this approach would use the same type of matrix elements as in LaMET, LaMET and SDF are intimately related in their factorization theorems, but provide two distinct limits for approaching the light-cone distributions with their different power corrections. The power corrections for SDF are ordered by matrix elements of operators with distinct twist t and are proportional to (z 2 ) t−2 . The power corrections of LaMET come from matrix elements with mixed twist and are proportional to (p −2 3 ) n .
All these frameworks that target the ab-initio study of parton physics via lattice calculations in Euclidean space incited a feverish activity of the lattice community .
In [115][116][117] one can find reviews of all the aforementioned lattice approaches to the extraction of light cone PDFs.
The lattice regulator itself induces artifacts which contaminate any quantity that one wishes to compute. Ultimately, a continuum limit must be properly taken in order to remove this regulator dependence and to achieve the final result. The most widely used lattice actions and some observables are O(a) improved but this is not the case for the fermion bilocal matrix element that is the starting point of the quasi and pseudo-distributions. Also for the time being no Symanzik program has been developed for this quantity and consequently the cut-off effects are of O(a). This issue could in principle translate itself in sizeable cut-off effects and render the approach to the continuum limit quite tricky [107,108]. In this paper, we study the continuum limit of the nucleon unpolarized parton distributions employing the SDF method of Ioffe time distributions and employing three lattice ensembles with lattice spacings ranging from 0.0749 fm to 0.0483 fm. The finest lattice is finer than those used in the continuum limit of previous LaMET studies [107,108]. These ensembles allow us to get a firm understanding of the size of lattice artifacts that one encounters in such studies and also allow us to study the extrapolation to the continuum limit. In general, there are different ways that one can take the continuum limit which is complicated due to the two physical scales in a lattice PDF calculation, the spacelike separation and the momentum of the hadron. In previous continuum limit studies [107,108], the ensembles are chosen such that one of the scales, the momentum, is all the same across the ensembles. In this manuscript, we advocate for a method which has not been studied beforehand in the literature. It allows for any ensemble to be used even if the separation and momenta scales cannot be exactly matched. Moreover it will allow usage of data from all available momenta and separations, within the limitations of the assumptions made for the LaMET or SDF factorization.
The manuscript is organized as follows, in Sec. 2, we review the SDF approach as applied to the matrix elements which define quark-PDFs and pseudo-PDFs. In Sec. 3, we define the models which are used to describe the leading twist PDF and the nuisance terms which control the systematic errors. In Sec. 4, we describe the lattice QCD methodologies used for extraction of matrix elements. In Sec. 5, we discuss how Bayes' theorem is used to determine the most probable model parameters and how the fits to the PDF and nuisance terms are employed. In Sec. 6, we present the results of fitting PDFs to a range of models and discuss how the modification of the nuisance terms as well as the Bayesian priors affect the final results. Finally in Sec. 7 we conclude our findings and discuss future studies.

Ioffe time pseudo-distributions
As described in [118], parton distributions can be described in terms of a boost invariant matrix element called the Ioffe time distribution (ITD), whose Fourier transform gives the standard PDFs. The ITD, up to a factor of 2P α , is a special case of the generic Lorentz covariant matrix element in Eq. (2.1), which was first studied at length in [15] prior to proposals to factorize it [6,13], where W (z; 0) denotes a straight Wilson line of length z and λ 3 is a flavor Pauli matrix to project onto the flavor non-singlet distribution, which is easier to handle in lattice QCD. This matrix element has the Lorentz decomposition where ν = p · z is here called the Ioffe time [118]. For timelike separations, it is equal to the product of mass of the hadron and the time t (m t) in the hadron's rest frame. This time t is what is typically referred to as the Ioffe time in analyses of DIS [119]. The ITD is given by the special case of lightlike separation z = (0, z − , 0 T ) and α = +, where in light cone coordinates, the + and − directions are defined by the direction of the hadron's momentum. The Fourier transform of the ITD with respect to ν gives the PDF where x is the Fourier-conjugate variable to ν. Due to this lightlike separation, the ITD cannot be directly calculated from lattice QCD, where the calculation takes place in Euclidean space which only allows spacelike separations. Following the framework of Short Distance Factorization (SDF) [12], the Lorentz invariant function M, which will be called the Ioffe time pseudo-distribution (pseudo-ITD) [13], can be related to the ITD through a factorization relationship. This term can be isolated from the purely higher twist distribution N by a choice of z = (0, 0, z 3 , 0), p = (0, 0, p 3 , E), and α = 4 using the Euclidean Cartesian notation. The OPE of the pseudo-ITD is given by where a n are the Mellin moments of the PDF, c n are perturbatively calculable Wilson coefficients, and O(z 2 ) represents higher twist and target mass corrections. Here it is important to note that we use an unconventional definition of the moments a n ( This sum can be rearranged into a more standard convolutional form where the kernel C is the inverse Mellin transform of the Wilson coefficients c n . The dependence of this kernel, C, on µ 2 z 2 is precisely the Dokshitzer-Gribov-Lipatov-Altarelli-Parisi (DGLAP) scale evolution. There is also a contribution independent of the scales which depends on the choice of MS to renormalize the ITD and whatever renormalization prescription is chosen for the pseudo-ITD. The kernel in this convolution has been calculated to O(α s ) for multiple renormalization schemes [45][46][47]. Recently, an O(α 2 s ) calculation has appeared in the literature [95]. For this study only the O(α s ) kernel is used, since the precision of the data is less than the expected size of the O(α 2 s ) terms. In future work, the O(α 2 s ) kernel will be used to improve the theoretical accuracy of the factorization procedure. Instead of utilizing a typical lattice renormalization scheme, such as a regulator independent momentum subtraction (RI-MOM) scheme, it is useful to consider the renormalization group invariant (RGI) quantity which is called the reduced pseudo-ITD [13]. The matrix elements involving the local vector current, M 0 (0, 0) and M 0 (p, 0), and the pseudo-ITD, M 0 (p, z) and M 0 (0, z), are all multiplicatively renormalizable [38]. The latter depends only on the separation z through the divergences related to the Wilson line [38,120,121]. Within this double ratio the renormalization constants all cancel explicitly and nonperturbatively, in a way independent of the renormalization scheme, making the reduced pseudo-ITD a RGI quantity [9]. The matching of this object to the MS ITD lacks the scheme dependent systematic errors which have been observed in calculations of the related quasi-PDF quantities [37,68], which so far have always used different variants of RI-MOM schemes. Additionally, this object has dramatically reduced higher twist errors compared to the RI-MOM renormalized matrix elements as originally suggested in [13] and directly observed by [100]. Besides the aforementioned unnecessary complications, the ratio is free of the pathologies of fixed gauge renormalization as well as of the undesirable systematic effects that plague any RI-MOM type of calculation. The higher twist, as well as lattice spacing, finite volume, and unphysical pion mass, systematic errors are all being reduced, and this fact has been observed in [74,75,86,100]. Finally this particular choice of ratio cancels correlated fluctuations between the terms in the numerator and denominator for small momenta and for small separation data, leading to a measurable improvement of the signal-to-noise ratio of the pertinent matrix element. This may prove crucial in studying statistically noisier cases such as the pion quark PDF, gluon PDF, and quark disconnected matrix elements that we wish to address in the future. A related ratio has been described in [100], which utilizes matrix elements with non-zero momenta only. This ratio leads to a more complicated matching relationship connecting the RGI ratio and the PDF. Since three of the four matrix elements in Eq. (2.6) defining the reduced pseudo-ITD, after renormalization in a RGI scheme, are equal to unity up to lattice spacing and higher twist errors, they do not modify the matching relationship in Eq. (2.5). We separate the scale dependent DGLAP contribution and the scale independent scheme dependent contribution to the kernel C up to O(α s ) as where and C F is the fundamental Casimir of SU (3) and γ E is the Euler-Mascheroni constant [45][46][47].
As can be seen in Eq. (2.3), the (reduced) pseudo-ITD can be directly related to the moments of the PDF without going through the ITD itself. A Taylor expansion of the ITD with respect to the Ioffe time can be written as where m n (z 2 ), called the pseudo-moments, are the Mellin moments of the pseudo-PDF, which is the Fourier transform with respect to ν of the (reduced) pseudo-ITD. The pseudomoments have a multiplicative matching relationship to the MS PDF moments given by where c n are the Mellin moments of the matching kernel C. These matching coefficients have been calculated to O(α s ) for the reduced pseudo-PDF are the well known anomalous dimensions of the moments of the PDF and the scheme dependent term is given by (2.13) One could easily fit the Ioffe-time dependence of the reduced pseudo-ITD for each separation z 2 independently to obtain the pseudo-moments. Studying the z 2 dependence of the pseudo-moments, as well as the resulting MS PDF moments, one can try to estimate the size of systematic errors. Deviations of MS PDF moments originating from the low z 2 data could signal large lattice spacing errors and deviations from large z 2 data could signal large higher twist errors. A wide region of z 2 where the resulting MS PDF moments are independent of z 2 would indicate a window of opportunity where the data are free of these systematic effects. It is only within such a window that an ITD derived from the reduced pseudo-ITD can be trustworthy without other methods of removing systematic errors.
Instead of simultaneously handling the real and imaginary components of the complex M, it is helpful to separate the CP even and odd contributions which are related to respectively, where f is defined in the window [-1,1] while q,q, q − , and q + are defined in the window [0,1]. These PDFs can be individually extracted from the real and imaginary components separately. The components are factorized as (2.14) where Use of these matching kernels which factorize directly to the PDF removes the need for the intermediate determination of the MS ITD. Unfortunately, they prove to be complicated functions whose direct numerical evaluation is inefficient when incorporated into the analysis of the matrix elements computed from lattice QCD. In Sec. 3.2, we adopt a power series approximation to the convolution integrals that the above kernel functions participate in which allows for efficient computations within the available range of the Ioffe time. With sufficient number of terms, this power series approximates the convolution integrals to numerical precision.

Determination of the continuum limit PDF and nuisance parameters
The continuum limit is a critical step in any precision lattice calculation. In this study, we take advantage of the symmetries of the reduced pseudo-ITD to parameterize the lattice spacing correction to the continuum limit, as well as the higher twist effects. The continuum PDF is also parameterized and a simultaneous analysis of all three ensembles obtains the continuum limit PDF with higher twist contamination removed. This method of adding "nuisance parameters" to parameterize the systematic errors of experimental cross sections is also used in the phenomenological extractions of PDFs. Such a combined analysis approach can also be used with results obtained with different pion masses, lattice spacings, matrix elements, and even lattice actions given appropriate parameterizations of those effects. Ultimately, one can imagine taking all published lattice matrix elements and analyzing them within this approach, given sufficiently novel nuisance parameterizations, just as a global phenomenological fit is performed using experimental data with vastly different systematic errors. In order to minimize the dependence of the effect of nuisance parameters, in this study only higher twist and lattice spacing errors are considered for data with the same physical quark mass and lattice action. Future work will study the extension of this method to include other effects. It is important to note that the coefficients of the lattice spacing errors can be functions of the Ioffe time. Previous parameterizations of lattice spacing errors for parton observables have only used simple dependences on the Ioffe time, which all diverge as ν → ∞. In [80,100], the Ioffe time dependence of lattice spacing errors was equivalent to a 2 z 2 ν 2 which is the simplest possible dependence. In [75], correction terms were modeled at fixed lattice spacing and given a higher order polynomial dependence. In this study, the Ioffe-time dependence is studied further by employing functions whose large Ioffe time behavior is finite as it physically should be.
In the recent work [92,107], the Ioffe-time dependence is taken into account by fitting the lattice spacing dependence at fixed LaMET scale p 3 , or at least requiring p 3 's to be sufficiently close in physical units and extrapolating their data, or an interpolation of it, in z. In pseudo-PDF studies, as in this work, matrix elements originating from many different SDF scales z 2 are simultaneously utilized. Studying a wide range of scales allows us to access systematic errors arising from lattice spacing effects or higher twist contributions. In particular, at short distances where higher twist effects are suppressed, lattice spacing errors arise, while at large distances higher twist effects may dominate. Therefore, a window in z 2 , where both systematic errors are suppressed can be identified in order to extract the universal leading twist matrix element. In order to perform the simple extrapolations of [92,107], the gauge ensembles must have specifically tuned lattice spacings and volumes to allow for the same scales, p 3 or z 2 , to occur in each ensemble. In most sets of ensembles, only few momenta or values of z 2 coincide in physical units since tuning lattice spacings to be integer multiples of each other is difficult. An analysis of 8 values of z 2 per ensemble, as is done in this study, would not be feasible without an overwhelming computational cost in gauge ensemble generation.
Besides the continuum limit, there is a significant complication due to the integral relation between the matrix element and the PDF. The inversion of this integral relation is a numerically ill-defined problem when there is a limited range of Ioffe times as there is in a lattice QCD calculation. An infinite number of solutions that fit the matrix elements exist, so procedures must be chosen to select the best class of solutions. There exist many classes of these procedures which have been proposed for lattice QCD calculations of PDFs [34,39,66,67,102]. The most popular amongst phenomenological global analyses of PDFs are parametric solutions. In this procedure a functional form for the PDF is written down based on a set of parameters whose values are tuned to represent the data. This style of solution is used in this study. Many methods also exist which do not rely on parameterizing the unknown functions. Though they lack functional forms, non-parametric solutions also have their own uncontrolled systematic errors and are not fundamentally better than parametric solutions. To arrive at a proper systematic error analysis of the resulting PDF, the systematic errors of any of these procedures must be tested, typically by comparing results from several approaches. In this study, the parameterizations of the data are varied in order to study the parametrization dependence and access the associated systematic errors. In future work, more diverse parameterizations can be used to obtain a better estimate of the variance due to model choices.

Separating continuum PDFs from systematic errors
The CP symmetry implies that the reduced pseudo-ITD has the property With an O(a) improved lattice action, the lattice spacing errors related to the momentum p, must come in from the momentum transfer. This feature is known in the improvement of the local vector current [122], the case of z = 0, where the local vector current mixes with the divergence of the tensor current. The operators discussed in [51] also demonstrate these features when considering the hadronic matrix elements in question. These momentum transfer effects are necessary for the studies of Generalized Parton Distributions, but not for the PDF. There is also potential z 2 dependence on the lattice spacing coefficient functions, P n and R n . Those effects which can come from logarithmic perturbative corrections, higher twist contributions, or target mass corrections are additionally suppressed either by α s , Λ 2 QCD z 2 , or m 2 z 2 respectively on top of the suppression by a/|z| and aΛ QCD . These z 2 dependencies are neglected here.
The relationship between the reduced pseudo-ITD and the ITD is through a convolution with Wilson coefficient function. Ultimately, the ITD is not the goal of this study, but instead its Fourier transform, the PDF. We adopt an approach analogous to [73,90,100] where the intermediate ITD is not required, but a parameterization of the PDF is directly related to the reduced pseudo-ITD. Unlike [73,90,100], the PDF is related to the leading twist reduced pseudo-ITD through its moments. The higher twist power corrections are added as nuisance terms similar to the lattice spacing terms. The functional form is given by in terms of the leading twist continuum limit reduced pseudo-ITD, M lt , and the higher twist distributions B n . In principle, the higher twist distributions could have non-trivial z 2 dependence. Similarly to the lattice spacing terms, these effects which come from perturbative corrections and target mass effects are additionally suppressed by powers of α s or m 2 z 2 respectively and are neglected in the remainder of this study. In principle, there exist higher twist power corrections and lattice spacing errors of all orders. With these errors sufficiently under control, only the leading contributions are significant. We therefore make the approximation that P n = R n = B n = 0 for n > 1.

Parameterization of unknown functions
Extracting PDFs from matrix elements using a functional form to parametrize them may induce unwanted model dependence. Therefore, a careful study of such parametrizationdependent systematic error is required. For that purpose, the functional forms used should be varied in order to understand how certain choices affect the final result. In previous lattice PDF studies [34,65,74,75,80,86,87,110], the chosen functional forms are similar to those used in phenomenological analyses of PDFs [123][124][125][126]. Progress has also been made on the application of neural networks to parameterize the PDF [67,99,127]. In this work, all of the unknown functions, q − (x), q + (x), P 1 (ν), R 1 (ν), and B 1 (ν), are parameterized using Jacobi polynomials.
The Jacobi polynomials, j (α,β) n (z), are defined in the interval [−1, 1] and they satisfy the orthogonality relation , which are referred to as Jacobi polynomials from now on, as The orthogonality relation becomes where One thing to note is that there exists a formula that relates Jacobi polynomials for different values of the weight parameters, α and β. This formula reads as following where the coefficientsĉ n m (α, α ; β, β ) are analytically known. Finally, it can be shown that the coefficients of the Jacobi polynomials satisfy the orthogonality relationship where B(a, b) is the beta function. Since the Jacobi polynomials form a complete basis of functions in the interval of [0,1], the PDFs can be written as for any α and β. The choice of those parameters does affect the convergence of the coefficients ± d (α,β) n . In practice, one needs to truncate the series introducing in this way some model dependence which can be easily controlled. The control of the truncation can be improved if one fits for the optimal values of α and β for that given order of truncation. In other words, the rate of convergence of the series can be optimized by tuning the values of α and β. One way to understand why tuning of α and β can result in improved convergence of the series is to realize that phenomenological considerations tell us that the Jacobi weight is a good approximation to the shape of the PDF, therefore if α, β are tuned to roughly match the shape of the PDF, the Jacobi polynomials need only to approximate a smooth, slowly varying function with small coefficients. Using Eq. 3.9, we can easily convert an expansion of the PDF in terms of (α, β) Jacobi polynomials to one with (α , β ) Jacobi polynomials. The transformation of the expansion coefficients is linear and if a truncation of the series up to order N is used the linear transformation involves only coefficients up to that order. Finally, there also exists a linear transformation which connects these coefficients and the Mellin moments of the PDF given by where a ± n = 1 0 dx x n q ± (x), so this parameterization can be thought as another way to parameterize the PDF by a set of its moments.
To determine the relationship between the reduced pseudo-ITD and the parameters of the PDF, the matching kernels K R,I are expanded in terms of Jacobi polynomials. It can be shown that the kernels can be written as Numerically, the sum over k can be performed to a sufficiently high order (k ∼ 30) to achieve convergence to double precision accuracy in the relevant range of Ioffe time. Given this expansion, the leading twist reduced-pseudo ITD can be written as the truncated sums Similarly, the nuisance parameters can be introduced in x space and the unknown functions B 1 , P 1 , Q 1 , and R 1 are constructed from similar sums.
which are the leading O(α 0 s ) order of σ n and η n . Unlike parameterizing with a polynomial form in Ioffe time, these functional forms are better behaved in the large Ioffe time regime. Unlike a polynomial in ν, one does not expect these nuisance terms to grow indefinitely with Ioffe time, but instead eventually falling to zero as the ITD does. In [63], a calculation using renormalon methods showed the ratio of the ITD and the leading power correction plateaus as ν grows indicating that the higher twist contribution eventually decays to zero at large Ioffe time. Similarly, the size of the lattice spacing error is not expected to grow infinitely with ν. For a fixed z 2 , it is expected to ultimately go to zero as ν increases.
Not only do the σ and η functions have better large Ioffe time behavior, but they also appear to dominate only in a given region of Ioffe time ordered by n. Figs 1 and 2 show the functions over a range of n. As can be seen, these functions have a peak region and fall to zero as Ioffe time increases, albeit slowly, and the peaks are ordered by n. Since our data exist within a limited range of Ioffe time, the terms whose peaks are beyond this region do not contribute significantly. More so, since the (pseudo-)ITD is believed to decay towards zero without any large values at larger Ioffe times, the values of the parameters with larger n will be small as well. This expected convergence of the series and the known shape of the the σ and η functions in the available range of Ioffe time can be used as natural guides for when to truncate the series without significant chance of losing vital information on the PDF's structure.
for α = −0.5 and β = 3 over a range of n. For the NLO contribution, the value of z 2 = 4 * 0.065 fm, µ = 2 GeV, and α s = 0.3 were chosen as a typical example which will be used in this study. The peaked structures of σ for α = −0.5 and β = 3 over a range of n. For the NLO contribution, the value of z 2 = 4 * 0.065 fm, µ = 2 GeV, and α s = 0.3 were chosen as a typical example which will be used in this study. The peaked structures of η This final functional form is capable of removing lattice spacing and higher twist dependencies which spoil the leading twist reduced pseudo-ITD. By testing with various functional forms, the model dependent systematic error can be studied. Using the Akaike information criterion (AIC), a weighted average of these models produces a final continuum limit PDF with the model dependence smoothed out in a statistically meaningful way, especially when sufficiently many distinct models are used. In future studies more adventurous parameterizations of the PDF and the nuisance parameters, such as a neural network, can be included alongside these fits into the weighted average.

Lattice QCD calculation
This study utilizes three ensembles of configurations with decreasing lattice spacing. These ensembles have two flavors of dynamical Wilson clover fermions and pion mass around 440 MeV. The specific parameters of these ensembles are given in Table 1. The lattice spacings of the configurations are 0.0749, 0.0652, and 0.0483 fm. The finer two ensembles were generated by the CLS effort [128] while the coarsest was generated by the authors for this study. These ensembles allow for a controlled continuum limit extrapolation which is a necessary step for precision calculations of PDFs. Apart from that, the finest lattice spacing employed in this study is half compared to our previous studies allowing us to reach much higher momenta and smaller separations.
The nucleon interpolating fields are constructed with Gaussian smearing [129] and momentum smearing [130]. The source field is always be smeared, and an unsmeared and a smeared sink field is used. These scenarios are referred to as "SP" (standing for smeared-point) and "SS" (standing for smeared-smeared) respectively. For both of these scenarios, three values of the momentum smearing parameter ζ are used. To implement the momentum smearing, prior to the Gaussian smearing step, the gauge links are modified by in order to smear only the direction parallel to the momentum. The smearing parameters are chosen to increase the overlap to the ground state, and thereby the signal-to-noise ratio, for correlation functions over a range of momenta.
The matrix elements are calculated using the summation Generalized Eigenvalue Problem (sGEVP) technique [131] to have optimal control over the excited state contamination, as described in Sec. 4.1. Summation techniques have proven to be extremely powerful in controlling excited state errors [132] and have been used in a number of lattice calculations of PDFs [34,60,68,[73][74][75]133]. These methods have dramatically reduced excited state contamination O(e −∆T ) compared to typical ratio methods O(e −∆T /2 ). These methods are necessary for efficient calculations especially for future work with physical pion masses where ∆ is smaller making excited states persistent for larger T /a which consequently increases the computational cost needed to achieve equivalent statistical precision. To obtain comparable statistical precision of a summation method calculation with N measurements, a ratio method calculation can be estimated to require N 2 measurements.

sGEVP matrix element extraction
Excited state contamination is a problem which can interfere with the lattice calculation of any matrix element. Large Euclidean times are required for isolating the ground state from this exponentially decaying contamination. This necessity is plagued by an exponentially decaying signal-to-noise ratio of the correlators as the Euclidean time is increased. Particularly for physical pion mass calculations with high momentum hadrons, a very large number of samples is necessary for precision calculations with large Euclidean times when using the typical ratio method for calculating matrix elements. The sGEVP method is a combination of two techniques which dramatically improves the scenario. Summation methods drastically increase the rate of decay of the excited state contributions. The GEVP method can be used to create an optimal operator which overlaps with the ground state and is orthogonal to the lowest lying excited states. The combination of these two methods allow for significant control over excited state contamination in a computationally efficient manner.
In a summation technique, one extracts the matrix element from the large Euclidean time behavior of a ratio , where E is the energy of the state, this feature is critical for efficient high momenta calculations. When considering the exponentially growing signal to noise ratio of the correlators, this improvement in excited state contamination means that if the summation method requires N measurements to obtain a desired precision, more traditional methods would require N 2 measurements for a point with equivalent excited state contamination. This advantage may also be critical for efficient calculations of pion quark PDFs, gluon PDFs, and quark disconnected contributions, which are notoriously more noisy than the connected quark operators used here. Attempts to increase the overlap of the interpolating field with the ground state, and ideally also lower the overlap with low excited states, has generated a number of smearing procedures including the Gaussian and momentum smearing techniques used in this study. An approach, orthogonal and complimentary to these methods, is the GEVP technique. One considers a matrix of correlators, C(T ), with a basis of interpolating fields which overlap with the desired state. Then one solves the GEVP equation where λ n and v n are the n th generalized eigenvalues and eigenvectors. With a sufficiently well chosen basis, the generalized eigenvectors of this matrix will overlap with the individual states and be largely orthogonal to the others. This allows one to choose the linear combination of interpolating fields which overlaps with the desired state be it ground state or excited state. These optimized operators can then be used to form the three point correlation function. This improved overlap allows for a decreased minimum Euclidean time for the matrix element fit improving the efficiency of the calculation. This approach has been used very successfully in identifying multiple energy levels for hadron spectroscopy [134][135][136][137][138] and in the determination of matrix elements [139,140].
The combination of the summation and GEVP methods [131] is a powerful technique for improving the excited state contamination in a matrix element calculation. The effective matrix element is given by the difference of the ratio where K(T ) is the sum over operator insertion time of the three point correlation matrix. This method has the combined advantages of the increased exponential decay and the reduced overlap of excited state contamination. In Fig 3, the summation technique with a single operator is compared to the sGEVP. The correlations with smeared source and sink interpolating fields already had relatively small excited state contamination in the summation technique. The sGEVP results do not significantly change those data. On the other hand, the correlators with smeared source and point sink interpolating fields had a large reduction in excited state contamination. The plateau region is reached significantly earlier when using the sGEVP technique.  This example demonstrates how even a minimal application of the sGEVP using only three local nucleon operators can create some control over the excited state contamination.
Other applications of the GEVP method [137,139,140] have used many more operators which are specifically selected to overlap with more excited states than these local operators. Our future applications to PDFs will utilize a larger basis of operators to have an even more substantial effect. The sGEVP method will be crucial for calculations at physical pion mass at high momenta where the correlation function may only be precise in a limited range of Euclidean time. Fig. 4 displays the results of the nucleon energy as function of the nucleon momentum. These energies are extracted by fitting the time dependence of the principle correlator λ 0 (T, t 0 ). As it is evident that these energy levels are consistent with the continuum dispersion relationship within the range of available momenta. Since the momentum smearing parameter was not tuned specifically for each momentum state, the errors do not monotonically increase as they would if the same smearing was used or if the momentum smearing parameter was optimized for each momentum state.

Fitting matrix elements
The sGEVP is applied to each scenario of smearing parameters individually. It is likely that modifying the operators by only changing smearing parameters will not drastically change its overlap with the ground and excited states. This means combining them within the sGEVP will have little effect. This feature can be seen in Fig. 3, where the effective matrix elements with different ζ are largely consistent within errors. With the same overlap they cannot significantly improve the cancellation of higher state effects. Instead, combinations of these six smearing scenarios are simultaneously fit to obtain a common matrix element and an excited state mass. When the signal-to-noise ratio for some of smearing scenarios is poor, they are excluded from the fit, for example large ζ at small p or vice versa.
There exists a systematic error from the particular choices of the maximum and minimum values of T used within the fits for the matrix elements. The maximum value was chosen based upon the statistical noise of the correlation functions at those times. When the noise was sufficiently large that the fit result was not significantly affected, the maximum value was set. The minimum value was chosen to minimize the χ 2 /d.o.f. of the fit. The change of the central values when fitting with a minimum time decreased by a single time slice is used, in order to estimate the systematic error from the choice of minimum time. The square of this systematic error is added to the diagonal of the covariance matrix for the remainder of the analysis. The majority of the data points do not see a dramatic increase in error, but some do highlighting the importance of this analysis.

Fits with Bayesian Priors
In order to determine the PDF from our lattice matrix elements, we create a model to describe our data in terms of the PDF and various systematic errors as described in Sec. 3. Let M L (ν, z 2 ) be the lattice matrix elements while M(ν, z 2 , θ) be the matrix element from our model which depends on a set of parameters θ. These parameters are the exponents α, β, and the linear coefficients of the Jacobi series for the PDF and the nuisance terms.
We attempt to determine the most likely values of the unknown parameters θ given our lattice matrix elements, M L and some prior information, I by using Bayes' theorem, which states Here P θ|M L , I is the posterior distribution, which describes the probability distribution that a given set of parameters are the true parameters given a set of data and prior information. P M L |θ is the probability distribution of the data given a set of model parameters. P [θ|I] is the prior distribution which describes the probability distribution of a set of parameters given some previously held information about it. Finally, P M L |I is the marginal likelihood or evidence which describes the probability that the data are correct given the previously held information. Ultimately, since the evidence does not depend on the model parameters it will be an unnecessary normalization for finding the most likely parameters. The probability distribution of the lattice matrix elements, due to the central limit where the indices k, l run over all our matrix elements and is the covariance matrix of the N samples (denoted as M L,i k ) of the matrix elements M L k . In the absence of any prior information, finding the most probable set of model parameters is done by minimizing χ 2 .
The prior distributions are chosen to encode some expectations or requirements on the fit parameters. A simple example of how this could be done is by setting bounds on a fit. If one desires a model parameter θ i to be limited to the range [a, b], then the prior distribution is given by P is the Heaviside step function. The PDF is known to be dominated by the leading behavior x α (1 − x) β and the other terms should be small corrections to this. Therefore we give the PDF model parameters ± d (α,β) n priors which are normal distributions, with a mean and width of d 0 and σ d respectively. In Sec. 6, we use normal distributions centered about 0, but change the widths in order to study its effects. Similarly for the nuisance terms, we expect their parameters to be small corrections to the dominant PDF and use a normal distribution for them, whose widths are smaller than those of the PDF parameters. The mean and width of these are given by c 0 and σ c .
The prior distributions for α and β could also be normal distributions, but they have other restrictions. First, α and β must be greater than -1. This is an explicit restriction from the definition of the Jacobi polynomials, but also has a physical interpretation. If α or β were equal or less than -1, then the integral of the PDF, which is related to the total number of quarks in the proton, would diverge. Furthermore, we do not expect β < 0, since we expect that the parton distribution function vanishes at x = 1. In order to enforce the restrictions of α > −1 and β > 0, their prior distributions are log-normal distributions, where µ is the mean and σ 2 the variance of the distribution of log(x − x 0 ), and x 0 is the lower bound of the log-normal distributions. The mean µ x and variance σ 2 x of the variable x are related to the log-normal distribution parameters by the following formulae, The most likely parameters of the model are found by maximizing the posterior distribution. This is performed by minimizing the negative log of the posterior distribution L 2 = −2 log(P θ|M L , I ) + C, where C is the normalization of the posterior which is independent of the model parameters. This is a relatively simple task because apart from α and β all other parameters of the model enter linearly and therefore the minimization with respect to any of these parameters can be done analytically at fixed α and β. Subsequently, a non-linear minimization of L 2 , which is now a function only of α and β, can be done with a non-linear minimizer. As a consequence, in principle one can easily minimize L 2 with a large number of Jacobi polynomial terms as the non-linear minimization is always two dimensional. This is a well known technique called Variable Projection (VarPro) [141].

PDF results
As discussed before the PDF is related to the lattice matrix elements through a convolution integral relation. Extracting the PDF from the lattice matrix elements involves the solution of an inverse problem and therefore the resulting PDF depends on the method used to solve it introducing a new systematic error that requires careful study. The statistical error of a single choice of solution, such as the discrete Fourier transform, the Backus-Gilbert method, or a fit to a particular model PDF, may significantly underestimate the true uncertainty on the PDF. This feature can clearly be seen in the few studies which have compared alternative methods or varied models [80,87,107]. In this theme, we want to study many, though rather interrelated, models which vary both the number of parameters as well as the prior distributions. However, the prior distributions have to be chosen carefully to reflect accurately our prior knowledge.
In the following analysis, the PDF scale is taken to be µ = 2 GeV, which results in the two flavor M S α s (µ = 2 GeV) = 0.245, with Λ QCD = 268 MeV.

Inclusion of nuisance terms
There are no ab initio estimates of the sizes of the lattice spacing nuisance terms P 1 and R 1 . In principle, one could calculate matrix elements of the operators discussed in [51] and develop a Symanzik improvement style program. In [63], an estimate of the size of the higher twist effects was made using a method based upon the fact that renormalon effects must cancel with the higher twist term. They demonstrated that the reduced pseudo-ITD had strikingly smaller higher twist effects than the pseudo-ITD, as expected. In the range of Ioffe time for this calculation, the improvement is at least a factor of 5. For the middle of this Ioffe time range it is closer to a factor of 10. We anticipate that the lattice spacing nuisance terms for the reduced pseudo-ITD will also be smaller than the pseudo-ITD due to the same cancellation within the ratio.
As discussed above, in this work, we use a parameterization of these unknown functions and study their effect on the fits of the PDF. First, it is important to understand which nuisance terms are more necessary than others. A common way approaching this is to iteratively add the terms and see the effect on the L 2 /d.o.f. In order to study this effect, every combination of the leading twist PDF and the nuisance terms is fit to the data.
For simplicity, in this test the continuum leading twist term has two Jacobi polynomials for the PDF and one Jacobi polynomial for the possible nuisance terms. As shown in Figs when P 1 is included into the fit for both the real and the imaginary component. This effect is anticipated, since the small z data, which are most sensitive to P 1 because they are affected more by lattice spacing errors, are generally more precise than the large z data for any given momentum. The precision in combination with the expected lattice spacing errors give a larger impact on the χ 2 and therefore L 2 . This feature of statistics, along with the ability to use small momentum data which are exponentially more precise than large momentum data, shows an advantage of the SDF approach over LaMET. The limitations of SDF require that the more precise low z data are used, where the limitations of LaMET require the noisier larger p data to be used. To reach the same precision for those points orders of magnitude of greater computational resources are required.
The effects of B 1 and R 1 are less clear. The improvement of L 2 /d.o.f. is only modest. The higher twist contribution is most sensitive to the largest z data, which are statistically noisier and therefore affect the χ 2 less. Deviations caused by neglecting B 1 may not be expected to generate larger contributions to the χ 2 . Unfortunately, no such argument can be made for R 1 terms which are agnostic to z. As we show in Sec. 6.2, the data are not sensitive to this term at all. A final thing to note is that the close values of both L 2 and χ 2 imply that the data provide a significant part of the contribution to L 2 , not the prior distributions. As seen in 6.2, when the priors are restrictive, the difference between χ 2 and L 2 is noticeably larger. Fig. 7 shows the PDFs which result from these fits. The shape of q ± changes substantially when the P 1 term is added in the large x region, consistent with the fact that the P 1 term affects mostly the small ν range which in turn controls the large x region of the PDF. Q only Q and B 1 Q and R 1 Q and P 1 Q, B 1 and R 1 Q, B 1 and P 1 Q, R 1 and P 1 All terms Figure 7. The results of fitting with various nuisance terms included.

Effects of prior distributions
In this section, we consider a set of prior distributions which can be studied in detail while fixing the number of parameters. The effects of the prior distributions in this model are modified in order to study the stability of the final results. The correlations between the resulting parameters, as well as comparison of their fluctuations to the prior distribution, can be used to identify which terms are being controlled by the data and which by the priors. These terms can then be modified or removed in order to test their relevance. The models being used in this study are described by Tab. 3.   Table 4. The L 2 and χ 2 of the models used to determine the continuum PDF given in Tab. 3.
The model designated as "default" serves as a baseline for this study. This model is the same as the one considered in the previous section where all nuisance terms were included. The models dubbed "wide" and "thin" are to study the effect of the widths of the prior distributions. The model named "limited" is to study the case where only the nuisance term P 1 , which decreased the L 2 /d.o.f. most significantly in Sec. 6.1, is included. As can be seen in Fig. 8, the model "default" qualitatively reproduces many of the known features of the PDF. The L 2 and χ 2 of this fit are given in Tab. 4. Due to the limited extent in Ioffe time, the low x behavior is not well resolved, allowing for solutions which converge and diverge as x → 0 for the q + and q − distributions respectively. In a previous study of mock data [66], we found that significantly larger values for the maximum Ioffe time than what is currently achievable in lattice QCD are required to resolve the region of x < 0.2 accurately. The size of the nuisance parameter terms are also smaller than the dominant leading twist component until ν has become large. The contribution from R 1 is completely consistent with 0. On the other hand, B 1 and P 1 are not consistent with zero for a range of Ioffe times.
The values of the model parameters are given in Tabs. 5 and 6. Fig. 9 shows a normalized correlation matrix between the various model parameters, whose labels correspond to the numbers given in the tables. As can be seen the coefficient for R 1 is largely uncorrelated with the rest of the model parameters. Its central value and error are also very similar to those of the prior distribution. These features imply that it is not being constrained by the data, but by the prior distribution only. The higher twist and O( a z ) lattice spacing parameters, b  Table 5. The values of the parameters from fitting the real component to the model "default". The ID numbers correspond to the labels in Fig. 9.  Table 6. The values of the parameters from fitting the imaginary component to the model "default". The ID numbers correspond to the labels in Fig. 9. each other. As expected "thin" and "wide" gave results with smaller and wider statistical errors than "default" respectively. For the higher twist terms, the "wide" and "thin" results actually seem to deviate slightly from the "default" model, compared to the other terms, but with little effect on the resulting PDF.

Varying the number of parameters
In order to study the model dependence, the number of parameters for the PDF and for each of the nuisance terms are all be varied. The prior distributions are set to those of the model "default" in Sec. 6.2. Changing the number of parameters varies the flexibility of the model. Based upon the size of the σ n (ν) and η n (ν) functions, the terms with the lowest n will dominate the result. It appears that terms with n > 5 will have an entirely negligible effect in the given range of Ioffe time. For this study, the maximum n is 3 for any given term. The number of polynomials in each of the nuisance terms is allowed to vary from 0 to the number of polynomials in the ITD term. The models are labeled with 4 numbers corresponding to (N ± , N R/I,b , N R/I,r , N R/I,p ).
The results of the PDFs are shown in Figs. 12-17. When the P 1 nuisance terms are included, the PDFs are largely consistent. Some of the q + distributions begin to have  Figure 10. The results of fitting to the models of Tab 3. The PDF is given by the upper left, B 1 (ν) is given by the upper right, R 1 (ν) is given by the lower left, and P 1 (ν) is given by the lower right.
significant differences for the region of x ≤ 0.4. These discrepancies are to be expected from a fit with a limited range in Ioffe time, as the study in [66] showed. When P 1 is included, the q − distribution consistently diverges at x = 0, but q + may converge or diverge. Without that P 1 term, the PDFs differ not only with the PDFs from fits with P 1 but also amongst themselves.
The nuisance terms are shown in Figs. 18-27. The O(a/z) term P 1 also appears to grow to a peak around ν ∼ 4 and either plateaus or goes to zero. The location of this peak is expected from σ (α,β) 1 and η (α,β) 1 when N R/I,p = 1 but is robust even for N R/I,p = 2, 3. This peak or plateau is negative for the real component and positive for the imaginary component. The R 1 term is only ever non-zero when there is no P 1 term. This implies that in those models it is attempting to compensate for the absence of P 1 .
For the majority of the models, the higher twist term B 1 appears to grow to a peak around ν ∼ 4 and either plateaus or shrinks slightly. Similarly to P 1 the location of this peak is expected from the shape of σ  Figure 11. The results of fitting to the models of Tab 3. The PDF is given by the upper left, B 1 (ν) is given by the upper right, R 1 (ν) is given by the lower left, and P 1 (ν) is given by the lower right.
naive expectation of the higher twist contribution was that the coefficient of Λ 2 QCD z 2 would be order 1. This result implies that even larger z 2 could be safely used. One may worry about the convergence of the higher twist sum and the size of the neglected twist 6 and higher terms. Since those are not included in the fit, the effects of those terms are being accumulated, if imperfectly, into B 1 . For the size of z 2 used here, and given the fact that B 1 is small the twist 6 contributions the reduced matrix element must be also small.

Model weighted averages
There exist a number of methods for combining results from different models in order to create an average. In this study, we utilize the Akaike Information Criterion (AIC) [142] for this goal. The AIC is given by a i = 2k i + 2L 2 i , where k i is the number of parameters in the i th model and L 2 i is the negative log of the posterior probability distribution for that model. It should be noted that L 2 i differs from the L 2 used previously by the proper normalization of the likelihood probability and the prior distributions. If there were no prior distributions and the same data were used in the fit, then this would only be an (2, 0, 0, 2) (2, 1, 0, 2) (2, 2, 0, 2) (2, 0, 1, 2) (2, 1, 1, 2) (2, 2, 1, 2) (2, 0, 2, 2) (2, 1, 2, 2) (2, 2, 2, 2) irrelevant constant. Since our models use different numbers of parameters, each with their own prior, the total normalization factor differs and must be taken into account. When using a relatively few number of data points (n), it is common to use the corrected AIC, called AICc [143], A i = a i + 2k(k+1) n−k−1 which approaches the AIC when n becomes large. We adopt the AICc for our analysis.
The AICc can be used to create weights for averaging results from different models. The weights can be interpreted as the relative likelihood of that given model compared to the rest. The average value from N models is given by where x i is the part of the i th model which describes the observable x, e.g. the PDF or various model parameters. This weighted average combines knowledge of the likelihood of a given model alongside a factor to avoid overfitting. Ultimately, this procedure can be improved by adding models which are less related to each other than simply varying the number of terms. If sufficiently many distinct models are used, the possible biases from choices of model can be averaged away through this AICc weighted average. For example, including fits which use neural networks, which were performed for lattice PDFs in [67,99], which likely would have distinct model dependent biases from the Jacobi polynomials fits. Unfortunately in this preliminary study which only uses Jacobi polynomial based models, the systematic errors may not be sufficiently distinct.
Based upon our studies of the models, we should select which ones to include into the model averaging. The models without a P 1 term do differ from the rest of the models, which may give a reason to exclude them. Since their L 2 was so much higher than the rest, they are exponentially suppressed in the AICc average. We include them in the average anyway in order not to bias ourselves at all due to their discrepancy. The model average of the PDFs is shown in Fig. 28. These PDFs share many of the features shown in Figs. 12-17.

Conclusions
In this work we have studied the continuum limit extrapolation of the nucleon PDF computed on the lattice via the method of Short Distance Factorization and Ioffe time pseudodistributions. In this method, we have employed three lattice ensembles all of which have a lattice spacing less than 0.08fm and this work constitutes a significant improvement with respect to our first study [74] of discretization errors in lattice computations of Ioffe time distributions. In the first work, we had worked with rather coarse lattices, 0.127fm and 0.091fm while the lattice spacings of the three ensembles employed in the current study, are at the limit of what the current lattice methodologies can achieve without encountering issues related to the freezing of topology and critical slowing down. Currently, taking the continuum limit on the lattice at the physical pion mass with Wilson type quarks is not possible since the generation of at least three ensembles at the physical point with Wilson type fermions is numerically a formidable task for the time being. We therefore consider ensembles at a heavier pion mass all of them at 440 MeV. A continuum extrapolation at the physical pion mass will be performed in future studies once the computational resources render such an endeavor feasible. Additionally, it is well known that the contamination from excited states can plague severely the results of a lattice computation especially as one is approaching the physical pion mass. In this work, we employ the sGEVP method in order to have a better control of the excited states. We also stressed the necessity of dealing explicitly with the various systematic errors that are unavoidable in any lattice calculation. We adopted a Bayesian approach where we build explicit models of the PDF and of the associated systematic errors in order to describe our data. These models contain unknown parameters whose most likely values are determined given some prior information and using Bayes' theorem. Another novelty of this work is that in this article we are using the Jacobi polynomials as a way of tackling the unavoidable inverse problem that one encounters when trying to obtain the Bjorken-x PDF from Ioffe time lattice data. Up to now, we had mainly been using parameterizations of polynomials of √ x or neural networks. We advocate that the Jacobi polynomials constitute a very versatile and flexible approach that eliminates the introduction of model dependence.  [146]. This work benefited also from access to the Joliot-Curie supercomputer of the TGCC (CEA) in France as part of a "grand challenge" project (project id: gch413) awarded by GENCI (Grand Equipement National de Calcul Intensif). We would also like to thank the Texas Advanced Computing Center (TACC) at the University of Texas at Austin for providing HPC resources on Frontera [147] that have contributed to the results in this paper. We acknowledge the facilities of the USQCD Collaboration used for this research in part, which are funded by the Office #DE-AC05-00OR22725. The software libraries used on these machines were Chroma [149], QUDA [150,151], QDP-JIT [152] and QPhiX [153,154] developed with support from the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research and Office of Nuclear Physics, Scientific Discovery through Advanced Computing (SciDAC) program, and of the U.S. Department of Energy Exascale Computing Project.