The Linear Template Fit

The estimation of parameters from data is a common problem in many areas of the physical sciences, and frequently used algorithms rely on sets of simulated data which are fit to data. In this article, an analytic solution for simulation-based parameter estimation problems is presented. The matrix formalism, termed the Linear Template Fit, calculates the best estimators for the parameters of interest. It combines a linear regression with the method of least squares. The algorithm uses only predictions calculated for a few values of the parameters of interest, which have been made available prior to its execution. The Linear Template Fit is particularly suited for performance critical applications and parameter estimation problems with computationally intense simulations, which are otherwise often limited in their usability for statistical inference. Equations for error propagation are discussed in detail and are given in closed analytic form. For the solution of problems with a nonlinear dependence on the parameters of interest, the Quadratic Template Fit is introduced. As an example application, a determination of the strong coupling constant from inclusive jet cross section data at the CERN Large Hadron Collider is studied and compared with previously published results.


Introduction
The interpretation of collected research data is a frequent and important step of the data analysis procedure in many scientific fields. A common task in this interpretation is the estimation of parameters using predictions obtained from analytic calculations or simulation programs, usually referred to as fitting. Frequently used fitting algorithms rely on numerical methods and utilize iterative function minimization algorithms [1][2][3][4][5][6][7]. The availability of more computational power and the development of improved algorithms such as machine learning techniques [8,9] have led to more comprehensive statistical methods that can be employed for the estimation of parameters [10][11][12][13]. At the same time, simulation programs have become more complex and more computationally demanding, placing constraints on the inference algorithms. For example, in the field of particle physics, such simulations provide fully differential predictions in perturbative Quantum Chromodynamics and the Electroweak Theory (see Refs. [14][15][16][17][18][19][20][21][22] for some recent advances) and may take several hundreds up to a few thousand years on a single, modern CPU core. Sometimes, the simulation of the physical process is followed by a detailed simulation of the experimental apparatus in order to provide a synthetic data as close as possible to the recorded raw data. At the experiments at the CERN Large Hadron Collider (LHC), the simulated data include the physics of the proton-proton collision [23][24][25] and the simulation of processes that take place subsequently in the detector material [26]. This results in the simulation of about 100 million electronic channels [27], which are processed similarly to real data [28]. Consequently, these simulated data cannot be used in iterative optimization algorithms because of their computational cost. In addition, these predictions can be provided only for a few selected values of the theory parameters of interest.
Besides the restrictions associated with computationally demanding simulations, there exist other reasons why less complex simulations cannot be used by certain optimization algorithms. Numerical instabilities or statistical fluctuations in the predictions can result in fit instabilities, input quantities for the predictions are sometimes only available for selected discrete values of the parameters of interest (for example, parameterizations of the parton distribution functions of the proton (PDFs) which are only available for a few values of the strong coupling constant α s [29,30]), or simply because there are technical limitations when interfacing a simulation program to the inference algorithm and the parameter(s) of interest cannot be made explicit to the minimization algorithm. A further frequent limitation for simulation-based inference is related to the intellectual property of the simulation program, where the computer code is not publicly available, but the obtained predictions are available for a given set of reference values. In short, in many cases, the inference algorithm can only use predictions that have been provided previously since a recomputation by the inference algorithm is not possible.
Several variants of simulation-based template fits, or template-like fits, are nowadays used in high energy physics [13,[31][32][33][34][35][36][37][38][39][40][41][42][43][44]. These make use of polynomial interpolation or regression between the reference points or apply related techniques [45][46][47]. Some algorithms use un-binned quantities, while others make use of summary statistics. Furthermore, different algorithms to find the best estimator are employed, for example, numerical methods or iterative minimization algorithms are used to find the extremum of a likelihood or χ 2 -function. In general, typical iterative minimization algorithms can be considered to be template fits, since every iteration generates a new template at a new reference point, which are then used to find the extremum of the likelihood function.
In this article, the equations of the Linear Template Fit (LTF) are derived, which can be written in a closed matrix equation form. The Linear Template Fit provides an analytic solution for the determination of the best estimator in a least-squares problem under the assumption that the predictions are available only for a finite set of values of the parameters of interest. The equations are obtained from a two-step marginalization of the underlying statistical model, which assumes normal-distributed uncertainties. The first step provides a linearized, but continuous, estimate of the prediction 1 , and the second step provides the best estimator of the fitting problem. The Linear Template Fit is suited for a wide range of parameter estimation problems, where the input data can be cross section measurements, event counts, or other summary statistics like histograms.
This article is structured as follows. After a brief review of the method of template fits in Section 2, the equation of the Linear Template Fit is derived in Section 3 for a univariate problem. While the Linear Template Fit turns out to be a simple matrix equation, to the author's knowledge it has not been published in its closed form before. The emphasis of this article is on variations of the Linear Template Fit, its applicability, and the propagation of uncertainties. The multivariate Linear Template Fit is discussed in Section 5, and the Linear Template Fit with relative uncertainties is presented in Section 7. A detailed discussion on error treatment and propagation is given in Section 8. A (detector) response matrix is inserted into the Linear Template Fit in Section 9. Several considerations for the applicability of the Linear Template Fit are provided in Section 10, as well as a simple algorithm for cross-checks and the relation to other algorithms. While the prerequisite of the Linear Template Fit is the linearity of the prediction in the parameters of interest, potential non-linear effects are estimated and discussed in Section 11. There, the Quadratic Template Fit is introduced, an algorithm with fast convergence using second-degree polynomials for the parameter dependence of the model. Three toy examples are discussed in Sections 4, 6 and 13, and are also remarked upon occasionally in between. A comprehensive and real example application of the Linear Template Fit is given in Section 14, where the value of the strong coupling constant α s (m Z ) is determined from inclusive jet cross section data obtained by the CMS experiment at the LHC. The best estimators are compared with previously published results, obtained with other inference algorithms. Section 15 provides a summary. Additional details are collected in the Appendix, where a table can be found of the notation adopted in this article.

Template fits
The objective function of a multivariate optimization problem assuming normal-distributed random variables is a likelihood function calculated as the joint probability distribution of Gaussians where d i are the i data values, σ 2 i the variances, and λ i is the value that is dependent on the model parameters of interest α. Gaussian probability distributions are often appropriate assumptions for real data due to the central limit theorem [48,49]. For numerical computation and optimization algorithms it is convenient to rewrite L in terms of a least-squares equation using χ 2 = −2 log L and omitting constant terms. In matrix notation, and using a covariance matrix, the objective function becomes 2 χ 2 (α) = (d − λ(α)) T V −1 (d − λ(α)) .
A common problem that physicists often face at this point is that the model λ(α) may be computationally intense, time-consuming to calculate, or not be available in a compatible implementation. Hence, the objective function χ 2 cannot be evaluated repeatedly and for arbitrary values of α, as it would be required in gradient methods. In contrast, it is often the case that the model is available for a set of j model parametersα (j) . These predictions y (j) := λ(α (j) ) are denoted as templates for given reference valuesα (j) in the following. Consequently, a template fit exploits only previously available information and can be considered as a two-step algorithm: 1. In a first step, a (continuous) representation of the model in the parameter α needs to be derived from the templates. This can be done by interpolation, regression or any other approximation of the true model with some reasonable function y(α,θ), like where y(α,θ) is dependent on α and several parameters θ. The values θ are then determined from the templates, for example with a least-squares method and a suitable χ 2 , similar toθ = arg min θ χ 2 (y (0) , . . . , y (j) ; θ) .
Such a procedure can be done for each point i separately, or for all points simultaneously.
2. In a second step, which is often decoupled from the first, the best estimators of the parameter of interest are determined using the approximated true model from the first step. It can be expressed asα = arg min where the objective function could be a least-squares expression, like the one of eq. (2), but using the approximated model y(α,θ) from the first step, like Such a two-step algorithm appears to be cumbersome, and in the following the equation of the Linear Template Fit is introduced, which combines the two steps into a single equation, under certain assumptions.

The basic method of the Linear Template Fit
The basic methodology of the Linear Template Fit is introduced, and in order to simplify the discussion a univariate model λ(α) is considered, where the parameter of interest is denoted as α. It is assumed that the model is available for several valuesα (j) of the parameter of interest, and these j predictions are denoted as the templates, λ(α (j) ). These templates will be confronted with the vector of the data d in order to obtain the best estimatorα.
We consider a basic optimization problem based on Gaussian probability distributions, and the objective function is written in terms of a least-squares equation In a first step, the model λ(α) is approximated linearly from the template distributions, and in every entry i, λ i (α) is approximated through a linear function y i (α; θ 0 , θ 1 ) like The best estimators of the function parameters,θ, are obtained from the templates by linear regression ("straight line fit," see e.g. Refs. [51][52][53]), whose formalism also follows the statistical concepts that were briefly outlined above. Each of the templates is representative of a certain value of α, which will be denoted in the following as reference pointsα j , and all reference values form the j-vectorα. Next, an extended Vandermonde design matrix [54], the regression matrix, is constructed from a unit column vector andα: From the method of least squares the best-estimators of the polynomial parameters for the ith entry are [55] where the matrix M + (i) is a g-inverse of a least squares problem [54,56], y (j),i denotes the ith entry of the jth template vector, and W (i) is an inverse covariance matrix which represents uncertainties in the templates, e.g. W (i),j,j = σ −2 y (j),i . However, for the purpose of the Linear Template Fit, an important simplification is obtained since the special case of an unweighted linear regression is applicable to a good approximation. The (inverse) covariance matrix W (i) in this problem has two features: it is a diagonal matrix, since the templates are generated independently, and secondly, it has, to a very good approximation, equally sized diagonal elements, W (i),j,j ≈ W (i),j+1,j+1 . This simplification is commonly well applicable, for example, if the model is obtained from an exact calculation or if Monte Carlo event generators were employed and the templates all have a similar statistical precision. Consequently, the size of the uncertainties in the templates σ y (j),i factorizes from W (i) and subsequently cancels in the calculation of M + (i) , eq. (11). Thus, an unweighted polynomial regression is applicable and the g-inverse simplifies to a left-side Moore-Penrose pseudoinverse matrix [57][58][59], with M as defined in eq. (10). It is observed that the matrix M + is universal, so it is independent on the quantities y (j),i and W (i) , but is calculated only from the reference pointsα. Consequently, it is equally applicable in every bin i.
We further introduce the template matrix Y , a i × j matrix, which is constructed from the column-vectors of the template distributions, and can be written as Hence, substituting the best estimatorsθ (eq. (11)) into the linearized model y(α,θ) (eq. (9)), and usingm andm (eq. (13)) and Y (eq. (14)), the model can be expressed as a matrix equation 3 : It is seen that the polynomial parameters θ are no longer explicit in this expression. Consequently, the objective function (eq. (8)) becomes a linear least-squares expression where W is the inverse covariance matrix, W = V −1 .
In the second step of the derivation of the Linear Template Fit the best estimatorα is determined. Due to the linearity of χ 2 in α, the best-fit value of α is defined by the stationary point of χ 2 [49,55,60], and the best linear unbiased estimator of the parameter of interest iŝ When introducing the g-inverse of least squares, a 1 × i matrix, the master formula of the Linear Template Fit is found to bê 3 Note that eq. (15) represents already a very useful (linearly approximated) representation of the simulation, since it transforms a discrete representation of the model into a continuous function. This equation may become useful for several other purposes, or optimization algorithms, whenever only templates are available, but a (quickly calculable) continuous function is required. Higher-degree polynomials beyond the linear approximation can also be used for that purpose, so λ T (α) α 0 . . . α n M + Y T , using a n-degree regression matrix M (eq. (81)).
Since the distribution of the data is assumed to be normal-distributed, from the linearity of the least squares, it follows that the estimates are also normal-distributed. From error propagation, the variance of the best estimator is Given the case that the approximation in eq. (15) holds, the estimatorα represents a best and unbiased estimator according to the Gauß-Markov-theorem [49,55], and also the variances are the smallest among all possible estimators. Since d does not enter the calculation of the variances, eq. (20), the uncertainties are equivalent to the expected uncertainties from the Asimov [61] data λ(α).
When using the right expression of eq. (13), it can be directly seen that the best estimator is indeed obtained from a single matrix equation using only the template quantities (Y ,α) and the data details (d, W ) and the Linear Template Fit can alternatively be written aŝ In contrast, when the (commonly well justified) approximation of the unweighted regression in each bin is not applicable (see left side of eq. (12)), then the bin-dependent regression matrices M + (i) remain explicit. Such a case may be present when a Monte Carlo event generator is used for the generation of the templates and the statistical uncertainties have to be considered, and one sample was generated with higher statistical precision than the others, for instance, since it also serves for further purposes like unfolding. However, when using such a weighted regression, the best estimator may still be expressed using eq. (17), but the elements of the two vectors Ym and Ym would need to be calculated separately and become where the vectors x (i) denote the ith row of the template matrix, x (i) = Y T i , and the matrices M + (i) may include uncertainties in the templates x (i) through W (i) (see eq. (11), right). This case, however, will not be considered any further in that article.
In the following sections, an example application of the Linear Template Fit is presented, and subsequently, generalizations of the Linear Template Fit are discussed, and formulae for error propagation are presented.

Example 1: the univariate Linear Template Fit
An example application is constructed from a random number generator within the ROOT analysis framework [62]. The physics model is a normal-distribution with a standard deviation of 6.0. The mean value, denoted as α, is subject to the inference. Similarly to particle physics, a counting experiment is simulated, and the (pseudo-)data are generated from 500 events at a true mean value of α = 170. 2   169. Some measurement distortions are simulated by using a standard deviation of 6.2 for the pseudo-data. Limited acceptance of the experimental setup is simulated by considering only one side of the normal distribution, and altogether 14 "bins" with unity width are "measured". Let us assume that from a previous measurement it is known that the physics parameter of interest has a value of about α ≈ 170.5 ± 1.0. Therefore, seven templates in a range fromα = 169.0 to 172.0 in steps of 0.5 are generated, and each template is generated using 40,000 random numbers. From eq. (20) it is then found that the pseudo-data will be able to determine the value of α approximately with an uncertainty of ±0.5, and subsequently, additional templates at values oḟ α = 169.0, 169.5, 170.5, 171 and 171.5 are generated.
The Linear Template Fit using these seven templates is illustrated in figure 1. The Linear template Fit from eq. (19) reports a value ofα = 170.03±0.41, which is fully consistent within the statistical uncertainty with the generated true value of 170.2, although a somewhat different standard deviation was used. In figure 1 (right), the result from a hypothesized weighted linear regression is shown (cf. eq. (23)), but it is almost indistinguishable from the unweighted linear regression as is used in the Linear Template Fit. This is because the generation of the individual templates always employs the same methodology, as already argued above. Furthermore, the estimated best model is displayed in figure 1 (left), which is defined from eq. (15) when substituting the best estimator,ŷ = y(α). This results in χ 2 = 5.6 for 14 data points and one free parameter. Other seeds for the random number generator of course result in different values. It should be noted that this example is purposefully constructed with large (statistical) uncertainties in order to obtain a visually clearer presentation in the figures, although in some bins the assumption of normaldistributed random variables is then a rather poor approximation to the Poisson distribution.

The multivariate Linear Template Fit
Phenomenological models often depend on multiple parameters, and thus it is a common task to determine multiple parameters at a time. Such k parameters of interest are referred to as α 1 , . . . , α k , or simply α, and the best estimators are denoted asα 1 , . . . ,α k , orα. The linear representation of the model is a hyperplane y i (α) in each of the i data points, defined as where a constant θ 0 and the first-degree parameters θ (1,k) are considered. Since higher-degree terms or interference terms are not included, the fit parameters need to be (sufficiently) independent or they have been made orthogonal by applying a variable transformation beforehand.
In the multivariate case, each template y (j) is representative of a reference pointα (j) in the k-dimensional space. The regressor matrix M is constructed as a j × (1 + k) design matrix: As in the univariate case, the pseudoinverse M + is calculated from eq. (12), and also the same considerations for the justification of the unweighted multiple regression are applicable. Instead of a single vectorm, there is now a vector for each regression parameter θ k . Therefore, the j × k matrixM is introduced, which is defined by decomposing M + like Hence, the best estimator for the linearized multivariate model y(α; θ) becomes where the θ-parameters are again no longer explicit.
When using the linear approximation of the model, λ(α) ≈ŷ(α), the χ 2 function (eq. (2)) for the multivariate Linear Template Fit becomes The last equation introduces an equivalent expression in terms of nuisance parameters [63]. The factorsˆ l are related to uncertainties with full bin-to-bin correlations when writing the covariance matrix as where the sum l runs over all uncertainties with full bin-to-bin correlations and the vectors s l denote the individual systematic uncertainties (also called shifts), while the matrix V includes all other uncertainty components. It is common practice that the systematic shifts are calculated from relative uncertainties and multiplied with the measured data. Implications of this practice are discussed in section 7 and 8.7, and care must be taken that the result does not become biased [64][65][66][67][68][69]; a common technique to avoid that bias is discussed in section 8.7.
Equation (30) is again a linear least squares expression and the best linear unbiased estimators for the parametersα and the nuisance parametersˆ are obtained from the stationary point. Hence, the best estimators from the multivariate Linear Template Fit becomê whereâ was introduced and the shorthand notations F (a g-inverse of least squares) and D are calculated as using W = V −1 . The matrix with the fully bin-to-bin correlated uncertainties S is composed from the column-vectors of the systematic shifts s (l) as and the symbol YM S denotes a composed i × (k + l) matrix from YM and S, while 0 k and 1 l denote a k × k or l × l zero or unit-matrix, respectively. The matrix D is commonly a fullranked symmetric matrix and thus invertible. A brief discussion about a more efficient numerical calculation of D −1 is given in D. If all uncertainty components are represented through a single covariance matrix V , cf. left side of eq. (31), the multivariate Linear Template Fit simplifies tô 6 Example 2: the multivariate Linear Template Fit In Example 1, only the mean value of the model (which is a normal-distribution) was determined, although the pseudo-data had a slightly different standard deviation than the model. In the following example, a multivariate Linear Template Fit is performed and the same pseudo-data as in Example 1 are used, but both values of the model will be determined: the mean value and the standard deviation of the Gaussian. Therefore, templates are generated for some selected values for the mean (between values of 169.5 and 171) and the standard deviation (between values of 5.8 and 6.4) of a Gaussian, and again 40,000 events are used for each template. The four "extreme" variations of the two parameters are omitted. The multivariate Linear Template Fit using all these templates is illustrated in figure 2.
The best estimators from the multivariate Linear Template Fit (eq. (35)) are 169.91 ± 0.46 for the mean and 6.28 ± 0.38 for the standard deviation. Their correlation coefficient is found to be −0.3. Within the statistical uncertainties, both values are in very good agreement with the simulated input values of 170.2 and 6.2, respectively. Thus, this example illustrates the application of the multivariate Linear Template Fit with two free parameters, where a visual representation of the linearized model is still possible. However, any number of free parameters is in principle possible, and for an n-parameter fit, the minimum number of linearly independent templates is just n + 1. Estimated

The Linear Template Fit with relative uncertainties
In this section we present the equations for the Linear Template Fit when the estimators obey a log-normal distribution. This could be the case for data when the determination of a variable is affected by a number of multiplicative factors that are subject to uncertainties. Consequently, the variable follows a log-normal distribution due to the central limit theorem [48,49]. Also, when the value of an observed variable is a random proportion of the previous observation, it follows a log-normal distribution [49]. An example would be the measurement of the electron energy in a calorimeter which is affected by a number of fractional energy losses and corrections [10]. Another example would be measurements that are dominated by systematic multiplicative uncertainties: a prominent multiplicative error is due to uncertainties of the luminosity measurement in particle collider experiments which results in a relative uncertainty. Contrary to additive errors, multiplicative errors cannot change the sign of the variable, and a positively defined observable always remains positive, which is an important prerequisite for several physical quantities such as cross sections. A comparison of the log-normal, the normal, and the Poisson probability distribution function for two selected values of their mean value are displayed in figure 3.
The likelihood of a set of i measurements is a joint function of log-normal probability distribution functions When writing µ i = log m i [70,71], it can be directly seen from the residuals that the ratios d i /m i enter the likelihood calculation instead of their differences. The variance ς denotes a relative quantity, thus it represents relative uncertainties. In fact, the log-normal distribution is  15) is also displayed, which is otherwise not used when log-normal-distributed estimators are considered. equivalent to considering random variables with normal-distributed relative uncertainties. For small relative errors, ς i 20 %, which is a common case in particle physics, the log-normal, the normal, and the Poisson distributions become similar. For larger relative errors, the log-normal distribution provides some approximation of the Poisson distribution, and for practical purposes one benefits from its strictly positiveness, in contrast to the often used normal distribution.
Similarly to the case for normal-distributed data, we choose as a first-order approximation of the model in every bin and the best estimators for the θ parameters are again determined from the templates; consequently, the best estimator of the approximated model is The term log(Y ) denotes a coefficient-wise application of the logarithm to the elements of the template matrix Y , and it is used to match the logarithmized data (log d i ) in the likelihood. Eq. (38) is an equally valid first-order approximation of the model as compared with the default linear approximation in eq. (24).
As an illustration, the linearized model from eq. (38) is displayed in figure 4 for some selected bins of Example 1 (section 4) and compared with the approximations of eq. (27). The two are reasonably similar in that example, since the templates are present in the vicinity of the best estimator and in a range where a linear approximation is valid. Although the parameter dependence in Example 1 is in fact truly linear, from the generated statistical precision of the templates, the templates cannot discriminate between these two best model estimators.
When using log-normal distributed variables, which corresponds to relative uncertainties, the objective function χ 2 of the Linear Template Fit becomes The covariance matrix V = W −1 comprises normal-distributed relative uncertainties, and also the uncertainties with bin-correlations s enter the calculation through their relative values. The notation log d denotes a coefficient-wise application of the logarithm. A common relation between relative and absolute uncertainties would be and although this equation does not strictly translate between normal and log-normal uncertainties, it is often used in practical applications. Note that we use a Roman font s for relative uncertainties, and italic s for absolute uncertainties.
This χ 2 function is again a linear least-squares problem, and the best estimators for α and in the Linear Template Fit areâ where the matrix F is defined in a similar way as eq. (33), but from relative uncertainties in V and s (S) and employing the substitution YM → log(Y )M . Due to the treatment of systematic uncertainties as relative uncertainties, the Linear Template Fit with the log-normal probability distribution functions may be preferred in practical applications over the Linear Template Fit with absolute uncertainties.
When repeating Example 1 (section 4) with relative uncertainties, the best estimator for the mean value is found according to eq. (41) to beα = 170.40 ± 0.46, which is in good agreement to the generated value of 170.2. Some details are further displayed in figure 4.
When considering Example 2 (section 6) under the assumption of log-normal distributed pseudo-data, the best estimators are found to be 170.18 ± 0.47 for the mean and 6.33 ± 0.39 for the standard deviation, with a correlation coefficient of −0.3. When considering further a normalisation uncertainty of the size of 10 %, so a relative uncertainty, the best estimators for the mean and variance become α 0 = 170.20 ± 0.76 and α 1 = 6.33 ± 0.49, respectively. Their correlation coefficient is −0.6, and the nuisance parameter of the normalisation uncertainty isˆ = −0.02 ± 0.5.

Errors and error propagation
In many applications and fields, like particle physics, it is of great importance to provide detailed uncertainties together with the fit parameters and to propagate uncertainties from all relevant input quantities. Due to its analytic and closed form, the Linear Template Fit provides unique opportunities for detailed error propagation and analysis, that is otherwise not easily possible in other inference algorithms. For example, every single uncertainty component can be propagated separately and analytically to the fit results. In the following, the equations for error propagation are discussed, based on the equation of the Linear Template Fit in eq. (32).

Statistical uncertainties and uncertainties without bin-to-bin correlations
Statistical or epistemic uncertainties in the data are represented by the covariance matrix V. The matrix V may include further uncertainty components with or without bin-to-bin correlations, like systematic or aleatoric uncertainties. From linear error propagation [60], the covariance matrix is propagated to the best estimatorsâ asVâ using F as defined in eq. (33). Due to the Gauß-Markov theorem the estimatorsâ for the given objective function χ 2 , eqs. (30) and (39), have the least variance among all possible estimators.
If V is a sum of multiple individual matrices, like V = V stat. + V uncorr. + V corr. + . . ., each covariance matrix can be propagated separately using equation (42). By doing so, the contribution from each uncertainty component to the full uncertainty is determined separately.

Systematic uncertainties or uncertainties with bin-to-bin correlations
Systematic uncertainties associated with the input data are often represented as uncertainties with full bin-to-bin correlations and are represented either as systematic shifts, s (l) (eq. (34)), or as covariance matrices, V corr l = s (l) s T (l) . Using linear error propagation, these are propagated to the best estimatorsâ as σ Alternatively, eq. (42) can be employed. For the Linear Template Fit with relative uncertainties, one literally has the same equation but using relative values s (l) (compare eq. (40)).
In Example 2, where a normalisation uncertainty of 10 % was considered the uncertainty breakdown for the two parameters would be σα 0 = ±0.56 (stat.) ± 0.51 (norm.) and When systematic uncertainties are included in the Linear Template Fit, their corresponding nuisance parameters l are constrained by data, and the obtained reduction of the systematic uncertainty may be used for further analyses. In applications where the fitted parameters are dominated by systematic uncertainties in the input data, it is good practice to constrain the associated systematic shifts by adding additional measurements to the fitting problem [31,35,72,73]. This practice provides additional constraints on the related parameters, l , and smaller experimental uncertainties can be obtained. The uncertainty in the nuisance parameter becomes smaller than unity.

External uncertainties
The generalized transfer matrix F provides the straightforward opportunity for linear error propagation; see eqs. (42) or (43). While these equations are predominantly employed to propagate the uncertainties that are included in the Linear Template Fit, equivalent equations are also applicable to propagate further uncertainties that are not used in the fit, but which would still need to be propagated to the fit result. For example, these may be uncertainties in the underlying theory prediction which cannot be constrained by the data. An actual case for such an uncertainty is given below for the non-perturbative correction factors (cf. Fig. 10).

Unconstrained uncertainties with full bin-to-bin correlations
A special case of an uncertainty with full bin-to-bin correlations may be considered through an unconstrained systematic uncertainty. These are defined by including the respective systematic shift s (l) into the Linear Template Fit, but in contrast to common constrained systematic uncertainties, these have a zero value (instead of unity) in the respective diagonal element of the l × l matrix 1 l in eq. (33). As a consequence, the Linear Template Fit can exploit the degree of freedom of that systematic shift, but the data does not provide any constraint on that uncertainty source. A possible application would be fits to data, where the normalization is unknown and is a free, unconstrained, systematic uncertainty.

Uncertainties in the templates
The template distributions Y may be associated with uncertainties from various sources. These can be statistical uncertainties in the elements Y ij or systematic uncertainties with correlations among all entries of Y . One may think about Y being determined from two distinct models, or templates that were generated with different parameters, and the resulting difference in Y may be considered as a systematic uncertainty.
The uncertainties inâ due to uncertainties in Y are obtained from linear error propagation. The uncertainty in a single element of Y is denoted as σ Y ij , and using the shorthand notation D from eq. (33), the partial derivative of eq. (32) is where 1 (ij) denotes an i × j zero matrix with only the element (i, j) being unity, and 0 il denotes an i × l zero-matrix. It should be noted, that the usage of the matrices 1 (ij) and 0 il in numerical calculations is discouraged since a naive matrix multiplication becomes numerically inefficient, but that it remains instructive to write the equations in that extensive format. The resulting size of the uncertainty of the best estimators due to σ Y ij becomes Hence, uncertainties without entry-to-entry correlations, such as statistical uncertainties, are propagated to the estimated parameters as where the square root and the exponentiation are performed coefficient-wise, while uncertainties that have full entry-to-entry correlations are propagated as In Example 1 and 2 (sections 4 and 6), the templates are each generated with 40 000 events and independently from each other. Hence, all their entries have statistical uncertainties. Using eq. (46), the statistical uncertainties from Y under gaussian approximation are ±0.025 in Example 1, and ±0.042 and ±0.060 for the two parameters in Example 2.
The equations for error propagation of the Linear Template Fit with relative uncertainties (section 7) are altogether similar to the presented equations and do not need to be repeated. However, since relative uncertainties are considered, the partial derivatives for error propagation in eqs. (46) and (47) become Example 2 with log-normal uncertainties would report statistical uncertainties from Y of ±0.052 and ±0.066 for the two fit parameters.
Note that the uncertainties in Y are not considered within the fit, but propagated separately, since it is assumed that these are negligibly small, and independent of the model parameter α, and cannot be constrained by the data.

Full uncertainty
The covariance matrix for the best estimators, including all uncertainty components that are considered in the Linear Template Fit, are calculated from eqs. (42) and (43): Further uncertainties, which are not explicitly considered in the fit, can be propagated to the result by using eqs. (42) and (43) or also eqs. (46) and (47).

Error (re-)scaling
When multiple data points are considered in the fit and they are correlated through some (systematic) uncertainties, the best-fit estimators may become biased [64][65][66][67][68][69]. This is mainly because uncertainties with bin-to-bin correlations are commonly valid as relative uncertainties, or so-called multiplicative uncertainties, such as normalization uncertainties, but those are included in the fit with absolute values (i.e., as variances). Since the data are subject to random noise from statistical fluctuations, or simply because of the different size of the values, for example when considering different data sets, the result may become biased since smaller values are effectively preferred by the fit. While such a bias is largely reduced or even absent from first principles when working with relative uncertainties, it may still be useful to discuss possible solutions to this problem and several solutions are discussed in the literature (see e.g. Ref. [69]).
A common solution is to consider (multiplicative) uncertainties with bin-to-bin correlations through their relative values in the χ 2 computation, and their relative values are multiplied coefficient-wise with the prediction(s) rather than the data values [68]. However, since the minimization of χ 2 is then no longer linear in the fit parameters, there is no closed analytic solution for the best estimators. However, an approximately equivalent result is obtained when calculating the covariance matrices and systematic shifts (cf. eq. (31)) from theoretical predictions [66], and the coefficients of the shifts are recalculated according to and analogously for covariance matrices. Various options for the predictions y for this error rescaling may be considered, and the prediction y could be chosen to be • One of the template distributions, y (j) ; • The prediction of the linearized model with ad-hoc selected parameters α,ŷ(α); • The best estimator of the linear model,ŷ(â); or • The Linear Template Fit may be iteratively repeated with the best estimator,ŷ(â).
The rather compact equations of the Linear Template Fit are well suited for an iterative algorithm. Such procedures are equivalent to an alternative formulation of the χ 2 quantity, but the convergence of such iterative algorithms should be critically assessed.

Linear Template Fit with a detector response matrix
Many measurements need to be corrected for detector effects like resolution or acceptance in order to compare the measurement with theoretical predictions. This is referred to as unfolding (for reviews see e.g. [10,[74][75][76][77][78][79]) and is commonly performed by first simulating the detector response and representing it in terms of a response matrix A, where y denotes the measured detector-level distribution and x the "true" underlying distribution. However, it is not straightforward to apply the inverse response matrix to the distribution y, since the unfolding problem represents an ill-posed inverse problem, or the matrix A needs to be a square matrix, and thus more sophisticated unfolding algorithms need to be employed to determine the "true" distribution x [80][81][82][83][84][85]. When performing a parameter determination, however, it is equivalent whether using the unfolded distribution x together with templates of the predictions, or alternatively using the detector-level measurement y and applying the response matrix to the templates instead. These two procedures are supposed to be equivalent if the unfolding problem and algorithm are unbiased, and are referred to as folding, up-folding, or forward folding. Because the simulation of the detector effects and the determination of the response matrix A may be computationally expensive, as is the case for complex particle physics experiments, it may be reasonable to determine the matrix A only once and with high precision, and then apply it to the templates.
The inclusion of the detector effects in the Linear Template Fit is straightforward and is realized by the simple substitution in eqs. (32) and (33), where X represents the template matrix without detector effects at the "truth" or "particle" level. When using relative uncertainties, the substitution becomes and the full expression of the Linear Template Fit becomes Since Y and AX are i × j matrices, the index t for the number of entries ("bins") on the truth level was introduced. Thus, A is an i × t matrix and X has dimension t × j, and in order to obtain a higher resolution, it may be reasonable to consider a non-square response matrix A, i.e., i = t. It is important that the detector response A is required to be independent from the physics parameter of interest.
The response matrix A is typically associated with uncertainties, and these may be entry-toentry correlations (like systematic or "model" uncertainties) or without such correlations (like statistical uncertainties). Note, that the response matrix is commonly normalized and therefore the statistical uncertainties are without entry-to-entry correlations only approximately.
Uncertainties of the best estimators that arise from uncertainties in A can be obtained from linear error propagation. After applying eq. (52), the partial derivative of eq. (32) is Consequently, uncertainties in the response matrix A without or with entry-to-entry correlations are propagated to the fitted results as respectively, where σ A it denotes the size of the uncertainty in the element A it , and the power and square-root operations are applied coefficient-wise. When using a response matrix, the uncertainties in the template matrix X (cf. section 8.5) are propagated using Note that for the Linear Template Fit with relative uncertainties, it is reasonable to consider uncertainties in A it as absolute uncertainties, rather than relative, because the elements A it represent already relative quantities and may be interpreted as probabilities. Therefore, one employs the partial derivative ∂â/∂A ij for error propagation, instead of ∂â/∂ log A ij , although elsewhere relative uncertainties are considered. This partial derivative is straightforward to derive. The propagation of uncertainties in the templates X tj , however, becomes somewhat more complicated when using a response matrix A, since one has to calculate the partial derivative ∂â/∂ log X tj for error propagation of the relative uncertainties σ X tj . 4

Considerations, validations and cross checks
As in any optimization problem, the result of the Linear Template Fit needs to be validated and cross checked. Due to its close similarity to the iterative Gauß-Newton minimization algorithm of nonlinear least squares [5,50,51,86,87], the Linear Template Fit is expected to result in an unbiased and optimal result, in particular if the templates provide a sufficiently accurate approximation of the model and its Jacobi matrix at the best estimator. In fact, when assuming that the model λ(α) could be used with continuous values of α for inference, the best estimator from the Linear Template Fit becomes equivalent to the best estimator that would be obtained when using λ(α) directly, if two approximations are exactly fulfilled, 5 λ(α) ≈ŷ(α) = Ym + YMα and (58) The first equation requires that the model is correctly represented by the interpolated templates at the best estimator, which is needed for the value of the best estimator, and the second equation demands that the first derivatives are correctly approximated, which is a requirement for the error propagation and minimization. These equations taken together represent just a linear approximation of λ(α) atα, but using the templates only. However, since for the motivation of the Linear Template Fit it was assumed that λ(α) cannot be used for inference itself, these two approximations also cannot be validated directly. In the following, some possible studies for the validation of the results and the practical application of the Linear Template Fit are discussed.

Template range
If the model is nonlinear in the parameters α, an unbiased result may only be obtained if the best estimatorα lies within the interval of the reference values, This is because the templates provide the approximation of the model and its first derivatives, but an extrapolation beyond the template range is inappropriate for a non-linear model. If eq. (60) does not hold, additional templates at further reference points should be generated and the Linear Template Fit should be repeated with an improved selection of reference values. For similar reasons, the templates should be in the close vicinity of the best estimator, such that the linear approximations are justified. As a rule of thumb, the distance between the reference points should not greatly exceed the size of the variance of the best estimatoṙ For similar reasons, the best-estimator should ideally be found in the center of the template range, rather than at its boundary,â −α (0) ≈α (jmax) −â .
However, for rather linear parameter dependencies these requirements may be more relaxed.
The choice of the reference values may be related to the size of the uncertainties of the best estimator. If one considers the uncertainties of the best estimator to be normal distributed, 6 as a consequence, the model also needs to exhibit a sufficiently linear behavior at least within a few σ. As a rule of thumb, one could demand that the reference values are approximately within 3 standard deviations:α 10.2 The value ofχ 2 , its uncertainty and the partial χ 2 The value of χ 2 at the minimum,χ 2 , is a meaningful quantity to assess the agreement between the data and the model. Since the calculation of χ 2 is not explicit in the Linear Template Fit, its value needs to be calculated separately, andχ 2 is calculated from data and the templates at a using the somewhat simplified equation Since the data and the templates are associated with uncertainties, the value ofχ 2 consequently may also be considered as uncertain. The uncertainty inχ 2 is obtained from linear error propagation of the input uncertainties and calculated as where the i-vector ξ is introduced and whose coefficients represent the partial derivatives ∂χ 2 ∂d i , which are calculated as ξ i = 2W i (d −ŷ(â)), and W i denotes the ith row of the matrix W . Also, the uncertainties in the templates Y may be propagated toχ 2 with similar equations. The interpretation of σχ2 is not trivial, but it may well be that this quantity has interesting features and may serve as an additional quality parameter in the future.
The value of χ 2 is inherently obtained from a sum of various sources of uncertainties. It is instructive to calculate the contribution from any uncertainty source toχ 2 separately using [92] where V is any of the covariance matrices, see eq. (31). It is obvious thatχ 2 = χ 2 . This may become particularly useful when multiple measurements are considered in the fit, and thus their individual contributions toχ 2 , i.e. the partial-χ 2 's, may be calculated by summing the respectivê χ 2 values.

Interpretation of the χ 2 j values as a cross check
An obviously simple validation of the result from the Linear Template Fit is performed by comparingχ 2 with the χ 2 values calculated from the individual templates, defined as whether absolute or relative normal distributed uncertainties are considered.
An alternative best estimator, denoted in the following asα, is obtained from the interpretation of all of the χ 2 j values, and serves as a validation of the best estimator from the Linear Template Fit (α). Based on the assumption that the χ 2 distribution χ 2 (α) follows a second-degree polynomial, the so-called χ 2 -parabola, its minimum is interpreted as best estimator. For a univariate problem, when introducing with χ 2 j from eq. (67), the best estimator of the model parameter is given by the stationary point of the polynomialα where the best estimators of the (three) polynomial parameters areθ = (C T C) −1 Cχ 2 . An illustration for Example 1 is shown in figure 5. The uncertainty inα is given by the criterion χ 2 min + 1 [10] and hence The equations for the multivariate case are straightforward and not repeated here. This alternative best estimator,α, as well as its variance σα and the minimum χ 2 , should be sufficiently consistent with the results from the Linear Template Fit: and may serve as a valuable cross check. As a quick and practical visual test, if the templates line up on the parabolic fit (as seen in figure 5), the problem is essentially linear and the Linear Template Fit will provide an unbiased and optimal best estimator. Note, that for the calculation of eq. (69) at least three templates are required, whereas the minimum number of templates for the univariate Linear Template Fit is only two.

Power law factor for the linear regression
In the Linear Template Fit the functionsŷ(α) are (multi-dimensional) linear functions, see eqs. (15), (27), and (38). Although the use of higher-degree polynomials or multivariate polynomials with interference terms could in principle be considered (see next section 11), in such cases the optimization problem cannot be solved analytically and no closed matrix expression for the best estimators is found. However, an analytic solution is still obtained when only power law factors are introduced, and in a univariate Linear Template Fit the model approximation becomes with γ > 0 being a real number. The matrix M , eq. (10), then has elements M j,0 = 1 and M j,1 =α γ j . In the multivariate case, different power law factors for each fit parameter can be introduced, and one writes γ k and has M j,(k+1) =α γ k (j),k . The best estimators of the multivariate Linear Template Fit, eq. (32), then becomê This solution can immediately be seen by a variable substitution of α γ in eqs. (32) and (72). The application of the power law factor(s) γ when working with relative uncertainties (see section 7) is straightforward, since all equations remain linear in α k . Similarly, the equations for the error propagation (see section 8) can easily be adopted with that variable substitution. For instance, the elements of the systematic shifts (see eq (43)) become and similarly for other uncertainty components or covariance matrices. The power law factor γ may be of practical use in order to validate the assumptions of the linearity of the regression analysis, or to improve the fit, such as when it is known that the model is proportional to α 2 . This would be the case in determinations of the strong coupling constant from n-jet production cross sections at the LHC, which are proportional to α n s (m Z ) at the leading order in perturbative Quantum Chromodynamics (QCD).

Finite differences
In a use case where a larger number of templates are available, various choices of subsets of these templates can become input to the Linear Template Fit, and some of such potential choices have already been suggested in the previous paragraphs. In the case where only two (or three) templates are provided as input to the Linear Template Fit, the selected templates may have a special meaning, because the templates can also be considered as a numerical first derivative through finite differences. Such numerically calculated derivatives are the essence of many gradient descent optimization methods [5,49,50,86,87], and it is instructive to study the relation.
As an example, let us consider a univariate problem where the objective function χ 2 is given by eq. (8). Close to the minimum, χ 2 is nearly quadratic, and the Newton step, ∆α = −(χ 2 ) /(χ 2 ) , exhibits a good convergence [5,50,51,87]. The first and second derivatives at an expansion point where the second derivatives λ are zero or neglected in the latter equation. Hence, one obtains the estimator from the Newton step aŝ and the calculation of the first derivative could be done numerically using a finite difference with nonzero step size h The analogous Linear Template Fit makes use of the templates at α 0 and α 0 + h and defines the following matrices From the Linear Template Fit, eq. (17), when resolvingm, one obtains the best estimator and since Ym = λ (α 0 ), it can be directly seen that it is equivalent to the estimatorα (Newton) from the Newton step using a numerical finite difference.

Nonlinear model approximation
The Linear Template Fit can be regarded such that linear functions are fitted to every row i of the template matrix Y . 7 In the following, nonlinear functions are considered for the model λ i (α) in each "bin" i. For a simplified discussion, the formuluae for only a univariate model are sometimes discussed, while the multivariate case is commonly straightforward. Also, the concepts of the γ factor, relative uncertainties, or "systematic shifts" s l can equally be considered in a multivariate formulation.
The model can be approximated from the templates in every bin i with an nth degree polynomial function The best estimators for the polynomial parameters are obtained from regression analysis, where y (j),i is the ith value of the jth template, and the g-inverse is For reasons outlined in section 3 an unweighted regression is considered. Using this model approximation in the objective function of the fitting problem, eq. (2), the χ 2 expression for a univariate model becomes of order O(2n) in α, where the vectors m were used as the representation of M + according to It is obvious that for n = 1, this is the Linear Template Fit, and the generalizations to the multivariate fit (eq. (30)), and with relative uncertainties (eq. (39)), are straightforward. Note that for different degrees n, the vectors m differ unless the equations are not rewritten for orthogonal polynomials [93].
Since the first derivative of χ 2 O(2n) is nonlinear in α, no analytic solution of the optimization problem is found. Therefore, in the following, two iterative algorithms to find the minimum of the objective function χ 2 O(2n) are discussed, which refer to the Gauß-Newton and the Newton algorithm (see also Refs. [5,50,51,87]). In both cases, if the starting value is identified with the best estimator from the Linear Template Fit, α =α, the step size of the next iterative step can be interpreted as convergence criterion or uncertainty of the Linear Template Fit. Reasonable values for n are n = 2, 3, or at most 4 (since M becomes quickly ill-conditioned for large n, e.g. Refs. [94,95] and therein), while for n = 1, consistent results with the Linear Template Fit are found.

Linearization of the nonlinear approxmationŷ(α)
An iterative minimization algorithm based on first derivatives is defined through linearly approximating the (nonlinear) functionŷ(α), eq. (80). For the purpose of validating the result from the Linear Template Fit, we considerα as the expansion point, although this could be any other value as well. Hence, the linear approximation ofŷ(α) in every bin i is The last term defines the shorthand notationsm andm. From comparison with eq. (15), it is seen that an alternative best estimator of the template fit (see eq. (19)) is obtained aŝ The differenceα −α is supposed to become small for a valid application of the Linear Template Fit, and one should find |α −α| σα .
If eq. (88) holds for a particular application, then nonlinear effects are negligible, and the best estimator of the Linear Template Fit is insensitive to nonlinear effects in the model λ. If eq. (88) does not hold, then it should be studied if the nonlinearity is present in the vicinity ofα or if it is rather introduced by a long-range nonlinear behavior of λ(α). In the latter case, the range of the templates should be reduced such that one has abouṫ α max −α min 2σα (89) and the Linear Template Fit should be repeated. Note that if the templates are subject to statistical fluctuations and if only a few templates are provided, then a nonlinear approximation is more sensitive to those fluctuations than linear approximations, and |α −α| may become larger. In such an application σ Ŷ α should also be considered as a relevant source of uncertainty (see eqs. (46) and (47)).
In Example 1, when using a second-degree polynomial, a value |α −α| = 0.02 is found, which is much smaller than σα = ±0.41 and of similar size than the statistical uncertainties from Y (see section 8.5). When working with log-normal uncertainties, a value |α −α| = 0.05 is found. Although the model is highly nonlinear in log(y), the assumptions of the linear template fit are well justified since the templates are in the close vicinity of the best estimator. In Example 2, for a second degree approximation values of |α 0 −α 0 | = 0.03 and |α 1 −α 1 | = 0.14 are found. When further considering interference terms, see Appendix C, the values are 0.01 and 0.11, respectively All these values are smaller than the respective statistical uncertainties in the data, although the values suggest that for α 1 (the variance of the gaussian model) a finer grid for the reference values may be preferred, for a more accurate linear approximation.
Through linear error propagation, one can define an uncertainty inα as Note that the sum runs over j in the first and over i in the second expression. The second expression makes use of σ Y m := Y σm and σ Y m := Y σm (92) and it has the advantage that the summands represent a useful quantifier for every bin i. For a sufficiently linear problem, both the linear sum (eq. (91)) and the quadratic sum should be sufficiently small. In particular, one should find σ (m) α ≈α −α for reasonable nth degree polynomials, i.e. n=2, and also, according to eq. (88),

Minimization of χ 2 O(2n)
Whenα is obtained from the Linear Template Fit, the objective function χ 2 O(2n) , eq. (84), is expected to exhibit a minimum in the close vicinity ofα. This provides good motivation for the application of the Newton optimization atα, which is based on second derivatives of χ 2 O(2n) , so a parabolic approximation. The Newton distance to the minimum is given by ∂α α=α (94) and in the multivariate case by [5,55] ∆α = −H −1 g α=α .
The gradient vector g and the Hesse matrix H are defined by the first and second derivatives of χ 2 O(2n) , respectively. Both quantities can be calculated directly at an expansion point α 0 , which was identified in the above equations withα, since χ 2 O(2n) is just a known second-degree polynomial and dependent on M , Y , d, S, and W . The value of ∆α can be interpreted as the expected distance to the minimum (EDM) based on the nonlinear model approximation. 8 For practical purposes and to validate the result of the Linear Template Fit one should require that its value be smaller than the uncertainty from the fit |∆α| σα .
The most reasonable degree for this purpose is n = 2 (which is exploited in the following in greater detail), and one expects that ∆α will be negligibly small. This observation can also be interpreted such that the result from the Linear Template Fit and an alternative result from a minimization of χ 2 O(2n) are consistent. If the reference points are in the vicinity ofα, this is to be expected because of the underlying statistical model of (log-)normal distributed estimators. In contrast, if ∆α becomes large, it may well be that in such a case not even the reported uncertainties in the best estimator, σα, can be considered to be normal distributed.

The Quadratic Template Fit
Despite all the considerations and cross-checks as discussed in the previous sections, in some problems the linearized model may simply be an inaccurate approximation of the true model (cf. eq. (27)). Reasons could be that the templates cannot be generated in a sufficiently close vicinity around the best estimator, perhaps for technical reasons, or in multivariate problems some parameters are highly nonlinear, a dimension is poorly constrained and the reference points span a large range, or interference terms between the parameters are non-negligible. In such cases, the quantitiesα −α, σ (m) α , or ∆α (eqs. (88), (91) and (95)) become non-negligible when they are calculated for second-degree polynomials n = 2.
An improved approximation of the model is obtained when using a second order approximation, which may also include interference terms. The equations are summarized in Appendix C (eqs. (103)- (105)). Using such a second-degree polynomial modelŷ(α) that includes terms O(α k ), O(α 2 k ), and O(α k 1 α k 2 ) in the optimization problem results in a nonlinear least-squares problem. Therefore, an iterative algorithm is required to obtain the best estimator.
A robust algorithm, denoted as Quadratic Template Fit, to obtain the best estimator and to perform a full error propagation is defined as 1. The Linear Template Fit is performed to obtain a first estimatorα (0) , for example using eq. (32); 2. The Newton algorithm [5,50,51,87], eq. (95), is employed with a few m iterations in order to obtain improved best estimatorsα (m) =α (m−1) +∆α (m) and for which the Hesse matrix H is directly analytically calculable; 3. The best estimator and the error calculation are obtained using the linearized approximations form andM (see eqs. (86) or (109)) in the equations of the Linear Template Fit.
The first step provides the starting point for the Newton algorithm, and since it is obtained with the Linear Template Fit it is already close to the minimum. The second step provides a fast and stable convergence since the starting point is already in the vicinity of the minimum, the Hesse matrix is commonly positive definite and is analytically exactly calculable, 9 , as well as the gradient vector, and the Newton method has good convergence for nearly quadratic functions [5,50,51], which is the case for χ 2 O(4) . Although the algorithm is expected to converge quickly after a very few iterations, several stopping criteria are thinkable, for example when EDM |∆α| or ∆α (m) ≈ ∆α (m−1) . The last step provides an even more improved best estimator, but mainly gives access to the equations of the error propagation. Note that in the first step, the true model λ(α) is linearly approximated, and in the latter steps, by quadratic functions.
When applying the Quadratic Template Fit to Example 2, and using m = 2 Newton-iterations, the best estimator for the mean is found to be 169.90 ± 0.44 and for the variance it is 6.23 ± 0.32. Both values are very close to the best estimators from the Linear Template Fit, see section 6. The EDM ∆α andα −α are found to be smaller than 10 −4 for both parameters, and σ (m) α is equally small.

Example 3: the Quadratic Template Fit
As an application of the Quadratic Template Fit, an example is constructed where the model is nonlinear and the templates purposefully span a (too) large range. Therefore, the criteria for a valid application of the Linear Template Fit as discussed in section 10 are mainly not fulfilled.
As in Example 1 and 2 (sections 4 and 6), the model is considered to be a normal distribution, and the same pseudo-data as in the previous examples are also used. In contrast to Examples 1 and 2, only the variance of the normal distribution shall be determined. Six templates are generated similarly to Examples 1 and 2, with variances between 3 and 8, and each template is generated using 4 · 10 6 events. The pseudo-data and the templates are displayed in figure 6. Recall that the pseudo-data were generated with a variance of 6.2 (due to statistical fluctuations, the true variance is about 6.3).
In this example, the Linear Template Fit results in a best estimator of 6.61 ± 0.32, and is thus not quite consistent with the expectation. The EDM is comparably large, with a value of −0.27 , and σ  (93) are not well fulfilled. In figure 6 (right), the parameter dependence in some selected bins is displayed, and it is clearly seen that the templates are not described by a linear function. It is also observed from the χ 2 values of the individual templates that these cannot be described by a parabola as would be the case for a linear problem, as displayed in figure 7.
The Quadratic Template Fit, in contrast, results in a best estimator of 6.36 ± 0.40 and is thus in excellent agreement with the expectation. This is because the algorithm employs a second-degree polynomial in each bin to represent the model. This provides an adequate representation of the parameter dependence even of a nonlinear model, as also seen in figure 6 (right). A linearization at the value of the best estimator is used only for linear error propagation, and appears to be appropriate given the size of the final uncertainties, ±0.40.    14 Validation study: determination of the strong coupling constant from inclusive jet cross section data As an example application of the Linear Template Fit, the strong coupling constant in Quantum Chromodynamics (QCD), α s (m Z ), is determined from a measurement of inclusive jet cross sec-  tions in proton-proton collisions at a center-of-mass energy of √ s = 7 TeV [96]. The data were taken with the CMS detector at the LHC at CERN. Inclusive jets were measured as a function of the transverse momentum p T in five regions of absolute rapidity |y|. Altogether 133 data points are available, and 24 different sources of uncertainties are associated with the data.
The model is obtained from a calculation in next-to-leading order perturbative QCD (NLO pQCD) using the program nlojet++ [97,98] that was interfaced to the tool fastNLO [99][100][101]. The latter makes it possible to provide templates for any value of α s (m Z ) and different parameterizations of parton distribution functions of the proton (PDFs) without rerunning the calculation of the pQCD matrix elements. Multiplicative non-perturbative (NP) [102] and electroweak corrections [103] are applied to the NLO predictions.
The value of α s (m Z ) was already determined earlier from these data and in NLO accuracy in two independent analyses [104,105]. The two analyses differ in the assumption of the statistical model, the inference algorithm, and the uncertainties considered in the fit. This provides a comprehensive testing ground to compare the results from the Linear Template Fit.
In Ref. [104], the value of α s (m Z ) is determined from the minimum of a χ 2 -parabola, where the χ 2 is derived from normal distributed probability distribution functions. In order to avoid a bias in that procedure [64,68,69], some uncertainty components are treated as multiplicative [104], which means that the covariance matrix is calculated from relative uncertainties that are multiplied with theory predictions. We chose predictions obtained with α s (m Z ) = 0.116 for that purpose, and the total covariance matrix in χ 2 is calculated as described in Ref. [104].
The templates are generated in the range 0.112 ≤ α s (m Z ) ≤ 0.121 in steps of 0.001 using the MSTW PDF set [106]. In this methodology, the model predictions are not available for continuous values of α s (m Z ), since the PDFs als exhibit an α s dependence, but are provided only in a limited range and for discrete α s (m Z ) values [106]. Therefore, this kind of α s (m Z ) inference is a typical use case for the Linear Template Fit The templates and the linear model are displayed for some selected bins in figure 8.
The results from the Linear Template Fit (eq. (32)), the Quadratic Template Fit (section 12), and the parabolic fit (section 10.3) are compared with the published results from CMS in table 1. An illustration of the χ 2 values is displayed in figure 9 (left). The best estimator for α s (m Z ), its uncertainties, and the value of χ 2 are in good agreement among each other, and with the published CMS values [104], and differences are only in the rounding digit. It is also found that the total "fit" uncertainty is reproduced, which comprises the experimental and the PDF , and the last column shows the χ 2 /n dof values. The difference between the Linear and the Quadratic Template Fit is only 0.000017, but appears larger due to a rounding effect. Differences to the CMS result may also be due to numerical limitations of the input data, or how the PDF uncertainties are symmetrized. The total uncertainties that are considered in the fit sum to ±0.0018 in all cases.  = ±0.0018. The small differences in the individual uncertainty components between Ref. [104] and the Linear Template Fit are likely because the error breakdown in Ref. [104] is only approximate, whereas the Linear Template Fit provides an analytic error propagation at the value of the best estimator. In fact, all 24 uncertainty components of the data, the 20 symmetrized PDF uncertainties, and the NP uncertainties are propagated separately (cf. section 8) and are displayed in figure 10. It is observed that the largest individual experimental uncertainty source is the luminosity uncertainty and the statistical uncertainty. The nuisance parameters are quite similar to those from (yet another) fit in Ref. [104].  Figure 10: Left: nuisance parameters of the systematic uncertainties in the α s (m Z )-fit to CMS inclusive jet cross section data. Details on these uncertainties are found in Refs. [96,104]. The gray value(s) are not included in the χ 2 computation but are propagated separately. Middle: size of the individual uncertainty components to the total uncertainty of the best estimator, α s (m Z ), in the Linear Template Fit to CMS inclusive jet cross section data. The color indicates the sign of the uncertainty, which, however, is of relevance only for multivariate fits. Right: partial χ 2 from the individual uncertainty components.
The uncertainty component with the largest contribution to the total uncertainty stems from the PDFs and the luminosity uncertainty. The statistical and uncorrelated uncertainties have the largest contribution toχ 2 . The NP uncertainties are not included in the nominal χ 2 computation and are propagated separately. The EDM of the Linear Template Fit is 7 · 10 −5 and thus smaller than the rounding digit of the result.  In Ref. [105], as an intermediate result, the value of α s (m Z ) was determined from CMS inclusive jet cross sections as well. In contrast to Ref. [104], the estimators are assumed to follow a log-normal distribution, and some theoretical uncertainties are further considered in the χ 2 expression, which was then minimized with Minuit [4]. The NNPDF3.0 PDF sets [107] are used for the NLO predictions and provide a covariance matrix with PDF uncertainties. This inference provides a useful testing ground for the Linear Template Fit with relative uncertainties (sec-tion 7). Templates are generated in the range 0.112 ≤ α s (m Z ) ≤ 0.121, and the same uncertainty components as in Ref. [105] are considered. Some selected bins are displayed in figure 11. The results from the Linear Template Fit (eq. (41)), the Quadratic Template Fit, and the parabolic fit are compared with the published results in table 2, and an illustration of χ 2 values, which here are calculated from normal-distributed relative uncertainties (eq. (39)), is displayed in figure 9 (right). The best estimator, the uncertainties, and the value of χ 2 are in good agreement among Ref. [105] 0.1144 ± 0.0022 (exp) ± 0.0014 (pdf) ± 0.0001 (NP) 0.81 each other and with the result from Ref. [105]. The results differ only in the rounding digit, and different sizes of the uncertainty breakdown (into (exp), (PDF), and (NP) uncertainties) may be explained since the error breakdown in Ref. [105] is only approximate, but the total uncertainty is consistently δα s = ±0.0026.
It is worth noting that due to the application of the factorization theorem [108], both the hard coefficients and the PDFs exhibit an α s sensitivity. Consequently, in the two approaches discussed above, not only is the value of α s (m Z ) determined, but the PDFs are determined simultaneously as well [109], which is realized through the inclusion of the PDF uncertainties in the Linear Template Fit, and which is commonly referred to as PDF profiling. This technique avoids a possible bias that is present when the PDFs are kept fixed [110], and it is particularly valid since the PDFs and the value of α s (m Z ) are only weakly correlated in PDF determinations, and thus their correlations are negligible [111]. The different size of the uncertainties in tables 1 and 2 is then a consequence of how the PDFs are considered in the fit. Since the PDFs themselves are determined from data as well, the ansatz from CMS also exploits, to some extent, the α s sensitivity of these further data by using the α s dependent PDF, whereas the other ansatz from Ref. [105] exploits only the jet data and their sensitivity to α s (m Z ) in the hard coefficients.
Consequently, these examples represent a comprehensive application of the Linear Template Fit, where in addition to determining the value of α s (m Z ), a PDF profiling is also performed, as well as a simultaneous constraint of the systematic uncertainties. Such constraints are obtained when the uncertainties in the nuisance parameters become smaller than unity. A PDF determination from these data alone, instead of a PDF profiling of an existent PDF, would be possible by considering the PDF uncertainties as unconstrained uncertainties in the Linear Template Fit, as described in section 8.4. Such a fit exploits the PDF parameter space of the PDF set that is used to derive the PDF uncertainties. For the given example of CMS inclusive jet cross sections, however, this is not possible, since these data do not have sufficient constraints on the separate PDF flavors.

Summary and conclusions
In this article, the equations of the Linear Template Fit were presented. The Linear Template Fit provides an analytic expression for a maximum likelihood estimator in a uni-or multivariate parameter estimation problem. The underlying statistical model is constructed from normal or log-normal probability distribution functions and refers to a maximum likelihood or a minimum χ 2 method. For parameter estimation with the Linear Template Fit, the model needs to be provided at a few reference values in the parameter(s) of interest: the templates. The multivariate Linear Template Fit can be written aŝ where Y is the template matrix, d is the data vector (random vector), W is the inverse covariance matrix, andm andM are calculated from the reference points (eq. (26)).
The equation of the Linear Template Fit is derived using two key arguments: (i) in the vicinity of the best estimator the model is linear, and (ii) the templates for different reference values are all generated in the same manner. The first precondition is commonly justified if the templates are provided within a reasonably small range around the (to be expected) best estimator. From the second, it follows that the bin-wise polynomial regression is unweighted, and thus the identical regression matrix is applicable in all bins. Several quantities to validate the applicability of the Linear Template Fit in a particular application were discussed.
If the linearity of the model is not justified, the Quadratic Template Fit is a suitable alternative algorithm for parameter estimation. It considers second-degree polynomials for the parameter dependence of the model and employs the quickly converging exact Newton minimization algorithm. For error propagation, the equations of the Linear Template Fit are applicable.
Both the Linear and the Quadratic Template Fit implicitly transform the discrete representation of the model (the templates) into analytic continuous functions using polynomial regression. These expressions themselves may also become useful in several other applications where templates are available but continuous functions would be needed.
The equations for error propagation of different sources of uncertainties were presented. The analytic nature of the Linear Template Fit allows the straightforward propagation of each uncertainty component separately, which provides additional insights into the parameter estimation analysis.
As an example application, previously published results on the determination of the strong coupling constant α s (m Z ) from inclusive jet cross section data taken at the LHC were repeated. The Linear and the Quadratic Template Fits reproduce these previously published results within the rounding digit.
In summary, key features of the Linear and the Quadratic Template Fit are its profound statistical model based on normal-or log-normal distributed probability distribution functions, its simple formulae, stable results, low computational demands, full analytic error propagation, and its simple applicability. It is believed that these template fits may become useful for statistical inference in a large variety of problems, such as performance-critical applications, and also in several fields outside of high energy particle physics.

A Notation
Letter ( Uncertainties with full bin-to-bin correlations (shifts)

S
Matrix of uncertainties s, eq. (34) The (physics)model with free parameter(s) α y(α) Best estimator of the linearized model λ(α) ≈ŷ(α) ≡ y(α;θ) l , Nuisance parameter(s) α,α k ,α The best estimator(s) of the model parameter (an MLE) l ,ˆ The best estimator(s) of the nuisance parameter(s) a A k+l vector for the best estimator and nuisance parametersâ = α α j ,α j,k ,α Reference values of the (jth) templates (for a multivariate Linear Template Fit the vectorα becomes a j × k matrix ) A template distribution represented as an i vector (or t vector), E.g., histograms of an observable i.e. a model prediction for the reference valueα y (j) = λ(α (j) ) value(s) to match the data distribution in the fit Index of the reference points; number of templates and reference points k Index of the fit parameter for multivariate Linear Template Fit; number of fit parameters l Index for uncertainties with full bin-to-bin correlations; number of systematic uncertainties t Number of entries (bins) on truth level, if a response matrix A is used n Degree of a polynomial function, or nth order expansion Table 3: Summary of the notation used. Capital letters denote a matrix, and small bold letters denote a column vector. The letters i, j, k, l, and t are indices. Throughout this article, a vector notation is used and, for instance, the vector d denotes a set of (random) variables, so d = {d i }. The same letters for indices and for the maximum value of an index are used. A subscript index denotes the entry of a vector (or matrix), and a bracketed index denotes one vector out of several others. Some letters have multiple meanings, if e.g. normal-or log-normal-distributed random variables are considered, or if nonlinear effects are discussed, while the meaning always becomes clear from the context. Occasionally, a single entry of a vector will be denoted as a bin, like it would be a histogram entry. The dot notation, "α", denotes the reference values and looks similar to a "point" of a grid, but it should not be mistakenly understood as a derivative.

B Pre-fitting the template distributions
It is tempting to fit the distributions of the templates with some well-motivated function, and then use the best estimators of the function parameters as input parameters to a (linear) template fit for the determination of a model parameter. In this common practice, it is often believed that the fitted function may better exploit the sensitivity of the shape of the template distribution to the parameter of interest and thus enhance the quality of the results.
Let us consider to fit the template distribution with a polynomial of degree n. Unless there is no particular motivation for any particular other n+1 parameter function, e.g. from the underlying physics model, a polynomial is an equally valid function as any other n+1 parameter function for that purpose. In matrix notation, the polynomial is expressed using whereż i may denote the observable value of the ith bin (bin center) of the template distribution. The best estimators for the polynomial parameters from a distribution y (j) are thuŝ θ = (P T W (j) P ) −1 W (j) P y (j) := B (j) y (j) (98) and W (j) = V −1 (j) is the inverse covariance matrix of the uncertainties of the template values. Since in the Linear Template Fit the templates are sufficiently similar (cf. section 10.1, anḋ α (j+1) ∼α (j) + ), it is appropriate to assume that the uncertainties are (approximately) equivalent for all templates and the uncertainties of the data scale by a constant factor c 2 only, probably because of a different amount of statistics. Therefore, the g-inverse B of the nth degree polynomial fit (eq. (98)) is equivalent for any of the templates and also for the data distribution, B (j) = B. Thus, the Linear Template Fit when using the n+1 polynomial function parameters becomes which after re-sorting becomes equivalent to the χ 2 equation of the Linear Template Fit but using W = P (P T WP ) −1 WP T W P (P T WP ) −1 WP .
It is therefore observed that a preceded fit of a function to the templates results only in a change of the Hesse matrix W → W . This observation is valid for all pre-fits that behave linearly, like B, and if eq. (99) holds.

C Nonlinear approximation of a multivariate model
In section 11, the nonlinear approximation of a univariate model using an nth degree regression was described; see eq. (81). When applying a first-order Taylor expansion to it, a linearized model is obtained, and from the equations of the Linear Template Fit, a best estimator for the fitting problem is found. A similar ansatz for a nonlinear model approximation of a multivariate model is straightforward. However, in the nonlinear approximation of a multivariate model, it is also of interest to consider interference terms (∼ α k 1 α k 2 ), and this will be discussed in the following. It will be shown that the subsequent linear approximation can again become input to a linear least-squares problem similar to the Linear Template Fit.
For reasons outlined in the main text, an unweighted regression is applicable, and thus the regression matrix is equivalent for every bin i. When substituting theθ estimators into eq. (103), one obtains the approximated model in every bin, and thus the full model as a second-order functionŷ(α).
Next, a first-order Taylor expansion is applied to the modelŷ(α). The expansion point is identified 10 with the best estimator of the Linear Template Fitα. For that purpose it is useful to refer to the columns of the transposed g-inverse matrix with vectors (m and multiplem) like M + T =: mm [1] The resulting terms from the first-order Taylor expansion are organized in the k × j-matrix Hence, the estimated model, similar to eq. (86), is found aŝ This model,ỹ(α) ≈ λ(α), can be used together with the equations of the linear template fit to calculate the best estimator, since it is linear in all parameters of interest.

D Calculation of the template fit matrix F
The calculation of the template fit matrix F (eq. (33)) involves the calculation of M + as well as the inversions of V and D. These calculations shall be briefly discussed in the following.
The calculation of M + includes the calculation and subsequent inversion of the sum of squares and cross products (SSCP) matrix of an extended Vandermonde, (M T M ) −1 , which is widely studied in literature (see e.g. Ref. [54]) and typically does not involve numerical problems for n = 1 (or 2).
The inverse covariance matrix W = V −1 is commonly calculable since V includes uncertainties without bin-to-bin correlations and it is thus positive definite. In fact, it needs to be invertible for a valid problem, or may even be a diagonal matrix. An efficient algorithm to prove that V is positive definite is a Cholesky decomposition, and when successful, the inverse is almost already calculated [86].
The inversion of the (k + l) × (k + l) matrix D may be more challenging, but it can be expressed using three sub-matrices as where the right side introduces some shorthand notations (for D, compare also eq. (33)). The inverse of D is then given in two variants by [114]: This requires one to calculate the inverse of only two matrices, L and D, or K and Σ, which are It is seen that these either can be inverted similarly to weighted least-squares matrices [53,115,116], or are just sums of two matrices, and the Sherman-Morrison-Woodbury formula can be applied [117,118] (see also Ref. [54]).