Prediction in regression models with continuous observations

We consider the problem of predicting values of a random process or field satisfying a linear model $y(x)=\theta^\top f(x) + \varepsilon(x)$, where errors $\varepsilon(x)$ are correlated. This is a common problem in kriging, where the case of discrete observations is standard. By focussing on the case of continuous observations, we derive expressions for the best linear unbiased predictors and their mean squared error. Our results are also applicable in the case where the derivatives of the process $y$ are available, and either a response or one of its derivatives need to be predicted. The theoretical results are illustrated by several examples in particular for the popular Mat\'{e}rn $3/2$ kernel.


Introduction
A common problem, which occurs in many different areas, most notably geostatistics (Ripley, 1991;Cressie, 1993), computer experiments (Sacks et al., 1989;Stein, 1999;Santner et al., 2003;Leatherman et al., 2017) and machine learning (Rasmussen and Williams, 2006), is to predict the response y(t 0 ) at a point t 0 ∈ R d from given responses y(t 1 ), . . ., y(t N ) at points t 1 , . . ., t N ∈ R d , where t 0 = t i for all i = 1, . . ., N .Making the prediction assuming that responses are observations of a random field is called kriging (Stein, 1999).In classical kriging, it is assumed that y is a random field of the form where f (t) ∈ R m is a vector of known regression functions, θ ∈ R m is a vector of unknown parameters and is a random field with zero mean and existing covariance kernel, say K(t, s) = E[ (t) (s)].The components of the vector-function f (t) are assumed to be linearly independent on the set of points where the observations have been made.It is well-known, see e.g.Sacks et al. (1989), that in the case of discrete observations the best linear unbiased predictor (BLUP) of y(t 0 ) has the form ŷ(t 0 ) = f (t 0 ) θBLUE + K t 0 Σ −1 (Y − X θBLUE ), (1.2) where Σ = K(t i , t j ) N i,j=1 is an N×N -matrix, K t 0 = K(t 0 , t 1 ), . . ., K(t 0 , t N ) is a vector in R N , X = (f (t 1 ), . . ., f (t N )) is an N×m-matrix, Y = (y(t 1 ), . . ., y(t N )) ∈ R N is a vector of observations and θBLUE = (X Σ −1 X) −1 X Σ −1 Y is the best linear unbiased estimator (BLUE) of θ.The BLUP satisfies the unbiased condition E[ŷ(t 0 )] = E[y(t 0 )] and minimizes the mean squared error MSE(ỹ(t 0 )) = E (y(t 0 ) − ỹ(t 0 )) 2 in the class of all linear unbiased predictors ỹ(t 0 ); its mean squared error is In the present paper, we generalize the predictor (1.2) to the case of continuous observations of the response including possibly derivatives and prediction of derivatives and weighted averages of y(t).We shall separately consider the cases where the observation region is an interval or a product set (in particular, square).An important observation concerning construction of the BLUPs at different points is the fact that there is a considerable common part related to the use of the same BLUE.This could lead to significant computational savings relative to independent construction of the BLUPs.This observation extend to the cases when the observations are taken in R d and when derivatives are also used for predictions.
The remaining part of this paper is organized as follows.In Section 2 we consider the BLUPs when we observe the process or field only.In Section 3 we study the BLUPs for either process values or one of its derivatives when we observe the process (or field) with derivatives.In Section 4 we provide proofs of the main results and in an Appendix we give more illustrating examples of the BLUPs for particular kernels.
2 Prediction without derivatives

Prediction at a point
Assume T ⊂ R d and consider prediction at a point t 0 ∈ T for a response given by the model (1.1),where the observations for all t ∈ T are available.
The vector-function f : T → R m is assumed to contain functions which are bounded, integrable, smooth enough and linearly independent on T ; the covariance kernel K(t, s) = E[ (t) (s)] is assumed strictly positive definite.
A general linear predictor of y(t 0 ) can be defined as where Q is a signed measure defined on the Borel field of T .This predictor The mean squared error (MSE) of ŷQ (t 0 ) is given by The best linear unbiased predictor (BLUP) ŷQ * (t 0 ) of y(t 0 ) minimizes the mean squared error MSE(ŷ Q (t 0 )) in the set of all linear unbiased predictors.
The corresponding signed measure Q * will be called BLUP measure throughout this paper.Unlike the case of discrete observations, the BLUP measure does not have to exist for continuous observations.
Assumption A.
(1) The best linear unbiased estimator (BLUE) θBLUE = T y(t)G(dt) exists in the model (1.1),where G(dt) is some signed vector-measure on T , (2) There exists a signed measure ζ t 0 (dt) which satisfies the equation (2.1) Assumption A will be discussed in Section 2.2 below.We continue with a general statement establishing the existence and explicit form of the BLUP.
Theorem 2.1.If Assumption A holds then the BLUP measure Q * exists and is given by where the signed measure ζ t 0 (dt) satisfies (2.1) and c = f The MSE of the corresponding BLUP ŷQ * (t 0 ) is given by where This theorem is a particular case of a more general Theorem 2.2, which considers the problem of predicting an integral of the response.A few examples illustrating applications of Theorem 2.1 for particular kernels are given in the Appendix.We can interpret the construction of the BLUP at t 0 in model (1.1) as the following two-stage algorithm.At stage 1, we use the BLUE θBLUE = T y(t)G(dt) for estimating θ.At stage 2, we compute the BLUP in the model ỹ which is a model with new error process and no trend.Straightforwardly, the covariance function of the process ỹ(t) is calculated as It then follows from Theorem 2.1 applied to the new model that the signed measure Q * (dt) satisfies the equation From (2.3) in the new model, we obtain an alternative representation for the MSE of the BLUP ŷQ * (t 0 ); that is,

Validity of Assumption A
If T is a discrete set then Assumption A is satisfied for any strictly positive definite covariance kernel.
In general, the main part of Assumption A is the existence of the BLUE of the parameter θ, which has been clarified by Dette et al. (2019).According to their Theorem 2.2, the BLUE of θ exists if and only if there exists a signed vector-measure G = (G 1 , . . ., G m ) on T , such that the m×m-matrix T f (t)G (dt) is the identity matrix and holds for all s ∈ T and some m × m-matrix D. In this case, θ BLU E = T Y (t)G(dt) and D is the covariance matrix of θ BLU E ; this matrix does not have to be non-degenerate.Let H K be the reproducing kernel Hilbert space (RKHS) associated with kernel K.If the function K(t 0 , s) belongs to H K , then the second part of Assumption A is also satisfied; that is, there exists a measure ζ t 0 (dt) satisfying the equation (2.1).This follows from results of Parzen (1961).Note that the function K(t 0 , s) does not automatically belong to H K since in general t 0 / ∈ T .If all components of f belong to H K then Assumption A holds and the matrix D in (2.4) is non-degenerate; see Dette et al. (2019) and Parzen (1961).
If the matrix D in Theorem 2.1 is non-degenerate then this theorem can be reformulated in the following form which is practically more convenient as there is no unbiasedness condition to check.
Proposition 2.1.Assume that there exists a signed measure ζ t 0 (dt) satisfying (2.1) and a signed vector-measure ζ(dt) satisfying equation (2.5) Explicit forms of the BLUP for some kernels are given in the Appendix.

Matching expressions in the case of discrete observations
Let us show that in the case of discrete observations the form of the BLUP of Proposition 2.1 coincides with the standard form (1.2).Assume that T is finite, say, T = {t 1 , . . ., t N }.In this case, equation (2.5) has the form Σζ = X, where ζ is and N×m-matrix.Since the kernel K is strictly positive definite, this gives ζ = Σ −1 X, and we also obtain Expanding the expression for Q * we obtain The classical form of the BLUP is given by (1.2), which can be written as and coincides with (2.6).

Predicting an average with respect to a measure
Assume that we have a realization of a random field (1.1) observed for all t ∈ T ⊂ R d .Consider the prediction problem of Z = S y(t)ν(dt), where ν(dt) is some (signed) measure on the Borel field of R d with support S.
Assume that S \ T = ∅ (otherwise, if S ⊆ T , the problem is trivial as we observe the full trajectory {y(t) | t ∈ T }).We interpret Z as a weighted average of the true process values on S. The general linear predictor can be defined as where Q is a signed measure on the Borel field of T .The estimator ẐQ is unbiased if and only if (2.8) The BLUP signed measure among all signed measure Q satisfying the unbiasedness condition (2.8).Assumption A and Theorem 2.1 generalize to the following.
Assumption A .The BLUE θBLUE exists and there exists a signed measure ζ ν (dt) which satisfies the equation (2.9) Theorem 2.2.Suppose that Assumption A holds and let D be the covariance matrix of θBLUE = T y(t)G(dt).Then the BLUP measure exists and is given by where ζ ν (dt) is the signed measure satisfying (2.9) and c The proof of Theorem 2.2 is given in Section 4 and contains the proof of Theorem 2.1 as a special case.Note also that the BLUP ẐQ * is simply the average (with respect to the measure ν) of the BLUPs at points s ∈ S.

Location scale model on a product set
In this section we consider the location scale model and assume that the kernel K of the random field ε(t) is given by We also assume that the set T ⊂ R 2 is a product-set of the form T = T 1 × T 2 , where T 1 and T 2 are Borel subsets of R (in particular, these sets could be discrete or continuous).The kernel K of the product form (2.12) is called separable; such kernels are frequently used in modelling of spatial-temporal structures because they offer enormous computational benefits, including rapid fitting and simple extensions of many techniques from time series and classical geostatistics [see Gneiting et al. (2007) or Fuentes (2006) among many others].Assume that Assumption A holds for two one-dimensional models Note that equation (2.9) can be rewritten as A solution of the above equation has the form Finally, the BLUP at the point T where The measure G(dt) is the BLUE measure and does not depend on T 1 , T 2 .On the other hand, the measure ζ T (dt) and constant c do depend on Example 2.1.Consider the case of T = [0, 1] 2 and the exponential kernel In view of (Dette et al., 2019, Sect 3.4), It follows from (Dette et al., 2019, Sect 3.4) that this equation is satisfied by the measure Similar formulas can be obtained for 0 < T 1 < 1 and T 1 ≥ 1.
In Table 1 we show the square root of the MSE of the BLUP for the equidistant design supported at points (i/(N − 1), j/(N − 1)), i, j = 0, 1, . . ., N − 1.We can see that the MSE for the design with N = 4 is already rather close to the MSE for the design with large N and the design with continuous observations.In Figure 1 we show the plot of the square root of the MSE as a function of a prediction point for points (T 1 , T 2 ) ∈ [0.5, 2] × [0.5, 2].As the design is symmetric with respect to the point (0.5, 0.5), the plot of the MSE is also symmetric with respect to this point.Consequently only the upper quadrant is depicted in the figure.We observe that the MSE tends to zero when the prediction point tends to one of design points and the MSE is almost constant if the prediction point is far enough from the observation domain.
Remark 2.1.The results of this section can be easily generalized to the case of d > 2 variables and, moreover, to the model y , where f (i) are some functions on T i ; i = 1, . . ., d.

Prediction with derivatives
In this section we consider prediction problems, where the trajectory y in model (1.1) is differentiable (in the mean-square sense) and derivatives of the process (or field) y are available.In Section 3.1 we discuss the discrete case of a once-differentiable process and in Section 3.2 we consider the general case of a q times differentiable (in the mean-square sense) process y satisfying the model (1.1).For the process y to be q times differentiable, the covariance kernel K and vector-function f in (1.1) have to be q times differentiable, which is one of the assumptions in Section 3.2.In Section 3.3 we consider the prediction problem for the location scale model on a two-dimensional product set in the case where the kernel K of the random field ε has the product form (2.12).The results of this section can be easily generalized to the case of d > 2 variables.

Discrete case
Consider the model (1.1),where the kernel K and vector-function f are differentiable and one can observe the process y and its derivative at N different points t 1 , . . ., t N ∈ R. In this case, the BLUP of y(t 0 ) has the form where Y 2N = (y(t 1 ), . . ., y(t N ), y (t 1 ), . . ., y are N ×N -matrices, For more general cases of prediction of processes and fields with derivatives observed at a finite number of points, see (Morris et al., 1993;Näther and Šimák, 2003).

Continuous observations on an interval
Consider the continuous-time model (1.1),where the error process has a q times differentiable covariance kernel K(t, s).We also assume that the vector-function f is q times differentiable and therefore the response y is q times differentiable as well.Suppose we observe realization y(t) = y (0) (t) for t ∈ T 0 ⊂ R and assume that observations of the derivatives y (i) (t) are also available for all t ∈ T i , where T i ⊂ R; i = 1, . . ., q.The sets T i (i = 0, 1, . . ., q) do not have to be the same; some of these sets (but not all) can even be empty.If at least one of the sets T i contains an interval then we speak of a problem with continuous observations.Consider the problem of prediction of y (p) (t 0 ), the p-th derivative of y at a point t 0 ∈ T p , where 0 ≤ p ≤ q.
Assumption A .
(1) The best linear unbiased estimator (BLUE) θBLUE = G(dt)Y(t) exists in the model (1.1),where G(dt) is some signed m×(q + 1)-matrix measure (that is, the j-th column of G(dt) is a signed vector measure defined on T j ); (2) There exists a signed vector-measure ζ p,t 0 (dt) (of size q+1) which satisfies the equation where K(t, s) = ∂ j K(t,s) ∂s j q j=0 is a (q + 1)-dimensional vector.The problem of existence and construction of the BLUE in the continuous model with derivatives is discussed in (Dette et al., 2019).A general statement establishing the existence and explicit form of the BLUP is as follows.The proof is given in Section 4.
Theorem 3.1.If Assumption A holds, then the BLUP measure Q * exists and is given by where the signed measure ζ p,t 0 (dt) satisfies (3.3) and The MSE of the BLUP ŷp,Q * (t 0 ) is given by where is the covariance matrix of θBLUE = G(dt)Y(t).
Example 3.1.As a particular case of prediction in the model (1.1), in this example we consider the problem of predicting a value of a process (so that p = 0) with Matérn 3/2 covariance kernel K(t, s) = (1 + λ|t − s|)e −λ|t−s| ; this kernel is once differentiable and is very popular in practice, see e.g.(Rasmussen and Williams, 2006).We assume that the vector-function f in the model (1.1) is 4 times differentiable and that the process y and its derivative y are observed on an interval [A, B] (so that T 0 = T 1 = [A, B] in the general statements).As shown in (Dette et al., 2019), for this kernel the BLUE measure G(dt) can be expressed in terms of the signed matrix-measure where Then using (Dette et al., 2019, Sect. 3.4) we obtain ζ 0,t 0 (dt) = (ζ 0,t 0 ,0 (dt), ζ 0,t 0 ,1 (dt)) with where for t 0 > B we have z t 0 ,A = 0, z t 0 ,1,A = 0, z t 0 (t) = 0, We also obtain the matrix defined in (Dette et al., 2019, Lem.2.1) from the condition of unbiasedness.
If D, the covariance matrix of the BLUE is non-degenerate, then D = C −1 .
In the present case, The BLUE-defining measure G(dt) is expressed through the measures ζ(dt) and the matrix C by G(dt) = C −1 ζ(dt).The BLUP measure for process prediction is given by For the location scale model with f (t) = 1, we obtain ) and, therefore, a BLUP measure for this model is given by Therefore, the corresponding BLUP is given Table 2 gives values of the square root of the MSE of the BLUP in the location scale model at the point t 0 = 2 for three families of designs, where [A, B] = [0, 1].We observe that observations of derivatives inside the interval do not bring any improvement to the BLUP which can be explained by the fact that the weights of the continuous BLUP at derivatives at points in the interior of the interval [A, B] are 0.

Location scale model on a product set
Similarly to Section 2.5, we consider the location scale model (2.11) defined on the product set T = T 1 ×T 2 (where T 1 and T 2 are Borel sets in R) with the kernel K of the random field ε having the product form (2.12).The results of this section (as of Section 2.5) can be easily generalized to the case of d > 2 variables.Assume that Assumption A with q = 1 is satisfied for two one-dimensional models (2.13).For this assumption to hold, the process {y(t 1 , t 2 ) | (t 1 , t 2 ) ∈ T } has to be once differentiable with respect to t 1 and t 2 .Let the measures G 0,i (du) and G 1,i (du) define the BLUE in the univariate models (2.13); i = 1, 2. In this case, results of (Dette et al., 2019) imply that the BLUE of θ in the model (2.11) has the form θ = T Y (t) G(dt), where and Assume we want to predict y(T ) at a point T = (T 1 , T 2 ) / ∈ T .The analogue of the equation (2.9) is given by where Observing the product-form of expressions, we directly obtain that a solution of (3.5) has the form where measures ζ 0,T i (dt) and ζ 1,T i (dt) for i = 1, 2 satisfy the equation Finally, the BLUP at the point The MSE of the BLUP is given by where D is the variance of the BLUE.
Example 3.2.Consider a location scale model on a square [0, 1] 2 with a product covariance Matérn 3/2 kernel, that is where Define the measures In view of (Dette et al., 2019, Sect. 3.4), defines a BLUE in the model y(u) = θ + ε(u) with u ∈ [0, 1] and covariance kernel (3.6).Additionally, from (Dette et al., 2019, Sect. 3.4) we have We now investigate the performance of five discrete designs: (i) the design ξ N 2 ,0,0,0 , where we observe process y on an N ×N grid; (ii) the design ξ N 2 ,4,4,4 , where we observe process y on an N ×N grid and additionally derivatives ∂y ∂t 1 , ∂y ∂t 2 , ∂ 2 y ∂t 1 ∂t 2 at 4 corners of [0, 1] 2 ; (iii) the design ξ N 2 ,N 2 ,N 2 ,0 , where we observe process y and derivatives ∂y ∂t 1 , ∂y ∂t 2 on an N ×N grid; (iv) the design ξ N 2 ,N 2 ,N 2 ,0 , where we observe process y on an N×N grid and derivatives ∂y ∂t 1 , ∂y ∂t 2 , ∂ 2 y ∂t 1 ∂t 2 at 4N − 4 equidistant points on the boundary of [0, 1] 2 ; (v) the design ξ N 2 ,N 2 ,N 2 ,N 2 , where we observe process y and derivatives ∂y ∂t 1 , ∂y ∂t 2 , ∂ 2 y ∂t 1 ∂t 2 at N ×N equidistant points on an N ×N grid.The results are depicted in Table 3, which shows the square root of the MSE of predictions at the point (2, 2) and (0.5, 2) for different sample sizes.For any given N ≥ 2, the MSE for prediction outside the square [0, 1] 2 for the designs ξ N 2 ,N 2 ,N 2 ,N 2 and ξ N 2 ,4N −4,4N −4,4N −4 are exactly the same.This is related to the fact that the BLUP weights associated with all derivatives at interior points in [0, 1] 2 of the designs ξ N 2 ,N 2 ,N 2 ,N 2 are all 0. This means that for optimal prediction of y(t 0 ) at a point t 0 outside the observation region one needs the design guaranteeing the optimal BLUE plus the observations of y(t) and y (t) at points t closest to t 0 .Note that the results of (Dette et al., 2019, Sect. 3.4) imply that the continuous optimal design for the BLUE does not use values of any derivatives of the process (or field for the product-covariance model) in the interior of T .The observation above is consistent with our other numerical experience which have shown that the BLUP at a point t 0 ∈ (0, 1) × (0, 1) constructed from the design ξ N 2 ,N 2 ,N 2 ,N 2 has vanishing weights at all derivatives of interior points of [0, 1] 2 with five exceptions: the center 0 and the four points which are closest to t 0 in the L ∞ (Manhattan) metric.To start, we proof the following lemma.
Lemma 4.1.The mean squared error [relative to the true process value] of any unbiased estimator ẐQ = T y(t)Q(dt) is given by Let us now prove the main result.We will show that MSE( ẐQ ) ≥ MSE( ẐQ * ), where ẐQ is any linear unbiased estimator of the from (2.7) and ẐQ * is defined by the measure (2.10).Define R(dt) = Q(dt) − Q * (dt).From the condition of unbiasedness for Q(dt) and Q * (dt), we have T f (t)R(dt) = 0 m×1 .We obtain where the inequality follows from nonnegative definiteness of the covariance kernel and the last equality follows from the unbiasedness condition f (t)R(dt) = 0.
4.2 Proof of Theorem 3.1 For simplicity, assume p = 0; the case p > 0 can be dealt with analogously.First, we derive the following lemma.

Appendix: more examples of predicting process values
In the appendix, we give further examples of prediction of values of specific random processes y(t), which follows the model (1.1) and observed for all t ∈ T = [A, B].In Section 5.1, we illustrate application of Proposition 2.1 and in Section 5.2 we give an example of application of Theorem 3.1.In the example of Section 5.2 we consider the integrated Brownian motion process, which is a once differentiable random process, and we assume that in addition to values of y(t), the values of the derivative of y(t) are also available.As in the main body of the paper, the components of the vector-function f (t) in (1.1) are assumed to be smooth enough (for all formulas to make sense) and linearly independent on T .

General Markovian process
Consider the prediction of the random process (1.1) with T = [A, B] and the Markovian kernel K(t, s) = u(t)v(s) for t ≤ s, where u(•) and v(•) are twice differentiable positive functions such that q(t) = u(t)/v(t) is monotonically increasing.As shown in (Dette et al., 2019, Sect. 2.6), a solution of the equation , where ψ denotes a derivative of a function ψ, the vector-function h(•) is defined by h(t) = f (t)/v(t).
Then we obtain ζ t 0 (dt) = z 0A δ A (dt) + z 0B δ B (dt) with The BLUP measure is given by In Table 4 we give values of the square root of the MSE of the BLUP at the point t 0 = 2 for the N -point equidistant design in the location scale model on the interval [0, 1] and the OU kernel with λ = 2. From this table, we can see that one does not need many points to get almost optimal prediction: indeed, the MSE for designs with N ≥ 4 is very close to the MSE for the continuous design.Similar results have been observed for other points t 0 and other Markovian kernels.

Prediction when the error process is integrated Brownian motion
Consider the prediction of the random process (1.1) with T = [A, B], the 4 times differentiable vector of regression functions f (t) and the kernel of the integrated Brownian motion defined by K(t, s) = min(t, s) 2 (3 max(t, s) − min(t, s))/6.

Figure 1 :
Figure 1: The square root of the MSE of the BLUP for the N × N -point equidistant design with N = 3 (left) and N = 4 (right) and the exponential kernel with λ = 2.

Table 1 :
The square root of the MSE of the BLUP at several points for the N×N -point equidistant design in the location scale model on the square [0, 1] 2 and the exponential kernel with λ = 2.In the case N = ∞ we provide the MSE for continuous observations.

Table 2 :
The square root of the MSE of the BLUP at the point t 0 = 2 for different designs.(i) the design ξ N,0 observing the process at N -point equidistant points, (ii) the design ξ N,2 observing the process at N -point equidistant points and the derivative at two boundary points.(iii) the design ξ N,N observing the process and derivative at N -point equidistant points.The model is the location scale model on the interval [0, 1] and the covariance kernel of the error process is given by the Matérn 3/2 kernel with λ = 2.For continuous observations the square root of the BLUB is given by √ MSE = 0.9985569896.

Table 3 :
The square root of the MSE of the BLUP at the point (2, 2) (upper part) and the point (0.5, 2) (lower part) for several designs in the location scale model on the square [0, 1] 2 with Matérn 3/2 product-kernel (λ = 2).

Table 4 :
The square root of the MSE of the BLUP at the point t 0 = 2 for the N -point equidistant design in the location scale model on the interval [0, 1] and the OU kernel with λ = 2; for the continuous design √ MSE = 1.164262.