Keywords

14.1 Principles of Functional Linear Regression Analyses

The general functional linear regression model with scalar response (Y) and one functional covariate (x(·)) is defined by

$$ Y=\mu +{\int}_0^Tx(t)\beta (t) dt+\epsilon, $$
(14.1)

where x(t) and β(t) are a centered functional covariate and the functional coefficient regression, respectively, and ϵ is a random error often assumed to have a normal distribution with mean 0 and variance σ2. Functional regression replaces the linear predictor of a linear regression model by integrating the product of a coefficient function β(t) and centered covariate x(t), which corresponds to a continuous non-delaying process.

Determining the infinite-dimensional beta coefficients β(t) from a finite number of observations of the model (1) is a very difficult task. Indeed, it is almost always possible to find a function β(t) satisfying the model with an error equal to 0, and there is an infinite number of these functions that give the same predictions (Ramsay et al. 2009). There are several procedures to solve this problem (Cardot and Sarda 2006); one of them is based on basis expansion (Fourier, B-splines, etc.) and will be adopted and described here.

A basis expansion solution is obtained by first representing the beta coefficient function β(t) as

$$ \beta (t)={\sum}_{l=1}^{L_1}{\beta}_l{\phi}_l(t), $$
(14.2)

where ϕl(·), l = 1, …, L1, is a collection of functions corresponding to the first L1 elements of basis for a function space and βl are constants that depend on the function to be represented (Ramsay et al. 2009). Then, by assuming this form for β(t), model (14.1) can be expressed as

$$ Y=\mu +{\sum}_{l=1}^{L_1}{\beta}_l{\int}_0^Tx(t){\phi}_l(t) dt+\epsilon =\mu +{\boldsymbol{x}}^{\mathrm{T}}{\boldsymbol{\beta}}_0+\epsilon $$
$$ ={\boldsymbol{x}}^{\ast \mathrm{T}}\boldsymbol{\beta} +\epsilon, $$
(14.3)

where x = [1, xT]T, \( \boldsymbol{x}={\left[{x}_1,\dots, {x}_{L_1}\right]}^{\mathrm{T}} \), \( {x}_l={\int}_0^Tx(t){\phi}_l(t) dt \), l = 1, …, L1. So, if yi, i = 1, …, n, are independent observations of model (14.1), corresponding to covariate functions xi(·), i = 1, …, n, a basis expansion solution for the beta coefficient function is obtained by estimating the parameters involved in model (14.3), and then substituting \( \hat{\boldsymbol{\beta}}={\left[\mu, {\hat{\beta}}_1,\dots, \kern0.75em {\hat{\beta}}_{L_1}\right]}^{\mathrm{T}} \) in (14.2) to obtain a basis-based estimation of β(t):

$$ \hat{\beta}(t)={\sum}_{l=1}^{L_1}{\hat{\beta}}_l{\phi}_l(t). $$

When smoothing in the function coefficient is desired, one way to take more control of this is by using a roughness penalty, which combined with a high-dimensional basis could reduce the possibility of not considering some important features or taking into account some extraneous features (Ramsay et al. 2009). However, sometimes we can obtain good results without recurring to this, if the number of basis functions is smaller than the number of individuals in the sample (Ramsay et al. 2009).

Assuming that random errors are independently and identically distributed as a normal random variable with mean 0 and variance σ2, \( {\epsilon}_1,\dots, {\epsilon}_n\sim iid\ N\left(0,{\sigma}^2\right) \), then \( {Y}_i=\mu +{\sum}_{l=1}^{L_1}{x}_{il}{\beta}_l+{\epsilon}_i\sim N\left(\mu +{\sum}_{l=1}^{L_1}{x}_{il}{\beta}_l,{\sigma}^2\right) \), \( {x}_{il}={\int}_0^T{x}_i(t){\phi}_l(t) dt \), i = 1, …, n, l = 1, …, L1. So, the maximum likelihood estimation of parameters β and σ2 is given by

$$ \hat{\boldsymbol{\beta}}={\left({\boldsymbol{X}}^{\ast \mathrm{T}}{\boldsymbol{X}}^{\ast}\right)}^{-\mathbf{1}}{\boldsymbol{X}}^{\ast \mathrm{T}}\boldsymbol{y} $$
(14.4)
$$ {\hat{\sigma}}^2=\frac{1}{n}{\left(\boldsymbol{y}-{\boldsymbol{X}}^{\ast}\hat{\boldsymbol{\beta}}\right)}^{\mathrm{T}}\left(\boldsymbol{y}-{\boldsymbol{X}}^{\ast}\hat{\boldsymbol{\beta}}\right), $$
(14.5)

where X = [1nX], X = [ x1, …, xn]T is assumed to be of full column rank (n > L1), 1n is a vector of dimension n × 1 with all its entries equal to 1 and \( {\boldsymbol{x}}_i={\left[{x}_{i1},\dots, {x}_{iL_1}\right]}^{\mathrm{T}},i=1,\dots, n. \)

However, in practice, the functional covariate is often unknown and not continuously observed. Usually, it is only measured in a finite number of points t1 < t2 < … < tm in time or another domain. So, to complete the solution described before, the usual approach is also to assume that the covariate function can be represented as a linear combination of a set of basis functions (ψl(·), l = 1, …, L2)

$$ {x}_i(t)={\sum}_{o=1}^{L_2}{c}_{io}{\psi}_o(t), $$
(14.6)

where cio, o = 1, …, L2, are constants to be determined for each observation, i = 1, …, n. Usually, this is determined by least squares, in which case, by assuming that all curves were observed at the same time points, this can be computed as

$$ {\hat{\boldsymbol{c}}}_i={\left[{\hat{c}}_{i1},\dots, {\hat{c}}_{i{L}_2}\right]}^{\mathrm{T}}={\left({\boldsymbol{\Psi}}^{\mathrm{T}}\boldsymbol{\Psi} \right)}^{-1}{\boldsymbol{\Psi}}^{\mathrm{T}}{\boldsymbol{x}}_{\boldsymbol{i}}\left(\boldsymbol{t}\right), $$
(14.7)

where Ψ is a matrix of dimension m × L2 given by

$$ \boldsymbol{\Psi} =\left[\begin{array}{ccc}{\psi}_1\left({t}_1\right)& \cdots & {\psi}_{L_2}\left({t}_1\right)\\ {}{\psi}_1\left({t}_2\right)& \ddots & {\psi}_{L_2}\left({t}_2\right)\\ {}\vdots & \vdots & \vdots \\ {}{\psi}_1\left({t}_m\right)& \cdots & {\psi}_{L_2}\left({t}_m\right)\end{array}\right] $$
(14.8)

and xi(t= [xi(t1), …, xi(tm)]T is the vector with the actual values where the covariate curve of individual i was observed. With this, the elements of \( {\boldsymbol{x}}_i={\left[{x}_{i1},\dots, {x}_{i{L}_1}\right]}^{\mathrm{T}} \), \( {x}_{il}={\int}_0^T{x}_i(t){\phi}_l(t) dt, \) can be re-expressed as \( {x}_{il}={\int}_0^T{x}_i(t){\phi}_l(t) dt={\sum}_{o=1}^{L_2}{\hat{c}}_{io}{\int}_0^T{\psi}_o(t){\phi}_l(t) dt={\boldsymbol{x}}_l^{\ast \ast \mathrm{T}}{\hat{\boldsymbol{c}}}_i \), with \( {\boldsymbol{x}}_l^{\ast \ast }={\left[{x}_{l1}^{\ast },\dots, {x}_{lL_2}^{\ast}\right]}^{\mathrm{T}} \) and \( {x}_{lo}^{\ast }={\int}_0^T{\phi}_l(t){\psi}_o(t) dt \), o = 1, …, L2 and l = 1, …, L1. From here, xi can be expressed as

$$ {\boldsymbol{x}}_i=\left[\begin{array}{c}{\boldsymbol{x}}_1^{\ast \ast \mathrm{T}}{\hat{\boldsymbol{c}}}_i\\ {}\vdots \\ {}{\boldsymbol{x}}_{L_1}^{\ast \ast \mathrm{T}}{\hat{\boldsymbol{c}}}_i\end{array}\right]=\left[\begin{array}{c}{\boldsymbol{x}}_1^{\ast \ast \mathrm{T}}\\ {}\vdots \\ {}{\boldsymbol{x}}_{L_1}^{\ast \ast \mathrm{T}}\end{array}\right]{\hat{\boldsymbol{c}}}_i=\left[\begin{array}{ccc}{\int}_0^T{\phi}_1(t){\psi}_1(t) dt& \cdots & {\int}_0^T{\phi}_1(t){\psi}_{L_2}(t) dt\\ {}\vdots & \ddots & \vdots \\ {}{\int}_0^T{\phi}_{L_1}(t){\psi}_m(t) dt& \mathbf{\cdots}& {\int}_0^T{\phi}_{L_1}(t){\psi}_{L_2}(t) dt\end{array}\right]{\hat{\boldsymbol{c}}}_i=\boldsymbol{Q}{\hat{\boldsymbol{c}}}_i, $$

where

$$ \boldsymbol{Q}=\left[\begin{array}{ccc}{\int}_0^T{\phi}_1(t){\psi}_1(t) dt& \cdots & {\int}_0^T{\phi}_1(t){\psi}_{L_2}(t) dt\\ {}\vdots & \ddots & \vdots \\ {}{\int}_0^T{\phi}_{L_1}(t){\psi}_m(t) dt& \boldsymbol{\cdots}& {\int}_0^T{\phi}_{L_1}(t){\psi}_{L_2}(t) dt\end{array}\right]. $$

Now, matrix X can be computed as

$$ {\boldsymbol{X}}^{\ast }=\left[{\mathbf{1}}_n\kern0.5em \boldsymbol{X}\right], $$
(14.9)

where

$$ \boldsymbol{X}=\left[\begin{array}{c}{\boldsymbol{x}}_1^{\mathrm{T}}\\ {}\vdots \\ {}{\boldsymbol{x}}_n^{\mathrm{T}}\end{array}\right]=\left[\begin{array}{c}{\hat{\boldsymbol{c}}}_1^{\mathrm{T}}{\boldsymbol{Q}}^{\mathrm{T}}\\ {}\vdots \\ {}{\hat{\boldsymbol{c}}}_n^{\mathrm{T}}{\boldsymbol{Q}}^{\mathrm{T}}\end{array}\right]=\left[\begin{array}{c}{\hat{\boldsymbol{c}}}_1^{\mathrm{T}}\\ {}\vdots \\ {}{\hat{\boldsymbol{c}}}_n^{\mathrm{T}}\end{array}\right]{\boldsymbol{Q}}^{\mathrm{T}}=\left[\begin{array}{c}{\boldsymbol{x}}_1{\left(\boldsymbol{t}\right)}^{\mathrm{T}}\boldsymbol{\Psi} {\left({\boldsymbol{\Psi}}^{\mathrm{T}}\boldsymbol{\Psi} \right)}^{-1}\\ {}\vdots \\ {}{\boldsymbol{x}}_n{\left(\boldsymbol{t}\right)}^{\mathrm{T}}\boldsymbol{\Psi} {\left({\boldsymbol{\Psi}}^{\mathrm{T}}\boldsymbol{\Psi} \right)}^{-1}\end{array}\right]{\boldsymbol{Q}}^{\mathrm{T}}={\boldsymbol{X}}^{\ast \ast}\boldsymbol{\Psi} {\left({\boldsymbol{\Psi}}^{\mathrm{T}}\boldsymbol{\Psi} \right)}^{-1}{\boldsymbol{Q}}^{\mathrm{T}} $$

with \( {\boldsymbol{X}}^{\ast \ast}={\left[{\boldsymbol{x}}_1\left(\boldsymbol{t}\right)\kern0.5em \boldsymbol{\cdots}\kern0.5em {\boldsymbol{x}}_n\left(\boldsymbol{t}\right)\right]}^{\mathrm{T}}. \) Finally, the complete practical solution of the parameter estimates is obtained with (14.4) and (14.5) but replacing X as computed in (14.9).

There are several proposals to choose the “optimal” number of bases (L1) to represent the 𝛽(·) function coefficient. One way is by means of the Bayesian information criterion (Górecki et al. 2018), which is defined as follows:

$$ \mathrm{BIC}=-2\ell \left(\hat{\boldsymbol{\beta}},{\hat{\sigma}}^2;\boldsymbol{y}\right)+\left({L}_1+1\right)\log (n), $$

where \( \ell \left(\hat{\boldsymbol{\beta}},{\hat{\sigma}}^2;\boldsymbol{y}\right) \) is the log-likelihood evaluated in the maximum likelihood estimation of parameters β and σ2. This is a compromise between the fit of the model (first term) and its complexity (second term, the number of parameters in the model). In general, the model with the lowest BIC is preferred. In particular, with this criterion, the “optimal” number of basis functions corresponds to the lowest BIC.

When smoothing is required in the curve to be estimated, one way to control it is through the introduction of a penalty term, as will be described later. However, sometimes good results can be obtained without the need for this, as long as the number of basis functions relative to the amount of data is kept small (Ramsay et al. 2009).

To choose the “optimal” number of basis functions (L2) to represent the functional covariate data, we can also use the BIC criteria. To do this, consider that each curve is observed with error under the following model:

$$ {x}_i\left({t}_j\right)={\sum}_{o=1}^{L_2}{c}_{io}{\psi}_o\left({t}_j\right)+{\epsilon}_{ij}, $$

where for each i = 1, …, nϵij, j = 1, …, m, are independent random variables with distribution \( N\left(0,{\sigma}_x^2\right) \). Then the likelihood of the parameters to be estimated (\( {\boldsymbol{c}}_i={\left({c}_{i1},\dots, {c}_{i{L}_2}\right)}^{\mathrm{T}} \) and \( {\sigma}_x^2 \)) is given by

$$ {\displaystyle \begin{array}{c}L\left({\boldsymbol{c}}_i,{\sigma}_x^2;{\boldsymbol{x}}_i\left(\boldsymbol{t}\right)\right)=\prod \limits_{j=1}^m{f}_{nx_i\left({t}_j\right)}\left({x}_i\left({t}_j\right)\right)\\ {}={\left(\frac{1}{2{\pi \sigma}_x^2}\right)}^{m/2}\exp\ \left[-\frac{1}{2{\sigma}_x^2}\sum \limits_{j=1}^m{\left({x}_i\left({t}_j\right)-\sum \limits_{o=1}^{L_2}{c}_{io}{\psi}_o\left({t}_j\right)\right)}^2\right].\end{array}} $$

From this, the maximum likelihood of the parameters of ci and \( {\sigma}_x^2 \) are \( {\hat{\boldsymbol{c}}}_i={\left[{\hat{c}}_{i1},\dots, {\hat{c}}_{i{L}_2}\right]}^{\mathrm{T}}={\left({\boldsymbol{\Psi}}^{\mathrm{T}}\boldsymbol{\Psi} \right)}^{-1}{\boldsymbol{\Psi}}^{\mathrm{T}}{\boldsymbol{x}}_{\boldsymbol{i}}\left(\boldsymbol{t}\right) \) and \( {\hat{\sigma}}_x^{\mathbf{2}}=\frac{1}{m}{\sum}_{j=1}^m{\left({x}_i\left({t}_j\right)-{\sum}_{o=1}^{L_2}{\hat{c}}_{io}{\psi}_o\left({t}_j\right)\right)}^2 \), respectively. So, before fitting the regression model, the “optimal” value of L2 that will represent each curve in the sample can be chosen with the smallest value of BIC in the corresponding model:

$$ \mathrm{BIC}=-2\log \left[L\left({\hat{\boldsymbol{c}}}_i,{\hat{\sigma}}_x^{\mathbf{2}};{\boldsymbol{x}}_i\left(\boldsymbol{t}\right)\right)\right]+\left({L}_2+1\right)\log (m). $$

A global value of L2 can be adopted as suggested by Górecki et al. (2018), and the mode of the “optimal” values obtained across all the represented curves. Note that the maximum likelihood estimate of ci is the same as the least square estimate mentioned above.

Another alternative to the BIC approach is to choose the “optimal” number of basis functions by estimating the predictive ability obtained by using different values of L2 and selecting the value with the best predictive performance (Ruppert et al. 2003). One way to do this is by using the leave-one-out cross-validation (LOOCV) with mean squared error of prediction as the criterion to measure the predictive performance:

$$ \mathrm{C}{\mathrm{V}}_1\left({L}_2\right)={\sum}_{j=1}^m{\left(x\left({t}_j\right)-{\hat{x}}_{-j}\left({t}_j\right)\right)}^2, $$

where \( {\hat{x}}_{-j}\left({t}_j\right) \) is the predicted value of point j, x(tj), obtained by doing the representation of the function with L2 bases but without this point, that is,

$$ {\hat{x}}_{-j}\left({t}_j\right)={\sum}_{o=1}^{L_2}{\hat{c}}_{-j,o}{\psi}_o\left({t}_j\right), $$

where \( {\hat{\boldsymbol{c}}}_{-j}={\left[{\hat{c}}_1^{-j},\dots, {\hat{c}}_{L_2}^{-j}\right]}^{\mathrm{T}}={\left({\boldsymbol{\Psi}}_{-j}^{\mathrm{T}}{\boldsymbol{\Psi}}_{-j}\right)}^{-1}{\boldsymbol{\Psi}}_{-j}^{\mathrm{T}}{\boldsymbol{x}}_{-j} \) and Ψj is a matrix of dimension (m − 1) × L2, like the matrix design basis defined in (14.8) over L2 basis functions, but without row j, and xj is the same as the vector that contains the observed values of the latent function, x(t), but removing the value of its position j. For a specific basis, the optimal number of basis is the one with the lowest value of CV1(L2).

14.2 Basis Functions

A “base” is a set of basis functions (ϕl, l = 1, 2, 3, …) such that “any” function (x(t)) can be approximated as well as required, by means of a linear combination of L2 of these functions:

$$ x(t)={\sum}_{l=1}^{L_2}{c}_l{\phi}_l(t), $$

where cl are values that will determine the function.

In general, to represent data in functions by means of basis functions, you need to

  1. (a)

    Choose suitable basis functions (polynomial basis, Fourier basis, B-spline basis, etc.).

  2. (b)

    Determine the number of basis functions to consider (L2).

  3. (c)

    Estimate the coefficients cl, l = 1, …, L2.

The degree of smoothness of function x(t) depends on the value of L2 that is chosen (a small value of L2 causes more smoothing of the curves) and the optimum value for L2 selected using the Bayesian information criterion (BIC ) (Górecki et al. 2018) or with cross-validation, as described before.

14.2.1 Fourier Basis

The Fourier basis is often used for periodic or near-periodic data and is often useful for expanding functions with weak local characteristics and with an approximately constant curvature. It is not appropriate for data with discontinuities in the function or in low order derivatives (Ramsay et al. 2009).

The Fourier basis is created by the following functions:

$$ {\phi}_1(t)=\frac{1}{\sqrt{P}},{\phi}_2(t)=\frac{1}{\sqrt{\frac{P}{2}}}\sin \left(\omega t\right),{\phi}_3=\frac{1}{\sqrt{\frac{P}{2}}}\sin \left(\omega t\right),{\phi}_4(t)=\frac{1}{\sqrt{\frac{P}{2}}}\cos \left(2\omega t\right), $$
$$ {\phi}_5=\frac{1}{\sqrt{\frac{P}{2}}}\cos \left(2\omega t\right),{\phi}_6(t)=\frac{1}{\sqrt{\frac{P}{2}}}\cos \left(3\omega t\right),\kern0.5em {\phi}_7=\frac{1}{\sqrt{P/2}}\cos \left(3\omega t\right),\dots, $$

where 𝜔 is related to period P by 𝜔 = 2π/P, and in practical applications, this is often taken as the range of t values where the data are observed (Ramsay et al. 2009).

The graph on interval (0,8) of the first five of these functions (0, 8) with period 4 is given in Fig. 14.1. The vertical dotted lines are the end of the subinterval that corresponds to the period of each function. This figure can be reproduced by the following R code:

library(fda) BF = create.fourier.basis(rangeval=c(0,8),nbasis=5,period=4) plot(BF,xlab='t',ylab=expression(phi[i](t)),ylim=c(-.7,.9), lty=1:5,lwd=1) abline(v=seq(4,8,4),lty=3,lwd=3) legend('topright',paste('i = ',1:5,sep=''),lty=1:5,col=1:5, bty='n',lwd=1,ncol=3)

Fig. 14.1
figure 1

Graph of the first five elements of the Fourier basis with period 4 on interval (0,8)

14.2.2 B-Spline Basis

A B-spline basis is typically used for nonperiodic data where the underlying function is locally smooth. The coefficients of a B-spline basis can be calculated quickly; they form a very flexible system because very good approximations of functions can be obtained even with a relatively small number of basis functions.

A B-spline is a type of spline that is a piecewise-polynomial continuous function and has a specific number of continuous derivatives on an interval. Specifically, a q + 1-order spline with interior knots Tj, j = 1, …, K (usually placed to take into account the data change points) on the observation interval [0, T] = (0, T1] ∪ (T1, T2] ∪ … ∪ (TK, T] is a continuous function sq such that in each subinterval (Tj − 1, Tj] there is a polynomial of degree q+1 that has continuous derivatives of order q−1 in each knot, that is, the dth derivate of sd, \( {s}_q^{(d)}\left({T}_j\right) \), is a continuous function in each knot, for each d = 1, …, q − 1 (Quarteroni et al. 2000; Hastie et al. 2009).

Indeed, a q + 1-order B-spline basis is a basis for the q + 1-order spline function space on a given sequence of knots, that is, any spline function of order q + 1 can be represented as a linear combination of B-splines, and unlike other bases, the truncated power basis, for example, is very attractive numerically (Quarteroni et al. 2000; Hastie et al. 2009).

Once chosen, the order (q + 1) and the interior knots Tj, j = 1, …, K,of a spline, because we need K + 1 polynomials (one for each of the K + 1 intervals) and Kq constraints (continuity of the B-spline in its interior knots + continuity of derivatives of order q − 1 in each knot), the number of basis functions is given by

$$ L=\left(q+1\right)\left(K+1\right)- Kq=q+K+1=\mathrm{order}+\mathrm{number}\ \mathrm{of}\ \mathrm{interior}\ \mathrm{knots}. $$

Practically, the position of the knots can be chosen according to the data change points or by allowing the observation time to determine their positions at appropriate percentiles (Fig. 14.2). For example, in R, if we want a B-spline of order 4 on the interval (0, 12) with three specific interior knots (T1 = 3, T2 = 6, T3 = 9), this can be defined, plotted, and evaluated by applying the following code:

Fig. 14.2
figure 2

Graphic of the seven B-spline basis functions of order 4 on interval (0,12) and interior knots T1 = 3, T2 = 6, and T3 = 9

Order = 4 ; breaks = seq(0,12,3) BS = create.bspline.basis(rangeval=c(0,12),norder=Order, breaks=breaks) #Number of basis functions: 4 + 3 BS$nbasis #Graphic of the seven basis functions plot(BS,xlab='t') #Evaluation of these seven basis functions in tv= seq(0,12, length=100) EBS = eval.basis(tv, basisobj=BS) head(EBS) matplot(tv,EBS,add=TRUE,type='o',pch='+')

Alternatively, if we want a B-spline basis of some order (q + 1) with a specific number of basis (L2), in R it can be obtained similarly as before (Fig. 14.3). For example, a B-spline of order 4 with six bases will contain two (6–4) interior knots equally spaced and can be obtained with the following code:

Fig. 14.3
figure 3

Graphic of the six B-spline basis functions of order 4 on interval (0,12)

Order = 4; nbasis = 6 BS = create.bspline.basis(rangeval=c(0,12),norder = Order,nbasis = nbasis) #Graphic of the 6 basis functions plot(BS,xlab='t') #Evaluation of this 6 basis functions in tv= seq(0,12,length=100) EBS = eval.basis(tv, basisobj=BS) head(EBS) matplot(tv,EBS,add=TRUE,type='o',pch='+')

A B-spline basis of the same order and the same number of basis functions (Fig. 14.4), but with specific positions of the interior knots (T1 = 2 and T2 = 7), can be obtained by adding the argument breaks = c(0,2,7,12) to the function create.bspline.basis:

breaks = c(0,2,7,12) Order = 4; nbasis = 6 BS = create.bspline.basis(rangeval=c(0,12),norder = Order,nbasis = nbasis, breaks= breaks) #Graphic of the six basis functions plot(BS,xlab='t')

Fig. 14.4
figure 4

Graphic of the six B-spline basis functions of order 4 on interval (0,12) with specific interior knots: T1 = 2 and T2 = 7

14.3 Illustrative Examples

Example 14.1

To illustrate how to use and get a better picture of the performance of Fourier and B-spline basis to represent functions, suppose that there is information on only 30 (tv) equispaced evaluations (xv) of an unknown function in interval (0,12):

tv = seq(0,12,length = 30)

xv = c(0.9, 0.924, 0.9461, 0.9658, 0.983, 0.9971, 1.008, 1.0152, 1.0187, 1.0181, 1.0133, 1.0042, 0.9906, 0.9727, 0.9504, 0.9237, 0.8928, 0.8579, 0.8192, 0.777, 0.7314, 0.683, 0.632, 0.5788, 0.5238, 0.4675, 0.4103, 0.3526, 0.2949, 0.2376).

The graphical representation of this information is given in Fig. 14.5, together with three representations of this using Fourier basis (5, 21, and 29 basis functions) with period 30 (range of the observation domain). From this we can see that a poor representation was obtained with five basis functions, while with the other numbers of basis used (21 and 29), almost equal and reasonable representations were obtained, except in the boundaries of the interval, which is related to the Gibbs phenomenon, a spurious oscillation at the interval boundaries of an expansion in a Fourier series of a nonperiodic function (Shizgal and Jung 2003).

Fig. 14.5
figure 5

Graphical representation of 30 evaluations (points) of a function in 30 equispaced time points in interval (0, 12), and representation of this using 5 (in black), 21 (in red), and 29 (in green) Fourier basis functions

Figure 14.6 shows the same data but now together with representations of the latent function obtained by using B-splines of order 2, 3, 4, and 5, all with two interior knots (T1 = 4 and T2 = 8), to which correspond 2 + 2 = 4, 2 + 3 = 5, 2 + 4 = 6, and 2 + 5 = 7 basis functions, respectively. From this figure we can observe that the representation of order 3 is satisfactory, and this almost coincides with the representations obtained with orders 4 and 5.

#R code for B-spline representation plot(tv,xv,type='p',xlab='t',ylab='x(t)') breaks = seq(0,12,length=4) Orderv = 2:5 for(i in 1:4) { #A linear B-spline with two interior knots Order = Orderv[i] Degree = Order-1 #No of basis functions= Order + length(breaks)-2 nB = Order + length(breaks)-2 BBS = create.bspline.basis(rangeval=c(0,12),norder=Order,breaks=breaks) EBBS = eval.basis(tv, basisobj=BBS) cv = solve(t(EBBS)%*%EBBS)%*%t(EBBS)%*%xv xv_p = EBBS%*%cv lines(tv,xv_p,col=i) } legend('topright',c('2 order B-spline (Linear)', '3 order B-spline (Quadratic)', '4 order B-spline (Cubic)','5 order B-spline'), lty=c(1,1,1,1),col=1:4,bty='n')

Fig. 14.6
figure 6

Graphical representation of 30 evaluations (points) of a function in 30 equispaced time points in interval (0, 12), and representation of this using B-splines of order 2 (shown in black), 3 (in red), 4 (in green), and 5 (in blue), each with two interior knots at T1 = 4 and T2 = 8

Now, by using 3, 5, 7, and 10 B-spline basis of orders 2, 3, 4, and 5, respectively, the resulting representations are shown in Fig. 14.7. From this we can see that with five B-spline basis of order 3 (only two interior knots equally spaced were required), the representation started to be satisfactory, indicating flexibility in the B-spline basis.

Fig. 14.7
figure 7

Graphical representation of 30 evaluations (points) of a function in 30 equispaced time points in interval (0,12), and B-spline representation using 3, 5, 7, and 10 basis functions of orders 2, 3, 4, and 5, respectively

R code to reproduce Fig. 14.7

plot(tv,xv,type='p',xlab='t',ylab='x(t)') Orderv = 2:5 #No of basis functions = Order + length(breaks)-2 nBv = c(3,5,7,10) #Interior points K = nBv - Orderv for(i in 1:4) { BBS = create.bspline.basis(rangeval=c(0,12),nbasis=nBv[i], norder=Orderv[i]) EBBS = eval.basis(tv, basisobj=BBS) cv = solve(t(EBBS)%*%EBBS)%*%t(EBBS)%*%xv xv_p = EBBS%*%cv lines(tv,xv_p,col=i) } legend('topright',paste(nBv,' basis - order ', Orderv,sep=''), lty=c(1,1,1,1),col=1:4,bty='n')

In general, more flexible curves can be obtained by increasing the order or the number of knots in the B-spline. However, overfitting and an increase in the variance can occur if the number of knots is increased, while an inflexible function with more bias may result by decreasing the number of knots (Perperoglou et al. 2019).

Example 14.2

Now to illustrate how to use the BIC and LOOCV as criteria to choose the number of basis functions in a Fourier or B-spline (for a chosen order) representation, we retake the data points in Example 14.1 but perturbed by a random Gaussian noise (see Fig. 14.8).

Fig. 14.8
figure 8

Graphical representation of the perturbed 30 data points in Example 14.1 and Fourier and B-spline representations using 5, 6, and 7 basis functions (L2) in both. The left panel is the optimal representation obtained with the BIC and the right panel corresponds to the optimal representation obtained with the LOOCV

For these data, Fig. 14.9 shows the value of the BIC criterion corresponding to the use of different numbers of basis functions in Fourier and B-spline representations. In both cases, the lowest value of this criterion was obtained with five basis functions and in all cases, the BIC criterion was better with the B-spline. The representation of the data points with this optimal number of basis functions is also shown in Fig. 14.8, where we can visually judge the better representation of the B-spline, because this resembles the non-perturbed latent function presented in Example 14.1. Similar results were obtained when using the LOOCV strategy: 7 and 6 basis for Fourier and B-spline were required, respectively, and the implied representation can also be observed in Fig. 14.9. These figures can be reproduced by the following R code:

rm(list=ls()) #------ #Example 14.2: Perturbed data points of Example 14.1 #------ library(fda) #Time points where the functions was observed tv = seq(0,12,length=30) #Values of the function in 30 points xv = c(0.9, 0.924, 0.9461, 0.9658, 0.983, 0.9971, 1.008, 1.0152, 1.0187, 1.0181, 1.0133, 1.0042, 0.9906, 0.9727, 0.9504, 0.9237, 0.8928, 0.8579, 0.8192, 0.777, 0.7314, 0.683, 0.632, 0.5788, 0.5238, 0.4675, 0.4103, 0.3526, 0.2949, 0.2376) set.seed(1) xv = xv + rnorm(length(xv),0,0.10*mean(xv)) #plot(tv,xv,type='p',xlab='t',ylab='x(t)') #Fourier library(fda) m = length(xv) nbFv = seq(5,m-1,2) BICFv = rep(0, length(nbFv)) AICFv = BICFv for(l in 1:length(nbFv)) { BF=create.fourier.basis(rangeval=c(0,12),nbasis=nbFv[l],period=diff(range(tv))) EBF = eval.basis(tv, basisobj=BF) cv = solve(t(EBF)%*%EBF)%*%t(EBF)%*%xv xv_p = EBF%*%cv sigma2 = mean((xv-xv_p)^2) ll = sum(dnorm(xv,xv_p,sqrt(sigma2),log = TRUE)) BICFv[l] = -2*ll+(dim(EBF)[2]+1)*log(m) AICFv[l] = -2*ll+2*(dim(EBF)[2]+1) } #B-spline library(fda) Order = 4 #No of basis functions = Order + length(breaks)-2 nbBSv = (Order+1):(m-4) BICBSv = rep(0,length(nbBSv)) for(l in 1:length(nbBSv)) { BBS = create.bspline.basis(rangeval=c(0,12),norder=Order,nbasis = nbBSv[l]) EBBS = eval.basis(tv, basisobj=BBS) cv = solve(t(EBBS)%*%EBBS)%*%t(EBBS)%*%xv xv_p = EBBS%*%cv muv = EBBS%*%cv sigma2 = mean((xv-muv)^2) ll = sum(dnorm(xv,muv,sqrt(sigma2),log = TRUE)) BICBSv[l] = -2*ll+(nbBSv[l]+1)*log(m) } #Behavior of the BIC for different models obtained using various #number of Fourier or B-spline basis functions par(mfrow=c(1,2)) plot(nbFv,BICFv,xlab='Number of basis functions (L2)', ylab='BIC', main='Fourier basis') plot(nbBSv,BICBSv,xlab='Number of basis functions (L2)',ylab='BIC', main='B-spline basis') #Fourier and B-spline representations with optimal number of #basis functions as chosen by BIC #Fourier nboF_BIC = nbFv[which.min(BICFv)] nboF_BIC plot(tv,xv,type='p',xlab='t',ylab='x(t)', main='')#Optimal representation using BIC') BF = create.fourier.basis(rangeval=c(0,12),nbasis=nboF_BIC, period=diff(range(tv))) EBF = eval.basis(tv, basisobj=BF) cv = solve(t(EBF)%*%EBF)%*%t(EBF)%*%xv xv_p = EBF%*%cv lines(tv,xv_p,col=1) #B-spline nboB_BIC = nbBSv[which.min(BICBSv)] nboB_BIC BBS = create.bspline.basis(rangeval=c(0,12),norder=Order, nbasis = nboB_BIC) EBBS = eval.basis(tv, basisobj=BBS) cv = solve(t(EBBS)%*%EBBS)%*%t(EBBS)%*%xv xv_p = EBBS%*%cv lines(tv,xv_p,col=2) legend('topright',paste(c('Fourier (','B-Spline ('), c(nboF_BIC,nboB_BIC),c(')',')'),sep=''), col=1:2,lty=rep(1,2),bty='n') #Choosing the optimal number of basis functions using 1FCV PRESS_f<-function(A) { Res = residuals(A) sum((Res/(1-hatvalues(A)))^2) } #Fourier basis CVFv = nbFv for(l in 1:length(nbFv)) { BF=create.fourier.basis(rangeval=c(0,12),nbasis=nbFv[l],period=diff(range(tv))) EBF = eval.basis(tv, basisobj=BF) A = lm(xv~0+EBF) CVFv[l] = PRESS_f(A) } plot(nbFv,CVFv,xlab='No. of basis functions',ylab='PRESS') nboF = nbBSv[which.min(CVFv)] nboF #B-spline basis CVBSv = nbBSv for(l in 1:length(nbBSv)) { BBS = create.bspline.basis(rangeval=c(0,12),norder=Order,nbasis = nbBSv[l]) EBBS = eval.basis(tv, basisobj=BBS) A = lm(xv~0+EBBS) CVBSv[l] = PRESS_f(A) } plot(nbBSv[1:5],CVBSv[1:5],xlab='No. of basis',ylab='PRESS') nboBS = nbBSv[which.min(CVBSv)] nboBS #Fourier and B-spline representations with optimal number of #basis functions as choosen by BIC par(mfrow=c(1,2)) plot(tv,xv,type='p',xlab='t',ylab='x(t)', main='BIC') #Fourier BF = create.fourier.basis(rangeval=c(0,12),nbasis=nboF_BIC, period=diff(range(tv))) EBF = eval.basis(tv, basisobj=BF) cv = solve(t(EBF)%*%EBF)%*%t(EBF)%*%xv xv_p = EBF%*%cv lines(tv,xv_p,col=1) #B-spline BBS = create.bspline.basis(rangeval=c(0,12),norder=Order, nbasis = nboB_BIC) EBBS = eval.basis(tv, basisobj=BBS) cv = solve(t(EBBS)%*%EBBS)%*%t(EBBS)%*%xv xv_p = EBBS%*%cv lines(tv,xv_p,col=2) legend('topright',paste(c('Fourier (','B-Spline ('),c(5,5), c(')',')'),sep=''), col=1:2,lty=rep(1,2),bty='n') #Optimal representation with 1FCV plot(tv,xv,type='p',xlab='t',ylab='x(t)', main='LOOCV') #Fourier BF = create.fourier.basis(rangeval=c(0,12),nbasis=nboF, period=diff(range(tv))) EBF = eval.basis(tv, basisobj=BF) cv = solve(t(EBF)%*%EBF)%*%t(EBF)%*%xv xv_p = EBF%*%cv lines(tv,xv_p,col=1) #B-spline BBS = create.bspline.basis(rangeval=c(0,12),norder=Order, nbasis = nboBS) EBBS = eval.basis(tv, basisobj=BBS) cv = solve(t(EBBS)%*%EBBS)%*%t(EBBS)%*%xv xv_p = EBBS%*%cv lines(tv,xv_p,col=2) legend('topright',paste(c('Fourier (','B-Spline ('),c(nboF,nboBS), c(')',')'),sep=''), col=1:2,lty=rep(1,2),bty='n')

Fig. 14.9
figure 9

Behavior of the BIC for different numbers of basis functions (L2) in Fourier and B-spline representations, for the perturbed data set of Example 14.1

Example 14.3

Now we will consider the prediction of wheat grain yield (tons/ha) using hyper-spectral image data. For this example, we consider part of the data used in Montesinos-López et al. (2017a, b): 20 lines and three environments. For each individual plant, the reflectance, x(tj), of its leaves was measured at m = 250 wavelengths (from 392 to 851 nm were measured) and at different stages of its growth, but the information used here corresponds to one of these stages.

Figure 14.10 shows the measured reflectance corresponding to 60 observations, where the colors of these observations indicate that they belong to the same environment. The Fourier and B-spline representations of all these curves are shown in Fig. 14.11, where the number of basis used in each case were 29 and 16, respectively; they are the medians of the most frequently selected number of basis functions (29 for Fourier, and 12, 16, and 73 for B-spline) by the BIC across all the curves (see Appendix 1 for the R code to reproduce these results).

Fig. 14.10
figure 10

Reflectance of leaves of 60 individuals measured in 250 wavelengths (29, 394, 394, …, 851). The colors indicate the environment where the individuals were measured

Fig. 14.11
figure 11

Fourier and B-spline representations with the “optimal” number of basis functions obtained with the BIC across all the curves

The observed values versus the predicted values of the response, corresponding to the Fourier and B-spline representations of the covariate, are shown in Fig. 14.12; in both cases, L1 = 21 basis functions were used to represent the beta coefficient function β(t). The fitted model appears to give almost the same results with both representations.

Fig. 14.12
figure 12

Graph displaying the observed versus fitted values with two functional regression models: the left panel was obtained by using 29 Fourier basis functions for the covariate function and 21 Fourier basis functions for β(t); the right panel was obtained by using 16 B-spline basis functions for the covariate function and 21 B-spline basis functions for β(t)

To let the data speak for themselves about a reasonable value for L1 to represent β(t), the BIC was used in both representations. For the Fourier case, L1 = 11 was the optimal value, while L1 = 14 was the optimal value for the B-spline basis (see Appendix 1 for the R code). The predicted and residual values obtained with these optimal representations are shown in Fig. 14.13, from which it is difficult to choose the best one because they gave almost the same results.

Fig. 14.13
figure 13

Observed versus predicted values and normal Q-Q plot of the residuals obtained with Fourier (above) and B-spline (below) representations of both sets of covariates (L2=29 Fourier basis functions and L2=16 B-spline basis functions) and beta coefficient functions (L1 = 11 Fourier basis functions and L1 = 14 B-spline Fourier basis functions)

Because the B-spline appears to give a better covariate representation (see Fig. 14.11) and a similar predicted value as that given by the Fourier basis (Fig. 14.13), we can take a more informed decision in terms of the prediction accuracy of the response, with both representations. To this end, we used ten random partitions, where in each partition, 20% of the total data set was used to measure prediction accurary and the rest was used to fit (train) the model. The results are shown in Table 14.1 and we can see that, on average, the Fourier basis was favored, because on average across all the partitions, this type of basis provided lower MSE in 6 out of 10 partitions; for this reason, it is considered the best option. The same conclusion was reached by comparing the BIC values of the corresponding Fourier (149.8048) and B-spline (159.5854) representations.

Table 14.1 Mean square error (MSE) of prediction for 10 random partitions for Fourier and B-spline representations

14.4 Functional Regression with a Smoothed Coefficient Function

As mentioned earlier, in the representation of the functional predictor (x(t)), one way to control the smoothness when determining the beta coefficient function, β(t), is by introducing a regularization term:

$$ \mathrm{SS}{\mathrm{E}}_{\lambda}\left(\boldsymbol{\beta} \right)={\sum}_{i=1}^n{\left({y}_i-\mu -{\sum}_{l=1}^{L_1}{x}_{il}{\beta}_l\right)}^2+\lambda {J}_{\beta }, $$
(14.10)

where Jβ is the penalty term and λ is a smoothing parameter that represents a compromise between the fit of the model to the data (first term) and the smoothness of the function β(·) (second term). When λ = 0, the problem is reduced to that of least squares (or maximum likelihood under normal errors) where there is no penalty, and when 𝜆 increases, the roughness is highly penalized to the extent that 𝛽(t) can be constant.

Often the penalty term Jβ is based on the integrated pth order derivatives (Usset et al. 2016):

$$ {J}_{\beta }={\int}_0^T{\left[\frac{d^p}{d{t}^p}\beta (t)\right]}^2 dt, $$
(14.11)

where \( \frac{d^p}{d{t}^p}\beta (t) \) is a derivative of order p of the function β(t). With the representation (14.2) of β(t), Jβ can be expressed as

$$ {J}_{\beta }={\boldsymbol{\beta}}^{\mathrm{T}}\boldsymbol{P}\boldsymbol{\beta }, $$

where P is a square matrix with entries \( {P}_{ij}={\int}_0^T{\phi}_i^{(p)}(t){\phi}_j^{(p)}(t) \), i, j = 1, …, L1, and \( {\phi}_i^{(p)}(t) \) is a derivate of order p of ϕi(t). Typical chosen values of p are 1 and 2.

A smoothed solution of the function β(t) can be obtained by minimizing (14.10) with respect to the parameters βl, l = 1, …, L1. However, because this solution depends on the smoothing parameter, this needs to be determined. For this reason, as in the Ridge and Lasso regression models described in early chapters, here a cross-validation method is adopted first, and a Bayesian approach will be described later.

Under the penalty term (14.11), the penalized sum of squared errors (14.10) can be written as

$$ \mathrm{SS}{\mathrm{E}}_{\lambda}\left(\boldsymbol{\beta} \right)={\left|\left|\boldsymbol{y}-{\mathbf{1}}_n\mu -{\boldsymbol{X}}^{\ast }{\boldsymbol{\beta}}^{\ast}\right|\right|}^2+\lambda {\boldsymbol{\beta}}^{\ast \mathrm{T}}\boldsymbol{D}\ {\boldsymbol{\beta}}^{\ast}= SS{E}_{\lambda}\left({\boldsymbol{\beta}}^{\ast}\right), $$
(14.12)

where X = XΓ, β = ΓTβ, and P = ΓDΓT is the spectral decomposition of the penalty matrix P. Note that when the matrix P is not of full rank, the penalty term in (14.12) is reduced to \( \lambda {\boldsymbol{\beta}}^{\ast \mathrm{T}}\boldsymbol{D}\ {\boldsymbol{\beta}}^{\ast}=\lambda {\boldsymbol{\beta}}_1^{\ast \mathrm{T}}{\boldsymbol{D}}_1\ {\boldsymbol{\beta}}_1^{\ast} \), where D1 is D but without the rows and columns corresponding to the eigenvalues equal to 0 of P. So, the corresponding smoothed solution of β(t) can be obtained as

$$ \hat{\beta}(t)={\sum}_{l=1}^{L_1}{\hat{\beta}}_l{\phi}_l(t), $$

where \( \hat{\boldsymbol{\beta}}=\boldsymbol{\Gamma} {\hat{\boldsymbol{\beta}}}^{\ast } \) and \( {\hat{\boldsymbol{\beta}}}^{\ast } \) is the solution of (14.12), which also can be obtained with the glmnet R package.

Example 14.4

To exemplify the penalized estimation of functional regression (14.10) with penalty (14.11), here we retake the data used in Example 14.3. To compare the prediction accuracy of this with the non-penalized functional regression described in the previous section, 100 random partitions were used, and in each, 80% of the data set was used to train the model and the rest to evaluate the prediction performance. When training the model, an inner five-fold cross-validation was used to choose the optimal parameter (λ) and estimate the βl’s coefficients. This was done using Fourier and B-spline basis in the representation of the beta function, and in both cases, two basis were used. In both cases, the penalty matrix (14.11) and the elements of its spectral decomposition (14.12) can be computed in R as

#Penalty matrix of derivative of order p P_mat = eval.penalty(basisobj =Phi,Lfdobj=p) ei_P = eigen(P_mat) #Espectral descompositin of P_mat gamma = ei_P$vectors #Γ dv = ei_P$values#Eigenvalues of P_mat, elements of diagonal of D dv = ifelse(dv<1e-10,0,dv)

where Phi is a created basis in R (Fourier or B-spline) and p is a nonnegative integer for the order of the derivative in the penalty matrix to be used. Once the penalty matrix is computed, the training of the model in (14.12) can be done in R as

Xa = X_F%*%gamma A_PFR = cv.glmnet(x =Xa,y = y ,alpha = 0, nfolds=k, penalty.factor=dv, standardize=FALSE,maxit=1e6)

where Xa is X, y is the vector with the corresponding values of the response variable, k is an integer used to specify the inner k cross-validation to train the model and choose the “optimal” value of the smoothing parameter, and dv are the eigenvalues of the penalty matrix used to indicate different penalties of the \( {\beta}_j^{\ast } \) as required in (14.12), and standardize = FALSE to indicate the non-required standardization of the columns of X. See Appendix 2 for a complete R code used to obtain the results of this example.

When using Fourier representation and first derivative penalization, for the 100 random partitions, in Fig. 14.14 is shown the MSE obtained with the penalized functional regression (PFR) against the MSE corresponding to the non-penalized functional regression (FR): in 78 out of 100 partitions, the PFR resulted in a better MSE (0.7465 vs. 0.5545 on average). Furthermore, in the partitions where the FR was better (20%), the average MSE of the PFR was 35.48% greater (0.4051 vs. 0.5561), while in those where the PFR was better, the average MSE obtained with the FR was 51.55% greater (0.8428 vs. 0.5561).

Fig. 14.14
figure 14

Mean square error (MSE) of prediction for penalized functional regression (PFR) versus MSE of functional regression (FR). Shown in green are the cases where the MSE obtained with the PFR is less than the FR

Now, for the B-spline representation and first derivative penalization, the corresponding results are also shown in Fig. 14.14. In this case, in 55 out of 100 partitions, the PFR resulted in a better MSE (0.5822 vs. 0.5630 on average). Furthermore, in the partitions where the FR was better (20%), the average MSE of the PFR was 27.18% greater (0.4887 vs. 0.6216), while in those where the PFR was better, the average MSE obtained with the FR was 27.87% greater (0.6587 vs. 0.5151). So, for the Fourier basis representation, the results obtained when using penalization in the estimation of the beta function β(·) differ from those obtained when no penalization is used, while with the B-spline representation the difference is negligible. This is perhaps because of the natural smoothing of the B-spline relative to the Fourier basis.

For second derivative penalization using Fourier representation, in 74 out of 100 partitions and on average, the MSE of PFR also resulted better than the FR (0.7465 vs. 0.5754). Indeed, in the cases where PFR was worse, the average MSE of this was 34.87% greater than the corresponding FR, and in the cases where FR was worse, the average MSE was 51.78% greater.

For B-spline basis with second derivative penalization, on average the FR was better than the PFR (0.5822 vs. 0.6009), but this resulted in a smaller MSE only in 50 out of 100 partitions. Additionally, in the case where FR was better, the average MSE of the PFR was 34.48% greater than the average of FR, while in the other half of the cases, the average MSE of the FR was only 24.25% greater than the average MSE of PFR.

14.5 Bayesian Estimation of the Functional Regression

Similar to what was done in Chaps. 3 and 6, the penalized sum squared of error solution in (14.12) with penalty term (14.11) coincides with the mean/mode of the posterior distribution of β, in the multiple linear regression \( {y}_i=\mu +{\sum}_{l=1}^{L_1}{x}_{il}{\beta}_l+{\epsilon}_i \) (y = 1nμ + Xβ + ϵ), with prior distribution \( \boldsymbol{\beta} \sim N\left(\mathbf{0},{\sigma}_{\beta}^2{\boldsymbol{P}}^{-1}\right) \), with \( {\sigma}_{\beta}^2=\frac{\lambda }{\sigma^2} \). So, from here a Bayesian formulation (PBFR) for the smoothed solution of the coefficient function (β(t)) in the functional regression model (14.1) can be completed by assuming the following priors for the rest of the parameters: \( {\sigma}^2\sim {\chi}_{v,S}^{-2} \) and \( {\sigma}_{\beta}^2\sim {\chi}_{v_{\beta },\kern0.5em {S}_{\beta}}^{-2} \), where \( {\chi}_{v,S}^{-2} \) denotes a scaled inverse Chi-squared distribution with shape parameter v and scale parameter S. When P is not of full rank, little change is needed over the prior distribution of the β coefficients: for example, in the Fourier basis, the first element of this is the constant function, so entries in the first row and first column of P are equal to 0.

Similarly, the Bayesian formulation of regression models can be expressed as y = 1nμ + Xβ + ϵ, with the same prior distribution, except that now \( {\boldsymbol{\beta}}^{\ast}\sim N\left(\mathbf{0},{\sigma}_{\beta}^2{\boldsymbol{I}}_{L_1}\right) \) and X = XΓD1/2, where \( {\boldsymbol{D}}^{-\frac{1}{2}} \) is the inverse of the squared root matrix of D. This equivalence is the same as Bayesian Ridge Regression (BRR) described in Chap. 6, so this can be implemented with the BGLR R package.

Example 14.5

Here we continue with the data set of Example 14.3, but now in another 100 random partitions, we added the Bayesian prediction to explore the prediction performance. Part of the code for implementing this model is given next, but the complete code used is given in Appendix 3.

#Number of Fourier and B-spline basis functions to represent the covariable #function nbFo = 29 nbBSo = 16 #Functional regression with Fourier and B-splines #Computing X for Fourier representation with L1 = 21 Fourier basis functions #for the beta function L1 = 21 P = diff(range(Wv)) Psi_F = create.fourier.basis(range(Wv), nbasis = nbFo, period = P) SF_mat = smooth.basisPar(argvals = Wv, y =t(X_W), fdobj = Psi_F,lambda=0, Lfdobj = 0) Phi_F = create.fourier.basis(range(Wv), nbasis = L1, period = P) Q = inprod(Phi_F,Psi_F) c_mat = t(SF_mat$fd$coefs) X_F = c_mat%*%t(Q) dim(X_F) #Computing X for B-spline representation with L1 = 21 B-spline basis functions #for the beta function L1 = 21 Psi_BS = create.bspline.basis(range(Wv), nbasis = nbBSo) SBS_mat = smooth.basisPar(argvals = Wv, y =t(X_W), fdobj = Psi_BS,lambda=0, Lfdobj = 0) Phi_BS = create.bspline.basis(range(Wv), nbasis = L1, norder= 4) Q = inprod(Phi_BS,Psi_BS) c_mat = t(SBS_mat$fd$coefs) X_BS = c_mat%*%t(Q) dim(X_BS) #Smoothing estimation using Fourier representation for the functional #covariable and L1 = 21 Fourier basis functions for the beta function par(mfrow=c(1,2)) library(glmnet) #Penalization with first derivate P_mat_F = eval.penalty(Phi_F,Lfdobj=1) ei_P = eigen(P_mat_F) gamma = ei_P$vectors dv = ei_P$values dv = ifelse(dv<1e-10,0,dv) X_Fa = X_F%*%gamma #Grid of lambda values obtained by varying the proportion of variance explained #by the functional predictor lamb_FR_f<-function(Xa,dv,K=100,li=1e-1,ls=1-1e-12) { Pos = which(dv<1e-10) D_inv = diag(1/dv[-Pos]) PEV = seq(li,ls,length=K) Xa = Xa[,-Pos] lambv = (1-PEV)/PEV*mean(diag(Xa%*%D_inv%*%t(Xa))) lambv = exp(seq(min(log(lambv)),max(log(lambv)),length=K)) sort(lambv,decreasing = TRUE) } lambda = lamb_FR_f(X_Fa,dv,K=1e2) y=yv library(BGLR) #Linear predictor specification in BGLR for the PBFR model Pos = which(dv>0) #Matrix design for the non-penalized columns of X^a X_Fa1 = X_Fa[,-Pos] #Matrix design for the penalized columns of X^a X_Fa2 = X_Fa[,Pos]%*%diag(1/sqrt(dv[Pos])) ETA = list(list(X=X_Fa1,model='FIXED'),list(X=X_Fa2,model='BRR')) #Linear predictor specification in BGLR for the BFR model ETA_NP = list(list(X=X_F[,1],model='FIXED'),list(X=X_F[,-1],model='BRR')) #Random cross-validation RP = 100 set.seed(1) MSEP_df = data.frame(RP=1:RP,MSEP_PFR=NA,MSEP_FR=NA,lamb_o=NA) for(p in 1:RP) { Pos_tst = sample(n,0.20*n) X_F_tr = X_F[-Pos_tst,]; n_tr = dim(X_F_tr)[1] y_tr = y[-Pos_tst]; y_tst = y[Pos_tst] #FR dat_df = data.frame(y=y,X=X_F) A_F = lm(y~.,data=dat_df[-Pos_tst,]) yp_tst = predict(A_F,newdata = dat_df[Pos_tst,]) MSEP_df$MSEP_FR[p] = mean((y_tst-yp_tst)^2) #PFR with first derivative A_PFR = cv.glmnet(x =X_Fa[-Pos_tst,],y = y[-Pos_tst] ,alpha = 0, nfolds=5,lambda=lambda/n_tr, penalty.factor=dv, standardize=FALSE,maxit=1e6) MSEP_df$lamb_o[p] = A_PFR$lambda.min yp_tst = predict(A_PFR,newx=X_Fa[Pos_tst,], s="lambda.min")[,1] MSEP_df$MSEP_PFR[p] = mean((y_tst-yp_tst)^2) #BGLR y_NA = y y_NA[Pos_tst] = NA A_BGLR = BGLR(y_NA,ETA= ETA,nIter=1e4, burnIn = 1e3,verbose=FALSE) yp_tst = A_BGLR$yHat[Pos_tst] MSEP_df$MSEP_BGLR[p] = mean((y_tst-yp_tst)^2) A_BGLR = BGLR(y_NA,ETA= ETA_NP,nIter=1e4, burnIn = 1e3,verbose=FALSE) yp_tst = A_BGLR$yHat[Pos_tst] MSEP_df$MSEP_BGLR_NP[p] = mean((y_tst-yp_tst)^2) cat('Partition = ', p,'\n') } MSEP_df

With the Fourier basis and first derivative penalization, the average MSE (SD) across 100 random partitions were 0.7744(0.3412), 0.6338(0.2871), 0.8585(0.2929), and 0.8092(0.2848) for the functional regression (FR), penalized functional regression (PFR), penalized Bayesian functional regression (PBFR), and Bayesian functional regression (BFR, \( \boldsymbol{\beta} \sim N\left(\mathbf{0},{\sigma}_{\beta}^2{\boldsymbol{I}}_{L_1}\right) \)), respectively. In 75, 81, and 77 out of the 100 random partitions, the MSE of the PFR was better than the partitions obtained with the FR, PBFR, and BFR, respectively. And only in 5 out of 100 cases, the MSE of the PBFR was better than the BFR, making the penalty term in the Bayesian estimation non-important and indeed harmful (see Appendix 3 for the R code used).

With the B-spline basis and first derivative penalization, the PFR (0.5718(0.2669)) also resulted better on average than FR (0.6012(0.2681)), PBFR (0.8475(0.3333)), and BFR (0.7982(0.3112)). Here, in 62 out of 100 random partitions, the MSE of the PFR was less than the MSE of the FR, while in 84 and 80 out of the 100 random partitions, they were better than the PBFR and BFR, respectively. Also, taking into account the penalty term in the Bayesian prediction was not so important because in only 8 out of the 100 random partitions, the MSE of the PBFR was less than the MSE corresponding to the BFR (see Appendix 3 for the R code used).

When using the penalty matrix based on second derivatives, in each case (Fourier and B-spline), the results were similar.

The Bayesian formulation can be extended easily to take into account the effects of other factors. For example, in Example 14.5, the effects of the environment can be added as

$$ \boldsymbol{y}={\mathbf{1}}_n\mu +{\boldsymbol{X}}_E{\boldsymbol{\beta}}_E+\boldsymbol{X}\boldsymbol{\beta } +\boldsymbol{\epsilon}, $$
(14.13)

where XE is the design matrix of the environments and βE is the vector with the environment effects, and the Bayesian formulation can be completed by assuming a prior distribution for βE. As was described in Chap. 6 in the BGLR package, there are several options for this: FIXED, BRR, BayesA, BayesB, BayesC, and BL. In the next example, the first one is used.

Example 14.6

This is a continuation of Example 14.5 used to illustrate the performance when adding environmental information to the prediction task by using the Bayesian formulation (14.13). The resulting models were evaluated with 100 random partitions with both Fourier and B-spline basis and first derivative penalization.

The key code for implementing this example is given below.

For Fourier basis : #Matrix design of environment X_E = model.matrix(~0+Env,data=dat_F)[,-1] #Linear predictor to PBFR + Env effect ETA = list(list(X=X_E,model='FIXED'),list(X=X_Fa1,model='FIXED'),list(X=X_Fa2,model='BRR')) # Linear predictor to BFR + Env effect ETA_NP = list(list(X=X_E,model='FIXED'),list(X=X_F[,1], model='FIXED'),list(X=X_F[,-1],model='BRR')) For B-spline basis: #Matrix design of environment X_E = model.matrix(~0+Env,data=dat_F)[,-1] #Linear predictor to PBFR + Env effect ETA = list(list(X=X_E,model='FIXED'),list(X=X_Fa1,model='FIXED'), list(X=X_Fa2,model='BRR')) #Linear predictor to BFR + Env effect ETA_NP = list(list(X=X_E,model='FIXED'),list(X=X_BS,model='BRR')) #Random cross-validation RP = 100 set.seed(1) MSEP_df = data.frame(RP=1:RP,MSEP_PFR=NA,MSEP_FR=NA,lamb_o=NA) for(p in 1:RP) { Pos_tst = sample(n,0.20*n) X_F_tr = X_F[-Pos_tst,]; n_tr = dim(X_F_tr)[1] y_tr = y[-Pos_tst]; y_tst = y[Pos_tst] #FR dat_df = data.frame(y=y,X=X_F) A_F = lm(y~.,data=dat_df[-Pos_tst,]) yp_tst = predict(A_F,newdata = dat_df[Pos_tst,]) MSEP_df$MSEP_FR[p] = mean((y_tst-yp_tst)^2) #PFR with first derivative A_PFR = cv.glmnet(x =X_Fa[-Pos_tst,],y = y[-Pos_tst] ,alpha = 0, nfolds=5,lambda=lambda/n_tr, penalty.factor=dv, standardize=FALSE,maxit=1e6) MSEP_df$lamb_o[p] = A_PFR$lambda.min yp_tst = predict(A_PFR,newx=X_Fa[Pos_tst,], s="lambda.min")[,1] MSEP_df$MSEP_PFR[p] = mean((y_tst-yp_tst)^2) #BGLR y_NA = y y_NA[Pos_tst] = NA A_BGLR = BGLR(y_NA,ETA= ETA,nIter=1e4, burnIn = 1e3,verbose=FALSE) yp_tst = A_BGLR$yHat[Pos_tst] MSEP_df$MSEP_BGLR[p] = mean((y_tst-yp_tst)^2) A_BGLR = BGLR(y_NA,ETA= ETA_NP,nIter=1e4, burnIn = 1e3,verbose=FALSE) yp_tst = A_BGLR$yHat[Pos_tst] MSEP_df$MSEP_BGLR_NP[p] = mean((y_tst-yp_tst)^2) cat('Partition = ', p,'\n') }

The results are shown in Table 14.2, where the third and fourth rows correspond to the average (Mean) and standard deviation (SD) of the MSE, when using Fourier basis (L2=29 for covariate representation and L1 = 21 for the beta function β(·)) in the four fitted models (PFR, FR, PBFR, and BFR, with the environment effects added in the predictor of the model). With respect to the classical functional regression models (PFR and FR), both Bayesian models with environment (PBFR and BFR) effects resulted in a better performance, but again, the Bayesian model without penalization matrix was better (0.4382 vs. 0.4580) (see Appendix 4 for the whole code).

Table 14.2 Mean square error (MSE) of prediction for 100 random partitions for Fourier and B-spline representations, where the PFR and FR are classic functional regression models used in Example 14.4, and PBFR and BFR are model (14.13) with prior \( \boldsymbol{\beta} \sim N\left(\mathbf{0},{\sigma}_{\beta}^2{\boldsymbol{P}}^{-1}\right) \) (with penalization matrix based on the first derivative) and \( \boldsymbol{\beta} \sim N\left(\mathbf{0},{\sigma}_{\beta}^2{\boldsymbol{I}}_{L_1}\right) \) (without penalization), respectively, for the functional term

For the B-spline basis, similar results were obtained, but again, the difference between the PFR and FR was not so important as observed before (see Appendix 4). So, in both cases (Fourier and B-spline), adding the environment information to the model improved the prediction performance. In general, the extra information can be added and explored easily with the BGLR package to determine the importance of this in the prediction task of interest.

Further information can be easily added to the model without many complications. See Chap. 6 for more detailed information on how to do this with the BGLR R package.

Example 14.7

This example is an extension of Example 14.6 by adding the interaction of the environment with the hyper-spectral data (the functional covariable) in the predictor of the model. The corresponding term is given by

$$ {\int}_0^Tx(t){\beta}_e(t) dt, $$

where βe(t) is the coefficient function corresponding to the functional part that represents the interaction between the eth environment and reflectance for wavelength t (in general, the functional covariate measured in time t), to allow the effect of reflectance to vary by environment (see Montesinos-López et al. 2017b). By assuming that there are ne observations in environment e, e = 1, …, I, the corresponding re-expressed model after representing the coefficient function βe(t) in terms of the same basis functions used for β(t), \( {\beta}_e(t)={\sum}_{l=1}^{L_{1e}}{\beta}_{el}{\phi}_l(t) \), is given by

$$ \boldsymbol{y}={\mathbf{1}}_n\mu +{\boldsymbol{X}}_E{\boldsymbol{\beta}}_E+\boldsymbol{X}\boldsymbol{\beta } +{\boldsymbol{X}}_{EF}{\boldsymbol{\beta}}_{EF}+\boldsymbol{\epsilon}, $$
(14.14)

where 1n is a vector of dimension n × 1 with all its entries equal to 1, \( \boldsymbol{X}={\left[\ {\boldsymbol{x}}_1,\dots, {\boldsymbol{x}}_n\right]}^{\mathrm{T}}\ \mathrm{and}\kern0.5em {\boldsymbol{x}}_i={\left[{x}_{i1},\dots, {x}_{iL_1}\right]}^{\mathrm{T}},i=1,\dots, n={n}_1+\cdots +{n}_I, \) as defined in (14.4) and (14.5), XE is the design matrix of the environments and βE is the vector with the environment effects (see 14.13), XEF is the design matrix of the interaction effecs of enviroment-reflectance and βEF is the corresponding interaction effects of environment-reflectance, as given by

$$ {\boldsymbol{X}}_{EF}=\left[\begin{array}{cccc}{\boldsymbol{x}}_1^{\mathrm{T}}& \mathbf{0}& \cdots & \mathbf{0}\\ {}\vdots & \vdots & \vdots & \vdots \\ {}{\boldsymbol{x}}_{n_1}^{\mathrm{T}}& \mathbf{0}& \cdots & \mathbf{0}\\ {}\mathbf{0}& {\boldsymbol{x}}_{n_1+1}^{\mathrm{T}}& \vdots & \mathbf{0}\\ {}\mathbf{\vdots}& \vdots & \vdots & \vdots \\ {}\mathbf{0}& {\boldsymbol{x}}_{n_1+{n}_2}^{\mathrm{T}}& \cdots & \mathbf{0}\\ {}\vdots & \vdots & \vdots & \vdots \\ {}\mathbf{0}& \mathbf{0}& \mathbf{0}& {\boldsymbol{x}}_{n-{n}_I+1}^{\mathrm{T}}\\ {}\vdots & \vdots & \vdots & \vdots \\ {}\mathbf{0}& \mathbf{0}& \mathbf{0}& {\boldsymbol{x}}_n^{\mathrm{T}}\end{array}\right]\ \mathrm{and}\ {\boldsymbol{\beta}}_{EF}=\left[\begin{array}{c}{\boldsymbol{\beta}}_1\\ {}{\boldsymbol{\beta}}_2\\ {}\mathbf{\vdots}\\ {}{\boldsymbol{\beta}}_I\end{array}\right]. $$

This model was also evaluated with 100 random partitions with both Fourier and B-spline basis and with (PBFR (14.14)) and without (BFR (14.14)) first derivative penalization. The results are shown in Table 14.3, together with the prediction performance of the model with no interaction term (model 14.13). We can observe that by adding the interaction of environment with the functional covariate, both Bayesian models (PBFR and BFR) resulted in a reduction on average of about 35% of the MSE (PBFR (14.13) vs. PBFR (14.14) and BFR (14.13) vs. BFR (14.14)), and again the Bayesian model without penalization matrix was better (0.2955 vs. 0.2899) in the Fourier basis, while in the B-spline basis the Bayesian model with penalization matrix was better (0.2814 vs. 0.2818).

Table 14.3 Mean squared error of prediction (MSE) for 100 random partitions for Fourier and B-spline representations, where PBFR (14.13) and BFR (14.13) are MSE of the model (14.13) with and without penalization matrix based on the first derivative in the functional term (), and PBFR (14.14) and BFR (14.14) are for model (14.14) with and without penalization matrix based on the first derivative in the functional terms (Xβ,XEFβEF)

Finally, in this chapter, we gave the basic theory of functional regression and we provided examples to illustrate this methodology for genomic prediction using the glmnet and BGLR packages. The examples show in detail how to implement functional regression analysis in a more straightforward way by taking advantage of the existing software for genomic selection. Also, the examples are done with small data sets so that the user can run them on his/her own computer and can understand the implementation process well.