1 Introduction

In univariate supervised learning we are given a set of training input and output pairs \(\{\varvec{x}_i,t_i \}_{i=1}^N\) that are used to learn a function f that maps from the D-dimensional input vector, \(\varvec{x}_i= (x_{i1}, \dots , x_{iD})^\top \), to the corresponding target \( t_i\). Supervised learning can be applied in both classification and regression, and its objective is to use the learned function to predict outputs for new input data. When solving nonlinear regression problems the true, but in general unknown, function is assumed to be represented by a weighted sum of M basis functions \(\{ \varvec{\phi }_j \}_{j=1}^M\):

$$\begin{aligned} f(\varvec{x}_i) = w_0 + \sum _{j=1}^M w_j \phi _j (\varvec{x}_i), \end{aligned}$$
(1)

where \(\textbf{w} = (w_0,w_1,\dots ,w_M)^\top \) is the \(M+1\) dimensional vector of weights to be learned. The basis functions are in general nonlinear and pre-specified. A frequent choice is to use kernel functions, \(K(\cdot , \cdot )\), centered at each of the observations, such that \(\phi _j(\varvec{x}) = K(\varvec{x}, \varvec{x}_j)\) and the number of basis functions M is equal to the sample size N. It is common to assume that the vector of observed targets \(\textbf{t}= (t_1, \dots , t_N)^\top \) contains samples from the model in Eq. (1) with additive noise, such that we obtain the relationship:

$$\begin{aligned} \varvec{t} = \varvec{\Phi }\varvec{w} + \varvec{\epsilon }, \qquad \varvec{\epsilon }\sim \mathcal {N}(\varvec{0}, \sigma ^2 \varvec{I}_N), \end{aligned}$$
(2)

where \(\varvec{\Phi }\) is the \(N \times (N + 1)\) design matrix. Each row i of \(\varvec{\Phi }\) contains the kernel functions centered at the corresponding input data, \((1, K(\varvec{x}_i, \varvec{x}_1), K(\varvec{x}_i, \varvec{x}_2) \ldots K(\varvec{x}_i, \varvec{x}_N))\). In sparse methods most of the weights in \(\varvec{w}\) are set to zero during estimation such that only a subset of the training data, that contains the most important information of the data structure, is used for prediction. Sparsity is an important property since it reduces the computational time when working with large datasets and can also prevent overfitting since it controls the complexity of the model.

Sparse Bayesian Learning (SBL) is a supervised learning framework that obtains sparse representations of models by applying an automatic relevance determination (ARD) prior (MacKay 1992b). SBL has been applied to many different research areas, and has become very popular in image recognition (Liu et al. 2018; Zhang et al. 2019) and compressive sensing (Choi et al. 2017; Calvetti et al. 2019; Shekaramiz et al. 2019; Djelouat et al. 2022). The classical SBL method is the Relevance Vector Machine (RVM) introduced by Tipping (2001). The RVM uses a prior that is a zero mean Gaussian for each \(w_i\) with an individual precision parameter \(\alpha _i\). This prior structure corresponds to one type of ARD priors that has been shown to promote sparsity (Tipping 2001). The selected training vectors \(\varvec{x}_i\) that correspond to the non-zero \(w_i's \) are called relevance vectors. Although the original RVM has been shown to give good predictions, it suffers from high computation cost for large datasets because the complexity is of order \(O(N^3)\) (Tipping 2001). The method is initiated with all the training samples included, and prunes irrelevant samples in each iteration. In order overcome this problem, a new version of RVM was proposed by Tipping and Faul (2003). The improved method, called the “fast RVM”, starts with an “empty” model and for each iteration a basis function may be added, deleted or re-estimated from the model. Due to the improvement in computational cost, the fast RVM has become a popular method applied to many different problems ( see, e.g., Kiaee et al.2015; 2018; 2018).

In the RVM, the basis function \(\phi _j\) is formed from a kernel, and while many different kernels have been used with success, the most common choice is the Gaussian kernel:

$$\begin{aligned} \phi _j(\varvec{x}_i) =K(\varvec{x}_i,\varvec{x}_j) = \textrm{exp} \Bigg (-\gamma \sum _{d=1}^D {(x_{id}-x_{jd})^2}\Bigg ). \end{aligned}$$
(3)

This kernel includes an additional parameter, \(\gamma \), that is not estimated in the RVM, but has to be set or tuned by, e.g., cross validation (CV). The value of \(\gamma \) affects how many weights are chosen (non-zero) by the RVM during the estimation process, and the results can be sensitive to the choice of \(\gamma \).

The original RVM is not applicable to feature selection, but other methods inspired by the RVM have been developed for this purpose (Lapedriza et al. 2008; Mohsenzadeh et al. 2016). The recently developed Probabilistic Feature Selection and Classification Vector Machine (\({\textrm{PFCVM}}_{\text {LP}}\)) by Jiang et al. (2019) is able to simultaneously choose the relevance samples and the relevance features. Feature selection in \({\textrm{PFCVM}}_{\text {LP}}\) is achieved by modifying Eq. (3) to include individual kernel parameters \(\gamma _d\) for the features, resulting in kernel function

$$\begin{aligned} \phi _j(\varvec{x}_i)= K(\varvec{x}_i,\varvec{x}_j) = \textrm{exp} \Bigg (- \sum _{d=1}^D \gamma _d (x_{id}-x_{jd})^2\Bigg ). \end{aligned}$$
(4)

We will use the notation \(\varvec{\Phi }_{\Gamma }\) for the corresponding kernel matrix, where \(\Gamma = (\gamma _1, \dots ,\gamma _D)^\top \). When \(\gamma _d\) is estimated to zero, the corresponding feature will not contribute in the model. The kernel parameters are restricted to be nonnegative, hence, a left truncated Gaussian ARD prior with individual precision parameters is the conventional choice. The \({\textrm{PFCVM}}_{\text {LP}}\) method has the advantage that the kernel parameters are now estimated internally in the model, but the method has so far only been derived for Gaussian or polynomial kernels, and in a classification setting.

The \({\textrm{PFCVM}}_{\text {LP}}\) is one example of methods that are inspired by the structure of the original RVM and uses ARD priors to achieve sparse solutions. Other examples are the Smooth Relevance Vector Machine by Schmolck and Everson (2007) and the Relevance Sample-Feature Machine by Mohsenzadeh et al. (2013). Each of these models involves smaller or larger changes to the RVM model in order to apply them to different problems. However, even minor changes to the original RVM model requires tedious and careful derivations of the new estimates for the posterior distribution of \(\varvec{w}\), estimates of the hyperparameters, and specific computer implementations. In this work we take a different approach, and we demonstrate how sparse Bayesian methods can be implemented by taking advantage of the Template Model Builder (TMB) (Kristensen et al. 2016). TMB evaluates numerically (to machine precision) the first and second order derivatives of the objective functions by Automatic Differentiation (Griewank and Walther 2008). This feature is used to provide a Laplace approximation of the marginal likelihood, where in an RVM context, the weights \(\varvec{w}\) have been integrated out. The main advantage of implementing the Bayesian framework in TMB is that changes to the methods can effortlessly be made. TMB requires as input from the user only the joint likelihood of the weights \(\varvec{w}\) and the data, while the Laplace approximation is calculated automatically. No derivatives have to be calculated manually, and there is no need to derive tailored algorithms for different extensions. In addition, we can use TMB to estimate parameters outside the original model, effortlessly at least with respect to model formulation. These parameters, such as the kernel parameter \(\gamma \) in the RVM, can be computationally expensive to calculate using, e.g., CV.

While TMB has been used with success in many different settings, this is to our knowledge the first time it has been used to solve sparse Bayesian methods and specifically RVM type methods. Because of the large number of hyperparameters in such methods a novel active set algorithm is derived to speed up calculations. To demonstrate how TMB can be applied in practice to sparse Bayesian learning, we show how it can implement the RVM method and the \({\textrm{PFCVM}}_{\text {LP}}\) method. To demonstrate the flexibility, we also extend the \({\textrm{PFCVM}}_{\text {LP}}\) method to a regression setting using TMB, as well as presenting a novel extension of the original RVM that also includes feature selection.

The remainder of this paper is organized as follows. Section 2 shows how the Bayesian hierarchical framework fits into the estimation procedure of TMB. Section 3 presents the active set algorithm that significantly speeds up the calculations. Section 4 presents the models used in the numerical results of Sect. 5, and a discussion is provided in Sect. 6. Finally, concluding remarks are given in Sect. 7.

2 Sparse Bayesian hierarchical models using TMB

Bayesian models naturally have a hierarchical structure, and this is also the case for sparse Bayesian models (Tipping and Faul 2003; Huang et al. 2017). Assume the model structure from Eq. (1), where the parameters of interest are the weights in \(\varvec{w}\). These will be given a prior distribution, \(p(\varvec{w} | \varvec{\alpha })\), where \(\varvec{\alpha }\) contains the parameters of this prior distribution. The parameters in the vector \(\varvec{\alpha }\) are called hyperparameters, which indicates that they are parameters one level below \(\varvec{w}\). In sparse hierarchical models, each element in \(\varvec{w}\) is associated with a separate hyperparameter in \(\varvec{\alpha }\). A common choice is to use a zero mean Gaussian ARD prior on every element of \(\varvec{w}\):

$$\begin{aligned} p({\varvec{w}}| \varvec{\alpha }) = \prod _{i=0}^{N}\mathcal {N}({w_i} |0, \alpha _i^{-1}), \end{aligned}$$
(5)

where each \(w_i\) has a separate precision parameter \(\alpha _i\). Note however that the active set algorithm in Sect. 3 can easily be adjusted to handle other types of ARD priors.

In addition to the hyperparameters associated with each weight, we define a set of additional hyperparameters, named \(\varvec{\tau }\), that are possibly given a prior distribution, \(p(\varvec{\tau }| \varvec{\xi })\). In this work we use the convention that the weights are called parameters, while all remaining components are called hyperparameters.

The likelihood of the data given the parameters, \(p(\varvec{t} | \varvec{w}, \varvec{\tau })\), will in general only depend on the hyperparameters \(\varvec{\alpha }\), and \(\varvec{\xi }\) through the prior distributions of \(\varvec{w}\) and \(\varvec{\tau }\). This results in the hierarchical structure of the full model:

$$\begin{aligned} \varvec{t} | \varvec{w},\varvec{\tau }&\sim p(\varvec{t} | \varvec{w}, \varvec{\tau }), \end{aligned}$$
(6)
$$\begin{aligned} \varvec{w}|\varvec{\alpha }&\sim p(\varvec{w}| \varvec{\alpha }),\end{aligned}$$
(7)
$$\begin{aligned} \varvec{\tau }|\varvec{\xi }&\sim p(\varvec{\tau }|\varvec{\xi }). \end{aligned}$$
(8)

The hyperparameters are in general unknown, but can be estimated by maximizing the penalized marginal likelihood:

$$\begin{aligned} p( \varvec{t}| \varvec{\alpha },\varvec{\tau })p(\varvec{\tau }| \varvec{\xi }) = \int p(\varvec{t} | \varvec{w},\varvec{\tau }) p(\varvec{w} | \varvec{\alpha }) {\text {d}}\!\varvec{w} \times p(\varvec{\tau }| \varvec{\xi }). \end{aligned}$$
(9)

In the remainder of this paper, we will refer to Eq. (9) as the marginal likelihood. Many methods, such as the RVM, do not include the hyperprior \(p(\varvec{\tau }| \varvec{\xi })\), thus it reduces to the marginal likelihood \(p(\varvec{t} | \varvec{\alpha }, \varvec{\tau })\). However, even when the hyperprior \(p(\varvec{\tau }| \varvec{\xi })\) is included, this penalized marginal likelihood will be treated the same way as the marginal likelihood.

The integral in Eq. (9) is obtained from the joint likelihood of the data \(\varvec{t}\) and the weights \(\varvec{w}\):

$$\begin{aligned} p(\varvec{w},\varvec{t} |\varvec{\alpha }, \varvec{\tau }) = p(\varvec{t} | \varvec{w},\varvec{\tau }) p(\varvec{w} | \varvec{\alpha }), \end{aligned}$$
(10)

by marginalization over \(\varvec{w}\). When the hyperparameters \(\varvec{\tau }\), \(\varvec{\alpha }\) and \(\varvec{\xi }\) are found by maximizing the marginal likelihood in Eq. (9), it is equivalent to a framework known as empirical Bayes (MacKay 1992a; Maritz and Lwin 2018) or type II maximum likelihood (Tipping 2001; Berger 2013).

The posterior distribution for the parameters in \(\varvec{w}\) can in some cases be calculated analytically using Bayes’ theorem. This requires that the prior distribution in Eq. (7) is a conjugate prior for the likelihood in Eq. (6). However, we are interested in the general case where an analytical solution may not be available, and we instead approximate the posterior distribution for \(\varvec{w}\) by using the Laplace approximation. This amounts to approximating the posterior \(p(\varvec{w}| \varvec{t}, \varvec{\tau }, \varvec{\alpha }) \propto p(\varvec{w}, \varvec{t} | \varvec{\alpha }, \varvec{\tau })\) by a Gaussian distribution with mean:

$$\begin{aligned} \varvec{\hat{w}} = \mathop {\mathrm {arg\,max}}\limits _{\varvec{w}} \log ( p(\varvec{t} | \varvec{w}, \varvec{\tau }) p(\varvec{w} | \varvec{\alpha })), \end{aligned}$$
(11)

i.e. the value of \(\varvec{w}\) that maximizes Eq. (10). Further, to obtain the covariance matrix in the Laplace approximation, we define the Hessian of Eq. (10):

$$\begin{aligned} H(\varvec{w}) = \frac{\partial ^2}{\partial \varvec{w}^2} \log \left( p(\varvec{t} | \varvec{w}, \varvec{\tau }) p(\varvec{w} | \varvec{\alpha })\right) . \end{aligned}$$
(12)

Then, the covariance matrix \(\hat{\varvec{\Sigma }}\) in the Laplace approximation follows from standard asymptotic theory:

$$\begin{aligned} \hat{\varvec{\Sigma }} = -H(\hat{\varvec{w}})^{-1}. \end{aligned}$$
(13)

We have now laid the foundation for everything needed to implement sparse Bayesian methods by the help of TMB. To apply the Laplace approximation, the logarithm of the hyperprior (8) and the joint likelihood (10) must be defined as a C++ function. From this, TMB can be used to calculate the Laplace approximation of the marginal likelihood as:

$$\begin{aligned} p(\varvec{t}|\varvec{\tau },\varvec{\alpha })p({\varvec{ \tau }}|\varvec{\xi })&= \int p(\varvec{t} | \varvec{w},\varvec{\tau }) p(\varvec{w} | \varvec{\alpha }) {\text {d}}\!\varvec{w} \times p(\varvec{\tau }| \varvec{\xi })\nonumber \\&\approx p(\varvec{t}|\varvec{\hat{w}},\varvec{\tau }) p({\varvec{\hat{w}}}|\varvec{\alpha })(2\pi )^{N/2} |\hat{\varvec{\Sigma }} | ^{1/2} p(\varvec{\tau }| \varvec{\xi }), \end{aligned}$$
(14)

where \(|\hat{\varvec{\Sigma }} |\) denotes the determinant of \(\hat{\varvec{\Sigma }} \). In order to evaluate the Laplace approximation given by Eq. (14), TMB uses second-order automatic differentiation and a Newton method to find \(\hat{\varvec{w}}\) and \(\hat{\varvec{\Sigma }}\) from Eqs. (11) and (13).

To estimate the hyperparameters, Eq. (14) is maximized with respect to the hyperparameters \(\varvec{\alpha }\), \(\varvec{\tau }\), and \(\varvec{\xi }\). In sparse Bayesian models with a prior given by Eq. (5), the hyperparameters \(\alpha _i\) that approaches infinity will force the corresponding weights \(w_i\) to be set to zero. Thus, in the optimization of Eq. (14) we assign an upper bound to all hyperparameters. The weights corresponding to the \(\alpha _i\)’s that reach this bound in the optimization are interpreted as zero in the model. We can use TMB to evaluate the function value and the gradient of the Laplace approximation of the marginal likelihood, thus, any standard constrained maximization routine can be easily used.

TMB is implemented as an R package (R Core Team 2022), which by linking to the C++ code makes the Laplace approximation (14) available to the user as an R function. TMB is designed to handle a large number of latent variables (weights), note however, that it can only handle a small to moderate number of hyperparameters (Kristensen et al. 2016). The practical limitation is that the computation of the gradient of the Laplace approximation (14) consumes an extensive amount of computer memory. This poses a problem for sparse Bayesian models, because the dimension of \(\varvec{\alpha }\) is the same as \(\varvec{w}\) which is high dimensional. To overcome this problem, the sparsity of the models can be exploited. In Sect. 3 we will present an algorithm that at any time only works with a small subset of \(\varvec{w}\) (and \(\varvec{\alpha }\)).

3 Active set algorithm

In the TMB literature “sparsity" refers to the Hessian (12) being a sparse matrix. This form of sparsity results from the multiplicative structure of the likelihood, and is computationally beneficial. However, this notion of sparsity differs from what is meant by sparsity in the SBL framework, where it means that most of the elements of \(\varvec{w}\) are zero. TMB does not have any built-in functionality for exploiting the latter type of sparsity, and we thus have to devise a special algorithm.

To avoid estimating excessively many parameters simultaneously, we develop an adaptive algorithm, specified in Algorithm 1, for updating a small subset of the weight parameters. This approach can lead to a substantial decrease in computational time for the cases when N is large which implies many hyperparameters. The algorithm has similarities with stepwise regression (Ament and Gomes 2021) and the fast RVM (Tipping and Faul 2003), used to improve the computational speed of the original RVM (Tipping 2001). Instead of starting the optimization procedure with all weights included, the fast RVM starts with an empty model and adds, prunes or re-estimates weights at each iteration. Inspired by the procedure of the fast RVM by Tipping and Faul (2003), we therefore provide an algorithm that updates only a subset of the components in \(\varvec{w}\), along with the corresponding components of \(\varvec{\alpha }\), in each iteration. This subset, denoted by \(\mathcal A \), is referred to as the “active set" and will converge towards a sparse subset of the complete set of observations.

Initialization

The algorithm is initialized with the hyperparameters \(\log \alpha _i=5\), which is taken as an upper bound on \(\alpha _i\) in our implementation. The remaining hyperparameters are initialized to 1. The active set is initiated with B elements. The B initial elements in \(\mathcal A\) can be chosen randomly, but we have found that it is better to maximize the joint likelihood (10) with respect to \(\varvec{w}\), keeping the hyperparameters fixed to the initial value, and then add the indices of the B largest (in absolute value) \(|\varvec{w}_i|\) to the active set. The initial values of the remaining weights are set to zero. This corresponds to steps 1 - 4 in Algorithm 1.

Find optimal parameters Each iteration starts with updating the hyperparameters (step 5 in Algorithm 1). To learn the hyperparameters, the Laplace approximation of the marginal likelihood (14) is maximized with respect to the hyperparameters \(\varvec{\tau }\), \(\varvec{\xi }\) and \(\alpha _i,\ i\in \mathcal A\), while the \(\alpha _i's\) not in the active set are fixed to the upper bound. For each evaluation of the marginal likelihood an inner optimization must be done to calculate \(\hat{\varvec{w}}\) and \(\hat{\varvec{\Sigma }}\) from Eqs. (11) and (13). Note that the \(\hat{w_i}\) that do not correspond to indices in the active set \(\mathcal A\) are fixed to zero in the inner optimization.

In our implementation, we use TMB to calculate the Laplace approximation of the marginal likelihood (14). The advantage of using TMB is that we only have to specify the negative logarithm of the joint likelihood in Eq. (10), and TMB will automatically apply the Laplace approximation on the provided function. Further, TMB sets up and solves the inner optimization with respect to \({\varvec{w}}\) automatically. Finally, TMB calculates the derivatives of the marginal likelihood that is used in the outer optimization with respect \(\varvec{\alpha }\), \(\varvec{\tau }\) and \(\varvec{\xi }\).

Pruning\(\mathcal A\) After the optimization, we may remove indices from the active set (step 6 in Algorithm 1). The indices corresponding to the \(\alpha _i\)’s that reaches the maximum upper bound in the optimization, are removed from the active set.

Enlarging\(\mathcal A\) After elements have been pruned from \(\mathcal A\), the joint likelihood (10) is maximized with respect to all weight parameters, \(\varvec{w}\) (step 7 in Algorithm 1). The values of the hyperparameters are fixed to the final values found in the optimization of the marginal likelihood. Likewise, the final values of the \(\varvec{w}\) in the active set are used as an initial guess in the optimization, while the initial guess for \(\varvec{w}\) not in the active set is zero. This procedure is equivalent to how the active set is initialized at the beginning of the algorithm. The indices corresponding to the set of weights with absolute values larger than a threshold \(|w_i|>\delta \), and not already in the active set, are added to the active set (step 8 in Algorithm 1). The algorithm converges when the changes in the marginal likelihood (14) are lower than a threshold, set as \(1e-6\) for the results presented in Sect. 5.

Algorithm 1
figure a

Active set algorithm

4 SBL methods implemented in TMB

This section applies the general theory from Sect. 2 to the specific methods used in the numerical results of Sect. 5. The SBL methods presented here are the RVM and the dimension reduction method \({\textrm{PFCVM}}_{\text {LP}}\). Although there are certain differences between the two methods, the main estimation procedure is similar. They both use the empirical Bayes approach to optimize the hyperparameters. We also present how the framework of TMB easily facilitates extension and development of methods, by adapting the RVM to feature selection in a method we call \({\textrm{KRVM}}_\text {TMB}\). Normally when adjusting the original RVM algorithm, it requires calculations of the new components corresponding to the posterior for \(\varvec{w}\) and the marginal likelihood. The advantage with implementing these methods in TMB is that only small adjustments are needed for the implementation because the TMB algorithm only requires the structure of the joint likelihood of the parameters and the data as input.

4.1 \({\textrm{RVM}}_\text {TMB}\)

Whenever it is necessary to distinguish the fast RVM method from the one implemented using TMB we add a subscript TMB when we refer to the method implemented in TMB, and no subscript when we refer to the original method. The \({\textrm{RVM}}_\text {TMB}\) will be benchmarked against fast RVM in the following section, and whenever we refer to RVM we mean the fast version. Both \({\textrm{RVM}}_\text {TMB}\) and RVM use the same likelihood function, and prior on the weights, and the main difference of the two methods lies in how the hyperparameters and weights are found. In addition, \({\textrm{RVM}}_\text {TMB}\) adds the kernel parameter \(\gamma \) as a hyperparameter, while RVM estimates it with cross-validation. For details on how the fast RVM is implemented, see (Tipping and Faul (2003)).

4.1.1 Regression

Given the training data \(\{\varvec{x}_i,t_i \}_{i=1}^N\), the \({\textrm{RVM}}_\text {TMB}\) method assumes that the vector of observed targets, \(\varvec{t}= (t_1, \dots , t_N)^\top \), is normally distributed and can be expressed as in Eq. (2). The \({\textrm{RVM}}_\text {TMB}\) method has two hyperparameters in addition to \(\varvec{\alpha }\); one representing the variance \(\sigma ^2\) and one representing the kernel parameter \(\gamma \) in Eq. (3). These correspond to the elements in \(\varvec{\tau }\) from Eq. (6). They are both given flat prior distributions and the hyperprior analogue to Eq. (8) is therefore omitted. This gives the likelihood function of the training data:

$$\begin{aligned} p(\varvec{t} |\varvec{w},\sigma ^2,\gamma ) = \mathcal {N}(\varvec{t} |\varvec{\Phi }_{\gamma }\varvec{w}, \sigma ^2 \varvec{I}_N), \end{aligned}$$
(15)

with the basis matrix \(\varvec{\Phi }_\gamma \) defined by the Gaussian kernel in Eq. (3). Note that we use the notation \(\varvec{\Phi }_\gamma \) for the kernel in the \({\textrm{RVM}}_{\text {TMB}}\) since this method includes estimation of \(\gamma \). The \({\textrm{RVM}}_\text {TMB}\) uses the zero mean Gaussian ARD prior given in Eq. (5). Thus the joint log likelihood that must be specified to implement the \({\textrm{RVM}}_\text {TMB}\) method using TMB is:

$$\begin{aligned} \log p(\varvec{w}, \varvec{t} | \varvec{\alpha }, \sigma ^2,\gamma ) = \log \left( \mathcal {N}(\varvec{t} |\varvec{\Phi }_\gamma \varvec{w}, \sigma ^2 \varvec{I}_N)\right) \nonumber \\ +\sum _{i=0}^{N} \log \mathcal {N}({w_i} |0, \alpha _i^{-1}). \end{aligned}$$
(16)

Sparsity is obtained during estimation by adaptively pruning weights, \(w_i\), that have an estimated variance close to zero (or equivalently when the precision parameter \(\alpha _i \) approaches infinity).

4.1.2 Classification

The \({\textrm{RVM}}_\text {TMB}\) for two category classification problems has a similar setup as regression, but the likelihood function (15) must be replaced by a Bernoulli likelihood

$$\begin{aligned} p(\varvec{t} |\varvec{w}, \gamma ) = \prod _{i=1}^N \Big \{S \big ( (\varvec{\Phi }_\gamma \varvec{w})_i \big ) \Big \}^{t_i}\Big \{1-S \big ( (\varvec{\Phi }_\gamma \varvec{w})_i \big )\Big \}^{1-t_i}, \nonumber \\ \end{aligned}$$
(17)

where the sigmoid function \(S(x) =(1+\exp (-x))^{-1}\) is used to obtain a mapping from \([-\infty ,\infty ]\) to [0, 1]. The training targets \(t_i\) take values 0 or 1, and \((\varvec{\Phi }_\gamma \varvec{w})_i\) is the i’th element of the vector \(\varvec{\Phi }_\gamma \varvec{w}\). This results in the joint log likelihood function:

$$\begin{aligned}&\log p(\varvec{w}, \varvec{t} | \varvec{\alpha }, \gamma ) = \sum _{i=1}^N {t_i} \log S \big ( (\varvec{\Phi }_\gamma \varvec{w})_i \big )\nonumber \\&\qquad \qquad + (1-t_i) \log \Big \{1-S \big ( (\varvec{\Phi }_\gamma \varvec{w})_i \big )\Big \}\nonumber \\&\qquad \qquad + \sum _{j=0}^{N} \log \mathcal {N}\left( {w_j} |0, \alpha _j^{-1} \right) . \end{aligned}$$
(18)

4.2 \({\textrm{KRVM}}_\text {TMB}\)

The original RVM is not primarily devised for feature selection, but Tipping (2001) includes an example where the Gaussian kernel is applied with two individual kernel parameters for a dataset with two features. We refer to this method as the Kernel Relevance Vector Machine (KRVM).

The joint log likelihood of the \({\textrm{KRVM}}_\text {TMB}\) is identical to Eq. (16) (regression) and (18) (classification), except for the kernel matrix. The kernel matrix for the \({\textrm{KRVM}}_\text {TMB}\) is \(\varvec{\Phi }_{ \Gamma }\), where \(\Gamma = (\gamma _1, \dots \gamma _D)^\top \), which is obtained from the kernel function in Eq. (4). The kernel parameters in \(\Gamma \) and the variance \(\sigma ^2\) correspond to the elements in \(\varvec{\tau }\) in Eq. (6) from the general formulation. As for the \({\textrm{RVM}}_\text {TMB}\), these are given flat prior distributions and are therefore omitted. To find the kernel parameters in \(\Gamma \), Tipping (2001) proposes a sequential iterative scheme that switches between maximizing the marginal likelihood with respect to \(\Gamma \) and finding the \(\varvec{\alpha }\) from the analytical expression of the RVM method, however, they postulate that better optimization methods can be found. In our approach, the kernel parameters and the hyperparameters on the weights are optimized simultaneously, and while Tipping (2001) claims that such a joint nonlinear optimization is prohibitively slow, we will demonstrate that by taking advantage of automatic differentiation and the Laplace approximation in TMB we do get comparable running times to the fast RVM method.

Table 1 The classification and regression datasets, their dimension, evaluation metric and source

4.3 \({\textrm{PFCVM}}_\text {TMB}\)

Similar to \({\textrm{KRVM}}_\text {TMB}\), the \({\textrm{PFCVM}}_{\text {LP}}\) method by Jiang et al. (2019) obtains feature selection by including an individual kernel parameter for each feature dimension. As for the RVM method, we use \({\textrm{PFCVM}}_\text {TMB}\) to refer to the implementation using TMB and \({\textrm{PFCVM}}_{\text {LP}}\) to refer to the original method presented in Jiang et al. (2019).

The only difference between \({\textrm{PFCVM}}_\text {TMB}\) and \({\textrm{KRVM}}_\text {TMB}\) is that a left truncated Gaussian prior is assigned to the kernel parameters \( \Gamma \) in \({\textrm{PFCVM}}_\text {TMB}\):

$$\begin{aligned} p(\gamma _d|\beta _d)= {\left\{ \begin{array}{ll} 2\mathcal {N}(\gamma _d| 0, \beta _d^{-1}) \qquad & \textrm{if} \ \gamma _d \ge 0, \\ 0 \qquad & \textrm{otherwise} \end{array}\right. }, d\in \{1,\ldots D\}. \nonumber \\ \end{aligned}$$
(19)

The kernel parameters \(\gamma _1, \dots , \gamma _D\) and hyperparameters \(\beta _1, \dots , \beta _D\) correspond to the elements in the vectors \(\varvec{\tau }\) and \(\varvec{\xi }\), respectively, in Eq. (8). The \({\textrm{PFCVM}}_\text {TMB}\) uses the same prior for the weights as the other methods, which is the Gaussian prior from Eq. (5). Note that the \({\textrm{PFCVM}}_\text {LP}\) also uses a left truncated Gaussian prior for the weights, however, we assign a regular Gaussian for two reasons; the Laplace approximation requires that the weights are Gaussian in order to obtain a good approximation, and we will extend the \({\textrm{PFCVM}}_\text {LP}\) method to regression where it is natural that negative weights are allowed.

4.3.1 Classification with feature selection

The \({\textrm{PFCVM}}_{\text {LP}}\) method is developed to solve classification problems, and the sigmoid function is used to map the predicted outputs, \(\varvec{\Phi }_{ \Gamma }\varvec{w},\) to [0, 1]. When combined with a Bernoulli distribution a likelihood identical to Eq. (17) is obtained, except for the kernel matrix, \(\varvec{\Phi }_{ \Gamma }\), that is calculated from Eq. (4).

Combining (19) with (5) we obtain the log likelihood used in the \({\textrm{PFCVM}}_\text {TMB}\) method:

$$\begin{aligned} \log \left( p(\varvec{w}, \varvec{t} | \varvec{\alpha },\Gamma ) p(\Gamma | \varvec{\beta }) \right)&=\log (p(\varvec{w}, \varvec{t} | \varvec{\alpha },\Gamma )) \\&\quad + \sum _{d=1}^D \log \mathcal {N}_t({\gamma _d} |0, \beta _d^{-1}) , \end{aligned}$$

where \(\mathcal {N}_t\) denotes a left truncated Gaussian and \(\log (p(\varvec{w}, \varvec{t} | \varvec{\alpha },\Gamma ))\) is given by Eq. (18) when replacing \(\varvec{\Phi }_{ \gamma }\) with \(\varvec{\Phi }_{ \Gamma }\).

4.3.2 Regression with feature selection

Although the original \({\textrm{PFCVM}}_{\text {LP}}\) method is only developed to handle classification tasks, it is easily extended to solve regression problems by using TMB. The only modification needed is to adjust the likelihood for the targets. Thus, for regression, the likelihood is given by Eq. (15), when replacing \(\varvec{\Phi }_{\gamma }\) with \(\varvec{\Phi }_{\Gamma }\). The kernel structure and the priors for \(\varvec{w}\) and \( \Gamma \) are the same as for classification. The method as described in Jiang et al. (2019), on the other hand, would require major modifications in order to be adapted to a regression setting. The same applies to the Matlab code that underlies Jiang et al. (2019), and this illustrates the advantages of using TMB in our opinion.

5 Results

To test the performance of the different methods implemented using TMB (see Table 2), we compare them with the original implementations on the regression and classification datasets from James et al. (2013) available from the source package ISLR in R. We also include some of the commonly used datasets from the University of California Irvine (UCI) machine learning repository. The source packages for all the datasets in R are listed in Table 1 as well as the performance measurements. Each dataset is split randomly into a training set (\(70\%\)) and a test set (\(30\% \)). All algorithms train on the same training set, and report the mean squared error (MSE) or the prediction accuracy over the test set. This is repeated for 100 different splits and the mean and standard error are calculated across these 100 simulation replicates.

Table 2 A comparison of the RVM-based methods that are implemented in TMB

Because TMB evaluates both the Laplace approximation and the corresponding gradient function, any standard optimization routine in R can be used to maximise the marginal likelihood (14) with respect to the hyperparameters. In all numerical results we use the quasi-Newton method implemented in the nlminb routine. The C++ code for \({\textrm{RVM}}_{\text {TMB}}\) is provided in the Supplementary material (Sect. A).

The initial size of the active set is set to \(B=20\) for all datasets and methods implemented using TMB. The upper bound of the hyperparameters \(\varvec{\alpha }, \varvec{\tau }\) and \(\varvec{\xi }\) is set to \(\nu =\exp (5)\) in the outer optimization of the marginal likelihood, and the threshold parameter for including new elements in the active set is set to \(\delta =0.5\). The results for the RVM method were obtained by using tenfold Cross Validation (CV) for the kernel parameter \(\gamma \) in Eq. (3) on the values \(\{ 0.001, 0.01, 0.1, 0.5,1 \}\). The RVM and \({\textrm{PFCVM}}_{\text {LP}}\) methods are run using MATLAB while the \({\textrm{RVM}}_\text {TMB}\), \({\textrm{PFCVM}}_\text {TMB}\) and \({\textrm{KRVM}}_\text {TMB}\) methods are run in R.

5.1 Prediction

The average prediction results are shown in Table 3. When comparing RVM and \({\textrm{RVM}}_{\text {TMB}}\) we see that the RVM method obtains higher prediction accuracy on the classification data and also a bit lower mean value of the MSE for most of the regression data. On average, the RVM method does therefore give somewhat better overall predictions compared to the \({\textrm{RVM}}_\text {TMB}\) method on the tested datasets.

Further, the \({\textrm{PFCVM}}_{\text {TMB}}\) method gives average prediction results that are mostly within the standard error of the \({\textrm{PFCVM}}_{\text {LP}}\) method, but the \({\textrm{PFCVM}}_{\text {LP}}\) method has a much higher standard error than the other methods on several datasets. The \({\textrm{PFCVM}}_{\text {TMB}}\) method gives a slightly higher prediction accuracy on the test data in Table 3, except for the Musk and Sonar dataset. We cannot compare the results on the regression datasets, since the original \({\textrm{PFCVM}}_{\text {LP}}\) method is not designed to solve regression task. However, we notice that the \({\textrm{PFCVM}}_{\text {TMB}}\) method is on par with the RVM method.

The last column in Table 3 shows the \({\textrm{KRVM}}_\text {TMB}\) method implemented using TMB. This method produces the lowest average MSE for the Boston and Hitters dataset and the highest prediction accuracy for the Musk dataset. It also shows good performance for the remaining datasets and small variations in prediction between different splits of the datasets.

Table 3 Average prediction accuracy (classification) and MSE (regression) values across 100 random splits of the full datasets into training and test, for the datasets in Table 1

5.2 Sample size and dimension reduction

Since sparsity is an important property for the RVM-based methods, it is important to investigate to which extent this property is still obtained when implementing the methods using TMB. The RVM and \({\textrm{RVM}}_{\text {TMB}}\) are only sparse with respect to the samples, while the rest of the methods should also be sparse with respect to the features.

Table 4 Average model size (number of relevance vectors times number of selected features) across 100 repetitions for the datasets in Table 1

From Table 4 we see that the RVM on average selects fewer relevance vectors than the \({\textrm{RVM}}_{\text {TMB}}\), but the RVM method has a greater variance in the number of relevance vectors. When comparing the \({\textrm{PFCVM}}_{\text {TMB}}\) with the \({\textrm{PFCVM}}_{\text {LP}}\) we see that the \({\textrm{PFCVM}}_{\text {TMB}}\) is more sparse with respect to the samples (i.e. selects fewer relevance vectors) while the \({\textrm{PFCVM}}_{\text {LP}}\) is more sparse with respect to the features. The \({\textrm{PFCVM}}_{\text {TMB}}\) is not as sparse as the RVM method, but interestingly more sparse than the \({\textrm{RVM}}_{\text {TMB}}\) method. Finally, we do observe that the \({\textrm{KRVM}}_\text {TMB}\) also perform feature selection, but not as aggressively as the \({\textrm{PFCVM}}_{\text {TMB}}\) and \({\textrm{PFCVM}}_{\text {LP}}\). As expected, the half-normal prior on the kernel parameter \(\Gamma \) enforces a model that is sparser with respect to features, than the flat prior in the \({\textrm{KRVM}}_{\text {TMB}}\) method.

The corresponding standard errors are also listed in Table 4. In the cases where a method picks all features in all 100 repetitions the corresponding standard error is zero. This is obviously the case for the \({\textrm{RVM}}_{\text {TMB}}\) and RVM that do not perform feature selection, but it also happens for a few datasets for the \({\textrm{PFCVM}}_{\text {TMB}}\) and the \({\textrm{KRVM}}_\text {TMB}\) methods.

5.3 Computation time

In order to compare the run time for the different methods, we have used the same splits for the dataset as described in Sect. 5.1, and have timed the 10 first splits. Computations of the methods that are implemented in TMB are done in R version 4.0.3 on a HP Elitedesk 800 G3 computer running 64-bit Windows 10, utilizing only a single core for comparability of algorithms. The original RVM by Tipping (2001) and the \({\textrm{PFCVM}}_{\text {LP}}\) method by Jiang et al. (2019) are both implemented in the MATLAB software. These methods were run on the same computer using MATLAB R2019a, also using a single core for comparability of the computational time.

Table 5 The average running time for different methods used on the data in Table 1

From the times of the dataset classifications reported in Table 5 we can observe that the RVM method is significantly slower than the other methods for all datasets. The longer running times for the RVM method are mainly due to the computational overhead of the CV done for the kernel parameter \(\gamma \). The other methods have comparable running times, except for the Musk and Sonar datasets where the \({\textrm{PFCVM}}_\text {LP}\) method outperforms the methods implemented in TMB. This is presumably due to the high number of features for these datasets. While the \({\textrm{PFCVM}}_\text {TMB}\) and \({\textrm{KRVM}}_\text {TMB}\) methods do perform feature selection these methods do not take advantage of the feature sparsity to speed up calculations as is done for the \(\varvec{\alpha }\) hyperparameters.

The active set algorithm was introduced in Sect. 3 because estimating the full set of \(w_i\)’s, along with the corresponding \(\alpha _i\)’s, simultaneously in TMB will require longer computational time. All results have so far used the active set algorithm, but in this section we test how much the running times actually improve. Table 6 shows the computation times and the prediction accuracy for the Liver dataset, where we have included the case where the full \(\varvec{w}\), and \(\varvec{\alpha }\), are estimated in TMB, denoted as \({\textrm{RVM}}_{\text {TMB}^*}\). The results are computed by using the first 10 splits of the dataset. The results from Table 6 show that not using the active set algorithm takes significantly longer time compared to RVM and \({\textrm{RVM}}_\text {TMB}\) methods. However, \({\textrm{RVM}}_{\text {TMB}^*}\) does obtain a better prediction accuracy than \({\textrm{RVM}}_\text {TMB}\).

Table 6 Average computational time in seconds and average prediction accuracy for the “Liver” dataset across 10 random splits

5.4 Waveform dataset

In order to test the robustness of the \({\textrm{PFCVM}}_{\text {LP}}\), Jiang et al. (2019) include an example where the method is applied to the Waveform dataset (Dua and Graff 2017). This dataset consists of 5000 samples of three different wave classes. For each sample, there are 40 continuous features. The first 21 features are relevant for classifying the correct wave class, whereas the latter 19 are irrelevant noise with mean zero and variance 1. The inclusion of the 19 noise features increases the difficulty of the classification problem. The ideal algorithm should select the first 21 features, while pruning the remaining noise features. In this section we test whether the feature selection methods using TMB (\({\textrm{PFCVM}}_{\text {TMB}}\) and \({\textrm{KRVM}}_{\text {TMB}}\)) are able to select the relevant features and discard the noise features. We use the same set up as the example in Jiang et al. (2019), which selects the data from wave 1 and wave 2 to obtain a two-class classification problem. The data from wave 1 and wave 2 consists of 3345 samples in total. From this dataset, we generate 100 distinct randomly selected training and test sets. Each training set is of size 400 and includes 200 training samples for each wave class. The corresponding test set includes the remaining 2945 samples.

Figure 1 shows the selected frequency for each feature based on the results from the 100 datasets. The figure shows that \({\textrm{KRVM}}_{\text {TMB}}\) performs better than \({\textrm{PFCVM}}_{\text {TMB}}\) on the first 21 relevant features. The selection frequency of \({\textrm{KRVM}}_{\text {TMB}}\) is greater than 0.5 for 13 among the 21 relevant features. The \({\textrm{PFCVM}}_{\text {TMB}}\) selects 8 of the 21 relevant features more than half the times. However, for the remaining 19 noise features, we notice that it is the \({\textrm{PFCVM}}_{\text {TMB}}\) that gives the best result. The frequency of selecting the noise features are all less than 0.2 for the \({\textrm{PFCVM}}_{\text {TMB}}\), while for the \({\textrm{KRVM}}_{\text {TMB}}\) they are all above 0.2, but lower than 0.5. When comparing the result in Fig. 1 with the results presented in Figure 3 in Jiang et al. (2019) we notice that the \({\textrm{PFCVM}}_{\text {TMB}}\) produces similar results as the \({\textrm{PFCVM}}_{\text {LP}}\). The \({\textrm{KRVM}}_{\text {TMB}}\) is less restrictive and has an overall higher selection frequency, which makes it less robust to the noise features included in this dataset.

Fig. 1
figure 1

The selected frequency of each feature index of the Waveform dataset. The first 21 features contain information about the true signal, while the remaining samples are irrelevant noise

5.5 Credible and prediction intervals

The estimated regression curve \(\hat{f}(x)\) is obtained by inserting estimated weights and hyperparameters into (1). The empirical Bayes framework suggests a way of calculating the variance of \(\hat{f}(x)-f(x)\) using a version of the delta method which accounts for both weight and hyperparameter uncertainty (Zheng and Cadigan 2021). This approach is implemented in TMB, and from a user perspective the standard deviations of latent variables, hyperparameters, and derived parameters like f(x) are available via the function sdreport. In this subsection, we use this functionality to construct pointwise credible interval for f(x). By adding \({\hat{\sigma }}^2\) to the variance of \(\hat{f}(x)-f(x)\) we similarly obtain pointwise prediction intervals for new observations. TMB optionally allows the user to ignore hyperparameter uncertainty, and we apply this to both types of intervals. It should be mentioned that the presense of the prior \(p(\varvec{\tau }|\varvec{\xi })\) in (8) complicates the interpretation of the standard deviations returned by sdreport, but in the example below the prior on \(\varvec{\tau }\) is absent, so the problem does not arise.

As our first illustration of the sdreport function we use the Auto dataset (Table 1). A single covariate, vehicle mass (called “weight” in the original dataset), is used to explain the target variable, miles per gallon (mpg). We focus on a single covariate in order to obtain a simple visual interpretation of the estimated regression curve with credible and prediction intervals. To investigate how the model behaves when extrapolating outside the training dataset, we use 300 lightest vehicles as the training data, and 92 heaviest vehicles as test data. The resulting \({\textrm{RVM}}_\text {TMB}\) regression curve, and associated uncertainty intervals, are shown in Fig. 2. We see that credible intervals are wider when hyperparameter uncertainty is taken into account (left versus right panel). Further, the prediction intervals are much wider than the credible intervals, showing that prediction uncertainty is dominated by the error variance \(\sigma ^2\). As expected, credible intervals get wider in the extrapolation region (right hand side of each panel). We note that the active set algorithm has selected only four relevance vectors (shown in the figure), located in pairs at the very left and right boundary of the training dataset, respectively. It is seen from the right hand panel of the figure that the uncertainty arising from weight estimation is at its lowest mid way (\(x\approx -0.5\)) between the two pairs of relevance vectors. When moving into the extrapolation region, the effect of the relevance vectors vanishes, and both the point estimate (solid line) and the uncertainty become dominated by the intercept term \(w_0\).

Fig. 2
figure 2

Fitted \({\textrm{RVM}}_\text {TMB}\) regression curve (black), pointwise 95 % credible intervals (blue region) and 95 % prediction intervals (light blue region) for the Auto dataset.. In the left figure both the uncertainty in the weights and hyperparameters is accounted for, while the right figure includes only the weight uncertainty. The black dots are the training data, while the red dots are additional test data. The four circled black dots are the relevance vectors (RV). The coverage rates for the prediction intervals are respectively \(94.1 \%\) and \(100 \% \), for the training and test data (left and right side)

In a second example we illustrate estimation uncertainty in a classification problem, again using \({\textrm{RVM}}_\text {TMB}\). For this purpose we use the Diabetes dataset (Table 1), and select two covariates: “Pressure” \((x_1)\) and “Insulin” \((x_2)\) as predictors for diabetes status. The dataset is split randomly into a training set (70 %) and a test set (30 %). With this setup, sdreport provides the standard deviation of \(\hat{f}(x_1,x_2)-f(x_1,x_2)\), which can be used to construct pointwise 95% credible intervals for \(f(x_1,x_2)\). We denote the lower and upper limits of these intervals by \(\hat{f}_L(x_1,x_2)\) and \(\hat{f}_U(x_1,x_2)\), respectively. While the classification rule is the zero-contour of \(\hat{f}(x_1,x_2)\), we take as the boundaries of the “uncertainty region” to be the zero-contours of \(\hat{f}_L(x_1,x_2)\) and \(\hat{f}_U(x_1,x_2)\), respectively. The results are illustrated in Fig. 3. It is seen that a substantial part of the test data (right hand panel) falls within the uncertainty region, and may hence have been classified differently if a different training dataset was used.

Fig. 3
figure 3

Diabetes status as a function of “Pressure” and “Insulin”, for training data (left) and test data (right), with highlighted relevance vectors (RV). The solid line represents the \({\textrm{RVM}}_\text {TMB}\) classification boundary, with background color showing the classification rule. The dashed lines represent boundaries of the uncertainty region

6 Discussion

When using TMB to fit the RVM-based methods, we get estimates of both hyperparameters and the weights as outputs. The suggested procedure is a nested two-step optimization procedure, which is also referred to as the inner and outer optimization. The inner optimization evaluates the Laplace approximation of the marginal likelihood, where estimates of the latent variables (weights) are found. The outer optimization maximises the Laplace approximation of the marginal likelihood with respect to the hyperparameters. The bottleneck when implementing the RVM-type methods in TMB is that the outer optimization is slow when the model includes a large number of hyperparameters. This will be the case for the sparse RVM-based models where each weight is associated with a separate hyperparameter. The suggested solution to this problem, as described in Sect. 3, is to use an active set algorithm, such that only a small subset of the weights and corresponding hyperparameters are evaluated simultaneously. The active set algorithm resembles the structure of the fast RVM method that only estimates a subset of the weights at each iteration. The difference between the fast RVM and \({\textrm{RVM}}_\text {TMB}\) lies in the estimation procedure of the weights and hyperparameters, and in how weights are added to the active set. While the fast RVM model uses a specially tailored method, which works excellent for the RVM model, our approach is designed to work on general sparse learning methods.

We have shown that the active set algorithm reduces the computational time by orders of magnitude compared to using the full set of weights as input to TMB. And while the presented active set algorithm is not required in order to solve sparse RVM type methods using TMB, in practice, the presented framework is too slow without using it. The cost of the provided speedup is a somewhat lower prediction accuracy as seen in Table 5. For the datasets we have tested, the \({\textrm{RVM}}_\text {TMB}\) method is in fact faster than the RVM method, but this is mainly because of the computational overhead of the CV of the kernel parameter in the fast RVM. While the active set method is efficient, it does make the implementation of the sparse models slightly more complex. Instead of only having to define the negative joint likelihood as a C++ template and a few lines of code in R, the active set method must be implemented as an outer layer (loop) in R. While the obtained speedup is significant, the computational time may still be an issue for large datasets, and the active set algorithm could probably be made even faster if it was included inside the TMB procedure, but this requires modifications of the TMB package.

The active set algorithm is designed to improve the computational times when the number of hyperparameters associated with the weights is large, however, it does not take advantage of the possible sparsity of additional hyperparameters. As an example, the \({\textrm{PFCVM}}_{\text {LP}}\) model adds a hyperparameter for each feature in the model with the aim to do feature selection. When the number of features in the dataset is large this results in a large number of hyperparameters that slows down the estimation procedure. This can be seen in the running times of Table 5 where the TMB methods have comparable running times of the \({\textrm{PFCVM}}_{\text {LP}}\) when the number of features is small, but the running times are significantly longer for the datasets with many features. The active set algorithm can presumably be extended to also include the kernel parameters in the \({\textrm{PFCVM}}_{\text {LP}}\), however, this will restrict the prior on the kernel parameters to be a Gaussian in order for the Laplace approximation to be a good approximation. Changing the inner optimization of the negative log likelihood to a constrained Newton solver would allow us to use a half normal distribution as a prior, as is done in the \({\textrm{PFCVM}}_{\text {LP}}\), but this is not supported by the default Newton solver in TMB.

The number of initial elements in the active set, B, and the value of the threshold parameter, \(\delta \), do affect the prediction accuracy and the computational time of the active set method. A small value of \(\delta \) will cause many hyperparameters to be included in the active set and updated in each iteration, thus, making the estimation slower. This is also the case if the value of B is high, so that many hyperparameters are included in the first iteration. A higher value of \(\delta \) will improve the computational time, but the prediction accuracy may suffer as it can cause significant weights to not be included in the model. Hence, for some datasets one may have to find a balance between the computational time and the prediction accuracy. However, we have found that the values \(B=20\), \(\delta =0.5\) give good results for most datasets.

It is interesting to note that in a regression setting, the Laplace approximation of the marginal likelihood is exact for the choice of priors in the \({\textrm{RVM}}_{\text {TMB}}\) method, thus, it reduces to the analytical expressions derived in the RVM method. One might therefore expect the two methods to give identical predictions, however, there are several reasons why this is not the case. As discussed above, the active set algorithm described in Algorithm 1 differs from how the fast RVM updates the weights. In addition, does the \({\textrm{RVM}}_{\text {TMB}}\) estimate the kernel parameter \(\gamma \) by maximizing the marginal likelihood, while the RVM method uses CV.

The presented framework and numerical results of this paper rely heavily on TMB, however, the methodology is not restricted to using this package. TMB is used for two reasons. First, it allows for a convenient way to calculate the Laplace approximation of the marginal likelihood. This includes what is called the inner optimization of the negative log likelihood function to find the estimates of the weights. Second, we make use of the automatic differentiation of TMB to calculate the gradient of the marginal likelihood with respect to the hyperparameters, which is necessary in order to use gradient-based optimization routines in the outer optimization of the marginal likelihood. Thus, TMB can be replaced by any package (or separate implementation) that can apply the Laplace approximation and give the gradient of the joint likelihood.

In the RVM by Tipping (2001) the posterior distribution of the weights does not take into account the uncertainty of the hyperparameters. Neither is it included in the predictive distribution. The variance of the predictive distribution in RVM consists of two variance components: the estimated noise from the data and the uncertainty in the prediction of the weights. To the best of our knowledge, we are not aware of any other methods that include the uncertainty of the hyperparameters, which as we show in Sect. 5.5, may be of importance.

7 Concluding remarks

This paper presents a new active set algorithm that can be used to implement sparse Bayesian methods efficiently by using TMB. The main advantage of implementing the methods by using TMB is that new methods can easily be tested in this framework without the need for tedious mathematical derivation of derivatives or tailored algorithms. The automatic differentiation provided by TMB allows us to maximize the marginal likelihood with respect to all hyperparameters, eliminating the need for e.g., cross validation of kernel parameters. A naive implementation of sparse Bayesian methods using TMB is inherently slow because the large number of hyperparameters, however, we show that by using the active set algorithm, we can obtain similar predictions as the highly specialized analytical solutions of the same models, both in terms of prediction accuracy and running times.