Efficient cross-validation for kernelized least-squares regression with sparse basis expansions
Abstract
We propose an efficient algorithm for calculating hold-out and cross-validation (CV) type of estimates for sparse regularized least-squares predictors. Holding out H data points with our method requires O(min(H^{2}n,Hn^{2})) time provided that a predictor with n basis vectors is already trained. In addition to holding out training examples, also some of the basis vectors used to train the sparse regularized least-squares predictor with the whole training set can be removed from the basis vector set used in the hold-out computation. In our experiments, we demonstrate the speed improvements provided by our algorithm in practice, and we empirically show the benefits of removing some of the basis vectors during the CV rounds.
Keywords
Hold-out Cross-validation Regularized least-squares Least-squares support vector machine Kernel methods Sparse basis expansions1 Introduction
The training time of RLS, that scales as O(m^{3}) in the worst case although one can often get closer to quadratic complexity with, for example, conjugate gradient methods (Shewchuk 1994; Suykens et al. 2002), may be too tedious if the number of training instances, m, is large. Moreover, a large number of nonzero coefficients in the expansion (2) causes slow prediction speed.^{2} Consequently, several approaches alleviating the computational burden have been developed during the recent years. The so-called subset of regressors approach enforces sparsity on the expansion (2) meaning that only a subset of size n≪m of coefficients a_{i} are allowed to be nonzero, while the whole training set is still used in the training process.^{3} By doing this, one can immediately decrease the training time down to O(mn^{2}) (Poggio and Girosi 1990; Rifkin et al. 2003). One of the simplest and fastest approaches for selecting the set of basis vectors, that is, the training examples whose coefficients in the expansion (2) are nonzero, is a random selection. While, many smarter methods for selecting the basis vectors have been developed (see e.g. Smola and Bartlett 2001; Vincent and Bengio 2002), the random selection has been shown both empirically and theoretically to work as well as most of the more sophisticated methods, unless extra amount of computational resources is sacrificed for the selection process (Rifkin et al. 2003; Kumar et al. 2009). In this paper, we mainly focus on the random selection of basis vectors, since it substantially simplifies our considerations of both computational complexity of the training algorithms and estimating the prediction performance with cross-validation. In the rest of the paper, we refer to the above described approach of training RLS with randomly selected basis vectors shortly as sparse RLS.
Another approach for alleviating the computational burden is to make selection of hyperparameters and performance evaluation methods more efficient. Of them, hold-out techniques and particularly cross-validation (CV) are among the most commonly used methods, and thus, fast N-fold CV methods have also been introduced for RLS and its variations (Pelckmans et al. 2006; Pahikkala et al. 2006a; An et al. 2007; De Brabanter et al. 2010; Airola et al. 2011)). They generalize the classical leave-one-out CV (LOOCV) short-cut (Wahba 1990; Green and Silverman 1994) computable in O(m^{2}) time and some variations also enable the selection of the RLS regularization parameter efficiently without a separate re-training for each parameter value (Pelckmans et al. 2005, 2006; Rifkin and Lippert 2007). As a related work, we also mention fast CV approaches made for support vector machine classifiers (Cauwenberghs and Poggio 2001) and regressors (Karasuyama et al. 2009). While the CV methods for RLS usually rely on computational short-cuts based on matrix algebra, the CV methods for SVMs require solving a series of small optimization problems, whose size and number are data dependent.
One can combine the aforementioned speed improvement approaches and address the question of performing hold out efficiently with sparse RLS. In the case of RLS regression, the sparse algorithm can also be considered as the standard algorithm with a certain type of a modified kernel function (Quiñonero-Candela and Rasmussen 2005). Consequently, a straightforward way to expedite the computations is to use the sparse algorithm for training and the most efficient hold-out algorithms for standard RLS (Pahikkala et al. 2006a; An et al. 2007) for selection of hyperparameters and performance evaluation. For holding out \(|\mathcal{H}|\) instances, this results in the computational complexity of \(O(|\mathcal{H}|^{2}m)\). However, as mentioned above, the computational complexity of LOOCV with the previously known methods is O(m^{2}), which is more expensive than the training process of sparse RLS if m>n^{2}. This motivates us in improving the efficiency by developing a sparse RLS specific method for selection of hyperparameters and performance evaluation.
Recently, Cawley and Talbot (2004) proposed this type of LOOCV algorithm. Its computational complexity of only O(mn^{2}) makes it much more practical than the LOOCV algorithm of standard RLS used together with the above mentioned modified kernel function, because it is as expensive as the training process of sparse RLS.
allows holding out several training examples simultaneously, enabling the use of CV methods other than LOOCV,
enables the removal of basis vectors if they belong to the hold-out set, which is, for some learning tasks, a necessary property in order to avoid severely biased CV results, and
- is computationally more efficient
Holding out \(|\mathcal{H}|\) training examples requires \(O(\min(|\mathcal{H}|^{2}n,|\mathcal{H}| n^{2}))\) time.
Since in N-fold CV the average size of the hold-out sets is \(|\mathcal{H}|=m/N\), the overall complexity of N-fold CV is \(O(\min(m|\mathcal{H}| n,mn^{2}))\). That is, the required time is at most the same as the time spent for learning a sparse RLS with the whole training set.
As a special case, we get the complexity of LOOCV: because there are m hold-out sets of size 1, our algorithm requires only O(mn) time. The respective complexity for the previously proposed LOOCV algorithm for sparse RLS is O(mn^{2}) Cawley and Talbot (2004).
One may ask reasons for developing a LOOCV-algorithm with the above time complexity if one has to spend O(mn^{2}) time for initializing the sparse RLS predictor before computing LOOCV. We provide the two most apparent motivations for this.
First, it is well-known and straightforward to see that sparse RLS can be simultaneously trained to predict multiple outputs almost at the cost of learning to predict only one output. A typical example of this is the use of RLS for the one-versus-all type of multi-class classification (Rifkin and Klautau 2004). For other types of multi-output learning settings, see Suykens and Vandewalle (1999b), Suykens et al. (2002), for example. The complexity of training a predictor for v outputs is O(mn(n+v)), that is, the number of outputs starts to dominate the training complexity only in case there are more outputs than basis vectors. Now, we may want to compute LOOCV for each output separately. The computational cost of this is O(mnv) which is dominated by the training complexity.
As a second motivation, we consider selecting the value of the regularization parameter with CV. If our algorithm is used for LOOCV or N-fold CV with small hold-out sets, it can be combined with the simultaneous training of the sparse RLS with several values of the regularization parameter in a way that performs the parameter selection as efficiently as training only one instance of the sparse RLS (see e.g. Pahikkala et al. 2006a; Rifkin and Lippert 2007). For example, if the number of regularization parameter value candidates is c, the overall computational time of training a sparse RLS predictor with the optimal value found by LOOCV is O(mn^{2}+cmn)=O(mn(n+c)), where the mn^{2}-term is the computational cost of the initial training and cmn is the cost of running LOOCV c times. Now, if c<n, the complexity reduces to the complexity O(mn^{2}) of training a sparse RLS predictor once. This speed improvement is also empirically demonstrated in our experiments.
Finally, we motivate the abilities to hold out several training examples simultaneously and to remove the effect of basis vectors belonging to the hold-out set as follows. A classical motivation is that while LOOCV is known to be an almost unbiased estimator of the learning performance, it is known to suffer from larger variance than, say N-fold CV (see e.g. Kohavi 1995). Another, more practically oriented, motivation is that there exists many learning problems where the assumption of training set consisting of independently and identically distributed training examples does not hold, and this must be taken account of when designing the CV experiments in order to avoid severely biased CV results. Using large hold-out sets, in turn, raises the question of how to deal with basis vectors in the hold-out set, as it is not realistic to assume that predictions are made for data points that are at the same time among the basis vectors in training. Indeed, in our experiments, we show that not removing the basis vectors belonging to the hold-out set may also cause a serious bias in the CV results. Thus, in this types of learning tasks, our approach for efficiently removing basis vectors becomes indispensable.
The rest of the paper is organized as follows: In Sect. 2, we recall the concepts of supervised learning and discuss the issues of performance evaluation with hold-out and CV estimates. In Sect. 3, we formalize the RLS method. Section 4 considers the sparse RLS. Section 5 presents our algorithm for calculating a hold-out performance and, by that, different types of CV performance estimates for the sparse RLS. Section 6 describes the empirical part of our study and it contains both the evaluation setting and results. Section 7 concludes the paper.
2 Preliminaries
We first recall the concept of supervised learning. A supervised learner is a machine that is taught with a set of training data points with preferred output variables to perform a specific task. By a task, we mean the prediction of an output variable for an unseen data point. Formally, let \(X=(x_{1},\ldots,x_{m})\in(\mathcal{X}^{m})^{\mathrm{T}}\) be a sequence of inputs and y∈ℝ^{m} a vector of outputs, where \(\mathcal{X}\), called the input space, is the set of possible inputs. Here, \((\mathcal{X}^{m})^{\mathrm{T}}\) denotes the sets of row vectors, which contain m elements belonging to the set \(\mathcal{X}\), while ℝ^{m} denotes the set of real valued column vectors of size m. Further, let S=(X,y) be a training set of m training instances. Unless stated otherwise, we assume that S is independently and identically distributed (i.i.d.) and drawn from an unknown probability distribution \(\mathbb{D}\) over \(\mathcal{X}\times\mathbb{R}\). Notice that while we call S a training set, it is actually an ordered sequence of data points.
As discussed for example by Dietterich (1998) and Schiavo and Hand (2000), the expected performances of predictor (4) and learning algorithm (5) correspond to two different statistical questions of interest. The former corresponds to the question of how well we expect a certain trained predictor to perform with future data points. This is of interest in majority of real world problems. The latter measures the quality of the learning algorithm itself in solving a learning task under consideration. That is, we assume a certain learning algorithm and train it with a data set of a given size. This results in a predictor and (5) indicates how well, on average, it generalizes to new data points. In addition, (5) can be useful in real world problems having a moving target, such as in the task of detecting junk mail, for example.
In practice, we can almost never directly access the probability distribution \(\mathbb{D}\) to calculate (4) or (5). Rather, we are limited to their estimation instead. One such estimate is obtained from CV. Depending on the algorithm, it may be possible to access ℙ but it may still be difficult to calculate the expectation over it due to the computational reasons. Therefore, in order to measure (5), one has to rely on estimates obtained from a finite set of labeled data drawn from \(\mathbb{D}\) and a finite set of random elements drawn from ℙ.
Let us first consider ways to compute estimates (5) with CV. Given a training set S of m examples, let \(\mathcal{I}=\{1,\ldots,m\}\) denote an index set in which the indices refer to the instances in the training set. If the algorithm is trained with a randomly chosen r∈ℙ and with the training set of all but one example indexed by \(\mathcal{I}\), and the performance is measured with the example which is not used in training, we get an estimate of (5) which is unbiased for training sets of size m−1. In addition to the unbiasedness, we would also prefer the estimate to have a small variance. This is achieved straightforwardly by averaging over several experiments of the above type. According to the central limit theorem, the average of such i.i.d. experiments approaches a normally distributed variable, whose variance decreases as the number of experiments increases.
However, in order to the experiments to be independent, we would have to sample a completely new training set, test instance, and r. This is not possible in real world situations, because of the lacking computational resources and having scarce data. Therefore, we usually have to permute our labeled data between training and testing when creating the sequence of experiments. This has the drawback that, while we can ensure the unbiasedness of the estimates of (5), the experiments are not completely independent. This complicates the theory and mathematics concerning the behavior of such estimates.
As an example, let us consider N-fold CV in which the training set is partitioned to N disjoint subsets of size H=m/N called CV folds. In N-fold CV, each example in the training set is used for testing at a time, and thus the number of experiments in the sequence (6) is t=m. Moreover, \(i_{j}\in\mathcal{H}_{h}\) if and only if \(\mathcal{H}_{h}=\mathcal{H}_{j}\), and \(\mathcal{H}_{h}\cap \mathcal{H}_{j}=\emptyset\) if and only if \(i_{j}\notin\mathcal{H}_{h}\), that is, each CV fold is used as a set \(\mathcal{H}_{j}\) in H of the m experiments and each example associated to the CV fold is used once as a test example in one of the H experiments. In case of randomized algorithms, we often also require that r_{h}=r_{j} whenever \(\mathcal {H}_{h}=\mathcal{H}_{j}\) in order to save training time. This is because the training has to be repeated only N times, once for each CV fold, and the same predictor can be used for several experiments in the sequence (6). Note that the notation also covers other types of CV approaches such as, say, N-fold CV repeated several times with different fold partitions (see e.g. Kohavi 1995). In repeated N-fold CV, a single data point can be used as a test example several times together with different sets \(\mathcal{H}_{j}\), that is, for the jth and hth experiments, we may have i_{j}=i_{h} while \(\mathcal{H}_{j}\neq\mathcal{H}_{h}\).
Finally, let us briefly consider how to obtain estimates for the expected performance of the predictor f. This task is difficult without having a separate test set not used in the training phase (Dietterich 1998). We could use similar hold-out estimates as those used for estimating the expected performance of the learning algorithm \(\mathcal{A}\) but they would be biased: holding out a subset of training examples means that the obtained predictor is not the same as the one, whose performance we aim to measure. Nevertheless, estimates based on hold-out are useful tools also for this purpose, since the bias is not necessarily severe, especially if the amount of training data is large. Still, care must be taken when designing the experiments as we show in Sect. 6.
3 Regularization framework
Next, we consider the hypothesis space \(\mathcal{F}\). For this purpose we define so-called kernel functions. Let \(\mathcal{X}\) denote the input space, which can be any set, and \(\mathcal{P}\) denote the feature vector space. For any mapping \(\varPhi:\mathcal{X}\rightarrow \mathcal{P}\), the inner product k(x,x′)=〈Φ(x),Φ(x′)〉 of the mapped data points is called a kernel function. We define the symmetric kernel matrix K∈ℝ^{m×m}, where ℝ^{m×m} denotes the set of real matrices of type m×m, and the entries of the kernel matrix are given by K_{i,j}=k(x_{i},x_{j}). For simplicity, we assume that K is strictly positive definite. This can be ensured, for example, by performing a small diagonal shift.
4 The sparse regularized least-squares
The computational complexity of training an RLS learner, O(m^{3}), may be too tedious, if the number of training instances is large. However, several authors have considered sparse versions of RLS in which only a part of the training instances, often called the basis vectors, have a nonzero coefficient in (2). This means that when the training is complete, the rest of the training instances are not needed any more, when predicting the output variables of the new data points. Another advantage of the sparse RLS is that its training complexity is only O(mn^{2}), where n is the number of basis vectors. Further, as we will show below, there are efficient algorithms for CV and selection of the regularization parameter for sparse RLS that are analogous to the ones for the standard RLS.
As discussed above, we focus in this mainly on the random selection of basis vectors. We assume a uniform sampling of the n basis vectors among the training set of size m. That is, training sparse RLS can be considered as a randomized learning (3), where the random element r determines the set of basis vectors.
Before continuing, we introduce some notation. Let \(\mathcal {M}_{\varXi \times\varPsi}\) denote the set of matrices whose rows and columns are indexed by the index sets Ξ and Ψ, respectively. With any matrix \(\mathbf{M}\in\mathcal{M}_{\varXi\times\varPsi}\) and index set ϒ⊆Ξ, we use the subscript ϒ so that a matrix \(\mathbf{M}_{\varUpsilon}\in\mathcal{M}_{\varUpsilon\times \varPsi}\) contains only the rows of M that are indexed by ϒ. For \(\mathbf{M}\in\mathcal{M}_{\varXi\times\varPsi}\), we also use \(\mathbf{M}_{\varUpsilon\varOmega} \in\mathcal{M}_{\varUpsilon \times \varOmega}\) to denote a matrix that contains only the rows and the columns that are indexed by any index sets ϒ⊆Ξ and Ω⊆Ψ, respectively.
Finally, let us first calculate \(\mathbf{Q}^{\mathrm{T}}\mathbf {K}_{\mathcal{B}}\mathbf{y}\) (in O(mn) time) and store it in the memory. After this, (12) can be computed for different values of the regularization parameter from \(\mathbf {a}=\mathbf{Q}\widetilde{\boldsymbol{\varLambda}}\mathbf {Q}^{\mathrm {T}}\mathbf{K}_{\mathcal{B}}\mathbf{y}\) with the complexity of O(n^{2}). This is because the multiplication of the shifted and inverted eigenvalues \(\widetilde{\boldsymbol{\varLambda}}\) with \(\mathbf{Q}^{\mathrm{T}}\mathbf{K}_{\mathcal{B}}\mathbf{y}\) can be performed in O(n) time and the multiplication of the resulting matrix from left by Q can be performed in O(n^{2}) time.
5 Fast computation of hold-out error
Proposition 1
Proof
The above proposition concerns the case in which \(\mathcal{E}\neq \emptyset\), that is, the basis vectors that belong to the hold-out set are removed from the basis vector set, and hence the predictions for the hold-out examples are performed with a predictor trained with training examples indexed by \(\overline{\mathcal{H}}\) and with basis vectors indexed by \(\mathcal{L}\). Next, we consider an approach for the case in which \(\mathcal{E}= \emptyset\). Note that, this approach can also be used if we do not intend to remove the basis vectors indexed by \(\mathcal{E}\), that is, the predictions for the hold-out examples would be performed with a predictor trained with training examples indexed by \(\overline{\mathcal{H}}\) and with basis vectors indexed by \(\mathcal{B}\), even if \(\mathcal{E}\neq\emptyset\). This is accomplished by setting the index set \(\mathcal{E}\) to empty. We formulate the approach as a corollary to the above proposition.
Corollary 2
Proposition 1 also holds, if\(\mathcal{E}=\emptyset\).
Proof
If \(\mathcal{E}= \emptyset\), then \(\mathbf{J}=\mathbf{U}\widetilde{\boldsymbol{\varLambda}}\mathbf {U}^{\mathrm{T}}\), \(\mathbf{r}=\mathbf{U}\widetilde{\boldsymbol{\varLambda}}\mathbf{z}\), \(\mathbf{U}=((\mathbf{K}_{\mathcal{B}})^{\mathrm{T}}\mathbf {Q})_{\mathcal{H}}\), and \(\mathbf{z}=\mathbf{Q}^{\mathrm{T}}\mathbf{K}_{\mathcal{B}}\mathbf{y}-(\mathbf{Q}^{\mathrm{T}}\mathbf{K}_{\mathcal{B}})_{\mathcal{B}\mathcal{H}}\mathbf {y}_{\mathcal{H}}\). The corollary can be proved the same way as Proposition 1 except the use of the above matrices simplifies the computation of \(\mathbf{K}_{\mathcal{H}\mathcal{L}}\mathbf {W}\mathbf{K}_{\mathcal{L}\mathcal{H}}\) and \(\mathbf{K}_{\mathcal {H}\mathcal{L}}\mathbf{W}\mathbf{K}_{\mathcal{L}\overline{\mathcal{H}}}\mathbf {y}_{\overline{\mathcal{H}}}\). □
We further note that the calculation of matrices (20)–(25) requires O(mn^{2}) time, and hence the training of the sparse RLS as in Proposition 1 is computationally as efficient as the training in the ordinary way. Here it is, of course, again presupposed that the set of basis vectors used in the hold-out computation is a subset of the set of basis vectors used in computing the matrices.
We next consider different estimators for the expected performance of the learning algorithm (5) that can be constructed with our efficient hold-out algorithm. In order to minimize the variance of the performance estimator obtained by averaging over the sequence of experiments (6), we should average over as many experiments as we can afford with our computational resources, while also minimizing the covariance (8) between the experiments. Analogously to the deterministic learning algorithms, the covariance between two experiments usually increases if the overlap between the training sets increases. Similarly, the covariance is larger if the test instance is the same in the two experiments than if the test instance is different.
The randomized part of our learning algorithm is determined by the random selection of the set of the basis vectors. Intuitively, the covariance between two experiments increases if the overlap between the sets of basis vectors increases. Therefore, we should preferably have experiments, where sets of basis vectors overlap with each other as little as possible. However, this counters the efficiency requirement, because training with different sets of basis vectors requires a lot of computational resources. Nevertheless, via our efficient hold-out method, we can vary the set of basis vectors by holding out different subsets of the original basis vector set in different experiments. These effects are investigated more in detail in our experiments in Sect. 6.
Since in N-fold CV the training set is partitioned into N parts of approximately equal size, the number of training instances in each fold is \(|\mathcal{H}|\approx m/N\). Then, CV is performed by using each fold as a hold-out set at a time and calculating the corresponding hold-out label predictions. According to Proposition 1, the computational complexity of each CV round is \(O(\min(|\mathcal{H}|^{2}n,|\mathcal{H}|n^{2}))\), and hence we get the following corollary:
Corollary 3
The overall computational complexity of theN-fold CV is\(O (\min(m|\mathcal{H}| n,\allowbreak mn^{2}) )\). Further, the computational complexity of LOOCV isO(mn) if combined with the training process of the sparse RLS predictor.
Thus, we observe that the computational complexity of N-fold CV is at most the training time of a sparse RLS predictor. The method is less complex for smaller hold-out sets (especially so for the extreme case of LOOCV) and it can be used, for example, to select the value of the regularization parameter λ efficiently from its candidate values. We formalize this in the following result:
Corollary 4
Letcbe the number of candidate values for the regularization parameterλ. The computational complexity of calculating theN-fold CV output for all different candidate values is\(O (c\min(m|\mathcal{H}| n,mn^{2}))\). Further, the analogous computational complexity for LOOCV isO(cmn).
Thus, if \(c|\mathcal{H}|\leq n\), the selection of the regularization parameter via CV is not computationally more complex than the training process of sparse RLS.
The above considerations can be generalized for RLS that can be simultaneously trained to predict multiple outputs. This is achieved by using, instead of the label vector y, a label matrix having v columns, where v is the number of outputs. It is straightforward to see from (12) that the complexity of training a sparse RLS predictor for v outputs is O(mn(n+v)). Now, we may want to compute CV for each output separately. This gives us the following corollary:
Corollary 5
The computational complexity of calculating theN-fold CV output for all of thevoutputs is\(O (v\min(m|\mathcal{H}| n,mn^{2}))\). Further, the analogous computational complexity for LOOCV isO(vmn).
Again, if \(v|\mathcal{H}|\leq n\), computing CV for v outputs is dominated by the training complexity.
6 Experiments
In Sect. 6.1, we measure the speed of the proposed hold-out approach in selection of hyperparameters with LOOCV. In Sect. 6.2, we compare three unbiased estimators of the expected performance of the sparse RLS learning algorithm and confirm our hypothesis of obtaining a lower variance for the performance estimate by varying the set of basis vectors in each CV round. Section 6.3 deals with experiments using artificially generated data. In Sect. 6.3.1, we repeat the experiments of Sect. 6.2 with a larger number of basis vectors. The bias on CV results caused by greedy selection of basis vectors is measured in Sect. 6.3.2. In Sect. 6.4, we consider the issues related to measuring the expected performance of a sparse RLS predictor.
6.1 Speed comparisons
To test the computational speed of the proposed CV method in practice, we make an experimental speed comparison between that and the fastest previously proposed CV approach for sparse RLS, the O(mn^{2}) time LOOCV algorithm proposed by Cawley and Talbot (2004). We only test LOOCV without removing the basis vectors, because the baseline method is defined only for that setting. To implement the algorithms, we use NumPy, a computationally efficient scientific computing library for Python programming language. As a test platform, we use a single core of an AMD Phenom II X6 1090T Processor. The speed comparisons are done with an artificially generated data set of 5000 examples and the number of basis vectors is varied as 500, 1000, 1500, 2000, and 2500.
The running times in CPU seconds of the baseline (Method 1) and the proposed algorithm (Method 2) with different amounts of basis vectors. The first two rows contain the time spent in combined training and LOOCV. The third row presents the LOOCV time of the proposed algorithm after a RLS learner has been trained. The last two rows contain the overall time spent in cross-validated selection of the regularization parameter λ for the two methods
n=500 | n=1000 | n=1500 | n=2000 | n=2500 | |
---|---|---|---|---|---|
RLS+LOOCV Method 1 | 0.707 | 2.482 | 5.435 | 9.625 | 15.891 |
RLS+LOOCV Method 2 | 1.287 | 5.119 | 11.813 | 22.558 | 38.589 |
LOOCV Method 2 | 0.007 | 0.014 | 0.022 | 0.029 | 0.036 |
MS Method 1 | 14.163 | 49.652 | 108.732 | 192.529 | 317.873 |
MS Method 2 | 1.428 | 5.402 | 12.258 | 23.139 | 39.326 |
To conclude, the proposed approach for LOOCV requires some extra computational resources for filling the caches required for computing LOOCV and performing hyperparameter selection. However, after the caches have been constructed once, the hyperparameter selection can become orders of magnitude faster that done without the caches if the regularization parameter is searched from a large grid.
6.2 Variance in estimates of expected performance of the learning algorithm
Here, we consider using N-fold CV for measuring the expected performance of the sparse RLS learning algorithm. We select the basis vectors randomly via uniform sampling from the training set. This causes some extra variance to the performance estimates in addition to the variability caused by the training set and unseen test examples. Taking the variability caused by the basis vector set into account can be accomplished, for example, by averaging over many hold-out experiments for which the set of basis vectors is randomly reselected. However, this requires the sparse RLS to be completely retrained for each CV round, which may be prohibitive in practice, especially if the number of CV rounds is large as in the case of LOOCV. With our hold-out method, taking the variability into account can still be achieved to some extent if holding out a subset of the basis vectors in each CV round.
For example, let us assume that we aim to use ten-fold CV to estimate the prediction performance of sparse RLS algorithm which randomly selects n basis vectors. We start by training sparse RLS predictor with 11n/10 randomly selected basis vectors. Then, we hold out n/10 of the basis vectors and one tenth of the non-basis vectors in each round. Consequently, this CV estimate provides an approximation of the standard CV estimate in which one tenth of the basis vectors is changed in each round instead of the whole set of basis vectors being changed. Compared to selecting the basis vectors randomly and separately for each CV round, this approach retains the computational efficiency, because we have no need to train RLS again with new basis vectors. Note also that as the original set of 11n/10 basis vectors is randomly selected from the original training set, the set of n basis vectors not belonging to a hold-out set can be considered to be randomly selected from the training examples not in the hold-out set. Thus, the hold-out experiment provides an unbiased estimate of the prediction performance of sparse RLS algorithm which randomly selects n basis vectors, and hence the average of the ten hold-out experiments also provides unbiased estimates.
We test the variability caused by the basis vector selection in four regression tasks: Ailerons, Elevators, Pole Telecomm and Pumadyn.^{4} Ailerons contains 7,154 instances and forty features per instance. The respective numbers are 8,752 and eighteen for Elevators; 5,000 and 48 for Pole Telecomm; and 4,499 and 32 for Pumadyn. In a pre-processing step, we scale the values to be regressed in the Ailerons and Elevators tasks by a factor of 1,000 because they are very small.
The kernel matrix K is formed by using a Gaussian radial basis function kernel k(x,x′)=exp(−γ|x−x′|^{2}), where γ∈ℝ is a positive constant determining the width of the kernel. We select a suitable value for γ in a preliminary experiment for each data set (i.e., Ailerons γ=2^{−18}, Elevators γ=2^{−19}, Pole Telecomm γ=2^{−12}, Pumadyn γ=2^{−21}) and use this value in the actual experiment. We ensure the positive definiteness of K by shifting it diagonally with 10^{−7}I, where the identity matrix I has the same order as K. In addition, we found shifting to reduce numerical errors. The tested domain for the regularization parameter λ is {2^{−15},…,2^{4}} in all four tasks.
- 1.
Approach 1: We select thirty basis vectors randomly from the training set. Then, the training set is divided randomly into ten folds so that each fold contains three basis vectors. Finally, the sparse RLS regressor is trained and CV error is computed using the fold partition.
- 2.
Approach 2: We select randomly 27 basis vectors and divide the training set randomly into ten folds so that no basis vectors are included in any fold.
- 3.
Approach 3: We divide the training set randomly into ten folds. The CV estimate is obtained by selecting 27 basis vectors randomly among the examples not in the hold-out set, separately for each CV round.
In all three approaches, the hold-out computations in the CV rounds correspond to the situation in which a sparse RLS regressor trained with 27 basis vectors is performing predictions for the hold-out instances. This also holds for Approach 1, where the predictor is originally trained with 30 basis vectors but three of them is held out during each CV round. The sizes of the hold-out sets are approximately m/10 in the first and third experiment and (m−27)/10 in the second one. This difference is negligible when m is large as it is in all our approaches.
The main difference between the three approaches is that three basis vectors are switched in each CV round in Approach 1, the basis vectors are the same in each CV round in Approach 2, while the sets of basis vectors may be completely different between the CV rounds in Approach 3. Thus, each of the three approaches is an unbiased estimator of the expected performance of the sparse RLS learning algorithm used with 27 randomly selected basis vectors. However, the variance of the estimates is likely to differ between the three approaches, because there are different types of dependences present between the CV rounds.
With the 100 repetitions, we estimate the variability of CV estimates caused by the selection of the basis vectors together with the fold partition in the training set. Note that these are not the whole variances of the three estimates. In order to measure the complete variability of the estimates, we should have new data sets drawn from the underlying distribution for each of the one hundred repetitions. Nevertheless, these experiments are sufficient for our purposes, since we are especially interested in the variance caused by the random selection of basis vectors.
We use the mean squared error (MSE) as a performance evaluation measure, and compute the variance estimate with the following formula: \(\frac{1}{1-r}\sum_{i=1}^{r}(\mathit{MSE}^{(i)}-\mu)^{2}\), where r is the number of repetitions, \(\mathit{MSE}^{(i)}\) is MSE obtained from the ith repetition, and μ is the mean CV error estimated from the sample of repetitions.
To conclude, in order to decrease the variance of the CV estimate of the expected performance of the sparse RLS algorithm, one can change the set of basis vectors in the CV rounds. However, changing the set completely in each round may be computationally too expensive, because it corresponds to completely retraining the predictor in each round. Nevertheless, the variance can be decreased to some extent by changing a small subset of the basis vectors per CV round as is done in Approach 1, while this does not require more computational resources than training the predictor only once.
6.3 Experiments with simulated data
6.3.1 Experiments using a larger number of basis vectors
To study the behavior of the three approaches considered in Sect. 6.2 in a task where the selection of the regularization parameter plays a more important role than in the previous experiments, we perform a similar test on a regression task in which the data is artificially generated. By generating a new data set from scratch for each repetition of the experiment, we also obtain more comprehensive view of the variance caused by sampling the training data from the underlying distribution.
From the behavior of the mean, one can easily observe that the learning process overfits with the small values of the regularization parameter, underfits with large values and the optimal values are in the middle, namely the values around 2^{5}. However, unlike in the experiments with the real world data, Approach 3 does not seem to express smaller variance than Approach 1 for all values of the regularization parameter. The large variance of Approach 2 can be explained by the fact that since the basis vectors are never included in the hold-out sets and the ratio between the number of basis vectors and the size of the training set is about 1/10, the size of the hold-out sets are one tenth smaller than those in the other approaches. Otherwise, we can conclude that the variation of the set of basis vectors does not play as important role in the sinusoid experiment as it does with the four experiments done with real world data sets, and hence the effect is very much dependent on the application. Thus, it is possible to be able to decrease the variance of CV results by varying the set of basis vectors between CV rounds somewhat, but in most cases, a fixed set of basis vectors works well enough.
6.3.2 Bias caused by basis vector selection outside CV loop
In many cases, it makes sense to use a clever heuristic for selecting the basis vectors than the completely random selection. This can increase the prediction performance considerably or the same prediction performance can be achieved with much smaller amount of basis vectors than with random selection. The downside of the heuristic-based selection is that they are often computationally more complex than the random selection (Rifkin et al. 2003; Kumar et al. 2009).
From the results, we observe that, as expected, the performances of Approaches 1 and 2 with a small number of basis vectors are optimistically biased compared to that of Approach 3. With a large number of basis vectors, Approach 1 does not seem to suffer from the bias anymore, while Approach 2 still does. Thus, removal of the basis vectors during CV rounds seems to counter the bias at least in some cases. With these experiments, the bias does not seem to be considerably serious if the CV results are used, for example, for parameter selection. Nevertheless, as observed from the results, the size of the bias can depend a lot on the experimental setting and we have no good means for predicting it in advance. To conclude, the proposed CV short-cuts should be used with basis vector selection with caution but the results may still be useful as approximate performance estimates if one needs to save the computational resources required in the re-selection of the basis vectors in each CV round, especially if LOOCV is used.
6.4 Bias caused by basis vectors in hold-out set
In order to use CV for measuring the expected performance of a fixed sparse RLS predictor learned from a certain training set and a certain set of basis vectors, the predictor learned without the hold-out set should be as close as possible to the fixed predictor under consideration. The smaller the hold-out sets are, the smaller is the change in the predictor in most of the practical cases. LOOCV is usually a good choice, because the hold-out sets are as small as possible. Moreover, the original set of basis vectors should be used during CV, since changing the set may change the predictor even more than holding out some training examples.
- (i)
We can remove the hold-out example from the basis vector set. This may change the predictor too much which, in turn, may increase the bias of the performance estimate.
- (ii)
If the training data set is large—as it is in cases where sparse algorithms are needed—it is reasonable to skip the training examples that are also basis vectors in LOOCV in order to avoid biased results. This may cause a slight increase in the variance of the performance estimate, since the whole data set is not used in CV but the increase is usually tolerable because of the small size of the basis vector set compared to the training set size.
- (iii)
The example can be kept in the basis vector set, while its effect is removed from the squared error. This may lead to a bias, because the data points for which the prediction is to be made do not usually belong to the set of basis vectors.
As an example of a task, where the training set contains certain dependency structures, we consider the following regression problem. Given a sentence taken from a free text document, an automatic parser is employed to generate a set of alternative parses of the sentence. Some of the generated parses describe the syntactic structure of the sentence more correctly than the others. To reflect this correctness, a regressor is used to predict scoring for the parses. The regressor is learned from a training set, which is in turn constructed from a set of sentences and the parses extracted from the sentences. Due to the feature representation of the parses, two parses originating from a same sentence have almost always larger mutual similarity than two parses originating from different sentences. Hence, the data set consisting of the parses is heavily clustered according to the sentences the parses were generated from and this clustered structure of the data has a strong effect on the performance estimates obtained by CV. This is because the data instances that are in the same cluster as the hold-out instance have a dominant effect on the predicted output of the hold-out instance. This does not, however, model the real world use, because a regressor is usually not trained with parses originating from the sentence from which the new parse with an unknown score value is originated. The problem can be solved by performing CV on the sentence level so that all the parses generated from a sentence would always be either in the training set or in the test set.
With sparse RLS, performing CV on sentence level is still not the whole story, though. Consider the case in which the aim to use CV in order to estimate the prediction performance of the regressor trained with the available data set. Moreover, consider a hold-out set consisting of all the parses associated to a single sentence. Further, one of the examples in the hold-out set belongs to the set of basis vectors. The question still left open is whether the hold-out examples should also be removed from the set of basis vectors or not. On one hand, removing one example from the set of basis vectors in hold-out computation counters the aim of measuring the performance of the regressor trained with the whole data set, because removing a basis vector may change the regressor even more than only removing some of the training examples. On the other hand, having a basis vector that is very similar to the hold-out examples may cause a bias on the results, because it does not model the reality at the time, when the predictions are made.
To study these issues in practice, we use a data set having altogether 2,354 instances: The total number of sentences is 501 and approximately five parse candidates are generated for each sentence. The feature representation of the instances was generated using the method presented in Pahikkala et al. (2006b). This feature representation is sparse and contains tens of thousands of different features. We form the kernel matrix K by using a linear kernel and ensure its positive definiteness by a diagonal shift of 10^{−7}I. In addition, we have a separate test set of 600 sentences with approximately twenty parses per sentence.
- 1.
The examples in the hold-out set are completely removed from the training set and from the set of basis vectors in each CV round.
- 2.
The examples in the hold-out set are removed from the training set except the example that belongs to the set of basis vectors, that is, the basis vector is preserved and the square loss is evaluated on it in the training phase.
- 3.
The examples in the hold-out set are removed from the training set but the set of basis vectors is preserved, that is, the basis vector in the hold-out set remains a basis vector but the square loss is not evaluated on it in the training phase.
The average absolute values for the kernel values, coefficients, and their products, during CV. The symbols x and v denote basis vectors inside and outside the hold-out set in a CV round, a_{x} and a_{v} denote their coefficients, and z denotes a data point in the hold-out set and associated with x. The measurements are done for both unnormalized and normalized kernels and with low and high regularization
k(x,z) | k(v,z) | a_{x} | a_{v} | a_{x}k(x,z) | a_{v}k(v,z) | |
---|---|---|---|---|---|---|
Unnormalized, λ=2^{0} | 250628 | 72231 | 1.738 | 0.915 | 701.0 | 105.7 |
Normalized, λ=2^{−10} | 450.9 | 145.6 | 486.4 | 191.9 | 439.6 | 55.97 |
Unnormalized, λ=2^{10} | 250628 | 72231 | 0.058 | 0.233 | 22.53 | 30.16 |
Normalized, λ=2^{0} | 450.9 | 145.6 | 35.42 | 89.79 | 31.92 | 26.0 |
Second, we consider the absolute values of the learned dual coefficients of the basis vectors during CV. The values are measured using small values of the regularization parameter, namely λ=2^{0} and λ=2^{−10} for the unnormalized and normalized kernels, respectively, for which the bias is large (see Table 2). On average over all CV rounds, the absolute value of the coefficient of the basis vector belonging to the hold-out set is about two or three times larger than that of a basis vector not in the hold-out set. With low regularization, RLS is close to the ordinary least-squares method that is invariant to the scaling of the kernel evaluations (Frank and Friedman 1993), and hence the basis vector gets a larger coefficient than the others as it is dissimilar with all training examples not in the hold-out set, because all the similar data points are held out. This does not cause any problems in prediction time, because the data points for which the predictions are to be made are not associated with the basis vectors. However, this is not happening in the hold-out prediction, since all the hold-out data points are associated with exactly the basis vector that is the most dissimilar with the data points used in training. Indeed, when we measure the evaluations a_{x}k(x,z), where x is a basis vector, a_{x} is its coefficient in a CV round, and z is a data point in the hold-out set of the CV round, we observe that the absolute values of the evaluations concerning the basis vector in the hold-out set are, on average, about seven times larger than the evaluations concerning the other basis vectors.
From the performance curves in Fig. 8, we see that the adverse effect disappears if the regularization is increased. To consider this in more detail, we made the same measurements also with λ=2^{10} and λ=2^{0} for the unnormalized and normalized kernels, respectively (see Table 2). We observe that, with more powerful regularization, RLS is not invariant to the scaling of the kernel values (Frank and Friedman 1993), and the absolute values of the coefficients of the basis vectors belonging to the hold-out set are, on average, smaller that those of the basis vectors not in the hold-out set. This, in turn, counters the effect of the kernel values in the evaluations a_{x}k(x,z), whose average absolute values are in this case very similar among the basis vectors inside and outside the hold-out set.
We conclude that having one of the hold-out examples in the set of basis vectors can also cause a serious bias on the regression results, because the circumstances in the hold-out computations do not correspond well enough those in the prediction time. Thus, the experiments clearly indicate that the ability to hold-out basis vectors from training is necessary in the considered task.
7 Conclusion
In this paper, we presented a hold-out algorithm for sparse RLS that improves result reliability and computational efficiency. Improvements in result reliability are achieved by a capability to hold out basis vectors. That is, some of the basis vectors used to train the sparse RLS predictor with the whole training set can be removed from the basis vector set used in the hold-out computation. In our experiments, we demonstrated that our algorithm is considerable faster in hyper parameter selection with leave-one-out cross-validation than the baseline approach without the proposed computational short-cuts. We have empirically studied the effect of holding out basis vectors in each CV round for the variance of the CV estimate and found out that it indeed lowers the variance in certain cases. We have also given empirical evidence on the necessity to hold out basis vectors in order to avoid seriously biased CV estimates. Further, we empirically measured the effect caused by greedily selecting the basis vectors outside the CV loop and, as expected, confirmed the risk of optimistic bias in the CV results.
To summarize the computational efficiency, holding out \(|\mathcal{H}|\) training examples with our algorithm requires \(O(\min(|\mathcal{H}|^{2}n,|\mathcal {H}| n^{2}))\) time, if a sparse RLS predictor is already trained with a set of m training instances and n basis vectors. This, in turn, enables the efficient computation of cross-validation (CV) estimates for sparse RLS. Namely, the complexity of N-fold CV estimates becomes \(O(\min(m|\mathcal{H}|n,mn^{2}))\), where m, n and N are the numbers of training instances, basis vectors and CV folds, respectively. Especially, in the case of LOOCV, the algorithm has the complexity O(mn). Because the sparse RLS can be trained in O(mn^{2}) time for several different values of the regularization parameter in parallel, the fast CV algorithm can be used to efficiently select the optimal parameter value.
Footnotes
- 1.
We note that in some formulations of the problem, such as the least-squares support vector machine (Suykens and Vandewalle 1999a), the hypotheses also contain a bias term that is not necessarily regularized. In this paper we omit the bias for simplicity and note that the effect of a regularized bias can be obtained by adding a positive constant to the kernel values.
- 2.
Some kernel-base learning algorithms, such as the support vector machines for classification and regression (Vapnik 1995), achieve sparse coefficient vectors due to the nature of the loss function they employ but this is not the case with the squared loss we consider in this paper.
- 3.
See also (Williams and Seeger 2001) for related approaches based on the Nyström approximation of the kernel matrix. We also note that the optimal solution with only n nonzero coefficients does not necessarily have a representation as in formula (2), because the subset of regressors approach cannot straightforwardly resort to the representer theorem.
- 4.
Downloaded from http://www.liaad.up.pt/~ltorgo/Regression/DataSets.html [cited 2010 October].
Notes
Acknowledgements
This work has been supported by the Academy of Finland (grants 134020 and 136653). NICTA is funded by the Australian Government as represented by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council through the ICT Centre of Excellence program. We express our gratitude to Dr. Wray Buntine at NICTA for his helpful comments.
References
- Airola, A., Pahikkala, T., & Salakoski, T. (2011). On learning and cross-validation with decomposed Nyström approximation of kernel matrix. Neural Processing Letters, 33(1), 17–30. CrossRefGoogle Scholar
- An, S., Liu, W., & Venkatesh, S. (2007). Fast cross-validation algorithms for least squares support vector machine and kernel ridge regression. Pattern Recognition, 40(8), 2154–2162. MATHCrossRefGoogle Scholar
- Cauwenberghs, G., & Poggio, T. (2001). Incremental and decremental support vector machine learning. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems (Vol. 13, pp. 409–415). Cambridge: MIT Press. Google Scholar
- Cawley, G. C., & Talbot, N. L. C. (2004). Fast exact leave-one-out cross-validation of sparse least-squares support vector machines. Neural Networks, 17(10), 1467–1475. MATHCrossRefGoogle Scholar
- De Brabanter, K., De Brabanter, J., Suykens, J., & De Moor, B. (2010). Optimized fixed-size kernel models for large data sets. Computational Statistics & Data Analysis, 54(6), 1484–1504. MathSciNetCrossRefGoogle Scholar
- Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10(7), 1895–1923. CrossRefGoogle Scholar
- Elisseeff, A., Evgeniou, T., & Pontil, M. (2005). Stability of randomized learning algorithms. Journal of Machine Learning Research, 6, 55–79. MathSciNetMATHGoogle Scholar
- Frank, I. E., & Friedman, J. H. (1993). A statistical view of some chemometrics regression tools. Technometrics, 35(2), 109–135. MATHCrossRefGoogle Scholar
- Golub, G. H., & Van Loan, C. (1989). Matrix computations (2nd ed.). Baltimore: Johns Hopkins University Press. MATHGoogle Scholar
- Green, P., & Silverman, B. (1994). Nonparametric regression and generalized linear models, a roughness penalty approach. London: Chapman & Hall. MATHGoogle Scholar
- Horn, R., & Johnson, C. (1985). Matrix analysis. Cambridge: Cambridge University Press. MATHGoogle Scholar
- Karasuyama, M., Takeuchi, I., & Nakano, R. (2009). Efficient leave-m-out cross-validation of support vector regression by generalizing decremental algorithm. New Generation Computing, 27, 307–318. MATHCrossRefGoogle Scholar
- Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In C. Mellish (Ed.), Proceedings of the fourteenth international joint conference on artificial intelligence (Vol. 2, pp. 1137–1143). San Mateo: Morgan Kaufmann. Google Scholar
- Kumar, S., Mohri, M., & Talwalkar, A. (2009). Sampling techniques for the Nyström method. In D. van Dyk & M. Welling (Eds.), JMLR workshop and conference proceedings: Vol. 5. Proceedings of the 12th international conference on artificial intelligence and statistics (pp. 304–311). Google Scholar
- Nadeau, C., & Bengio, Y. (2003). Inference for the generalization error. Machine Learning, 52(3), 239–281. MATHCrossRefGoogle Scholar
- Pahikkala, T., Boberg, J., & Salakoski, T. (2006a). Fast n-fold cross-validation for regularized least-squares. In T. Honkela, T. Raiko, J. Kortela, & H. Valpola (Eds.), Proceedings of the ninth Scandinavian conference on artificial intelligence (SCAI 2006) (pp. 83–90). Espoo: Helsinki University of Technology. Google Scholar
- Pahikkala, T., Tsivtsivadze, E., Boberg, J., & Salakoski, T. (2006b). Graph kernels versus graph representations: a case study in parse ranking. In T. Gärtner, G. C. Garriga, & T. Meinl (Eds.), Proceedings of the ECML/PKDD’06 workshop on mining and learning with graphs (pp. 181–188). Google Scholar
- Pahikkala, T., Pyysalo, S., Boberg, J., Järvinen, J., & Salakoski, T. (2009a). Matrix representations, linear transformations, and kernels for disambiguation in natural language. Machine Learning, 74(2), 133–158. CrossRefGoogle Scholar
- Pahikkala, T., Suominen, H., Boberg, J., & Salakoski, T. (2009b). Efficient hold-out for subset of regressors. In M. Kolehmainen, P. Toivanen, & B. Beliczynski (Eds.), Proceedings of the 9th international conference on adaptive and natural computing algorithms (pp. 350–359). Berlin: Springer. CrossRefGoogle Scholar
- Pahikkala, T., Tsivtsivadze, E., Airola, A., Järvinen, J., & Boberg, J. (2009c). An efficient algorithm for learning to rank from preference graphs. Machine Learning, 75(1), 129–165. CrossRefGoogle Scholar
- Pelckmans, K., De Brabanter, J., Suykens, J., & De Moor, B. (2005). The differogram: non-parametric noise variance estimation and its use for model selection. Neurocomputing, 69(1–3), 100–122. CrossRefGoogle Scholar
- Pelckmans, K., Suykens, J., & De Moor, B. (2006). Additive regularization trade-off: fusion of training and validation levels in kernel methods. Machine Learning, 62, 217–252. CrossRefGoogle Scholar
- Poggio, T., & Girosi, F. (1990). Networks for approximation and learning. Proceedings of the IEEE, 78(9), 1481–1497. CrossRefGoogle Scholar
- Poggio, T., & Smale, S. (2003). The mathematics of learning: Dealing with data. Notices of the American Mathematical Society, 50(5), 537–544. MathSciNetMATHGoogle Scholar
- Quiñonero-Candela, J., & Rasmussen, C. E. (2005). A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6, 1939–1959. MATHGoogle Scholar
- Rasmussen, C. E., & Williams, C. K. I. (2005). Gaussian processes for machine learning (adaptive computation and machine learning). Cambridge: MIT Press. Google Scholar
- Rifkin, R., & Klautau, A. (2004). In defense of one-vs-all classification. Journal of Machine Learning Research, 5, 101–141. MathSciNetMATHGoogle Scholar
- Rifkin, R., & Lippert, R. (2007). Notes on regularized least squares (Technical Report MIT-CSAIL-TR-2007-025). Massachusetts Institute of Technology, Cambridge, Massachusetts, USA. Google Scholar
- Rifkin, R., Yeo, G., & Poggio, T. (2003). Regularized least-squares classification. In J. Suykens, G. Horvath, S. Basu, C. Micchelli, & J. Vandewalle (Eds.), NATO science series III: Computer and system sciences, Chap. 7: Vol. 190. Advances in learning theory: methods, model and applications (pp. 131–154). Amsterdam: IOS Press. Google Scholar
- Saunders, C., Gammerman, A., & Vovk, V. (1998). Ridge regression learning algorithm in dual variables. In J. W. Shavlik (Ed.), Proceedings of the fifteenth international conference on machine learning (pp. 515–521). San Mateo: Morgan Kaufmann. Google Scholar
- Schiavo, R. A., & Hand, D. J. (2000). Ten more years of error rate research. International Statistical Review, 68(3), 295–310. MATHCrossRefGoogle Scholar
- Schölkopf, B., Herbrich, R., & Smola, A. (2001). A generalized representer theorem. In D. Helmbold & R. Williamson (Eds.), Proceedings of the 14th annual conference on computational learning theory (pp. 416–426). Berlin: Springer. Google Scholar
- Shewchuk, J. R. (1994). An introduction to the conjugate gradient method without the agonizing pain (Technical report). Carnegie Mellon University, Pittsburgh, PA, USA. Google Scholar
- Smola, A., & Bartlett, P. (2001). Sparse greedy gaussian process regression. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems (Vol. 3, pp. 619–625). Cambridge: MIT Press. Google Scholar
- Suykens, J., & Vandewalle, J. (1999a). Least squares support vector machine classifiers. Neural Processing Letters, 9(3), 293–300. MathSciNetCrossRefGoogle Scholar
- Suykens, J., & Vandewalle, J. (1999b). Multiclass least squares support vector machines. In International joint conference on neural networks (IJCNN’99) (Vol. 2, pp. 900–903). New York: Inst. Elect. Electronics Eng. Google Scholar
- Suykens, J., Van Gestel, T., De Brabanter, J., De Moor, B., & Vandewalle, J. (2002). Least squares support vector machines. Singapore: World Scientific. MATHCrossRefGoogle Scholar
- Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer. MATHGoogle Scholar
- Vincent, P., & Bengio, Y. (2002). Kernel matching pursuit. Machine Learning, 48, 165–187. MATHCrossRefGoogle Scholar
- Wahba, G. (1990). Spline models for observational data. series in applied mathematics (Vol. 59). Philadelphia: SIAM. CrossRefGoogle Scholar
- Williams, C. K. I., & Seeger, M. (2001). Using the Nyström method to speed up kernel machines. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems (Vol. 13, pp. 682–688). Cambridge: MIT Press. Google Scholar