# Kernel Matrix Completion for Learning Nearly Consensus Support Vector Machines

## Abstract

When feature measurements are stored in a distributed fashion, such as in sensor networks, learning a support vector machine (SVM) with a full kernel built with accessing all features can be pricey due to required communication. If we build an individual SVM for each subset of features stored locally, then the SVMs may behave quite differently, being unable to capture global trends. However, it is possible to make the individual SVMs behave nearly the same, using a simple yet effective idea we propose in this paper. Our approach makes use of two kernel matrices in each node of a network, a local kernel matrix built with only locally stored features, and an estimate of remote information (about “local” kernels stored in the other nodes). Using matrix completion, remote information is fully recovered with high probability from a small set of sampled entries. Due to symmetric construction, each node will be equipped with nearly identical kernel matrices, and therefore individually trained SVMs on these matrices are expected to have good consensus. Experiments showed that such SVMs trained with relatively small numbers of sampled remote kernel entries have competent prediction performance to full models.

## Keywords

Support vector machines Kernel methods Matrix completion Multiple kernel learning Distributed features## 1 Introduction

Training the support vector machines (SVMs) [2] in distributed environments has been an interesting topic in machine learning research. Considering distributed storage of data, the topic can be largely divided into two categories depending on whether examples or features are distributed. When examples are distributed (in this case, features are usually not assumed to be distributed), an SVM can be trained using a distributed optimization algorithm [3] but incurring potentially a good amount of communication, or individual SVMs can be trained locally with extra constraints to reduce their disparity to other SVM models in a network [7] with less amount of information exchange. Alternatively, local SVMs can be trained completely independently on data partitions and then combined to produce a more stable and accurate model than individual ones [6, 10].

On the other hand, when features are distributed (examples are not distributed) and they are not agglomerated for analysis, learning machineries have to deal with a set of partitioned feature spaces, figuring out how they can accurately predict global trends overall feature spaces. Although this scenario has potential use for emerging applications such as sensor network [14], it has not been studied much in machine learning research for obvious difficulties. This paper focuses on this scenario and proposes a simple but effective method for training SVMs for distributed features, with a major difference to the existing methods [12, 20] that no central coordination will be involved in learning. Part of this paper has been published in a conference [11], and this paper extends the previous one with updated results on projection error and matrix completion, and new discussions on classification error bounds of using approximate kernels.

In the following, we first introduce two decompositions of kernel matrices that enable us to approximate an original full-feature kernel matrix with separated kernels corresponding to local and remote features. Then we discuss the idea of matrix completion, a new observation of support vectors, and generalization error bounds of our method. Our description focuses on SVM classifiers, however it can be generalized to other kernel-based methods. We denote the Euclidean norm of vectors by \(\Vert \cdot \Vert \) and the cardinality of a finite set *A* by |*A*| throughout the paper.

## 2 Decomposition of Kernel Matrices

Consider local feature storages represented as nodes \(n=1,2,\dots ,N\) in a network, where each node stores its features in a vector \(\mathbf{x}_i[n]\) of length \(p_n\). Here *i* is an index for examples, \(i=1,2,\dots ,m\), and we assume that all nodes can observe the same examples (but through different set of features) and their label \(y_1,\dots ,y_m\). For simplicity, we allow for communication between any pair of nodes. Then the collection of all features can be written as a single vector \(\mathbf{x}_i = (\mathbf{x}_i[1]^T, \mathbf{x}_i[2]^T, \dots , \mathbf{x}_i[N]^T)^T\) of length \(p:=\sum _{n=1}^N p_n\) (this vector is never created in our method).

### 2.1 Support Vector Machines

*m*, \(\mathbf{1}:=(1,1,\dots ,1)^T\), \(\mathbf{y}:=(y_1,y_2,\dots ,y_m)^T\), and \(C>0\) is a given constant. The \(m\times m\) matrix \(\mathbf{Q}\) is a scaled kernel matrix, that is, \(\mathbf{Q}: = \mathbf{Y}\mathbf{K}\mathbf{Y}\) for a positive semidefinite kernel matrix \(\mathbf{K}\), where \(\mathbf{Y}:=\text {diag}(\mathbf{y})\) is the diagonal matrix with labels from \(\mathbf{y}\). A typical SVM is built with \(\mathbf{K}_{ij} = \langle \phi (\mathbf{x}_i), \phi (\mathbf{x}_j) \rangle \), where \(\mathbf{x}_i\) and \(\mathbf{x}_j\) are “full” feature vectors and \(\phi : \mathfrak {R}^p \rightarrow \mathcal {H}\) is a map from the space of feature vectors to a Hilbert space [18].

### 2.2 Schur and MKL Kernels

Our goal is to build an SVM for each local feature storage node in a network, as if we have accessed all features but without explicitly doing so. This becomes possible by the observations we introduce here, about how to decompose the original kernel matrices into parts corresponding to local and remote features.

*n*, and another matrix \(\mathbf{S}_{-n}\) that captures information about features not in the node

*n*. Since this kernel matrix is written as an elementwise product of matrices, known as the Schur (or Hadamard) product, we rename this kernel as the Schur kernel to distinguish itself from local Gaussian kernels such as \(\mathbf{S}_n\).

*N*positive kernel parameters \(\gamma _1, \gamma _2,\dots ,\gamma _N\), rather than a single parameter \(\gamma >0\). Despite of this potential inconvenience, MKL kernels have an advantage that we can reduce the size of kernel matrices and therefore speed up the process of matrix completion as discussed later in Sect. 3.3.

## 3 Recovery of Unseen Kernel Elements

In our model, each node *n* creates a matrix \(\mathbf{S}_n\) or \(\mathbf{M}_n\) depending on the type of kernel it use (Schur or MKL), from only locally stored features. Also, each node *n* creates an empty matrix for \(\mathbf{S}_{-n}\) or \(\mathbf{M}_{-n}\), in which each element is just a product (Schur) or a sum (MKL) of “local” kernel matrices stored in the other nodes, where only a few of these entries are observed by the node *n* by uniform random sampling.

### 3.1 Sampling

The key elements in our sampling strategy are that (i) the size of sample should be minimized to reduce communication cost, and at the same time (ii) the sample size should be large enough to guarantee the recovery of unseen entries with accuracy. As we see later, the theory of matrix completion make it possible to achieve both goals, surprisingly enough, with simple uniform random sampling.

*n*, we denote the index pair set of observed entries of remote kernel matrices (“local” in the other nodes) by

*i*,

*j*) are chosen uniformly at random

^{1}. Then, all the other nodes \(n'\) transfer the entries of their local kernel matrix corresponding to \(\varOmega _n\), i.e.,This requires that the sample index pair set \(\varOmega _n\) should be known to all nodes

^{2}, which can be done once by exchanging such information when a pier recognizes other piers in a network. Or, we can fix all sets \(\varOmega _n\) for \(n=1,2,\dots ,N\) the same, so that no exchange will be necessary if the set is pre-determined.

Given \(\varOmega _n\), the communication cost of this type of transfer will be \({\mathcal O}((N-1)|\varOmega |)\), if a node *n* is connected to all the other nodes. As discussed later, matrix completion requires the size \(\varOmega _n\) relatively small, \({\mathcal O}(m \log ^6 m)\), to guarantee a perfect recovery in high probability.

*n*receives the information, it stores the entries to the corresponding storage as follows (left: Schur, right: MKL):

### 3.2 Low-Rank Matrix Completion with Kernel Constraints

To recover the full matrix of \(\mathbf{S}_{-n}\) or \(\mathbf{M}_{-n}\) from few observed entries indexed by \(\varOmega _n\), we use the low-rank matrix completion [17]. Some modifications are required, however, to deal with the constraints that \(\mathbf{S}_{-n}\) or \(\mathbf{M}_{-n}\) should be a valid kernel matrix. In particular, \(\mathbf{S}_{-n}\) and \(\frac{1}{N-1}\mathbf{M}_{-n}\) must be symmetric and positive semi-definite matrices to satisfy the Mercer’s theorem [18].

*nuclear norm*of \(\mathbf{X}\), which is the summation of singular values \(\sigma _k(\mathbf{X})\) of \(\mathbf{X}\) and penalizes the rank of \(\mathbf{X}\) in effect. When the rank of \(\mathbf{X}\) is

*r*, then the nuclear norm simplifies to the expression [16, 17],

*A*. Using this property, the above optimization can be rewritten for rank-

*r*matrix completion,

*i*,

*j*) in \(\varOmega _n\).

**Projection to the Mercer Kernel Space.**

*r*matrices in our case,

### **Lemma 1**

### *Proof*

The result follows from the definition of \(\mathbf{Z}^*\) and the triangle inequality.

After training SVMs, we apply the same technique for new test examples to build a test kernel matrix. This usually involves a smaller matrix completion problem corresponding to the support vectors and test examples.

### 3.3 Reduction with MKL Kernels Using Support Vectors

The matrix completion optimization (4) for recovering full kernel matrices involves \(m^2\) variables, and therefore it requires more computational resource for larger *m*. It turns out that we can work on a smaller number of variables than \(m^2\), using a property we discovered for *support vectors* (SVs) in case of the MKL kernel. To remind, a support vector is an example indexed by \(i \in \{1,2,\dots ,m\}\) for which the optimal solution \(\alpha ^*_i\) of the SVM problem (1) is strictly positive.

*n*, \(n=1,2,\dots ,N\),

*m*, and it is well known that an SVM predictor is fully determined by the SVs [18]. Therefore, if we estimate \(S^*\) without too much cost, then we can focus on variables corresponding to \(S^*\) for recovery in matrix completion rather than considering all \(m^2\) variables. Our next theorem shows that an estimation of \(S^*\) is possible without solving the full-information problem, simply by the union of local SV sets (the proof is in Appendix).

### **Theorem 1**

Using the theorem, matrix completion (4) can be solved more efficiently with \(|\cup _{n} S_n^*|^2\) variables. In our experiments, the number was much smaller than \(m^2\): the ratio \(|\cup _{n} S_n^*|^2/m^2\) was in the range of \(0.31 \sim 0.98\), where in the half of the cases the ratio was below 0.5.

## 4 Matrix Recovery and Classification Error Bounds

To decide the size of the sample index set \(\varOmega \), it is important to understand when a (perfect) recovery of unseen matrix elements is possible from only a few observed entries. It is also closely related how well an SVM trained with a recovered kernel matrix will perform in classification, compared to the case of using full-information kernel matrices.

### 4.1 Conditions for Matrix Recovery

The technique of matrix completion guarantees the perfect recovery of a partially observed matrix with high probability, under a certain condition called the *strong incoherence property* [4].

*r*, so that the reduced singular value decomposition of \(\mathbf{D}\) can be written as,

The next theorem states that when the rank *r* or the incoherence parameter \(\mu \) of a matrix \(\mathbf{D}\) we want to recover is small, then exact recovery via matrix completion (4) is possible with high probability with only a small number of observed entries.

### **Theorem 2**

**(Candès and Plan**[5]

**).**Let \(\mathbf{D}\in \mathfrak {R}^{m\times m}\) a matrix with rank \(r \in (0,m]\) and a strong incoherence parameter \(\mu >0\). If the number of observed entries from \(\mathbf{D}\) satisfies

### 4.2 Classification Error Bounds for Using Estimated Kernels

*error coefficient*\(\mathcal E_{\mathbf{K}}\) approaches the Bayes error in probability, that is,

### **Theorem 3**

*C*is a parameter for the SVM, and \(\lambda _1(\mathbf{Q})\) is the smallest eigenvalue of \(\mathbf{Q}= \mathbf{Y}\mathbf{K}\mathbf{Y}\), \(\mathbf{Y}= \text {diag}(\mathbf{y})\).

Note that when observations are made without noise, then \(\delta _1=0\) with probability at least \(1-m^{-3}\) by Theorem 2. Also, \(\delta _2\) is expected to be small due to our symmetry construction of \(\varOmega \), in which case \(\mathbf{L}^*\approx \mathbf{R}^*\). Moreover, we often specify \(C = C'/m\) for some \(C'>0\), and therefore if \(1/\lambda _1(Q) = o(m)\), that is, \(\lim _{m\rightarrow \infty } \frac{o(m)}{m} =0\), then the term \(C/\lambda _1(\mathbf{Q})\) becomes small as *m* increases.

## 5 Experiments

For experiments, we used five benchmark data sets from the UCI machine learning repository [1], summarized in Table 1, and also their subset composed of 5000 training and 5000 test examples (denoted by 5k/5k) to study characteristics of algorithms under various circumstances.

^{3}[17] for matrix completion, and svmlight

^{4}[8] for solving SVMs. Our implementation makes use of the union SVs set theorem (Theorem 1) for the MKL approach to reduce kernel completion time, but not for Schur since the theorem does not apply for this case.

Data sets and their training parameters. Different values of *C* were used for the full data sets (column *C*) and smaller 5*k*/5*k* sets (column *C* (5*k*/5*k*)).

Name | m (train) | Test | p | | | \(\gamma \) |
---|---|---|---|---|---|---|

ADULT | 40701 | 8141 | 124 | 10 | 10 | 0.001 |

MNIST | 58100 | 11900 | 784 | 0.1 | 1162 | 0.01 |

CCAT | 89702 | 11574 | 47237 | 100 | 156 | 1.0 |

IJCNN | 113352 | 28339 | 22 | 1 | 2200 | 1.0 |

COVTYPE | 464809 | 116203 | 54 | 10 | 10 | 1.0 |

For all experiments, we split the original input feature vectors into subvectors of almost equal lengths, one for each node of \(N=3\) nodes (for 5k/5k sets) and \(N=10\) nodes (for full data sets). The tuning parameters *C* and \(\gamma \) were determined by cross validation for the full sets, and the *C* values for the 5k/5k subsets were determined by independent validation subsets, both with svmlight. The results of svmlight were included for a comparison to a non-distributed SVM training. Following [12], the local Gaussian kernel parameters for MKL were adjusted to \(\gamma _n = \frac{p}{p_n} \approx N\gamma \) for a given \(\gamma \), so that \(\gamma _n \Vert \mathbf{x}_i[n]-\mathbf{x}_j[n]\Vert \) will have the same order of magnitude \({\mathcal O}(\gamma p)\) as \(\gamma \Vert \mathbf{x}_i - \mathbf{x}_j\Vert \).

### 5.1 Characteristics of Kernel Matrices

The first set of experiments is to verify that how well kernel matrices fit for matrix completion. For this, we computed the two types of full kernel matrices, Schur (2) and MKL (3), accessing all features of the small 5k/5k subsets of the five UCI data sets (in this case the Schur kernel is simply the Gaussian kernel).

The important characteristics of the kernel matrices with respect to matrix completion are its rank (*r*) and coherence parameters \(\mu _1\) and \(\mu _2\) defined in (6) and (7). When these values are small, then Theorem 2 tells that we only need small number of observations for perfect matrix completion with high probability.

Schur | MKL | |||||||
---|---|---|---|---|---|---|---|---|

Density | | \(\mu _1\) | \(\mu _2\) | Density | | \(\mu _2\) | \(\mu _2\) | |

ADULT | 1.0 | 789 | 25.7 | 24.4 | 1.0 | 222 | 12.0 | 5.5 |

MNIST | 1.0 | 4782 | 68.1 | 68.5 | 1.0 | 4568 | 66.6 | 66.2 |

CCAT | 1.0 | 4984 | 69.6 | 70.6 | 1.0 | 4982 | 69.6 | 70.6 |

IJCNN | 1.0 | 1516 | 37.2 | 37.8 | 1.0 | 698 | 25.1 | 6.6 |

COVTYPE | 1.0 | 1423 | 35.9 | 35.8 | 1.0 | 424 | 19.3 | 2.7 |

Test prediction performance on full data sets (mean and standard deviation). Two sampling ratios (2 % and 10 %) are tried for our method. The svmlight results are from using the classical Gaussian kernels with matching parameters. \(|\cup _n S_n^*|^2/m^2\) is the fraction of the reduced number of variables compared to \(m^2\).

MKL | asset | svmlight | |||
---|---|---|---|---|---|

\(|\cup _n S_n^*|^2/m^2\) | \(2\,\%\) | \(10\,\%\) | |||

ADULT | 0.37 | 81.4\(\scriptstyle {\pm }\)1.00 | 84.2\(\scriptstyle {\pm }\)0.18 | 80.0\(\scriptstyle {\pm }\)0.02 | 84.9 |

MNIST | 0.98 | 78.9\(\scriptstyle {\pm }\)1.69 | 87.0\(\scriptstyle {\pm }\)0.20 | 88.9\(\scriptstyle {\pm }\)0.39 | 98.9 |

CCAT | 0.71 | 87.2\(\scriptstyle {\pm }\)1.00 | 92.0\(\scriptstyle {\pm }\)0.35 | 73.7\(\scriptstyle {\pm }\)1.00 | 95.8 |

IJCNN | 0.31 | 96.0\(\scriptstyle {\pm }\)0.35 | 96.5\(\scriptstyle {\pm }\)0.23 | 90.9\(\scriptstyle {\pm }\)0.88 | 99.3 |

### 5.2 The Effect of Sampling Size

*m*is 5000 in this experiment. We compared the prediction performance of using Schur and MKL to that of svmlight.

The bottom-right corner of Fig. 2 shows the concentration of eigenvalue spectrum in the five kernel matrices. The height of each box represents the magnitude of the corresponding normalized eigenvalue, so that the height a stack of boxes represents the proportion of entire spectrum concentrated in the top 10 eigenvalues. The plot shows that \(90\,\%\) of the spectrum in ADULT is concentrated in the top 10 eigenvalues, indicating that its kernel matrix has a very small numerically effective rank. This gives one explanation why our method performs as good as svmlight for the case of ADULT.

Comparing Schur to MKL, both showed similar prediction performance. However, higher concentration of the eigen spectrum of MKL indicated that it would make a good alternative to Schur, also considering the extra saving with MKL discussed in Sect. 3.3.

### 5.3 Performance on Full Data Sets

In the last experiment, we used the full data sets for comparing our method to one of the closely related approaches, asset [12]. Since asset admits only MKL-type kernels, we have omitted the Schur kernel in comparison. Among the several versions of asset in [12], we used the “Separate” version with central optimization. COVTYPE was excluded due to runtime issues of svmlight.

The results are in Table 3. The second column shows the ratio between a union SV set and an entire training set. The square of these numbers indicates the saving we have achieved by the union SVs trick, for example the size of matrix is reduced to \(37\,\%\) of the original size for ADULT. The saving was substantial for ADULT and IJCNN. In terms of prediction performance, we have achieved test accuracy approaching to that of svmlight (within \(1\,\%\) point (ADULT), \(3.8\,\%\) points (CCAT), and \(2.8\,\%\) points (IJCNN) on average) with 10 % sampling ratio, except for the case of MNIST where the gap was significantly larger (\(11.9\,\%\)): this result was consistent to the discussion in Sects. 5.1 and 5.2. Our method (with 10 % sampling) also outperformed asset (by \(4.2\,\%\), \(18.3\,\%\), and \(5.6\,\%\) on average for ADULT, CCAT, and IJCNN respectively) except for the case of MNIST with a small but not negligible margin (\(1.9\,\%\)). We conjecture that the approximation of kernel mapping in asset have fitted particularly well for MNIST, but it remains to be investigated.

## 6 Conclusions

We have proposed a simple framework for learning nearly consensus SVMs for scenarios where features are stored in a distributed manner. Our method makes use of decompositions of kernels, together with kernel matrix completion to recover unobserved entries of remote kernel matrices. The resulting SVMs performed well with relatively small numbers of sampled entries, but under certain conditions. A newly discovered property of support vectors also helped us further reduce computation cost in matrix completion.

Several aspects of our method remain to be investigated further. First, different types of kernels may involve different types of decomposition, with new characteristics in terms of matrix completion. Second, although parameters of SVMs and kernels can be tuned using small aggregated data, it would be desirable to tune parameters locally, or to consider entirely parameter-free alternatives if possible. Also, despite the benefits of the MKL kernel, it requires more kernel parameters to be specified compared to the Schur kernel. Therefore when the budget for parameter tuning is limited, Schur would be preferred to MKL. Finally, it would be worthwhile to analyze the characteristics of the suggested algorithm in real communication systems to make it more practical, considering non-uniform communication cost on non-symmetric networks.

## Footnotes

- 1.
We require \(\varOmega _n\) to be symmetric, that is, if \((i,j) \in \varOmega _n\) then \((j,i)\in \varOmega _n\), and also the values corresponding to the entries are the same.

- 2.
Not actual values, but the positions to be sampled from.

- 3.
Available for download at http://hazy.cs.wisc.edu/hazy/victor/jellyfish/.

- 4.
Available at http://svmlight.joachims.org/.

## Notes

### Acknowledgements

The authors acknowledge the support of Deutsche Forschungsgemeinschaft (DFG) within the Collaborative Research Center SFB 876 “Providing Information by Resource-Constrained Analysis”, projects A1 and C1.

## References

- 1.Bache, K., Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml
- 2.Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pp. 144–152 (1992)Google Scholar
- 3.Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn.
**3**(1), 1–122 (2011)CrossRefzbMATHGoogle Scholar - 4.Candes, E.J., Tao, T.: The power of convex relaxation: near-optimal matrix completion. IEEE Trans. Inf. Theor.
**56**(5), 2053–2080 (2010)MathSciNetCrossRefGoogle Scholar - 5.Candes, E., Plan, Y.: Matrix completion with noise. Proc. IEEE
**98**(6), 925–936 (2010)CrossRefGoogle Scholar - 6.Crammer, K., Dredze, M., Pereira, F.: Confidence-weighted linear classification for natural language processing. J. Mach. Learn. Res.
**13**, 1891–1926 (2012)MathSciNetzbMATHGoogle Scholar - 7.Forero, P.A., Cano, A., Giannakis, G.B.: Consensus-based distributed support vector machines. J. Mach. Learn. Res.
**11**, 1663–1707 (2010)MathSciNetzbMATHGoogle Scholar - 8.Joachims, T.: Making large-scale support vector machine learning practical. In: Schölkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods - Support Vector Learning, chap. 11, pp. 169–184. MIT Press, Cambridge (1999)Google Scholar
- 9.Lanckriet, G., Cristianini, N., Bartlett, P., E.G., L., Jordan, M.: Learning the kernel matrix with semidefinite programming. In: Proceedings of the 19th International Conference on Machine Learning (2002)Google Scholar
- 10.Lee, S., Bockermann, C.: Scalable stochastic gradient descent with improved confidence. In: Big Learning - Algorithms, Systems, and Tools for Learning at Scale, NIPS Workshop (2011)Google Scholar
- 11.Lee, S., Pölitz, C.: Kernel completion for learning consensus support vector machines in bandwidth-limited sensor networks. In: International Conference on Pattern Recognition Applications and Methods (2014)Google Scholar
- 12.Lee, S., Stolpe, M., Morik, K.: Separable approximate optimization of support vector machines for distributed sensing. In: De Bie, T., Cristianini, N., Flach, P.A. (eds.) ECML PKDD 2012, Part II. LNCS, vol. 7524, pp. 387–402. Springer, Heidelberg (2012) CrossRefGoogle Scholar
- 13.Mangasarian, O.L., Musicant, D.R.: Lagrangian support vector machines. J. Mach. Learn. Res.
**1**, 161–177 (2001)MathSciNetzbMATHGoogle Scholar - 14.Morik, K., Bhaduri, K., Kargupta, H.: Introduction to data mining for sustainability. Data Min. Knowl. Disc.
**24**(2), 311–324 (2012)CrossRefGoogle Scholar - 15.Huang, L., Huang, L., Joseph, A.D., Joseph, A.D., Nguyen, X.L., Nguyen, X.L.: Support vector machines, data reduction, and approximate kernel matrices. In: Goethals, B., Goethals, B., Daelemans, W., Daelemans, W., Morik, K., Morik, K. (eds.) ECML PKDD 2008, Part II. LNCS (LNAI), vol. 5212, pp. 137–153. Springer, Heidelberg (2008) CrossRefGoogle Scholar
- 16.Recht, B., Fazel, M., Parrilo, P.A.: Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev.
**52**(3), 471–501 (2010)MathSciNetCrossRefzbMATHGoogle Scholar - 17.Recht, B., Ré, C.: Parallel stochastic gradient algorithms for large-scale matrix completion. Technical report, University of Wisconsin-Madison, April 2011Google Scholar
- 18.Scholkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge (2001) Google Scholar
- 19.Steinwart, I.: On the influence of the kernel on the consistency of support vector machines. J. Mach. Learn. Res.
**2**, 67–93 (2002)MathSciNetzbMATHGoogle Scholar - 20.Stolpe, M., Bhaduri, K., Das, K., Morik, K.: Anomaly detection in vertically partitioned data by distributed core vector machines. In: Nijssen, S., Železný, F., Blockeel, H., Kersting, K. (eds.) ECML PKDD 2013, Part III. LNCS, vol. 8190, pp. 321–336. Springer, Heidelberg (2013)Google Scholar
- 21.Trefethen, L.N., Bau, D.: Numerical Linear Algebra. SIAM, Philadelphia (1997)CrossRefzbMATHGoogle Scholar