# Large margin classification with indefinite similarities

- 1k Downloads
- 1 Citations

## Abstract

Classification with indefinite similarities has attracted attention in the machine learning community. This is partly due to the fact that many similarity functions that arise in practice are not symmetric positive semidefinite, i.e. the Mercer condition is not satisfied, or the Mercer condition is difficult to verify. Examples of such indefinite similarities in machine learning applications are ample including, for instance, the BLAST similarity score between protein sequences, human-judged similarities between concepts and words, and the tangent distance or the shape matching distance in computer vision. Nevertheless, previous works on classification with indefinite similarities are not fully satisfactory. They have either introduced sources of inconsistency in handling past and future examples using *kernel approximation*, settled for local-minimum solutions using *non-convex optimization*, or produced non-sparse solutions by *learning in Krein spaces*. Despite the large volume of research devoted to this subject lately, we demonstrate in this paper how an old idea, namely the 1-norm support vector machine (SVM) proposed more than 15 years ago, has several advantages over more recent work. In particular, the 1-norm SVM method is conceptually simpler, which makes it easier to implement and maintain. It is competitive, if not superior to, all other methods in terms of predictive accuracy. Moreover, it produces solutions that are often sparser than more recent methods by several orders of magnitude. In addition, we provide various theoretical justifications by relating 1-norm SVM to well-established learning algorithms such as neural networks, SVM, and nearest neighbor classifiers. Finally, we conduct a thorough experimental evaluation, which reveals that the evidence in favor of 1-norm SVM is statistically significant.

## Keywords

Support vector machine Indefinite kernels Similarity-based classification Supervised learning Linear programming## 1 Introduction

Classification is one of the most fundamental problems in machine learning and statistics. In its binary form, it refers to the task of predicting a binary label \(y\in \{-1,+1\}\) given an instance \(x\in \mathcal {X}\), based on a training set of data whose instances and labels are both known. Classification has many important applications covering a wide spectrum that includes engineering, e-commerce, and medicine. For example, the popular UCI Machine Learning Repository (Lichman 2013) contains about 240 datasets, two thirds of which are for classification problems such as predicting disease severity, recognizing handwritten characters, identifying material types, detecting unsolicited messages, and recognizing signs. Due to its immense popularity, many algorithms have been invented for classification including nearest neighbor classifiers, decision trees, artificial neural networks, and Bayesian methods, to name only a few.

One of the most successful binary classification algorithms in practice today is the support vector machine (SVM) algorithm, which was developed in the 1990s by Vapnik and colleagues (Boser et al. 1992; Cortes and Vapnik 1995; Vapnik 1999). It works by constructing a separating hyperplane in a high-(possibly infinite-) dimensional feature space that separates the positive from the negative instances. Most importantly, it seeks a separating hyperplane, which ensures that most training instances are correctly classified with a *large margin*. The support vector machine (SVM) algorithm and its many variants were inspired by deep theoretical foundations that make use of the Vapnik-Chervonenkis (VC) dimension to establish the generalization ability of such family of classifiers (Burges 1998; Vapnik 1999). In informal terms, by seeking a large margin classifier, SVM tends to reduce its own risk of overfitting. We will return to such a statement later in Sect. 4.

*m*class labels,

*C*is a fixed tradeoff constant, \(\mathbf 1 = (1,1,\ldots ,1)^T\in \mathbb {R}^m\), while \(K\in \mathbb {R}^{m\times m}\) is the similarity (kernel) matrix. In the dual-form in (1), the similarity matrix

*K*has to be symmetric positive semidefinite, i.e. satisfies the Mercer condition, in order to guarantee convexity of the optimization problem and the existence of a reproducing Hilbert kernel space (RHKS). When

*K*is positive semidefinite, the optimization problem in (1) can be solved quite efficiently and its optimal solution can be used to construct a large margin separating hyperplane in some implicit feature space. Such advantages are no longer guaranteed when

*K*is indefinite.

In real-life applications, however, many similarity functions exist that are either indefinite or for which the Mercer condition is difficult to verify. For example, one can incorporate the longest common subsequence in defining a distance between genetic sequences, use the BLAST similarity score between protein sequences, use set operations such as the union and/or intersection in defining a similarity between transactions, use human-judged similarities between concepts and words, use the symmetrized Kullback–Leibler divergence between probability distributions, use dynamic time warping (DTW) for time series, or use the tangent distance and the shape matching distance in computer vision (Chen et al. 2009a; Wu et al. 2005; Ying et al. 2009; Haasdonk 2005). Indefinite similarities are also frequently encountered in psychology, neuroscience, and economics (Graepel et al. 1999). Extending large-margin classification to indefinite similarities will have many important applications.

Because classification with indefinite similarities is a frequently-encountered problem, many algorithms have been proposed in the literature to solve it. These include algorithms that are based on kernel approximation, non-convex optimization, and learning in Krein spaces. Other classical algorithms were also shown to be useful for the task as well such as nearest neighbor classifiers and relevance vector machine (RVM) (Graepel et al. 1999; Tipping 2001; Loosli et al. 2013). Despite the large volume of research devoted to this subject, however, we demonstrate in this paper how an old idea, namely the 1-norm support vector machine (SVM) method proposed more than 15 years ago (Graepel et al. 1999; Zhu et al. 2004), has several advantages over more recent work. In particular, the 1-norm SVM method is conceptually simpler, which makes it easier to implement and maintain. It is also competitive, if not superior to, all other methods in terms of predictive accuracy. Moreover, it produces solutions that are often sparser than more recent methods by several orders of magnitude.

There are several reasons why the 1-norm SVM method is competitive with more recent approaches in its predictive accuracy. Unlike many alternative methods that have been proposed in the literature, 1-norm SVM retains convexity of the optimization problem and treats both training and test examples consistently. It is closely connected to many well-established learning algorithms such as artificial neural networks, nearest neighbor classifiers, and SVM. As will be discussed in more details in the sequel, these connections between 1-norm SVM and those learning algorithms provide a formal justification to the use of 1-norm SVM when learning from indefinite similarities.

In the literature, 1-norm SVM is often used as an *embedded* feature selection method, where learning and feature selection are performed simultaneously (Bradley and Mangasarian 1998; Zhu et al. 2004; Fung and Mangasarian 2004; Zou 2007; Hilario and Kalousis 2008; Liu et al. 2010). It was studied in Zhu et al. (2004), where it was argued that 1-norm SVM has an advantage over the standard form of SVM in (1) when there are redundant noisy features. Despite the fact that it was suggested to be a viable method for classification with indefinite similarities more than 15 years ago (Graepel et al. 1999), it remained a relatively less known method than more-involved less-accurate approaches such as kernel approximation and non-convex optimization. In particular, 1-norm SVM is rarely used as a standard benchmark for the task (see for instance Chen et al. 2009b; Luss and d’Aspremont 2009; Ying et al. 2009; Chen and Ye 2008; Lin and Lin 2003).

The rest of the paper is structured as follows. First, we review the existing literature on learning with indefinite similarities. Second, we describe the 1-norm SVM method and show how it can be adapted to handle binary classification with indefinite similarities. We provide various motivations behind its formulation by relating 1-norm SVM to artificial neural networks, nearest neighbor classifiers, and SVM. We also show that 1-norm SVM can be interpreted as a method of minimizing an upper bound on the expected true risk (prediction error rate). After that, we present experimental results using both synthetic and real datasets, which validate the advantage of using 1-norm SVM in handling indefinite similarities over all other methods.

## 2 Previous work

Several methods have been proposed in the literature for learning with indefinite similarities. Some of these methods are old, such as non-convex optimization (Lin and Lin 2003), while others are more recent such as eigen-decomposition SVM (ESVM) that was proposed recently in 2013 (Loosli et al. 2013). Other more classical classification algorithms were also shown to be useful for the task as well, such as nearest neighbor classifiers and relevance vector machine (RVM) (Graepel et al. 1999; Tipping 2001; Loosli et al. 2013). Generally speaking, however, the most dominant methods can be grouped into three broad approaches: (1) kernel approximation, (2) non-convex optimization, and (3) learning in Krein spaces. We review each approach next.

### 2.1 Kernel approximation

The first approach for learning with indefinite similarities is the *kernel approximation* method. In this approach, not only does the learning algorithm look for a hypothesis that can correctly classify training instances with a large margin, but it also approximates the indefinite similarity matrix with a positive semidefinite (PSD) matrix so that the resulting optimization problem can be solved quite efficiently. That is, it is implicitly assumed that the advantage that would be reaped from convexifying a non-convex optimization problem outweigh the information loss we incur by artificially altering (distorting) the similarity matrix.

Two of the earliest kernel approximation methods are the *denoise* and the *flip* methods. Both methods alter the eigenvalues of the similarity matrix of training examples so that it becomes PSD. The two methods differ, however, in how they alter those eigenvalues. On one hand, the *denoise* method sets all negative eigenvalues to zero. The motivation behind such approach is to assume that negative eigenvalues are caused by noise (Pekalska et al. 2001). On the other hand, the *flip* method flips the sign of the negative eigenvalues, hence the name. This method aims at retaining some of the information coded in those negative eigenvalues (Pekalska et al. 2001; Graepel et al. 1999). A third more involved kernel approximation method is to formulate a max-min optimization problem that both seeks support vectors as well as a PSD kernel that approximates the indefinite similarity matrix. The latter approach was introduced by Luss and d’Aspremont in 2007 with improvements in training time reported in the following years (Chen and Ye 2008; Luss and d’Aspremont 2009; Chen et al. 2009b).

All of the kernel approximation methods above guarantee that the optimization problem remains convex during training. During prediction, however, the original indefinite similarity function is used. Hence, past and future examples are treated inconsistently. In addition, such methods are only useful when the similarity matrix is approximable by a PSD matrix. For other similarity functions, such as the sigmoid kernel that can occasionally yield a *negative semidefinite* matrix for certain values of its hyperparameters, the kernel approximation approach cannot be utilized. In fact, and as will be shown later in the evaluations in Sect. 5, the accuracy of some of these methods can become even *worse* than random guessing, especially when the similarity matrix is close to being negative semidefinite.

### 2.2 Non-convex optimization

The second approach for learning with indefinite similarities is *non-convex optimization*. In contrast to the previous kernel approximation approach, non-convex optimization implicitly assumes that treating training and test examples consistently by keeping the similarity function intact is more important than convexifying the problem. Of course, because the optimization problem is non-convex, however, this approach can terminate at a local minimum.

In the literature, non-convex optimization with indefinite similarities for SVMs has received a fair attention. In the theoretical side, Haasdonk interprets this approach as a method of minimizing the distance between reduced convex hulls in a pseudo-Euclidean space (Haasdonk 2005). In the practical side, SMO-type decomposition methods, which seek a stationary point, have been proposed for indefinite similarity functions such as the sigmoid kernel (Lin and Lin 2003). Nevertheless, because non-convex optimization can terminate at a stationary point that can be quite distant from the globally optimal solution, non-convex optimization does not guarantee learning (Chen et al. 2009a). In addition, this approach only works reasonably well if the similarity matrix is approximately positive semidefinite (PSD).

### 2.3 Learning in Krein spaces

The last major approach that has been proposed in the literature resolves many of the issues that are inherent in the kernel approximation and non-convex optimization methods. It proposes efficient learning algorithms that treat training and test examples consistently, and achieves a very high accuracy in practice. This fairly-recent approach is based on learning in Krein spaces, in which the similarity function is decomposed into the sum of one positive semidefinite kernel and one negative semidefinite kernel (Ong et al. 2004; Loosli et al. 2013). By taking such decomposition of similarity functions, learning in Krein spaces embraces the idea that the negative part of a similarity function contains viable information (Loosli et al. 2013).

Whereas learning in a Hilbert space can be formulated as a minimization problem, learning in a Krein space is formulated as a *stabilization* problem, where a saddle point to the objective function is found. One fairly recent algorithm that has been proposed to solve the stabilization problem is called eigen-decomposition SVM (ESVM) (Loosli et al. 2013). While this algorithm has been shown to outperform all previous methods, its primary drawback is that it does not produce sparse solutions, hence the entire list of training examples are often needed during prediction.

As will be shown later in the evaluations in Sect. 5, 1-norm SVM and ESVM both outperform all other methods significantly, and their performance is strikingly similar despite the different approaches employed by the two algorithms. Nevertheless, 1-norm SVM has the added advantage of producing a solution that is often sparser than ESVM by many orders of magnitude. Hence, 1-norm SVM will prove to be the most effective method for classification with indefinite similarities.

## 3 The 1-norm support vector machine

The 1-norm support vector machine (SVM) method was proposed more than 15 years ago (Graepel et al. 1999). It was rediscovered under many guises later, such as for being a special case of the *generalized SVM* (Mangasarian 1998) and for being a method of embedding similarities into features (Chen et al. 2009a; Pekalska et al. 2001). However, it is widely used in the literature for embedded feature selection (Bradley and Mangasarian 1998; Zhu et al. 2004; Fung and Mangasarian 2004; Zou 2007; Hilario and Kalousis 2008; Liu et al. 2010).

### 3.1 Description

Before we describe the 1-norm SVM method, we recall the binary classification setting. In this setting, we have an instance space \(\mathcal {X}\) and a target set \(\mathcal {Y}=\{+1,-1\}\). Every observation is a pair \((x,y)\in \mathcal {X}\times \mathcal {Y}\) drawn from some fixed unknown distribution \(\mathcal {D}\). A classifier \(h:\mathcal {X}\rightarrow \mathcal {Y}\) is a rule that maps each instance \(x\in \mathcal {X}\) to either the positive class or the negative class. Throughout this paper, such a classifier is assumed to be of the form \(h(x) = \mathbf{sign}(f(x))\) for some function \(f:\mathcal {X}\rightarrow \mathbb {R}\), where *f* is learned on the basis of a set of *m* training examples \(\{(x_i,\,y_i)\}_{i=1,\ldots ,m}\) drawn i.i.d. from \(\mathcal {D}\).

*similarity-based*classification, which is a generalization to nearest neighbor classifiers. Given a similarity function \(S:\mathcal {X}\times \mathcal {X}\rightarrow \mathbb {R}\), we can predict whether \(y=+1\) or \(y=-1\) for an instance

*x*based on how similar

*x*is to the fixed set of

*m*training examples. A general approach of similarity-based classification is to use a decision rule of the form:

*x*is the instance whose label we would like to predict, whereas \(\{(x_i,\,y_i)\}_{i=1,\ldots ,m}\) is a training set of

*m*observations drawn i.i.d. from some fixed unknown distribution \(\mathcal {D}\).

To interpret the decision rule in (2), we note that \(\lambda _0\) is a *biasing* term that is similar to the activation threshold in neural networks or the prior in Bayesian methods, while \(\lambda _i\) for \(i\ge 1\) quantifies how important the training example \((x_i,\,y_i)\) is to the classification rule. According to (2), if an instance *x* is “more” similar to the weighted set of negative training examples, then *f*(*x*) would be negative; otherwise, *f*(*x*) is positive. Different methods of learning the weights \(\lambda \) yield different learning algorithms.

*S*is always chosen to be positive semidefinite, which implies that there exists a

*feature mapping*\(\phi :\mathcal {X}\rightarrow \mathbb {H}\) for some Hilbert space \(\mathbb {H}\) endowed with an inner product \(\langle \cdot ,\,\cdot \rangle \) such that:

*feature space*(Mohri et al. 2012). The existence of a feature space for a similarity function

*S*is equivalent to the statement that

*S*is positive semidefinite (Mohri et al. 2012). In such case,

*S*is often referred to as a

*kernel*.

If the similarity function *S* is positive semidefinite (PSD), then similarity in the instance space \(\mathcal {X}\) can be interpreted differently in the feature space \(\mathbb {H}\). Specifically, instead of performing a similarity-based classification in the instance space \(\mathcal {X}\), one can seek a separating hyperplane with a large functional margin in the feature space \(\mathbb {H}\). This is precisely the approach employed by SVM. The *Representer Theorem* states that such approach yields a decision rule that is identical to the similarity-based classification rule in (2) (Schölkopf and Smola 2002; Mohri et al. 2012). Hence, SVM is, indeed, a similarity-based classification algorithm.

*f*, the basis functions \(h_j:\mathcal {X}\rightarrow \mathbb {R}\) are fixed and the only variables to be optimized are \(\lambda _j\) for \(j\ge 0\).

^{1}

*C*is a tradeoff parameter between regularization and fitting. We will provide several motivations behind such formulation shortly.

*m*training examples and \(Q\in \mathbb {R}^{m\times m}\) is given by:

^{2}

*Q*is not PSD because both the objective function and inequality constraints are linear in the optimization variables \((\lambda ,\,\xi )\). Once the LP is solved, we predict the label of a new instance

*x*using the earlier classification rule in (2).

*support vectors*in SVM, and we will refer to them as support vectors here as well. As depicted in Fig. 1, each support vector is ‘carefully’ placed in the plane to guard a region dominated by its respective class. In practice, because the regularization term in the objective function minimizes the \(\ell _1\) norm of \(\lambda \), the vector \(\lambda \) tends to be sparse and the number of support vectors tends to be small.

### 3.2 Computational complexity

Before we begin to analyze 1-norm SVM, we make a final remark on its computational complexity. The 1-norm SVM method is a linear program (LP), for which many efficient solvers currently exist. These include, for instance, the Gurobi solver (Gurobi Optimization 2012), the CPLEX solver (IBM 2015), and MATLAB’s built-in linprog command. When using *interior-point* methods, rigorous bounds have been established for the number of computations required to solve a linear program (Boyd and Vandenberghe 2004). In general, it can be shown that the 1-norm SVM method requires, at most, \(O(m^3)\) computations in the worst case.

## 4 Analysis

In this section, we provide several motivations for using the linear program (LP) in (4) to classify with indefinite similarities. First, we show that 1-norm SVM can be interpreted as a method of producing a decision boundary with a large similarity margin in the instance space \(\mathcal {X}\). Using theoretical bounds expressed in terms of the margin, large-margin classification is proven to be less susceptible to over-fitting. On a related note, we show that 1-norm SVM is a large-margin classifier because it can also be interpreted as an \(\ell _1\)-regularized linear SVM applied to the empirical kernel maps.

Next, we relate 1-norm SVM to nearest neighbor classifiers, which reveals that 1-NN classification is a special case of the 1-norm SVM. Nearest neighbor classifiers are some of the most important classification algorithms in practice today with provable performance bounds. For example, it has long been established that the true risk (test error rate) of 1-NN is asymptotically bounded from above by twice the Bayes risk (Cover and Hart 1967).

After that, we show how the 1-norm SVM can be interpreted as an approximate implementation of the structural risk minimization (SRM) induction principle (Vapnik 1999). This can be deduced by establishing the connection between 1-norm SVM and two-layer neural networks, which, in turn, justifies the use of \(\ell _1\) regularization. Similarly, we show that the objective function in (4) can be interpreted as a method of minimizing an upper bound on expected prediction error rate using the leave-one-out (LOO) error estimation method.

### 4.1 Large-margin classification

*y*. In other words, we may write:

*x*and a class \(y\in \{+1,-1\}\). Here, the weight \(\lambda _i\ge 0\) represents the

*importance*of the training instance \(x_i\) to its own class \(y_i\). In addition, we can introduce an offset \(\lambda _0\) that quantifies prior preference. This offset plays a role that is similar to the

*prior*in Bayesian methods, the activation threshold in neural networks, and the offset in SVM. Thus, we consider classification using the rule:

*similarity margin*\(M_i\) for the training example \((x_i,\,y_i)\) in the usual sense:

*S*is positive semidefinite (PSD). In general, the

*i*th training example \((x_i,\,y_i)\) is classified correctly if and only if its margin is positive (i.e. \(M_i>0\)).

*M*in absolute terms. Here, any norm \(||\cdot ||\) suffices but the 1-norm is preferred because it produces sparse solutions and because it gives a better accuracy in practice.

^{3}

^{4}

*h*, i.e. \(R_\mathcal {D}(h) = \mathbb {P}_{(x,y)\sim \mathcal {D}}\,[y\,h(x)<0]\), then several bounds of the following form can be established (Schapire et al. 1998; Mohri et al. 2012):

*M*, while the second term is a generalization risk that depends on the margin

*M*, the number of training examples

*m*, and on some appropriate measure of complexity of the hypothesis class \(C(\mathcal {H})\). Most importantly, such results hold uniformly for all \(M\in (0,1)\). Because maximizing the margin on the training set for a fixed

*m*and \(\mathcal {H}\) reduces

*both*terms in the right-hand side simultaneously, large-margin classification, such as by using the 1-norm SVM, tend to perform well in practice.

### 4.2 Empirical Kernel maps

*empirical kernel map*. Given a training set \(\{(x_i,\,y_i)\}_{i=1,\ldots ,m}\), one can introduce the new mapping:

*empirical kernel map*(Schölkopf and Smola 2002) except for the presence of class labels. Essentially, such mapping turns similarities into features (Chen et al. 2009a; Pekalska et al. 2001).

*b*corresponds in our earlier notation to the bias term \(\lambda _0\) while \(w_i = \lambda _i\) for all \(i\ge 1\).

With \(\ell _1\) regularization, the optimization problem in (7) becomes nearly identical to that of 1-norm SVM in (4). The only (minor) difference is the non-negativity constraint \(\lambda _i\ge 0\) for \(i\ge 1\), which we imposed in 1-norm SVM so that it would also behave like a nearest neighbor classification algorithm as will be discussed later. Consequently, 1-norm SVM can be interpreted as a method of producing a separating hyperplane with a large functional margin applied to the empirical kernel map \(\phi \). Again, although it is possible to use any norm in the regularization term of the objective function in (7), such as \(l_2\) regularization, \(\ell _1\) regularization is preferred because it produces a sparse solution so that only a small subset of the training set is needed during prediction.

Applying linear classification to the empirical kernel map can be justified rigorously. In Balcan et al. (2008), it is shown that linear classification via the empirical kernel map, which is identical to our interpretation of the 1-norm SVM method discussed above, is guaranteed to have a small true risk (prediction error rate) as long as the similarity function is reasonable. Here, “reasonable” means that a high similarity exists between objects of the same class and a low similarity exists between objects of different classes. This guarantee on performance holds regardless of whether or not the similarity function is PSD (Balcan et al. 2008). Therefore, we expect 1-norm SVM to perform quite well for indefinite similarities.

### 4.3 Nearest neighbor classification

Nearest neighbor classifiers are some of the most important data mining algorithms in practice today (Wu 2008), and theoretical results for such algorithms have long been established. For instance, one well-known result shows that the true risk of 1-NN is asymptotically bounded from above by twice the Bayes risk (Cover and Hart 1967). Here, we show that 1-NN classification is a special case of the 1-norm SVM method.

### **Lemma 1**

Given a semi-metric \(d:\,\mathcal {X}\times \mathcal {X}\rightarrow [0,\infty )\), let \(S(x_i,\, x_j)=\psi (\gamma \cdot d(x_i,\,x_j)):\,\mathcal {X}\times \mathcal {X}\rightarrow (0,1]\) for some bandwidth \(\gamma >0\) be a radial monotone decreasing function of distance that satisfies \(\psi (0)=1\) and the property: \(\displaystyle z_1>z_2 \Rightarrow \lim _{\gamma \rightarrow \infty } \frac{\psi (\gamma \,z_1)}{\psi (\gamma z_2)} = 0 \). Also, let \(C=1+\epsilon \;\) for some \(\epsilon >0\) and fix \(\lambda _0=0\) (i.e. with no bias term). Then, the behavior of 1-norm SVM can be made arbitrarily close to 1-NN using a sufficiently large bandwidth \(\gamma \rightarrow \infty \).

### *Proof*

*d*is a semi-metric on the instance space \(\mathcal {X}\), we have \(d(x_i,\,x_i) = 0\) and \(S(x_i,x_i)=1\). By setting \(\lambda _0=0\), the margin \(M_i\) for the training example \((x_i,\,y_i)\) reduces to:

*m*is assumed to be finite, \(M_i\rightarrow \lambda _i\) as \(\gamma \rightarrow \infty \). Thus, the 1-norm SVM optimization problem can be made arbitrarily close to the following LP:

*i*.

*x*, and let \(x_j\) be its nearest neighbor in the training set with respect to the semi-metric

*d*. Then:

Aside from the extreme case in which 1-norm SVM reduces to 1-NN classification, the 1-norm SVM method can, in general, be interpreted as a *weighted* nearest neighbor classification algorithm as depicted in Fig. 1. In this figure, every support vector in 1-norm SVM exerts some influence in its vicinity in the instance space \(\mathcal {X}\), where the “amount” of influence is determined by its weight \(\lambda _i\). For a new instance *x*, the prediction rule in (2) becomes a weighted nearest neighbor rule, where the weight of a neighbor \(x_i\) is determined by the product of its influence \(\lambda _i\) and its similarity \(S(x,x_i)\) to the new instance *x*. Such interpretation of 1-norm SVM holds because \(\lambda _i\ge 0\) for all \(i\ge 1\). Later in Sect. 5, we will show that 1-norm SVM remains superior in its predictive accuracy over the *k*-NN classifier, despite the apparent similarity between the two methods.

### 4.4 Neural networks

In addition to SVM and nearest neighbor classifiers, the 1-norm SVM method is closely connected to neural networks as well. This can be observed in the decision rule in (2) or (5), which can be interpreted as a neural network with one hidden layer and one output node as depicted in Fig. 2. The similarity functions \(S_j=S(\cdot , x_j)\) form the activation functions in the hidden nodes, whereas the bias term \(\lambda _0\) is the activation threshold at the output node. When similarity functions are radial, i.e. functions of distance, such neural networks are commonly referred to as *radial basis function* (RBF) networks.

*under-fitting*and

*over-fitting*respectively. The first observation shows that 1-norm SVM is not susceptible to under-fitting because its decision rule has the capacity to approximate any function arbitrarily well if the similarity function is radial. The second observation shows that 1-norm SVM is not susceptible to over-fitting because its regularization term minimizes the “size” of the weights at the output layer of the neural network.

#### 4.4.1 Universal function approximation

The first key observation is related to function approximation. Ideally, we would like the decision rule in the neural network in Fig. 2 to be as close as possible to the optimal Bayes rule. In principle, therefore, classification is reduced to function approximation. However, it has been established that if the instance space is the Euclidean plane \(\mathbb {R}^n\) and the similarity function is radial, then the RBF neural network of Fig. 2 with a fixed bandwidth is capable of *universal function approximation* under mild conditions (Park and Sandberg 1991). Consequently, if the training set is sufficiently large, such that 1-norm SVM can freely choose its support vectors in the plane, then 1-norm SVM has the capacity to approximate the optimal decision rule arbitrarily well. Hence, its risk for under-fitting is limited.

#### 4.4.2 Size of the weights

*m*training examples [Theorem 1 in Bartlett (1997)]:

*C*(

*M*) is a function of

*M*. A similar bound can be obtained that holds uniformly for all \(0<M<1\) (Bartlett 1997). Contrasting the latter bound with the earlier margin-based bound in (6) suggests that minimizing \(||\lambda ||_1\) in the 1-norm SVM method plays a role that is similar to minimizing the complexity of the hypothesis space. A similar conclusion can be inferred more directly using Lagrange duality.

^{5}

Because the activation functions at the hidden nodes in 1-norm SVM are not fixed, since they do depend on the random choice of training examples, the bound in (9) does not hold for 1-norm SVM. Nevertheless, it provides an informal justification to the use of 1-norm SVM with indefinite kernels since it compares favorably well with the objective function given in (4). In particular, the first term is the \(\ell _1\) regularization term, which is identical in both expressions. Moreover, the second term in (9) is related to the hinge loss, which is the second term of the objective function in 1-norm SVM. In 1-norm SVM, both terms are minimized.

### 4.5 Structural risk minimization

The connection between 1-norm SVM and neural networks reveals that 1-norm SVM can be interpreted as a method of striking a balance between under-fitting and over-fitting. A similar conclusion can be established using the *leave-one-out* (LOO) error estimation method. To do this, we begin with the following lemma.

### **Lemma 2**

*m*training examples, which is used to train the 1-norm SVM. Let \(\lambda ^\star \) and \(\xi ^\star \) be the optimal solutions of the LP in (4). Also, let \(e_{LOO}\) be the expected leave-one-out validation error rate on the same training set. Then:

*z*.

### *Proof*

*i*th training example was classified correctly and it will continue to be classified correctly if it is the only example removed from the training set. The latter statement holds because removing the

*i*th example from the training set is equivalent to adding the new constraint \(\lambda _i=\xi _i=0\) to the LP formulation (4), which is the original optimal value of \(\lambda ^\star _i\) and \(\xi ^\star _i\). Because the new feasibility region is a subset of the original feasibility region and it contains the original optimal solution, the optimal solution remains unchanged. Hence:

### **Theorem 1**

*S*. Let \(R_\mathcal {D}(h_{S})\) be the true risk (prediction error rate) of the hypothesis \(h_{S}\). Then:

*m*.

### *Proof*

The tradeoff in Eq 11 is analogous to the classical tradeoff in estimation between bias and variance (Hastie et al. 2001). On one hand, one can fit the training set perfectly, e.g. by using a radial similarity function with a sufficiently large bandwidth that effectively turns 1-norm SVM into a 1-NN classifier, but the number of support vectors becomes at its worst, hence high variance. On the other hand, one can choose a very small number of support vectors but this tends to increase the empirical risk (training error rate), hence high bias. In the 1-norm SVM formulation in (4), the cost function penalizes both training error (bias) and the number of support vectors (variance) simultaneously by penalizing the \(||\cdot ||_1\) of slack variables \(\xi \) and weights \(\lambda \). Because minimizing \(||\cdot ||_1\) promotes *sparsity* (Boyd and Vandenberghe 2004), Corollary 1 states that 1-norm SVM can be interpreted as a method of minimizing the expected true risk. Hence, it is an approximate implementation of the structural risk minimization (SRM) induction principle (Vapnik 1999).

## 5 Experiments and results

In the previous section, we described theoretically why 1-norm SVM was a viable tool for classification with indefinite similarities. In this section, we present experimental results of applying 1-norm SVM to synthetic and real-world classification problems, which demonstrate its effectiveness in handling indefinite similarity functions.^{6}

### 5.1 Synthetic datasets

*m*, with the Bayes risk for each classification problem indicated in the legend bar. As shown in Fig. 4, the test error rate approaches the optimal Bayes risk for sufficiently large training sets in all six classification problems. This test verifies that 1-norm SVM is capable of producing accurate decision boundaries for various complex mixtures of classes, which is consistent with the universal approximation property of the RBF similarity function discussed earlier in Sect. 4.4.

### 5.2 Real datasets

For real datasets, we compared the performance of 1-norm SVM against popular classification algorithms for both PSD and non-PSD similarity functions. We will first describe the datasets and test methodology, and discuss the test results afterwards.

#### 5.2.1 Datasets

- (A)
**IMDB**: This is a graph-based dataset that contains movies released between 1996 and 2001 (Macskassy and Provost 2007). The binary class label identifies whether the opening weekend box-office receipts exceeded $2 million or not. An edge weight between two movies is the number of common production companies, actors, producers, or directors. In our implementation, all edge weights were normalized to fall in the range \([0,\,1]\), and the following similarity functions were used:- (a)
*PSD*: The Jaccard index \(S_{i,j} = \frac{\sum _k\,\min \{w_{i,k},\,w_{j,k}\}}{\sum _k\,\max \{w_{i,k},\,w_{j,k}\}}\), where \(w_{i,k}\) is the edge weight. - (b)
*Non-PSD*: Edge weight \(S_{i,j} = w_{i,j}\) and \(S_{i,i}=1\).

- (a)
- (B)
**Word-Sim-353**: This dataset contains human-judged similarities between English words (Finkelstein et al. 2002). All similarities are again normalized to fall in the range \([0,\,1]\) and self-similarity is set to unity. We grouped words into two categories: ‘living’ versus ‘non-living’, and used the two similarity functions specified earlier for the IMDB dataset. Examples of the ‘living’ class include*children*,*Maradona*,*brother*,*carnivore*, and*mammal*. - (C)
**Caltech-101**: This dataset contains images of various objects (Fei-Fei et al. 2004). We grouped images of ‘Big Cats’, ‘Winged Insects’, and ‘Flowers’ into three classes and trained three separate binary classifiers between every pair of classes. Each image was converted into a histogram using the two MATLAB commands rgb2gray and imhist, and Laplace normalization was used. This effectively represents the*i*th image by a probability distribution \(p_i\). We, then, used the following two similarity functions:- (a)
*PSD*: The intersection (a.k.a. overlapping coefficient), which is given by \(S_{i,j} = \sum _k\,\min \{p_{i,k},\,p_{j,k}\}\). - (b)
*Non-PSD*: We used \(S_{i,j} = \max \{0,\,1-0.1\times D(p_i\,||p_j)\}\), where \(D(p_i || p_j)\) is the symmetrized Kullback–Leibler divergence.^{7}

- (a)
- (D)
**Splice**: This is a biological sequence classification dataset (Noordewier et al. 1991) that was downloaded from the UCI repository (Lichman 2013). Each example is a 60-letter DNA sequence. We performed classification between the two classes EI and IE. The similarity functions used were:- (a)
*PSD*: We used the implementation of string subsequence kernels given in Soman et al. (2009). Because string kernels can grow quite rapidly, we normalized using the cosine similarity: \(S_{i,j} = \frac{K_{i,j}}{\sqrt{K_{i,i}\cdot K_{j,j}}}\). - (b)
*Non-PSD*: We used the longest-common-subsequence (LCS) between two strings. Because each string is 60 letters in length, we set \(S_{i,j} = LCS(x_i,\,x_j)/60\).

- (a)
- (E)
**CNAE-9**: This is a text classification dataset available at the UCI repository, where each text is represented using a bag of words. The dataset contains nine classes and we randomly selected five binary classification problems: 1-versus-5, 5-versus-4, 6-versus-8, 3-versus-9, and 2-versus-7.^{8}These are represented by P1 through P5 in Table 2 respectively. The two similarity functions are:- (a)
*PSD*: The cosine similarity \(S_{i,j} = \frac{x_i^T\,x_j}{||x_i||\cdot ||x_j||}\), which is commonly used for text classification tasks (Chen et al. 2009a). - (b)
*Non-PSD*: The second similarity function used is a variant to the first. Specifically, we have \(S_{i,j} = \frac{v^T\,v}{||x_i||\cdot ||x_j||}\), where \(v_k=\min \{x_{i,k},\,x_{j,k}\}\).

- (a)
- (F)
**Ionosphere, Australian, Breast Cancer, Haberman, and Diabetes**: These are five binary classification problems with numeric features available at the UCI repository. We used the following similarity functions:- (a)
*PSD*: The RBF kernel \(S_{i,j} = e^{-\gamma \,||x_i-x_j||_2^2}\), which is considered the default similarity function for numeric attributes in popular SVM packages such as LIBSVM (Chang and Lin 2001). - (b)
*Non-PSD*: The sigmoid kernel \(S_{i,j} = \tanh \{\gamma \cdot \,x_i^T\,x_j+r \}\), which is popular due to its origins in neural networks. To ensure that the kernel matrix is not PSD, we fixed \(r=-1\).^{9}

- (a)

#### 5.2.2 Test methodology and results

When the similarity function is PSD, we compared performance of 1-norm SVM versus the standard form of SVM in (1). For each dataset, the value of the tradeoff constant *C* was selected using fivefold cross validation for \(C\in \{2,\,4,\,8,\,16,\,32\}\). When the RBF kernel is used, the bandwidth \(\gamma \) is also selected using fivefold cross validation in the grid \(\gamma \in \{2^{-15},\,2^{-14},\,\ldots , 2^{-1},\,1\}\). SVM was implemented using the LIBSVM library (Chang and Lin 2001), whereas 1-norm SVM was implemented using the Gurobi solver (Gurobi Optimization 2012). In all classification problems, we reported the average test error rate of five random training-to-test splits, with a training-to-split ratio of 4:1. The same split is always used in both SVM and 1-norm SVM.

- 1.
*Non-convex optimization*: This was implemented using the LIBSVM library with its -t 4 option. When the similarity matrix is non-PSD, the LIBSVM package seeks a stationary point using non-convex optimization (Lin and Lin 2003). - 2.
*Kernel approximation*: PSD kernel approximation was tested using the three methods discussed earlier in Sect. 2: (1) the*denoise*method, (2) the*flip*method, and (3) the indefinite SVM proposed by Luss and d’Aspremont (2009). The*denoise*and*flip*methods were implemented by supplying the modified (PSD) kernel matrix to LIBSVM using the -t 4 option. The indefinite SVM method was tested using the implementation available for download at the authors’ website. - 3.
*SVM in Krien Spaces*: SVM in Krien spaces was implemented using the ESVM algorithm described in Loosli et al. (2013). ESVM comprises of two main steps: (1) eigen-decomposition, and (2) SVM training. LIBSVM was used for the SVM training step.

*k*-NN to serve as a benchmark for similarity-based classification.

^{10}In all methods, hyper-parameters were selected using cross validation and grid search, implemented separately for each individual method. Test results for PSD and non-PSD similarity functions are shown in Tables 1 and 2 respectively. Because the datasets are balanced, we used classification error rate as a measure of performance. All results reported here are based on the best selected hyper-parameters of these methods.

Average test error rate results on 16 datasets using the positive semidefinite (PSD) similarity functions described in Sect. 5.2

Dataset ( | 1-norm SVM (%) | SVM (%) | |
---|---|---|---|

IMDB (1441) | 16.0 | | 21.2 |

Word-Sim-353 (437) | | 13.7 | 13.3 |

Caltech-101-P2 (368) | 26.0 | | 40.0 |

Caltech-101-P3 (379) | 19.8 | | 33.1 |

Caltech-101-P1 (387) | | | 38.7 |

Splice (1527) | 6.56 | | 10.3 |

CNAE-9-P1 (240) | 0.56 | | 0.42 |

CNAE-9-P5 (240) | 2.05 | | 1.67 |

CNAE-9-P3 (240) | 1.67 | | 2.50 |

CNAE-9-P2 (240) | | 1.22 | 1.25 |

CNAE-9-P4 (240) | 1.89 | | 1.67 |

Ionosphere (351) | 7.14 | | 14.6 |

Australian (690) | 16.6 | 16.9 | |

Breast Cancer (699) | | 3.51 | 4.75 |

Haberman (398) | 30.2 | 31.2 | |

Diabetes (768) | 28.9 | 27.9 | |

Average test error rate results on 16 datasets using the indefinite (non-PSD) similarity functions described in Sect. 5.2

Dataset ( | \(\beta ^{(1)}\) | 1-NORM SVM (%) | | NCO (%) | Kernel approximation | Krein space | ||
---|---|---|---|---|---|---|---|---|

SVM (%) | DEN (%) | FLIP (%) | ISVM (%) | ESVM (%) | ||||

IMDB (1441) | 0.01 | 18.8 | 26.1 | 17.7 | | | 18.1 | 18.8 |

Word-Sim-353 (437) | 0.03 | | 14.9 | 15.6 | 15.2 | 15.4 | | 15.8 |

Caltech-P2 (368) | 0.05 | 22.1 | 27.7 | 44.0 | 62.0 | 41.6 | 37.5 | |

Caltech-P3 (379) | 0.08 | | 26.7 | 40.7 | 54.5 | 36.5 | 32.0 | 24.1 |

Caltech-P1 (387) | 0.09 | | 35.6 | 40.9 | 50.8 | 39.5 | 38.0 | 31.7 |

Splice (1527) | 0.10 | 5.70 | 7.49 | 5.74 | 5.97 | 6.95 | 5.64 | |

CNAE-9-P1 (240) | 0.32 | | 8.30 | 13.2 | 6.17 | 0.17 | | |

CNAE-9-P5 (240) | 0.33 | 3.72 | 6.25 | 10.9 | 6.61 | 1.94 | | 3.50 |

CNAE-9-P3 (240) | 0.34 | | 24.6 | 25.3 | 25.4 | 8.67 | 3.75 | 3.33 |

CNAE-9-P2 (240) | 0.34 | | 22.5 | 21.6 | 20.7 | 5.78 | 2.50 | 1.67 |

CNAE-9-P4 (240) | 0.34 | 2.56 | 4.17 | 16.3 | 9.67 | 2.17 | | 1.67 |

Ionosphere (351) | 1.0 | | 18.6 | 37.7 | 70.3 | 66.0 | 36.0 | |

Australian (690) | 1.0 | | 18.4 | 40.7 | 40.7 | 91.9 | 47.7 | 15.1 |

Breast Cancer (699) | 1.0 | 4.32 | | 32.9 | 32.9 | 96.4 | 30.1 | 5.76 |

Haberman (398) | 1.0 | 26.6 | | 26.6 | 40.1 | 27.2 | *\(^{(2)}\) | |

Diabetes (768) | 1.0 | | 33.2 | 33.3 | 70.3 | 54.2 | 35.8 | 25.9 |

### 5.3 Discussion

In this section, we review the test results when using PSD and non-PSD similarity functions for the 16 real datasets described earlier.

#### 5.3.1 Positive semidefinite similarities

We begin our discussion by looking into the test results for positive semidefinite (PSD) similarities. As shown in columns 2 and 3 of Table 1, when the similarity function is PSD, performance of 1-norm SVM is comparable to that of SVM for all 16 datasets. When running statistical significance tests, we find no statistically significant evidence that one method outperforms the other at the 95 % confidence level. For example, the two-tailed Wilcoxon’s signed rank test (Demšar 2006) gives a value of \(p=0.155\). By contrast, both algorithms tend to outperform *k*-NN in classification accuracy. This validation verifies that 1-norm SVM is a viable algorithm for binary classification even when the similarity function is positive semidefinite (PSD). Such experimental evidence agrees with earlier conclusions (Zhu et al. 2004).

#### 5.3.2 Indefinite similarities

In contrast to the previous case, the use of indefinite similarity functions presents an entirely different picture. When comparing the test error rate of 1-norm SVM (shown in column 3 of Table 2) with the other methods (in columns 4-9), we find that 1-norm SVM and ESVM (i.e. learning in Krein spaces) outperform all other methods significantly in nearly all the datasets. Performance of ESVM, however, is very similar to that of 1-norm SVM, which is quite intriguing given the very different approaches employed by the two algorithms.

The number of support vectors (SVs) used by 1-norm SVM and ESVM for the 16 classification problems with indefinite similarity functions

Datasets | No. of training examples | No. of SVs in 1-norm svm | No. of SVs in esvm |
---|---|---|---|

IMDB | 1441 | 630 | 1438 |

Word-Sim-353 | 437 | 26 | 436 |

Caltech-101-P2 | 368 | 39 | 386 |

Caltech-101-P3 | 379 | 47 | 379 |

Caltech-101-P1 | 387 | 35 | 386 |

Splice | 1527 | 224 | 1527 |

CNAE-9-P1 | 240 | 2 | 233 |

CNAE-9-P5 | 240 | 21 | 240 |

CNAE-9-P3 | 240 | 23 | 238 |

CNAE-9-P2 | 240 | 2 | 235 |

CNAE-9-P4 | 240 | 23 | 240 |

Ionosphere | 351 | 39 | 351 |

Australian | 690 | 7 | 690 |

Breast Cancer | 699 | 20 | 699 |

Haberman | 398 | 28 | 398 |

Diabetes | 768 | 13 | 768 |

In order to verify statistical significance at the 95 % confidence level, we used Holm’s step-down procedure for multiple comparisons applied to the two-tailed Wilcoxon’s signed rank test (Demšar 2006; Holm 1979). More specifically, each null hypothesis \(H_i\) asserts that 1-norm SVM and the *i*th alternative classifier have similar performance. When \(H_i\) is tested using the two-tailed Wilcoxon’s signed rank test, the resulting *p* values are shown in Table 4. Using a confidence level of 95 % in Holm’s step down procedure, we find that the null hypothesis is rejected for non-convex optimization, *k*-NN, and all kernel approximation methods. This confirms that 1-norm SVM outperforms non-convex optimization and kernel approximation with a statistically significant evidence.

However, there is no statistically significant evidence at the 95 % confidence level that 1-norm SVM outperforms ESVM in terms of predictive accuracy. Here, it is perhaps worth reiterating that the 1-norm SVM significantly outperforms ESVM in terms of sparsity of solutions as shown in Table 3. Therefore, the 1-norm SVM method achieves the highest predictive accuracy among all methods that learn with indefinite similarities, while also retaining sparsity of the support vector set.

In this table, the second column lists the *p* values in increasing order of the two-tailed Wilcoxon’s signed rank test

Null hypothesis (\(H_i\)) | | Adjusted critical value |
---|---|---|

1-norm SVM versus SVM with non-convex optimization | 0.0003 | 0.0083 |

1-norm SVM versus denoise | 0.0008 | 0.0100 |

1-norm SVM versus | 0.0016 | 0.0125 |

1-norm SVM versus flip | 0.0052 | 0.0167 |

1-norm SVM versus indefinite SVM | 0.0107 | 0.0250 |

1-norm SVM versus ESVM | 0.0771 | 0.0500 |

## 6 Conclusion

Extensive research effort has been devoted lately to classification with indefinite similarities. In this paper, we show theoretically and experimentally how the 1-norm support vector machine is a better method for handling indefinite similarities. The 1-norm SVM method formulates large-margin separation as a convex linear programming (LP) problem without requiring that the similarity function be positive semidefinite (PSD). It uses the indefinite similarity function directly without any transformation, and, hence, it always treats both training and test examples consistently. Furthermore, by relating 1-norm SVM with neural networks and error bounds of the leave-one-out estimation method, 1-norm SVM can be interpreted as an approximate implementation of the structural risk minimization (SRM) induction principle. Hence, it is robust against the risks of under-fitting and over-fitting. Finally, 1-norm SVM indeed achieves the highest accuracy among all previous methods for classification with indefinite similarities with a statistically significant evidence, while also retaining sparsity of the support vector set.

## Footnotes

- 1.
- 2.
To reiterate, the similarity function \(S:\mathcal {X}\times \mathcal {X}\rightarrow \mathbb {R}\) is determined by the application at hand, and not by the learning system. Therefore, we assume a similarity function is given, and do not address whether or not it is suitable for the learning task.

- 3.
Sparse solutions are important for at least two reasons. First, only a small subset of the training set is needed during prediction, and hence prediction can be carried out quite efficiently. Second, minimizing the number of support vectors can be interpreted as a method of minimizing an upper bound on the expected test error rate [see for example Eq. (93) in Burges (1998) in the case of SVM and the discussion in Sect. 4.5 in the case of 1-norm SVM].

- 4.
An alternative derivation is as follows. We can introduce the coefficient vector \(w=\lambda /M\) and fix \(||\lambda ||=1\) to avoid the issue of rescaling. Then, maximizing the margin

*M*becomes equivalent to minimizing an appropriate norm of*w*such as the \(\ell _1\) norm. This leads to the same linear program that is used in 1-norm SVM. - 5.
Using Lagrange duality, it is well-known that adding \(\ell _p\) regularization on some optimization variable \(\lambda \) in the objective function is equivalent to setting some upper bound on \(||\lambda ||_p\) (see for instance Abu-Mostafa et al. 1997). This shows that minimizing \(||\lambda ||_1\) in the 1-norm SVM mehod indeed plays the role of minimizing the complexity of the hypothesis space.

- 6.
The datasets and MATLAB implementation routines will be made available at: http://mine.kaust.edu.sa.

- 7.
The reason behind choosing 0.1 is because 95 % of pairwise distances are less than 10.

- 8.
We perfomred a random permutation of the set of integers \(\{1,2,\ldots ,9\}\). Each pair of adjacent labels was used as a binary classification problem, where the 9th label is trainied versus the 1

^{st}. - 9.
- 10.
Because ESVM was found to be competitive to relevance vector machine (RVM) (Loosli et al. 2013), RVM was not included in our experiments.

## Notes

### Acknowledgments

We would like to express our gratitude to the anonymous reviewers who provided valuable feedback and suggestions that greatly improved the manuscript. We also thank the Saudi Arabian Oil Company (Saudi Aramco) and King Abdullah University of Science and Technology (KAUST) for supporting this research.

## References

- Abu-Mostafa, Y. S., Magdon-Ismail, M., & Lin, H. T. (2012).
*Learning from data*. AMLBook.Google Scholar - Balcan, M. F., Blum, A., & Srebro, N. (2008). A theory of learning with similarity functions.
*Machine Learning*,*72*(1–2), 89–112.CrossRefGoogle Scholar - Bartlett, P. L. (1997). For valid generalization, the size of the weights is more important than the size.
*Advances in Neural Information Processing Systems (NIPS)*,*9*, 134.Google Scholar - Lichman, M. (2013).
*UCI machine learning repository*. Irvine, CA: University of California, School of Information and Computer Sciences. http://archive.ics.uci.edu/ml - Boser, B. E., Guyon, I., & Vapnik, V. (1992) A training algorithm for optimal margin classifiers. In
*Fifth annual workshop on computational learning theory*(pp. 144–152).Google Scholar - Boyd, S., & Vandenberghe, L. (2004).
*Convex optimization*. Cambridge: Cambridge university press.CrossRefzbMATHGoogle Scholar - Bradley, P. S., & Mangasarian, O. L. (1998). Feature selection via concave minimization and support vector machines. In
*ICML*.Google Scholar - Burges, C. (1999). Geometry and invariance in kernel based methods. In B. Schölkopf, C. J. C. Burges, & A. J. Smola (Eds.),
*Advances in kernel methods–support vector learning*(pp. 89–116). Cambridge, MA: MIT Press.Google Scholar - Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition.
*Data Mining and Knowledge Discovery*,*2*, 121–167.CrossRefGoogle Scholar - Chang, C., & Lin, C. J. (2001).
*LIBSVM: A library for support vector machines*(online). http://www.csie.ntu.edu.tw/cjlin/libsvm. - Chen, J., & Ye, J. (2008). Training SVM with indefinite kernels. In
*Proceedings of ICML*(pp. 136–143).Google Scholar - Chen, Y., Garcia, E. K., Gupta, M. R., Rahimi, A., & Cazzanti, L. (2009a). Similarity-based classification: Concepts and algorithms.
*JMLR*,*10*, 747–776.MathSciNetzbMATHGoogle Scholar - Chen, Y., Gupta, M. R., & Recht, B. (2009b). Learning kernels from indefinite similarities. In
*Proceedings of ICML*(pp. 145–152).Google Scholar - Cortes, C., & Vapnik, V. (1995). Support-vector networks.
*Machine learning*,*20*(3), 273–297.zbMATHGoogle Scholar - Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification.
*IEEE Transactions on Information Theory*,*13*(1), 21–27.CrossRefzbMATHGoogle Scholar - Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets.
*JMLR*,*7*, 1–30.MathSciNetzbMATHGoogle Scholar - Fei-Fei, L., Fergus, R., & Perona, P. (2004). Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. In
*IEEE CVPR: Workshop on generative-model based vision*.Google Scholar - Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., et al. (2002). Placing search in context: The concept revisited.
*ACM Transactions on Information Systems*,*20*(1), 116–131.CrossRefGoogle Scholar - Fung, G. M., & Mangasarian, O. L. (2004). A feature selection Newton method for support vector machine classification.
*Computational Optimization and Applications*,*28*, 185–202.MathSciNetCrossRefzbMATHGoogle Scholar - Graepel, T., Herbrich, R., Bollmann-Sdorra, P., & Obermayer, K. (1999). Classification on pairwise proximity data. In M. J. Kearns, S. A. Solla, & D. A. Cohn (Eds.),
*Advances in NIPS*(pp. 438–444). MIT Press.Google Scholar - Gurobi Optimization I. (2012).
*Gurobi optimizer reference manual*. http://www.gurobi.com. - Haasdonk, B. (2005). Feature space interpretation of svms with indefinite kernels.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,*27*(4), 482–492.CrossRefGoogle Scholar - Hastie, T., Tibshirani, R., & Friedman, J. (2001).
*The elements of statistical learning: Data mining, inference, and prediction. Springer series in statistics*(2nd ed.). Springer.Google Scholar - Hilario, M., & Kalousis, A. (2008). Approaches to dimensionality reduction in proteomic biomarker studies.
*Briefings in Bioinformatics*,*9*(2), 102–118.CrossRefGoogle Scholar - Holm, S. (1979). A simple sequentially rejective multiple test procedure.
*Scandinavian Journal of Statistics*,*6*(2), 65–70.Google Scholar - IBM I. (2015).
*Cplex optimizer*. http://www.ibm.com/software/commerce/optimization/cplex-optimizer/. - Lin, H. T., & Lin, C. J. (2003).
*A study on sigmoid kernels for SVM and the training of non-PSD kernels by SMO-type methods*. Tech. rep., Department of Computer Science, National Taiwan University. http://www.csie.ntu.edu.tw/cjlin/papers/tanh.pdf. - Liu, H., Motoda, H., Setiono, R., & Zhao, Z. (2010). Feature selection: An ever evolving frontier in data mining. In
*4th workshop on feature selection in data mining (FSDM 10), PAKDD*. pp. 4–13.Google Scholar - Loosli, G., Ong, C. S., & Canu, S. (2013).
*SVM in Krein spaces*. Tech. rep. http://hal.archives-ouvertes.fr/hal-00869658/. - Luntz, A., & Brailovsky, V. (1969). On estimation of characters obtained in statistical procedure of recognition.
*Technicheskaya Kibernetica*,*3*(6) (in Russian).Google Scholar - Luss, R., & d’Aspremont, A. (2009). Support vector machine classification with indefinite kernels.
*Mathematical Programming Computation*,*1*(2–3), 97–118.Google Scholar - Macskassy, S. A., & Provost, F. (2007). Classification in networked data: A toolkit and a univariate case study.
*JMLR*,*8*, 935–983.Google Scholar - Mangasarian, O. L. (1998).
*Generalized support vector machines*. Tech. Rep. Mathematical Programming Technical Report 98-14, University of Wisconsin.Google Scholar - Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2012).
*Foundations of machine learning*. Cambridge: MIT Press.zbMATHGoogle Scholar - Noordewier, M. O., Towell, G. G., & Shavlik, J. W. (1991). Training knowledge-based neural networks to recognize genes in dna sequences. In R. P. Lippmann, J. E. Moody, & D. S. Touretzky (Eds.),
*Advances in NIPS*(pp. 530–536). Morgan-Kaufmann.Google Scholar - Ong, C. S., Mary, X., Canu, S., & Smola, A. J. (2004). Learning with non-positive kernels. In
*ICML*.Google Scholar - Park, J., & Sandberg, I. W. (1991). Universal approximation using radial-basis-function networks.
*Neural Computation*,*3*(2), 246–257.CrossRefGoogle Scholar - Pekalska, E., Paclik, P., & Duin, R. P. (2001). A generalized kernel approach to dissimilarity-based classification.
*JMLR*,*2*, 175–211.MathSciNetzbMATHGoogle Scholar - Schapire, R. E., Freund, Y., Bartlett, P., & Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods.
*Annals of Statistics*,*26*(5), 1651–1686.Google Scholar - Schölkopf, B., & Smola, A. J. (2002).
*Learning with kernels: Support vector machines, regularization, optimization, and beyond*. Cambridge: MIT Press.Google Scholar - Soman, K. P., Loganathan, R., & Ajay, V. (2009).
*Machine Learning with SVM and other Kernel methods*. PHI Learning.Google Scholar - Tipping, M. E. (2001). Sparse bayesian learning and the relevance vector machine.
*JMLR*,*1*, 211–244.MathSciNetzbMATHGoogle Scholar - Vapnik, V., & Chapelle, O. (2000). Bounds on error expectation for support vector machines.
*Neural Computation*,*12*(9), 2013–2036.CrossRefGoogle Scholar - Vapnik, V. N. (1999). An overview of statistical learning theory.
*IEEE Transactions on Neural Networks*,*10*(5), 988–999.CrossRefGoogle Scholar - Wu, G., Zhang, Z., & Chang, E. Y. (2005).
*An analysis of transformation on non-positive semidefinite similarity matrix for kernel machines*. Tech. rep., UCSB.Google Scholar - Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., et al. (2008). Top 10 algorithms in data mining.
*Knowledge and Information Systems*,*14*(1), 1–37.Google Scholar - Ying, Y., Campbell, C., & Girolami, M. (2009). Analysis of SVM with indefinite kernels.
*Advances in NIPS*,*22*, 2205–2213.Google Scholar - Zhu, J., Rosset, S., Hastie, T., & Tibshirani, R. (2004). 1-norm support vector machines.
*Advances in Neural Information Processing Systems (NIPS)*,*16*, 49–56.Google Scholar - Zou, H. (2007). An improved 1-norm SVM for simultaneous classification and variable selection. In
*Proceedings of the 11th international conference on artificial intelligence and statistics*(pp. 675–681).Google Scholar