# Unconfused ultraconservative multiclass algorithms

- 717 Downloads
- 1 Citations

## Abstract

We tackle the problem of learning linear classifiers from noisy datasets in a multiclass setting. The two-class version of this problem was studied a few years ago where the proposed approaches to combat the noise revolve around a Perceptron learning scheme fed with peculiar examples computed through a weighted average of points from the noisy training set. We propose to build upon these approaches and we introduce a new algorithm called Unconfused Multiclass additive Algorithm (U MA) which may be seen as a generalization to the multiclass setting of the previous approaches. In order to characterize the noise we use the *confusion matrix* as a multiclass extension of the classification noise studied in the aforementioned literature. Theoretically well-founded, U MA furthermore displays very good empirical noise robustness, as evidenced by numerical simulations conducted on both synthetic and real data.

## Keywords

Multiclass classification Perceptron Noisy labels Confusion Matrix Ultraconservative algorithms## 1 Introduction

*Context*This paper deals with linear multiclass classification problems defined on an input space \({\mathcal {X}}\) (e.g., \({\mathcal {X}}={\mathbb {R}}^d\)) and a set of classes

*ultraconservative additive*algorithms (Crammer and Singer 2003) to label noise classification in the multiclass setting—in order to lighten notation, we will now refer to these algorithms as

*ultraconservative algorithms*. We study whether it is possible to learn a linear predictor from a training set made of independent realizations of a pair \((X,Y)\) of random variables:

*true label*, i.e. deterministically computed class, \(t(\varvec{x}_i)\in \mathcal{Q}\) associated with \(\varvec{x}_i\), according to some

*concept*\(t\). The random noise process \(Y\) that corrupts the label to provide the \(y_i\)’s given the \(\varvec{x}_i\)’s is supposed uniform within each pair of classes, thus it is fully described by a

*confusion matrix*\(C=(C_{pq})_{p,q}\in {\mathbb {R}}^{Q\times Q}\) so that

*confusion noise*present in the training set \({\mathcal {S}}\) to give rise to a classifier \(h\) with small risk

*unconfused*to characterize the procedures we propose.

Ultraconservative learning procedures are online learning algorithms that output linear classifiers. They display nice theoretical properties regarding their convergence in the case of linearly separable datasets, provided a sufficient separation *margin* is guaranteed (as formalized in Assumption 1 below). In turn, these convergence-related properties yield generalization guarantees about the quality of the predictor learned. We build upon these nice convergence properties to show that ultraconservative algorithms are robust to a confusion noise process, provided that: i) \(C\) is invertible and can be accessed, ii) the original dataset \(\lbrace (\varvec{x}_i, t(\varvec{x}_i) ) \rbrace _{i=1}^n\) is linearly separable. This paper is essentially devoted to proving how/why ultraconservative multiclass algorithms are indeed robust to such situations. To some extent, the results provided in the present contribution may be viewed as a generalization of the contributions on learning binary perceptrons under misclassification noise (Blum et al. 1996; Bylander 1994).

Beside the theoretical questions raised by the learning setting considered, we may depict the following example of an actual learning scenario where learning from noisy data is relevant. This learning scenario will be further investigated from an empirical standpoint in the section devoted to numerical simulations (Sect. 4).

### *Example 1*

One situation where coping with mislabelled data is required arises in (partially supervised) scenarios where labelling data is very expensive. Imagine a task of text categorization from a training set \({\mathcal {S}}={\mathcal {S}}_{\ell }{{\mathrm{\cup }}}{\mathcal {S}}_u\), where \({\mathcal {S}}_{\ell }=\{(\varvec{x}_i,y_i)\}_{i=1}^n\) is a set of \(n\) labelled training examples and \({\mathcal {S}}_u=\{\varvec{x}_{n+i}\}_{i=1}^m\) is a set of \(m\) unlabelled vectors; in order to fall back to realistic training scenarios where more labelled data cannot be acquired, we may assume that \(n\ll m\). A possible three-stage strategy to learn a predictor is as follows: first learn a predictor \(f_{\ell }\) on \({\mathcal {S}}_{\ell }\) and estimate its confusion error \(C\) *via* a cross-validation procedure—\(f\) is assumed to make mistakes evenly over the class regions—, second, use the learned predictor to label all the data in \({\mathcal {S}}_u\) to produce the labelled traning set \(\widehat{{\mathcal {S}}}=\{(\varvec{x}_{n+i},t_{n+i}:=f(\varvec{x}_{n+i}))\}_{i=1}^m\) and finally, learn a classifier \(f\) from \(\widehat{{\mathcal {S}}}\) *and* the confusion information \(C\).

This introductory example pertains to semi-supervised learning and this is only one possible learning scenario where the contribution we propose, U MA, might be of some use. Still, it is essential to understand right away that one key feature of U MA, which sets it apart from many contributions encountered in the realm of semi-supervised learning, is that we do provide theoretical bounds on the sample complexity and running time required by our algorithm to output an effective predictor.

The present paper is an extended version of Louche and Ralaivola (2013). Compared with the original paper, it provides a more detailed introduction of the tools used in the paper, a more thorough discussion on related work as well as more extensive numerical results (which confirm the relevance of our findings). A strategy to make use of kernels for nonlinear classification has also been added.

*Contributions* Our main contribution is to show that it is both practically and theoretically possible to learn a multiclass classifier on noisy data if some information on the noise process is available. We propose a way to generate new points for which the true class is known. Hence we can iteratively populate a new *unconfused* dataset to learn from. This allows us to handle a massive amount of mislabelled data with only a very slight loss of accuracy. We embed our method into ultraconservative algorithms and provide a thorough analysis of it, in which we show that the strong theoretical guarantees that characterize the family of ultraconservative algorithms carry over to the noisy scenario.

*Related work* Learning from mislabelled data in an iterative manner has a long-standing history in the machine learning community. The first contributions on this topic, based on the Perceptron algorithm (Minsky and Papert 1969), are those of Bylander (1994), Blum et al. (1996), Cohen (1997), which promoted the idea utilized here that a sample average may be used to construct update vectors relevant to a Perceptron learning procedure. These first contributions were focused on the binary classification case and, for Blum et al. (1996), Cohen (1997), tackled the specific problem of strong-polynomiality of the learning procedure in the *probably approximately correct* (PAC) framework (Kearns and Vazirani 1994). Later, Stempfel and Ralaivola (2007) proposed a binary learning procedure making it possible to learn a kernel Perceptron in a noisy setting; an interesting feature of this work is the recourse to random projections in order to lower the capacity of the class of kernel-based classifiers. Meanwhile, many advances were witnessed in the realm of online multiclass learning procedures. In particular, Crammer and Singer (2003) proposed families of learning procedures subsuming the Perceptron algorithm, dedicated to tackle multiclass prediction problems. A sibling family of algorithms, the passive-aggressive online learning algorithms (Crammer et al. 2006), inspired both by the previous family and the idea of minimizing instantaneous losses, were designed to tackle various problems, among which multiclass linear classification. Sometimes, learning with partially labelled data might be viewed as a problem of learning with corrupted data (if, for example, all the unlabelled data are randomly or arbitrarily labelled) and it makes sense to mention the works Kakade et al. (2008) and Ralaivola et al. (2011) as distant relatives to the present work.

*Organization of the paper* Section 2 formally states the setting we consider throughout this paper. Section 3 provides the details of our main contribution: the U MA algorithm and its detailed theoretical analysis. Section 4 presents numerical simulations that support the soundness of our approach.

## 2 Setting and problem

### 2.1 Noisy labels with underlying linear concept

The probabilistic setting we consider hinges on the existence of two components. On the one hand, we assume an unknown (but fixed) probability distribution \(\mathcal {D}\) on the *input space* \({\mathcal {X}}\doteq {\mathbb {R}}^d\). On the other hand, we also assume the existence of a deterministic labelling function \(t:{\mathcal {X}}\rightarrow \mathcal{Q}\), where \(\mathcal{Q}\doteq \{1,\ldots Q\}\), which associates a label \(t(\varvec{x})\) to any input example \(\varvec{x}\); in the *Probably Approximately Correct* (PAC) literature, \(t\) is sometimes referred to as a *concept* (Kearns and Vazirani 1994; Valiant 1984).

In the present paper, we focus on learning *linear classifiers*, defined as follows.

### **Definition 1**

*Linear classifiers*) The

*linear classifier*\(f_W:{\mathcal {X}}\rightarrow \mathcal{Q}\) is a classifier that is associated with a set of vectors \(W=[\varvec{w}_1\cdots \varvec{w}_Q]\in {\mathbb {R}}^{d\times Q}\), which predicts the label \(f_W(\varvec{x})\) of any vector \(\varvec{x}\in {\mathcal {X}}\) as

### **Definition 2**

*Margin of a linear classifier*) Let \(c:{\mathcal {S}}\rightarrow \mathcal{Q}\) be some fixed concept. Let \(W=[\varvec{w}_1\cdots \varvec{w}_Q]\in {\mathbb {R}}^{d\times Q}\) be a set of \(Q\) weight vectors. Linear classifier \(f_W\) is said to have margin \(\theta >0\) with respect to \(c\) (and distribution \(\mathcal {D}\)) if the following holds:

Equipped with this definition, we shall consider that the following assumption of linear separability with margin \(\theta \) of concept \(t\) holds throughout.

### **Assumption 1**

(*Linear Separability of t with Margin * \(\theta \)) There exist \(\theta \ge 0\) and \(W^*=[\varvec{w}_1^*\cdots \varvec{w}_Q^*]\in {\mathbb {R}}^{d\times Q}\), with \(\Vert W^* \Vert _F^2 = 1\) (\(\Vert \cdot \Vert _F\) denotes the Frobenius norm) such that \(f_{W^*}\) has margin \(\theta \) with respect to the concept \(t\).

*true risk*or

*misclassification error*\(R_{\text {error}}(f)\) of \(f\) given by

### **Assumption 2**

*known*confusion matrix \(C\) given by

Alternatively put, the noise process that corrupts the data is *uniform* within each class and its level does not depend on the precise location of \(\varvec{x}\) within the region that corresponds to class \(t(\varvec{x})\). The noise process \(Y\) is both a) aggressive, as it does not only apply, as we may expect, to regions close to the class boundaries between classes and b) regular, as the mislabelling rate is piecewise constant. Nonetheless, this setting can account for many real-world problems as numerous noisy phenomena can be summarized by a simple confusion matrix. Moreover it has been proved (Blum et al. 1996) that robustness to classification noise generalizes robustness to monotonic noise where, for each class, the noise rate is a monotonically decreasing function of the distance to the class boundaries.

### *Remark 1*

The confusion matrix \(C\) should not be mistaken with the matrix \(\tilde{C}\) of general term: \(\tilde{C}_{ij} \doteq \mathbb {P}_{X\sim \mathcal {D}_{X|Y=j}}(t(X) = i | Y = j)\) which is the class-conditional distribution of \(t(X)\) given \(Y\). The problem of learning from a noisy training set and \(\tilde{C}\) is a different problem than the one we aim to solve. In particular, \(\tilde{C}\) can be used to define cost-sensitive losses rather directly whereas doing so with \(C\) is far less obvious. Anyhow, this second problem of learning from \(\tilde{C}\) is far from trivial and very interesting, and it falls way beyond the scope of the present work.

Finally, we assume the following from here on:

### **Assumption 3**

\(C\) is invertible.

Note that this assumption is not as restrictive as it may appear. For instance, if we consider the learning setting depicted in Example 1 and implemented in the numerical simulations, then the confusion matrix obtained from the first predictor \(f_{\ell }\) is often diagonally dominant, i.e. the magnitudes of the diagonal entries are larger than the sum of the magnitudes of the entries in their corresponding rows, and \(C\) is therefore invertible. Generally speaking, the problems that we are interested in (i.e. problems where the true classes seems to be recoverable) tend to have invertible confusion matrix. It is most likely that invertibility is merely a sufficient condition on \(C\) that allows us to establish learnability in the sequel. Identifying less stringent conditions on \(C\), or conditions termed in a different way—which would for instance be based on the condition number of \(C\)—for learnability to remain, is a research issue of its own that we leave for future investigations.

The setting we have just presented allows us to view \({\mathcal {S}}=\{(\varvec{x}_i,y_i)\}_{i=1}^n\) as the realization of a random sample \(\{(X_i,Y_i)\}_{i=1}^n\), where each pair \((X_i,Y_i)\) is an independent copy of the random pair \((X,Y)\) of law \(\mathcal {D}_{XY}\doteq \mathcal {D}_X\mathcal {D}_{X|Y}.\)

### 2.2 Problem: learning a linear classifier from noisy data

*and*\(C\) so that the error rate

Building on Assumption 1, we may refine our learning objective by restricting ourselves to linear classifiers \(f_W\), for \(W=[\varvec{w}_1\cdots \varvec{w}_Q]\in {\mathbb {R}}^{d\times Q}\) (see Definition 1). Our goal is thus to learn a relevant matrix \(W\) from \({\mathcal {S}}\) *and* the confusion matrix \(C\). More precisely, we achieve risk minimization through classic additive methods and the core of this work is focused on computing noise-free update points such that the properties of said methods are unchanged.

## 3 Uma: unconfused ultraconservative multiclass algorithm

This section presents the main result of the paper, that is, the U MA procedure, which is a response to the problem posed above: U MA makes it possible to learn a multiclass linear predictor from \({\mathcal {S}}\) and the confusion information \(C\). In addition to the algorithm itself, this section provides theoretical results regarding the convergence and sample complexity of U MA.

As U MA is a generalization of the ultraconservative additive online algorithms proposed in Crammer and Singer (2003) to the case of noisy labels, we first and foremost recall the essential features of this family of algorithms. The rest of the section is then devoted to the presentation and analysis of U MA.

### 3.1 A brief reminder on ultraconservative additive algorithms

*error set*and its simple update: when processing a training pair \((\varvec{x},y)\), they perform updates of the form

*error set*\(\mathcal{E}(\varvec{x},y)\) defined as

*step sizes*to fulfill

*ultraconservative*refers to the fact that only those prototype vectors \(\varvec{w}_r\) which achieve a larger inner product \(\langle \varvec{w}_r,\varvec{x}\rangle \) than \(\langle \varvec{w}_y,\varvec{x}\rangle \), that is, the vectors that can entail a prediction mistake when decision rule (1) is applied, may be affected by the update procedure. The term

*additive*conveys the fact that the updates consist in modifying the weight vectors \(\varvec{w}_r\)’s by adding a portion of \(\varvec{x}\) to them (which is to be opposed to multiplicative update schemes). Again, as we only consider these additive types of updates in what follows, it will have to be implicitly understood even when not explicitly mentioned.

One of the main results regarding ultraconservative algorithms, which we extend in our learning scenario is the following.

### **Theorem 1**

(Mistake bound for ultraconservative algorithms Crammer and Singer 2003) Suppose that concept \(t\) is in accordance with Assumption 1. The number of mistakes/updates made by one pass over \({\mathcal {S}}\) by any ultraconservative procedure is upper-bounded by \(2/\theta ^2\).

This result is essentially a generalization of the well-known Block–Novikoff theorem (Block 1962; Novikoff 1963), which establishes a mistake bound for the Perceptron algorithm (an ultraconservative algorithm itself).

### 3.2 Main result and high level justification

For the impatient reader, we may already leak some of the ingredients we use to prove the relevance of our procedure. Theorem 1, which shows the convergence of ultraconservative algorithms, rests on the analysis of the updates made when training examples are misclassified by the current classifier. The conveyed message is therefore that examples that are erred upon are central to the convergence analysis. It turns out that steps 4 through 7 of U MA (cf. Algorithm 2) construct a point \(\varvec{z}_{pq}\) that is, with high probabilty, mistaken on. More precisely, the true class \(t(\varvec{z}_{pq})\) of \(\varvec{z}_{pq}\) is \(q\) and it is predicted to be of class \(p\) by the current classifier; at the same time, these update vectors are guaranteed to realize a positive margin condition with respect to \(W^*\): \(\langle \varvec{w}_q^*,\varvec{z}_{pq}\rangle >\langle \varvec{w}_k^*,\varvec{z}_{pq}\rangle \) for all \(k\ne q\). The ultraconservative feature of the algorithm is carried by step 8 and step 10, which make it possible to update any prototype vector \(\varvec{w}_r\) with \(r\ne q\) having an inner product \(\langle \varvec{w}_r,\mathbf{z}_{pq}\rangle \) with \(\mathbf{z}_{pq}\) larger than \(\langle \varvec{w}_q,\mathbf{z}_{pq}\rangle \) (which should be the largest if a correct prediction were made). The reason why we have results ‘with high probability’ is because the \(z_{pq}\)’s are sample-based estimates of update vectors known to be of class \(q\) but predicted as being of class \(p\), with \(p\ne q\); computing the accuracy of the sample estimates is one of the important exercises of what follows. A control on the accuracy makes it possible for us to then establish the convergence of the proposed algorithm.

### 3.3 With high probability, \(\mathbf{z}_{pq}\) is a mistake with positive margin

Here, we prove that the update vector \(\varvec{z}_{pq}\) given in step 7 is, with high probability, a point on which the current classifier errs.

### **Proposition 1**

### *Proof*

Intuitively, \(\mu _q^p\) must be seen as an example of class \(p\) which is erroneously predicted as being of class \(q\). Such an example is precisely what we are looking for to update the current classifier; as expecations cannot be computed, the estimate \(\varvec{z}_{pq}\) of \(\mu _q^p\) is used instead of \(\mu _q^p\).

### **Proposition 2**

- i)
\(t(\mu _{q}^p)=q\), i.e. the ‘true’ class of \(\mu _{q}^p\) is \(q\);

- ii)
and \(f_W(\mu _{q}^p)=p\); \(\mu _{q}^p\) is therefore misclassified by the current classifier \(f_W\).

### *Proof*

Equation (12) is obtained thanks to Assumption 1 combined with (10) and the linearity of the expectation. Equation (13) is obtained thanks to the definition (8) of \({\mathcal {A}}_{p}^{\alpha }\) (made of points that are predicted to be of class \(p\)) and the linearity of the expectation. \(\square \)

The attentive reader may notice that Proposition 2 or, equivalently, step 7, is precisely the reason for requiring \(C\) to be invertible, as the computation of \(\varvec{z}_{pq}\) hinges on the resolution of a system of equations based on \(C\).

### **Proposition 3**

### *Proof*

This last proposition essentially says that the update vectors \(\mathbf{z}_{pq}\) that we compute are, with high probability, erred upon and realize a margin condition \(\theta - \varepsilon \).

Note that \(\alpha \) is needed to cope with the imprecision incurred by the use of empirical estimates. Indeed, we can only approximate \(\langle \varvec{w}_p,\varvec{z}_{pq}\rangle - \langle \varvec{w}_k,\varvec{z}_{pq}\rangle \) in (15) up to a precision of \(\varepsilon \). Thus for the result to hold we need to have \(\langle \varvec{w}_p,\mu _{q}^p\rangle - \langle \varvec{w}_k,\mu _{q}^p\rangle \ge \varepsilon \) which is obtained from (13) when \(\alpha = \varepsilon \). In practice, this just says that the points used in the computation of \(\varvec{z}_{pq}\) are at a distance at least \(\alpha \) from any decision boundaries.

### *Remark 2*

It is important to understand that the parameter \(\alpha \) helps us derive sample complexity results by allowing us to retrieve a linearly separable training dataset with *positive* margin from the noisy dataset. The theoretical results we prove hold for any such \(\alpha >0\) parameter and the smaller this parameter, the larger the sample complexity, i.e., the harder it is for the algorithm to take advantage of a training samples that meets the sample complexity requirements. In other words, the smaller \(\alpha \), the less likely it is for U MA to succeed; yet, as shown in the experiments, where we use \(\alpha =0\), U MA continues to perform quite well.

### 3.4 Convergence and stopping criterion

We arrive at our main result, which provides both convergence and a stopping criterion.

### **Proposition 4**

Under Assumptions 1, 2 and 3 there exists a number \(n\), polynomial in \(d, 1/\theta , Q, 1/\delta \), such that if the training sample is of size at least \(n\), then, with high probability (\(1 - \delta \)), U MA makes at most \(\mathcal {O}(1/{\theta }^2)\) updates.

### *Proof*

Let \({\mathcal {S}}_{\varvec{z}_{}}\) the set of all the update vectors \(\varvec{z}_{pq}\) generated during the execution of U MA and labeled with their *true* class \(q\). Observe that, in this context, U MA (Alg. 2) behaves like a regular ultraconservative algorithm run on \({\mathcal {S}}_{\varvec{z}_{}}\). Namely: a) lines 4 through 7 compute a new point in \({\mathcal {S}}_{\varvec{z}_{}}\), and b) lines 8 through 10 perform an ultraconservative update step.

From Proposition 3, we know that with high probability, \(w^*\) is a classifier with positive margin \(\theta - \varepsilon \) on \({\mathcal {S}}_{\varvec{z}_{}}\) and it comes from Theorem 1 that U MA does not make more than \(\mathcal {O}(1/{\theta }^2)\) mistakes on such dataset.

Because, by construction, we have that with high probability each element of \({\mathcal {S}}_{\varvec{z}_{}}\) is erred upon then \(\vert {\mathcal {S}}_{\varvec{z}_{}} \vert \in \mathcal {O}(1/{\theta }^2)\); that means that, with high probability, U MA does not make more than \(\mathcal {O}(1/{\theta }^2)\) updates.

All in all, after \(\mathcal {O}(1/{\theta }^2)\) updates, there is a high probability that we are not able to construct examples on which U MA makes a mistake or, equivalently, the conditional misclassification errors \(\mathbb {P}(f_{W}(X)=p|Y=q)\) are all small. \(\square \)

Even though U MA operates in a batch setting, it ‘internally’ simulates the execution of an online algorithm that encounters a new training point (\(\varvec{z}_{pq} \in {\mathcal {S}}_{\varvec{z}_{}}\)) at each time step. To more precisely see how U MA can be seen as an online algorithm, it suffices to imagine it be run in a way where each vector update is made after a chunk of \(n\) (where \(n\) is as in Proposition 4) training data has been encountered and used to compute the next element of \({\mathcal {S}}_{\varvec{z}_{}}\). Repeating this process \(\mathcal {O}(1/{\theta }^2)\) times then guarantees convergence with high probability. Note that, in this scenario, U MA requires \(n' = \mathcal {O}(n/{\theta }^2)\) data to converge which might be far more than the sample complexity exhibited in Proposition 4. Nonetheless, \(n'\) still remains polynomial in \(d\), \(1/{\theta }\), \(Q\) and \(1/{\delta }\). For more detail on this (online to batch conversion) approach, we refer the interested readers to Blum et al. (1996).

### 3.5 Selecting \(p\) and \(q\)

So far, the question of selecting good pairs of values \(p\) and \(q\) to perform updates has been left unanswered. Indeed, our results hold for *any* pair \((p,q)\) and convergence is guaranteed even when \(p\) and \(q\) are arbitrarily selected as long as \(\varvec{z}_{pq}\) is not \(\mathbf {0}\). Nonetheless, it is reasonable to use heuristics for selecting \(p\) and \(q\) with the hope that it might improve the practical convergence speed.

The second selection criterion is intended to normalize the number of errors with respect to the proportions of different classes and aims at being robust to imbalanced data. Our goal here is to provide a way to take into account the class distribution for the selection of \((p,q)\). Note that this might be a first step towards transforming U MA into an algorithm for minimizing the confusion risk, even though additional (and significant) work is required to provably provide U MA with this feature.

On a final note, we remark that \((p,q)_{\text {conf}}\) requires additional precautions when used: when \((p,q)_{\text {error}}\) is implemented, \(\varvec{z}_{pq}\) is guaranteed to be the update vector of maximum norm among all possible update vectors, whereas this no longer holds true when \((p,q)_{\text {conf}}\) is used and if \(\varvec{z}_{pq}\) is close to \(\mathbf {0}\) then there may exist another possibly more informative—from the standpoint of convergence speed—update vector \(\varvec{z}_{p'q'}\) for some \((p',q')\ne (p,q).\)

### 3.6 U MA and kernels

Thus far, we have only considered the situation where linear classifiers are learned. There are however many learning problems that cannot be handled effectively without going beyond linear classification. A popular strategy to deal with such a situation is obviously to make use of kernels (Schölkopf and Smola 2002). In this direction, there are (at least) two paths that can be taken. The first one is to revisit U MA and provide a kernelized algorithm based on a dual representation of the weight vectors, as is done with the kernel Perceptron (see Cristianini and Shawe-Taylor 2000) or its close cousins (see, e.g. Friess et al. 1998; Dekel et al. 2005; Freund and Schapire 1999). Doing so would entail the question of finding sparse expansions of the weight vectors with respect to the training data in order to contain the prediction time and to derive generalization guarantees based on such sparsity: this is an interesting and ambitious research program on its own. A second strategy, which we make use of in the numerical simulations, is simply to build upon the idea of Kernel Projection Machines (Blanchard and Zwald 2008; Takerkart and Ralaivola 2011): first, perform a Kernel Principal Component Analysis (shorthanded as kernel-PCA afterwards) with \(D\) principal axes, second, project the data onto the principal \(D\)-dimensional subspace and, finally, run U MA on the obtained data. The availability of numerous methods to efficiently extract the principal subspaces (or approximation thereof) (Bach and Jordan 2002; Drineas et al. 2006; Drineas and Mahoney 2005; Stempfel and Ralaivola 2007; Williams and Seeger 2000) makes this path a viable strategy to render U MA usable for nonlinearly separable concepts. This explains why we decided to use this strategy in the present paper.

## 4 Experiments

In this section, we present results from numerical simulations of our approach and we discuss different practical aspects of U MA. The ultraconservative step sizes retained are those corresponding to a regular Perceptron: \(\tau _p=-1\) and \(\tau _q=+1\), the other values of \(\tau _r\) being equal to 0.

Section 4.1 discusses robustness results, based on simulations conducted on synthetic data while Section 4.2 takes it a step further and evaluates our algorithm on real data, with a realistic noise process related to Example 1 (cf. Sect. 1).

*confusion rate*as a performance measure, which is:

### 4.1 Toy dataset

We use a 10-class dataset with a total of roughly 1000 2-dimensional examples uniformly distributed according to \(\mathcal {U}\), which is the uniform distribution over the unit circle centered at the origin. Labelling is achieved according to (1) given a set of 10 weight vectors \(\varvec{w}_1,\ldots ,\varvec{w}_{10}\), which are also randomly generated according to \(\mathcal {U}\); all these weight vectors have therefore norm 1. A margin \(\theta = 0.025\) is enforced in the generated data by removing examples that are too close to the decision boundaries—practically, with this value of \(\theta \), the case where three classes are so close to each other that no training example from one of the classes remained after enforcing the margin never occurred.

The learned classifiers are tested against a dataset of 10,000 points that are distributed according to the training distribution. The results reported in the tables and graphics are averaged over 10 runs.

The noise is generated from the sole confusion matrix. This situation can be tough to handle and is rarely met with real data but we stick with it as it is a good example of a worst-case scenario.

*Robustness to noise*We first (Fig. 1a) evaluate the robustness to noise of U MA by running our algorithm with various confusion matrices. We uniformly draw a reference nonnegative square matrix \(M\), the rows of \(M\) are then normalized, i.e. each entry of \(M\) is divided by the sum of the elements of its row, so \(M\) is a stochastic matrix. If \(M\) is not invertible it is rejected and we draw a new matrix until we have an invertible one. Then, we define \(N\) such that \(N = {(M - I)}/10\), where \(I\) is the identity matrix of order \(Q\); typically \(N\) has nonpositive diagonal entries and nonnegative off-diagonal coefficients. We will use \(N\) to parametrize a family of confusion matrices that have their most dominant coefficient to move from their diagonal to their off-diagonal parts. Namely, we run U MA 20 times with confusion matrices \(C\in \{C_i\doteq \Omega (I + iN)\}_{i=1}^{20}\), where \(\Omega \) is a matrix operator which outputs a (row-)stochastic matrix: when applied on matrix \(A\), \(\Omega \) replaces the negative elements of \(A\) by zeros and it normalizes the rows of the obtained matrix; note that \(i = 10\) corresponds to the case where \(C= M\). Equivalently, one can think of \(C_i\) as the weighted average between \(I\) and \(\Omega (N)\) where \(I\) has a constant weight of 1 and \(\Omega (N)\) is weighted by \(i\). Note that, after some point, further increasing \(i\) has little effect on \(C_i\) as it eventually converges to \(\Omega (N)\). Figure 1a plots our results against the Frobenius norm of the diagonal-free confusion matrix \(C\), that is: \(\Vert {C- \mathtt diag (C)} \Vert _F\) where \(\mathtt diag (C)\) denotes the diagonal matrix with the same diagonal values as \(C\). For the sake of comparison, we also have run U MA with a fixed confusion matrix \(C= I\) on the same data. This amounts to running a Perceptron through the data multiple times and it allows us to have a baseline for measuring the improvement induced by the use of the confusion matrix.

*Robustness to the incorrect estimation of the confusion matrix* The second experiment (Fig. 1b) evaluates the robustness of U MA to the use of a confusion matrix that is not exactly the confusion matrix that describes the noise process corrupting the data; this will allow us to measure the extent to which a confusion matrix (inaccurately) estimated from the training data can be dealt with by U MA. Using the same notation as before, and the same idea of generating a random stochastic reference matrix \(M\), we proceed as follows: we use the given matrix \(M\) to corrupt the noise-free dataset and then, each confusion matrix from the family \(\{C_i\}_{i=1}^{20}\) is fed to U MA as if it were the confusion matrix governing the noise process. We introduce the notion of *approximation* factor \(\rho \) as \(\rho (i)\doteq 1-i/10\), so that \(\rho \) takes values in the set \(\{-1,-0.9,\ldots ,0.9\}\). As reference, the limit case where \(\rho = 1\)—that is, \(i = 0\)—corresponds to the case where U MA is fed with the identity matrix \(I\), effectively being oblivious of any noise in the training set. More generally, the values of \(C\) are being shifted away from the diagonal as \(\rho \) decreases, the equilibrium point being \(\rho = 0\) where \(C\) is equal to the *true* confusion matrix \(M\). Consequently, a positive (resp. negative) approximation factor means that the noise is underestimated (resp. overestimated), in the sense that the noise process described by \(C\) would corrupt a lower (resp. higher) fraction of labels from each class than the *true* noise process applied on the training set, and corresponding to \(M\). Figure 1b plots the confusion rate against this approximation factor.

On Fig. 1a we observe that U MA clearly provides improvement over the Perceptron algorithm for every noise level tested, as it achieves lower confusion rates. Nonetheless, its performance degrades as the noise level increases, going from a confusion rate of \(0.5\) for small noise levels—that is, when \(\Vert {C- \mathtt diag (C)} \Vert _F\) is small—to roughly \(2.25\) when the noise is the strongest. Comparatively, the Perceptron algorithm follows the same trend, but with higher confusion rate, ranging from \(1.7\) to \(2.75\).

The second simulation (Fig. 1b) points out that, in addition to being robust to the noise process itself, U MA is also robust to underestimated (approximation factor \(\rho > 0\)) noise levels, but not to overestimated (approximation factor \(\rho < 0\)) noise levels. Unsurprisingly, the best confusion rate corresponds to an approximation factor of 0, which means that U MA is using the true confusion matrix and can achieve a confusion rate as low as \(1.8\). There is a clear gap between positive and negative approximation factors, the former yielding confusion rates around \(2.6\) while the latter’s are slightly lower, around \(2.15\). From these observations, it is clear that the approximation factor has a major influence on the performances of U MA.

### 4.2 Real data

#### 4.2.1 Experimental protocol

- 1.
Ask for a small number \(m\) of examples for each of the \(Q\) classes.

- 2.
Learn a rough classifier

^{1}\(g\) from these \(Q \times m\) points. - 3.
Estimate the confusion \(C\) of \(g\) on a small labelled subset \({\mathcal {S}}_{\text {conf}}\) of \({\mathcal {S}}\).

- 4.
Predict the missing labels \(\varvec{y}\) of \({\mathcal {S}}\) using \(g\); thus, \(\varvec{y}\) is a sequence of noisy labels.

- 5.
Learn the final classifier \(f_{\mathtt{\normalsize U}\mathtt{MA}}\) from \({\mathcal {S}}\), \(\varvec{y}\), \(C\) and measure its error rate.

In order to put our results into perspective, we compare them with results obtained from various algorithms. This allows us to give a precise idea of the benefits and limitations of U MA. Namely, we learn four additional classifiers: \(f_{\varvec{y}}\) is a regular Perceptron learned on \({\mathcal {S}}\) labelled with noisy labels \(\varvec{y}\), \(f_{\text {conf}}\) and \(f_{\text {full}}\) are trained with the correctly labelled training sets \({\mathcal {S}}_{\text {conf}}\) and \({\mathcal {S}}\) respectively and, lastly, \(f_{{\mathtt{S3VM}}}\) is a classifier produced by a multiclass semi-supervised *SVM* algorithm (S3VM, Bennett and Demiriz 1998) run on \({\mathcal {S}}\) where only the labels of \({\mathcal {S}}_{\text {conf}}\) are provided. The performances achieved by \(f_{\varvec{y}}\) and \(f_{\text {full}}\) provide bounds for U MA ’s error rates: on the one hand, \(f_{\varvec{y}}\) corresponds to a worst-case situation, as we simply ignore the confusion matrix and use the regular Perceptron instead—arguably, U MA should perform better than this—; on the other hand, \(f_{\text {full}}\) represents the best-case scenario for learning, when all the correct labels are available—the performance of \(f_{\text {full}}\) should always top that of U MA (and the performances of other classifiers). The last two classifiers, \(f_{\text {conf}}\) and \(f_{{{\mathtt{S3VM}}}}\), provide us with objective comparison measures. They are learned from the same data as U MA but use them differently: \(f_{\text {conf}}\) is learned from the reduced training set \({\mathcal {S}}_{\text {conf}}\) and \(f_{{{\mathtt{S3VM}}}}\) is output by a semi-supervised learning strategy that infers both \(f_{{{\mathtt{S3VM}}}}\) and the missing labels of \({\mathcal {S}}\) and it totally ignores the predictions \(\varvec{y}\) made by \(g\). Note that according to the learning scenario we implement, we assume \(C\) to be estimated from raw data. This might not always be the case with real-world problems and \(C\) might be easier and/or less expensive to get than raw data; for instance, it might be deduced from expert knowledge on the studied domain. In that case, \(f_{\text {conf}}\) and \(f_{{{\mathtt{S3VM}}}}\) may suffer from not taking full advantage of the accurate information about the confusion.

#### 4.2.2 Datasets

Our simulations are conducted on three different datasets. Each one with different features. For the sake of reproducibility, we used datasets that can be easily found on the *UCI Machine learning repository* (Bache and Lichman 2013). Moreover, these datasets correspond to tasks for which generating a complete, labelled, training set is typically costly because of the necessity of human supervision and subject to classification noise. The datasets used and their main features are as follows.

*Optical recognition of handwritten digits*This well-known dataset is composed of \(8\times 8\) grey-level images of handwritten digits, ranging from 0 to 9. The dataset is composed of 3823 images of 64 features for training, and 1797 for the test phase. We set \(m\) to 10 for this dataset, which means that \(g\) is learned from 100 examples only. \({\mathcal {S}}_{\text {conf}}\) is a sampling of 5 % of \({\mathcal {S}}\). The classes are evenly distributed (see Fig. 2a). We handle the nonlinearity through the use of a Gaussian kernel-PCA (see Sect. 3.6) to project the data onto a feature space of dimension 640.

*Letter recognition* The Letter Recognition dataset is another well-known pattern recognition dataset. The images of the letters are summarized into a vector of 16 attributes, which correspond to various primitives computed on the raw data. With 20,000 examples, this dataset is much larger than the previous one. As for the handwritten digits dataset, the examples are evenly spread across the 26 classes (see Fig. 2b). We uniformly select 15,000 examples for training and the remaining 5000 are used for test. We set \(m\) to 50 as it seems that smaller values do not yield usable confusion matrices. We again sample 5 % of the dataset to form \({\mathcal {S}}_{\text {conf}}\) and use, as before, a Gaussian kernel-based Kernel-PCA to (nonlinearly) expand the dimension of the data to 1600.

*Reuters* The Reuters dataset is a nearly linearly-separable document categorization dataset of more than 300,000 instances of nearly 47,000 features each. For size reasons we restrict ourselves to roughly 15,000 examples for training, and 15,000 other for test. It occurs that some classes are so underrepresented that they are flooded by the noise process and/or do not appear in \({\mathcal {S}}_{\text {conf}}\), which may lead to a non-invertible confusion matrix. We therefore restrict the dataset to the nine largest classes. One might wonder whether doing so erases class imbalance. This is not the case as, even this way, the least represented class accounts for roughly 500 examples while this number reaches nearly 4000 for the most represented one (see Fig. 2c). Actually, these 9 classes represent more than 70 % of the dataset, reducing the training and test sets to approximately 11,000 examples each. We do not use any kernel for this dataset, the data being already near to linearly-separable. Also, we sample \({\mathcal {S}}_{\text {conf}}\) on 5 % of the training set and we set \(m = 20\).

#### 4.2.3 Results

Table 1 presents the misclassification error rates averaged on 10 runs. Keep in mind that we have not conducted a very thorough optimization of the hyper-parameters as the point here is essentially to compare U MA with the other algorithms. Additionally, we also report the error rates of \(f_{{{\mathtt{S3VM}}}}\) when trained on the kernelized data with all dimensions, that is the kernelized data before we project them onto their \(D\) principal components. Because the projection step is indeed unbecessary with S3VM, this will give us insights on the error due to the Kernel-PCA step. Comparing the first and the last columns of Table 1, it appears that U MA always induces a slight performance gain, i.e. a decrease of the misclassification rate, with respect to \(f_{\varvec{y}}\).

Misclassification rates of different algorithms

Dataset | \({f}_{{{\mathbf {y}}}}\) | \({f_{\mathtt{conf}}}\) | \({f_{\mathtt{full}}}\) | \({f_{\mathtt{{\mathtt{S3VM}}}}}\) | U MA | \({f}_{\mathtt{{\mathtt{S3VM}}}}\) (no K-PCA) |
---|---|---|---|---|---|---|

Handwritten digits | \(0.25\) | \(0.21\) | \(0.04\) | \(0.15\) | \(0.16\) | \(0.07\) |

Letter recognition | \(0.35\) | \(0.36\) | \(0.23\) | \(0.49\) | \(0.33\) | \(0.18\) |

Reuters | \(0.30\) | \(0.17\) | \(0.01\) | \(0.22\) | \(0.21\) | \(0.22\) |

Comparing U MA and \(f_{\text {conf}}\) in Table 1 (fifth and second columns), we observe that U MA achieves lower misclassification rates on the Handwritten Digits and Letter Recognition datasets but a higher misclassification rate on Reuters. Although this is likely related to the strong class imbalance in the dataset. Indeed, some classes are overly represented, accounting for the vast majority of the whole dataset (see Fig. 2c). Because \({\mathcal {S}}_{\text {conf}}\) is uniformly sampled from the main dataset, \(f_{\text {conf}}\) is trained with a lot of examples from the overrepresented classes and therefore it is very effective, in the sense that it achieves a low misclassification rate, for these overrepresented classes; this, in turn, induces a (global) low misclassification rate, as possibly high misclassification rates on underrepresented classes are countervailed by theirs accounting for a small portion of the data. On the other hand, because of this disparity in class representation, the slightest error in the confusion matrix, granted it involves one of these overrepresented classes, may lead to a significant increase of the misclassification rate. In this regard, U MA is strongly disadvantaged with respect to \(f_{\text {conf}}\) on the Reuters dataset and it is the cause of the reported results.

Nonetheless, the disparities between U MA and \(f_{\text {conf}}\) deserve more attention. Indeed, the same data are being used by both algorithms, and one could expect more closeness in the results. To get a better insight on what is occurring, we have reported the evolution of the error rate of these two algorithms with respect to the sampling size of \({\mathcal {S}}_{\text {conf}}\) in Fig. 3. We can see that U MA is unaffected by the size of the sample, essentially ignoring the possible errors in the confusion matrix on small samples. This reinforces our previous results showing that U MA is robust to errors in the confusion matrix. On the other hand, with the addition of more samples, the refinement of the confusion matrix does not allow U MA to compete with the value of additional (correctly) labelled data and eventually, when the size of \({\mathcal {S}}_{\text {conf}}\) grows, \(f_{\text {conf}}\) performs better than U MA. This points towards the idea that the aggregated nature of the confusion matrix incurs some loss of relevant information for the classification task at hand, and that a more accurate estimate of the confusion matrix, as induced by, e.g., the use of larger \({\mathcal {S}}_{\text {conf}}\), may not compensate for the information provided by additional raw data.

Beyond this, it is important to recall that U MA never uses the labels of \({\mathcal {S}}_{\text {conf}}\) (those are only used to estimate the confusion matrix, not the classifier—refer to Sect. 4.2.1 for the detailed learning protocol). While refining the estimation of \(C\) is undoubtedly useful, a direction toward substantial performance gains should revolve around the combination of both this refined estimation of \(C\) *and* the use of the correctly labelled training set \({\mathcal {S}}_{\text {conf}}\). This is a research subject on its own that we leave for future work.

All in all, the reported results advise us to prefer U MA over other available methods when the amount of labelled data is particularly small, in addition, obviously, to the motivating case of the present work where the training data are corrupted and the confusion matrix is known. Also, another interesting finding we get is that even a rough estimation of the confusion matrix is sufficient for U MA to behave well.

Finally, we investigate the impact of the selection strategy of \((p,q)\) on the convergence speed of U MA (see Sect. 3.5). We use three variations of U MA with different strategies for selecting \((p,q)\) (error, confusion, and random) and monitor each one along the learning process on the reuters dataset. The error and confusion strategies are described in Sect. 3.5 and the random strategy simply selects \(p\) and \(q\) at random.

As one might expect, the confusion-based strategy performs better than the error-based strategy when the confusion rate is retained as a performance measure, while the converse holds when using the error rate. This observation motivates us to thoroughly study the confusion-based strategy in a near future as being able to propose methods robust to class imbalance is a particularly interesting challenge of multiclass classification.

The plateau reached around the 30th iteration may be puzzling, since the studied dataset presents no positive margin and convergence is therefore not guaranteed. One possible explanation for this is to see the reuters dataset as linearly separable problem corrupted by the effect of a noise process, which we call the *intrinsic noise process* that has structural features ‘compatible’ with the classification noise. By this, we mean that there must be features of the intrinsic noise such that, when additional classification noise is added, the resulting noise that characterizes the data is similar to a classification noise, or at least, to a noise that can be naturally handled by U MA. Finding out the family of noise processes that can be combined with the classification noise—or, more generally, the family of noise processes themselves—without hindering the effectiveness of U MA is one research direction that we aim to explore in a near future.

## 5 Conclusion

In this paper, we have proposed a new algorithm, U MA—for Unconfused Multiclass Additive algorithm—to cope with noisy training examples in multiclass linear problems. As its name indicates, it is a learning procedure that extends the (ultraconservative) additive multiclass algorithms proposed by Crammer and Singer (2003); to handle the noisy datasets, it only requires the information about the confusion matrix that characterizes the mislabelling process. This is, to the best of our knowledge, the first time the confusion matrix is used as a way to handle noisy label in multiclass problems.

One of the core ideas behind U MA, namely, the computation of the update vector \(\varvec{z}_{pq}\), is not tied to the additive update scheme. Thus, as long as the assumption of linear separability holds, the very same idea can be used to render a wide variety of algorithms robust to noise by iteratively generating a noise-free training set with the consecutive values of \(\varvec{z}_{pq}\). Although, every computation of a new \(\varvec{z}_{pq}\) requires learning a new classifier to start with. This may eventually incur prohibitive computational costs when applied to batch methods (as opposed to online methods) which are designed to process the entirety of the dataset at once.^{2}

U MA takes advantage of the online scheme of additive algorithms and avoids this problem completely. Moreover, additive algorithms are designed to directly handle multiclass problem rather than having recourse to a bi-class mapping. The end-results of this are tightened theoretical guarantees and a convergence rate that does not depend of \(Q\), the number of classes. Besides, U MA can be directly used with any additive algorithms, allowing to handle noise with multiple methods without further computational burden.

While we provide sample complexity analysis, it should be noted that a tighter bound can be derived with specific multiclass tools, such as the Natarajan’s dimension (see Daniely et al. 2011 for example), which allow to better specify the expressiveness of a multiclass classifier. However, this is not the main focus of this paper and our results are based on simpler tools.

To complement this work, we want to investigate a way to properly tackle near-linear problems (such as reuters). As for now the algorithm already does a very good jobs due to its noise robustness. However more work has to be done to derive a proper way to handle cases where a perfect classifier does not exist. We think there are great avenues for interesting research in this domain with an algorithm like U MA and we are curious to see how this present work may carry over to more general problems.

## Footnotes

- 1.
For the sake of self-containedness, we use U MA for this task (with \(C\) being the identity matrix). Remind that, when used this way, U MA acts as a regular Perceptron algorithm

- 2.
Nonetheless, from a purely theoretical point of view, U MA makes at most \(O(1/\theta ^2)\) mistakes (see proposition 4) and computing \(\varvec{z}_{pq}\) can be done in \(O(n)\) time. Therefore, polynomial batch methods do not suffer much from this as their overall execution time is still polynomial.

- 3.
Note that in some references the right-hand side of (24) might viewed as a probability measure over \(m\) independent Rademacher variables.

## Notes

### Acknowledgments

The authors would like to thank the reviewers for their feedback and invaluable comments. This work is partially supported by the Agence Nationale de la Recherche (ANR), project GRETA 12-BS02-004-01. We would like to thank the anonymous reviewers for their insightful and extremely valuable feedback on earlier versions of this paper.

## References

- Bach, F. R., & Jordan, M. I. (2002). Kernel independent component analysis.
*Journal of Machine Learning Research*,*3*, 1–48.MathSciNetGoogle Scholar - Bache, K., & Lichman, M. (2013). UCI machine learning repository. http://archive.ics.uci.edu/ml
- Bennett, K. P., & Demiriz, A. (1998). Semi-supervised support vector machines. In
*Advances in Neural Information Processing Systems, Vol. 11, Papers from Neural Information Processing Systems (NIPS) 1998*(pp. 368–374), Denver, CO, USA. http://papers.nips.cc/paper/1582-semi-supervised-support-vectormachines. - Blanchard, G., & Zwald, L. (2008). Finite-dimensional projection for classification and statistical learning.
*IEEE Transactions on Information Theory*,*54*(9), 4169–4182.MathSciNetCrossRefGoogle Scholar - Block, H. (1962). The perceptron: A model for brain functioning.
*Reviews of Modern Physics*,*34*, 123–135.zbMATHMathSciNetCrossRefGoogle Scholar - Blum, A., Frieze, A. M., Kannan, R., & Vempala, S. (1996) A polynomial-time algorithm for learning noisy linear threshold functions. In
*Proceedings of 37th IEEE symposium on foundations of computer science*(pp. 330–338).Google Scholar - Bylander, T. (1994). Learning linear threshold functions in the presence of classification noise. In
*Proceedings of 7th annual workshop on computational learning theory*(pp. 340–347). New York, NY: ACM Press.Google Scholar - Cohen, E. (1997). Learning noisy perceptrons by a perceptron in polynomial time. In
*Proceedings of 38th IEEE symposium on foundations of computer science*(pp. 514–523).Google Scholar - Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., & Singer, Y. (2006). Online passive-aggressive algorithms.
*JMLR*,*7*, 551–585.zbMATHMathSciNetGoogle Scholar - Crammer, K., & Singer, Y. (2003). Ultraconservative online algorithms for multiclass problems.
*Journal of Machine Learning Research*,*3*, 951–991.zbMATHMathSciNetGoogle Scholar - Cristianini, N., & Shawe-Taylor, J. (2000).
*An introduction to support vector machines and other kernel-based learning methods*. Cambridge: Cambridge University Press.CrossRefGoogle Scholar - Daniely, A., Sabato, S., Ben-David, S., & Shalev-Shwartz, S. (2011). Multiclass learnability and the ERM principle.
*Journal of Machine Learning Research Proceedings Track*,*19*, 207–232.Google Scholar - Dekel, O., Shalev-shwartz, S., & Singer, Y. (2005). The forgetron: A kernel-based perceptron on a fixed budget. In
*Advances in Neural Information Processing Systems, Vol. 18, Papers from Neural Information Processing Systems (NIPS) 2005*(pp. 259–266), Vancouver, BC, Canada. http://papers.nips.cc/paper/2806-the-forgetron-a-kernel-basedperceptron-on-a-fixed-budget. - Devroye, L., Györfi, L., & Lugosi, G. (1996).
*A probabilistic theory of pattern recognition*. Berlin: Springer.zbMATHCrossRefGoogle Scholar - Drineas, P., Kannan, R., & Mahoney, M. W. (2006). Fast Monte Carlo algorithms for matrices ii: Computing a low rank approximation to a matrix.
*SIAM Journal on Computing*,*36*(1), 158–183.zbMATHMathSciNetCrossRefGoogle Scholar - Drineas, P., & Mahoney, M. W. (2005). On the Nyström method for approximating a gram matrix for improved kernel-based learning.
*Journal of Machine Learning Research*,*6*, 2153–2175.zbMATHMathSciNetGoogle Scholar - Freund, Y., & Schapire, R. E. (1999). Large margin classification using the perceptron algorithm.
*Machine Learning*,*37*(3), 277–296.zbMATHCrossRefGoogle Scholar - Friess, T., Cristianini, N., & Campbell, N. (1998). The kernel-adatron algorithm: A fast and simple learning procedure for support vector machines. In J. Shavlik (Ed.),
*Machine learning: Proceedings of the 15th international conference*. Morgan Kaufmann Publishers.Google Scholar - Kakade, S. M., Shalev-Shwartz, S., & Tewari, A. (2008). Efficient bandit algorithms for online multiclass prediction. In
*Proceedings of the 25th international conference on machine learning, ICML ’08*(pp. 440–447). New York, NY: ACM.Google Scholar - Kearns, M. J., & Vazirani, U. V. (1994).
*An introduction to computational learning theory*. Cambridge: MIT Press.Google Scholar - Louche, U., & Ralaivola, L. (2013). Unconfused ultraconservative multiclass algorithms. In:
*JMLR workshop & conference proceedings 29*(Proceedings of ACML 13) (pp. 309–324).Google Scholar - Minsky, M., & Papert, S. (1969).
*Perceptrons: An introduction to computational geometry*. Cambridge: MIT Press.zbMATHGoogle Scholar - Novikoff, A. (1963). On convergence proofs for perceptrons. In
*Proceedings of the symposium on the mathematical theory of automata*(Vol. 12, pp. 615–622).Google Scholar - Ralaivola, L. (2012). Confusion-based online learning and a passive-aggressive scheme. In
*NIPS*(pp. 3293–3301).Google Scholar - Ralaivola, L., Favre, B., Gotab, P., Bechet, F., & Damnati, G. (2011). Applying multiclass bandit algorithms to call-type classification. In
*ASRU*(pp. 431–436).Google Scholar - Schölkopf, B., & Smola, A. J. (2002).
*Learning with kernels, support vector machines, regularization, optimization and beyond*. MIT University Press. http://www.learning-with-kernels.org - Stempfel, G., & Ralaivola, L. (2007). Learning kernel perceptron on noisy data and random projections. In
*In Proceedings of algorithmic learning theory (ALT 07)*.Google Scholar - Takerkart, S., & Ralaivola, L. (2011). MKPM: A multiclass extension to the kernel projection machine. In
*CVPR*(pp. 2785–2791). http://dblp.uni-trier.de/db/conf/cvpr/cvpr2011.html#TakerkartR11 - Valiant, L. (1984). A theory of the learnable.
*Communications of the ACM*,*27*, 1134–1142.zbMATHCrossRefGoogle Scholar - Williams, C. K. I., & Seeger, M. (2000). Using the Nyström method to speed up kernel machines. In
*Advances in Neural Information Processing Systems, Vol. 13, Papers from Neural Information Processing Systems (NIPS) 2000*(pp. 682–688), Denver, CO, USA. http://papers.nips.cc/paper/1866-using-the-nystrom-method-to-speed-upkernel-machines.