1 Introduction

Both online learning and active learning have been extensively studied in machine learning and data mining (Freund et al. 1997; McCallum and Nigam 1998; Balcan et al. 2006, 2007; Cesa-Bianchi and Lugosi 2006; Crammer et al. 2006; Castro and Nowak 2007; Zhao and Hoi 2010; Hoi et al. 2014). In a traditional online learning task (e.g., online classification), a learner is trained in a sequential manner to predict the class labels of a sequence of instances as accurately as possible. Specifically, at each round of a typical online learning task, the learner first receives an incoming instance, and then makes a prediction of its class label. After that, it is assumed to always receive the true class label from an oracle, which can be used to measure the loss incurred by the learner’s prediction so as to update the learner if necessary. In many real-world applications especially for mining real-life sequential arriving data (e.g., spam email filtering), acquiring the true class labels from an oracle is often time-consuming and costly due to the unavoidable interaction between the learner and the environment. This has motivated the recent study of Online Active Learning (Cesa-Bianchi et al. 2006; Dasgupta et al. 2009; Cesa-Bianchi and Lugosi 2006; Sculley 2007), which explores active learning strategy in an online learning setting to avoid requiring to request class labels of all incoming instances.

A pioneering and state-of-the-art technique to online active learning is known as Label Efficient Perceptron (Cesa-Bianchi and Lugosi 2006) or Selective Sampling Perceptron (Cesa-Bianchi et al. 2006; Cavallanti et al. 2009), or called Perceptron-based Active Learning (Dasgupta et al. 2009). In particular, consider an online classification task, when a learner receives an incoming instance \(\mathbf {x}_t\), the learner first makes a prediction \(\hat{y}_t=\mathrm {sign}(f(\mathbf {x}_t))\) where \(f(\mathbf {x}_t)=\mathbf {w}_t\cdot \mathbf {x}_t\), and then uses a stochastic approach to decide whether it should query the class label or not, where the query probability is inversely proportional to the prediction confidence (e.g., the magnitude of the margin, i.e., \(\delta /(\delta +|f(\mathbf {x}_t)|)\) where \(\delta \) is a positive smoothing constant). If no class label is queried, the learner makes no update; otherwise, it acquires the true label \(y_t\) from the environment and follows the regular Perceptron algorithm to make update (i.e., the learner will update the model if and only if the instance is misclassified according to the true class label). We summarize the common Framework of Online Active learning algorithms in Algorithm 1.

figure a

In the above Perceptron-based active learning, if an incoming instance is predicted with low confidence by the current model, the learner very likely would query its class label. However, if the instance is correctly classified according to the acquired true label, this training instance will be discarded and never be used to update the learner according to the principle of the Perceptron algorithm. Clearly this is a critical limitation of wasting the effort of requesting class labels. To overcome this limitation, we present a new scheme for online active learning, i.e., the Passive-Aggressive Active (PAA) learning, which explores the principle of passive-aggressive learning (Crammer et al. 2006). It not only decides when the learner should make a query appropriately, but also attempts to fully exploit the potential of every queried instance for updating the classification model. To tackle different kinds of machine learning tasks, we propose several variants of the PAA algorithms, i.e. the PAA algorithm for online binary classification tasks, the Multi-class Passive-Aggressive Active (MPAA) learning algorithm for online multi-class classification tasks and the Cost-Sensitive Passive-Aggressive Active learning (CSPAA) algorithm for online binary classification with extremely unbalanced data. We theoretically analyse the mistake bounds of the proposed algorithms and conduct extensive experiments to examine their empirical performance, in which encouraging results show clear advantages of our algorithms over the baselines.

The rest of this paper is organized as follows. Section 2 reviews background and related work. Section 3 presents the proposed framework of Passive-Aggressive Active-learning (PAA) algorithms for online binary classification tasks and analyzes the mistake bounds of the proposed algorithms. Section 4 extends the proposed framework for online multi-class classification by presenting a family of multi-class PAA algorithms (MPAA). Section 5 extends the proposed learning framework to tackle cost-sensitive classification by presenting cost-sensitive PAA (CSPAA) algorithms. Section 6 discusses our empirical studies and the applications of our technique to large-scale real-world online learning tasks, and finally Sect. 7 concludes this work.

2 Related work

Our work is closely related to three major categories of machine learning studies: online learning, online active learning and cost-sensitive classification. Below we briefly review some representative related work in each category.

2.1 Online learning

Online learning represents a family of efficient and scalable machine learning algorithms (Hoi et al. 2014; Rosenblatt 1958; Crammer and Singer 2003; Cesa-Bianchi et al. 2004; Crammer et al. 2006; Zhao et al. 2011; Wang et al. 2012). Unlike conventional batch learning methods that assume all training instances are available prior to the learning task, online learning repeatedly updates the predictive models sequentially, which is more appropriate for large-scale applications where training data often arrive sequentially. In literature, a variety of online learning methods have been proposed in machine learning. A classical online learning method is the Perceptron algorithm (Rosenblatt 1958; Freund and Schapire 1999), which updates the model by adding a new example with some constant weight into the current set of support vectors when the example is misclassified. Recently a lot of new online learning algorithms have been developed based on the criterion of maximum margin (Crammer and Singer 2003; Gentile 2001; Kivinen et al. 2001; Crammer et al. 2006; Li and Long 1999). One notable technique in this category is the online Passive-Aggressive (PA) learning method (Crammer et al. 2006), which updates the classification function when a new example is misclassified or its classification score does not exceed some predefined margin. PA algorithms have been proved as a very successful and popular online learning technique for solving many real-world applications. Finally, we note that there are also a number of emerging online learning algorithms proposed recently, such as second-order online learning (Crammer et al. 2008, 2009; Wang et al. 2012), which make more accurate predictions and often converge faster than first-order algorithms.

Most of the above existing online learning methods generally belong to supervised and passive learning. One major weakness of such supervised passive online learning methodology (Hoi et al. 2014) is the unrealistic assumption in that it assumes class labels of every incoming instance will be already requested or made available at the end of every iteration, which limits the application of online learning techniques for many real-world online learning tasks where class labels may not be always available or may be expensive to collect or request.

2.2 Online active learning

Online Active Learning algorithms emerge to address the main problem of conventional supervised online learning approach, i.e. the strong dependence on labeled data. The basic process of online active learning works in iterations. At each iteration, one unlabelled instance is presented to the learner, and the learner needs to decide whether to query its label. If the label is queried, then the learner can use the labelled instance to update the model, otherwise the model is kept unchanged.

Specifically, there are two kinds of settings for online active learning, selective sampling setting (Cavallanti et al. 2009; Cesa-Bianchi et al. 2009; Dekel et al. 2010; Orabona and Cesa-Bianchi 2011) and label efficient learning setting. We summarize their differences in several aspects. Firstly, in the selective sampling setting the instances are drawn randomly from a fixed distribution, while in the label efficient setting the instances can be generated adversarially. Secondly, the label efficient model must make predictions on those instance where the label is not requested, while the selective sampling models are concerned with the generalization error rather than the performance of the algorithm on the sequence of instances. Our work belongs to the second category. One of the most representative existing work in label efficient learning setting is the Label Efficient Perceptron algorithm, where the probability of querying the label is decided by the classification confidence. Following the similar setting, many variants of this algorithm were proposed including Adaptive Label Efficient Perceptron, Label Efficient Second-Order Perceptron (Cesa-Bianchi et al. 2006), Adaptive Label Efficient Second-Order Perceptron and Label Efficient Winnow (Cesa-Bianchi et al. 2006). Although extensively studied, the existing active learning algorithms still suffer from a serious limitation: the effort of querying a correctly classified instance is wasted due to the adoption of Perceptron update strategy.

In this work, we apply the popular PA algorithm to solve the online active learning task. Our work enjoys two advantages. On one hand, different from the regular PA setting which assumes every class label will be revealed, our approach queries the class labels of only a limited amount of incoming instances through active learning. On the other hand, our effective updating strategy fully exploits the potential of every queried instance and thus achieves a superior performance compared to existing active learning algorithms.

In addition, it is important to note that our work in online active learning is highly related to but different from active learning in data streams (Zhu et al. 2007, 2010; Žliobaitė et al. 2011). Both of them attempt to achieve the highest prediction accuracy while querying the fewest instance labels in the situation of sequentially arriving data. However, online active learning algorithms focus on the updating strategy when one new instance arrives and may discard past instances for efficient learning. While active learning in data streams must preserve the data distribution to detect the potential concept drift (Zliobaite et al. 2014).

2.3 Cost-sensitive classification

Cost-sensitive classification has been extensively studied in data mining and machine learning (Liu and Zhou 2006; Zadrozny et al. 2003; Zhu and Wu 2006). To address this problem, researchers have proposed a variety of cost-sensitive metrics. The well-known examples include the weighted sum of sensitivity and specificity (Jiawei and Kamber 2001; Brodersen et al. 2010), and the weighted misclassification cost that takes cost into consideration when measuring classification performance (Elkan 2001; Akbani et al. 2004). As a special case, when the weights are both equal to 0.5, the weighted sum of sensitivity and specificity is reduced to the well-known balanced accuracy (Brodersen et al. 2010). Although lots of algorithms have been proposed, most of them are in batch setting.

Recently a few existing works have attempted to address online cost-sensitive classification problems. Perceptron Algorithms with Uneven Margin (PAUM) (Li et al. 2002) is an extension of the Perceptron with Margin (PAM) algorithm (Krauth and Mézard 1987) where the classifier is updated whenever the classification margin is smaller than a predefined threshold. The PAUM algorithm achieves a cost-sensitive update by setting different thresholds for different class labels. Cost-Sensitive Passive-Aggressive (CPA) (Crammer et al. 2006) is proposed as a variant of PA algorithms, where the loss function is the sum of a margin based term and a constant depending on the mistake type. Although both of the above algorithms can achieve cost-sensitive updating, one main drawback is that they are not designed to optimize a cost-sensitive measurement directly. Recently some algorithms such as Cost-Sensitive Online Learning (Wang et al. 2014) and Online AUC maximization (Zhao et al. 2011) are proposed to address this drawback by directly solving an optimization problem that optimizes the target cost-sensitive measurements. Our work follows the idea of cost-sensitive learning but extends it to the active learning setting, which enjoys a great advantage in saving the labor of labeling huge amount of instances. Finally, we note that this journal article is based the extensions of our previous conference papers (Lu et al. 2014; Zhao and Hoi 2013).

3 Passive-Aggressive Active learning

3.1 Problem formulation and background review

We first introduce the problem setting of a regular online binary classification task. Let \(\{(\mathbf {x}_t,y_t)|\ t=1,\ldots ,T\}\) be a sequence of input patterns for online learning, where each instance \(\mathbf {x}_t \in \mathbb {R}^d\) received at the tth trial is a vector of d dimension and \(y_t\in \{-1,+1\}\) is its true class label. The goal of online binary classification is to learn a linear classifier \(\hat{y}_t=\mathrm {sign}(\mathbf {w}_t\cdot \mathbf {x}_t)\) where \(\mathbf {w}_t\in \mathbb {R}^d\) is the weight vector at the tth round.

For the Perceptron algorithm (Rosenblatt 1958; Freund and Schapire 1999), a learner first receives an incoming instance \(\mathbf {x}_t\) at tth round; it then makes a prediction based on the current classifier \(\mathbf {w}_t\); finally the true class label \(y_t\) is disclosed. If the prediction is correct, i.e., \(\hat{y}_t=y_t\), no update is applied to the learner; otherwise, Perceptron updates the model with the misclassified instance \((\mathbf {x}_t,y_t)\):

$$\begin{aligned} \mathbf {w}_{t+1}\leftarrow \mathbf {w}_{t} + y_t\mathbf {x}_t \end{aligned}$$

Unlike Perceptron that updates the model only when a misclassification occurs, the Passive-Aggressive (PA) algorithms (Crammer et al. 2006) make update whenever the loss function \(\ell (\mathbf {w}_t;(\mathbf {x}_t,y_t))\) is nonzero, e.g., one can choose the hinge loss \( \ell (\mathbf {w}_t;(\mathbf {x}_t,y_t))=\max (0, 1-y_t\mathbf {w}_t\cdot \mathbf {x}_t)\). In particular, PA algorithms update the model \(\mathbf {w}_{t+1}\) by solving three variants of the optimization task:

$$\begin{aligned} \arg \min _{\mathbf {w}} F(\mathbf {w})=\left\{ \begin{array}{ll} \frac{1}{2}\Vert \mathbf {w}-\mathbf {w}_t\Vert ^2\ \text {s.t.}\ \ell _t(\mathbf {w};(\mathbf {x}_t,y_t))=0,&{}\quad \text {(PA)}\\ \frac{1}{2}\Vert \mathbf {w}-\mathbf {w}_t\Vert ^2+C\ell _t(\mathbf {w};(\mathbf {x}_t,y_t)),&{}\quad \text {(PA-I)}\\ \frac{1}{2}\Vert \mathbf {w}-\mathbf {w}_t\Vert ^2+C\ell _t(\mathbf {w};(\mathbf {x}_t,y_t))^2,&{}\quad \text {(PA-II)} \end{array}\right. \end{aligned}$$

where \(C>0\) is a penalty cost parameter. The closed-form solutions can be derived for the above optimizations, i.e., \(\mathbf {w}_{t+1}\leftarrow \mathbf {w}_t + \tau _t y_t\mathbf {x}_t\), where the stepsize \(\tau _t\) is computed respectively as follows:

$$\begin{aligned} \tau _t=\left\{ \begin{array}{ll} \ell _t(\mathbf {w}_t;(\mathbf {x}_t,y_t))/\Vert \mathbf {x}_t\Vert ^2,&{}\quad \text {(PA)}\\ \min (C, \ell _t(\mathbf {w}_t;(\mathbf {x}_t,y_t))/\Vert \mathbf {x}_t\Vert ^2),&{}\quad \text {(PA-I)}\\ \ell _t(\mathbf {w}_t;(\mathbf {x}_t,y_t))/(\Vert \mathbf {x}_t\Vert ^2+1/(2C)).&{}\quad \text {(PA-II)} \end{array} \right. \end{aligned}$$
(1)

Thus, PA algorithms are more aggressive in updating models than Perceptron.

3.2 Passive-Aggressive Active learning algorithms

In this section, we aim to develop new algorithms for online active learning. Unlike conventional online learning (Rosenblatt 1958) and pool-based active learning (McCallum and Nigam 1998; Tong and Koller 2002), the key challenges to an online active learning task are two-fold: (i) when a learner should query the class label of an incoming instance, and (ii) when the class label is queried and disclosed, how to exploit the labeled instance to update the learner in an effective way. We propose Passive-Aggressive Active (PAA) learning to tackle the above challenges. In particular, the PAA algorithms adopt a simple yet effective randomized rule to decide whether the label of an incoming instance should be queried, and employ state-of-the-art PA algorithms to exploit the labeled instance for updating the online learner.

In particular, for an incoming instance \(\mathbf {x}_t\) at the tth round, the PAA algorithm first computes its prediction margin, i.e.,

$$\begin{aligned} p_t=\mathbf {w}_t\cdot \mathbf {x}_t, \end{aligned}$$

by the current classifier, and then decides if the class label should be queried according to a Bernoulli random variable \(Z_t\in \{0,1\}\) with probability equal to

$$\begin{aligned} \delta /(\delta +|p_t|), \end{aligned}$$

where \(\delta \ge 1\) is a smoothing parameter. Such an approach is similar to the idea of margin-based active learning (Tong and Koller 2002; Balcan et al. 2007) and has been adopted in other previous work (Cesa-Bianchi et al. 2006; Dasgupta et al. 2009). If the outcome \(Z_t=0\), the class label will not be queried and the learner is not updated; otherwise, the class label is queried and the outcome \(y_t\) is disclosed. Whenever the class label of an incoming instance is queried, the PAA algorithms will try the best effort to exploit the potential of this example for updating the learner. Specifically, it adopts the PA algorithms to update the linear classification model \({\mathbf {w}}_{t+1}\) according to Eq. (1). Based on the different updating strategies, we name the hard margin algorithm as PAA and the two soft margin algorithms as PAA-I and PAA-II. Clearly this is able to overcome the limitation of the Perceptron-based active learning algorithm that only updates the misclassified instances and wastes a large amount of correctly classified instances with low prediction confidence which can be potentially beneficial to improving the classifier. Finally, we summarize the detailed steps of the proposed PAA algorithms in Algorithm 2.

figure b

3.3 Analysis of mistake bounds for the PAA algorithms

In this section, we aim to theoretically analyze the mistake bounds of the proposed PAA algorithms. Before presenting the mistake bounds, we begin by presenting a technical lemma which would facilitate the proofs in this section. With this lemma, we could then derive the loss and mistake bounds for the three variants of PAA algorithm. For convenience, we introduce the following notation: \(\mathcal {M}=\{t|t\in [T], \hat{y}_t\not =y_t\}, \;\) and \(\mathcal {L}=\{t|t\in [T], \hat{y}_t=y_t, \ell _t(\mathbf {w}_t;(\mathbf {x}_t,y_t))>0\}\), where [T] denotes \(\{1,2,\ldots ,T\}\).

Lemma 1

Let \((\mathbf {x}_1, y_1), \ldots , (\mathbf {x}_T, y_T)\) be a sequence of input instances, where \(\mathbf {x}_t\in \mathbb {R}^d\) and \(y_t\in \{-1, +1\}\) for all t. Let \(\tau _t\) be the stepsize parameter for either of the three PAA variants as given in Eq. (1). Then, the following bound holds for any \(\mathbf {w}\in \mathbb {R}^d\) and any \(\alpha >0\)

$$\begin{aligned} \sum ^T_{t=1}Z_t 2\tau _t \big [L_t(\alpha -|p_t|)+ M_t(\alpha +|p_t|)\big ] \le \alpha ^2\Vert \mathbf {w}\Vert ^2+\sum ^T_{t=1}\tau _t^2\Vert \mathbf {x}_t\Vert ^2+\sum ^T_{t=1}2\alpha \tau _t\ell _t(\mathbf {w}), \end{aligned}$$

where \(M_t=\mathbb {I}_{(t\in \mathcal {M})}\), \(L_t=\mathbb {I}_{(t\in \mathcal {L})}\), \(\mathbb {I}\) is the indicator function.

The detailed proof of Lemma 1 can be found in “Appendix 1”.

Based on Lemma 1, we first derive the expected mistake bound for the PAA algorithm in the separable case. We assume there exists some \(\mathbf {w}\) such that \(y_t(\mathbf {w}\cdot \mathbf {x}_t)\ge 1\), \(\forall t\in [T]\).

Theorem 1

Let \((\mathbf {x}_1, y_1), \ldots , (\mathbf {x}_T, y_T)\) be a sequence of input instances, where \(\mathbf {x}_t\in \mathbb {R}^d\) and \(y_t\in \{-1, +1\}\) and \(\Vert \mathbf {x}_t\Vert \le R\) for all t. For any vector \(\mathbf {w}\) that satisfies \(\ell _t(\mathbf {w})=0\) for all t, assuming \(\delta \ge 1\), the expected number of mistakes made by the PAA algorithm on this sequence of examples is bounded by

$$\begin{aligned} \mathbb {E}\left[ \sum ^T_{t=1}M_t\right] \le \mathbb {E}\left[ \sum ^T_{t=1}M_t\ell _t(\mathbf {w}_t) \right] \le \frac{R^2}{4}\left( \delta +\frac{1}{\delta }+2\right) \Vert \mathbf {w}\Vert ^2. \end{aligned}$$

By setting \(\delta =1\), we can obtain the best upper bound as follows:

$$\begin{aligned} \mathbb {E}\left[ \sum ^T_{t=1}M_t\right] \le \mathbb {E}\left[ \sum ^T_{t=1}M_t\ell _t(\mathbf {w}_t))\right] \le R^2\Vert \mathbf {w}\Vert ^2. \end{aligned}$$

Proof

Since \(\ell _t(\mathbf {w})=0\), \(\forall t\in [T]\), according to Lemma 1, we have

$$\begin{aligned} \sum ^T_{t=1}Z_t 2\tau _t\big [L_t(\alpha -|p_t|)+ M_t(\alpha +|p_t|)\big ]\le \alpha ^2\Vert \mathbf {w}\Vert ^2+\sum ^T_{t=1}\tau _t^2\Vert \mathbf {x}_t\Vert ^2. \end{aligned}$$

Further, the above inequality can be reformulated as:

$$\begin{aligned} \alpha ^2\Vert \mathbf {w}\Vert ^2\ge & {} \sum ^T_{t=1}Z_t 2\tau _t\left[ L_t\left( \alpha -|p_t|\right) + M_t\left( \alpha +|p_t|\right) \right] -\sum ^T_{t=1}\tau _t^2\Vert \mathbf {x}_t\Vert ^2\\= & {} \sum ^T_{t=1}Z_t 2\tau _t \left[ L_t\left( \alpha -|p_t|-\frac{\tau _t}{2}\Vert \mathbf {x}_t\Vert ^2\right) +M_t\left( \alpha +|p_t|-\frac{\tau _t}{2}\Vert \mathbf {x}_t\Vert ^2\right) \right] \\= & {} \sum ^T_{t=1}Z_t 2\tau _t \left[ L_t \left( \alpha -|p_t|-\frac{\ell _t(\mathbf {w}_t)}{2}\right) +M_t\left( \alpha +|p_t|-\frac{\ell _t(\mathbf {w}_t)}{2}\right) \right] \\= & {} \sum ^T_{t=1} Z_t 2\tau _t\left[ L_t\left( \alpha -|p_t|-\frac{1-y_t p_t}{2}\right) +M_t\left( \alpha +|p_t|-\frac{1-y_t p_t}{2}\right) \right] \\= & {} \sum ^T_{t=1}Z_t 2\tau _t\left[ L_t \left( \alpha -|p_t|-\frac{1-|p_t|}{2}\right) +M_t\left( \alpha +|p_t|-\frac{1+|p_t|}{2}\right) \right] \\= & {} \sum ^T_{t=1}L_t Z_t 2\tau _t\left( \alpha -\frac{1+|p_t|}{2}\right) +\sum ^T_{t=1}M_tZ_t 2\tau _t\left( \alpha -\frac{1-|p_t|}{2}\right) . \end{aligned}$$

Plugging \(\alpha =\frac{\delta +1}{2}\), \(\delta \ge 1\) into the above inequality results in

$$\begin{aligned} \left( \frac{1+\delta }{2}\right) ^2\Vert \mathbf {w}\Vert ^2\ge \sum ^T_{t=1}M_tZ_t \tau _t(\delta +|p_t|), \end{aligned}$$

since when \(L_t=1\), \(|p_t|\in [0,1)\), \((\alpha -\frac{1+|p_t|}{2})=\frac{\delta -|p_t|}{2}>0\), and \((\alpha -\frac{1-|p_t|}{2})=\frac{\delta +|p_t|}{2}\).

In addition, combining the fact \(\tau _t=\ell _t(\mathbf {w}_t)/\Vert \mathbf {x}_t\Vert ^2\ge \ell _t(\mathbf {w}_t)/R^2\) with the above inequality concludes:

$$\begin{aligned} \left( \frac{1+\delta }{2}\right) ^2\Vert \mathbf {w}\Vert ^2\ge \frac{1}{R^2}\sum ^T_{t=1}M_tZ_t \ell _t(\mathbf {w}_t)(\delta +|p_t|). \end{aligned}$$

Taking expectation with the above inequality results in

$$\begin{aligned} \frac{1}{R^2}\mathbb {E}\left[ \delta \sum ^T_{t=1}M_t\ell _t(\mathbf {w}_t) \right]= & {} \mathbb {E}\left[ \frac{1}{R^2}\sum ^T_{t=1}M_t\ell _t(\mathbf {w}_t) (\delta +|p_t|)\mathbb {E}Z_t\right] \\= & {} \mathbb {E}\left[ \frac{1}{R^2}\sum ^T_{t=1}M_t\ell _t(\mathbf {w}_t)Z_t (\delta +|p_t|)\right] \le \left( \frac{1+\delta }{2}\right) ^2\Vert \mathbf {w}\Vert ^2. \end{aligned}$$

\(\square \)

Remark

Since, the above theorem holds for any \(\mathbf {w}\) such that \(\forall t, \ell _t(\mathbf {w})=0\), we obtain the tightest bound when \(\mathbf {w}=\mathbf {w}^*\), where

$$\begin{aligned} \begin{aligned} \mathbf {w}^*&=\arg \min _\mathbf {w}||\mathbf {w}||_2^2,\\&s.t.\min _{t\in [T]}y_t\mathbf {w}^\top \mathbf {x}_t\ge 1. \end{aligned} \end{aligned}$$

We can draw two observations as follows. When the norm of instances diverse greatly, i.e. \(\min _t ||\mathbf {x}_t||\ll R\), to guarantee the zero loss of the smallest \(\mathbf {x}_t\), the optimal \(||\mathbf {w}^*||_2^2\) should be extremely large, which indicates a loose theoretical bound. Thus a proper data pre-processing scheme, for example scaling all instance vectors to the same norm, will help improve performance. However, adopting two different data scaling scheme, for example, scaling \(||\mathbf {x}_t||=R\) and \(||\mathbf {x}_t||=cR\), \(c\in \mathbb {R}^+\), will not affect the performance. That is because when changing \(\mathbf {x}_t\) to \(c\mathbf {x}_t\) (R to cR), \(\mathbf {w}^*\) is changed to \(\mathbf {w}^*/c\), which makes no change to the theorem.

The above mistake bound indicates that the expected number of mistakes is proportional to the upper bound of the instances norm R and inversely proportional to the margin \(1/\Vert \mathbf {w}\Vert ^2\), which is consistent with existing research (Crammer et al. 2006). One disadvantage of the above theorem is the linear separable assumption, since real world datasets are usually not separable. To solve this problem, we present the expected mistake bound for the PAA-I algorithm, which is suitable for the non-separable problem.

Theorem 2

Let \((\mathbf {x}_1 , y_1 ), \ldots , (\mathbf {x}_T , y_T )\) be a sequence of examples where \(\mathbf {x}_t\in \mathbb {R}^d\) and \(y_t\in \{-1, +1\}\) and \(\Vert \mathbf {x}_t\Vert \le R\) for all t. Assuming \(\delta \ge 1\), for any vector \(\mathbf {w}\in \mathbb {R}^d\) , the expected number of prediction mistakes made by PAA-I on this sequence of examples is bounded from above by

$$\begin{aligned} \mathbb {E}\left[ \sum ^T_{t=1}M_t\right] \le \beta \left\{ \left( \frac{\delta +1}{2}\right) ^2\Vert \mathbf {w}\Vert ^2+(\delta +1) C\sum ^T_{t=1}\ell _t(\mathbf {w})\right\} , \end{aligned}$$

where \(\beta =\frac{1}{\delta }\max \{\frac{1}{C}, R^2\}\) and C is the aggressiveness parameter for PAA-I. Setting \(\delta =1\) leads to the following bound

$$\begin{aligned} \mathbb {E}\left[ \sum ^T_{t=1}M_t\right] \le \max \left\{ \frac{1}{C}, R^2\right\} \left\{ \Vert \mathbf {w}\Vert ^2+2 C\sum ^T_{t=1}\ell _t(\mathbf {w})\right\} . \end{aligned}$$

Setting \(\delta =\sqrt{1+\frac{4C\sum ^T_{t=1}\ell _t(\mathbf {w})}{\Vert \mathbf {w}\Vert ^2}}\) leads to the following bound

$$\begin{aligned} \begin{aligned}&\mathbb {E}\left[ \sum ^T_{t=1}M_t\right] \\ \le&\max \left\{ \frac{1}{C}, R^2\right\} \left\{ \frac{1}{2}\Vert \mathbf {w}\Vert ^2+C\sum ^T_{t=1}\ell _t(\mathbf {w})+\frac{1}{2}\Vert \mathbf {w}\Vert \sqrt{\Vert \mathbf {w}\Vert ^2+4C\sum ^T_{t=1}\ell _t(\mathbf {w})}\right\} . \end{aligned} \end{aligned}$$

Proof

According to Lemma 1, we have

$$\begin{aligned} \begin{aligned}&\alpha ^2\Vert \mathbf {w}\Vert ^2+\sum ^T_{t=1}2\alpha \tau _t\ell _t(\mathbf {w})\\ \ge&\sum ^T_{t=1}Z_t 2\tau _t \left[ L_t\left( \alpha -|p_t|-\frac{\tau _t}{2}\Vert \mathbf {x}_t\Vert ^2\right) +M_t\left( \alpha +|p_t|-\frac{\tau _t}{2}\Vert \mathbf {x}_t\Vert ^2\right) \right] \\ \ge&\sum ^T_{t=1}L_t Z_t 2\tau _t\left( \alpha -\frac{1+|p_t|}{2}\right) +\sum ^T_{t=1}M_tZ_t 2\tau _t\left( \alpha -\frac{1-|p_t|}{2}\right) . \end{aligned} \end{aligned}$$

Similar to Theorem 1, plugging \(\alpha =\frac{\delta +1}{2}\), \(\delta \ge 1\) into the above inequality will result in

$$\begin{aligned} \left( \frac{\delta +1}{2}\right) ^2\Vert \mathbf {w}\Vert ^2+\sum ^T_{t=1}(\delta +1)\tau _t\ell _t(\mathbf {w})\ge \sum ^T_{t=1}M_tZ_t \tau _t(\delta +|p_t|). \end{aligned}$$

Since \(\tau _t\ge \min \{C, \frac{1}{R^2}\}\) holds when \(M_t=1\), the above inequality implies:

$$\begin{aligned} \left( \frac{\delta +1}{2}\right) ^2\Vert \mathbf {w}\Vert ^2+\sum ^T_{t=1}(\delta +1)\tau _t\ell _t(\mathbf {w})\ge \min \left\{ C, \frac{1}{R^2}\right\} \sum ^T_{t=1}M_tZ_t (\delta +|p_t|). \end{aligned}$$

Taking expectation with the above equality and re-arranging the result conclude the theorem. \(\square \)

This theorem shows that the number of expected mistakes is bounded by a weighted sum of the model complexity \(\Vert \mathbf {w}\Vert ^2\) and the cumulative loss \(\sum ^T_{t=1}\ell _t(\mathbf {w})\) suffered by it. Finally, we give the mistake bound of the PAA-II algorithm in the following theorem.

Theorem 3

Let \((\mathbf {x}_1 , y_1 ), \ldots , (\mathbf {x}_T , y_T )\) be a sequence of examples where \(\mathbf {x}_t\in \mathbb {R}^d\) and \(y_t\in \{-1, +1\}\) and \(\Vert \mathbf {x}_t\Vert \le R\) for all t. Then, for any vector \(\mathbf {w}\in \mathbb {R}^d\) , assuming \(\delta \ge 1\), the expected number of prediction mistakes made by PAA-II on this sequence of examples is bounded from above by,

$$\begin{aligned} \mathbb {E}\left[ \sum ^T_{t=1}M_t\right] \le \left\{ R^2+\frac{1}{2C}\right\} \frac{1}{\delta }\left\{ \left( \frac{\delta +1}{2}\right) ^2\Vert \mathbf {w}\Vert ^2+2C\left( \frac{\delta +1}{2}\right) ^2\sum ^T_{t=1}\ell _t(\mathbf {w})^2\right\} , \end{aligned}$$

where C is the aggressiveness parameter for PAA-II. By setting \(\delta =1\), we can further have

$$\begin{aligned} \mathbb {E}\left[ \sum ^T_{t=1}M_t\right] \le \left\{ R^2+\frac{1}{2C}\right\} \left\{ (\Vert \mathbf {w}\Vert ^2+2C\sum ^T_{t=1}\ell _t(\mathbf {w})^2\right\} . \end{aligned}$$

Proof

Define

$$\begin{aligned} \mathcal {O}= & {} \alpha ^2\Vert \mathbf {w}\Vert ^2+\sum ^T_{t=1}\tau _t^2\Vert \mathbf {x}_t\Vert ^2+\sum ^T_{t=1}2\alpha \tau _t\ell _t(\mathbf {w}), \\ \mathcal {P}= & {} \sum ^T_{t=1}\alpha \left( \frac{\tau _t}{\sqrt{2C\alpha }}-\sqrt{2C\alpha }\ell _t(\mathbf {w})\right) ^2, \\ \mathcal {Q}= & {} \alpha ^2\Vert \mathbf {w}\Vert ^2+\sum ^T_{t=1}\tau _t^2\left( \Vert \mathbf {x}_t\Vert ^2+\frac{1}{2C}\right) +\sum ^T_{t=1}2C\alpha ^2\ell _t(\mathbf {w})^2, \end{aligned}$$

then it is easy to verify that \(\mathcal {O}\le \mathcal {O}+\mathcal {P}=\mathcal {Q}\).

Combining \(\mathcal {O}\le \mathcal {Q}\) with Lemma 1, we have the following

$$\begin{aligned} \sum ^T_{t=1}(L_tZ_t 2\tau _t(\alpha -|p_t|)+ M_tZ_t 2\tau _t(\alpha +|p_t|)\le \mathcal {Q}. \end{aligned}$$

Furthermore, the above formulation can be reformulated as:

$$\begin{aligned} \begin{aligned}&\alpha ^2\Vert \mathbf {w}\Vert ^2+\sum ^T_{t=1}2C\alpha ^2\ell _t(\mathbf {w})^2 \\&\quad \ge \sum ^T_{t=1}Z_t 2\tau _t\Big [L_t(\alpha -|p_t|)+ M_t(\alpha +|p_t|)\Big ]-\tau _t^2\left( \Vert \mathbf {x}_t\Vert ^2+\frac{1}{2C}\right) \\&\quad =\sum ^T_{t=1}L_t Z_t 2\tau _t\left( \alpha -\frac{1+|p_t|}{2}\right) +\sum ^T_{t=1}M_tZ_t 2\tau _t\left( \alpha -\frac{1-|p_t|}{2}\right) . \end{aligned} \end{aligned}$$

Similar to Theorem 1, plugging \(\alpha =\frac{\delta +1}{2}\), \(\delta \ge 1\) into the above inequality results in

$$\begin{aligned} \left( \frac{\delta +1}{2}\right) ^2\Vert \mathbf {w}\Vert ^2+\sum ^T_{t=1}2C\left( \frac{\delta +1}{2}\right) ^2\ell _t(\mathbf {w})^2>\sum ^T_{t=1}M_tZ_t \tau _t(\delta +|p_t|). \end{aligned}$$

Taking expectation with the above inequality and using \(\tau _t\ge 1/\{R^2+\frac{1}{2C}\} \), will conclude the theorem. \(\square \)

Remark

As proven in previous work (Cesa-Bianchi et al. 2006), the expected mistake bounds for active learning perceptron, which in our notation, could be expressed as:

$$\begin{aligned} \mathbb {E}\left[ \sum ^T_{t=1}M_t\right] \le \frac{(2\delta +R^2)^2}{8\delta }\Vert \mathbf {w}\Vert ^2+ \left( 1+\frac{R^2}{2\delta }\right) \sum ^T_{t=1}\ell _t(\mathbf {w}). \end{aligned}$$

By setting \(\delta =1\), they further have

$$\begin{aligned} \mathbb {E}\left[ \sum ^T_{t=1}M_t\right] \le \frac{(2+R^2)^2}{8}\Vert \mathbf {w}\Vert ^2+ \left( 1+\frac{R^2}{2}\right) \sum ^T_{t=1}\ell _t(\mathbf {w}). \end{aligned}$$

We could find that generally speaking, the bounds are similar and it depends on the parameters to determine which is better. This is similar to the comparison between the PA bound  (Cesa-Bianchi and Lugosi 2006; Crammer et al. 2006) and Perceptron bound  (Freund and Schapire 1999). However, the bound for Percetron Based Active learning has a \(R^4\Vert w\Vert ^2\) order term, which may make it inferior to ours.

One problem in the above theorem is that the value of \(\delta \) must be larger than 1, which may result in too many requests. To fix this issue, we propose the following theorem that can resolve this limitation.

Theorem 4

Let \((\mathbf {x}_1 , y_1 ), \ldots , (\mathbf {x}_T , y_T )\) be a sequence of examples where \(\mathbf {x}_t\in \mathbb {R}^d\) and \(y_t\in \{-1, +1\}\) and \(\Vert \mathbf {x}_t\Vert \le R\) for all t. Assume there exists a vector \(\mathbf {w}\) such that \(\ell _t(\mathbf {w})=0\) for all t. For the PAA algorithm, if change the parameter for the Bernoulli distribution to \(\delta /(\delta +1+|p_t|)\) and \(\delta \ge 0\), then its expected number of prediction mistakes on this sequence is bounded by

$$\begin{aligned} \mathbb {E}\left[ \sum ^T_{t=1}M_t) \right] \le \mathbb {E}\left[ \sum ^T_{t=1}M_t\ell _t(\mathbf {w}_t)\right] \le R^2\left( \frac{\delta }{4}+\frac{1}{\delta }+1\right) \Vert \mathbf {w}\Vert ^2. \end{aligned}$$

When setting \(\delta =2\), we get the best upper bound

$$\begin{aligned} \mathbb {E}\left[ \sum ^T_{t=1}M_t\right] \le \mathbb {E}\left[ \sum ^T_{t=1}M_t\ell _t(\mathbf {w}_t))\right] \le 2 R^2\Vert \mathbf {w}\Vert ^2. \end{aligned}$$

Proof

As proven in Theorem 1,

$$\begin{aligned} \alpha ^2\Vert \mathbf {w}\Vert ^2+\sum ^T_{t=1}2\alpha \tau _t\ell _t(\mathbf {w}) \ge \sum ^T_{t=1}Z_t 2\tau _t \left[ L_t \left( \alpha -\frac{1+|p_t|}{2}\right) +M_t\left( \alpha -\frac{1-|p_t|}{2}\right) \right] . \end{aligned}$$

Plugging \(\alpha =\frac{\delta }{2}+1\), \(\delta \ge 0\) into the above inequality results in

$$\begin{aligned} \left( \frac{\delta }{2}+1\right) ^2\Vert \mathbf {w}\Vert ^2>\sum ^T_{t=1}M_tZ_t \tau _t(\delta +1+|p_t|), \end{aligned}$$

since, when \(L_t=1\), \(|p_t|\in [0,1)\), \((\alpha -\frac{1+|p_t|}{2})=\frac{\delta +1-|p_t|}{2}>0\), and \((\alpha -\frac{1-|p_t|}{2})=\frac{\delta +1+|p_t|}{2}\). Taking expectation with the above inequality and using \(\tau _t\ge \ell _t(\mathbf {w}_t)/R^2\) will conclude the theorem. \(\square \)

Remark

Note this theorem demonstrates that for a new sampling probability \(\frac{\delta }{\delta +1+|p_t|}\), the expected number of mistakes of the PAA algorithm is bounded. Similar property also holds for PAA-I and PAA-II. It’s easy to get the corresponding bounds using the facts that \(\tau _t\ge \min \{C, \frac{1}{R^2}\}\) for PAA-I and \(\tau _t\ge 1/\{R^2+\frac{1}{2C}\} \) for PAA-II when \(M_t=1\). We omit the theorems since it is similar to Theorem 4.

4 Extension to multi-class online classification

In this section, we will generalize the PAA algorithms to solve online multi-class classification tasks.

4.1 Problem formulation and background review

We first introduce the problem setting of the multi-class classification problem. Let \(\{(\mathbf {x}_t,y_t)|\ t=1,\ldots ,T\}\) be a sequence of input patterns for online learning, where each instance \(\mathbf {x}_t \in \mathbb {R}^d\) received at the tth trial is a vector of d dimension and \(y_t\in Y=\{1,\ldots ,k\}\) is its true class label. We adopt the multi-prototype model in Crammer et al. (2006). The classifier is made up of k weight vectors \(\mathbf {w}^r\in \mathbb {R}^d, r\in Y\), where each vector corresponds to one class label. During the prediction period of the tth iteration, the classifier first generates a sequence of k prediction scores for all the class labels:

$$\begin{aligned} (\mathbf {w}_t^1\cdot \mathbf {x}_t,\ldots ,\mathbf {w}_t^r\cdot \mathbf {x}_t,\ldots ,\mathbf {w}_t^k\cdot \mathbf {x}_t). \end{aligned}$$

Then, by comparing the above scores, it picks the class label with the largest score as the prediction,

$$\begin{aligned} \hat{y}_t=\arg \max _{r\in Y}\mathbf {w}_t^r\cdot \mathbf {x}_t. \end{aligned}$$
(2)

We further define \(s_t\) as the irrelevant class with the highest prediction score:

$$\begin{aligned} s_t=\mathop {\arg \max }_{r\in Y,r\ne y_t}\mathbf {w}_t^r\cdot \mathbf {x}_t. \end{aligned}$$

The margin with respect to the hypothesis in the tth iteration is defined to be the gap between the prediction score of class \(y_t\) and \(s_t\):

$$\begin{aligned} \gamma _t=\mathbf {w}_t^{y_t}\cdot \mathbf {x}_t-\mathbf {w}_t^{s_t}\cdot \mathbf {x}_t. \end{aligned}$$

Obviously, in a correct prediction, the margin \(\gamma _t > 0\). In the max-score multi-class Perceptron algorithm (Crammer and Singer 2003), the classifier is only updated when a prediction mistake occurs, i.e. \(\hat{y}_t\ne y_t\); otherwise, Perceptron updates the model with the misclassified instance \((\mathbf {x}_t,y_t)\):

$$\begin{aligned} \begin{aligned}&\mathbf {w}_{t+1}^{y_t}\leftarrow \mathbf {w}_{t}^{y_t} + \mathbf {x}_t,\\&\mathbf {w}_{t+1}^{s_t}\leftarrow \mathbf {w}_{t}^{s_t} - \mathbf {x}_t. \end{aligned} \end{aligned}$$

Unlike Perceptron that updates the model only when a misclassification occurs, the Multi-class Passive-Aggressive (MPA) algorithms (Crammer et al. 2006) will also updates the classifier when the prediction is correct while the margin is not large enough. Specifically, MPA algorithms will update the model when the hinge loss is nonzero, where the hinge loss is defined as,

$$\begin{aligned} \ell _t(\overline{\mathbf {w}}_t)=\max (0, 1-\gamma _t), \end{aligned}$$

in which \(\overline{\mathbf {w}}_t\) denotes the set of all k weight vectors in the classifier.

If the hinge loss is positive, then multi-class PA algorithms will update the model \(\overline{\mathbf {w}}_{t+1}\) by solving three variants of the following optimization objectives:

$$\begin{aligned} \arg \min _{\overline{\mathbf {w}}} F(\overline{\mathbf {w}})=\left\{ \begin{array}{ll} \frac{1}{2}\sum _{r=1}^k\Vert \mathbf {w}^r-\mathbf {w}^r_t\Vert ^2\ \text {s.t.}\ \ell _t(\overline{\mathbf {w}};(\mathbf {x}_t,y_t))=0,&{}\quad \text {(MPA)}\\ \frac{1}{2}\sum _{r=1}^k\Vert \mathbf {w}^r-\mathbf {w}^r_t\Vert ^2+C\ell _t(\overline{\mathbf {w}};(\mathbf {x}_t,y_t)),&{}\quad \text {(MPA-I)}\\ \frac{1}{2}\sum _{r=1}^k\Vert \mathbf {w}^r-\mathbf {w}^r_t\Vert ^2+C\ell _t(\overline{\mathbf {w}};(\mathbf {x}_t,y_t))^2,&{}\quad \text {(MPA-II)} \end{array} \right. \end{aligned}$$

where \(C>0\) is a penalty cost parameter. Luckily, the above optimizations enjoy closed-form solutions as follows,

$$\begin{aligned} \begin{aligned}&\mathbf {w}_{t+1}^{y_t}\leftarrow \mathbf {w}_t^{y_t} + \tau _t \mathbf {x}_t,\\&\mathbf {w}_{t+1}^{s_t}\leftarrow \mathbf {w}_t^{s_t} - \tau _t \mathbf {x}_t, \end{aligned} \end{aligned}$$
(3)

where the stepsize \(\tau _t\) is computed respectively as follows:

$$\begin{aligned} \tau _t=\left\{ \begin{array}{ll} \ell _t(\overline{\mathbf {w}}_t;(\mathbf {x}_t,y_t))/(2\Vert \mathbf {x}_t\Vert ^2),&{}\quad \text {(MPA)}\\ \min (C, \ell _t(\overline{\mathbf {w}}_t;(\mathbf {x}_t,y_t))/(2\Vert \mathbf {x}_t\Vert ^2)),&{}\quad \text {(MPA-I)}\\ \ell _t(\overline{\mathbf {w}}_t;(\mathbf {x}_t,y_t))/(2\Vert \mathbf {x}_t\Vert ^2+1/(2C)).&{}\quad \text {(MPA-II)} \end{array} \right. \end{aligned}$$
(4)

These update rules generally implies that larger losses will result in larger learning rates.

4.2 Multi-class Passive-Aggressive Active learning algorithms (MPAA)

In this section, we aim to develop a group of new algorithms for online active multi-class classification tasks, termed as the Multi-class Passive-Aggressive Active learning (MPAA) algorithms. Firstly, a similar but different stochastic rule is adopted in deciding whether to query the label of a certain instance. This rule will be introduced in the later paragraphs. If a label is queried, the update rules of MPAA algorithms simply follow those of the MPA algorithms introduced in the previous section.

Now we will introduce the randomized rule for querying labels. As stated in the PAA algorithms, the probability of querying a label, \(Pr(Z_t=1)\) should be inversely proportional to the margin of the model on the current instance (with a smoothing parameter \(\delta \)), which is considered as a kind of confidence of the model. However, in multi-class setting, the margin of the model on the current example is not available, since the label is not disclosed when the probability is computed.

To solve this problem, we introduce the label with second largest prediction score:

$$\begin{aligned} \tilde{y}_t=\mathop {\arg \max }_{r\in Y,r\ne \hat{y}_t}\mathbf {w}_t^r\cdot \mathbf {x}_t, \end{aligned}$$

and propose a different confidence score

$$\begin{aligned} p_t=\mathbf {w}_t^{\hat{y}_t}\cdot \mathbf {x}_t-\mathbf {w}_t^{\tilde{y}_t}\cdot \mathbf {x}_t, \end{aligned}$$
(5)

which is the gap between the prediction scores of predicted label \(\hat{y}_t\) and the second label \(\tilde{y}_t\). Note that \(p_t\ge 0\) holds for all cases. The relation between the confidence value \(p_t\) and the margin \(\gamma _t\) has two cases: 1) if the prediction is correct, i.e., \(\hat{y}_t=y_t\) and \(s_t=\tilde{y}_t\), then \(p_t=\gamma _t\); 2) if the prediction is incorrect, then it is easy to check \(p_t\le |\gamma _t|\). Given this confidence value, the probability of querying a label is set as

$$\begin{aligned} Pr(Z_t=1)=\frac{\delta }{\delta +p_t}, \end{aligned}$$

where \(\delta >0\) is a smooth parameter. This probability is larger than \(\frac{\delta }{\delta +|\gamma _t|}\), which will facilitate the theoretical analysis later.

Finally, we summarize the detailed steps of the proposed MPAA algorithms in Algorithm 3.

figure c

4.3 Analysis of mistake bounds for the MPAA algorithms

In this section, we aim to theoretically analyze the mistake bounds of the proposed MPAA algorithms.

Theorem 5

Let \((\mathbf {x}_1, y_1), \ldots , (\mathbf {x}_T, y_T)\) be a sequence of input instances, where \(\mathbf {x}_t\in \mathbb {R}^d\) and \(y_t\in Y\) and \(\Vert \mathbf {x}_t\Vert \le R\) for all t. For any classifier \(\overline{\mathbf {w}}\) such that \(\ell _t(\overline{\mathbf {w}})=0\) for all t, assuming \(\delta \le 1\) the expected number of mistakes made by the MPAA algorithm on this sequence of examples is bounded by

$$\begin{aligned} \mathbb {E}\left[ \sum ^T_{t=1}M_t\right] \le \mathbb {E}\left[ \sum ^T_{t=1}M_t\ell _t(\overline{\mathbf {w}}_t) )\right] \le \frac{R^2}{2}\left( \delta +\frac{1}{\delta }+2\right) \sum _{r=1}^k\Vert \mathbf {w}^r\Vert ^2. \end{aligned}$$

By setting \(\delta =1\), we can obtain the best upper bound as follows:

$$\begin{aligned} \mathbb {E}\left[ \sum ^T_{t=1}M_t\right] \le \mathbb {E}\left[ \sum ^T_{t=1}M_t\ell _t(\mathbf {w}_t) \right] \le 2R^2\sum _{r=1}^k\Vert \mathbf {w}^r\Vert ^2. \end{aligned}$$

The proof can be found in “Appendices 2 and 3”. Similarly, we can get the mistake bounds for the other variants of MPAA algorithms. Because it is easy, we skip it for conciseness.

5 Extension to cost-sensitive online classification

In this section, we further extend the PAA algorithms to deal with highly imbalanced binary classification tasks, where instead of simply maximizing the accuracy, we should further consider some cost-sensitive evaluation metrics.

5.1 Problem formulation and cost-sensitive classification review

For binary classification, the result of each prediction for an instance can be classified into four cases: (1) True Positive (TP) if \(\hat{y}_t = y_t = +1\); (2) False Positive (FP) if \(\hat{y}_t = +1\) and \(y_t = -1\); (3) True Negative (TN) if \(\hat{y}_t = y_t = -1\); and (4) False Negative (FN) if \(\hat{y}_t = -1\) and \(y_t = +1\). We now consider a sequence of training examples \((\mathbf {x}_1,y_1),\ldots , (\mathbf {x}_T,y_T)\) for online learning. Then, we denote by \(\mathcal {M}\) to be the set of indexes that correspond to the trial of misclassification:

$$\begin{aligned} \mathcal {M}=\{t|y_t\ne \mathrm {sign}(\mathbf {w}_t\cdot \mathbf {x}_t), \forall t\in [T]\}. \end{aligned}$$

Similarly, we denote by \(\mathcal {M}_p=\{t|t\in \mathcal {M}\text { and }y_t=+1\}\) the set of indexes for false negatives, and \(\mathcal {M}_n=\{t|t\in \mathcal {M}\text { and } y_t=-1\}\) the set of indexes for false positives. Further introduce notation \(M=|\mathcal {M}|\) to denote the number of mistakes, \(M_p=|\mathcal {M}_p|\) to denote the number of false negative and \(M_n=|\mathcal {M}_n|\) to denote the number of false positives. Let \(T_p\) denote the number of positive instances and \(T_n\) denote the number of negative instances, we have the following performance metrics: sensitivity is defined as the ratio between the number of true positives \(T_p-M_p\) and the number of positive examples; and specificity is defined as the ratio between \(T_n-M_n\) and the number of negative examples. These can be summarized as:

$$\begin{aligned} sensitivity=\frac{T_p-M_p}{T_p},\quad specificity=\frac{T_n-M_n}{T_n}. \end{aligned}$$

Without loss of generality, we assume “positive” is the rare class, i.e., \(T_p\ll T_n\).

For traditional online learning, the performance is measured by the prediction accuracy (or mistake rate equivalently) over the sequence of examples. This is inappropriate for imbalanced data because a trivial learner that simply classifies any example as negative could achieve a quite high accuracy for a highly imbalanced dataset. Thus, we propose to study new online learning algorithms, which can optimize a more appropriate performance metric, such as the sum of weighted sensitivity and specificity, i.e.,

$$\begin{aligned} sum = \eta _p \times sensitivity + \eta _n \times specificity, \end{aligned}$$
(6)

where \(0 \le \eta _p,\eta _n \le 1\) and \(\eta _p+\eta _n=1\). When \(\eta _p=\eta _n=1/2\), sum is the well-known balanced accuracy, which is adopted as a metric in the existing studies for anomaly detection (Li and Tsang 2011). In general, the higher the sum value, the better the performance. Besides, another suitable metric is the total cost suffered by the algorithm, which is defined as:

$$\begin{aligned} cost = c_p \times M_p + c_n \times M_n, \end{aligned}$$
(7)

where \(M_p\) and \(M_n\) are the number of false negatives and false positives respectively, \(0 \le c_p,c_n \le 1\) are the cost parameters for positive and negative classes, respectively, and we assume \(c_p+c_n=1\). The lower the cost value, the better the classification performance.

5.2 Cost-sensitive Passive-Aggressive Active learning algorithm (CSPAA)

We now propose the CSPAA framework for cost-sensitive online binary classification task by optimizing the previous two cost-sensitive measures. Before presenting our algorithms, we prove an important proposition below to motivate our solution. For simplicity, we assume \(\Vert \mathbf {x}_t\Vert =1\) for the rest.

Proposition 1

Consider a cost-sensitive classification problem, the goal of maximizing the weighted sum in (6) or minimizing the weighted cost in (7) is equivalent to minimizing the following objective:

$$\begin{aligned} \sum _{y_t=+1}\rho \mathbb {I}_{(y_t\mathbf {w}\cdot \mathbf {x}_t<0)}+\sum _{y_t=-1}\mathbb {I}_{(y_t\mathbf {w}\cdot \mathbf {x}_t<0)}, \end{aligned}$$
(8)

where \(\rho = \frac{\eta _p T_n}{\eta _n T_p}\) for the maximization of the weighted sum, \(T_p\) and \(T_n\) are the number of positive examples and negative examples respectively, \(\rho = \frac{c_p}{c_n}\) for the minimization of the weighted misclassification cost, and \(\mathbb {I}_\pi \) is the indicator function that outputs 1 if the statement \(\pi \) holds and 0 otherwise.

The proof can be found in the “Appendix 4”.

Proposition 1 gives the explicit objective function for optimization, but the indicator function is non-convex. To tackle this issue, we replace the indicator function by its convex surrogate, i.e., a modified hinge loss function:

$$\begin{aligned} \ell (\mathbf {w};(\mathbf {x},y))=\max (0,\rho *\mathbb {I}_{(y=1)}+\mathbb {I}_{(y=-1)}-y(\mathbf {w}\cdot \mathbf {x})). \end{aligned}$$
(9)

As a result, we can formulate the primal objective function as follows:

$$\begin{aligned} \mathcal {F}_p^b(\mathbf {w})= \frac{1}{2}\Vert \mathbf {w}\Vert ^2+C\sum ^T_{t=1}\ell _t(\mathbf {w}), \end{aligned}$$
(10)

where the regularization parameter \(C>0\), the loss function \(\ell _t(\mathbf {w})=\max (0, \rho _t-y_t(\mathbf {w}\cdot \mathbf {x}_t))\) and \(\rho _t=\rho *\mathbb {I}_{(y_t=1)}+\mathbb {I}_{(y_t=-1)}\). The idea of this formulation is somewhat similar to the biased formulation of batch SVM for learning with imbalanced datasets (Akbani et al. 2004).

To online optimize the above objective (10), following the passive-aggressive learning method (Crammer et al. 2006), we have a similar online optimization objective:

$$\begin{aligned} \mathbf {w}_{t+1} = \arg \min _{\mathbf {w}\in \mathbb {R}^d}\frac{1}{2}\Vert \mathbf {w}-\mathbf {w}_t\Vert ^2 + C\ell _t(\mathbf {w}), \end{aligned}$$

which enjoys the following closed-form solution:

$$\begin{aligned} \mathbf {w}_{t+1} \leftarrow \mathbf {w}_{t} + \tau _t y_t \mathbf {x}_t,\quad \text {where}\ \tau _t =\min (C,\ell _t(\mathbf {w}_t)). \end{aligned}$$
(11)

At the t th round, the CSPAA algorithm decides if the class label should be queried according to the same margin based Bernoulli random variable \(Z_t\in \{0,1\}\) as that used in the PAA algorithm. Finally, Algorithm 4 summarizes the details of the proposed CSPAA algorithms.

figure d

Remark

It is interesting to analyze the impact of the sampling factor parameter \(\delta \). In general, the larger the value of \(\delta \), the larger the resulting number of queries issued by the online active learner. In particular, when setting \(\delta \rightarrow \infty \), it is reduced to the extreme case of querying class label of every instance in the online learning process. In general, one can simply fix \(\delta \) to some constant to trade off a proper ratio of queries. Besides, an even better approach is to adaptively change the value of \(\delta \) during the online learning process. In particular, we expect to query more examples at the beginning of the online learning task in order to build a good classifier, and gradually reduce the ratio of queries when the classifier becomes more and more accurate during the online learning process. To this purpose, we suggest a simple yet effective scheme to adaptively update the parameter \(\delta \) at the t th learning step as: \(\delta _{t} \leftarrow \delta _{t-1}*\frac{t}{t+1}\). We will examine the impact of the sampling factor \(\delta \) in our experiments.

5.3 Theoretical bound analysis for the CSPAA algorithms

Below gives theoretical analysis of its performance on online binary cost-sensitive active learning tasks in terms of two types of performance metrics. The proofs can be found in the “Appendices 5, 6 and 7”.

Theorem 6

Let \((\mathbf {x}_1 , y_1 ), \ldots , (\mathbf {x}_T , y_T )\) be a sequence of examples where \(\mathbf {x}_t\in \mathbb {R}^d\) and \(y_t\in \{-1, +1\}\) and \(\Vert \mathbf {x}_t\Vert = 1\) for all t. Then, for any vector \(\mathbf {w}\in \mathbb {R}^d\) , the expected weighted number of prediction mistakes made by CSPAA on this sequence of examples is bounded as:

$$\begin{aligned} \mathbb {E}\left[ \sum ^T_{t=1}\rho _t M_t\right] \le \frac{1}{\delta }\left\{ \left( \frac{1+\delta }{2}\right) ^2\Vert \mathbf {w}\Vert ^2+\sum ^T_{t=1}(1+\delta )C\ell _t(\mathbf {w})\right\} , \end{aligned}$$

where \(C\ge \rho \) is the aggressiveness parameter for CSPAA.

Now our goal is to analyze the performance of the proposed algorithm in terms of the two metrics, sum and cost. We first consider the weighted sum of sensitivity and specificity, i.e.,

$$\begin{aligned} sum = \eta _p\times sensitivity +\eta _n\times specificity, \end{aligned}$$

where \(\eta _p +\eta _n=1\) and \(\eta _p \ge \eta _n> 0\). The following theorem gives the bound on the sum by the proposed CSPAA algorithm.

Theorem 7

Let \((\mathbf {x}_1,y_1),\ldots ,(\mathbf {x}_T,y_T)\) be a sequence of examples, where \(\mathbf {x}_t\in \mathbb {R}^d\), \(y_t\in \{-1,+1\}\) and \(\Vert \mathbf {x}_t\Vert =1\) for all t. By setting \(\rho =\frac{\eta _p T_n}{\eta _n T_p}\), and assuming \(C\ge \rho \), for any \(\mathbf {w}\in \mathbb {R}^d\), we have the following bound for the proposed CSPAA algorithm:

$$\begin{aligned} \mathbb {E}[sum] \ge 1-\frac{\eta _n}{T_n}\frac{1}{\delta }\left\{ \left( \frac{1+\delta }{2}\right) ^2\Vert \mathbf {w}\Vert ^2+\sum ^T_{t=1}(1+\delta )C\ell _t(\mathbf {w})\right\} . \end{aligned}$$

Furthermore, when \(\eta _p=\eta _n=1/2\), the balanced accuracy (BA) is bounded from below by

$$\begin{aligned} \mathbb {E}[BA]\ge 1-\frac{1}{2 T_n}\frac{1}{\delta }\left\{ \left( \frac{1+\delta }{2}\right) ^2\Vert \mathbf {w}\Vert ^2+\sum ^T_{t=1}(1+\delta )C\ell _t(\mathbf {w})\right\} . \end{aligned}$$

Remark

In the above, setting \(\delta =1\) leads to the following bound

$$\begin{aligned} \mathbb {E}[sum] \ge 1-\frac{\eta _n}{T_n} \left\{ \Vert \mathbf {w}\Vert ^2+2 C\sum ^T_{t=1}\ell _t(\mathbf {w})\right\} . \end{aligned}$$

Setting \(\delta =\sqrt{1+\frac{4C\sum ^T_{t=1}\ell _t(\mathbf {w}_t)}{\Vert \mathbf {w}\Vert ^2}}\) leads to the following bound

$$\begin{aligned} \mathbb {E}[sum] \ge 1-\frac{\eta _n}{T_n}\left\{ \frac{1}{2}\Vert \mathbf {w}\Vert ^2+C\sum ^T_{t=1}\ell _t(\mathbf {w})+\frac{1}{2}\Vert \mathbf {w}\Vert \sqrt{\Vert \mathbf {w}\Vert ^2+4C\sum ^T_{t=1}\ell _t(\mathbf {w})}\right\} . \end{aligned}$$

In the above approach, the bias parameter \(\rho \) is set to \(\frac{\eta _p T_n}{\eta _n T_p}\), in which the ratio \(\frac{T_n}{T_p}\) may not be available in advance. To alleviate this issue, we consider another approach using the cost based performance metric. Specifically, we propose to set \(\rho =\frac{c_p}{c_n}\), where \(c_p\) and \(c_n\) are the predefined cost parameters of false negative and false positive, respectively. We assume \(c_p+c_n=1\) and \(0 \le c_n \le c_p\) since we would prefer to improve the accuracy of predicting the rare positive examples. By this setting, the following theorem gives us the cumulative cost bound of the proposed CSPAA algorithm.

Theorem 8

Let \((\mathbf {x}_1,y_1),\ldots ,(\mathbf {x}_T,y_T)\) be a sequence of examples, where \(\mathbf {x}_t\in \mathbb {R}^d\), \(y_t\in \{-1,+1\}\) and \(\Vert \mathbf {x}_t\Vert =1\) for all t. By setting \(\rho =\frac{c_p}{c_n}\), and assuming \(C\ge \rho \), for any \(\mathbf {w}\in \mathbb {R}^d\), the overall cost made by the proposed CSPAA algorithm over this sequence of examples is bounded as follows:

$$\begin{aligned} \mathbb {E}[cost]\le c_n \frac{1}{\delta }\left\{ \left( \frac{1+\delta }{2}\right) ^2\Vert \mathbf {w}\Vert ^2+\sum ^T_{t=1}(1+\delta )C\ell _t(\mathbf {w})\right\} . \end{aligned}$$

Remark

Setting \(\delta =1\) for the above theorem leads to the following bound:

$$\begin{aligned} \mathbb {E}[cost] \le c_n \left\{ \Vert \mathbf {w}\Vert ^2+2 C\sum ^T_{t=1}\ell _t(\mathbf {w})\right\} . \end{aligned}$$

Setting \(\delta =\sqrt{1+\frac{4C\sum ^T_{t=1}\ell _t(\mathbf {w}_t)}{\Vert \mathbf {w}\Vert ^2}}\) leads to the following bound:

$$\begin{aligned} \mathbb {E}[cost]\le c_n\left\{ \frac{1}{2}\Vert \mathbf {w}\Vert ^2+C\sum ^T_{t=1}\ell _t(\mathbf {w})+\frac{1}{2}\Vert \mathbf {w}\Vert \sqrt{\Vert \mathbf {w}\Vert ^2+4C\sum ^T_{t=1}\ell _t(\mathbf {w})}\right\} . \end{aligned}$$

6 Experimental results

In this section, we evaluate the empirical performance of the proposed family of Passive-Aggressive Active-learning algorithms for three types of online active learning tasks: binary classification, multi-class classification, and cost-sensitive classification tasks.

6.1 Evaluation of PAA algorithms for binary classification tasks

This section will evaluate the empirical performance of the proposed PAA algorithms on online binary classification tasks.

6.1.1 Compared algorithms and experimental testbed

We compare the proposed PAA algorithms with the Perceptron-based Active learning, and their random variants, which are listed as follows:

  • “RPE”: the Random Perceptron algorithm (Cesa-Bianchi and Lugosi 2006);

  • “RPA”: the Random Passive-Aggressive algorithms, including RPA, RPA-I, RPA-II. These algorithms adopt the same updating strategy as the proposed PAA algorithms, while the querying strategy is different. Instead of actively querying the class label, these algorithms utilize a uniform random sampling approach. The comparison between the RPA algorithms and the proposed PAA algorithms will validate the effectiveness of the active querying strategy.

  • “PEA”: the Perceptron-based Active learning algorithm (Cesa-Bianchi et al. 2006); This algorithm adopts the same active querying strategy as the proposed PAA algorithm, while the updating rule follows the Perceptron algorithm. The comparison between the proposed algorithms and the Perceptron-based algorithms will support our main motivation, to fully exploit the potential of every queried instance.

  • “SEL-ada”: the Selective Sampling Perceptron with Adaptive Parameter (Cesa-Bianchi et al. 2006); Unlike the PEA algorithm where the smoothing parameter \(\delta \) is a constant for all iterations, the \(\delta \) in the SEL-ada algorithm is set in an online fashion, i.e. \(\delta _t\propto \sqrt{1+ErrCount_{t-1}}\).

  • “SEL-2nd”: the Selective Sampling Second-Order Perceptron algorithm  (Cesa-Bianchi et al. 2006). This algorithm adopts the same active querying strategy as the proposed PAA algorithm, but the update strategy follows the Second-order Perceptron algorithm (Cesa-Bianchi et al. 2005). The SEL-2nd achieved the best performance among all compared algorithms in  (Cesa-Bianchi et al. 2006).

  • “PAA”: the Passive-Aggressive Active learning algorithms, including PAA, PAA-I, PAA-II.

To examine the performance, we conduct extensive experiments on a variety of benchmark datasets from web machine learning repositories. Table 1 shows the details of twelve binary-class datasets used in our experiments. All of these datasets can be downloaded from LIBSVM websiteFootnote 1 and UCI machine learning repository.Footnote 2 These datasets are chosen fairly randomly in order to cover various sizes of datasets.

Table 1 Summary of datasets used in binary online classification experiments

All the compared algorithms learn a linear classifier for the binary classification tasks. The penalty parameter C is searched from \(2^{[-5:5]}\) through cross validation for all the algorithms and datasets. The smoothing parameter \(\delta \) is set as \(2^{[-10:10]}\) in order to examine varied sampling situations. All the experiments were conducted over 20 runs of different random permutations for each dataset. All the results were reported by averaging over these 20 runs. For performance metrics, we select F-measure, which is defined as \(F{\text {-}}measure=2\frac{Precision*Recall}{Precision+Recall }\).

6.1.2 Evaluation on fixed ratio of queries

In the first experiment, we evaluate the performance of our proposed PAA algorithms on the online binary classification task with fixed query rates. We adjust \(\delta \) to make the percentage of queried instances near 10 and 20 % and compare all the algorithms on a fair platform. The results are shown in Tables 2 and 3. Several observations can be drawn from the results.

Table 2 Evaluation of the PAA algorithms against the other baselines (time in seconds)
Table 3 Evaluation of the PAA algorithms against the other baselines (time in seconds)

First of all, we observe that all the active learning algorithms outperform their corresponding random versions in terms of F-measure results, which validates the efficacy and advantage of the active learning strategies.

Second, we find that the two soft-margin PAA algorithms (i.e., PAA-I and PAA-II) achieve similar F-measure performance on most of the datasets, while the hard-margin PAA usually performs slightly worse. This may possibly be caused by overfitting on noisy training data, since PAA conducts a more aggressive update and is thus more sensitive to noise.

Third, under the same fraction of queried labels, the two soft PAA algorithms always achieve significantly higher F-measure than all compared algorithms. This promising result indicates that PAA can effectively exploit those requested labeled data, especially for those that are correctly classified but with low confidence. And this observation again supports our main motivation that the PAA algorithms are designed to address the key limitation of PEA who wastes the efforts of querying labels but may never uses them for effective update. Furthermore, the running time cost of PAA and PEA algorithms are similar, as well as in the same order of magnitude with randomized query algorithms, which validates the efficiency of the proposed methods.

Fig. 1
figure 1

Evaluation of F-measure against the fraction of queried labels on all binary datasets. The plotted curves are averaged over 20 random permutations

Finally, we would like to discuss the performance of the SEL-ada and SEL-2nd algorithms, which were proposed as improved variants of the PEA algorithm. Surprisingly, SEL-ada performs slightly worse than PEA in all the datasets. This observation consists with earlier results while no explanation was provided by Cesa-Bianchi et al. (2006). We think that the inferior performance of SEL-ada may be caused by the ineffective setting the smoothing parameter \(\delta \). When aiming at maximizing the accumulated accuracy along the whole learning process, querying labels earlier is more likely to result in better performance. However, the increasing \(\delta \) with time t is actually encouraging later query.

When analyzing the performance of SEL-2nd, we find that this algorithm usually outperforms PEA in terms of F-measure, which can be explained by the utilization of second order information. However, it still suffers from the same drawback with PEA, i.e. wasting the queried labels when the prediction is correct but with low confidence. Thus the F-measure is still significantly lower than PAA-I and PAA-II. Besides, the \(O(d^2)\) time and space complexity of SEL-2nd limits its application in high dimensional applications, which is indicated by its time cost on the “w8a” dataset with 300 feature dimension.

6.1.3 Evaluation on varied ratio of queries

This experiment is to evaluate the performance of the proposed algorithms by varying the query rate of different online learning algorithms, shown in Fig. 1. From the experimental results, several observations can be drawn.

Similar to the previous experiments under fixed query rate, we find that the two soft margin PAA algorithms always achieve higher F-measure than all the other active learning algorithms, which validates the successful update strategy of our proposed algorithm. And all the active learning algorithms outperform their corresponding random versions, which demonstrates the advantage of the active querying rule.

In addition, we observe that the F-measure usually increases as the fraction of queried labels increases at the beginning, but saturates quickly after the fraction of queried labels exceeds some value. This result indicates the proposed online active learning strategy can effectively explore those most informative instances for updating the classifiers in a rather effective and efficient way.

Finally, it is interesting to see that on some datasets (e.g., a8a, magic04, svmguide1, etc.), the F-measures achieved by PAA,PEA, SEL-ada and SEL-2nd could decrease when increasing the fraction of queried labels. This seems a little bit surprising as we usually expect the more the labeled data queried, the better the predictive performance. Note that this phenomenon only appears in the hard margin algorithms (PAA, PEA, SEL-ada and SEL-2nd), which are not designed for noisy data. While the other two soft-margin algorithms (PAA-I and PAA-II), which are robust to noisy, tend to be able to avoid such situations. Consequently, we suspect this was mainly caused due to the overfit issue on the noisy training data.

6.1.4 Application to online text classification

In this section, we apply our proposed Passive-Aggressive Active learning algorithms to online text classification. Our experimental testbed consists of: (i) a subset of the Reuters Corpus Volume 1 (RCV1)Footnote 3 which contains 4086 documents with 29,992 distinct words; (ii) 20 Newsgroups datasets,Footnote 4 we extract the “comp” versus “rec” and “rec” versus “sci” to form two binary classification tasks, which have a total of 8870 and 8928 documents, respectively. Each document is represented by a feature vector of 26,214 distinct words.

Fig. 2
figure 2

Evaluation of F-measure against the fraction of queried labels for text classification applications

Fig. 3
figure 3

Evaluation of F-measure against the fraction of queried labels for web applications

As discussed earlier, the SEL-ada algorithm mostly performs worse that PEA because of the ineffective setting of the \(\delta \) value. In addition, the SEL-2nd is not applicable to high dimensional applications due to its \(O(d^2)\) space and time complexity while its performance is always worse than PAA-I and PAA-II. We remove the two compared algorithms for the conciseness of our later experiments.

The text classification results are shown in Fig. 2. We could see that Passive-Aggressive based algorithms usually outperform the Perceptron based algorithms, which empirically shows the advantages of large margin approaches for active learning. Among all methods, PAA algorithms consistently perform better than random querying methods and perceptron based active learning methods, which further validates the efficacy of our proposed approaches.

6.1.5 Application to web data classification

To further evaluate the PAA algorithms, we apply them to web data classification tasks, which are (i) URL classification (Ma et al. 2009b) which contains 1,782,206 URLs with 3,231,961 features; (ii) webspam classification (Wang et al. 2012), which have a total of 350,000 instance with 254 features, respectively. These two datasets can be downloaded from the LIBSVM website.Footnote 5 Similar phenomenon could be observed from the results, as shown in Fig. 3.

6.2 Evaluation of the MPAA algorithm in multi-class classification tasks

This section will evaluate the empirical performance of the proposed MPAA algorithms on online multi-class classification tasks.

6.2.1 Compared algorithms and experimental testbed

We compare the proposed MPAA algorithms with the Multi-class Perceptron Active learning algorithm, which adopts the same querying strategy as MPAA algorithms do but updates in the Perceptron rule. To demonstrate the advantage of our querying strategy, we also compare the active learning algorithms with their random variants. Note that for all Perceptron based algorithms, we choose the max-score variant since it is similar to the PA based algorithms in only updating two weight vectors during each iteration, which is a fair comparison. The compared algorithms are listed as follows:

  • “MRPE”: the Multi-class Random Perceptron algorithm, an extension of RPE (Cesa-Bianchi and Lugosi 2006) to multi-class setting;

  • “MRPA”: the Multi-class Random Passive-Aggressive algorithms, including MRPA, MRPA-I, MRPA-II, which will uniformly randomly query labels;

  • “MPEA”: the Multi-class Perceptron-based Active learning algorithm, an extension of PEA  (Cesa-Bianchi et al. 2006) in multi-class setting;

  • “MPAA”: the Multi-class Passive-Aggressive Active learning algorithms, including MPAA, MPAA-I, MPAA-II.

To examine the performance, we conduct extensive experiments on a variety of benchmark datasets from web machine learning repositories. Table 4 shows the details of 9 multi-class datasets used in our experiments. All of these datasets can be downloaded from LIBSVM website.Footnote 6 These datasets are chosen fairly randomly in order to cover various sizes of datasets.

Table 4 Details of multi-class classification datasets

The parameter settings mostly follow those in the previous binary classification experiments excepts that we select online accuracy for performance metrics.

6.2.2 Performance evaluation

Next we evaluate the performance of all the algorithms on online multi-class active learning tasks. Figure 4 summarizes the average performance of the eight different algorithms for online active learning on the 9 datasets. Similar phenomenon could be observed from the results in Fig. 4 as that in binary setting, which further demonstrates that our proposed algorithms are effective in dealing with online multi-class active learning tasks.

Fig. 4
figure 4

Evaluation of accuracy against the fraction of queried labels on all multi-class datasets. The plotted curves are averaged over 20 random permutations

Fig. 5
figure 5

Evaluation of accuracy against the fraction of queried labels for multi-class online text classification applications

6.2.3 Application to online text classification

In this section, we apply our proposed Multi-class Passive-Aggressive Active learning algorithms to online text classification. Our experimental testbed consists of: (i) 20 Newsgroups datasets,Footnote 7 we use the 20-class dataset, which have a total of 15,935 documents. Each document is represented by a feature vector of 62,061 distinct words. (ii) Reuters Corpus Volume 1 (RCV1)Footnote 8 which is a 53-class datasets and contains 15,564 documents with 47,236 distinct words; The text classification results are shown in Fig. 5. Similar phenomenon could be observed from the results: Passive-Aggressive based algorithms usually outperform the Perceptron based algorithms and PAA algorithms consistently perform better than random querying methods and perceptron based active learning methods, which further validates the efficacy of our proposed approaches.

6.3 Evaluation of CSPAA algorithms for cost-sensitive classification tasks

This section will evaluate the empirical performance of the proposed CSPAA algorithm. Specifically, we will evaluate all the algorithms on online malicious URL detection (Ma et al. 2009a), which is a large-scale online learning task. Our experiments are designed to answer several open questions: (i) how does the class imbalance issue affect the predictive performance of online active learning? (ii) if the proposed online active learning approach is effective to reducing the amount of labeled data significantly in order to maintain comparable performance? (iii) how is the efficiency and scalability of the proposed learning algorithms for a web-scale application?

6.3.1 Experimental testbed

To examine the performance of the proposed CSPAA algorithms, we test them on a large-scale benchmark dataset for malicious URL detection tasks (Ma et al. 2009b), which can be downloaded from http://sysnet.ucsd.edu/projects/url/. The original data set was created in purpose to make it somehow class-balanced. In our experiment, we create two subsets by sampling from the original data set to make it close to a more realistic distribution scenario where the number of normal URLs is significantly larger than the number of malicious URLs. Table 5 shows the data sets used in our experiment for online malicious detection, where \(T_p/T_n\) denotes the ratio between the number of positive (malicious) instances and the number of negative (normal) instances. A variety of features were extracted to represent the content of a URL, including both lexical features (such as hostnames, primary domain, path tokens, etc) and host-based features (such as WHOIS info, IP prefix, AS number, Geographic, etc.). More details can be found in (Ma et al. 2009b).

Table 5 The data set of malicious URL detection

6.3.2 Compared algorithms and setup

We compare the proposed CSPAA algorithms against a variety of state-of-the-art algorithms as follows:

  • “PE”: the classical PErceptron algorithm (Rosenblatt 1958), which queries label of every instance; this is impractical as it requires huge amount of labeled data, which is used as a yardstick to evaluate the efficacy of our algorithm;

  • “PA”: the regular Passive-Aggressive algorithm (Crammer et al. 2006), which also queries class label of every instance; similarly, this is another yardstick for comparison;

  • “CW-diag”: the Confidence Weighted (CW) algorithm (Crammer et al. 2008), which also queries label of every instance, and exploits the second-order info. We adopt the CW-diag version to make it feasible for high-dimensional data.

  • “PAUM”: this is the cost-sensitive Perceptron Algorithm with Uneven Margin (Li et al. 2002), which also queries label of every instance;

  • “CPA”: the Cost-sensitive Passive-Aggressive algorithm based on prediction (Crammer et al. 2006) which also queries all labels;

  • “LEPE”: the Label Efficient Perceptron algorithm (Cesa-Bianchi et al. 2006), which actively queries label for informative instances;

  • “CSRND”: a variant of the proposed CSPAA algorithm, but randomly queries label of incoming instances;

  • “CSPAA”: the proposed Cost-Sensitive Passive-Aggressive Active learning algorithm as shown in Algorithm 4.

To make a fair comparison, all the algorithms adopt the same setup. All the compared algorithms learn a linear classifier for the malicious URL detection task. In particular, for all the compared algorithms, we set the penalty parameter \(C=\rho =T_n/T_p\). For the proposed CSPAA\(_{sum}\) algorithm, we set \(\eta _p=\eta _n=1/2\) for all cases, while for the CSPAA\(_{cos}\), we set \(c_p=T_n/T\) and \(c_n=T_p/T\). The smoothing parameter \(\delta \) for LEPE and CSPAA is set as \(2^{[-10:2:10]}\) in order to examine varied query ratios.

Table 6 “Sum” evaluation of cost-sensitive online classification for malicious URL detection
Table 7 “Cost” evaluation of cost-sensitive online classification for malicious URL detection

All the experiments were conducted over 5 random permutations of the dataset. The results were reported by averaging over these 5 runs. We evaluate the online classification performance by two key metrics: the weighted sum of sensitivity and specificity, and the weighted cost. We denote by CSPAA\(_{sum}\) the algorithm aiming to improve the weighted sum of sensitivity and specificity, and CSPAA\(_{cos}\) the algorithm aiming to improve the overall cost. All experiments were run on a machine of 2.3GHz CPU.

6.3.3 Evaluation on fixed ratio of queries

The first experiment is to evaluate the performance by fixing the ratio of queries issued by the (active learning) algorithms. Table 6 shows the results of the sum performance under a fixed ratio of queries to about 2 % for URL1 and 10 % for URL2 dataset, and Table 7 summarizes the cost performance under the similar query ratio.

Several observations can be drawn from the results. First of all, according to the classification accuracy (a misleading metric for cost-sensitive classification), we found that both PE and PA algorithms significantly outperform the other algorithms, while, in terms of both sum and cost measures, they are considerably worse than their cost-sensitive variants (i.e., PAUM and CPA). This indicates the importance of taking the class imbalance issue into consideration for online malicious detection tasks. Second, when querying the same ratio of labeled data, in terms of both sum and cost performances, CSPAA significantly outperforms the LEPE algorithm, which validates the effectiveness of the proposed cost-sensitive online updating strategy. Third, when querying the same ratio of labels, CSPAA significantly outperforms CSRND, which implies the proposed querying strategy is able to actively select those fairly informative instances for querying labels, which are considerably better than just randomly querying. Moreover, among all the approaches, the proposed CSPAA algorithm and the PAUM algorithm achieve the highest sum performance. However, the proposed CSPAA only queried a extremely small subset of labels, while the PAUM algorithm requires to query the labels of all the incoming instances, which is very expensive to label 1-million training instances in a real-world application. We thus believe the proposed CSPAA algorithm is more practically attractive and suitable for a web-scale application.

Finally, we notice that the proposed CSPAA algorithm not only achieves the best sensitivity performance, but also achieves fairly good specificity performance which is generally quite comparable to the other algorithms. This implies that the proposed CSPAA algorithm not only significantly improves the prediction accuracy on the rare class, but also does not sacrifice much the prediction accuracy on the other majority classes. This promising observation again validates the effectiveness of the proposed CSPAA algorithm.

6.3.4 Evaluation on varied query ratios

This experiment is to evaluate the performance of the proposed algorithms by varying the ratios of queries for comparing different online active learning algorithms. Figure 6 shows the online average sum performance and the online average cost performance under varied query ratios, respectively. From the experimental results, several observations can be drawn as follows.

Fig. 6
figure 6

Evaluation of the performance with respect to varied query ratios

First of all, among all four fully supervised online learning algorithms (PE, PA, PAUM, and CPA), the cost-sensitive algorithms (PAUM and CPA) generally outperform the cost-insensitive versions. This result validates the importance of studying the proposed cost-sensitive online learning methodology class imbalanced tasks.

Second, compared with the CSRND algorithm that randomly queries the labels, CSPAA consistently achieves much higher sum and much lower cost performance over all the ratios of queried labels, especially when the query ratio is relatively small. This promising result indicates that the querying strategy of the proposed CSPAA technique is able to effectively query the informative labeled data from the sequentially arriving of unlabeled data instances.

Third, compared with LEPE, CSPAA achieves higher sum over all the ratios of queried labels, which implies that the proposed online updating strategy is able to effectively exploits the labeled data for improving the classifier. In addition, compared with PA, CSPAA with query ratio equals to 1 (equivalent to querying label of every instance) achieves a significantly higher sum performance, which shows the biased penalty function does effectively optimize the objective metric of the weighted sum of sensitivity and specificity.

Finally, we notice that when the query ratio increases, we generally observe an improvement of the cost-sensitive classification performance by the proposed CSPAA algorithm. However, While the query ratio reaches about 1 %, the improvement tends to become saturated, which is very close to the same algorithm that queries the label of every unlabeled data. This interesting observation indicates that the proposed learning strategy is able to attain potentially the best possible predictive performance using a small amount of label data (only 1 % or even less) over the entire training data set, which can thus save a significant amount of labeling cost in a practical real-world application.

6.3.5 Evaluation on adaptive sampling factor

In the above experiments, the sampling factor \(\delta \) was simply fixed to a constant. This experiment aims to examine if it is possible to further improve the proposed CSPAA approach using the adaptive sampling factor, denoted as “CSPAA(a)” for short (as discussed in the “remark” before Sect. 5.3). In this experiment, the initial value of \(\delta \) is set to an extremely large value, i.e., \(\delta _0 = 2^{14}\), and is updated adaptively using the proposed strategy in Sect. 5.3. To enable a fair comparison, we set appropriate parameters of the other algorithms (LEPE, CSRND and CSPAA) to make them sample the similar ratio of labeled data. Table 8 shows the experimental results for URL2 dataset, where “CSPAA” adopts the constant sampling factor. In addition, we also show the results on URL1 dataset of varied ratio of queries in Fig. 6.

Table 8 Evaluation of malicious URL detection performance on “URL2”

Some observations can be drawn from the results. First, the CSPAA(a) algorithm using the adaptive sampling factor significantly outperforms both CSRND using the random query strategy and CSPAA using a constant sampling factor under the same query ratio. Second, we found that by querying only 0.5 % out of the entire 1-million instances, the proposed CSPAA(a) algorithm is able to achieve the best performance, which is almost the same (statistically no difference according to student t test) to the state-of-the-art cost-sensitive algorithm CPA which has to query labels for all the 1-million instances. This promising result shows that the proposed CSPAA technique is able to save a significant amount of labeling cost while maintaining the state-of-the-art performance.

6.3.6 Evaluation on efficiency and scalability

Finally, we examine the time efficiency of the proposed algorithms. The “time” columns of Tables 6, 7 and 8 show the average time costs of the proposed CSPAA algorithms on the fixed query ratios. In addition to these tables, we also evaluate the scalability of the proposed algorithms, as shown in Fig. 7, which measures the online cumulative time cost of different algorithms over the number of received instances in the online malicious URL detection process.

Fig. 7
figure 7

Evaluation of online cumulative time cost on the “URL2” dataset

From the results, we can see that all the proposed online learning algorithms are fairly efficient and scalable, which typically took about 20–30 s to run on the data set with 1-million instances on a single regular machine. Moreover, by examining the efficiency and scalability of the proposed CSPAA algorithms, we found that CSPAA is among the most efficient and scalable algorithms, which is at least as efficient as the other algorithms and even slightly better than some of the other algorithms. These encouraging results again validate the practical value of the proposed CSPAA algorithm for web-scale real-world applications.

7 Conclusions

This paper investigated online active learning techniques for resolving the open challenges of learning sequentially arriving data in varied settings. The proposed novel online active learning technique not only overcomes the drawback of conventional supervised passive online learning algorithms that have to query (or wait) class labels of every incoming instances, but also improves the limitation of the existing perceptron-based active learning algorithm that often wastes a lot of queried/received labeled instances that are barely classified correctly but with low prediction confidence. Specifically, we have proposed a family of passive aggressive active (PAA) learning algorithms to tackle three different kinds of online predictive tasks, including online binary classification, online multi-class classification, and cost-sensitive online classification tasks. We theoretically analyzed the mistake bounds for the proposed PAA algorithms in the three different settings, in which the bounds generally enjoy the similar bounds as those regular fully supervised Passive-Aggressive online learning algorithms when requesting class labels of every incoming instance. We conducted an extensive set of empirical studies, in which our encouraging results showed that the proposed PAA algorithms significantly outperform the baseline approaches. For future work, we plan to address more other challenges of online learning tasks, such as concept drifting issues (Minku and Yao 2012).