1 Introduction

Online learning is well-studied framework in Machine Learning, with both theoretical and practical appeals (Shalev-Shwartz, 2012). For large-scale and possibly streaming applications, online learning has received widespread attention, owing to its efficiency and scalability by handling instances one-by-one. Conceptually, online learning for classification can be viewed as a sequential process, involving a learner and its environment. At each round t, the learner first receives an instance \(\varvec{x}_t\), from which it is required to predict a class or label according to its current predictor \(\varvec{w}_t\). Once the learner has committed to its prediction, say \({{\hat{y}}}_t\), the environment reveals the true label \(y_t\), and the learner incurs a loss which assesses the discrepancy between the prediction \({{\hat{y}}}_t\) and the response \(y_t\). Before proceeding to the next round, the learner is allowed to choose a new predictor \(\varvec{w}_{t+1}\) in the hope of improving its predictive performance for the subsequent rounds. In the past decades, various online learning algorithms have been proposed, including online first-order (Crammer et al. 2006; Shalev-Shwartz et al., 2011; Zhai et al., 2019; Zinkevich, 2003) and second-order methods (Crammer et al., 2012; Hazan et al., 2007; Luo et al., 2016), online kernel (Lu et al., 2016a; Song et al., 2017) and multiple kernel methods (Hoi et al., 2013), online ensemble learning methods (Sun et al., 2016; Zhai et al., 2017), and so on.

According to the above protocol, online learning is a fully supervised learning process, in which the label of each incoming instance is provided by the environment. Although this protocol has been successful for handling large, fully labeled data streams, it is ill-suited for dealing with applications where labels are scarce or expensive to obtain. Consider for example the task of classifying web pages according to a set of predefined topics. Collecting and encoding web pages as vectorized instances is a fairly automated process, but assigning them a topic often requires time-consuming and costly human expertise. Similarly, in personalized anti-spam filtering, various techniques are known to encode incoming messages as feature vectors, but it is unreasonable to assume that a user will label every message as a “spam” or a “ham”. For such applications, a natural question arises: can we achieve strong online classification performance while using only few labeled instances?

Online active learning has recently come up as a promising approach for handling this issue. As usual, the learner starts each round t by making a prediction \({\hat{y}}_t\) for an incoming instance \(\varvec{x}_t\) using its model \(\varvec{w}_t\). But the key difference with passive online learning lies at the end of the round: in the active setting, the learner has to decide whether the true label \(y_t\) of the instance \(\varvec{x}_t\) should be supplied, or not. If \(y_t\) is queried, then the complete example \((\varvec{x}_t, y_t)\) is obtained and the learner uses the example to derive a new predictor \(\varvec{w}_{t+1}\). Otherwise, the current predictor \(\varvec{w}_t\) is left unchanged.

In the literature, there exists another line of active learning, namely, offline (or pool-based) active learning (Lughofer, 2017; Settles, 2009), which assumes that a pool of unlabeled instances is available before learning, and query decisions are made by evaluating the whole pool of unlabeled instances. Various labeling strategies have been developed in this offline scenario. Margin-based methods (Awasthi et al., 2015; Balcan and Long, 2013; Zhang, 2018) query instances which are close to the estimated decision boundary. Disagreement-based methods (Golovin et al., 2010; Hanneke, 2014; Tosh and Dasgupta, 2017) maintain a set of hypotheses that are consistent with the currently labeled instances, and query the unlabeled instances about which those hypotheses most disagree. Multi-criterion methods (Demir and Bruzzone, 2014; Du et al., 2017; Huang et al., 2014; Wang and Ye, 2015) combine multiple criteria for assessing the value of an unlabeled instance and querying the most valuable instances.

Contrastingly, online active learning is more suited for large-scale and streaming data, by handling instances one-by-one. In this protocol, the query strategy is applied on each incoming, unclassified instance. Based on this online active learning framework, several perceptron-based active learning algorithms that rely on a margin-based query strategy have been proposed (Cesa-Bianchi et al., 2006). At each round t, the learner draws a random variable \(Z_t \in \{0,1\}\) from a Bernoulli distribution with parameter \(b / (b + p_t)\), where \(p_t = |\varvec{w}_t ^\top \varvec{x}_t|\) is the prediction margin of the instance \(\varvec{x}_t\), and \(b > 0\) is a predefined hyperparameter used to control the probability of asking the label \(y_t\) of \(\varvec{x}_t\). This label is revealed only when \(Z_t = 1\) and, in that case, the predictor \(\varvec{w}_t\) is updated according to a first-order or second-order perceptron rule. The margin-based label query was also advocated for the active versions of the Winnow algorithm(Cesa-Bianchi et al., 2006) and the Passive-Aggressive algorithm (Lu et al., 2016b). More recently, Hao et al.(Hao et al., 2018) have proposed a new algorithm, called Second-order Online Active Learning (SOAL), that exploits both the prediction margin and the margin variance for asking label queries, and which updates the predictor using a variant of the Adaptive Regularization Of Weights method (Crammer et al., 2013).

In practice, some second-order online active learning methods have shown better performance than the first-order methods (Cesa-Bianchi et al., 2006; Hao et al., 2018). In doing so, these second-order methods maintain a correlation matrix and use the matrix to update the online predictor. In presence of high-dimensional data, maintaining and using a full correlation matrix is prohibitive in time and space. Although Hao et al.(Hao et al., 2018) has realized this problem and has extended SOAL to use the diagonal correlation matrix, the empirical and theoretical analyses in the paper are only for the full matrix version of SOAL, and not for the diagonal matrix version of SOAL. On the other hand, first-order methods are more efficient in time and space than second-order methods in handling high-dimensional data, but may suffer from two critical limitations. First, their updating rules treat all dimensions of features equally and update each dimension in the same learning rate, which is deficient given that one feature may be seen hundreds of times, while another feature may be seen only once. Second, their margin-based label query strategy ignores the feature-based discriminative information of instances. At this point, it is well-known that infrequently occurring features are highly informative and discriminative (Crammer et al., 2012; Duchi et al., 2011) and should be taken more notice when they occur. Therefore, in the label query, when instances including such infrequent features appear, they should be given more chances to be queried. In summary, existing research on effective and efficient online active learning for high-dimensional data is still insufficient.

Furthermore, most of the aforementioned methods are designed only for binary classification tasks and how to generalize them to the multiclass scenario is left unknown. Indeed, to the best of our knowledge, there is only one research paper handling the multiclass problem in the online active learning setting. In (Lu et al., 2016b), Lu et al. extend their Passive-Aggressive Active learning algorithms (PAA) for binary classification to the multiclass setting and propose the Multiclass PAA (MPAA). MPAA uses the Multi-prototype method (Crammer et al., 2006) together with the Multiclass Passive-Aggressive algorithms for constructing and updating the multiclass classifier online, and also relies on a multiclass margin-based query strategy to query labels. In the query strategy, a decision variable \(Z_t \in \{0, 1\}\) is drawn according to the Bernoulli distribution with parameter \(b / (b + p_t)\), where \(p_t\) is a quantity used to approximate the true multiclass predictive margin. MPAA also suffer from two limitations. First, all dimensions of features are updated in the same learning rate. Second, the query strategy also ignores the feature-based discriminative information of instances.

In this study, we focus on novel online active classification methods, which can handle high dimensional data effectively and efficiently and present good extensions to the multiclass classification tasks. Our contributions are threefold:

  1. 1.

    Two novel online active learning algorithms for binary classification are proposed, which use the adaptive subgradient methods (Duchi et al. 2011) to update the online learner when the labels of instances are revealed and which exploit not only the margin-based predictive uncertainty of instances, but also the feature-based discriminative information of instances to identify critical instances to query. Our updating rules can endow different dimensions of features with different learning rates by using a diagonal correlation matrix. Our label query strategy can discover instances that significantly improve the online predictive performance. In light of the above algorithmic design, the proposed methods can handle high dimensional data effectively and efficiently. Both algorithms have been extended to the multiclass scenario.

  2. 2.

    Expected mistake bounds for our proposed algorithms are provided and analyzed. The bounds reveal that when the label query ratio is larger than a certain value, our active learning algorithms are asymptotically comparable to the best fixed fully supervised classifier chosen in hindsight.

  3. 3.

    An ablation study on six high dimensional binary classification datasets show the superiority of our label query strategy. Comparative experiments also indicate that, at extensive label query ratios, our algorithms outperform (in terms of online F1-measure) existing online active learning methods. Furthermore, experiments on six multiclass classification datasets also show the advantage of our multiclass active learning algorithms.

The paper is organized as follows. Section 2 provides the notation used throughout this paper. Our proposed active learning algorithms for binary classification are presented and analyzed in Sect. 3. Further, both algorithms are extended to the multiclass classification tasks in Sect. 4. Experimental comparisons and analyses are provided in Sect. 5. Finally, Sect. 6 concludes this study with some perspectives of further research.

2 Notation

For a positive integer T, let [T] denote the set \(\{1, 2,\cdots T\}\). For an event E, we denote by \(\mathbbm {1}[E]\) the indicator function in \(\{0,1\}\) of E, namely, \(\mathbbm {1}[E] = 1\) if E happens and \(\mathbbm {1}[E] = 0\), otherwise. For a scalar a, we use \({{\,\mathrm{sgn}\,}}(a)\) to denote the sign in \(\{-1,+1\}\) of a. The ith element of a vector \(\varvec{x}_t\) is denoted by \(x_{t,i}\). We use \(\varvec{a}_{1:t} = [a_1,\cdots ,a_t]\) to denote the (row) vector representation of a scalar sequence \(\{a_i\}_{i=1}^t\). By extension, we use \(\varvec{G}_{1:t} = [\varvec{g}_1,\cdots ,\varvec{g}_t]\) to denote the \(d \times t\) matrix representation of a sequence \(\{\varvec{g}_i\}_{i=1}^t\) of (column) vectors in \(\mathbb {R}^d\), and here we use \(\varvec{G}_{1:t, i}\) to denote the ith row of \(\varvec{G}_{1:t}\). The inner product of two vectors \(\varvec{w}\) and \(\varvec{v}\) is denoted by \(\varvec{w}^\top \varvec{v}\), and for any \(p \in [1, \infty ]\), we use \(\Vert \varvec{w}\Vert _p\) to denote the \(\ell _p\) norm of \(\varvec{w}\). For a vector \(\varvec{v}\), we denote by \(\mathrm {diag}(\varvec{v})\) the diagonal matrix with elements of \(\varvec{v}\) on the diagonal line, and we use \(\Vert \varvec{v} \Vert _{\varvec{A}}\) to denote the Mahalanobis norm of \(\varvec{v}\) with respect to a positive definite matrix \(\varvec{A}\), which is given by \(\sqrt{\varvec{v}^\top \varvec{A} \varvec{v}}\). Let \(\varvec{I}\) denote an identity matrix. The trace of a matrix \(\varvec{M}\) is denoted by \(\mathrm {tr}(\varvec{M})\). Given a closed convex set \({\mathcal {W}} \subseteq \mathbb {R}^d\), and a convex function \(f: \mathbb {R}^d \rightarrow \mathbb {R}\), the sub-differential set of f at the point \(\varvec{w} \in {\mathcal {W}}\) is denoted by \(\partial f(\varvec{w})\). When f is differentiable, we use \(\nabla f(\varvec{w})\) to denote its unique subgradient (called gradient) at \(\varvec{w}\). We shall also exploit the next property.

Claim

(Duchi et al., 2011) Let \(\{a_t\}_{t=1}^T\) be an arbitrary sequence of scalars, and assume that \(\frac{0}{\sqrt{0}} = 0\). Then, \(\sum _{t=1}^T \frac{a_t^2}{\Vert \varvec{a}_{1:t} \Vert _2} \le 2 \Vert \varvec{a}_{1:T} \Vert _2\).

3 Online active learning for binary classification

3.1 Problem definition

We first focus on online active learning for binary classification. Let \((\varvec{x}_1, y_1), \cdots ,\) \((\varvec{x}_T, y_T)\) be a sequence of input examples, where \(\varvec{x}_t \in \mathbb {R}^d\) and \(y_t \in \{-1, +1\}\) for any \(t \in [T]\). It should be noted that the entire sequence of examples can be arbitrary, but is chosen beforehand. At each round t, the learner first observes an instance \(\varvec{x}_t \in \mathbb {R}^d\), and next predicts the label \({\hat{y}}_t = {{\,\mathrm{sgn}\,}}(\varvec{w}_t ^\top \varvec{x}_t)\) using its current model \(\varvec{w}_t \in \mathbb {R}^d\). Then, the learner is given the choice of querying the true label \(y_t\), or not. A variable \(Z_t \in \{0, 1\}\) is associated with the query decision. If \(Z_t = 1\), then \(y_t\) is queried and a loss \(f(\varvec{w}_t; (\varvec{x}_t, y_t))\) that measures the discrepancy between \({\hat{y}}_t\) and \(y_t\) is revealed. In light of this information, the learner can compute a new predictor \(\varvec{w}_{t+1} \in \mathbb {R}^d\). On the other hand, if \(Z_t = 0\), then \(y_t\) remains unknown and the learner simply sets \(\varvec{w}_{t+1} = \varvec{w}_t\).

In what follows, \(f_t(\varvec{w})\) is used as an abbreviation of \(f(\varvec{w}; (\varvec{x}_t, y_t))\). At each online round t, the hinge loss \(f_t (\varvec{w}_t) = \max \{0, 1- y_t \varvec{w}_t^\top \varvec{x}_t \}\) is used to measure the inaccuracy of the prediction. In order to evaluate the number of online prediction mistakes made by our learner, we introduce two new symbols:

$$\begin{aligned} M_t = \mathbbm {1}[y_t \varvec{w}_t^\top \varvec{x}_t< 0] = \mathbbm {1}[{\hat{y}}_t \ne y_t], \ L_t = \mathbbm {1}[0 \le y_t \varvec{w}_t ^\top \varvec{x}_t < 1] , \end{aligned}$$

where \(M_t\) indicates whether the learner has made a prediction mistake at round t, and \(L_t\) indicates whether the learner has made a correct prediction but without sufficient confidence.

The main goal of an online active learner is to achieve a predictive performance that is comparable to the corresponding fully supervised online learner, but using few label queries. Therefore, we compare the expected number of prediction mistakes made by our online learner, that is, \({\mathbb {E}} [\sum _{t=1}^T M_t]\), with the cumulative hinge loss of the best fully supervised classifier \(\varvec{w}^*\), taken with the benefit of hindsight. Specifically, \(\varvec{w}^* = {{\,\mathrm{argmin}\,}}_{\varvec{w} \in \mathbb {R}^d} \sum _{t=1}^T f_t(\varvec{w})\) and its cumulative loss is given by \(\sum _{t=1}^T f_t(\varvec{w}^*)\). Importantly, the prediction mistakes of our learner is evaluated on all rounds, including those where true labels remain unknown.

3.2 Adaptive subgradient methods for binary classification

Adaptive subgradient methods (Duchi et al. 2011) are a family of online algorithms that can exploit the historically observed subgradients to perform more informative learning and achieve asymptotically sub-linear regret. In this section, we introduce two specific implementation methods of adaptive subgradient methods that are efficient in time and space for high-dimensional data and that will be used for updating our active learner. One method is based on dual averaging and the other one is founded on mirror descent. Both methods are fully supervised and require to query each instance’s label. Both methods can endow each dimension of the predictor with an adaptive learning step-size. In order to achieve this point, a diagonal matrix \(\varvec{H}_t\) is computed at each round t as:

figure a

where \(\delta > 0\) is a hyperparameter and \(\varvec{H}_t\) can be rewritten as

$$\begin{aligned} \varvec{H}_t = \delta \varvec{I} + \mathrm {diag}\left( \sum _{k=1}^t \varvec{g}_k \varvec{g}_k^\top \right) ^{\frac{1}{2}} = \delta \varvec{I} + \mathrm {diag}\left( \sum _{k=1}^t \mathbbm {1}[f_k(\varvec{w}_k) > 0] \varvec{x}_k \varvec{x}_k^\top \right) ^{\frac{1}{2}} . \end{aligned}$$

Informally, \(\varvec{H}_t\) is used to approximate the Hessian of the functions \(f_t(\varvec{w})\) (Duchi et al. 2011). Relying on \(\varvec{H}_t\), the updating rules for both methods at the end of round t are defined as follows.

Dual Averaging (DA) update: the new predictor is given by

$$\begin{aligned} \varvec{w}_{t+1} = {{\,\mathrm{argmin}\,}}_{\varvec{w} \in \mathbb {R}^d} \left\{ \eta \varvec{w}^{\top }\left( \sum _{k =1}^t \varvec{g}_k \right) + \frac{1}{2} \varvec{w}^{\top } \varvec{H}_t \varvec{w} \right\} \end{aligned}$$
(1)

Mirror Descent (MD) update: the new predictor is given by

$$\begin{aligned} \varvec{w}_{t+1} = {{\,\mathrm{argmin}\,}}_{\varvec{w} \in \mathbb {R}^d} \left\{ \eta \varvec{g}_t^\top \varvec{w} + \frac{1}{2} (\varvec{w} - \varvec{w}_t)^{\top } \varvec{H}_t (\varvec{w} - \varvec{w}_t) \right\} \end{aligned}$$
(2)

Here \(\eta\) is the step-size hyperparameter. The updating rules (1) and (2) both admit a closed form solution. For (1), we can get \(\varvec{w}_{t+1} = - \eta \varvec{H}_t^{-1} \sum _{k =1}^t \varvec{g}_k\) and for (2), we have \(\varvec{w}_{t+1} = \varvec{w}_t - \eta \varvec{H}_t^{-1} \varvec{g}_t\). Informally, the above two methods give frequently occurring features very low learning rates and infrequent features high learning rates (Duchi et al. 2011), which is achieved by using \(\varvec{H}_t\). So conceptually, the value of each diagonal element in \(\varvec{H}_t\) captures how frequently the feature on that dimension is seen during the online learning process.

3.3 Novel online active learning methods for binary classification

In this section, we aim to develop novel online active learning algorithms. The core challenges for designing an online active learner include (a) label query strategy: how to identify critical instances to label so that the predictive performance of the online learner can be significantly improved, and (b) updating rule: how to effectively update the online learner when the true label of an incoming instance is revealed. Our proposed algorithms adopt a novel discrimination-based label query and the DA or MD updating rule to handle the above challenges.

Our label query strategy is motivated by the following idea. In various applications characterized by high-dimensional, yet sparse, data instances, infrequently occurring features are known to be highly discriminative (Crammer et al. 2012; Duchi et al. 2011). The instances including such infrequent features are therefore important for improving the predictive performance of the online predictor, and hence, it is crucial to obtain their labels. To this point, recall that the usual margin-based query strategy is to draw a random variable \(Z_t \in \{0,1\}\) from a Bernoulli distribution with parameter \(b / (b + p_t)\), where \(p_t = |\varvec{w}_t ^\top \varvec{x}_t|\) and \(b > 0\) is a predefined hyperparameter. So, this strategy, advocated for example in (Cesa-Bianchi et al. 2006; Lu et al. 2016b; Zhao and Hoi 2013), does not take account of the feature-based discriminative information of instances, but only considers the predictive uncertainty of instances.

Our query strategy takes full advantage of both aspects. Here, \(Z_t\) is drawn from a Bernoulli distribution with parameter \(b / (b + q_t \mathbbm {1}[q_t > 0])\) where

$$\begin{aligned} q_t = |{\hat{p}}_t | - \frac{\eta }{2} a_t v_t, \text{ with } {\hat{p}}_t = \varvec{w}_t^\top \varvec{x}_t, \ a_t \in [0,1] \text{ and } v_t = \varvec{x}_t^\top \varvec{H}_{t-1}^{-1} \varvec{x}_t . \end{aligned}$$

The value of \(a_t\) will be clarified in Remark 3 in Sect. 3.4. The matrix \(\varvec{H}_{t-1}\) is the diagonal matrix maintained by the two adaptive subgradient methods in the previous section. We later prove that such definition of \(q_t\) helps to reduce the upper bound of the online prediction mistakes made by our proposed active learning algorithms. Intuitively, \(|{\hat{p}}_t|\) is used to assess the uncertainty of classifying the instance \(\varvec{x}_t\), but this term is compensated by \(v_t\), which quantifies the feature-based discrimination of \(\varvec{x}_t\). Recall that the smaller value of the i-th diagonal element of \(\varvec{H}_{t-1}\) implies, in some extent, the less frequently occurring for the i-th dimensional feature. Thus, the larger is the value of \(v_t\), the more is the infrequent features that \(\varvec{x}_t\) contains and the more important is \(\varvec{x}_t\). According to this strategy, labels of instances with small \(|{\hat{p}}_t|\) and large \(v_t\) are given high probability to be asked. Notably, when an instance exhibits a high value of \(v_t\), i.e. \(\frac{\eta }{2}a_t v_t \ge |{\hat{p}}_t|\), its label is queried with certainty. If \(y_t\) is queried then, in light of this information, the new predictor \(\varvec{w}_{t+1}\) is computed according to (1) or (2). Otherwise, keep the predictor unchanged.

We present the proposed Discrimination-based Active Dual Averaging (D-ADA) algorithm and the Discrimination-based Active Mirror Descent (D-AMD) algorithm in Algorithm 1, where discrimination refers to the margin-based uncertainty and feature-based discrimination. Both algorithms are defined on the same query strategy (Lines 5-6), and only differ in the choice of the updating rule (Line 13 for D-ADA and Line 14 for D-AMD).

figure b

Our algorithms can be implemented in an efficient way. Indeed, using the fact that \(\varvec{s}_0 = \varvec{0}\), together with the fact that \(s_{t,i} = \sqrt{s_{t-1, i}^2 + g_{t,i}^2}\) for \(i \in [d]\), the matrix \(\varvec{H}_{t}\) can be computed at round t in time proportional to \(d'\), the number of non-zero elements in \(\varvec{x}_t\), by simply using the vector \(\varvec{s}_{t-1}\) derived at round \(t - 1\) and the subgradient \(\varvec{g}_t\) obtained at round t. Since \(\varvec{H}_t\) is diagonal, its inverse can also be found in \(O(d')\). Therefore, it is easy to observe that the per-round time complexity of our algorithms is \(O(d')\), and the per-round space complexity is O(d).

3.4 Theoretical analysis for D-ADA and D-AMD

The next theorem provides for D-ADA and D-AMD expected mistake bounds, which refer to upper bounds for \(\mathbb {E} \left [ \sum _{t=1}^T M_t \right ]\). In all results described below, expectations are taken with respect to the randomized query strategy, and \(\varvec{w}^* = {{\,\mathrm{argmin}\,}}_{\varvec{w} \in \mathbb {R}^d} \sum _{t=1}^T f_t(\varvec{w})\).

Theorem 1

If D-ADA and D-AMD are run with \(b \ge 2\), then the expected number of online prediction mistakes made by D-ADA for T rounds satisfies the inequality:

$$\begin{aligned}&\mathbb {E} \left [ \sum _{t=1}^T M_t \right ] \le {\mathbb {E}} \left [\sum _{t=1}^T Z_t f_t(\varvec{w}^*) \right ] + \frac{b A_1}{2 \eta } \mathrm {tr}({\mathbb {E}} [\varvec{H}_T]) - \frac{1}{b} \mathbb {E} \left [ \sum _{t: q_t\le 0} L_t \right ] \nonumber \\&\quad + \frac{\eta }{2 b} {\mathbb {E}}\left [\sum _{t: q_t \le 0} a_t \Vert \varvec{g}_t \Vert _{\varvec{H}_{t-1}^{-1}}^2 \right ] + \frac{\eta }{2 b} {\mathbb {E}}\left [\sum _{t=1}^T (1-a_t) \Vert \varvec{g}_{t} \Vert ^2_{\varvec{H}_{t-1}^{-1}} \right ] \end{aligned}$$
(3)

where \(A_1 = \Vert {\varvec{w}^*}\Vert _{\infty }^2\). For D-AMD, the following inequality holds:

$$\begin{aligned}&\mathbb {E} \left [ \sum _{t=1}^T M_t \right ] \le {\mathbb {E}} \left [\sum _{t=1}^T Z_t f_t(\varvec{w}^*) \right ] + \frac{A_2 + (b-1)^2 A_1}{\eta b} \mathrm {tr}(\mathbb {E} [\varvec{H}_T]) - \frac{1}{b} \mathbb {E} \left [ \sum _{t: q_t \le 0} L_t \right ] \nonumber \\&\quad \quad \quad+ \frac{\eta }{2 b} {\mathbb {E}} \left [ \sum _{t: q_t \le 0} a_t \Vert \varvec{g}_{t} \Vert ^2_{\varvec{H}_{t}^{-1}} \right ] + \frac{\eta }{2 b} {\mathbb {E}} \left [ \sum _{t=1}^T (1-a_t) \Vert \varvec{g}_{t} \Vert ^2_{\varvec{H}_{t}^{-1}} \right ] \end{aligned}$$
(4)

where \(A_2 = \max _{t \in [T]} \Vert \varvec{w}^* - \varvec{w}_t\Vert _{\infty }^2\) and \(A_1\) is defined as above.

Remark 1

Except the term \({\mathbb {E}} \left [\sum _{t=1}^T Z_t f_t(\varvec{w}^*) \right ]\), the dominant components of our mistake bounds depend on the expected trace of the diagonal matrix \(\varvec{H}_{T}\). Indeed, for D-ADA, with the assumption that \(\delta \ge \max _t \Vert \varvec{g}_t\Vert _{\infty }\), we can get

$$\begin{aligned} \sum _{t =1}^T \Vert \varvec{g}_t \Vert _{\varvec{H}_{t-1}^{-1}}^2 \le \sum _{t =1}^T \varvec{g}_t ^\top \mathrm {diag}(\varvec{s}_t)^{-1} \varvec{g}_t = \sum _{t =1}^T \sum _{i=1}^d \frac{g_{t,i}^2}{\Vert \varvec{G}_{1:t, i}\Vert _2} \le 2 \sum _{i=1}^d \Vert \varvec{G}_{1: T, i}\Vert _2 \end{aligned}$$

where the last inequality is derived from Claim 1. By contrast, for D-AMD, without any assumptions about \(\delta\), we can also get \(\sum _{t=1}^T \Vert \varvec{g}_t\Vert ^2_{\varvec{H}_{t}^{-1}} \le 2 \sum _{i=1}^d \Vert \varvec{G}_{1: T, i}\Vert _2\). Therefore, for D-ADA, we have

$$\begin{aligned} {\mathbb {E}}\left [\sum _{t: q_t \le 0} a_t \Vert \varvec{g}_t \Vert _{\varvec{H}_{t-1}^{-1}}^2 \right ] + {\mathbb {E}}\left [\sum _{t=1}^T (1-a_t) \Vert \varvec{g}_{t} \Vert ^2_{\varvec{H}_{t-1}^{-1}} \right ] \le 2 {\mathbb {E}}\left [ \sum _{i=1}^d \Vert \varvec{G}_{1: T, i}\Vert _2 \right ] . \end{aligned}$$

Similarly, for D-AMD, the sum of the last two terms in (4) is also less than or equal to \(2 {\mathbb {E}}\left [ \sum _{i=1}^d \Vert \varvec{G}_{1: T, i}\Vert _2 \right ]\). The facts that \(\sum _{i=1}^d \Vert \varvec{G}_{1: T, i}\Vert _2 = \mathrm {tr}(\varvec{H}_T) - \delta d\) and \(\mathrm {tr}(\varvec{H}_{T})\) is sublinear (Duchi et al. 2011) imply that as T increases, our algorithms can converge to \(\varvec{w}^*\) when the query hyperparameter \(b \ge 2\).

Remark 2

The expected mistake bounds can reveal the theoretical motivation of our query rule. Taking D-ADA for example, if the query rule exploits only the margin-based uncertainty of instances, that is, taking \(a_t = 0, \forall t \in [T]\), the sum of the last two terms in (3) reaches its maximal value \(A = \frac{\eta }{2 b} {\mathbb {E}} [\sum _{t=1}^T \Vert \varvec{g}_{t} \Vert ^2_{\varvec{H}_{t-1}^{-1}}]\). However, if the query rule also takes full advantage of the feature-based discrimination of instances, namely, taking \(0 <a_t \le 1, \forall t \in [T]\), then the sum of the last two terms in (3) is \(B = \frac{\eta }{2 b} {\mathbb {E}}\left [\sum _{t: q_t \le 0} a_t \Vert \varvec{g}_t \Vert _{\varvec{H}_{t-1}^{-1}}^2 + \sum _{t=1}^T (1-a_t) \Vert \varvec{g}_{t} \Vert ^2_{\varvec{H}_{t-1}^{-1}} \right ]\). In view of the fact that \(A - B = C = \frac{\eta }{2 b} {\mathbb {E}} [\sum _{t: q_t > 0} a_t \Vert \varvec{g}_t \Vert _{\varvec{H}_{t-1}^{-1}}^2] \ge 0\), we can derive that using the feature-based discrimination of instances tends to produce a smaller expected mistake bound, since the non-negative term C is eliminated from the upper bound of the expected number of online prediction mistakes made by D-ADA.

Remark 3

Three cases of \(a_t\) are considered:

  • Case 1 If \(a_t =0, \forall t \in [T]\), our query strategy becomes the margin-based query strategy in which feature-based discrimination of instances is not utilized.

  • Case 2 If \(a_t =1, \forall t \in [T]\), the sums of the last two terms in (3) and (4) reach their minimal value \(\frac{\eta }{2 b} {\mathbb {E}} [\sum _{t: q_t \le 0} \Vert \varvec{g}_t \Vert _{\varvec{H}_{t-1}^{-1}}^2 ]\) and \(\frac{\eta }{2 b} {\mathbb {E}} [ \sum _{t: q_t \le 0} \Vert \varvec{g}_{t} \Vert ^2_{\varvec{H}_{t}^{-1}} ]\), respectively, which would be ideal when the number of online rounds for which \(q_t \le 0\), namely, \(\sum _{t: q_t \le 0} 1\), is also less. But if \(\sum _{t:q_t \le 0} 1\) cannot be made less, the number of labels queried by our algorithms is at least \(\sum _{t:q_t \le 0} 1\).

  • Case 3 If \(a_t = 1 / \max \{1, \varvec{x}_t ^\top \varvec{x}_t \} \in (0, 1], \forall t \in [T]\), taking D-ADA for example, the sum of the last two terms in (3) is between \(\frac{\eta }{2 b} {\mathbb {E}} [\sum _{t: q_t \le 0} \Vert \varvec{g}_t \Vert _{\varvec{H}_{t-1}^{-1}}^2]\) and \(\frac{\eta }{2 b} {\mathbb {E}} [\sum _{t=1}^T \Vert \varvec{g}_{t} \Vert ^2_{\varvec{H}_{t-1}^{-1}}]\), which tends to produce a larger bound than that in Case 2, but a smaller bound than that in Case 1. However, since \(a_t\) takes a smaller value than that in Case 2, the number of online rounds for which \(q_t \le 0\) can be reduced so that smaller label query ratios can be obtained than in Case 2.

Remark 4

The expected number of labels queried by our algorithms is \({\mathbb {E}} [ \sum _{t: q_t \le 0} 1 + \sum _{t: q_t > 0} \frac{b}{b + q_t}]\), where the value of \(q_t\) relies on \(a_t\). As can be seen, our algorithms query at least \({\mathbb {E}} [\sum _{t: q_t \le 0} 1]\) labels. By increasing the value of b, more label queries are triggered. Since \(q_t\) is data-dependent, we have been unable to provide an upper bound for the query number.

4 Extension to online multiclass classification

4.1 Problem setting

In this section, we extend D-ADA and D-AMD to multiclass classification tasks. To achieve the goal, both updating rules and query strategy need to be generalized to the multiclass setting. In generalizing the updating rules, we choose to use the multi-prototype method in (Crammer et al., 2006) since the method makes the extension feasible and more importantly, it contributes to good theoretical properties of our extended multiclass active learning methods. At each online round t, the method maintains a multiclass classifier \(\varvec{W}_t\) that consists of C class-specific predictors \(\varvec{w}_t^{(i)}\in \mathbb {R}^d, \forall i \in [C]\). For an incoming instance \(\varvec{x}_t\), the method predicts the label of \(\varvec{x}_t\) as \({\hat{y}}_t = {{\,\mathrm{argmax}\,}}_{i \in [C]} \{ (\varvec{w}_t^{(i)}) ^\top \varvec{x}_t \}\). Similarly to binary classification, a label query strategy is used to decide whether to query the true label \(y_t \in [C]\) of \(\varvec{x}_t\). Once \(y_t\) is queried, a loss \(f(\varvec{W}_t; (\varvec{x}_t, y_t))\) that measures the predictive inaccuracy of \(\varvec{W}_t\) on the example \((\varvec{x}_t, y_t)\) is incurred. Relying on this loss, \(\varvec{W}_t\) is updated to \(\varvec{W}_{t+1}\), which can be converted into updating each class-specific classifier \(\varvec{w}_t^{(i)}\). If \(y_t\) is not queried, the current classifier \(\varvec{W}_t\) is kept unchanged.

In what follows, \(f(\varvec{W}; (\varvec{x}_t, y_t))\) is abbreviated as \(f_t (\varvec{W})\). The loss that we use at round t is the multiclass hinge loss \(f_t (\varvec{W}_t) = \max \left\{ 0, 1 + (\varvec{w}_t^{(r_t)}) ^\top \varvec{x}_t - (\varvec{w}_t^{(y_t)}) ^\top \varvec{x}_t \right\}\) where \(r_t = {{\,\mathrm{argmax}\,}}_{ i \in [C], i \ne y_t} (\varvec{w}_t^{(i)}) ^\top \varvec{x}_t\). In order to evaluate the number of online prediction mistakes made by our multiclass classifier, \(M_t\) and \(L_t\) are re-defined as

$$\begin{aligned} M_t = \mathbbm {1}[\varvec{x}_t ^\top \varvec{w}_t^{(y_t)}< \varvec{x}_t^\top \varvec{w}_t^{(r_t)}] = \mathbbm {1}[{\hat{y}}_t \ne y_t] ,\ L_t = \mathbbm {1}[0 \le \varvec{x}_t^\top \varvec{w}_t^{(y_t)} - \varvec{x}_t^\top \varvec{w}_t^{(r_t)} < 1]. \end{aligned}$$

Let \(\varvec{W}^* = [\varvec{w}_*^{(1)},\cdots ,\varvec{w}_*^{(C)}]\) be the best fully supervised multiclass classifier chosen in hindsight, that is, \(\varvec{W}^* = {{\,\mathrm{argmin}\,}}_{\varvec{W} \in \mathbb {R}^{d \times C}} \sum _{t=1}^T f_t(\varvec{W})\). As usual, we compare the expected number of prediction mistakes made by our learner, that is, \({\mathbb {E}} [\sum _{t=1}^T M_t]\), with the cumulative multiclass hinge loss of \(\varvec{W}^*\), given by \(\sum _{t=1}^T f_t(\varvec{W}^*)\).

4.2 Novel online active learning algorithms for multiclass classification

4.2.1 Multiclass updating rules

We use the dual averaging and mirror descent methods to update each class-specific predictor \(\varvec{w}_t^{(i)}\). At each online round t, both updating rules need to maintain C class-specific diagonal matrix \(\varvec{H}_t^{(i)}\) computed in the following way:

figure c

where \(\varvec{g}_t^{(i)}\) is the partial derivative of \(f_t(\varvec{W})\) with respect to \(\varvec{w}^{(i)}\) at the point \(\varvec{W}_t\). If \(f_t (\varvec{W}_t) >0\), we can get

$$\begin{aligned} \varvec{g}_t ^{(i)} = {\left\{ \begin{array}{ll} \varvec{x}_t, &{} \text{ if } i = r_t; \\ -\varvec{x}_t, &{} \text{ if } i = y_t; \\ \varvec{0}, &{} \text{ otherwise. } \end{array}\right. } \end{aligned}$$

Otherwise, it follows that \(\varvec{g}_t ^{(i)} = \varvec{0}\), \(\forall i \in [C]\). Based on the matrix \(\varvec{H}_t^{(i)}\), each new predictor \(\varvec{w}_{t+1}^{(i)}\) at the end of round t is defined as follows.

Multiclass Dual Averaging (M-DA) update:

$$\begin{aligned} \varvec{w}_{t+1}^{(i)} = {{\,\mathrm{argmin}\,}}_{\varvec{w} \in \mathbb {R}^d} \left\{ \eta \, \varvec{w}^{\top }\left( \sum _{k =1}^t \varvec{g}_k^{(i)} \right) + \frac{1}{2} \varvec{w}^{\top } \varvec{H}_t^{(i)} \varvec{w} \right\} , \forall i \in [C] \end{aligned}$$
(5)

Multiclass Mirror Descent (M-MD) update:

$$\begin{aligned} \varvec{w}_{t+1}^{(i)} = {{\,\mathrm{argmin}\,}}_{\varvec{w} \in \mathbb {R}^d} \left\{ \eta {\varvec{w}^\top \varvec{g}_t^{(i)}} + \frac{1}{2} (\varvec{w} - \varvec{w}_t^{(i)} )^{\top } \varvec{H}_t^{(i)} (\varvec{w} - \varvec{w}_t^{(i)} ) \right\} , \forall i \in [C] \end{aligned}$$
(6)

Here \(\eta\) is again a step-size hyperparameter. According to (5) and (6), if \(f_t (\varvec{W}_t) >0\), for (5), we obtain \(\varvec{w}_{t+1}^{(i)} = - \eta \, \left (\varvec{H}_t^{(i)}\right )^{-1} \sum _{k =1}^t \varvec{g}_k^{(i)} \text{ for } i \in \{y_t, r_t\}\) and \(\varvec{w}_{t+1}^{(i)} = \varvec{w}_{t}^{(i)}\) for \(\forall i \notin \{y_t, r_t\}\), and for (6), we have \(\varvec{w}_{t+1}^{(y_t)} = \varvec{w}_t^{(y_t)} + \eta \, (\varvec{H}_t^{(y_t)})^{-1} \varvec{x}_t, \ \varvec{w}_{t+1}^{(r_t)} = \varvec{w}_t^{(r_t)} - \eta \, (\varvec{H}_t^{(r_t)})^{-1} \varvec{x}_t\) and \(\varvec{w}_{t+1}^{(i)} = \varvec{w}_{t}^{(i)}\) for \(\forall i \notin \{y_t, r_t\}\). If \(f_t (\varvec{W}_t) =0\), then for both updating rules, it holds that \(\varvec{w}_{t+1}^{(i)} = \varvec{w}_{t}^{(i)}\) for \(\forall i \in [C]\).

From (5) and (6), it seems that each class-specific classifier \(\varvec{w}_{t+1}^{(i)}\) is updated independently of the others, but the fact is that all class-specific classifiers are simultaneously updated for achieving one global objective. One clue is that for any \(i \in [C]\), \(\varvec{g}_t^{(i)}\) is connected with the common loss \(f_t(\varvec{W}_t)\). Indeed, by following the similar derivation process to that in (Duchi et al. 2011), one can easily prove that the above two fully-supervised multiclass classification methods can achieve a sublinear regret, which implies that they both asymptotically converge to the best hindsight \(\varvec{W}^*\).

4.2.2 Multiclass query strategy

The multiclass margin-based query strategy in (Lu et al. 2016b) uses an approximated margin to replace the genuine margin for measuring the predictive uncertainty. Specifically, for an instance \(\varvec{x}_t\) with the true label \(y_t\), the multiclass predictive margin for \(\varvec{x}_t\) is originally defined as \(m_t = (\varvec{w}_t^{({y}_t)})^\top \varvec{x}_t - \max _{i \in [C], i \ne {y}_t} (\varvec{w}_t^{(i)})^\top \varvec{x}_t\). Since \(y_t\) is unknown before label querying, \(m_t\) cannot be computed. Therefore, an approximated margin \(p_t\) is used to replace \(m_t\):

$$\begin{aligned} p_t = (\varvec{w}_t^{(\hat{y}_t)})^\top \varvec{x}_t - \max _{i \in [C], i \ne \hat{y}_t} (\varvec{w}_t^{(i)})^\top \varvec{x}_t . \end{aligned}$$
(7)

It satisfies \(p_t \le |m_t|\) for any \(t \in [T]\). The query strategy in (Lu et al. 2016b) then draws a random variable \(Z_t \in \{0,1\}\) from a Bernoulli distribution with parameter \(b / (b + p_t)\), where \(b > 0\) is still a scaling factor on \(p_t\). This strategy does not take into account the feature-based discrimination of \(\varvec{x}_t\).

Our multiclass query strategy exploits both margin-based uncertainty and feature-based discrimination of instances. According to the corresponding closed form solution of (5) and (6), we can find that even if \(y_t\) is queried at round t, for both updating rules, it always holds that for \(\forall i \notin \{y_t, r_t \}\), \(\varvec{w}_{t+1}^{(i)} = \varvec{w}_{t}^{(i)}\). This suggests that the example \((\varvec{x}_t, y_t)\) cannot improve all the other class-specific classifiers except \(\varvec{w}_{t}^{(y_t)}\) and \(\varvec{w}_{t}^{(r_t)}\). Therefore, it is pointless to evaluate the feature-based discrimination of \(\varvec{x}_t\) for all the other classes except the classes \(y_t\) and \(r_t\). In view of the fact, we focus on evaluating the feature-based discrimination of \(\varvec{x}_t\) only for the two classes \(y_t\) and \(r_t\), which is defined as

$$\begin{aligned} \rho _t = \varvec{x}_t ^\top (\varvec{H}_{t-1}^{(y_t)})^{-1} \varvec{x}_t + \varvec{x}_t ^\top (\varvec{H}_{t-1}^{(r_t)})^{-1} \varvec{x}_t . \end{aligned}$$

The larger is \(\rho _t\), the more is the infrequent features that \(\varvec{x}_t\) contains for the classes \(y_t\) and \(r_t\). Similarly, \(y_t\) and \(r_t\) are unknown before label querying and thus \(\rho _t\) cannot be computed. We use an approximated quantity \(v_t\) to replace \(\rho _t\):

$$\begin{aligned} v_t = {\varvec{x}_t ^\top (\varvec{H}_{t-1}^{({\hat{y}}_t)})^{-1} \varvec{x}_t + \max _{i \in [C], i \ne {\hat{y}}_t} \varvec{x}_t^\top (\varvec{H}_{t-1}^{(i)})^{-1} \varvec{x}_t} \end{aligned}$$
(8)

It is easy to observe that if \(y_t = {\hat{y}}_t\), then \(v_t \ge \rho _t\); if \(y_t \ne {\hat{y}}_t\), then \(r_t = {\hat{y}}_t\) and it still follows that \(v_t \ge \rho _t\).

In our multiclass query strategy, \(Z_t\) is drawn from a Bernoulli distribution with parameter \(b / (b + q_t \mathbbm {1}[q_t > 0])\) where

$$\begin{aligned} q_t = p_t - \frac{\eta }{2} a_t v_t \text{ with } a_t \in [0, 1] \end{aligned}$$
(9)

and \(p_t\) and \(v_t\) are defined in (7) and (8), respectively. Here \(a_t\) has the same definition and effect as that in the binary classification setting. It is easy to check that \(q_t \le |m_t| - \frac{\eta }{2} a_t \rho _t\) for any \(t \in [T]\). Once \(Z_t =1\), M-DA update or M-MD update can be used to improve the multiclass classifier \(\varvec{W}_t\) to \(\varvec{W}_{t+1}\). Otherwise, set \(\varvec{g}_t^{(i)} = \varvec{0}\) for \(\forall i \in [C]\) and keep the classifier unchanged.

Based on the above discussion, we present the Multiclass D-ADA (MD-ADA) algorithm and the Multiclass D-AMD (MD-AMD) algorithm in Algorithm 2.

figure d

4.3 Theoretical analysis for MD-ADA and MD-AMD

Theorem 2

If MD-ADA and MD-AMD are run with \(b \ge 2\), then the expected number of online prediction mistakes made by MD-ADA for T rounds satisfies the following inequality:

$$\begin{aligned} \mathbb {E} \left [ \sum _{t=1}^T M_t \right ] \le {\mathbb {E}} \left [\sum _{t=1}^T Z_t f_t(\varvec{W}^*) \right ] + \frac{b A_1}{2 \eta } \sum _{i=1}^C \mathrm {tr}(\mathbb {E} [\varvec{H}_T^{(i)} ]) - \frac{1}{b} \mathbb {E} \left [ \sum _{t: q_t \le 0} L_t \right ] \\ + \frac{\eta }{2 b} {\mathbb {E}} \left [ \sum _{t: q_t \le 0} \sum _{i=1}^C a_t \Vert \varvec{g}_t^{(i)}\Vert ^2_{(\varvec{H}_{t-1}^{(i)})^{-1}} \right ] + \frac{\eta }{2 b} {\mathbb {E}} \left[ \sum _{t=1}^T \sum _{i=1}^C (1 - a_t) \Vert \varvec{g}_t^{(i)}\Vert ^2_{(\varvec{H}_{t-1}^{(i)})^{-1}} \right] \end{aligned}$$

where \(A_1 = \max _{i \in [C]} \Vert \varvec{w}_*^{(i)}\Vert _{\infty }^2\). For MD-AMD, the following inequality holds:

$$\begin{aligned} \mathbb {E} \left [ \sum _{t=1}^T M_t \right ] \le {\mathbb {E}} \left [\sum _{t=1}^T Z_t f_t(\varvec{W}^*) \right ] + \frac{A_2 + (b-1)^2 A_1}{\eta b} \sum _{i=1}^C \mathrm {tr}(\mathbb {E} [\varvec{H}_T^{(i)}]) - \frac{1}{b} \mathbb {E} \left [ \sum _{t: q_t \le 0} L_t \right ] \\ + \frac{\eta }{2 b} {\mathbb {E}} \left [ \sum _{t: q_t \le 0} \sum _{i=1}^C a_t \Vert \varvec{g}_t^{(i)}\Vert ^2_{(\varvec{H}_{t}^{(i)})^{-1}} \right ] + \frac{\eta }{2 b} {\mathbb {E}} \left [ \sum _{t=1}^T \sum _{i=1}^C (1 - a_t) \Vert \varvec{g}_t^{(i)}\Vert _{(\varvec{H}_{t}^{(i)})^{-1}}^2 \right ] \end{aligned}$$

where \(A_2 = \max _{i \in [C], t \in [T]} \Vert \varvec{w}_*^{(i)} - \varvec{w}_t^{(i)}\Vert _{\infty }^2\) and \(A_1\) is defined as above.

The analytical process is similar to that for Theorem 1, so we just skip it here. The theorem reveals that our multiclass active learning algorithms can converge to the best fixed fully-supervised classifier \(\varvec{W}^*\) as the query hyperparameter \(b \ge 2\).

5 Experiments

Two series of experiments have been conducted for evaluating the empirical performance of our proposed algorithms. The first series evaluates D-ADA and D-AMD for online binary classification tasks. The second series evaluates MD-ADA and MD-AMD for online multiclass classification tasks.

5.1 Evaluation of D-ADA and D-AMD for binary classification tasks

5.1.1 Binary classification datasets

We have randomly chosen six high-dimensional datasets to perform experiments. On these datasets, maintaining a full correlation matrix for updating the classifier is infeasible, so one has to use a diagonal matrix. The datasets are described in Table 1. Basehock and Pcmac are subsets extracted from 20NewsgroupsFootnote 1. Farm_ads was collected from text ads found on twelve websites dealing with farm animal related topics, and the goal is to identify whether the content owner approves of the ad, or not. Gisette is a handwritten digit recognition problem, for which the task is to separate the digits ’4’ and ’9’. Both farm_ads and gisette can be downloaded from UCI repository. Spam_corpus (Katakis et al., 2009), collected from the anti-spam platform SpamAssasin, contains 9,324 emails, each encoded as a boolean bag-of-words vector, and around 20% of these emails are spams. Url_day0, a subset of the URL dataset (Ma et al., 2009), contains all Day 0’s URLs, each represented by its lexical and host-based features. The task is to separate malicious URLs from benign ones, and around 33% of these URLs are malicious.

Table 1 A summary of binary classification datasets in the experiments

5.1.2 Evaluation of our label query strategy

We perform an ablation study to demonstrate the benefit of our label query strategy. We compare the following two groups of algorithms:

  1. 1.

    R-ADA, M-ADA, D-ADA, D-ADA-I: these algorithms use the same dual averaging updating rule, but different label query strategy.

  2. 2.

    R-AMD, M-AMD, D-AMD, D-AMD-I: these algorithms use the same mirror descent updating rule, but different label query strategy.

Different query strategies are as follows:

  1. 1.

    R-ADA and R-AMD use the random query strategy.

  2. 2.

    M-ADA and M-AMD use the margin-based query strategy which is equivalent to our query strategy that adopts \(a_t = 0\), \(\forall t \in [T]\) in Algorithm 1

  3. 3.

    D-ADA and D-AMD use our query strategy that adopts \(a_t = 1 / \max \{1, \varvec{x}_t ^\top \varvec{x}_t \}\), \(\forall t \in [T]\) in Algorithm 1

  4. 4.

    D-ADA-I and D-AMD-I also use our proposed query strategy, but adopt \(a_t = 1\), \(\forall t \in [T]\) in Algorithm 1

We evaluate the online F1-measure achieved by these algorithms at the label query ratio in \(\{10^{-1}, 10^{-0.9}, \cdots , 10^{-0.1}\}\), where \(F1-measure = \frac{2 * precision * recall}{precision + recall}\). For each algorithm, hyperparameter optimization is carried out using grid search with cross validation. In performing cross validation, only one pass over the training splits is allowed. Once hyperparameters at each certain query ratio are determined, each algorithm is run 20 times, each time with a different random permutation of examples in the dataset. The online F1-measure achieved by these active learners at different query ratios is averaged over the 20 runs, and reported in Figs. 1 and 2.

Fig. 1
figure 1

Comparison of these algorithms based on dual averaging update at various query ratios

Fig. 2
figure 2

Comparison of these algorithms based on mirror descent update at various query ratios

From Fig. 1, we observe that R-ADA performs the worst, M-ADA the second worst, and D-ADA and D-ADA-I perform the best. This fact shows that using margin-based uncertainty is better than using nothing in the query strategy, while exploiting both margin-based predictive uncertainty and feature-based discrimination is also more beneficial than using only margin-based uncertainty. D-ADA-I sometimes cannot achieve low query ratios, for example, on the first three datasets. By using smaller \(a_t\), D-ADA can achieve lower query ratios, but mostly at the price of performance degradation. Thus, the performance of D-ADA is generally better than that of M-ADA but worse than that of D-ADA-I. On Gisette, D-ADA behaves similarly to M-ADA since this dataset has very large feature values which leads to a small value of \(a_t\). According to Algorithm 1, if \(a_t\) tends to zero, D-ADA will degrade to M-ADA. Similar phenomenon can also be observed from Fig. 2. These results corroborate the fact that exploiting feature-based discrimination of instances helps to identify the critical instances in the label queries and enhance predictive performance.

5.1.3 Comparison with existing algorithms

In this section, we have compared the following algorithms:

  • PAA-II (Lu et al., 2016b): Passive Aggressive Active learning.

  • SOP (Cesa-Bianchi et al., 2006): selective sampling Second-Order Perceptron.

  • SOAL (Hao et al., 2018): Second-order Online Active Learning.

  • D-ADA, D-AMD, D-ADA-I and D-AMD-I: as described in the previous section.

  • DA and MD: the fully supervised version of D-ADA and D-AMD.

Notably, the diagonal matrix versions of SOAL and SOP that keep only diagonal elements of the full correlation matrix are used here. D-ADA and D-AMD are used on the first three datasets, while D-ADA-I and D-AMD-I are used on the remaining ones. Similarly to the previous experiments, grid search with cross validation is used to optimize hyperparameters. Each algorithm is run 20 times on each dataset and the online F1-measure achieved by these algorithms at different query ratios is averaged over the 20 runs, and reported in Fig. 3. Moreover, we also report in Table 2 the results at the query ratio near \(10^{-1}\) and \(10^{-0.7}\).

Fig. 3
figure 3

Online F1-measure achieved by each active learning algorithm at different query ratios

Table 2 Online F1-measure obtained at the label query ratio near \(10^{-1}\) and \(10^{-0.7}\)

The most telling observation from Fig. 3 is that, our algorithms outperform all compared active learning algorithms at extensive label query ratios. Specifically, SOP performs the worst, PAA-II the second worst, then it comes to SOAL, which is inferior to our algorithms. Moreover, D-ADA (D-ADA-I) sometimes outperforms D-AMD (D-AMD-I), but sometimes not. We also notice that our algorithms can achieve comparable F1-measure to their fully supervised counterpart, but using fewer label queries on these datasets. From Table 2, according to paired t-tests at 95% confidence level, we observe that only on Farm_ads at query ratio near 10%, our algorithms perform comparably to SOAL, while in the rest of all cases, our algorithms are significantly better than the other competitors. These experimental results demonstrate the superiority of our algorithms over the existing ones.

5.1.4 Sensitivity analysis

In this section, we focus on analyzing the sensitivity of the proposed algorithms to the hyperparameters. Specifically, we observe that (a) when the hyperparameter b is fixed as \(b =1\), how online F1-measure and query ratio vary with different \(\delta\) and \(\eta\); (b) when the hyperparameter \(\delta\) is fixed as \(\delta = 0.001\), how online F1-measure and query ratio vary with different \(\eta\) and b; (c) when the hyperparameter \(\eta\) is fixed as \(\eta = 0.01\), how online F1-measure and query ratio vary with different \(\delta\) and b. Due to the space constraint, we only present the results for D-ADA on Basehock in Fig. 4, where different colors represent different F1-measure or query ratio.

Fig. 4
figure 4

Evaluation of the hyperparameter sensitivity for D-ADA on Basehock

From Fig. 4, we observe a common phenomenon that under many small query ratios, D-ADA can obtain F1-measures that are comparable to or even better than that under large query ratios, which implies the advantage of D-ADA. From Fig. 4a, d, c and f, we find that a large \(\delta\) often leads to low F1-measures, but a small \(\delta\) leads to high query ratios. This involves a tradeoff between F1-measure and query ratio. Once \(\delta\) is fixed, \(\eta\) should be neither too large nor too small, according to Fig. 4b. This observation is consistent with Theorem 1 since too large or too small values of \(\eta\) both lead to large mistake bounds. So the optimal \(\eta\) should be searched around 1. From Fig. 4b, e, c and f, we observe that when \(\delta\) and \(\eta\) are fixed, the minimal query ratio that D-ADA can attain is determined accordingly. Although the query ratio decreases with diminishing b, one can only obtain a query ratio above the minimal query ratio. In practice, we recommend to first find appropriate values of \(\delta\) and \(\eta\) by a grid search, then tune b to get the desired query ratio.

5.1.5 Comparison with existing algorithms in a fixed parameter setting

In this section, we adopt a different parameter setting to perform the comparative experiments in Sect. 5.1.3. Specifically, we fix the other parameters except the query parameter b on each dataset, and adjust b to obtain different query ratios and observe how much F1-measure can be obtained. This setting is deemed to be more suitable for stream-based learning. The fixed parameter values are given in Table 3. Note that these parameter values are chosen coarsely in order to observe how these algorithms behave in a more practical setting, rather than an ideal one. SOP is excluded in this experiment since it performs the worst according to Sect. 5.1.3. The experimental results are displayed in Fig. 5.

Table 3 Fixed parameter values on each dataset
Fig. 5
figure 5

Performance comparison in a fixed parameter setting

From Fig. 5, we observe that our proposed algorithms beat the other algorithms at extensive query ratios. Although our algorithms sometimes cannot achieve very low query ratios, such as on Pcmac, this phenomenon also exists for SOAL. Basically, in the fixed parameter setting, one can get consistent conclusions with that made in Sect. 5.1.3.

5.2 Evaluation of MD-ADA and MD-AMD for multiclass classification tasks

5.2.1 Multiclass classification datasets

Six multiclass datasets are chosen randomly to perform the experiments. These datasets are described in Table 4 and can be downloaded from LIBSVM website.Footnote 2

Table 4 A summary of multiclass classification datasets

5.2.2 Performance comparison

We have compared the following online multiclass active learning algorithms:

  • MPAA-II (Lu et al., 2016b): Multiclass Passive Aggressive Active learning which uses the MPA-II updating rule and the multiclass margin-based label query strategy.

  • MDA and MMD: the fully supervised versions of Algorithm 2, which query the labels of all incoming instances.

  • MR-ADA and MR-AMD: use our multiclass updating rules, but the random label query strategy.

  • MM-ADA and MM-AMD: use our multiclass updating rules, but the multiclass margin-based label query strategy. They are equivalent to Algorithm 2 that adopts \(a_t =0, \forall t \in [T]\).

  • MD-ADA and MD-AMD: Algorithm 2 with \(a_t =1/ \max \{1, \varvec{x}_t ^\top \varvec{x}_t\}, \forall t \in [T]\).

  • MD-ADA-I and MD-AMD-I: Algorithm 2 that adopts \(a_t =1, \forall t \in [T]\).

The experimental setting is similar to that for binary classification except that online accuracy is used for the performance metric. Figs. 6 and 7 present the online accuracy achieved by these algorithms at different query ratios. Note that one line (originally one point) is drawn for the fully supervised MDA and MMD. To clearly measure the performance difference, we also report in Table 5 the results at the fixed query ratio near \(10^{-1}\) and \(10^{-0.7}\).

Fig. 6
figure 6

Comparison of algorithms based on dual averaging update with existing methods

Fig. 7
figure 7

Comparison of algorithms based on mirror descent update with existing methods

Table 5 Online accuracy obtained at the label query ratio near \(10^{-1}\) and \(10^{-0.7}\)

From Figs. 6 and 7, we observe that MD-ADA-I and MD-AMD-I cannot achieve low query ratios on many datasets, but at those query ratios they can obtain, they mostly perform the best. Such a relationship of accuracy can be observed on all datasets: MD-ADA-I \(\ge\) MD-ADA\(\ge\) MM-ADA > MR-ADA, and MD-AMD-I \(\ge\) MD-AMD\(\ge\) MM-AMD > MR-AMD. The fact shows again the importance of exploiting the feature-based discrimination of instances in the query strategy. MM-ADA outperforms MPAA-II on four datasets and MM-AMD outperforms MPAA-II on all six datasets, which shows that our second-order updating rules generally lead to better performance than the first-order rule of MPAA-II. From Table 5, we further observe that MD-AMD-I or MD-AMD significantly outperform the other algorithms on all six datasets, according to paired t-tests at 95% confidence level. We also find that using the same label query strategy, M-MD updating tends to bring better performance than M-DA updating on these datasets. In conclusion, we discover that our updating rules, working together with our query strategy, can make very promising results on multiclass tasks.

6 Conclusion

In this paper, two novel online active learning algorithms for binary classification, called D-ADA and D-AMD, have been proposed and analyzed. Both algorithms maintain a diagonal matrix for recording the updating information of all dimensions and exploit the matrix to endow different dimensions with adaptive learning rates. Especially, D-ADA uses the dual averaging idea to update its predictor, while D-AMD uses the mirror descent idea. In order to identify critical instances to label, different from the usual margin-based methods that only use the predictive uncertainty of instances, D-ADA and D-AMD also take full advantage of the feature-based discriminative information of instances. Further, D-ADA and D-AMD have been extended to the multiclass classification setting. The expected mistake bounds for our proposed algorithms are provided, which show that when the label query ratio exceeds a certain value, our active learning algorithms are asymptotically comparable to the best fixed fully supervised classifier chosen in hindsight. Experiments on six high-dimensional binary classification datasets corroborate the merits of our label query strategy and demonstrate that D-ADA and D-AMD outperform existing second-order and first-order active learning methods, at various label query ratios. Experiments on six multiclass classification datasets also show the superiority of our multiclass active learning algorithms. In the future, it is interesting to investigate how to extend our methods to the multi-label classification setting and the cost-sensitive setting.