Dynamic principal projection for cost-sensitive online multi-label classification


We study multi-label classification (MLC) with three important real-world issues: online updating, label space dimension reduction (LSDR), and cost-sensitivity. Current MLC algorithms have not been designed to address these three issues simultaneously. In this paper, we propose a novel algorithm, cost-sensitive dynamic principal projection (CS-DPP) that resolves all three issues. The foundation of CS-DPP is an online LSDR framework derived from a leading LSDR algorithm. In particular, CS-DPP is equipped with an efficient online dimension reducer motivated by matrix stochastic gradient, and establishes its theoretical backbone when coupled with a carefully-designed online regression learner. In addition, CS-DPP embeds the cost information into label weights to achieve cost-sensitivity along with theoretical guarantees. Experimental results verify that CS-DPP achieves better practical performance than current MLC algorithms across different evaluation criteria, and demonstrate the importance of resolving the three issues simultaneously.


The multi-label classification (MLC) problem allows each instance to be associated with a set of labels and reflects the nature of a wide spectrum of real-world applications (Chua et al. 2009; Bello et al. 2008; Elisseeff and Weston 2001). Traditional MLC algorithms mainly tackle the batch MLC problem, where the input data are presented in a batch (Read et al. 2011; Tsoumakas et al. 2010). Nevertheless, in many MLC applications such as e-mail categorization (Osojnik et al. 2017), multi-label examples arrive as a stream. Online analysis is therefore required as batch MLC algorithms may not meet the needs to make a prediction and update the predictor on the fly. The needs of such applications can be formalized as the online MLC (OMLC) problem.

The OMLC problem is generally more challenging than the batch one, and many mature algorithms for the batch problem have not yet been carefully extended to OMLC. Label space dimension reduction (LSDR) is a family of mature algorithms for the batch MLC problem (Chen and Lin 2012; Hsu et al. 2009; Lin et al. 2014; Tai and Lin 2012; Kapoor et al. 2012; Sun et al. 2011; Yu et al. 2014; Bi and Kwok 2013; Balasubramanian and Lebanon 2012; Bhatia et al. 2015). By viewing the label set of each instance as a high-dimensional label vector in a label space, LSDR encodes each label vector as a code vector in a lower-dimensional code space, and learns a predictor within the code space. An unseen instance is predicted by coupling the predictor with a decoder from the code space to the label space. For example, compressed sensing (CS) (Hsu et al. 2009) encodes with random projections, and decodes with sparse vector reconstruction; principal label space transformation (PLST) (Tai and Lin 2012) encodes by projecting to the key eigenvectors of the known label vectors obtained from principal component analysis (PCA), and decodes by reconstruction with the same eigenvectors. This low-dimensional encoding allows LSDR algorithms to exploit the key joint information between labels to be more robust to noise and be more effective on learning (Tai and Lin 2012). Nevertheless, to the best of our knowledge, all the LSDR algorithms mentioned above are designed only for the batch MLC problem.

Another family of MLC algorithms that have not been carefully extended for OMLC contains the cost-sensitive MLC algorithms. In particular, different MLC applications usually come with different evaluation criteria (costs) that reflect their realistic needs. It is important to design MLC algorithms that are cost-sensitive to systematically cope with different costs, because an MLC algorithm that targets one specific cost may not always perform well under other costs (Li and Lin 2014). Two representative cost-sensitive MLC algorithms are probabilistic classifier chain (PCC) (Dembczynski et al. 2010) and condensed filter tree (CFT) (Li and Lin 2014). PCC estimates the conditional probability with the classifier chain (CC) method (Read et al. 2011) and makes Bayes-optimal predictions with respect to the given cost; CFT decomposes the cost into instance weights when training the classifiers in CC. Both algorithms, again, target the batch MLC problem rather than the OMLC one.

From the discussions above, there is currently no algorithm that considers the three realistic needs of online updating, label space dimension reduction, and cost-sensitivity at the same time. The goal of this work is to study such algorithms. We first formalize the OMLC and cost-sensitive OMLC (CSOMLC) problems in Sect. 2 and discuss related work. We then extend LSDR for the OMLC problem and propose a novel online LSDR algorithm, dynamic principal projection (DPP), by connecting PLST with online PCA. In particular, we derive the DPP algorithm in Sect. 3 along with its theoretical guarantees, and resolve the issue of possible basis drifting caused by online PCA.

In Sect. 4, we further generalize DPP to cost-sensitive DPP (CS-DPP) to fully match the needs of CSOMLC with a theoretically-backed label-weighting scheme inspired by CFT. Extensive empirical studies demonstrate the strength of CS-DPP in addressing the three realistic needs in Sect. 5. In particular, we justify the necessity to consider LSDR, basis drifting and cost-sensitivity. The results show that CS-DPP significantly outperforms other OMLC competitors across different CSOMLC problems, which validates the robustness and effectiveness of CS-DPP, as concluded in Sect. 6.

Preliminaries and related work

For the MLC problem, we denote the feature vector of an instance as \(\mathbf {x} \in \mathbb {R}^d \) and its corresponding label vector as \(\mathbf {y} \in \mathcal {Y} \equiv \{+1,-1\}^K\), where \(\mathbf {y}[k] = +1\) iff the instance is associated with the k-th label out of a total of K possible labels. We let \(\mathbf {y}[k] \in \{+1,-1\}\) to conform with the common setting of online binary classification (Crammer et al. 2006), which is equivalent to another scheme, \(\mathbf {y}[k] \in \{1,0\}\), used in other MLC works (Li and Lin 2014; Read et al. 2011).

Traditional MLC methods consider the batch setting, where a training dataset \( \mathcal {D} = \{(\mathbf {x}_n, \mathbf {y}_n)\}_{n=1}^{N}\) is given at once, and the objective is to learn a classifier \(g:\mathbb {R}^d \rightarrow \{+1,-1\}^K\) from \(\mathcal {D}\) with the hope that \(\hat{\mathbf {y}} = g(\mathbf {x})\) accurately predicts the ground truth \(\mathbf {y}\) with respect to an unseen \(\mathbf {x}\). In this work, we focus on the OMLC setting, which assumes that instance \((\mathbf {x}_t, \mathbf {y}_t)\) arrives in sequence from a data stream. Whenever an \(\mathbf {x}_t\) arrives at iteration t, the OMLC algorithm is required to make a prediction \(\hat{\mathbf {y}}_t = g_t(\mathbf {x}_t)\) based on the current classifier \(g_t\) and feature vector \(\mathbf {x}_t\). The ground truth \(\mathbf {y}_t\) with respect to \(\mathbf {x}_t\) is then revealed, and the penalty of \(\hat{\mathbf {y}}_t\) is evaluated against \(\mathbf {y}_t\).

Many evaluation criteria for comparing \(\mathbf {y}\) and \(\hat{\mathbf {y}}\) have been considered in the literature to satisfy different application needs. A simple criterion (Tsoumakas et al. 2010) is the Hamming loss \(c_{\textsc {ham}}(\mathbf {y}, \hat{\mathbf {y}})= \frac{1}{K}\sum ^{K}_{k=1} \llbracket \mathbf {y}[k] \ne \hat{\mathbf {y}}[k] \rrbracket \). The Hamming loss separately considers each label during evaluation. There are other criteria that jointly evaluate all labels, such as the F1 loss (Tsoumakas et al. 2010)

$$\begin{aligned} c_{\textsc {f}}(\mathbf {y}, \hat{\mathbf {y}}) = 1 - 2 \dfrac{\sum \nolimits _{k=1}^K \llbracket \mathbf {y}[k]=+1 \text{ and } \hat{\mathbf {y}}[k] =+1\rrbracket }{\sum \nolimits _{k=1}^K \left( \llbracket \mathbf {y}[k]=+1\rrbracket + \llbracket \hat{\mathbf {y}}[k]=+1\rrbracket \right) }. \end{aligned}$$

In this work, we follow existing cost-sensitive MLC approaches (Li and Lin 2014) to extend OMLC to the cost-sensitive OMLC (CSOMLC) setting, which further takes the evaluation criterion as an additional input to the learning algorithm. We call the criterion a cost function and overload \(c:\{+1, -1\}^K\times \{+1,-1\}^K \rightarrow \mathbb {R}\) as its notation. The cost function evaluates the penalty of \(\hat{\mathbf {y}}\) against \(\mathbf {y}\) by \(c(\mathbf {y}, \hat{\mathbf {y}})\). We naturally assume that \(c(\cdot , \cdot )\) satisfies \(c(\mathbf {y}, \mathbf {y}) = 0\) and \(\max _{\hat{\mathbf {y}}} c(\mathbf {y}, \hat{\mathbf {y}}) \le 1\). The objective of a CSOMLC algorithm is to adaptively learn a classifier \(g_t:\mathbb {R}^d \rightarrow \{+1,-1\}^K\) based on not only the data stream but also the input cost function c such that the cumulative cost \(\sum _{t=1}^T c(\mathbf {y}_t, \hat{\mathbf {y}}_t)\) with respect to the input c, where \(\hat{\mathbf {y}}_t = g_t(\mathbf {x}_t)\), can be minimized.

Note that the cost function within the CSOMLC setting above corresponds to the example-based evaluation criteria for MLC, named because the prediction \(\hat{\mathbf {y}}_t\) of each example is evaluated against the ground truth \(\mathbf {y}_t\) independently. More sophisticated evaluation criteria such as micro-based and macro-based criteria (Tang et al. 2009; Mao et al. 2013) can also be found in the literature. The following equations highlight the difference between example-F1 (what our CSOMLC setting can handle), micro-F1 and macro-F1 when calculated on T predictions

$$\begin{aligned} \text{ Example-F1 } \text{ loss }= & {} 1 - \frac{2}{T} \sum _{t=1}^T \frac{\sum \nolimits _{k=1}^K \llbracket \mathbf {y}_t[k]=+1 \text{ and } \hat{\mathbf {y}}_t[k] =+1\rrbracket }{\sum \nolimits _{k=1}^K \left( \llbracket \mathbf {y}_t[k]=+1\rrbracket + \llbracket \hat{\mathbf {y}}_t[k]=+1\rrbracket \right) }{;}\\ \text{ Micro-F1 } \text{ loss }= & {} 1 - \frac{2}{K} \sum _{k=1}^K \frac{\sum \nolimits _{t=1}^T \llbracket \mathbf {y}_t[k]=+1 \text{ and } \hat{\mathbf {y}}_t[k] =+1\rrbracket }{\sum \nolimits _{t=1}^T \left( \llbracket \mathbf {y}_t[k]=+1\rrbracket + \llbracket \hat{\mathbf {y}}_t[k]=+1\rrbracket \right) }{;} \\ \text{ Macro-F1 } \text{ loss }= & {} 1 - 2 \frac{\sum \nolimits _{t=1}^T \sum _{k=1}^K \llbracket \mathbf {y}_t[k]=+1 \text{ and } \hat{\mathbf {y}}_t[k] =+1\rrbracket }{\sum \nolimits _{t=1}^T \sum _{k=1}^K \left( \llbracket \mathbf {y}_t[k]=+1\rrbracket + \llbracket \hat{\mathbf {y}}_t[k]=+1\rrbracket \right) }{.} \end{aligned}$$

In particular, the three criteria differ by the averaging process. Average example-F1 computes the geometric mean of precision and recall (F1 score) per example and then computes the arithmetic mean over all examples; micro-F1 computes the geometic mean of precision and recall per label and then computes the arithmetic mean over all labels; macro-F1 computes the geometric mean of precision and recall over the set of all example-label predictions. The more sophisticated ones are known to be more difficult to optimize. Thus, similar to many existing cost-sensitive MLC algorithms for the batch setting (Li and Lin 2014), we consider only example-based criteria in this work, and leave the investigation of achieving cost-sensitivity for micro- and macro-based criteria to the future.

Several OMLC algorithms have been studied in the literature, including online binary relevance (Read et al. 2011), Bayesian OMLC framework (Zhang et al. 2010), and the multi-window approach using k nearest neighbors (Xioufis et al. 2011). However, none of them are cost-sensitive. That is, they cannot take the cost function into account to improve learning performance.

Cost-sensitive MLC algorithms have also been studied in the literature. Cost-sensitive RAkEL (Lo et al. 2011) and progressive RAkEL (Wu and Lin 2017) are two algorithms that generalize a famous batch MLC algorithm called RAkEL (Tsoumakas and Vlahavas 2007) to cost-sensitive learning. The former achieves cost-sensitivity for any weighted Hamming loss, and the latter achieves this for any cost function. Probabilistic classifier chain (PCC; Dembczynski et al. 2010) and condensed filter tree (CFT; Li and Lin 2014) are two other algorithms that generalizes another famous batch MLC algorithm called classifier chain (CC; Read et al. 2011) to cost-sensitive learning. PCC estimates the conditional probability of the label vector via CC, and makes a Bayes-optimal prediction with respect to the cost function and the estimation. PCC in principal achieves cost-sensitivity for any cost function, but the prediction can be time-consuming unless an efficient Bayes inference rule is designed for the cost function [e.g. the F1 loss (Dembczynski et al. 2011)]. CFT embeds the cost information into CC by an \(O(K^2)\)-time step that re-weights the training instances for each classifier. All four algorithms above are designed for the batch cost-sensitive MLC problem, and it is not clear how they can be modified for the CSOMLC problem. CC-family algorithms typically suffer from the problem of ordering the labels properly to achieve decent performance. Some works start solving the ordering problem for the original CC algorithm, such as the easy-to-hard paradigm (Liu et al. 2017), but whether those works can be well-coupled with CFT or PCC has yet to be studied.

Label space dimension reduction (LSDR) is another family of MLC algorithms. LSDR encodes each label vector as a code vector in the lower-dimensional code space, and learns a predictor from the feature vectors to the corresponding code vectors. The prediction of LSDR consists of the predictor followed by a decoder from the code space to the label space. For example, compressed sensing (CS; Hsu et al. 2009) uses random projection for encoding, takes a regressor as the predictor, and decodes by sparse vector reconstruction. Instead of using a random projection, principal label space transformation (PLST; Tai and Lin 2012) encodes the label vectors \(\{\mathbf {y}_n\}_{n=1}^N\) to their top principal components for the batch MLC problem. Some other LSDR algorithms, including conditional principal label space transformation (CPLST; Chen and Lin 2012), feature-aware implicit label space encoding (FaIE; Lin et al. 2014), canonical-correlation-analysis method (Sun et al. 2011), and low-rank empirical risk minimization for multi-label learning (Yu et al. 2014), jointly take the feature and the label vectors into account during encoding (Chen and Lin 2012; Lin et al. 2014; Sun et al. 2011; Yu et al. 2014) to further improve the performance.

The physical intuition behind LSDR algorithms is to capture the key joint information between labels before learning. By encoding to a more concise code space, LSDR algorithms enjoy the advantage of learning the predictor more effectively to improve the MLC performance. Moreover, compared with non-LSDR algorithms like RAkEL and CFT, LSDR algorithms are generally more efficient, which in turn makes them favorable candidates to be extended to online learning.

Motivated by the possible applications of online updating, the realistic needs of cost-sensitivity, and the potential effectiveness of label space dimension reduction, we take an initiative to study LSDR algorithms for the CSOMLC setting. In particular, we first adapt PLST to the OMLC setting in Sect. 3, and further generalize it to the CSOMLC setting in Sect. 4.

Dynamic principal projection

In this section, we first propose an online LSDR algorithm, dynamic principal projection (DPP), that optimizes the Hamming loss. DPP is motivated by the connection between PLST, which encodes the label vectors to their top principal components, and the rich literature of online PCA algorithms (Arora et al. 2013; Nie et al. 2016; Li et al. 2016). We shall first introduce the detail of PLST. Then, we discuss the potential difficulties along with our solutions to advance PLST to our proposed DPP. To facilitate reading, the common notations that will be used for the coming sections are summarized in Table 1.

Table 1 Summary of common notations

Principal label space transformation

Given the dimension \(M \le K\) of the code space and a batch training dataset \(\mathcal {D} = \{(\mathbf {x}_n, \mathbf {y}_n)\}_{n=1}^N\), PLST, as a batch LSDR algorithm, encodes each \(\mathbf {y}_n \in \{+1, -1\}^K\) into a code vector \(\mathbf {z}_n = \mathbf {P}^* (\mathbf {y}_n - \mathbf {o})\), where \(\mathbf {o}\) is a fixed reference point for shifting \(\mathbf {y}_n\), and \(\mathbf {P}^*\) contains the top M eigenvectors of \(\sum _{n=1}^N (\mathbf {y}_n - \mathbf {o}) (\mathbf {y}_n - \mathbf {o})^\top \). While PLST works with any fixed \(\mathbf {o}\), it is worth noting that when \(\mathbf {o}\) is taken as \(\frac{1}{N} \sum _{n=1}^N \mathbf {y}_n\), the code vector \(\mathbf {z}_n\) contains the top M principal components of \(\mathbf {y}_n\). A multi-target regressor \(\mathbf {r}\) is then learned on \(\{(\mathbf {x}_n, \mathbf {z}_n)\}_{n=1}^N\), and the prediction of an unseen instance \(\mathbf {x}\) is made by

$$\begin{aligned} \hat{\mathbf {y}} = \mathrm {round}\left( (\mathbf {P}^*)^\top \mathbf {r}(\mathbf {x}) + \mathbf {o}\right) \end{aligned}$$

whereFootnote 1\( \mathrm {round}(\mathbf {v}) = \bigl ( \mathrm {sign}(\mathbf {v}[1]), \ldots , \mathrm {sign}(\mathbf {v}[K]) \bigr )^\top \).

By projecting to the top principal components, PLST preserves the maximum amount of information within the observed label vectors. In addition, PLST is backed by the following theoretical guarantee.

Theorem 1

(Tai and Lin 2012) When making a prediction \(\hat{\mathbf {y}}\) from \(\mathbf {x}\) by \(\hat{\mathbf {y}} = \mathrm {round}\left( \mathbf {P}^\top \mathbf {r}(\mathbf {x}) + \mathbf {o}\right) \) with any left orthogonal matrix \(\mathbf {P}\), the Hamming loss

$$\begin{aligned} c_{\textsc {ham}}(\mathbf {y}, \hat{\mathbf {y}}) \le \frac{1}{K} (\underbrace{\Vert \mathbf {r}(\mathbf {x}) - \mathbf {z}\Vert ^2_2}_{\text {pred. error}} + \underbrace{\Vert (\mathbf {I} - \mathbf {P}^\top \mathbf {P})(\mathbf {y}')\Vert ^2_2}_{\text {reconstruction error}}) \end{aligned}$$

where \(\mathbf {z} \equiv \mathbf {P} \mathbf {y}'\) and \(\mathbf {y}' \equiv \mathbf {y} - \mathbf {o}\) with respect to any fixed reference point \(\mathbf {o}\).

Theorem 1 bounds the Hamming loss by the prediction and reconstruction errors. Based on the results of singular value decomposition, \(\mathbf {P}^*\) in PLST is the optimal solution for minimizing the total reconstruction error of the observed label vectors with respect to any fixed \(\mathbf {o}\), and the particular reference point \(\frac{1}{N} \sum _{n=1}^N \mathbf {y}_n\) minimizes the reconstruction error over all possible \(\mathbf {o}\). Then, by minimizing the prediction error with regressor \(\mathbf {r}\), PLST is able to minimize the Hamming loss approximately.

General online LSDR framework for DPP

The upper bound in Theorem 1 works for any regressor \(\mathbf {r}\) and any left orthogonal encoding matrix \(\mathbf {P}\). Based on the bound, we propose an online LSDR framework that approximately minimizes the Hamming loss with an online regressor \(\mathbf {r}_t\) and an online encoding matrix \(\mathbf {P}_t\) in each iteration t. Similar to PLST, the proposed framework works with any fixed referenced point \(\mathbf {o}\). But for simplicity of illustration, we assume that \(\mathbf {o} = \mathbf {0}\) to remove \(\mathbf {o}\) from the derivations below. The steps of the framework are:


In each iteration t of the framework, an online prediction \(\hat{\mathbf {y}}_t\) is made with the updated \(\mathbf {r}_t\) and \(\mathbf {P}_t\). We take the online error function \(\ell ^{(t)}(\mathbf {r}, \mathbf {P})\) to be \(\Vert \mathbf {r}(\mathbf {x}_t) - \mathbf {P} \mathbf {y}_t\Vert _2^2 + \Vert (\mathbf {I} - \mathbf {P}^\top \mathbf {P})\mathbf {y}_t\Vert _2^2\), which upper bounds the Hamming loss \(c_{\textsc {ham}}(\mathbf {y}_t, \hat{\mathbf {y}}_t)\) of the online prediction. Then, by updating \(\mathbf {r}_t\) and \(\mathbf {P}_t\) with online learning algorithms that minimize the cumulative online error \(\sum _{t=1}^T \ell ^{(t)}(\mathbf {r}_t, \mathbf {P}_t)\), we can approximately minimize the cumulative Hamming loss.

The simple framework above transforms the OMLC problem to an online learning problem with an error function composed of two terms. Ideally, the online learning algorithm should update \(\mathbf {P}_t\) and \(\mathbf {r}_t\) to jointly minimize the total error from both terms. Optimizing the two terms jointly has been studied in batch LSDR algorithms like CPLST (Chen and Lin 2012), which is a successor of PLST (Tai and Lin 2012) that also operates with the upper bound in Theorem 1. Nevertheless, it is very challenging to extend CPLST to the online setting efficiently. In particular, a naïve online extension would require computing the hat matrix of the ridge regression part (from \(\mathbf {x}\) to \(\mathbf {z}\)) within CPLST in order to obtain \(\mathbf {P}_t\), and the hat matrix grows quadratically with the number of examples. That is, in an online setting, computing and storing the hat matrix needs at least \(\varOmega (T^2)\) complexity up to iteration T, which is practically infeasible.

Thus, we resort to PLST (Tai and Lin 2012), the predecessor of CPLST, to make an initial attempt towards tackling OMLC problems. PLST minimizes the two terms separately in the batch setting, and our proposed extension of PLST similarly contains two online learning algorithms, one for minimizing each term. That is, we further decompose the online learning problem to two sub-problems, one for minimizing the cumulative reconstruction error (by updating \(\mathbf {P}_t\)), and one for minimizing the cumulative prediction error (by updating \(\mathbf {r}_t\)). Designing efficient and effective algorithms for the two sub-problems turns out to be non-trivial, and will be discussed in Sects. 3.3 and 3.4.

Online minimization of reconstruction error

Next, we discuss the design of our first online learning algorithm to tackle the sub-problem of minimizing the cumulative reconstruction error \(\sum _{t=1}^T \Vert (\mathbf {I} - \mathbf {P}_t^\top \mathbf {P}_t)\mathbf {y}_t\Vert _2^2\), which corresponds to the second term in (2). The goal is to generate a left-orthogonal matrix \(\mathbf {P}_t \in \mathbb {R}^{M \times K}\) in each iteration which guarantees minimizing the cumulative reconstruction error theoretically.

Our design is motivated by a simple but promising online PCA algorithm, matrix stochastic gradient (MSG) (Arora et al. 2013). MSG does not directly solve the sub-problem of our interest because the problem is non-convex over \(\mathbf {P}_t\). Instead, MSG substitutes \(\mathbf {P}_t^\top \mathbf {P}_t\) with a rank-M matrix \(\mathbf {U}_t \in \mathbb {R}^{K \times K}\) and rewrites the cumulative reconstruction error as \(\sum _{t=1}^T \mathbf {y}_t^\top (\mathbf {I} - \mathbf {U}_t) \mathbf {y}_t\). By further assuming that \(\Vert \mathbf {y}_t\Vert _2 \le 1\), MSG loosens the constraint of \(rank(\mathbf {U}_t) = M\) to \(tr(\mathbf {U}_t) = M\), and updates \(\mathbf {U}_t\) with online projected gradient descent upon receiving a new \(\mathbf {y}_t\) as

$$\begin{aligned} \begin{aligned} \mathbf {U}_{t+1} = \mathcal {P}_{tr}(\mathbf {U}_t + \eta \mathbf {y}_t \mathbf {y}_t^\top ) \end{aligned} \end{aligned}$$

where \(\eta \) is the learning rate and \(\mathcal {P}_{tr}(\cdot )\) is the projecting operator to a feasible \(\mathbf {U}\). The less-constrained \(\mathbf {U}_t\) in MSG carries the theoretical guarantee of minimizing the cumulative reconstruction error (subject to \(\mathbf {U}_t\)), but decomposing \(\mathbf {U}_t\) to a left-orthogonal \(\mathbf {P}_t \in \mathbb {R}^{M \times K}\) with theoretical guarantee on \(\mathbf {P}_t\) is not only non-trivial but also time-consuming.

Capped MSG (Arora et al. 2013) is an extension of MSG with the hope of lightening the computational burden of decomposing \(\mathbf {U}_t\). In particular, Capped MSG introduces an additional (non-convex) constraint of \(rank(\mathbf {U}_t) \le M + 1\), and indirectly maintains the decomposition of \(\mathbf {U}_t\) as \((\mathbf {Q}_t, \sigma _t)\), where the left-orthogonal matrix \(\mathbf {Q}_t \in \mathbb {R}^{(M+1)\times K}\) and the vector of singular values \(\sigma _t \in \mathbb {R}^{M+1}\) such that \(\mathbf {U}_t = \mathbf {Q}_t \text{ diag }(\sigma _t) \mathbf {Q}_t^\top \). The decomposed \((\mathbf {Q}_t, \sigma _t)\) in Capped MSG enjoys the same theoretical guarantee of minimizing the reconstruction error as the \(\mathbf {U}_t\) in MSG, while the maintenance step of Capped MSG is more efficient than MSG. Nevertheless, because we want \(\mathbf {P}_t\) to be M by K while \(\mathbf {Q}_t\) is \((M+1)\) by K, the generated \(\mathbf {Q}_t\) in Capped MSG cannot be directly used to solve our sub-problem. A naïve idea is to generate \(\mathbf {P}_t\) by truncating the least important row of \(\mathbf {Q}_t\), but the naïve idea is no longer backed by the theoretical guarantee of Capped MSG.

Aiming to address the above difficulties, we propose an efficient and effective algorithm to stochastically generate \(\mathbf {P}_t\) from \((\mathbf {Q}_t, \sigma _t)\) maintained by Capped MSG in each iteration. To elaborate, let \(\mathbf {Q}_t^{-i}\) be \(\mathbf {Q}_t\) with its i-th row removed and \(\sigma _t[i]\) be the eigenvalue corresponding to i-th row of \(\mathbf {Q}_t\). We generate \(\mathbf {P}_t\) by sampling from a discrete probability distribution \(\varGamma _t\), which consists of \(M+1\) events \(\{\mathbf {Q}_t^{-i}\}_{i=1}^{M+1}\) with probability of \(\mathbf {Q}_t^{-i}\) being \(1-\sigma _t[i]\). As the projecting operator \(\mathcal {P}_{tr}(\cdot )\) ensures \(0 \le \sigma _t[i] \le 1\) for each \(\sigma _t[i]\), one can easily verify \(\varGamma _t\) to be a valid distribution with the additional fact that \(\sum _{i}\sigma _t[i] = tr(\mathbf {U}_t) = M\). The following lemma shows that the online encoding matrix generated by our simple stochastic algorithm is truly effective, and the proof can be found in the Appendix A.1.

Lemma 2

Suppose \((\mathbf {Q}_t, \sigma _t)\) is obtained after an updated of Capped MSG such that \(\mathbf {U}_t = \mathbf {Q}_t diag(\sigma _t) \mathbf {Q}_t^\top \). If \(\varGamma _t\) is a discrete probability distribution over events \(\{\mathbf {Q}_t^{-i}\}_{i=1}^{M+1}\) with probability of \(\mathbf {Q}_t^{-i}\) being \(1-\sigma _t[i]\), we have for any \(\mathbf {y}\)

$$\begin{aligned} \mathbb {E}_{\mathbf {P}_t \sim \varGamma _t} [\mathbf {y}^\top (\mathbf {I} - \mathbf {P}_t^\top \mathbf {P}_t) \mathbf {y} ] = \mathbf {y}^\top (\mathbf {I} - \mathbf {U}_t) \mathbf {y} \end{aligned}$$

The proof of the lemma can be found in Appendix A.1. Moreover, our sampling algorithm is highly efficient regarding its \(\mathcal {O}(M)\) time complexity. Note that there is an earlier work that contains another algorithm of similar spirit (Nie et al. 2016). Somehow the algorithm’s time complexity is \(\mathcal {O}(K^2)\), which is less efficient than ours.

To sum up, our online learning algorithm that minimizes the cumulative reconstruction error for DPP takes Capped MSG as its building block to maintain \(\mathbf {U}_t\) by \(\mathbf {Q}_t\) and \(\sigma _t\), and then samples the online encoding matrix \(\mathbf {P}_t\) from \(\varGamma _t\) derived by \(\mathbf {Q}_t\) in each iteration by our proposed sampling algorithm. Note that to fulfill the assumption of \(\Vert \mathbf {y}_t\Vert _2 \le 1\) required by Capped MSG, we apply a simple trick to scale each \(\mathbf {y}_t \in \{+1,-1\}^K\) with a factor of \(\frac{1}{\sqrt{K}}\). The predictions given by our online LSDR framework remain unchanged after the constant scaling due to the use of \(\mathrm {round}(\cdot )\) operator.

Online minimization of prediction error

Next, we discuss another proposed online learning algorithm to solve the second sub-problem of minimizing the cumulative prediction error \(\sum _{t=1}^T \Vert \mathbf {r}_t(\mathbf {x}_t) - \mathbf {P}_t \mathbf {y}_t\Vert ^2\), which corresponds to the first term in (2). The proposed online learning algorithm is based on the well-known online ridge regression, and incorporates two different carefully designed techniques to remedy the negative effect caused by the variation of \(\mathbf {P}_t\) in each iteration.

The naïve online ridge regression parameterizes \(\mathbf {r}_t(\mathbf {x})\) to be an online linear regressor \(\mathbf {W}_t^\top \mathbf {x}\) with \(\mathbf {W}_t \in \mathbb {R}^{d \times M}\), and update \(\mathbf {W}_t\) by

$$\begin{aligned} \mathbf {W}_t = \underset{\mathbf {W}}{\arg \min }\quad \frac{\lambda }{2} tr(\mathbf {W} \mathbf {W}^\top ) + \sum _{i=1}^{t-1} \Vert \mathbf {W}^\top \mathbf {x}_i - \mathbf {z}_i \Vert _2^2 \end{aligned}$$

where \(\mathbf {z}_i = \mathbf {P}_i \mathbf {y}_i\) is the code vector of \(\mathbf {y}_i\) regarding \(\mathbf {P}_i\), and \(\lambda \) is the regularization parameter. However, the naïve online ridge regression suffers from the drifting of projection basis caused by varying the online encoding matrix \(\mathbf {P}_t\) as t advances. To elaborate, recall that the online regressor \(\mathbf {W}_t\) aims to predict \(\mathbf {z}_t = \mathbf {P}_t \mathbf {y}_t\) from \(\mathbf {x}_t\), where the code vector \(\mathbf {z}_t\) can essentially be viewed as the set of combination coefficients with reference projection basis formed by \(\mathbf {P}_t\). However, \(\mathbf {W}_t\) is learned from \(\{(\mathbf {x}_i, \mathbf {z}_i)\}_{i=1}^{t-1}\), where the learning target \(\{\mathbf {z}_i\}_{i=1}^{t-1}\) is mixed up with coefficients \(\mathbf {z}_i\) induced from different projection basis \(\mathbf {P}_i\). As a consequence, expecting \(\mathbf {W}_t^\top \mathbf {x}_t\) to give accurate prediction of \(\mathbf {z}_t\) for any specific \(\mathbf {P}_t\) is unrealistic. For a very extreme case, if \(\mathbf {P}_1 = \mathbf {P}_3 = \cdots = \mathbf {P}_{2\tau -1} = \mathbf {P}\) and \(\mathbf {P}_2 = \mathbf {P}_4 = \cdots = \mathbf {P}_{2\tau } = - \mathbf {P}\), the \(\mathbf {z}_i\)’s in the odd and even iterations are of totally opposite meanings although the projection matrices \(\mathbf {P}\) and \(-\mathbf {P}\) are mathematically equivalent in quality. The totally opposite meanings make it impossible for \(\mathbf {W}_t\) to predict \(\mathbf {z}_t\) accurately.

To remedy the problem of basis drifting, we propose two different techniques, principal basis correction (PBC) and principal basis transform (PBT), to improve online regressor \(\mathbf {W}_t\). Each of them enjoys different advantages.

Principal basis correction

The ideal solution to handle basis drifting is to “correct” the reference basis of each \(\mathbf {z}_i\) to be the latest \(\mathbf {P}_t\) used for prediction. More specifically, we want \(\mathbf {W}_t\) to be the ridge regression solution obtained from \(\{(\mathbf {x}_i, \mathbf {P}_t \mathbf {y}_i)\}_{i=1}^{t-1}\) instead of \(\{(\mathbf {x}_i, \mathbf {P}_i \mathbf {y}_i)\}_{i=1}^{t-1}\). Such a correction step ensures that the reference basis for generating the previous \(\mathbf {z}_i\)’s is the same as the basis that will be used to predict \(\mathbf{z}_t\) and decode \(\hat{\mathbf{y}}_t\) from \(\mathbf{z}_t\). Denote \(\mathbf {W}^{\text {PBC}}_t\) as the ridge regression solution of \(\{(\mathbf {x}_i, \mathbf {P}_t \mathbf {y}_i)\}_{i=1}^{t-1}\). The closed-form solution of \(\mathbf {W}^{\text {PBC}}_t\) is

$$\begin{aligned} \mathbf {W}^{\text {PBC}}_t = \underbrace{\left( \lambda \mathbf {I} + \sum _{i=1}^{t-1} \mathbf {x}_i \mathbf {x}_i^\top \right) ^{-1}}_{\mathbf {A}_{t}^{-1}}\underbrace{\left( \sum _{i=1}^{t-1} \mathbf {x}_i \mathbf {y}_i^\top \right) }_{\mathbf {B}_{t}} \mathbf {P}_t^\top \; {.} \end{aligned}$$

The part \(\mathbf {A}_t^{-1} \mathbf {B}_t\) is independent of the projection matrix \(\mathbf {P}_t\). Thus, by maintaining another d by K matrix

$$\begin{aligned} \mathbf {H}_t = \mathbf {A}_t^{-1} \mathbf {B}_t \end{aligned}$$

throughout the iterations, \(\mathbf {W}^{\text {PBC}}_t\) can be easily obtained by \(\mathbf {H}_t \mathbf {P}_t^\top \) for any \(\mathbf {P}_t\). The update of \(\mathbf {H}_t\) to \(\mathbf {H}_{t+1}\), on the other hand, requires the calculation of \(\mathbf {H}_{t+1} = (\mathbf {A}_t + \mathbf {x}_t \mathbf {x}_t^\top )^{-1}(\mathbf {B}_t + \mathbf {x}_t \mathbf {y}_t^\top )\), which at a first glance has a time complexity of \(\mathcal {O}(d^3 + Kd^2)\). Fortunately, we can speed up the calculation by applying the Sherman-Morrison formula, which states that

$$\begin{aligned} (\mathbf {A}_t + \mathbf {x}_t \mathbf {x}_t^\top )^{-1}= & {} \left( \mathbf {A}_t^{-1} - \frac{\mathbf {A}_t^{-1} \mathbf {x}_t \mathbf {x}_t^\top \mathbf {A}_t^{-1}}{1 + \gamma }\right) \end{aligned}$$

with \(\gamma = \mathbf {x}_t^\top \mathbf {A}_t^{-1} \mathbf {x}_t\). Then, the calculation can be rewritten as

$$\begin{aligned} \mathbf {H}_{t+1}&= \left( \mathbf {A}_t^{-1} - \frac{\mathbf {A}_t^{-1} \mathbf {x}_t \mathbf {x}_t^\top \mathbf {A}_t^{-1}}{1 + \gamma }\right) \left( \mathbf {B}_t + \mathbf {x}_t \mathbf {y}_t^\top \right) \\&= \mathbf {A}_t^{-1} \mathbf {B}_t - \frac{\mathbf {A}_t^{-1} \mathbf {x}_t \mathbf {x}_t^\top \mathbf {A}_t^{-1} \mathbf {B}_t}{1 + \gamma } + \mathbf {A}_t^{-1} \mathbf {x}_t \mathbf {y}_t^\top - \frac{\mathbf {A}_t^{-1} \mathbf {x}_t \mathbf {x}_t^\top \mathbf {A}_t^{-1} \mathbf {x}_t \mathbf {y}_t^\top }{1 + \gamma } \\&= \mathbf {H}_t - \frac{\mathbf {A}_t^{-1}\mathbf {x}_t \tilde{\mathbf {y}}_t^\top }{1 + \gamma } + \mathbf {A}_t^{-1} \mathbf {x}_t \mathbf {y}_t^\top - \frac{\gamma \mathbf {A}_t^{-1} \mathbf {x}_t \mathbf {y}_t^\top }{1+\gamma }\\&= \mathbf {H}_{t} - \frac{\mathbf {A}_{t}^{-1} \mathbf {x}_{t} (\tilde{\mathbf {y}}_{t} - \mathbf {y}_{t})^\top }{1 + \gamma }, \\ \end{aligned}$$

where \(\tilde{\mathbf {y}}_t = \mathbf {H}_t^\top \mathbf {x}_t\). The third line follows from the fact that \(\mathbf {H}_t = \mathbf {A}_t^{-1} \mathbf {B}_t\). Thus, the d by K matrix \(\mathbf {H}_t\) can be efficiently updated online by

$$\begin{aligned} \mathbf {H}_{t+1} = \mathbf {H}_{t} - \frac{\mathbf {A}_{t}^{-1} \mathbf {x}_{t} (\tilde{\mathbf {y}}_{t} - \mathbf {y}_{t})^\top }{1 + \mathbf {x}_{t}^\top \mathbf {A}_{t}^{-1} \mathbf {x}_{{t}}} \end{aligned}$$

which requires only a time complexity of \(\mathcal {O}(d^2 + Kd)\).

It is worth noting that \(\mathbf {H}_t\) actually stores the online ridge regression solution from \(\mathbf {x}\) to \(\mathbf {y}\). Based on the definition of \(\mathbf {H}_t\), we can then theoretically analyze the performance of our online ridge regression solution \(\mathbf {W}_t^{\textsc {PBC}}\) from \(\mathbf {x}\) to \(\mathbf {z}\) with respect to the error \(\ell ^{(t)}(\cdot ,\cdot )\) in our proposed online LSDR framework. Following the convention of online learning, we analyze the expected average regret \(\dfrac{{\mathcal {R}}}{T}\), defined as

$$\begin{aligned} \frac{\mathcal {R}}{T} = \frac{1}{T}\sum _{t=1}^T\mathbb {E}_{\mathbf {P}_t\sim \varGamma _t}[\ell ^{(t)}(\mathbf {W}_t^{\text {PBC}}, \mathbf {P}_t) - \ell ^{(t)}(\mathbf {W}_\#, \mathbf {P}^*)], \end{aligned}$$

for any given sequence of \(\{(\mathbf {P}_t, \varGamma _t)\}_{t=1}^T\), where each \(\mathbf {P}_t\) is sampled from the distribution \(\varGamma _t\). \((\mathbf {W}_\#, \mathbf {P}^*)\) here denotes the offline reference solution that is allowed to peek the whole data stream \(\{(\mathbf {x}_t, \mathbf {y}_t)\}_{t=1}^T\). As our algorithm aims to minimize the online error function by a similar decomposition of sub-problems as PLST , we particularly consider \((\mathbf {W}_\#, \mathbf {P}^*)\) to be the solution of PLST when treating \(\{(\mathbf {x}_t, \mathbf {y}_t)\}_{t=1}^T\) as the input batch data. That is, \(\mathbf {P}^*\) is the minimizer of \(\sum _{t=1}^T \mathbf {y}_t^\top (\mathbf {I}-\mathbf {P}^\top \mathbf {P}) \mathbf {y}_t\), which corresponds to the second term of \(\ell ^{(t)}(\cdot ,\cdot )\), and \(\mathbf {W}_\#\) is the minimizer of \(\sum _{t=1}^T\Vert \mathbf {W}^\top \mathbf {x}_t - \mathbf {P}^* \mathbf {y}_t\Vert _2^2\), which corresponds to the first term of \(\ell ^{(t)}(\cdot ,\cdot )\) given \(\mathbf {P}^*\). It can be easily proved that \(\mathbf {W}_\# = \mathbf {H}^* (\mathbf {P}^*)^\top \) where \(\mathbf {H}^*\) is the optimal linear regression solution of \(\{(\mathbf {x}_t, \mathbf {y}_t)\}_{t=1}^T\). That is,

$$\begin{aligned} \mathbf {H}^* = \underset{\mathbf {H}}{\arg \min }\quad \sum _{t=1}^T \Vert \mathbf {H}^\top \mathbf {x}_t - \mathbf {y}_t\Vert _2^2 \; . \end{aligned}$$

With the expected average regret defined, we can prove its convergence by assuming the convergence of the subspace spanned by \(\mathbf {P}_t\) to the subspace spanned by \(\mathbf {P}^*\). The assumption generally holds when the M-th and \((M+1)\)-th eigenvalues of \(\sum _{t=1}^T (\mathbf {y}_t - \mathbf {o}) (\mathbf {y}_t - \mathbf {o})^\top \) are different, as the subspace spanned by \(\mathbf {P}^*\) to reach the minimum reconstruction error is consequently unique. In particular, define the expected subspace difference

$$\begin{aligned} \varDelta _t = \Vert \mathbb {E}_{\mathbf {P}_t \sim \varGamma _t}[\mathbf {P}_t^\top \mathbf {P}_t] - \left( \mathbf {P}^*\right) ^\top \mathbf {P}^*\Vert _2 \; {.} \end{aligned}$$

Theorem 3

With the definitions of \(\mathbf {H}_t\) in (7), \(\mathbf {H}^*\) in (9), \(\dfrac{\mathcal {R}}{T}\) in (8) and \(\varDelta _t\) in (10), assume that \(\Vert \mathbf {x}_t\Vert \le 1\), \(\Vert \mathbf {y}_t\Vert \le 1\) and \(\Vert \mathbf {H}_t \mathbf {x}_t - \mathbf {y}_t\Vert _2^2 \le \epsilon \).

  1. 1.

    For any given T, the expected cumulative regret \(\mathcal {R}\) is upper-bounded by

    $$\begin{aligned} (1+\epsilon ) \sum _{t=1}^T \varDelta _t + \frac{M}{2} \Vert \mathbf {H}^*\Vert _F^2 + 2 \epsilon Md \log \left( 1 + \frac{T}{d}\right) . \end{aligned}$$
  2. 2.

    If \(\lim _{T \rightarrow \infty } \varDelta _T = 0\) and \(\Vert \mathbf {H}^*\Vert _F \le h^*\) across all iterations,Footnote 2\(\lim _{T \rightarrow \infty } \dfrac{\mathcal {R}}{T} = 0\).

The third assumption requires the residual errors of online ridge regression without projection to be bounded, which generally holds when there is some linear relationship between \(\mathbf {x}_t\) and \(\mathbf {y}_t\). The detailed of the proof of the theorem can be found in Appendix A.2. Theorem 3 guarantees the performance of PBC to be competitive with a reasonable offline baseline in the long run given the convergence of subspace spanned by \(\mathbf {P}_t\). Such a guarantee makes online linear regressor with PBC a solid option for DPP to tackle the sub-problem of minimizing cumulative prediction error.

Principal basis transform

While PBC always gives the \(\mathbf {W}^{\text {PBC}}_t\) learned on the correct code vectors with respect to the basis formed by \(\mathbf {P}_t\), the time and space complexity of PBC depends on \(\varOmega (Kd)\) at the cost of maintaining \(\mathbf {H}_t \in \mathbb {R}^{d\times K}\). The \(\varOmega (Kd)\) dependency can make PBC computationally inefficient when both K and d are large.

To address the issue, we propose another technique, principal basis transform (PBT). Different from PBC, when a new online encoding matrix \(\mathbf {P}_{t+1}\) is presented, PBT aims at a direct basis transform of the online linear regressor from \(\mathbf {P}_t\) to \(\mathbf {P}_{t+1}\). To be more specific, PBT assumes the regressor \(\mathbf {W}_t^{\textsc {PBT}}\) to be the low-rank coefficients matrix of some unknown\(\mathbf {H}_t' \in \mathbb {R}^{d \times K}\) with reference projection basis formed by \(\mathbf {P}_t\), which can equivalently be described as \(\mathbf {W}_t^{\textsc {PBT}} = \mathbf {H}'_t \mathbf {P}_t^\top \). The goal of PBT is to update \(\mathbf {W}_t^{\textsc {PBT}}\) to \(\mathbf {W}_{t+1}^{\textsc {PBT}}\) with \((\mathbf {x}_t, \mathbf {y}_t)\) such that the reference projection basis of \(\mathbf {W}_{t+1}^{\textsc {PBT}}\) is now induced from \(\mathbf {P}_{t+1}\). PBT achieves the goal by a two-step procedure. The first step is to find the low-rank coefficients matrix \(\mathbf {W}_t'\) of \(\mathbf {H}_t'\) based on the new reference basis formed by \(\mathbf {P}_{t+1}\). However, as only the low rank coefficients matrix \(\mathbf {W}_{t}^{\textsc {PBT}}\) rather than \(\mathbf {H}_t'\) itself is known, we approximate \(\mathbf {W}_t'\) by

$$\begin{aligned} \mathbf {W}_t' = \underset{\mathbf {W}}{\arg \min }\quad \Vert \mathbf {W} \mathbf {P}_{t+1} - \mathbf {W}_{t}^{\textsc {PBT}} \mathbf {P}_{t} \Vert _F^2 \; {.} \end{aligned}$$

Solving (11) analytically gives

$$\begin{aligned} \mathbf {W}_t' = \mathbf {W}_{t}^{\textsc {PBT}} \mathbf {P}_{t} \mathbf {P}_{t+1}^\top \; {.} \end{aligned}$$

The second step is to update \(\mathbf {W}_t'\) with \((\mathbf {x}_t, \mathbf {y}_t)\) to obtain \(\mathbf {W}_{t+1}^{\textsc {PBT}}\) by

$$\begin{aligned} \mathbf {W}_{t+1}^{\textsc {PBT}} = \mathbf {W}_{t}' - \frac{\mathbf {A}_{t}^{-1} \mathbf {x}_{t} (\tilde{\mathbf {z}}' _{t} - \mathbf {P}_{t+1} \mathbf {y}_{t})^\top }{1 + \mathbf {x}_{t}^\top \mathbf {A}_{t}^{-1} \mathbf {x}_{{t}}} \end{aligned}$$

where \(\tilde{\mathbf {z}}'_t = \left( \mathbf {W}_t'\right) ^\top \mathbf {x}_t\). Equation (13) can be derived with a similar use of the Sherman-Morrison formula as that for (7) by replacing \((\tilde{\mathbf {y}}_t, \mathbf {y}_t)\) with \((\tilde{\mathbf {z}}'_t, \mathbf {P}_t \mathbf {y}_t)\) respectively. One can easily verify that \(\mathbf {W}_{t+1}^{\textsc {PBT}}\) obtained by (13) still keeps its reference basis as \(\mathbf {P}_{t+1}\).

Comparing to PBC, PBT only has \(\varOmega (M^2(K+d))\) time complexity, which is particularly useful when \(M^2 \ll {\text {min}}(K,d)\). The appealing time complexity makes PBT a highly practical option for DPP to minimize the cumulative prediction error with. The time and space complexity of the two variants of DPP are listed in Table 2.

Table 2 Time and space complexity for two DPP variants

Generalization to cost-sensitive learning

In this section, we generalize DPP to cost-sensitive DPP (CS-DPP), which meets the requirement of CSOMLC. The key ingredient to the generalization is a carefully designed label-weighting scheme that transforms cost \(c(\mathbf {y}, \hat{\mathbf {y}})\) into the corresponding weighted Hamming loss. With the help of the label weighting scheme, we subsequently derive the optimization objective similar to Theorem 1 for general cost functions, which allows us to derive CS-DPP by reusing the building blocks of DPP.

We start from the detail of our label-weighting scheme based on the label-wise decomposition of \(c(\mathbf {y}, \hat{\mathbf {y}})\). To represent the cost with the label weights, we propose a label-weighting scheme based on a label-wise and order-dependent decomposition of \(c(\cdot ,\cdot )\), which is motivated by a similar concept in Li and Lin (2014). The label-weighting scheme works as follows. Defining \(\hat{\mathbf {y}}_{\text {real}}^{(k)}\) and \(\hat{\mathbf {y}}_{\text {pred}}^{(k)}\) as

$$\begin{aligned} \hat{\mathbf {y}}_{\text {real}}^{(k)}[i] = \left\{ \begin{array}{ll} \mathbf {y}[i] &{} \text {if} \, i< k \\ \mathbf {y}[i] &{} \text {if} \, i = k \\ \hat{\mathbf {y}}[i] &{} \text {if} \, i> k \\ \end{array}\right. \quad \mathbf{and } \quad \hat{\mathbf {y}}_{\text {pred}}^{(k)}[i] = \left\{ \begin{array}{ll} \mathbf {y}[i] &{} \text {if} \, i < k \\ -\mathbf {y}[i] &{} \text {if} \, i = k \\ \hat{\mathbf {y}}[i] &{} \text {if} \, i > k \\ \end{array}\right. \end{aligned}$$

we decompose \(c(\mathbf {y}, \hat{\mathbf {y}})\) into \(\delta ^{(1)},\ldots ,\delta ^{(K)}\) such that

$$\begin{aligned} \delta ^{(k)} = |c(\mathbf {y}, \hat{\mathbf {y}}_{\text {pred}}^{(k)}) - c(\mathbf {y}, \hat{\mathbf {y}}_{\text {real}}^{(k)})| \; {.} \end{aligned}$$

Recall that \(\mathbf {y}\) is the ground truth vector and \(\hat{\mathbf {y}}\) is the prediction vector from the algorithm. The two newly constructed vectors, \(\hat{\mathbf {y}}^{(k)}_{\text {real}}\) and \(\hat{\mathbf {y}}^{(k)}_{\text {pred}}\), can both be viewed as pseudo prediction vectors that are “better” than \(\hat{\mathbf {y}}\), as they are both perfectly correct up to the \((k-1)\)-th label. The two vectors only differ on the k-th prediction, which is correct for \(\hat{\mathbf {y}}^{(k)}_{\text {real}}\) and incorrect for \(\hat{\mathbf {y}}^{(k)}_{\text {pred}}\). The difference allows the term \(\delta ^{(k)}\) in (14) to quantify the price that the algorithm needs to pay if the k-th prediction is wrong. Then, the price \(\delta ^{(k)}\) can be viewed as an indicator of importance for predicting the k-th label correctly. Our label-weighting scheme follows such intuition by simply setting the weight of k-th label as \(\delta ^{(k)}\). The label-weighting scheme with (14) is not only intuitive, but also enjoys nice theoretical guarantee under a mild condition of \(c(\cdot ,\cdot )\), as shown in the following lemma.

Lemma 4

If \(c(\mathbf {y}, \mathbf {y}^{(k)}_{\text {pred}}) - c(\mathbf {y}, \mathbf {y}^{(k)}_{\text {real}}) \ge 0\) holds for any k, \(\mathbf {y}\) and \(\hat{\mathbf {y}}\), then for any given \(\mathbf {y}\) and \(\hat{\mathbf {y}}\), we have

$$\begin{aligned} c(\mathbf {y}, \hat{\mathbf {y}}) = \sum _{k=1}^K \delta ^{(k)} \llbracket \mathbf {y}[k] \ne \hat{\mathbf {y}}[k]\rrbracket \end{aligned}$$

The condition of the lemma, which generally holds for reasonable cost functions, simply says that for any label, a correct prediction should enjoy a lower cost than an incorrect prediction. The proof of the lemma can be found in Appendix A.3. Lemma 4 transforms \(c(\mathbf {y}, \hat{\mathbf {y}})\) into the corresponding weighted Hamming loss, and thus enables the optimization over general cost functions.

Next, we propose CS-DPP, which extends DPP based on our proposed label-weighting scheme. Define \(\mathbf {C}\) as

$$\begin{aligned} \mathbf {C} = \text{ diag }(\sqrt{\delta ^{(1)}},...,\sqrt{\delta ^{(K)}}) \end{aligned}$$

With \(\mathbf {C}\), which carries the cost information, we establish a theorem similar to Theorem 1 to upper-bound \(c(\mathbf {y}, \hat{\mathbf {y}})\).

Theorem 5

When making a prediction \(\hat{\mathbf {y}}\) from \(\mathbf {x}\) by \(\hat{\mathbf {y}} = \mathrm {round}\left( \mathbf {P}^\top \mathbf {r}(\mathbf {x}) + \mathbf {o}\right) \) with any left orthogonal matrix \(\mathbf {P}\), if \(c(\cdot , \cdot )\) satisfies the condition of Lemma 4, the prediction cost

$$\begin{aligned} c(\mathbf {y}, \hat{\mathbf {y}}) \le \Vert \mathbf {r}(\mathbf {x}) - \mathbf {z}_{\mathbf {C}}\Vert ^2_2 + \Vert (\mathbf {I} - \mathbf {P}^\top \mathbf {P})(\mathbf {y}_{\mathbf {C}}')\Vert ^2_2 \end{aligned}$$

where \(\mathbf {z}_{\mathbf {C}} = \mathbf {P} (\mathbf {y}_{\mathbf {C}}')\) and \(\mathbf {y}_{\mathbf {C}}' = \mathbf {C} \mathbf {y} - \mathbf {o}\) with respect to any fixed reference point \(\mathbf {o}\).

Theorem 5 generalizes Theorem 1 to upper-bound the general cost \(c(\mathbf {y}, \hat{\mathbf {y}})\) instead of the original Hamming loss \(c_{\textsc {ham}}(\mathbf {y}, \hat{\mathbf {y}})\). With Theorem 5, extending DPP to CS-DPP is a straightforward task by reusing the online updating algorithms of DPP with \(\mathbf {y}_t\) replaced by \(\mathbf {C}_t \mathbf {y}_t\). The full details of CS-DPP using PBT is given in Algorithm 1, and we can easily write down similar steps for CS-DPP using PBC. Note that we simplify \(\mathbf {W}_t^{\textsc {PBT}}\) to \(\mathbf {W}_t\) in Algorithm 1 to make a cleaner presentation.



To empirically evaluate the performance, and also to study the effectiveness and necessity of design components of CS-DPP, we conduct three sets of experiments: (1) necessity justification of online LSDR, (2) experiments on basis drifting, and (3) experiments on cost-sensitivity. Furthermore, recall that the label weighting scheme of CS-DPP depends on the label order. We therefore conduct an additional set of experiments to study how different label orders affect the performance of CS-DPP. To assist the readers in understanding the experiments, we list the full names and acronyms of the algorithms to be compared along with their key differences in Table 3. The details of the algorithms will be illustrated as needed.

Table 3 Algorithms being compared in the experiments

Experiments setup

We conduct our experiments on eleven real-world datasetsFootnote 3 downloaded from MULAN (Tsoumakas et al. 2011). Statistics of the datasets can be found in Table 4. In particular, datasets \(\textit{eurlex}\)-eurovec and \(\textit{delicious}\) are used only in the experiment to justify the necessity of online LSDR, and only 7500 sub-sampled instances are used on these two datasets to reduce the computational burden of the competitors in the experiment. In addition, only 50000 sub-sampled instances are used for nuswide because a competitor in the cost-sensitivity experiment is rather computationally inefficient.

Table 4 Statistics of datasets

Data streams are generated by permuting datasets into different random orders. We perform sub-sampling on eurlex-eurovec, delicious and nuswide after computing the permutation so that each stream contains a diferent set of original instances for the three datasets.

All LSDR algorithms, except for competitors run on delicious and eurlex-eurovec, are coupled with online ridge regression and three different code space dimensions, \(M = 10\%\), \(25\%\), and \(50\%\) of K, are considered. For DPP we fix \(\lambda = 1\) and follow (Arora et al. 2013) to use the time-decreasing learning rate \(\eta = \frac{2}{\sqrt{t}}\frac{M}{K}\), and parameters of other algorithms will be elaborated along with their details in the corresponding section. For the two larger datasets delicious and eurlex-eurovec, we implement both DPP and O-BR using gradient descent instead of online ridge regression for calculating \(\mathbf {W}_t\), where O-BR is the competitor that will be elaborated in Sect. 5.2. In particular, for PBC of DPP we replace the update of the online ridge regressor (6) with online gradient descent, while for PBT we replace (13), the update after basis transform, with a gradient descent update as well. Note that even with online ridge regression replaced with gradient descent, the ability of DPP with PBT or PBC to handle the basis drifting problem remains unchanged. We use the time decreasing step-size \(\frac{1}{\sqrt{t}}\) for gradient descent on delicious, and \(\frac{0.001}{\sqrt{t}}\) on eurlex-eurovec.

We consider four different cost functions, Hamming loss, Normalized rank loss, F1 loss and Accuracy loss.

$$\begin{aligned}&c_{\textsc {ham}}(\mathbf {y}, \hat{\mathbf {y}}) = \frac{1}{K} \left( \sum \limits _{k=1}^K \llbracket \mathbf {y}[k]\ne \hat{\mathbf {y}}[k]\rrbracket \right) \\&c_{\textsc {nr}}(\mathbf {y}, \hat{\mathbf {y}}) = \mathop {\text{ average }}\limits _{\mathbf {y}[i] > \mathbf {y}[j]}\Bigl (\llbracket \hat{\mathbf {y}}[i]<\hat{\mathbf {y}}[j]\rrbracket + \frac{1}{2} \llbracket \hat{\mathbf {y}}[i]=\hat{\mathbf {y}}[j]\rrbracket \Bigr ) \\&c_{\textsc {f1}}(\mathbf {y}, \hat{\mathbf {y}}) = 1 - 2 \left( \sum \limits _{k=1}^K \llbracket \mathbf {y}[k]=+1 \text{ and } \hat{\mathbf {y}}[k] =+1\rrbracket \right) /\left( \sum \limits _{k=1}^K (\llbracket \mathbf {y}[k]=+1\rrbracket + \llbracket \hat{\mathbf {y}}[k]=+1\rrbracket )\right) \\&c_{\textsc {acc}}(\mathbf {y}, \hat{\mathbf {y}})=1-\left( \sum \limits _{k=1}^K \llbracket \mathbf {y}[k]=+1 \text{ and } \hat{\mathbf {y}}[k] =+1\rrbracket \right) / \left( \sum \limits _{k=1}^K \llbracket \mathbf {y}[k]=+1 \text{ or } \hat{\mathbf {y}}[k] =+1\rrbracket \right) \end{aligned}$$

The performances of different algorithms are compared using the average cumulative cost \(\frac{1}{t}\sum _{i=1}^t c(\mathbf {y}_i, \hat{\mathbf {y}}_i)\) at each iteration t. We remark that a lower average cumulative cost implies better performance. We report the average results of each experiment after 15 repetitions.

Necessity of online LSDR

In this experiment, we aim to justify the necessity to address LSDR for OMLC problems. We demonstrate that the ability of LSDR to preserve the key joint correlations between labels can be helpful when facing (1) data with noisy labels or (2) data with a large possible set of labels, which are often encountered in real-world OMLC problems. We compare DPP with online Binary Relevance (O-BR), which is a naïve extension from binary relevance (Tsoumakas et al. 2010) with online ridge regressor. The only difference between DPP and O-BR is whether the algorithm incorporates LSDR.

We first compare DPP and O-BR on data with noisy labels. We generate noisy data stream by randomly flipping each positive label \(\mathbf {y}[i] = 1\) to negative with probability \(p = \{0.3, 0.5, 0.7\}\), which simulates the real-world scenario in which human annotators fail to tag the existed labels. We plot the results of O-BR and DPP with \(M = 10\%\), \(25\%\) and \(50\%\) of K on datasets emotions and enron with respect to Hamming loss and F1 loss in Fig. 1, which contains error bars that represent the standard error of the average results. The standard errors are naturally larger when M is smaller or when t (number of iterations) is small, but in general for \(M \ge 25\% \cdot K\) and for \(t \ge 400\) the standard errors are small enough to justify the difference. The complete results are listed in Appendix B.1.

Fig. 1

DPP versus O-BR on noisy labels

Table 5 DPP versus O-BR on large datasets

The results from the first two rows of Fig. 1 show that DPP with \(M = 10\%\) of K performs competitively and even better than O-BR as p increases on dataset emotions. The results from the last two rows of Fig. 1 show that DPP always performs better on enron. We can also observe from Fig. 1 that DPP with smaller M tends to perform better as p increases. The above results clearly demonstrate that DPP better resists the effect of noisy labels with its incorporation of LSDR as the noise level (p) increases. The observation that DPP with smaller M tends to perform better demonstrates that DPP is more robust to noise by preserving the key of the key joint correlations between labels with LSDR.

Next, we demonstrate that LSDR is also helpful for handling data with a large label set. We compare O-BR with DPP that is coupled with either PBC or PBT on datasets delicious and eurlex-eurovec.Footnote 4 DPP uses \(M = 10\) for delicious and \(M = 25\) for eurlex-eurovec. We summarize the results and average run-time in Table 5. Table 5 indicates that DPP coupled with either PBT or PBC performs competitively with O-BR, while DPP with PBT enjoys significantly cheaper computational cost. The results demonstrate that DPP enjoys more effective and efficient learning for data with a large label set than O-BR, and also justifies the advantage of PBT over PBC in terms of efficiency when K and d are large while M is relatively small, as previously highlighted in Sect. 3.

Fig. 2

PBC versus PBT versus none, \(M = 10\%\) of K

Experiments on basis drifting

To empirically justify the necessity of handling basis drifting, we compare variants of DPP that (a) incorporates PBC by (6), (b) incorporates PBT by (13), and (c) neglects basis drifting as (5). We plot the results for Hamming loss with \(M=10\%\) of K in Fig. 2 on six datasets, and report the complete results in Appendix B.2. The results on all datasets in Fig. 2 show that DPP with either PBC or PBT significantly improves the performance over its variant that neglects the basis drifting, which clearly demonstrates the necessity to handle the drifting of projection basis.

Further comparison of PBC and PBT based on Fig. 2 reveals that PBC in general performs slightly better than PBT, reflecting its advantage of exact projection basis correction. Nevertheless, as discussed in Sect. 5.2, PBT enjoys a nice computational speedup when K and d are large and M is relatively small, making PBT more suitable to handle data with a large label set.

Experiments on cost-sensitivity

To empirically justify the necessity of cost-sensitivity, we compare CS-DPP using PBT with DPP using PBT and other online LSDR algorithms. To the best of our knowledge, no online LSDR algorithm has yet been proposed in the literature. We therefore design two simple online LSDR algorithms, online compressed sensing (O-CS) and online random projection (O-RAND), to compare with CS-DPP. O-CS is a straightforward extension of CS (Hsu et al. 2009) with an online ridge regressor, and we follow (Hsu et al. 2009) to determine the parameter of O-CS. O-RAND encodes using random matrix \(\mathbf {P}_{R}\) and simply decodes with the corresponding pseudo inverse \(\mathbf {P}_{R}^{\dagger }\).

We plot the results with respect to all evaluation criteria except for the Hamming loss with \(M=10\%\) of K in Fig. 3 on three datasets, and report the complete results in Appendix B.3. Note that the results for CS-DPP here are obtained by using the original label order from the dataset.

Fig. 3

CS-DPP versus others, \(M = 10\%\) of K

CS-DPP versus DPP

The results of Fig. 3 clearly indicate that CS-DPP performs significantly better than DPP on all evaluation criteria other than the Hamming loss, while CS-DPP reduces to DPP when \(c_{\textsc {Ham}}(\cdot ,\cdot )\) is used as the cost function. These observations demonstrate that CS-DPP, by optimizing the given cost function instead of Hamming loss, indeed achieves cost-sensitivity and is superior to its cost-insensitive counterpart, DPP.

CS-DPP versus other online LSDR algorithms

As shown in Fig. 3, while DPP generally performs better than O-CS and O-RAND because of the advantage to preserve key label correlations rather than random ones, it can nevertheless be inferior on some datasets with respect to specific cost functions due to its cost-insensitivity. For example, DPP loses to O-RAND on dataset Corel5k with respect to the Normalized rank loss, as shown in the third row of Fig. 3. CS-DPP conquers the weakness of DPP with its cost-sensitivity, and significantly outperforms O-CS and O-RAND on all three datasets with respect to all three evaluation criteria, as demonstrated in Fig. 3. The superiority of CS-DPP justifies the necessity to take cost-sensitivity into account.

Experiment on effect of label order for CS-DPP

The goal of this experiment is to study how different label orders affect the performance of CS-DPP as our proposed label weighting scheme with (14) is label-order-dependent. To evaluate the impact of label orders, we run CS-DPP with 50 randomly generated label orders and \(M = 10\%\), \(25\%\) and \(50\%\) of K on each dataset. The permutation of each dataset is fixed to the original one given in Tsoumakas et al. (2011), which allows the variance of the performance to better indicate the effect of different orders.

Table 6 Results of CS-DPP on CAL500 with 50 random label orders
Table 7 Results of CS-DPP on yeast with 50 random label orders
Table 8 Results of CS-DPP on enron with 50 random label orders

We summarize the results of all four different cost functions with mean and standard deviation on datasets CAL500, enron and yeast in Tables 67 and 8 respectively, and report the complete results in Appendix B.4. Note that the results of Hamming loss are unaffected by the order of labels, and the reported deviation is due to the randomness from \(\mathbf {P}_t\). From the results of Tables 67 and 8, we see that standard deviation is generally in a relatively small scale of \(10^{-3}\), indicating that the performance of CS-DPP is not that sensitive to the order of labels. Closer inspection of Table 7 reveals that the standard deviation of \(c_{\textsc {acc}}\) on yeast with \(M = 10\%\) of K (which is less than 2 in this case) is somewhat larger, but for sufficiently large M the label order does not seem to cause much variation.


We proposed a novel cost-sensitive online LSDR algorithm called cost-sensitive dynamic principal projection (CS-DPP). We established the foundation of CS-DPP with an online LSDR framework derived from PLST, and derived CS-DPP along with its theoretical guarantees on top of MSG. We successfully conquered the challenge of basis drifting using our carefully designed PBC and PBT. CS-DPP further achieves cost-sensitivity with theoretical guarantees based on our proposed label-weighting scheme. The empirical results demonstrate that CS-DPP significantly outperforms other OMLC algorithms on all evaluation criteria, which validates the robustness and superiority of CS-DPP. The necessity for CS-DPP to address LSDR, basis drifting and cost-sensitivity was also empirically justified.

For possible future works, an interesting direction is to design an online LSDR algorithm capable of capturing the key joint information between features and labels. As discussed, the concept to capture such joint information has been investigated for batch MLC (Chen and Lin 2012; Lin et al. 2014; Yu et al. 2014), but it remains challenging for online MLC. Another direction is to apply OMLC algorithms as a fast approximate solver for large-scale batch data, and see how they compete with traditional batch algorithms. The other interesting direction, as mentioned in Sect. 2, is to design online learning algorithms that achieve cost-sensitivity for the more sophisticated micro- and macro-based criteria.


  1. 1.

    The naming of the \(\mathrm {round}(\cdot )\) operator follows directly from the original paper of PLST (Tai and Lin 2012), which represents \(\mathbf {y} \in \{0, 1\}^K\) instead of \(\{-1, +1\}^K\). Our use of \(\mathrm {sign}\) is thus equivalent to the rounding steps used in the original PLST.

  2. 2.

    The technicality of requiring \(\Vert \mathbf {H}^*\Vert _F\) to be bounded is because we defined regret (up to the T-th iteration) with respect to the optimal offline solution upon receiving T examples, and hence \(\mathbf {H}^*\) depends on T. Standard regret proof in online learning alternatively defines regret with respect to any fixed\(\mathbf {H}\). Our proof could also go through with the alternative definition, which changes \(\Vert \mathbf {H}^*\Vert _F\) to a constant \(\Vert \mathbf {H}\Vert _F\) (that is trivially bounded).

  3. 3.

    \(\mathtt {CAL500}\), \(\mathtt {emotions}\), \(\mathtt {scene}\), \(\mathtt {yeast}\), \(\mathtt {enron}\), \(\mathtt {Corel5k}\), \(\mathtt {mediamill}\), \(\mathtt {nuswide}\), \(\mathtt {medical}\), \(\mathtt {delicious}\) and \(\mathtt {eurlex}\)-\(\mathtt {eurovec}\).

  4. 4.

    delicious: \(d =500\), \(K =983\), eurlex-eurovec: \(d =5000\), \(K =3993\).


  1. Arora, R., Cotter, A., & Srebro, N. (2013). Stochastic optimization of PCA with capped MSG. In NIPS 2013 (pp. 1815–1823).

  2. Balasubramanian, K., & Lebanon, G. (2012). The landmark selection method for multiple output prediction. In ICML 2012.

  3. Bartlett, P. (2008). Online convex optimization: Ridge regression, adaptivity. https://people.eecs.berkeley.edu/~bartlett/courses/281b-sp08/24.pdf. Accessed 4 Nov 2017.

  4. Bello, J. P., Chew, E., & Turnbull, D. (2008). Multilabel classification of music into emotions. In ICMIR 2008 (pp. 325–330).

  5. Bhatia, K., Jain, H., Kar, P., Varma, M., & Jain, P. (2015). Sparse local embeddings for extreme multi-label classification. In NIPS 2015 (pp. 730–738).

  6. Bi, W., & Kwok, J. T. (2013). Efficient multi-label classification with many labels. In ICML 2013 (pp. 405–413).

  7. Chen, Y., & Lin, H. (2012). Feature-aware label space dimension reduction for multi-label classification. In NIPS 2012 (pp. 1538–1546).

  8. Chua, T., Tang, J., Hong, R., Li, H., Luo, Z., & Zheng, Y. (2009). NUS-WIDE: A real-world web image database from National University of Singapore. In CIVR 2009.

  9. Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., & Singer, Y. (2006). Online passive-aggressive algorithms. Journal of Machine Learning Research, 7, 551–585.

    MathSciNet  MATH  Google Scholar 

  10. Dembczynski, K., Cheng, W., & Hüllermeier, E. (2010). Bayes optimal multilabel classification via probabilistic classifier chains. In ICML 2010 (pp. 279–286).

  11. Dembczynski, K., Waegeman, W., Cheng, W., & Hüllermeier, E. (2011). An exact algorithm for F-measure maximization. In NIPS 2011 (pp. 1404–1412).

  12. Elisseeff, A., & Weston, J. (2001). A kernel method for multilabelled classification. In NIPS 2001.

  13. Hsu, D., Kakade, S., Langford, J., & Zhang, T. (2009). Multi-label prediction via compressed sensing. In NIPS 2009 (pp. 772–780).

  14. Kapoor, A., Viswanathan, R., & Jain, P. (2012). Multilabel classification using Bayesian compressed sensing. In NIPS 2012 (pp. 2654–2662).

  15. Li, C., & Lin, H. (2014). Condensed filter tree for cost-sensitive multi-label classification. In ICML 2014 (pp. 423–431).

  16. Li, C., Lin, H., & Lu, C. (2016). Rivalry of two families of algorithms for memory-restricted streaming PCA. In AISTATS 2016.

  17. Lin, Z., Ding, G., Hu, M., & Wang, J. (2014). Multi-label classification via feature-aware implicit label space encoding. In ICML 2014 (pp. 325–333).

  18. Liu, W., Tsang, I. W., & Müller, K. R. (2017). An easy-to-hard learning paradigm for multiple classes and multiple labels. Journal of Machine Learning Research, 18, 1–38.

    MathSciNet  MATH  Google Scholar 

  19. Lo, H., Wang, J., Wang, H., & Lin, S. (2011). Cost-sensitive multi-label learning for audio tag annotation and retrieval. IEEE Transactions on Multimedia, 13(3), 518–529.

    Article  Google Scholar 

  20. Mao, Q., Tsang, I. W. H., & Gao, S. (2013). Objective-guided image annotation. IEEE Transactions on Image Processing, 22, 1585–1597.

    MathSciNet  Article  MATH  Google Scholar 

  21. Nie, J., Kotlowski, W., & Warmuth, M. K. (2016). Online PCA with optimal regrets. Journal of Machine Learning Research, 17, 194–200.

    MathSciNet  MATH  Google Scholar 

  22. Osojnik, A., Panov, P., & Deroski, S. (2017). Multi-label classification via multi-target regression on data streams. Machine Learning, 106, 745–770.

    MathSciNet  Article  MATH  Google Scholar 

  23. Read, J., Bifet, A., Holmes, G., & Pfahringer, B. (2011). Streaming multi-label classification. In Proceedings of the workshop on applications of pattern analysis (WAPA) 2011 (pp. 19–25).

  24. Read, J., Pfahringer, B., Holmes, G., & Frank, E. (2011). Classifier chains for multi-label classification. Machine Learning, 85(3), 333–359.

    MathSciNet  Article  Google Scholar 

  25. Sun, L., Ji, S., & Ye, J. (2011). Canonical correlation analysis for multilabel classification: A least-squares formulation, extensions, and analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1), 194–200.

    Article  Google Scholar 

  26. Tai, F., & Lin, H. (2012). Multilabel classification with principal label space transformation. Neural Computation, 24(9), 2508–2542.

    MathSciNet  Article  MATH  Google Scholar 

  27. Tang, L., Rajan, S., & Narayanan, V. K. (2009). Large scale multi-label classification via metalabeler. In WWW 2009 (pp. 211–220).

  28. Tsoumakas, G., Katakis, I., & Vlahavas, I. P. (2010). Mining multi-label data. In O. Maimon & L. Rokach (Eds.), Data mining and knowledge discovery handbook (2nd ed., pp. 667–685). Boston, MA: Springer.

    Google Scholar 

  29. Tsoumakas, G., & Vlahavas, I. P. (2007). Random k-labelsets: An ensemble method for multilabel classification. In ECML 2007 (pp. 406–417).

  30. Tsoumakas, G., Xioufis, E. S., Vilcek, J., & Vlahavas, I. P. (2011). MULAN: A java library for multi-label learning. Journal of Machine Learning Research, 12, 2411–2414.

    MathSciNet  MATH  Google Scholar 

  31. Wu, Y., & Lin, H. (2017). Progressive \(k\)-labelsets for cost-sensitive multi-label classification. Machine Learning, 106(5), 671–694.

    MathSciNet  Article  MATH  Google Scholar 

  32. Xioufis, E. S., Spiliopoulou, M., Tsoumakas, G., & Vlahavas, I. P. (2011). Dealing with concept drift and class imbalance in multi-label stream classification. In IJCAI 2011 (pp. 1583–1588).

  33. Yu, H., Jain, P., Kar, P., & Dhillon, I. S. (2014). Large-scale multi-label learning with missing labels. In ICML 2014 (pp. 593–601).

  34. Zhang, X., Graepel, T., & Herbrich, R. (2010). Bayesian online learning for multi-label and multi-variate performance measures. In AISTATS 2010.

Download references

Author information



Corresponding author

Correspondence to Hsuan-Tien Lin.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Editors: Jesse Davis, Elisa Fromont, Derek Greene, and Bjorn Bringmann.


Appendix A: Proof of lemmas and theorems

Appendix A.1: Proof of Lemma 2

Lemma 2

Suppose \((\mathbf {Q}_t, \sigma _t)\) is obtained after an updated of Capped MSG such that \(\mathbf {U}_t = \mathbf {Q}_t \text{ diag }(\sigma _t) \mathbf {Q}_t^\top \). If \(\varGamma _t\) is a discrete probability distribution over events \(\{\mathbf {Q}_t^{-i}\}_{i=1}^{M+1}\) with probability of \(\mathbf {Q}_t^{-i}\) being \(1-\sigma _t[i]\), we have for any \(\mathbf {y}\)

$$\begin{aligned} \mathbb {E}_{\mathbf {P}_t \sim \varGamma _t} [\mathbf {y}^\top (\mathbf {I} - \mathbf {P}_t^\top \mathbf {P}_t) \mathbf {y} ] = \mathbf {y}^\top (\mathbf {I} - \mathbf {U}_t) \mathbf {y}. \end{aligned}$$


We first formally show that \(\varGamma _t\) is a well-defined probability distribution. By the definition of the projection operator of Capped MSG we have \(0 \le \sigma _t[i] \le 1\) for each \(\sigma _t[i]\) and \(\sum _{i=1}^{M+1} 1 - \sigma _t[i] = M+1 - \sum _{i=1}^{M+1}\sigma _t[i] = 1\) with \(tr(\mathbf {U}_t) = M\). \(\varGamma _t\) is therefore a well-defined probability distribution.

Then it suffices to show that \(\mathbb {E}_{\mathbf {P}_t\sim \varGamma _t}[\mathbf {P}_t^\top \mathbf {P}_t] = \mathbf {U}_t\) as

$$\begin{aligned} \mathbb {E}_{\mathbf {P}_t \sim \varGamma _t}[\mathbf {y}^\top (\mathbf {I} - \mathbf {P}_t^\top \mathbf {P}_t) \mathbf {y}] = \Vert \mathbf {y}\Vert _2^2 - \mathbf {y}^\top \mathbb {E}_{\mathbf {P}_t\sim \varGamma _t}[\mathbf {P}_t^\top \mathbf {P}_t] \mathbf {y} \end{aligned}$$

To see that \(\mathbb {E}_{\mathbf {P}_t\sim \varGamma _t}[\mathbf {P}_t^\top \mathbf {P}_t] = \mathbf {U}_t\), first notice that by orthogonality of rows of \(\mathbf {Q}_t\) we have \(\mathbf {U}_t = \sum _{j=1}^{M+1} \sigma _t(j) \mathbf {e}_j\mathbf {e}_j^\top \) where \(\mathbf {e}_j\) is the j-th row of \(\mathbf {Q}_t\). We then have

$$\begin{aligned} \begin{aligned} \mathbb {E}_{\mathbf {P}_t \sim \varGamma _t}[\mathbf {P}_t^\top \mathbf {P}_t]&= \sum _{i=1}^{M+1} (1 - \sigma _t[i]) \sum _{j=1}^{M+1}\llbracket i\ne j\rrbracket \mathbf {e}_j \mathbf {e}_j^\top \\&= \sum _{j=1}^{M+1} (\mathbf {e}_j \mathbf {e}_j^\top \sum _{i=1}^{M+1}\llbracket i \ne j\rrbracket (1 - \sigma _t[i])) \\&= \sum _{j=1}^{M+1} (\sigma _t[j] \mathbf {e}_j \mathbf {e}_j^\top )&(a) \\&= \mathbf {U}_t \end{aligned} \end{aligned}$$

where (a) is by \(\sum _{i=1}^{M+1} \sigma _t[i] = M\). \(\square \)

Appendix A.2: Proof of Theorem 3

Theorem 3

With the definitions of \(\mathbf {H}_t\) in (7), \(\mathbf {H}^*\) in (9), \(\dfrac{\mathcal {R}}{T}\) in (8) and \(\varDelta _t\) in (10), assume that \(\Vert \mathbf {x}_t\Vert \le 1\), \(\Vert \mathbf {y}_t\Vert \le 1\) and \(\Vert \mathbf {H}_t \mathbf {x}_t - \mathbf {y}_t\Vert _2^2 \le \epsilon \).

  1. 1.

    For any given T, the expected cumulative regret \(\mathcal {R}\) is upper-bounded by

    $$\begin{aligned} (1+\epsilon ) \sum _{t=1}^T \varDelta _t + \frac{M}{2} \Vert \mathbf {H}^*\Vert _F^2 + 2 \epsilon M d \log \left( 1 + \frac{T}{d}\right) . \end{aligned}$$
  2. 2.

    If \(\lim _{T \rightarrow \infty } \varDelta _T = 0\) and \(\Vert \mathbf {H}^*\Vert _F \le h^*\) across all iterations, \(\lim _{T \rightarrow \infty } \dfrac{\mathcal {R}}{T} = 0\).


We start by separating the definition of \(\mathcal {R}\) to two terms: one for how \(\mathbf {P}_t\) in MSG converges to \(\mathbf {P}^*\), and the other for how \(\mathbf {W}_t^{\text {PBC}}\) for \(\mathbf {P}_t\) in ridge regression differs to \(\mathbf {W}_\#\) for \(\mathbf {P}^*\). For simplicity, we will denote \(\mathbf {W}_t^{\text {PBC}}\) by \(\mathbf {W}_t\). Then,

$$\begin{aligned} \begin{aligned} \mathcal {R} =&\sum _{t=1}^T\mathbb {E}_{\mathbf {P}_t\sim \varGamma _t}[\ell ^{(t)}(\mathbf {W}_t, \mathbf {P}_t) - \ell ^{(t)}(\mathbf {W}_\#, \mathbf {P}^*)]\\ =&+ \sum _{t=1}^T \mathbb {E}_{\mathbf {P}_t\sim \varGamma _t}[\Vert \mathbf {W}_t^\top \mathbf {x}_t - \mathbf {P}_t \mathbf {y}_t\Vert _2^2 + \Vert (\mathbf {I} - \mathbf {P}_t^\top \mathbf {P}_t)\mathbf {y}_t\Vert _2^2]\\&- \sum _{t=1}^T \mathbb {E}_{\mathbf {P}_t\sim \varGamma _t}[\Vert \mathbf {W}_\#^\top \mathbf {x}_t - \mathbf {P}^* \mathbf {y}_t\Vert _2^2 + \Vert (\mathbf {I} - (\mathbf {P}^*)^\top \mathbf {P}^*)\mathbf {y}_t\Vert _2^2]\\ =&+\underbrace{\sum _{t=1}^T \mathbb {E}_{\mathbf {P}_t\sim \varGamma _t}[\Vert (\mathbf {I} - \mathbf {P}_t^\top \mathbf {P}_t)\mathbf {y}_t\Vert _2^2] - \Vert (\mathbf {I} - (\mathbf {P}^*)^\top \mathbf {P}^*)\mathbf {y}_t\Vert _2^2]}_{\mathcal {R}_{\text {MSG}}}\\&+ \underbrace{\sum _{t=1}^T \mathbb {E}_{\mathbf {P}_t\sim \varGamma _t}[\Vert \mathbf {W}_t^\top \mathbf {x}_t - \mathbf {P}_t \mathbf {y}_t\Vert _2^2] - \Vert \mathbf {W}_\#^\top \mathbf {x}_t - \mathbf {P}^* \mathbf {y}_t\Vert _2^2}_{\mathcal {R}_{\text {ridge}}} \end{aligned} \end{aligned}$$

We can bound \(\mathcal {R}_{\text {MSG}}\) first. Let \(\mathbf {U}_t = \mathbb {E}_{\mathbf {P}_t \sim \varGamma _t}[\mathbf {P}_t^\top \mathbf {P}_t]\) and \(\mathbf {U}^* = \left( \mathbf {P}^*\right) ^\top \mathbf {P}^*\), by linearity of expectation,

$$\begin{aligned} \mathcal {R}_{\text {MSG}}&= \sum _{t=1}^T \mathbf {y}_t^\top (\mathbf {U}_t - \mathbf {U}^*) \mathbf {y}_t \nonumber \\&\le \sum _{t=1}^T \Vert \mathbf {U}_t - \mathbf {U}^*\Vert _2 \nonumber \\&\le \sum _{t=1}^T \varDelta _t \end{aligned}$$

where (17) from the assumption of \(\Vert \mathbf {y}_t\Vert _2 \le 1\) and the definition of the matrix 2-norm.

Next, we bound \(\mathcal {R}_{\text {ridge}}\). With the definitions of \(\mathbf {H}_t\) in (7) and \(\mathbf {H}^*\) in (9), \(\mathcal {R}_{\text {ridge}}\) can be further decomposed to

$$\begin{aligned} \begin{aligned} \mathcal {R}_{\text {ridge}} =&+ \underbrace{\sum _{t=1}^T \left( \mathbb {E}_{\mathbf {P}_t\sim \varGamma _t}[\Vert \mathbf {P}_t(\mathbf {H}_t^\top \mathbf {x}_t - \mathbf {y}_t)\Vert _2^2] - \Vert \mathbf {P}^*(\mathbf {H}_t^\top \mathbf {x}_t - \mathbf {y}_t)\Vert _2^2\right) }_{\mathcal {R}_1} \\&+ \underbrace{\sum _{t=1}^T \left( \Vert \mathbf {P}^*(\mathbf {H}_t^\top \mathbf {x}_t - \mathbf {y}_t)\Vert _2^2 - \Vert \mathbf {P}^*(\left( \mathbf {H}^*\right) ^\top \mathbf {x}_t - \mathbf {y}_t)\Vert _2^2\right) }_{\mathcal {R}_2}. \end{aligned} \end{aligned}$$

Bounding \(\mathcal {R}_1\) is very similar to bounding \(\mathcal {R}_{\text {MSG}}\). In particular,

$$\begin{aligned} \mathcal {R}_{1}&= \sum _{t=1}^T (\mathbf {H}_t^\top \mathbf {x}_t - \mathbf {y}_t)^\top (\mathbf {U}_t - \mathbf {U}^*) (\mathbf {H}_t^\top \mathbf {x}_t - \mathbf {y}_t) \nonumber \\&\le \sum _{t=1}^T \epsilon \Vert \mathbf {U}_t - \mathbf {U}^*\Vert _2 \nonumber \\&\le \sum _{t=1}^T \epsilon \varDelta _t \end{aligned}$$

where (18) follows from the assumption of \(\Vert \mathbf {H}_t^\top \mathbf {x}_t - \mathbf {y}_t\Vert _2^2 \le \epsilon \).

The term \(\mathcal {R}_2\) can be viewed as an online ridge regression process from \(\mathbf {x}\) to \(\mathbf {P}^* \mathbf {y}\), because it can be easily proved that \(\mathbf {H}_t (\mathbf {P}^*)^\top \) is the ridge regression solution after receiving \((\mathbf {x}_1, \mathbf {P}^* \mathbf {y}_1)\), \((\mathbf {x}_2, \mathbf {P}^* \mathbf {y}_2)\), \(\ldots \), \((\mathbf {x}_{t-1}, \mathbf {P}^* \mathbf {y}_{t-1})\). Also, as discussed in Sect. 3.4, \(\mathbf {W}_\# = \mathbf {H}^* (\mathbf {P}^*)^\top \) is the optimal linear regression solution of \(\{(\mathbf {x}_t, \mathbf {P}^* \mathbf {y}_t)\}_{t=1}^T\). The assumption of \(\Vert \mathbf {H}_t \mathbf {x}_t - \mathbf {y}_t\Vert _2^2\le \epsilon \) implies that

$$\begin{aligned} \Vert \mathbf {P}^* \mathbf {H}_t^\top \mathbf {x}_t - \mathbf {P}^* \mathbf {y}_t\Vert _2^2 = (\mathbf {H}_t^\top \mathbf {x}_t - \mathbf {y}_t)^\top \mathbf {U}^* (\mathbf {H}_t^\top \mathbf {x}_t - \mathbf {y}_t) \le \epsilon \end{aligned}$$

as well. Similarly, the assumption of \(\Vert \mathbf {y}_t\Vert _2 \le 1\) implies that \(\Vert \mathbf {P}^* y_t\Vert _2 \le 1\). Then, a standard ridge regression analysis (see, e.g. Bartlett 2008) by provng that \(\mathbf {A}_t = \lambda \mathbf {I} + \sum _{i=1}^{t-1} \mathbf {x}_i \mathbf {x}_i^\top \) grows linearly with t leads to

$$\begin{aligned} \mathcal {R}_{2}&=\sum _{t=1}^T \left( \Vert \mathbf {P}^* (\mathbf {H}_t^\top \mathbf {x}_t - \mathbf {y}_t)\Vert _2^2 - \Vert \mathbf {P}^* (\left( \mathbf {H}^*\right) ^\top \mathbf {x}_t - \mathbf {y}_t)\Vert _2^2\right) \nonumber \\&\le \frac{1}{2}\Vert \mathbf {P}^* \mathbf {H}^*\Vert _F^2 + 2 \epsilon Md \log \left( 1 + \frac{T}{d}\right) \nonumber \\&\le \frac{M}{2}\Vert \mathbf {H}^*\Vert _F^2 + 2 \epsilon Md \log \left( 1 + \frac{T}{d}\right) \end{aligned}$$

where (19) is because \(\Vert \mathbf {P}^*\Vert _F^2 = tr(\mathbf {U}^*) = M\).

Summing \(\mathcal {R}_{\text {MSG}}\), \(\mathcal {R}_1\) and \(\mathcal {R}_2\) results in

$$\begin{aligned} \mathcal {R} \le (1+\epsilon ) \sum _{t=1}^T \varDelta _t + \frac{M}{2}\Vert \mathbf {H}^*\Vert _F^2 + 2 \epsilon M d \log \left( 1 + \frac{T}{d}\right) , \end{aligned}$$

which proves the first part of the theorem. The second part easily follows because the convergence of a sequence implies the convergence of the mean. \(\square \)

Appendix A.3: Proof of Lemma 4

Lemma 4

If \(c(\mathbf {y}, \mathbf {y}^{(k)}_{\text {pred}}) - c(\mathbf {y}, \mathbf {y}^{(k)}_{\text {real}}) \ge 0\) holds for any k, \(\mathbf {y}\) and \(\hat{\mathbf {y}}\), then for any given \(\mathbf {y}\) and \(\hat{\mathbf {y}}\) we have

$$\begin{aligned} c(\mathbf {y}, \hat{\mathbf {y}}) = \sum _{k=1}^K \delta ^{(k)} \llbracket \mathbf {y}[k] \ne \hat{\mathbf {y}}[k]\rrbracket \end{aligned}$$


Recall the definition of \(\mathbf {y}^{(k)}_{\mathrm{real}}\) and \(\mathbf {y}^{(k)}_{\mathrm{pred}}\) to be

$$\begin{aligned} \hat{\mathbf {y}}_{\text {real}}^{(k)}[i] = \left\{ \begin{array}{ll} \mathbf {y}[i] &{} \text {if} \, i< k \\ {\mathbf {y}}[i] &{} \text {if} \, i = k \\ {\mathbf {y}}[i] &{} \text{ if } \, i> k \\ \end{array}\right. \quad \mathbf{and } \quad \hat{\mathbf {y}}_{\text{ pred }}^{(k)}[i] = \left\{ \begin{array}{ll} \mathbf {y}[i] &{} \text {if} \, i < k \\ -{\mathbf {y}}[i] &{} \text {if} \, i = k \\ \hat{\mathbf {y}}[i] &{} \text{ if } \, i > k \\ \end{array}\right. \end{aligned}$$

and the definition of \(\delta ^{(k)}\) to be

$$\begin{aligned} \delta ^{(k)} = |c(\mathbf {y}, \hat{\mathbf {y}}_{\text {pred}}^{(k)}) - c(\mathbf {y}, \hat{\mathbf {y}}_{\text {real}}^{(k)})| \end{aligned}$$

Now define \(k_i, i = 1,\ldots ,L\) be the sequence of indices such that \(\mathbf {y}[k_i] \ne \hat{\mathbf {y}}[k_i]\) for every \(k_i\) and \(k_i < k_{i+1}\). If such \(k_i\) does not exist than (21) holds trivially by \(c(\mathbf {y}, \mathbf {y}) = 0\). Otherwise, by the condition of c we have

$$\begin{aligned} \begin{aligned}&\sum _{k=1}^K \delta ^{(k)} \llbracket \mathbf {y}[k] \ne \hat{\mathbf {y}}[k]\rrbracket&(a)\\ =\;&\sum _{k=1}^K (c(\mathbf {y}, \hat{\mathbf {y}}_{\text {pred}}^{(k)}) - c(\mathbf {y}, \hat{\mathbf {y}}_{\text {real}}^{(k)})) \llbracket \mathbf {y}[k] \ne \hat{\mathbf {y}}[k]\rrbracket \\ =\;&\sum _{i=1}^L c(\mathbf {y}, \hat{\mathbf {y}}_{\text {pred}}^{(k_i)}) - c(\mathbf {y}, \hat{\mathbf {y}}_{\text {real}}^{(k_i)}) \\ =\;&c(\mathbf {y}, \hat{\mathbf {y}}_{\text {pred}}^{(k_1)}) - c(\mathbf {y}, \hat{\mathbf {y}}_{\text {real}}^{(k_L)})&(b) \\ =\;&c(\mathbf {y}, \hat{\mathbf {y}})&(c) \\ \end{aligned} \end{aligned}$$

where (a) uses the condition of \(c(\cdot ,\cdot )\) to remove the absolute value function; (b) is from two possibilities of L: if \(L = 1\) then the equation trivially holds; if \(L > 1\) we use the observation that \(\hat{\mathbf {y}}_{\text {real}}^{(k_i)} = \hat{\mathbf {y}}_{\text {pred}}^{(k_{i+1})}\) where the observation is by realizing \(\mathbf {y}[j] = \hat{\mathbf {y}}[j]\) for any \(k_i< j < k_{i+1}\); (c) follows from the observation that \(\hat{\mathbf {y}}^{(k_1)}_{\text {pred}} = \hat{\mathbf {y}}\) and \(\hat{\mathbf {y}}^{(k_L)}_{\text {real}} = \mathbf {y}\) and \(c(\mathbf {y}, \mathbf {y}) = 0\). \(\square \)

Appendix A.4: Proof of Theorem 5

Theorem 5

When making a prediction \(\hat{\mathbf {y}}\) from \(\mathbf {x}\) by \(\hat{\mathbf {y}} = \mathrm {round}\left( \mathbf {P}^\top \mathbf {r}(\mathbf {x}) + \mathbf {o}\right) \) with any left orthogonal matrix \(\mathbf {P}\), if \(c(\cdot , \cdot )\) satisfies the condition of Lemma 4, the prediction cost

$$\begin{aligned} c(\mathbf {y}, \hat{\mathbf {y}}) \le \Vert \mathbf {r}(\mathbf {x}) - \mathbf {z}_{\mathbf {C}}\Vert ^2_2 + \Vert (\mathbf {I} - \mathbf {P}^\top \mathbf {P})(\mathbf {y}_{\mathbf {C}}')\Vert ^2_2 \end{aligned}$$

where \(\mathbf {z}_{\mathbf {C}} = \mathbf {P} (\mathbf {y}_{\mathbf {C}}')\) and \(\mathbf {y}_{\mathbf {C}}' = \mathbf {C} \mathbf {y} - \mathbf {o}\) with respect to any fixed reference point \(\mathbf {o}\).

Recall the definition of \(\mathbf {C}\) in the main context is

$$\begin{aligned} \mathbf {C} = \text{ diag }(\sqrt{\delta ^{(1)}},...,\sqrt{\delta ^{(K)}}) \end{aligned}$$

Next we show and prove the following lemma before we proceed to the complete proof.

Lemma 6

Given the ground truth \(\mathbf {y}\), if the binary-value prediction \(\hat{\mathbf {y}} \in \{+1,-1\}^K\) is made by \(\mathrm {round}(\tilde{\mathbf {y}})\) where \(\tilde{\mathbf {y}}\) is the real-value prediction \(\tilde{\mathbf {y}} \in \mathbb {R}^K\). Then for any \(\mathbf {y}\), \(\hat{\mathbf {y}}\), \(\tilde{\mathbf {y}}\), if c satisfies the condition in Lemma 4, we have

$$\begin{aligned} c(\mathbf {y}, \hat{\mathbf {y}}) \le \Vert \mathbf {C}\mathbf {y} - \tilde{\mathbf {y}}\Vert ^2 \end{aligned}$$


From Lemma 4 we have \(c(\mathbf {y}, \hat{\mathbf {y}}) = \sum _{k=1}^K \delta ^{(k)}\llbracket \mathbf {y}[k] \ne \hat{\mathbf {y}}[k]\rrbracket \). As \(\Vert \mathbf {C}\mathbf {y} - \tilde{\mathbf {y}}\Vert _2^2 = \sum _{k=1}^K (\sqrt{\delta ^{(K)}}\mathbf {y}[k] - \tilde{\mathbf {y}}[k])^2\), it suffices to show that for all k we have

$$\begin{aligned} \delta ^{(k)}\llbracket \mathbf {y}[k] \ne \hat{\mathbf {y}}[k]\rrbracket \le (\sqrt{\delta ^{(k)}}\mathbf {y}[k] - \tilde{\mathbf {y}}[k])^2 \end{aligned}$$

When \(\delta ^{(k)} = 0\), (24) holds trivially. When \(\delta ^{(k)} > 0\), we have

$$\begin{aligned} \begin{aligned}&\delta ^{(k)} \llbracket \mathbf {y}[k] \ne \hat{\mathbf {y}}[k]\rrbracket \\&\quad = \; \delta ^{(k)}(\llbracket \tilde{\mathbf {y}}[k] \ge 0\rrbracket \llbracket \mathbf {y}[k] = -1\rrbracket + \llbracket \tilde{\mathbf {y}}[k]< 0\rrbracket \llbracket \mathbf {y}[k] = +1\rrbracket ) \\&\quad = \; \delta ^{(k)}\left( \llbracket \frac{\tilde{\mathbf {y}}[k]}{\sqrt{\delta ^{(k)}}} \ge 0\rrbracket \llbracket \mathbf {y}[k] = -1\rrbracket + \llbracket \frac{\tilde{\mathbf {y}}[k]}{\sqrt{\delta ^{(k)}}} < 0\rrbracket \llbracket \mathbf {y}[k] = +1\rrbracket \right) \\&\quad \le \; \delta ^{(k)}\left( \left( \frac{\tilde{\mathbf {y}}[k]}{\sqrt{\delta ^{(k)}}} - \mathbf {y}[k]\right) ^2\llbracket \mathbf {y}[k] = -1\rrbracket + \left( \frac{\tilde{\mathbf {y}}[k]}{\sqrt{\delta ^{(k)}}} - \mathbf {y}[k]\right) ^2\llbracket \mathbf {y}[k] = +1\rrbracket \right) \\&\quad = \; \delta ^{(k)}\left( \frac{\tilde{\mathbf {y}}[k]}{\sqrt{\delta ^{(k)}}} - \mathbf {y}[k]\right) ^ 2 \\&\quad = \; \left( \sqrt{\delta ^{(k)}}\mathbf {y}[k] - \tilde{\mathbf {y}}[k]\right) ^ 2 \end{aligned} \end{aligned}$$

where the second equality uses the fact that \(\delta ^{(k)} > 0\). As \(\delta ^{(k)} \ge 0\) holds by its definition, (24) holds for every k. Summing (24) with respect to all k then completes the proof. \(\square \)

With Lemma 6 established, we now prove Theorem 5.

Proof of Theorem 5

If the given c satisfies the condition in Lemma (4), and let \(\tilde{\mathbf {y}} = \mathbf {P}^\top \mathbf {r}(\mathbf {x}) + \mathbf {o}\) and \(\hat{\mathbf {y}} = \mathrm {round}(\tilde{\mathbf {y}})\). Then for any \((\mathbf {x}, \mathbf {y})\) we have

$$\begin{aligned} \begin{aligned}&c(\mathbf {y}, \hat{\mathbf {y}}) \\&\quad \le \; \Vert \mathbf {C}\mathbf {y} - \tilde{\mathbf {y}}\Vert ^2_2&(a)\\&\quad = \; \Vert ((\tilde{\mathbf {y}} - \mathbf {o} - \mathbf {P}^\top \mathbf {P} \mathbf {y}_{\mathbf {C}}') - (\mathbf {y}_{\mathbf {C}}' - \mathbf {P}^\top \mathbf {P} \mathbf {y}_{\mathbf {C}}') ) \Vert ^2_2 \\&\quad = \; \Vert (\mathbf {P}^\top (\mathbf {r}(\mathbf {x}) - \mathbf {z}^\mathbf {C}) - (\mathbf {I}- \mathbf {P}^\top \mathbf {P})\mathbf {y}_{\mathbf {C}}' \Vert ^2_2 \\&\quad = \; \Vert (\mathbf {P}^\top (\mathbf {r}(\mathbf {x}) - \mathbf {z}^\mathbf {C})\Vert ^2_2 + \Vert (\mathbf {I} - \mathbf {P}^\top \mathbf {P})\mathbf {y}_{\mathbf {C}}')\Vert ^2_2&(b) \\&\quad = \; \Vert \mathbf {r}(\mathbf {x}) - \mathbf {z}_{\mathbf {C}}\Vert ^2_2 + \Vert (\mathbf {I} - \mathbf {P}^\top \mathbf {P})\mathbf {y}_{\mathbf {C}}'\Vert ^2_2&(c) \\ \end{aligned} \end{aligned}$$

where we recall that \(\bar{\mathbf {y}}_{\mathbf {C}}' = \mathbf {C} \mathbf {y} - \mathbf {o}\) and \(\mathbf {z}_{\mathbf {C}} = \mathbf {P}(\mathbf {y}_{\mathbf {c}}')\). (a) is from Lemma 24, while (b) and (c) follow from the orthogonal rows of \(\mathbf {P}\). \(\square \)

We note that the proof above closely follows the proof of Theorem 1 in Tai and Lin (2012), while the key difference comes from Lemma 6 to handle the weighted Hamming loss.

Appendix B: Complete results of experiments

Here we report the complete results of each experiment.

Appendix B.1: Necessity of online LSDR

We report the complete results of comparison between O-BR and DPP with \(M = 10\%\), \(25\%\) and \(50\%\) of K from Tables 9, 10 and 11 with respect to all four evaluation criteria, where the best values (the lowest) are marked in bold.

Table 9 DPP versus O-BR on noisy data, \(p = 0.3\)
Table 10 DPP versus O-BR on noisy data, \(p = 0.5\)
Table 11 DPP versus O-BR on noisy data, \(p = 0.7\)

The results show that DPP outperforms O-BR as the value of p increases with respect to Hamming loss, F1 loss and Accuracy loss, demonstrating the robustness of DPP. On the other hand, the results related to Normalized rank loss from Tables 9, 10 and 11 show that, while DPP cannot outperform O-BR regarding this specific criterion, DPP does start to perform competitively as the value of p increases. The observation again demonstrates that DPP indeed suffers less from noisy labels comparing to O-BR due to the incorporation with LSDR.

Appendix B.2: Experiments on basis drifting

The complete results of comparison between DPP using (1) PBC, (2) PBT, and (3) nothing regarding Hamming loss can be found in Tables 1213 and 14, where the best values (the lowest) are marked in bold. To further understand the behavior of basis drifting and the effectiveness of PBC and PBT for CS-DPP, we also compare CS-DPP coupled with PBC/PBT/none on F1 loss, Accuracy loss and Normalized rank loss, and summarize the results in the same tables. From these results we can again draw the same conclusion as that in Sect. 5.3. That is, CS-DPP with either PBT or PBC greatly outperforms CS-DPP that neglects the basis drifting, and CS-DPP with PBT performs competitively with CS-DPP with PBC.

Table 12 CS-DPP with PBC versus PBT versus none, \(M = 10\%\) of K
Table 13 CS-DPP with PBC versus PBT versus none, \(M = 25\%\) of K
Table 14 CS-DPP with PBC versus PBT versus none, \(M = 50\%\) of K

Appendix B.3: Experiments on cost-sensitivity

We report the complete results of on all datasets with respect to all four cost functions in Tables 1516 and 17, where the best values (the lowest) are marked in bold. These complete results validate the conclusion in Sect. 5.4.

Table 15 CS-DPP versus others, \(M = 10\%\) of K
Table 16 CS-DPP versus others, \(M = 25\%\) of K
Table 17 CS-DPP versus others, \(M = 50\%\) of K
Table 18 Results of CS-DPP on 50 random label orders

Appendix B.4: Experiments on effect of label orders

The complete average results and the corresponding standard deviations of CS-DPP run on 50 random label orders are reported in Table 18. The results indicate that the standard deviation over the average results of 50 random orders are of \(10^{-3}\) scale generally, indicating that our CS-DPP is relatively not sensitive to the change of label order. On the other hand, the results of CS-DPP have comparatively large deviation on several datasets for some cost functions, such as the Normalized rank loss on dataset emotions with \(M = 10\%\) of K. We attribute the reason to the instability of interaction between the randomness of \(\mathbf {P}_t\) and different label orders based on the fact that larger deviations are observed only when \(M = 10\%\) of K.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Chu, HM., Huang, KH. & Lin, HT. Dynamic principal projection for cost-sensitive online multi-label classification. Mach Learn 108, 1193–1230 (2019). https://doi.org/10.1007/s10994-018-5773-6

Download citation


  • Multi-label classification
  • Cost-sensitive
  • Label space dimension reduction