On label dependence and loss minimization in multilabel classification
Authors
 First Online:
 Received:
 Accepted:
DOI: 10.1007/s1099401252858
 Cite this article as:
 Dembczyński, K., Waegeman, W., Cheng, W. et al. Mach Learn (2012) 88: 5. doi:10.1007/s1099401252858
Abstract
Most of the multilabel classification (MLC) methods proposed in recent years intended to exploit, in one way or the other, dependencies between the class labels. Comparing to simple binary relevance learning as a baseline, any gain in performance is normally explained by the fact that this method is ignoring such dependencies. Without questioning the correctness of such studies, one has to admit that a blanket explanation of that kind is hiding many subtle details, and indeed, the underlying mechanisms and true reasons for the improvements reported in experimental studies are rarely laid bare. Rather than proposing yet another MLC algorithm, the aim of this paper is to elaborate more closely on the idea of exploiting label dependence, thereby contributing to a better understanding of MLC. Adopting a statistical perspective, we claim that two types of label dependence should be distinguished, namely conditional and marginal dependence. Subsequently, we present three scenarios in which the exploitation of one of these types of dependence may boost the predictive performance of a classifier. In this regard, a close connection with loss minimization is established, showing that the benefit of exploiting label dependence does also depend on the type of loss to be minimized. Concrete theoretical results are presented for two representative loss functions, namely the Hamming loss and the subset 0/1 loss. In addition, we give an overview of stateoftheart decomposition algorithms for MLC and we try to reveal the reasons for their effectiveness. Our conclusions are supported by carefully designed experiments on synthetic and benchmark data.
Keywords
Multilabel classification Label dependence Loss functions1 Introduction
In contrast to conventional (singlelabel) classification, the setting of multilabel classification (MLC) allows an instance to belong to several classes simultaneously. At first sight, MLC problems can be solved in a quite straightforward way, namely through decomposition into several binary classification problems; one binary classifier is trained for each label and used to predict whether, for a given query instance, this label is present (relevant) or not. This approach is known as binary relevance (BR) learning.
However, BR has been criticized for ignoring important information hidden in the label space, namely information about the interdependencies between the labels. Since the presence or absence of the different class labels has to be predicted simultaneously, it is arguably important to exploit any such dependencies.

The notion of label dependence or “label correlation” is often used in a purely intuitive manner without giving a precise formal definition. Likewise, MLC methods are often adhoc extensions of existing methods for multiclass classification.

Many studies report improvements on average, but without carefully investigating the conditions under which label dependencies are useful and when they are perhaps less important. Apart from properties of the data and the learner, for example, it is plausible that the type of performance measure is important in this regard.

The reasons for improvements are often not carefully distinguished. As the performance of a method depends on many factors, which are hard to isolate, it is not always clear that the improvements can be fully credited to the consideration of label dependence.

Moreover, a multitude of loss functions can be considered in MLC, and indeed, a large number of losses has already been proposed and is commonly applied as performance metrics in experimental studies. However, even though these loss functions are of a quite different nature, a concrete connection between the type of multilabel classifier used and the loss to be minimized is rarely established, implicitly giving the misleading impression that the same method can be optimal for different loss functions.
The aim of this paper is to elaborate on the issue of label dependence in more detail, thereby helping to gain a better understanding of the mechanisms behind MLC algorithms in general. Subsequent to a formal problem description in Sect. 2, we will propose a distinction between two different types of label dependence in MLC (Sect. 3). These two types will be referred to as conditional and marginal (unconditional) label dependence, respectively. While the latter captures dependencies between labels conditional to a specific instance, the former is a global type of dependence, independent of any concrete observation. In Sect. 4, we distinguish three different (though not necessarily disjoint) views on MLC. Roughly speaking, an MLC problem can either be seen as a set of interrelated binary classification problems or as a single multivariate prediction problem. Our discussion of this point will reveal a close interplay between label dependence and loss minimization. Theoretical results making this interplay more concrete are given in Sect. 5, where we analyze two specific but representative loss functions, namely the Hamming loss and the subset 0/1 loss. Furthermore, in Sect. 6, a selection of stateoftheart MLC algorithms is revisited in light of exploiting label dependence and minimizing different losses. Using both synthetic and benchmark data, Sect. 7 presents several experimental results on carefully selected case studies, confirming the conclusions that were drawn earlier on the basis of theoretical considerations. We end with a final discussion about facts, pitfalls and open challenges on exploiting label dependencies in MLC problems. Let us remark that this paper combines material that we have recently published in three other papers (Dembczyński et al. 2010a; Dembczyński et al. 2010b; Dembczyński et al. 2010c). However, this paper discusses in more detail the distinction between marginal and conditional dependence and introduces the three different views on MLC. The risk minimizers for multilabel loss functions have been firstly discussed in Dembczyński et al. (2010a). The theoretical analysis of the two loss functions, the Hamming and the subset 0/1 loss, comes from Dembczyński et al. (2010c), however, the formal proofs of the theorems have not yet been published. The paper also extends the discussion given in Dembczyński et al. (2010b) on different stateoftheart MLC algorithms and contains new experimental results.
2 Multilabel classification
Let \(\mathcal{X}\) denote an instance space, and let \(\mathcal{L}= \{ \lambda_{1}, \lambda_{2}, \ldots, \lambda_{m}\}\) be a finite set of class labels. We assume that an instance \(\mathbf{x} \in\mathcal{X}\) is (nondeterministically) associated with a subset of labels \(L \in 2^{\mathcal{L}}\); this subset is often called the set of relevant labels, while the complement \(\mathcal{L} \setminus L\) is considered as irrelevant for x. We identify a set L of relevant labels with a binary vector y=(y _{1},y _{2},…,y _{ m }), in which y _{ i }=1⇔λ _{ i }∈L. By \(\mathcal{Y} = \{0,1\}^{m}\) we denote the set of possible labellings.
Usually, the image of a classifier h is restricted to \(\mathcal{Y}\), which means that it assigns a predicted label subset to each instance \(\mathbf{x}\in\mathcal{X}\). However, for some loss functions that correspond to slightly different tasks like ranking or probability estimation, the prediction of a classifier is not limited to binary vectors.
3 Stochastic label dependence
Since MLC algorithms analyze multiple labels Y=(Y _{1},Y _{2},…,Y _{ m }) simultaneously, it is worth to study any dependence between them. In this section, we analyze the stochastic dependence between labels and make a distinction between conditional and marginal dependence. As will be seen later on, this distinction is crucial for MLC learning algorithms.
3.1 Marginal and conditional label dependence
As mentioned previously, we distinguish two types of label dependence in MLC, namely conditional and marginal (unconditional) dependence. We start with a formal definition of the latter.
Definition 1
Conditional dependence, in turn, captures the dependence of the labels given a specific instance \(\mathbf{x}\in\mathcal{X}\).
Definition 2
Example 1
Consider a problem with two labels Y _{1} and Y _{2}, both being independently generated through the same logistic model P(Y _{ i }=1x)=(1+exp(−ϕf(x)))^{−1}, where ϕ controls to the Bayes error rate. Thus, by definition, the two labels are conditionally independent, having joint distribution P(Yx)=P(Y _{1}x)×P(Y _{2}x) given x. However, depending on the value of ϕ, we will have a stronger or weaker marginal dependence. For ϕ→∞ (Bayes error rate tends to 0), the marginal dependence increases toward an almost deterministic one (y _{1}=y _{2}).
The next example shows that conditional dependence does not imply marginal dependence.
Example 2
3.2 Modeling label dependence
In general, the distribution of the noise terms can depend on x. Moreover, two noise terms ε _{ i } and ε _{ j } can also depend on each other, as also the structural parts of the model, say h _{ i } and h _{ j }, may share some similarities between each other. From this, we can find that there are two possible sources of label dependence: the structural part of the model h(⋅) and the stochastic part ε(⋅).
Example 3
From this point of view, marginal dependence can be seen as a kind of (soft) constraint that a learning algorithm can exploit for the purpose of regularization. This way, it may indeed help to improve predictive accuracy, as will be shown in subsequent sections.
Proposition 1
Proof
When conditioning on a given input x, one can write Y _{ i }=q(ε _{ i }) with q a function. Independence of the error terms then implies independence of the labels. The reverse statement also holds because h becomes a constant for a given x. □
A less general statement has been put forward in Dembczyński et al. (2010b), and independently in Zhang and Zhang (2010).
Let us also underline that conditional dependence may cause marginal dependence, because of (8). In other words, the similarity between the models is not the only source of the marginal dependence.
Briefly summarized, one will encounter conditional dependence between labels if dependencies are observed in the errors terms of the model. On the other hand, the observation of label correlations in the training data will not necessarily imply any dependence between error terms. Label correlations only provide evidence for the existence of marginal dependence between labels, even though the conditional dependence might be a cause of this dependence.
In the remainder of this paper, we will address the idea of exploiting label dependence in learning multilabel classifiers in more detail. We will claim that exploiting both types of dependence, marginal and conditional, can in principle improve the generalization performance, but the true benefit does also depend on the particular formulation of the problem. Furthermore, we will also argue that some of the existing algorithms are interpreted in a somewhat misleading way.
4 Three views on multilabel classification
 1.
The “individual label” view: How can we improve the predictive accuracy of a single label by using information about other labels? Moreover, what are the requirements for improvement? (This view is closely connected to transfer and multitask learning (Caruana 1997).)
 2.
The “joint label” view: What type of proper (nondecomposable) MLC loss functions is suitable for evaluating a multilabel prediction as a whole, and how to minimize such loss functions?
 3.
The “joint distribution” view: Under what conditions is it reasonable (or even necessary) to estimate the joint conditional probability distribution over all label combinations?
4.1 Improving single label predictions
Let us first analyze the following question: Can we improve the predictive accuracy for a single label by using the information about other labels? In other words, the question is whether we can improve the binary relevance approach by exploiting relationships between labels. We will refer to this scenario as single label predictions.
Let us also mention that (16) is usually referred to as the macroaverage as the performance is averaged over single labels, thus attributing equal weights to the labels. In contrast, the microaverage, also commonly used in MLC, gives equal weights to all classifications as it is computed over all predictions simultaneously, for example, by first summing contingency matrices for all labels and then computing the performance over the resulting global contingency matrix (only one matrix). However, this kind of averaging does not fall into any of the views considered in this paper. In the next subsection, we discuss, in turn, losses that are decomposable over single instances.
Our discussion so far implies that the single label prediction problem can be solved on the basis of the marginal distributions P(Y _{ i }x) alone. Hence, with a proper choice of base classifiers and parameters for estimating the marginal probabilities, there is in principle no need for modeling conditional dependence between the labels. This does not exclude the possibility of first modeling the conditional joint distribution (so, conditional dependencies as well) and then perform a proper marginalization procedure. We discuss such an approach in Sect. 4.3. Here in turn, we take a closer look at another possibility that relies on exploiting marginal dependence.
In Sect. 6, we will discuss some existing MLC algorithms that improve the performance measured in terms of labelwise decomposable loss functions by exploiting the similarities between the structural parts of the models. Here, let us only add that similarity between structural parts can also be seen as a specific type of background knowledge of the form h _{ l }(x)=f(h _{ k }(x)), i.e., knowledge about a functional dependence between the deterministic parts of the models for individual labels. Given a labelwise decomposable loss function, an improvement over BR can also be achieved by using any sort of prior knowledge about the marginal dependence between the labels.
4.2 Minimization of multilabel loss functions
In the framework of MLC, one can consider a multitude of loss functions. We have already discussed the group of losses that are decomposable over single labels, i.e., losses that can be represented as an average over labels. Here, we discuss loss functions that are not decomposable over single labels, but decomposable over single instances. Particularly, we focus on rank loss, Fmeasure loss, Jaccard distance, and subset 0/1 loss. We start our discussion with the rank loss by showing that this loss function is still closely related to single label predictions. Later, we will discuss the subset 0/1 loss, which is in turn closely related to the estimation of the joint probability distribution. The two remaining loss functions, Fmeasure loss and Jaccard distance, are more difficult to analyze, and there is no easy way to train a classifier minimizing them.
Theorem 1
As one of the most important consequences of the above result we note that, according to (18), a riskminimizing prediction for the rank loss can be obtained from the marginal distributions P(Y _{ i }x) (i=1,…,m) alone. Thus, just like in the case of Hamming loss, it is in principle not necessary to know the joint label distribution P(Yx) on \(\mathcal{Y}\), which means that riskminimizing predictions can be made without any knowledge about the conditional dependency between labels. In other words, this result suggests that instead of minimizing the rank loss directly, one can simply use any approach for single label prediction that properly estimates the marginal probabilities.
In passing, we note that there is also a normalized variant of the rank loss, in which the number of mistakes is divided by the maximum number of possible mistakes on y, i.e., by the number of summands in (17); this number is given by r(m−r)/2, with \(r= \sum_{i=1}^{m} y_{i}\) the number of relevant labels. Without going into detail, we mention that the above result cannot be extended to the normalized version of the rank loss. That is, knowing the marginal distributions P(Y _{ i }  x) is not enough to produce a risk minimizer in this case.
4.3 Conditional joint distribution estimation
Nevertheless, the estimation of the joint probability is a difficult task. In general one has to estimate 2^{ m } values for a given x, namely the probability degrees P(yx) for all \(\mathbf{y} \in\mathcal{Y}\). In order to solve this problem efficiently, all methods for probability estimation can in principle be used. This includes parametric approaches based on Gaussian distributions or exponential families, reducing the problem to the estimation of a small number of parameters (Joe 2000). It also includes graphical models such as Bayesian networks (Jordan 1998), which factorize a highdimensional distribution into the product of several lowerdimensional distributions. For example, departing from the product rule of probability (7), one can try to simplify a joint distribution by exploiting label independence whenever possible, ending up with (6), in the extreme case of conditional independence.

First, obtain estimates of the conditional marginal distributions for every label separately. This step could be considered as a probabilistic binary relevance approach.

Subsequently, estimate a copula on top of the marginal distributions to obtain the conditional joint distribution.
5 Theoretical insights into multilabel classification
In many MLC papers, a new learning algorithm is introduced without clearly stating the problem to be solved. Then, the algorithm is empirically tested with respect to a multitude of performance measures, but without precise information about which of these measures the algorithm actually intends to optimize. This may implicitly give the misleading impression that the same method can be optimal for several loss functions at the same time.
In this section, we provide theoretical evidence for the claim that our distinction between MLC problems, as proposed in the previous section, is indeed important. A classifier supposed to be good for solving one of those problems may perform poorly for another problem. In order to facilitate the analysis, we restrict ourselves to two loss functions, namely the Hamming and the subset 0/1 loss. The first one is representative of the single label scenario, while the second one is a typical multilabel loss function whose minimization calls for an estimation of the joint distribution. Our analysis proceeds from the simplifying assumption of an unconstrained hypothesis space, which allows us to consider the conditional distribution for a given x. As such, this theoretical analysis will differ from the experimental analysis reported in Sect. 7, where parametric hypothesis spaces are considered. Despite this conceptual difference, our theoretical and experimental results will be highly consistent. They both support the main claims of this paper concerning loss minimization and its relationship with label dependence. While the theoretical analysis mainly provides evidence on the population level, the empirical study also investigates the effect of estimation.
The main result of this section will show that, in general, the Hamming loss minimizer and the subset 0/1 loss minimizer will differ significantly. That is, the Hamming loss minimizer may be poor in terms of the subset 0/1 loss and vice versa. In some (not necessarily unrealistic) situations, however, the Hamming and subset 0/1 loss minimizers coincide, an observation that may explain some misleading results in recent MLC papers. The following proposition reveals two such situations.
Proposition 2
 (1)
Labels Y _{1},…,Y _{ m } are conditionally independent, i.e., \(\mathbf {P}(\mathbf{Y}\mathbf{x}) = \prod_{i=1}^{m} \mathbf {P}(Y_{i}\mathbf{x})\).
 (2)
The probability of the mode of the joint probability is greater than or equal to 0.5, i.e., \(\mathbf {P}(\mathbf{h}^{*}_{s}(\mathbf{x})\mathbf {x}) \ge0.5\).
Proof
 (1)
Since the joint probability of any combination of y is given by the product of marginal probabilities, the highest value of this product is given by the highest values of the marginal probabilities. Thus, the joint mode is composed of the marginal modes.
 (2)
If \(\mathbf {P}(\mathbf{h}^{*}_{s}(\mathbf{x})\mathbf{x}) \ge0.5\), then \(\mathbf {P}(h^{*}_{s_{i}}(\mathbf{x})\mathbf{x}) \ge0.5\), i=1,…,m, and from this it follows that \(h^{*}_{s_{i}}(\mathbf{x}) = h^{*}_{H_{i}}(\mathbf{x})\).
As a simple corollary of this proposition, we have the following.
Corollary 1
In the separable case (i.e., the joint conditional distribution is deterministic, P(Yx)=[[Y=y]], where y is a binary vector of size m), the risk minimizers of the Hamming loss and subset 0/1 coincide.
Proof
If P(Yx)=[[Y=y]], then \(\mathbf {P}(\mathbf{Y}\mathbf{x}) = \prod_{i=1}^{m} \mathbf {P}(Y_{i}\mathbf{x})\). In this case, we also have \(\mathbf {P}(\mathbf{h}^{*}_{s}(\mathbf{x})\mathbf{x}) \ge0.5\). Thus, the result follows from both (1) and (2) in Proposition 2. □
Moreover, one can claim that the two loss functions are related to each other because of the following simple bounds (the proof is given in the Appendix).
Proposition 3
Proposition 4
The second result concerns the highest value of the regret in terms of the Hamming loss for \(\mathbf{h}^{*}_{s}(\mathbf{X})\), the optimal strategy for the subset 0/1 loss (the proof is given in the Appendix).
Proposition 5
As we can see, the worst case regret is high for both loss functions, suggesting that a single classifier will not be able to perform equally well in terms of both functions. Instead, a classifier specifically tailored for the Hamming (subset 0/1) loss will indeed perform much better for this loss than a classifier trained to minimize the subset 0/1 (Hamming) loss.
6 MLC algorithms for exploiting label dependence
Recently, a number of learning algorithms for MLC have been proposed in the literature, mostly with the goal to improve predictive performance (in comparison to binary relevance learning), but sometimes also having other objectives in mind (e.g., reduction of time complexity (Hsu et al. 2009)). To achieve their goals, the algorithms typically seek to exploit dependencies between the labels. However, as mentioned before, concrete information about the type of dependency tackled or the loss function to be minimized is rarely given. In many cases, this is a cause of confusion and illdesigned experimental studies, in which inappropriate algorithms are used as baselines.
Tsoumakas and Katakis (2007) distinguish two categories of MLC algorithms, namely problem transformation methods (reduction) and algorithm adaptation methods (adaptation). Here, we focus on algorithms from the first group, mainly because they are simple and widely used in empirical studies. Thus, a proper interpretation of these algorithms is strongly desired.
We discuss reduction algorithms in light of our three views on MLC problems. We will start with a short description of the BR approach. Then, we will present algorithms being tailored for single label predictions by exploiting the similarities between structural parts of the models. Next, we will discuss algorithms taking into account conditional label dependence, and hence being tailored for other multilabel loss functions, like the subset 0/1 loss. Some of these algorithms are also able to estimate the joint distribution. To summarize the discussion on these algorithms we present their main properties in a table. Let us, however, underline that this description concerns the basic settings of these algorithms given in the original papers. It may happen that one can extend their functionality by alternating their setup. At the end of this section, we give a short review of adaptation algorithms, but their detailed description is beyond the scope of this paper. We also shortly describe algorithms devoted for multilabel ranking problems.
6.1 Binary relevance
As we mentioned before, BR is the simplest approach to multilabel classification. It reduces the problem to binary classification, by training a separate binary classifier h _{ i }(⋅) for each label λ _{ i }. Learning is performed independently for each label, ignoring all other labels.
Obviously, BR does not take label dependence into account, neither conditional nor marginal. Indeed, as suggested by our theoretical results, BR is, in general, not able to yield risk minimizing predictions for losses like subset 0/1, but it is welltailored for Hamming loss minimization or, more generally, every loss whose risk minimizer can be expressed solely in terms of marginal distributions P(Y _{ i }x) (i=1,…,m). As confirmed by several experimental studies, this approach might be sufficient for getting good results in such cases. However, exploiting marginal dependencies may still be beneficial, especially for smallsized problems.
6.2 Single label predictions
Stacking.
Methods like Stacking (Godbole and Sarawagi 2004; Cheng and Hüllermeier 2009) directly follow the first scheme (25). They replace the original predictions, obtained by learning every label separately, by correcting them in light of information about the predictions of the other labels. This transformation of the initial prediction should be interpreted as a regularization procedure. Another possible interpretation is a feature expansion. This method can easily be used with any kind of binary classifier. It is not clear, in general, whether the metaclassifier b should be trained on the BR predictions h(x) alone or use the original features x as additional inputs. Another question concerns the type of information provided by the BR predictions. One can use binary predictions, but also values of scoring functions or probabilities, if such outputs are delivered by the classifier.
Multivariate regression.
These methods can also be represented by the second scheme (26). First, y is transformed to the canonical coordinate system y′=Ty. Then, separate linear regression is performed to obtain estimates \(\tilde{\mathbf{y}}' = (\tilde{y}'_{1}, \tilde{y}'_{2}, \ldots, \tilde{y}'_{n})\). These estimates are further shrunk by the factor g _{ ii } obtaining \(\hat{\mathbf{y}}' = \mathbf{G} \tilde{\mathbf{y}}'\). Finally, the prediction is transformed back to the original coordinate output space \(\hat{\mathbf{y}} = \mathbf {T}^{1} \hat{\mathbf{y}}'\).
Kernel dependency estimation.
The above references rather originate from the statistics domain, but similar approaches have also been introduced in machine learning, like kernel dependency estimation (KDE) (Weston et al. 2002) and multioutput regularized feature projection (MORP) (Yu et al. 2006). We focus here on the former method. It consists of a threestep procedure. The first step conducts a kernel principal component analysis of the label space for deriving nonlinear combinations of the labels or for predicting structured outputs. Subsequently, the transformed labels (i.e., the principal components) are used in a simple multivariate regression method that does not have to care about label dependencies, knowing that the transformed labels are uncorrelated. In the last step, the predicted labels of test data are transformed back to the original label space. Since Kernel PCA is used, this transformation is not straightforward, and the socalled preimage problem has to be solved. Labelbased regularization can be included in this approach as well, simply by using only the first r<m principal components in steps two and three, similar to regularization based on feature selection in methods like principal component regression (Hastie et al. 2007). The main difference between KDE and multivariate regression methods described above is the use of kernel PCA instead of CCA. Simplified KDE approaches based on PCA have been studied for multilabel classification in Tai and Lin (2010). Here, the main goal was to reduce the computational costs by using only the most important principal components.
Compressive sensing.
The idea behind compressive sensing used for MLC (Hsu et al. 2009) is quite different, but the resulting method shares a lot of similarities with the algorithms described above. The method assumes that the label sets can be compressed and we can learn to predict the compressed labels instead. From this point of view, we can mainly improve the time complexity, since we solve a lower number of core problems. The compression of the label sets is possible only if the vectors y are sparse. This method follows scheme (26) to some extent. The main difference is the interpretation of the matrix T. Here, we obtain y′=Ty by using a random matrix from an appropriate distribution (such as Gaussian, Bernoulli, or Hadamard) whose number of rows is much smaller than the length of y. This results in a new multivariate regression problem with a lower number of outputs. The prediction for a novel x relies on computing the output of the regression problem \(\hat{\mathbf{y}}'\), and then on obtaining a sparse vector \(\hat{\mathbf{y}}\) such that \(\mathbf{T}\hat{\mathbf {y}}'\) is closest to \(\hat{\mathbf{y}}'\) solving an optimization problem, similarly as in KDE. In other words, there is no simple decoding from the compressed to the original label space, as it was the case for multivariate regression methods.
6.3 Estimation of joint distribution and minimization of multilabel loss functions
Here, we describe some methods that seek to estimate the joint distribution P(Yx). As explained in Sect. 4.3, knowledge about the joint distribution (or an estimation thereof) allows for an explicit derivation of the risk minimizer of any loss function. However, we also mentioned the high complexity of this approach.
Label Powerset (LP).
This approach reduces the MLC problem to multiclass classification, considering each label subset \(L \in\mathcal{L}\) as a distinct metaclass (Tsoumakas and Katakis 2007; Tsoumakas and Vlahavas 2007). The number of these metaclasses may become as large as \(\mathcal{L} = 2^{m}\), although it is often reduced considerably by ignoring label combinations that never occur in the training data. Nevertheless, the large number of classes produced by this reduction is generally seen as the most important drawback of LP.
Since prediction of the most probable metaclass is equivalent to prediction of the mode of the joint label distribution, LP is tailored for the subset 0/1 loss. In the literature, however, it is often claimed to be the right approach to MLC in general, as it obviously takes the label dependence into account. This claim is arguably incorrect and does not discern between the two types of dependence, conditional and unconditional. In fact, LP takes the conditional dependence into account and usually fails for loss functions like Hamming.
Let us notice that LP can easily be extended to any other loss function, provided the underlying multiclass classifier f(⋅) does not only provide a class prediction but a reasonable estimate of the probability of all metaclasses (label combinations), i.e., f(x)≈P(Yx). From this point of view, LP can be seen as a method for estimating the conditional joint distribution. Practically, however, the large number of metaclasses makes probability estimation an extremely difficult problem. In this regard, we also mention that most implementations of LP essentially ignore label combinations that are not presented in the training set or, stated differently, tend to underestimate (set to 0) their probabilities.
Several extensions of LP have been proposed in order to overcome its computational burden. The RAKEL algorithm (Tsoumakas and Vlahavas 2007) is an ensemble method that consists of several LP classifiers defined on randomly drawn subsets of labels. This method is parametrized by a number of base classifiers and the size of label subsets. A global prediction is obtained by combining the predictions of the ensemble members on the label subsets. Essentially, this is done by counting, for each label, how many times it is included in a predicted label subset. Despite its intuitive appeal and competitive performance, RAKEL is still not well understood from a theoretical point of view. For example, it is not clear what loss function it intends to minimize.
Probabilistic Classifier Chains (PCC).
Much more problematic, however, is doing inference from the given joint distribution. In fact, exact inference will again come down to using (27) in order to produce a probability degree for each label combination, and hence cause an exponential complexity. Since this approach is infeasible in general, approximate methods may have to be used. For example, a simple greedy approximation of the joint mode is obtained by successively choosing the most probable label according to each of the classifiers’ predictions. This approach, referred to as classifier chains (CC), has been introduced in Read et al. (2009), albeit without a probabilistic interpretation. Alternatively, one can exploit (27) to sample from it. Then, one can compute a response for a given loss function based on this sample. Such an approach has been used for the Fmeasure in Dembczyński et al. (2012).
Theoretically, the result of the product rule does not depend on the order of the variables. Practically, however, two different classifier chains will produce different results, simply because they involve different classifiers learned on different training sets. To reduce the influence of the label order, Read et al. (2009) propose to average the multilabel predictions of CC over a (randomly chosen) set of permutations. Thus, the labels λ _{1},…,λ _{ m } are first reordered by a permutation π of {1,…,m}, which moves the label λ _{ i } from position i to position π(i), and CC is then applied as usual. This extension is called the ensembled classifier chain (ECC). In ECC, a prediction is made by averaging over several CC predictions. However, like in the case of RAKEL, it is rather unclear what this approach actually tends to estimate, and what loss function it seeks to minimize.
Summarization of the properties of the most popular reduction algorithms for multilabel classification problems
Method 
Marginal dependence 
Conditional dependence 
Loss function 

BR 
no 
no 
Hamming loss^{a} and rank loss^{b} 
Stacking 
yes 
no 
Hamming loss^{a} and rank loss^{b} 
C&W, RRR, FICYREG 
yes 
no 
squared error loss 
KDE 
yes 
no 
kernelbased loss functions^{c} 
Compressive sensing 
yes^{d} 
no 
squared error loss, Hamming loss 
LP 
no 
yes 
subset 0/1 loss, any loss^{e} 
RAKEL 
no 
yes 
not explicitly defined 
PCC 
no 
yes 
any loss^{e} 
CC 
no 
yes 
subset 0/1 loss 
ECC 
no 
yes 
not explicitly defined 
6.4 Other approaches to MLC
For the sake of completeness, let us mention that the list of methods discussed so far is not exhaustive. In fact, there are several other methods that are potentially interesting in the context of MLC. This includes, for example, conditional random fields (CRF) (Lafferty et al. 2001; Ghamrawi and McCallum 2005), a specific type of graphical model that allows for representing relationships between labels and features in a quite convenient way. This approach is designed for finding the joint mode, thus for minimizing the subset 0/1 loss. It can also be used for estimating the joint probability of label combinations.
Instead of estimating the joint probability distribution, one can also try to minimize a given loss function in a more direct way. Concretely, this can be accomplished within the framework of structural support vector machines (SSVM) (Tsochantaridis et al. 2005); indeed, a multilabel prediction can be seen as a specific type of structured output. Finley and Joachims (2008) and Hariharan et al. (2010) (M3L) tailored this algorithm explicitly to minimize the Hamming loss in MLC problems. Let us also notice that Pletscher et al. (2010) introduced a generalization of SSVMs and CRFs that can be applied for optimizing a variety of MLC loss functions. Yet another approach to direct loss minimization is the use of boosting techniques. In Amit et al. (2007), socalled label covering loss functions are introduced that include Hamming and the subset 0/1 losses as special cases. The authors also propose a learning algorithm suitable for minimizing covering losses, called AdaBoost.LC.
Finally, let us discuss shortly algorithms that have been designed for the problem of label ranking, i.e., MLC problems in which rankingbased performance measures, like the rank loss (17), are of primary interest. One of the first algorithms of this type was BoosTexter (Schapire and Singer 2000), being an adaptation of AdaBoost. This idea has been further generalized to loglinear models by Dekel et al. (2004). RankSVM is an instantiation of SVMs that can be applied for this type of problems (Elisseeff and Weston 2002). Ranking by pairwise comparison (Hüllermeier et al. 2008; Fürnkranz et al. 2008) is a reduction method that transform the MLC problem to a quadratic number of binary problems, one for each pair of labels.
7 Experimental evidence
To corroborate our theoretical results by means of empirical evidence, this section presents a number of experimental studies, using both synthetic and benchmark data. We constrained the experiment to four reduction algorithms: BR, Stacking (SBR), CC, and LP. We test these methods in terms of Hamming and subset 0/1 loss. First, we investigate the behavior of these methods on synthetic datasets pointing to some important pitfalls often encountered in experimental studies of MLC. Finally, we present some results on benchmark datasets and discuss them in the light of these pitfalls.
We used an implementation of BR and LP from the MULAN package (Tsoumakas et al. 2010),^{5} and the original implementation of CC (Read et al. 2009) from the MEKA package.^{6} We implemented our own code for Stacking that was built upon the code of BR. In the following experiments, we employed linear logistic regression (LR) as a base classifier of the MLC methods, taking the implementation from WEKA (Witten and Frank 2005).^{7} In some experiments, we also used a rule ensemble algorithm, called MLRules,^{8} which can be treated as a nonlinear version of logistic regression, as this method trains a linear combination of decision (classification) rules by maximizing the likelihood (Dembczyński et al. 2008). In SBR, we first trained the binary relevance based on LR or MLRules, and subsequently a second LR for every label, in which the predicted labels (in fact, probabilities) of a given instance are used as additional features. In CC, the base classifier was trained for each consecutive label using the precedent labels as additional inputs, and the prediction was computed in a greedy way, as we adopted here the original version of this algorithm (not the probabilistic one). We took the original order of the labels (in one experiment we trained an ensemble of CCs and in this case we randomized the order of labels). In LP we used the 1vs1 method to solve the multiclass problem.
For each binary problem being a result of the reduction algorithm, we applied an internal threefold crossvalidation on training data for tuning the regularization parameters of the base learner. We chose for a given binary problem the model with the lowest misclassification error. For LR we used the following set of possible values of the regularization parameter {1000,100,10,1,0.1,0.01,0.001}. For MLRules, we varied the pairs of the number of rules and the shrinkage parameter. The possible values for the number of rules are {5,10,20,50,100,200,500}. We associated the shrinkage parameter with the number of rules by taking respectively the following values {1,1,1,0.5,0.2,0.2,0.1}.
According to this setting and our theoretical claims, BR and SBR should perform well for the Hamming loss, while CC and LP are more appropriate for the subset 0/1 loss.
7.1 Synthetic data
7.2 Marginal independence
7.3 Conditional independence
In this experiment, we analyze the case of conditional independence. In this case, we used only two features and each label was computed on them using different linear models, in contrast to the previous experiment, where two separate features were constructed for each label individually. The error terms for different labels were independently sampled. They followed a Bernoulli distribution as before with π=0.1. First, we generated data for τ=0. This results in models sharing the same structural part and differing in the stochastic part only. Later, we changed τ to 1. In this case some of the labels can still share some similarities. We can observe marginal dependence, but it is not so extreme as in the previous case. Let us also notice that in this case the risk minimizers for both loss functions coincide.
Interestingly, CC is not better than BR in terms of Hamming loss in the case of the same structural parts. Moreover, the standard errors of the Hamming loss are for CC indifferent to the number of labels. For τ=1, its performance decreases if the number of labels increases. However, it performs much better with respect to subset 0/1 loss, and its behavior is similar to BR in this case. These results can be interpreted as follows. For the same structural parts, CC tends to build a model based on values of previous labels. In the prediction phase, however, once the error is made, it will be propagated along a chain. From this point of view, its behavior is similar to using for all labels a base classifier that has been learned on the first label. That is why standard errors do not change in the case of Hamming loss. This behavior gives a small advantage for subset 0/1 loss, as the predictions become more homogeneous. On the other hand, the training in the case of different structural parts (τ=1) becomes more difficult as there are not clear patterns among the previous labels. From this point of view, the overall performance is influenced by the training and prediction phase, as in both phases the algorithm makes mistakes.
More generally speaking, in addition to the potential existence of dependence between the error terms in the underlying statistical process that generates the data, one can claim as well that dependence can occur in the errors of the fitted models on test data. From this perspective, BR and SBR can be interpreted as methods that do not induce additional dependence between error terms, although the errors might be dependent due to the existence of dependence in the underlying statistical process. CC on the other hand will typically induce some further dependence, in addition to the dependence in the underlying statistical process. So, even if we have conditional independence in the data, the outputs of CC tend to result in dependent errors, simply because errors propagate through the chain. Obviously, this does not have to be at all a bottleneck in minimizing the subset 0/1 loss, but it can have a big impact on minimizing the Hamming loss, even if the true labels are conditionally independent.
LP seems to break down completely when the number of labels increases. Since the errors are independently generated for each label, the training sets contain a lot of different label combinations, resulting in a large number of metaclasses for LP. For small training datasets, the majority of these metaclasses will not even occur in the training data.
7.4 Conditional dependence
The behavior of CC is quite similar as in the previous experiment with independent errors on different structural parts. Apart from the dependence of errors, it seems that the structural part of the model influences the performance in a greater degree, and the algorithm is not able to learn accurately. In addition, one can also observe that LP performs much better in comparison to previous settings. The main reason is that the number of different label combinations is much lower than before. Nevertheless, LP still behaves worse than binary relevance.
7.5 Joint mode ≠ marginal modes
SBR and BR perform the best for the Hamming loss, with the former yielding a slight advantage. For the subset 0/1 loss, LP now becomes the best, thereby supporting our theoretical claim that BR is estimating marginal modes, while LP is seeking the mode of the joint conditional distribution. Moreover, an interesting behavior of CC can be observed; for a small number of labels, it properly estimates the joint mode. However, its performance decreases with an increase in the number of labels. It follows that one has to use a proper base classifier to capture the conditional dependence. A linear classifier is too weak in this case. Moreover, CC employs a greedy approximation of the joint mode, which might also have a negative impact on the performance.
7.6 XOR problem
In the literature, LP is often shown to outperform BR even in terms of Hamming loss. Given our results so far, this is somewhat surprising and calls for an explanation. We argue that results of that kind should be considered with caution, mainly because a meta learning technique (such as BR and LP) must always be considered in conjunction with the underlying base learner. In fact, differences in performance should not only be attributed to the meta but also to the base learner. In particular, since BR uses binary and LP multiclass classification, they are typically applied with different base learners, and hence are not directly comparable.
We illustrate this by means of an example in which we generated data as before, but using XOR instead of linear models. More specifically, we first generated a linear model, and then converted it to an XOR problem by combining it with the corresponding orthogonal linear model. Each label depends on the same two features, but the parameters were generated independently for each label with τ=1. For simplicity, we did not use any kind of error.
7.7 Benchmark data
Basic statistics for the datasets, including training and test set sizes, number of features and labels, and minimal, average, and maximal number of relevant labels
Dataset 
# train inst. 
# test inst. 
# attr. 
# lab. 
min 
ave. 
max 

scene 
1211 
1196 
294 
6 
1 
1.062 
3 
yeast 
1500 
917 
103 
14 
1 
4.228 
11 
medical 
333 
645 
1449 
45 
1 
1.255 
3 
emotions 
391 
202 
72 
6 
1 
1.813 
3 
Scene is a semantic scene classification dataset proposed by Boutell et al. (2004), in which a picture can be categorized into one or more classes. In this dataset, pictures can have the following classes: beach, sunset, foliage, field, mountain, and urban. Features of this dataset correspond to spatial color moments in the LUV space. Color as well as spatial information have been shown to be fairly effective in distinguishing between certain types of outdoor scenes: bright and warm colors at the top of a picture may correspond to a sunset, while those at the bottom may correspond to a desert rock.
From the biological field, we have chosen the yeast dataset (Elisseeff and Weston 2002), which is about predicting the functional classes of genes in the Yeast Saccharomyces Cerevisiae. Each gene is described by the concatenation of microarray expression data and a phylogenetic profile, and associated with a set of 14 functional classes. The dataset contains 2417 genes in total, and each gene is represented by a 103dimensional feature vector.
The medical (Pestian et al. 2007) dataset has been used in Computational Medicine Centers 2007 Medical Natural Language Processing Challenge.^{10} It is a medicaltext dataset that includes a brief freetext summary of patient symptom history and their prognosis, labeled with insurance codes. Each instance is represented with a bagofwords of the symptom history and is associated with a subset of 45 labels (i.e., possible prognoses).
The emotions data was created from a selection of songs from 233 musical albums (Trohidis et al. 2008). From each song, a sequence of 30 seconds after the initial 30 seconds was extracted. The resulting sound clips were stored and converted into wave files of 22050 Hz sampling rate, 16bit per sample and mono. From each wave file, 72 features have been extracted, falling into two categories: rhythmic and timbre. Then, in the emotion labeling process, 6 main emotional clusters are retained corresponding to the TellegenWatsonClark model of mood: amazedsurprised, happypleased, relaxingclam, quietstill, sadlonely and angryaggressive.
Hamming loss, standard error and the rank of the algorithms on benchmark datasets
Hamming loss 
scene 
yeast 
medical 
emotions 

BR 
0.1013±0.0033(7) 
0.1981±0.0046(1) 
0.0150±0.0006(6) 
0.2071±0.0109(3) 
SBR 
0.0980±0.0036(6) 
0.2002±0.0045(3) 
0.0145±0.0007(4) 
0.1955±0.0110(1) 
BR Rules 
0.0909±0.0032(4) 
0.1995±0.0046(2) 
0.0123±0.0007(2) 
0.2203±0.0116(5) 
SBR Rules 
0.0864±0.0035(1) 
0.2080±0.0050(4) 
0.0124±0.0007(3) 
0.1980±0.0112(2) 
CC 
0.1342±0.0046(8) 
0.2137±0.0052(8) 
0.0146±0.0007(5) 
0.2244±0.0122(6) 
CC Rules 
0.0884±0.0036(3) 
0.2121±0.0050(6) 
0.0119±0.0007(1) 
0.2327±0.0131(8) 
LP 
0.0942±0.0042(5) 
0.2096±0.0056(5) 
0.0174±0.0009(7) 
0.2129±0.0144(4) 
LP Rules 
0.0874±0.0041(2) 
0.2123±0.0056(7) 
0.0181±0.0009(8) 
0.2294±0.0154(7) 
Subset 0/1 loss, standard error and the rank of the algorithms on benchmark datasets
subset 0/1 loss 
scene 
yeast 
medical 
emotions 

BR 
0.4891±0.0145(8) 
0.8408±0.0121(6) 
0.5380±0.0196(8) 
0.7772±0.0293(8) 
SBR 
0.4381±0.0144(5) 
0.8550±0.0116(8) 
0.5116±0.0197(7) 
0.7574±0.0302(5) 
BR Rules 
0.4507±0.0144(6) 
0.8408±0.0121(6) 
0.4093±0.0194(2) 
0.7624±0.0300(6) 
SBR Rules 
0.3880±0.0141(4) 
0.8277±0.0125(5) 
0.4248±0.0195(3) 
0.7426±0.0308(3) 
CC 
0.4582±0.0144(7) 
0.7895±0.0135(3) 
0.5008±0.0197(6) 
0.7624±0.0300(6) 
CC Rules 
0.3855±0.0141(3) 
0.8092±0.0130(4) 
0.3767±0.0191(1) 
0.7475±0.0306(4) 
LP 
0.3152±0.0134(2) 
0.7514±0.0143(1) 
0.4434±0.0196(4) 
0.6535±0.0335(1) 
LP Rules 
0.2943±0.0132(1) 
0.7557±0.0142(2) 
0.4527±0.0196(5) 
0.6634±0.0333(2) 
In general, the results confirm our theoretical claims. In the case of the yeast and emotions datasets, we can observe a kind of Pareto front of the classifiers. This suggests a strong conditional dependence between labels, resulting in different risk minimizers for Hamming and subset 0/1 loss. In the case of the scene and medical datasets, it seems that both risk minimizers coincide. The best algorithms perform equally good for both losses.
Moreover, one can also observe for the scene dataset that LP with a linear base classifier outperforms linear BR in terms of Hamming loss, but the use of a nonlinear classifier in BR improves the results again over LP. As pointed out above, comparing LP and BR with the same base learner is questionable and may lead to unwarranted conclusions. Similar to the synthetic XOR experiment, performance gains of LP and CC might be primarily due to a hypothesis space extension, especially because the methods with nonlinear base learners perform well in general.
8 Conclusions
In this paper, we have addressed a number of issues around one of the core topics in current MLC research, namely the idea of improving predictive performance by exploiting label dependence. In our opinion, this topic has not received enough attention so far, despite the increasing interest in MLC in general. Indeed, as we have argued in this paper, empirical studies of MLC methods are often meaningless or even misleading without a careful interpretation, which in turn requires a thorough understanding of underlying theoretical conceptions.
In particular, by looking at the current literature, we noticed that papers proposing new methods for MLC rarely give a precise definition of the type of dependence they have in mind, despite stating the exploitation of label dependence as an explicit goal. Besides, the type of loss function to be minimized, i.e., the concrete goal of the classifier, is often not mentioned either. Instead, a new method is shown to be better than existing ones “on average”, evaluating on a number of different loss functions.
Based on a distinction between two types of label dependence that seem to be important in MLC, namely marginal and conditional dependence, we have established a close connection between the type of dependence present in the data and the type of loss function to be minimized. In this regard, we have also distinguished three classes of problem tasks in MLC, namely the minimization of singlelabel loss functions, multilabel loss functions, and the estimation of the joint conditional distribution.

The type of loss function has a strong influence on whether or not, and perhaps to what extent, an exploitation of label dependencies can be expected to yield a true benefit.

Marginal label dependence can help in boosting the performance for singlelabel and multilabel loss functions that have marginal conditional distributions as risk minimizers, while conditional dependence plays a role for loss functions having a more complex risk minimizer, such as the subset 0/1 loss, which requires estimating the mode of the joint conditional distribution.

Loss functions in MLC are quite diverse, and minimizing different losses will normally require different estimators. Using the Hamming and subset 0/1 loss as concrete examples, we have shown that a minimization of the former may cause a high regret for the latter and vice versa.
We believe that these results have a number of important implications, not only from a theoretical but also from a methodological and practical point of view. Perhaps most importantly, one cannot expect the same MLC method to be optimal for different types of losses at the same time, and each new approach shown to outperform others across a wide and diverse spectrum of different loss functions should be considered with reservation. Besides, more efforts should be made in explaining the improvements that are achieved by an algorithm, laying bare its underlying mechanisms, the type of label dependence it assumes, and the way in which this dependence is exploited. Since experimental studies often contain a number of side effects, relying on empirical results alone, without a careful analysis and reasonable explanation, appears to be disputable.
Please note that we use the term “marginal distribution” with two different meanings, namely for P(Y) (marginalization over the joint distribution P(X,Y)) and for P(Y _{ i }x) (marginalization over P(Yx)).
Note that the denominator is 0 if y _{ i }=h _{ i }(x)=0 for all i=1,…,m. In this case, the loss is 0 by definition. The same remark applies to the Jaccard distance.
Acknowledgements
Krzysztof Dembczyński has started this work during his postdoctoral stay at Marburg University supported by the German Research Foundation (DFG) and finalized it at Poznań University of Technology under the grant 91515/DS of the Polish Ministry of Science and Higher Education. Willem Waegeman is supported as a postdoc by the Research Foundation of Flanders (FWOVlaanderen). The part of this work has been done during his visit at Marburg University. Weiwei Cheng and Eyke Hüllermeier are supported by DFG. The authors are grateful to the anonymous reviewers for their valuable comments and suggestions.
Open Access
This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.