A flexible class of dependence-aware multi-label loss functions

The idea to exploit label dependencies for better prediction is at the core of methods for multi-label classification (MLC), and performance improvements are normally explained in this way. Surprisingly, however, there is no established methodology that allows to analyze the dependence-awareness of MLC algorithms. With that goal in mind, we introduce a class of loss functions that are able to capture the important aspect of label dependence. To this end, we leverage the mathematical framework of non-additive measures and integrals. Roughly speaking, a non-additive measure allows for modeling the importance of correct predictions of label subsets (instead of single labels), and thereby their impact on the overall evaluation, in a flexible way. The well-known Hamming and subset 0/1 losses are rather extreme special cases of this function class, which give full importance to single label sets or the entire label set, respectively. We present concrete instantiations of this class, which appear to be especially appealing from a modeling perspective. The assessment of multi-label classifiers in terms of these losses is illustrated in an empirical study, clearly showing their aptness at capturing label dependencies. Finally, while not being the main goal of this study, we also show some preliminary results on the minimization of this parametrized family of losses.


Introduction
The setting of multi-label classification (MLC), which generalizes standard multi-class classification by relaxing the assumption of mutual exclusiveness of classes, has received a lot of attention in the recent machine learning literaturewe refer to Tsoumakas et al. [2010] and Zhang and Zhou [2014] for survey articles on this topic.The motivation for MLC originated in the field of text categorization [Hayes and Weinstein, 1990, Lewis, 1992, Apté et al., 1994], but nowadays multi-label methods are used in applications as diverse as music categorization [Trohidis et al., 2011], semantic scene classification [Boutell et al., 2004], and protein function classification [Diplaris et al., 2005].
Formally, the task of a multi-label classifier is to assign a subset of a given set of candidate labels to any query instance.A straightforward approach for learning such a predictor is via a reduction to binary classification, i.e., by training one binary classifier per label and combining the predictions of these classifiers into an overall multi-label prediction.This approach, known as binary relevance (BR) learning, is often criticized for ignoring possible label dependencies, because each label is predicted independently of all other labels.Indeed, the idea of exploiting statistical dependencies between the labels in order to improve predictive performance on the level of the entire label set is a major theme in research on multi-label classification, and many MLC methods proposed in the literature are motivated by this idea.
Of course, the usefulness of such methods very much depends on whether or not there is a need to capture label dependence.This is not always the case, for example if such dependencies are indeed not present in the data.Besides, it turns out that the underlying loss function used to evaluate multi-label predictions plays an important role [Dembczynski et al., 2012].Since a subset of predicted labels can be compared with a ground-truth subset in many ways, various loss functions have been proposed in the literature.Two simple but commonly used examples are the Hamming and the subset 0/1 loss, which are both generalizations of the 0/1 loss in conventional single-label classification.While the former assesses the quality of predictions as the percentage of incorrectly predicted labels, the latter measures the fraction of label subsets that are not predicted correctly in their entirety, i.e., for which at least one label is predicted incorrectly.As will be explained in more detail later on, capturing label dependencies is crucial for performing well in terms of subset 0/1 loss, but-at least in theory-not necessary in the case of Hamming.
Despite their widespread use and interesting theoretical properties, both Hamming and subset 0/1 can be criticized for various reasons, especially for being rather extreme.The Hamming loss is often close to 0, simply because the label cardinality (average percentage of relevant labels per example) is very small in typical MLC data sets.Thus, even the default classifier that predicts all labels as irrelevant will usually perform well according to Hamming loss, and is indeed often difficult to beat.Even if an improvement is possible, the performance differences are typically small, and therefore difficult to test for statistical significance.On the other side, the subset 0/1 loss is normally quite high and may appear overly stringent, especially in the case of many labels.Moreover, since making a mistake on a single label is punished as hardly as a mistake on all labels, it does not discriminate well between "almost correct" and completely wrong predictions.
To overcome disadvantages of commonly used losses such as Hamming and subset 0/1, we introduce a new class of loss functions for multi-label classification (Section 3).To this end, we leverage the mathematical framework of non-additive measures and integrals.Roughly speaking, the overall loss is obtained by integrating over the errors on individual labels.This integration is done with respect to a non-additive measure, which allows for controlling the "dependence-awareness" of the loss, i.e., for modeling the importance of correct predictions of label subsets.We present concrete instantiations of this type of loss functions, which allow for controlling dependence-awareness by means of a single parameter.The assessment of the "dependence-awarenenss" of multi-label classifiers in terms of these losses is illustrated in an empirical study (Section 4).

Multi-label Classification
Let X denote an instance space, and let L = {λ 1 , . . ., λ K } be a finite set of class labels.We assume that an instance x ∈ X is (probabilistically) associated with a subset of labels Λ = Λ(x) ∈ 2 L ; this subset is often called the set of relevant labels, while the complement L \ Λ is considered as irrelevant for x.We identify a set Λ of relevant labels with a binary vector y = (y 1 , . . ., y K ), where y k = λ k ∈ Λ .1By Y = {0, 1} K we denote the set of possible labelings.
We assume observations to be realizations of random variables generated independently and identically (i.i.d.) according to a probability measure P on X × Y (with density/mass function p), i.e., an observation y = (y 1 , . . ., y K ) is the realization of a corresponding random vector Y = (Y 1 , . . ., Y K ).We denote by p(Y | x) the conditional distribution of Y given X = x, and by p i (Y i | x) the corresponding marginal distribution of the i-th label Y i : (1) Given training data in the form of a finite set of observations drawn independently from P (X, Y), the goal in MLC is to learn a predictive model that generalizes well beyond these observations, i.e., which yields predictions that minimize the expected risk with respect to a specific loss function.In this regard, we need to clarify what type of predictions are sought and how these predictions are assessed.

Predictive Models in MLC
A multi-label classifier h is a mapping X −→ Y that assigns a (predicted) label subset to each instance x ∈ X .Thus, the output of a classifier h is a vector Predictions of this kind will also be denoted ŷ = (ŷ 1 , . . ., ŷK ).
Sometimes, MLC is treated as a ranking (instead of a subset selection) problem, in which the labels are sorted according to their degree or probability of relevance.Then, the prediction takes the form of a scoring function: A prediction of that kind encodes a ranking π : such that π(i) is the position of label λ i .This ranking is obtained by sorting the labels λ i in decreasing order according to their scores s i (x).

MLC Loss Functions
In the literature, various MLC loss functions have been proposed.Commonly used are the Hamming loss H and the subset 0/1 loss S , which both generalize the standard 0/1 loss for multi-class classification, albeit in very different ways: S (y, ŷ) = y = ŷ (6) Besides, other performance metrics are often reported in experimental studies.For example, the (instance-wise) F-measure is defined in terms of the harmonic mean of precision and recall, and can be written as follows: The F-measure takes values in the unit interval and can be turned into a loss function by setting F (Y, Ŷ ) = 1−F (Y, Ŷ ).

Label Dependence
The goal of classification algorithms in general is to capture dependencies between input features X i and the target variable Y .In fact, the prediction of a scoring classifier is often regarded as an approximation of the conditional probability p(Y = ŷ | x), i.e., the probability that ŷ is the true label for the given instance x.In MLC, dependencies may not only exist between the features X i and each target, but also between the targets Y 1 , . . ., Y K themselves.The idea to improve predictive accuracy by capturing such dependencies is a driving force in research on multi-label classification.
In this regard, a distinction between unconditional and conditional independence of labels can be made [Dembczynski et al., 2012].In the first case, the joint distribution p(Y) in the label space factorizes into the product of the marginals whereas in the latter case, the factorization holds conditioned on x, for every instance x.In other words, unconditional dependence is a kind of global dependence (for example originating from a hierarchical structure on the labels), whereas conditional dependence is a dependence locally restricted to a single point in the instance space.
It turns out that there is a close connection between label dependence and the decomposability of loss functions: A decomposable loss can be expressed in the form with suitable binary loss functions k : {0, 1} 2 −→ R, whereas a non-decomposable loss does not permit such a representation.It can be shown that, to produce optimal predictions ŷ = h(x) minimizing expected loss, knowledge about the marginals p k (Y k | x) is enough in the case of a decomposable loss (such as Hamming), but not in the case of a non-decomposable loss [Dembczynski et al., 2012].Instead, if a loss is non-decomposable, high-order probabilities are needed, and in the extreme case even the entire distribution p(Y | x) (like in the case of the subset 0/1 loss).On an algorithmic level, this means that MLC with a decomposable loss can be tackled by binary relevance learning (i.e., learning one binary classifier for each label individually), whereas non-decomposable losses call for more sophisticated learning methods that are able to take label-dependencies into account.

MLC Loss Functions based on Non-Additive Measures
The Hamming and the subset 0/1 loss are often considered as prototypical examples of losses which, respectively, do and do not impel the learner to take label dependencies into account: Hamming is label-wise decomposable and can principally be optimized by learning algorithms like BR.The subset 0/1 loss, on the other side, is not label-wise decomposable.Therefore, this loss is often used to quantify the learner's ability to capture label dependencies.For example, consider the following (conditional) ground-truth distribution p(• | x) on the label space Y = {0, 1} 3 : For each of the three labels, the individual probability of relevance is higher than the probability of irrelevance, and indeed, ŷ = (1, 1, 1) is the Bayes-optimal prediction (minimizing the loss in expectation) in the case of Hamming.For the subset 0/1 loss, however, the Bayes-optimal prediction is ŷ = (0, 0, 0).In general, the Bayes-optimal prediction is given by the marginal mode of the distribution p(• | x) in the case of Hamming and by the joint mode in the case of subset 0/1.
As already said, both Hamming and subset 0/1 can be criticized for being rather extreme.Due to the reasons already explained in the introduction (imbalance between relevant and irrelevant labels), the Hamming loss is often very low.
As opposed to this, the subset 0/1 loss is normally quite high, since an entirely correct prediction becomes very unlikely with increasing K.It is an "all or nothing" measure, for which a mistake on a single label is as bad as a mistake on many labels, and which does not reward correct predictions on larger subsets of the labels.
To overcome these disadvantages, we introduce a new class of loss functions for multi-label classification in Section 3.3.These loss functions are able to assess a learner's dependence-awareness, i.e., its aptness at capturing label dependencies, in a more skillful manner.To this end, we leverage the mathematical framework of non-additive measures and integrals, the essentials of which are recalled in Sections 3.1 and 3.2.Roughly speaking, a non-additive measure is used for modeling the importance of correct predictions of label subsets (instead of single labels), and thereby their impact on the overall evaluation.As will be seen, Hamming and subset 0/1 will be recovered as special cases of our family, which, in a sense, allows for "interpolating" between these two extremes.
For didactic reasons, let us anticipate the basic construction principle of our family of loss functions, which will be introduced step by step alongside with a couple of other (auxiliary) functions.More specifically, considering the correctness of predictions on individual labels λ i as evaluation criteria c i , a loss µ (y, s) will be defined as a suitably weighted aggregation of the correctness degrees where s i ∈ [0, 1] is the score predicted for label λ i and y i ∈ {0, 1} the corresponding ground truth.Allowing for predictions in terms of a score vector s = (s 1 , . . ., s K ) ∈ [0, 1] K is more general than a binary prediction ŷ = (ŷ 1 , . . ., ŷK ) ∈ {0, 1} k , but obviously comprises the latter as a special case.The loss will then be specified in terms of an integral of the "correctness function" f given by (8), i.e., as an aggregated (in-)correctness To this end, two main ingredients are needed, namely the measure µ for weighting and the integral for aggregation: • A measure µ assigns a weight µ(A) to every subset A, in our case to a subset of labels, which can be interpreted as the importance of that subset.Formally, a measure µ is a mapping from subsets to the unit interval, which can be equivalently represented by its Möbius transform m µ .• The aggregation in ( 9) is accomplished with the so-called (discrete) Choquet integral C µ , which is a weighted aggregation of the values of a function (in our case f ) with respect to the underlying measure µ.
In the following, we discuss the components on the right-hand side of (9) in more detail.

Non-Additive Measures
Let C = {c 1 , . . ., c K } be a finite set of (desirable) "criteria" and µ : 2 C −→ [0, 1] a measure on this set.For each A ⊆ C, we interpret µ(A) as the weight or, say, the importance of the subset of criteria A. In the context of MLC, we can think of the criterion c i as the correctness of the prediction on the i th label λ i .Thus, µ({c 1 }) is the importance of predicting the first label correctly, and µ({c 1 , c 2 }) is the importance of jointly predicting the first and the second label correctly.
A standard assumption on a measure µ, which is at the core of probability theory, is additivity: Unfortunately, additive measures cannot model any kind of "interaction": Extending a set of elements A by a set of elements B always increases the weight µ(A) by the weight µ(B), regardless of A and B. For example, we cannot express that predicting λ 1 and λ 2 correctly, i.e., both together, has a higher value than the sum of getting both of them individually right.
Non-additive measures, also called capacities or fuzzy measures, are simply normalized and monotone, but not necessarily additive [Sugeno, 1974]: Thus, a set of criteria B is always at least as important as any of its subsets.
A useful representation of non-additive measures is in terms of the Möbius transform: for all B ⊆ C, where the Möbius transform m µ of the measure µ is defined as follows: The value m µ (A) can be interpreted as the weight that is exclusively allocated to A, instead of being indirectly connected with A through the interaction with other subsets.
A measure µ is said to be k-order additive, or simply k-additive, if k is the smallest integer such that m(A) = 0 for all A ⊆ C with |A| > k.This property is interesting for several reasons.First, as can be seen from ( 11), it means that a measure µ can formally be specified by significantly fewer than 2 K values, which are needed in the general case.Second, k-additivity is also interesting from a semantic point of view, as it means that there are no interaction effects between subsets A, B ⊆ C whose cardinality exceeds k.

The Choquet Integral
Suppose that f : C −→ R + is a non-negative function that assigns a value to each criterion c i .In the case of MLC, we can think of f (c i ) as the correctness of a prediction on the label λ i .An important question, then, is how to aggregate the evaluations of individual criteria, i.e., the values f (c i ), into an overall evaluation, in which the criteria are properly weighted according to the measure µ.Mathematically, this overall evaluation can be considered as an integral C µ (f ) of the function f with respect to the measure µ.
Indeed, if µ is an additive measure, the standard integral just corresponds to the weighted mean which is a natural aggregation operator in this case.For example, in the context of MLC, the Hamming loss is a special case of ( 13), with f (c i ) ∈ {0, 1} depending on whether the prediction on λ i is right or wrong, and uniform weights A non-trivial question, however, is how to generalize (13) in the case where µ is non-additive.This question, namely how to define the integral of a function with respect to a non-additive measure (not necessarily restricted to the discrete case), is answered in a satisfactory way by the Choquet integral [Choquet, 1954].The point of departure of the Choquet integral is an alternative representation of the "area" under the function f , which, in the additive case, is a natural interpretation of the integral.Roughly speaking, this representation decomposes the area in a "horizontal" instead of a "vertical" manner, thereby making it amenable to a straightforward extension to the non-additive case.More specifically, note that the weighted mean can be expressed as follows: Figure 1: Vertical (left) versus horizontal (right) integration.In the first case, the height of a single bar, f (c i ), is multiplied with its "width" (the weight µ({c i }), and these products are added.In the second case, the height of each horizontal section, f (c ), is multiplied with the corresponding "width" µ(A (i) ). where . ., c (K) }; see Fig. 1 for an illustration.Now, the key difference between the left and right-hand side of the above expression is that, whereas the measure µ is only evaluated on single elements c i on the left, it is evaluated on subsets of elements on the right.Thus, the right-hand side suggests an immediate extension to the case of non-additive measures, namely the Choquet integral, which, in the discrete case, is formally defined as follows: A simple derivation shows that, in terms of the Möbius transform of µ, the Choquet integral can also be expressed as

MLC Loss Functions based on Non-Additive Measures
In the context of MLC, non-additive measures and generalized integrals can be used to define flexible loss functions: Each criterion c i corresponds to the (correct) prediction on a label λ i , and µ(A) quantifies the importance to be correct on the subset of labels A as a whole.Moreover, the function to be integrated is the correctness function (8).Thus, is the degree of correctness on the label λ i , where s i ∈ [0, 1] is the score predicted for label λ i and y i ∈ {0, 1} the corresponding ground truth: u i = 1 for a perfectly correct prediction and u i = 0 for a completely wrong prediction.Now, given the u i as values on the criteria c i (the higher the better), the idea is to aggregate these values with the Choquet integral (based on the measure µ) into an overall degree of correctness, and to define a loss as the complement (1 − (•)) of this degree of correctness.Formally, this leads to where the permutation Special cases.Important special cases include the additive measure µ(A) = |A|/K, for which we obtain i.e., the Hamming loss (or, strictly speaking, a generalization of the Hamming loss in the case of real-valued scores s i ∈ [0, 1]), and the measure µ defined by µ(C) = 1 and µ(A) = 0 for A C, for which we obtain µ (y, s) = min i.e., the subset 0/1 loss (or again a generalization).
Another interesting special case is the covering error introduced by Amit et al. [2007].The latter is defined as the sum of subset 0/1 losses on a family of predefined label subsets, called a covering.The connection to this loss can nicely be seen based on the representation ( 14) of the Choquet integral in terms of the Möbius transform.Here, the min-terms correspond to subset 0/1 losses on subsets T .In contrast to the covering error, where these losses are weighted equally, they are weighted by the values of the Möbius function in our case.
Counting measures.The two measures above are examples of so-called counting measures, which only depend on the cardinality of A. In other words, µ is a counting measure if it can be expressed as µ(A) = v( |A| /K) for a suitable function v : [0, 1] −→ [0, 1], which means that the measure of a set only depends on its cardinality but not the elements of the set.For example, µ({c 1 , c 2 }) = v(2/K) = µ({c 3 , c 4 }).This kind of symmetry property is certainly meaningful in MLC, where the different labels are normally considered as equally important -or, stated differently, the performance metric is normally invariant under permutation of the labels.Here, v(k/K) can be interpreted as the importance of a correct prediction on a subset of k labels, which means that the loss function ( 15) is completely specified by the values 0 = v(0), v( 1 /K), . . ., v(1) = 1.
Formally, for an increasing function v : [0, 1] −→ [0, 1] such that v(0) = 0 and v(1) = 1, we obtain an OWA (ordered weighted averaging) [Yager andFilev, 1999, Yager andKacprzyk, 2012] aggregation of the degrees of correctness u i , namely with In other words, we obtain an OWA loss function with w 1 + . . .+ w K = 1.Again, Hamming is obtained for the special case v : x → x and subset 0/1 for v such that v(x) = 1 for x = 1 and v(x) = 0 otherwise.Let us highlight that, in spite of a somewhat involved derivation (based on non-additive measures and integrals) and the flexibility our class of loss functions in general, the form ( 16) we end up with in the case counting measures is both intuitively appealing and easy to compute.In principle, it is nothing than a weighted average of the errors on individual labels, with the important difference that the weights w i now pertain, not to the i th label, but to the i th order statistic of the error, i.e., the i th largest error.Let us illustrate this with a simple example, in which the ground-truth labeling is y = (0, 1, 1, 0, 0, 0) and the prediction s = (0.2, 0.3, 0.9, 0.1, 0.4, 0.3).
Here, the errors on the individual labels are given, respectively, by 0.2, 0.7, 0.1, 0.1, 0.4, 0.3.Sorting these from lowest to highest yields the increasing sequence 0. The first case with uniform weights corresponds to Hamming loss and yields a simple averaging of the errors.In the second case, the full weight is given to the largest error, which corresponds to the subset 0/1 loss.The third case is in-between these two extremes.
Let us also note that the computation is further simplified in the case of binary predictions, i.e., where the scores s i and hence also the individual errors are either 0 or 1.In this case, the loss merely depends on the total number of errors k, and is given by i.e., by the sum of the k largest weights.

Parameterized Families
In the following, we present two families of such loss functions, which allow for modeling the dependence-awareness in terms of a single parameter.
• Polynomial loss: First, one could think of using a convex function of the form v : x → x α (18) for α ≥ 1.The larger α, the more important it becomes to predict larger subsets correctly, and subset 0/1 is recovered for the limit case α → ∞.In other words, α can be used to smoothly interpolate between Hamming and subset 0/1.• Binomial loss: To motivate a second family of losses, suppose we are only interested in getting k-subsets of labels right, whereas a correct prediction on a subset of size < k should not be rewarded.This could be reflected by a Möbius function of the form In this case, we obtain Again, the Hamming and subset 0/1 loss can be recovered by setting, respectively, k = 1 and k = K, while interpolations are obtained in-between.
In principle, non-symmetric measures could of course be used in MLC as well, for example to express that different labels or different label subsets are of different importance.Yet, as already said, symmetry appears to be a natural property.Moreover, as it significantly reduces the number of degrees of freedom, this property facilitates the specification of a measure-based loss function ( 15).What could nevertheless be interesting is a weighting of label subsets in proportion to the number of relevant labels they contain.More concretely, starting from a "base measure" µ, the Möbius mass m µ (A) could be adjusted depending on the number of relevant labels in A -increased if A contains many and reduced if it contains only few relevant labels.Thereby, more emphasis could be put on correct predictions for relevant labels.The resulting loss function would then depend on the ground truth y.
As an example of loss minimization for the Binomial loss, i.e., the loss (17) with v given by ( 19), consider the following distribution on labelings y (given an instance x): One can then verify (e.g., through simple enumeration) that the Bayes-optimal predictions for the Binomial loss with different parameters k are given as follows: k=1: 1 0 0 1 0 k=2: 0 0 1 1 1 k=3: 0 0 1 1 1 k=4: 1 1 0 0 0 k=5: 1 0 1 1 0 This example shows that, by changing the parameter of the loss, the optimal prediction may change quite drastically.For example, the prediction of three of the five labels changes when going from k = 1 to k = 2, and even all five labels change when passing from k = 3 to k = 4.

Empirical Case Study
In this section, we showcase how the proposed class of multi-label loss functions can be applied as an analysis tool for capturing the "dependence-awareness" of different multi-label classifiers, i.e., for assessing a learner's ability to capture label dependence.

Experimental Setup
For the comparison of different multi-label classifiers, we apply them to various benchmark datasets originating from different domains.In Table 1, an overview of the considered datasets together with their statistical properties is provided.This includes the number of instances, the number of labels, the ratio of number of labels to number of instances, the absolute number of unique label combinations, and the average number of relevant labels per instance, also referred to as label cardinality.
We use paired 10-fold cross-validations for obtaining out-of-sample predictions in the form of label relevance scores.
Although we restrict our analysis to binary predictions s i ∈ {0, 1} in order to isolate from the ability of the classifiers to shape their scores, our methodology is in principle also suitable for comparing soft predictions s i ∈ [0, 1] and independent of the thresholding technique used.

Methods
We experiment with several publicly available multi-label algorithms: • Binary Relevance (BR) is a reduction to binary classification, which learns one binary classifier for each label independently of the others [Boutell et al., 2004].Despite its simplicity, BR has proven to be highly competitive in comparison to state-of-the-art multi-label learners in recent studies, especially regarding measures that are not dependence-aware (cf.e.g.[Rivolli et al., 2020, Wever et al., 2020, 2018]).
• Classifier Chains (CC) take label dependencies into account, by imposing an order on the label set and using the predictions for the previous labels as additional feature information for the next label predictor [Read et al., 2009].
• Label Powerset (LP) is a reduction to multi-class classification [Tsoumakas et al., 2010].It converts each possible label subset into a separate (meta-)class and then solves a standard classification problem.Thereby, it takes label dependence into account, though at the expense of treating similar label sets as independent classes.
• Random k-Labelsets (RAkEL) randomly selects several label subsets of a given size k, learns a (LP) multi-label classifier for each subset, and combines their predictions [Tsoumakas and Vlahavas, 2007].This may be viewed as a generalization of binary relevance (K classifiers with k = 1) and label powerset (1 classifier with k = K).Obviously, the larger k, the more dependence-aware this method should be.
• Predictive Clustering Trees (PCT) build up a multi-objective decision tree by using example variance and multi-label prediction quality for guiding the tree construction [Kocev et al., 2007].Full label vectors are predicted at the leafs, hence PCT allows a certain control over the dependence-awareness by setting the leaf and ensemble sizes.For all algorithms we used the implementations of MEKA, except PCTs, for which we used the implementation in Mulan2 .Due to their runtime, we used decision trees as single-label base learners in all MEKA methods.Except for RAkEL, which is evaluated for different values k and the number of ensemble members m, and PCT, which is used with single trees (PCT) and bagged ensembles of 10 trees (EPCT), all hyper-parameters are set to their default values.

Results
In the following, we present a selection of the results we produced and highlight several interesting insights.For a more comprehensive and detailed presentation, we refer to the supplementary material.
To analyze the dependence-awareness of the considered multi-label algorithms, we evaluate their performance in terms of the polynomial instantiation (18) of our loss function, as well as the binomial instantiation (19) -we denote the former by poly and the latter by bin .While the (discrete) parameter k of bin takes values in {1, . . ., K}, we vary the (continuous) parameter of the polynomial loss, α, between 1 and 1000.In both cases, the lowest parameter value 1 corresponds to the Hamming loss and the highest values to the subset 0/1 loss (in the case of poly , strictly speaking, only for α → ∞), whereas intermediate values interpolate between these two extremes.
We start the analysis with a comparison of the evaluated algorithms for the llog-f dataset.The graphs in Fig. 2 plot the value of the parameter k respectively α on the x-axis against the loss of the method on the y-axis.On closer examination, we can observe some algorithms to work better than other methods for a small k or α, while the order may change as the parameter values increase and the losses demand more dependence-awareness.For example, we can observe that PCT performs favourably to LP for small k or α, but LP catches up with increasing parameter values until it finally outperforms PCT.In general, the dependence-awareness of a learner is reflected by the slope of the performance curve (the flatter the better).
While the parameter of bin has a simpler interpretation, as k corresponds to the number of labels that is required to be predicted correctly, α allows for a more fine-grained analysis of dependence-awareness.
However, with both families, we can observe intersections between the loss curves of the algorithms, explicitly showing when the order of the methods changes.
The visualizations chosen in Fig. 3 and Fig. 4 allow for a more focused comparison between two methods over several datasets.The graphs shown (one per dataset) are produced by plotting the loss of the first learner (on the x-axis) against the loss of the second learner (on the y-axis) in the comparison, again varying the values of the parameters (1 ≤ k ≤ K) for bin and (1 ≤ α ≤ 1000) for poly .To interpret these plots, let us highlight the following properties: • Since the loss increases with increasing dependence-awareness, the direction of the graphs is from the lower left to the upper right.• A point on the graph above the diagonal indicates better performance of the first method, a point below just the opposite.Thus, the intersections of the curve with the diagonal are of particular interest.• Also interesting is the curvature of the graph: A convex (concave) shape indicates better dependence-awareness of the first (second) method, as it improves relative to the second (first) method with increasing dependenceawareness.
Despite the different appearance in Fig. 2, the trajectories in the pairwise comparisons are quite comparable for the two loss functions (as can be seen for the first three comparisons, respectively), demonstrating the consistency between the two losses.In general, the experimental results confirm our expectations: With an increasing dependence-awareness of the loss (increasing k respectively α), simple methods such as BR tend to perform worse than dependence-aware methods like LP, which is also shown by the late crossing of the diagonal by the graphs.This observation is confirmed by the comparison of LP with PCT.However, compared to the case of BR, the differences at intermediate levels of dependence-awareness are larger, suggesting that PCT is better able to take label dependencies into account than BR.The advantage for intermediate levels is diminished if we compare to CC, a method which is less extreme than LP in its attempt to correctly predict the entire label combination.
In contrast, RAkEL allows a more fine-grained control over the dependence-awareness with its parameter k, which is reflected in the comparisons in Fig. 3.When the ensemble members are trained to predict label subsets of size 2, RAkEL behaves quite similarly to BR, whereas for subsets of size 5 it approaches LP.The full set of pairwise comparisons are depicted in the Fig. 5-10 in the supplement.

Conclusion and Future Work
We consider a multi-label loss function as "dependence-aware" if it puts emphasis on getting larger label combinations right in their entirety, instead of "merely" making correct predictions on individual labels.In this paper, we introduced a flexible class of loss functions that allows for modeling dependence-awareness by means of non-additive measures.
More specifically, we define a loss function in terms of a Choquet integral of label-wise correctness with respect to such a measure.We also proposed two instantiations of our family, in which dependence-awareness can be controlled by a single parameter, thereby "interpolating" between Hamming and subset 0/1 loss.
A first experimental study has shown the potential usefulness of our loss functions as a tool for analyzing the dependenceawareness of different MLC methods, i.e., their ability to capture label dependence.Going beyond the analysis of existing algorithms, the natural next step is to develop new algorithms that are specifically tailored to our family of losses and can be customized for minimizing specific instantiations thereof.

Figure 2 :
Figure 2: Comparing multi-label algorithms on the dataset llog-f w.r.t. the binomial loss (left) and polynomial loss (right).

Figure 3 :
Figure 3: Pairwise comparison of multi-label classifiers for the binomial loss.

Table 1 :
Overview of datasets with statistics of their main properties.