Abstract
Multilabel classification exhibits several challenges not present in the binary case. The labels may be interdependent, so that the presence of a certain label affects the probability of other labels’ presence. Thus, exploiting dependencies among the labels could be beneficial for the classifier’s predictive performance. Surprisingly, only a few of the existing algorithms address this issue directly by identifying dependent labels explicitly from the dataset. In this paper we propose new approaches for identifying and modeling existing dependencies between labels. One principal contribution of this work is a theoretical confirmation of the reduction in sample complexity that is gained from unconditional dependence. Additionally, we develop methods for identifying conditionally and unconditionally dependent label pairs; clustering them into several mutually exclusive subsets; and finally, performing multilabel classification incorporating the discovered dependencies. We compare these two notions of label dependence (conditional and unconditional) and evaluate their performance on various benchmark and artificial datasets. We also compare and analyze labels identified as dependent by each of the methods. Moreover, we define an ensemble framework for the new methods and compare it to existing ensemble methods. An empirical comparison of the new approaches to existing baseline and stateoftheart methods on 12 various benchmark datasets demonstrates that in many cases the proposed singleclassifier and ensemble methods outperform many multilabel classification algorithms. Perhaps surprisingly, we discover that the weaker notion of unconditional dependence plays the decisive role.
Keywords
Multilabel classification Conditional and unconditional label dependence Generalization bounds Multilabel evaluation measures Ensemble learning algorithms Ensemble models diversity Empirical experiment Artificial datasets1 Introduction and motivation
Conventional classification tasks deal with problems where each item should be assigned to exactly one category from a finite set of available labels. This type of classification is referred to in the literature as (singlelabel) multiclass. Conversely, in multilabel classification, an instance can be associated with several labels simultaneously. Multilabel classification has many applications in everyday life. For example, a news item about an assassination attempt in the course of a presidential election campaign can be classified simultaneously to National Elections and Crime concepts at the same time; a photograph can similarly belong to more than one conceptual class, such as sunset and beaches; and in music categorization, a song may belong to more than one genre. Multilabeling is a very common problem in text classification: medical documents, Web pages, and scientific papers, for example, often belong simultaneously to a number of concept classes. Due to its increasing practical relevance as well as its theoretical interest, multilabel classification has received more attention from the machine learning community in recent years and many recent studies look for efficient and accurate algorithms for coping with this classification challenge.
In an exhaustive overview of existing approaches for multilabel classification, Tsoumakas et al. (2010) partition them into two main categories: problem transformation and algorithm adaptation. Problem transformation includes methods that transform the multilabel classification problem into one or more singlelabel classification problems. The main advantage of these methods is that they are suitable for use with any readily available singlelabel classifier. Algorithm adaptation embraces methods that extend specific learning algorithms in order to handle multilabel data directly. The main criticism of the adaptation methods is that their application requires changing known classification algorithms in order to adapt them to a specific problem. Algorithm adaptation methods are beyond the scope of this paper.
In the problem transformation category, the common methods used are the Label Powerset (LP) and Binary Relevance (BR) approaches. According to the LP approach, each distinct combination of labels that exists in the multilabel dataset is considered as a single class. The main problem of this method is that many of the created classes are associated with too few examples.
According to the BR approach, a multilabel classification problem is decomposed into multiple, independent binary classification problems and the final labels for each data point are determined by aggregating the classification results from all binary classifiers. The main criticism of this method is that possible dependencies among the labels are ignored.
Recently, many problem transformation methods addressing problems in the LP and BR approaches have been proposed. Some of these methods are discussed in the next section, Related Work.
The aim of this paper is to examine whether the dependencies (conditional or unconditional) among labels can be leveraged to improve the classification accuracy. To this end, we define a natural family of cost functions that interpolates between the 0–1 and the Hamming distances on the multilabel vectors. For each cost function in this family, we derive apparently novel generalization bounds. Furthermore, we give theoretical evidence that unconditional dependence reduces sample complexity. In addition, we propose new algorithms for explicitly identifying conditionally and unconditionally dependent labels and multilabel classification incorporating the discovered dependencies. Heuristic analysis is used to demonstrate why and under what circumstances the proposed multilabel classification algorithm will be beneficial. Empirical evaluation of the proposed methods on a wide range of datasets confirms our theoretical findings.
The rest of the paper is organized as follows. In the next section, related work is discussed. In Sect. 3 we formally define the multilabel classification problem and analyze a number of measures commonly used for evaluating multilabel classification algorithms. In Sect. 4 some general theoretical results for multilabel learning are derived. Section 5 describes the proposed method and analyzes the circumstances in which it will be beneficial. Section 6 presents the setup of the empirical experiment conducted for evaluating the proposed approaches. And in Sect. 7 the results of the experiment are presented. Finally, Sect. 8 concludes the current work and outlines some further research directions.
2 Related work
In this section we briefly review several recently proposed algorithms for multilabel classification and then, in the light of these and other previous works, summarize what we believe are the original contributions of this paper.
The data sparseness problem of the LP approach was addressed in Read et al. (2008). The authors propose Pruned Sets (PS) and Ensemble of Pruned Sets (EPS) methods to concentrate on the most important correlations. This is achieved by pruning away examples with infrequently occurring label sets. Some of the pruned examples are then partially reintroduced into the data by decomposing them into more frequently occurring label subsets. Finally, a process similar to the regular LP approach is applied on the new dataset. The authors show empirically that the proposed methods are often superior to other multilabel methods. However, these methods are likely to be inefficient in domains with a large proportion of distinct label combinations (Read et al. 2008) and with an even distribution of examples over those combinations. Another limitation of the PS and EPS methods is the need to balance the tradeoff between information loss (caused by pruning training examples) and adding too many decomposed examples with smaller label sets. For this purpose there is a need to choose some nontrivial parameter values before applying the algorithm or, alternatively, to perform calibration tests for parameters adjustment. And still another limitation is that the dependencies within the decomposed label sets are not considered.
Another approach for multilabel classification in domains with a large number of labels was proposed by Tsoumakas et al. (2008). The proposed algorithm (HOMER) organizes all labels into a treeshaped hierarchy with a much smaller set of labels at each node. A multilabel classifier is then constructed at each nonleaf node, following the BR approach. The multilabel classification is performed recursively, starting from the root and proceeding into the child nodes only if their labels are among those predicted by the parent’s classifier. One of the main HOMER processes is the clustering of the label set into disjoint subsets so that similar labels are placed together. This is accomplished by applying a balanced kmeans clustering algorithm on the label part of the data. In this work, differently from HOMER we try to cluster labels based on the level of dependency among them and perform a flat multilabel classification (considering internal dependencies) into each one of the clusters.
A recent paper by Read et al. (2009) argues in defense of the BR method. It presents a method for chaining binary classifiers—Classifiers Chains (CC)—in a way that overcomes the label independence assumption of BR. According to the proposed method, a single binary classifier is associated with each one of the predefined labels in the dataset and all these classifiers are linked in an ordered chain. The feature space of each classifier in the chain is extended with the 0/1 label associations of all previous classifiers. Thus, each classification decision for a certain label in the chain is augmented by all prior binary relevance predictions in the chain. In this way correlations among labels are considered. The CC method has been shown to improve classification accuracy over the BR method on a number of regular (not largesize) datasets. One of the disadvantages of this method, noted by authors, is that the order of the chain itself has an effect on accuracy. This can be solved either by a heuristic for selecting a chain order or by using an ensemble of chain classifiers. Any of these solutions increases the required computation time.
Recently, a probabilistic extension of the CC algorithm was proposed (Dembczynski et al. 2010a). According to the probabilistic classifier chains (PCC) approach, the conditional probability of each label combination is computed using the product rule of probability. For estimating the joint distribution of labels, a model is learned for each label on a feature space augmented by previous labels as additional attributes. The classification prediction is then derived from the calculated join distributions in an explicit way. Authors theoretically and empirically confirm expectations that PCC produces better estimates than original Classifier Chains. However, the price is paid in a much higher algorithm complexity. In fact, the main disadvantage of the PCC method is its applicability only on datasets with a small number of labels, not more than about 15.
An idea relatively close to that described in this research is presented in Tsoumakas and Vlahavas (2007). The authors propose an approach that constructs an ensemble of LP classifiers. Each LP classifier is trained using a different, small random subset of the set of labels. This approach (RAkEL) aims at taking into account label correlations and at the same time avoiding the LP limitations mentioned above. A comparison shows the superiority of RAkEL’s performance over popular BR and LP methods on the full set of labels. Tsoumakas and Vlahavas note that the random nature of their method may lead to including models that affect the ensemble’s performance in a negative way.
To the best of our knowledge, few works on multilabel learning have directly identified dependent labels explicitly from the dataset. One such method where the degree of label correlation is explicitly measured was presented in Tsoumakas et al. (2009). In this paper the authors use stacking (Wolpert 1992) of BR classifiers to alleviate the label correlations problem. The idea in stacking is to learn a second (or meta) level of models that consider as input the output of all first (or base) level models. In this way, correlations between labels are modeled by a metalevel classifier. To avoid the noise that may be introduced by modeling uncorrelated labels in the metalevel, the authors prune models participating in the stacking process by explicitly measuring the degree of label correlation using the phi coefficient. They showed, by exploratory analysis, that detected correlations are meaningful and useful. The main disadvantage of this method is that the identified correlations between labels are utilized by a metalevel classifier only. In this paper we show that direct exploration of label dependence (i.e. by a baselevel classifier) is more beneficial for predictive performance.
Another recent paper by Zhang and Zhang (2010) exploited conditional dependencies among labels. For this purpose the authors utilize a Bayesian network representing the joint probability of all labels conditioned on the feature space, such that dependency relations among labels are explicitly expressed by the network structure. Zhang and Zhang learn approximate network structure from classification errors of independent binary models for all labels. On the next step, a new binary classifier is learned for each label by treating its parental labels in the network as additional input features. The labels of unseen examples are predicted by binary classifiers learned on the feature space augmented by its parental labels. The ordering of the labels is implied by the Bayesian network structure. Zhang and Zhang (2010) showed empirically that their method is highly comparable to some of the stateoftheart approaches over a range of datasets using 3 multilabel evaluation measures. Note that according to this method all parental labels, which a certain label is found to be dependent on, are added to the feature space. The main limitation of this method is in the complexity of Bayesian network learning. It can be efficiently learned with only a small (up to 20) number of variables. To handle cases where the number of variables is larger than 20, the authors switch the algorithm to approximate maximum a posterior (MAP) structure learning for which maximum running time and some other parameters should be specified.
Tenenboim et al. (2009) demonstrated that dividing the whole set of labels into several mutually exclusive subsets of dependent labels and applying a combination of BR and LP methods to these subsets provides in many cases higher predictive performance than regular LP and BR approaches.
A general detailed analysis of the label dependence issue was presented in Dembczynski et al. (2010b). The paper distinguishes and formally explains the differences and connections between two dependence types—conditional and unconditional. Also an overview of stateoftheart algorithms for MLC and their categorization according to the type of label dependence they seek to capture is given. As well, authors analyze potential benefit of exploiting label dependencies in the light of 3 different loss functions.
Recently, some algorithm adaptation methods considering labels correlations in various ways were proposed. For example, Zhang et al. (2009) proposed an extension to the popular Naive Bayes classifiers for dealing with multilabel instances called MLNB. The authors incorporated feature selection techniques (principal component analysis and genetic algorithms) to mitigate the harmful effects caused by the classic Naive Bayes assumption of class conditional independence. Furthermore, correlations between different labels were also explicitly addressed through the specific fitness function used by the genetic algorithm. Authors experimentally demonstrated the effectiveness of utilized feature selection techniques in addressing the interlabel relationships and reported significant rise in the performance of MLNB due to these techniques.
Another example of algorithm adaptation for multilabel classification considering interlabel relationships is MLSVDD, a fast multilabel classification algorithm based on support vector data description (Xu 2010). According to this algorithm, a klabel problem is divided into k subproblems each of which consists of instances from a specific class. For each class a subclassifier is learned using support vector data description method. For making an entire multilabel classification decision predictions of all subclassifier are combined as follows. The classes which predicted pseudo posterior probability above some threshold are added to the set of predicted labels. To compensate missing correlations between labels linear ridge regression model is used when constructing a threshold function.
In this paper, we analyze the commonly used multilabel evaluation measures and demonstrate that classification accuracy is better suited than other measures for general evaluation of classifier performance for most regular multilabel classification problems. Thus, accuracy is the target evaluation measure that we aim to improve in the current research.
We propose to discover existing dependencies among labels in advance, before any classifiers are induced, and then to use the discovered dependencies to construct a multilabel classifier. We define methods estimating conditional and unconditional dependencies between labels from a training set and apply a new algorithm that combines the LP and BR methods on the results of each one of the dependence identification methods. The new algorithm is termed “LPBR”. We then compare the contributions of both methods for label dependency identification on classifier predictive performance in terms of 4 evaluation measures. Moreover, an ensemble framework that we introduce for the proposed algorithm makes it possible to further improve the classifier’s predictive performance.
We empirically evaluate the proposed methods on twelve multilabel datasets and show that the new approach for multilabel classification outperforms many existing methods.

The development of the LPBR algorithm which combines the best features of the LP and BR methods while eliminating their inherent disadvantages.

Theoretical confirmation of the reduction in sample complexity that is gained from unconditional dependence.

Heuristic analysis of conditions for which the LPBR method is supposed to be beneficial.

The formulation of novel algorithms—ConDep and ChiDep—for explicitly identifying conditionally and unconditionally dependent label pairs, clustering them into disjoint subsets and applying LPBR for modeling the specified dependencies.

An ensemble framework for the ConDep and ChiDep algorithms.

An extensive empirical evaluation experiment comparing the effectiveness of developed algorithms to nine existing multilabel algorithms on a wide range of various datasets.
3 Problem formulation and evaluation measures analysis
3.1 Formal definitions
Our learning model is the following standard extension of the binary case. We have a teacher generating examples \(X\in\mathcal{X}\) iid according to some distribution P. Each example X _{ i } is accompanied by its multilabel Y _{ i }, for which we use subset notation Y⊆[L] or vector notation Y∈{0,1}^{ L }, as dictated by convenience.^{1}
Assume \(\mathcal{Y}\) is a given set of predefined binary labels \(\mathcal{Y}=\{\lambda_{1},\ldots, \lambda_{L}\}\). For a given set of labeled examples D={X _{1},X _{2},…,X _{ n }} the goal of the learning process is to find a classifier \(h : \mathcal{X}\rightarrow\mathcal{Y}\), which maps an object \(X\in\mathcal{X}\) to a set of its classification labels \(Y\in \mathcal{Y}\), such that h(X)⊆{λ _{1},…,λ _{ L }} for all X in \(\mathcal{X}\).
The main feature distinguishing multilabel classification from a regular classification task is that a number of labels have to be predicted simultaneously. Thus, exploiting potential dependencies between labels is important and may improve classifier predictive performance. In this paper we consider two types of label dependence, namely conditional and unconditional. The first type refers to dependencies between labels conditional to (i.e. given) a specific instance, while the second one refers to general dependencies existing in the whole set, independently of any concrete observation.
Both types of dependence are formally defined below.
Definition 1
Definition 2
We refer to these definitions when talking about conditional and unconditional dependencies in Sect. 5.1, Label Dependence Identification.
3.2 Evaluation measures analysis
In this paper we consider the most commonly used multilabel evaluation measures from Tsoumakas and Vlahavas (2007), namely multilabel examplebased classification accuracy, subset accuracy, Hamming loss, and labelbased microaveraged Fmeasure. Their formal definition and analysis are presented below.
Let D be a multilabel evaluation dataset, consisting of D multilabel examples (X _{ i },Y _{ i }), i=1…D, Y _{ i }⊆[L]. Let h be a multilabel classifier.
Hamming loss is very sensitive to the label set size L. It measures the percentage of incorrectly predicted labels both positive and negative. Thus, in cases where the percentage of positive labels is low relative to L, the low values of the Hamming loss measure do not give an indication of high predictive performance. Thus, as the empirical evaluation results demonstrate below, the accuracy of the classification algorithm on two datasets with similar Hamming loss values may vary from about 30 to above 70 percent (as, for example, in the “bibtex” and “medical” datasets). However, the Hamming loss measure can be useful for certain applications where errors of all types (i.e., incorrect prediction of negative labels and missing positive labels) are equally important.
Note that microaveraged Fmeasure gives equal weight to each document and is therefore considered as an average over all the document/label pairs. It tends to be dominated by the classifier’s performance on common categories and is less influenced by the classifier’s performance on rare categories.
Of the various measures that are discussed here, the microaveraged Fmeasure seems to be the most balanced and the least dependent on dataset properties. Thus it could be the most useful indicator of classifier general predictive performance for various classification problems. However it is more difficult for human interpretation, as it combines two other measures (precision and recall).
Summarizing the above analysis of some of the most commonly used evaluation measures, we conclude that accuracy and microaveraged Fmeasure are better suited for general evaluation of algorithm performance for most regular multilabel classification problems while the Hamming loss and subset accuracy measures may be more appropriate for some specific multilabel classification problems.
Actually the accuracy measure, where the number of true labels among the predicted ones is important, can be considered as a “golden mean” between Hamming loss where all labels are equally important and subset accuracy where only the whole set of positive labels is important. Thus, in this research we aim at improving the accuracy measure.
Some other evaluation measures, such as oneerror, coverage, ranking loss and average precision, which are specially designed for multilabel ranking do exist (Schapire and Singer 2000). This category of measures, known as rankingbased, is often used in the literature (although not directly related to multilabel classification), and is nicely presented in Tsoumakas et al. (2010) among other publications. These measures are tailored for evaluation of specificpurpose ranking problems and are of a less interest for our research.
4 Generalization bounds for multilabel learning
In this section, we derive some general theoretical results for multilabel learning.
4.1 General theory
Lemma 1
(Baum and Haussler 1989)
Corollary 1
Proof
Theorem 2
Proof
Our desired generalization bound (4) is now immediate:
Theorem 3
4.2 Tighter bounds via unconditional dependence
To illustrate the advantage conferred by unconditionally dependent labels, let us consider the following toy example. Take \(\mathcal{X}=\{0,1\}^{n}\) and let \(\mathcal{C}\) be the set of all monotone conjunctions over the n Boolean variables. That is, each member of \(\mathcal{C}\) is an AND of some fixed subset of the n variables, without negations.
We will consider the 0–1 distance over multilabels (corresponding to k=L). Since \(\mathcal{C}=2^{n}\), we have \(d=\operatorname{VCdim} (\mathcal{C})\leq n\). In fact, d=n, since the \(\mathcal{C}\) shatters the set {10^{ n−1},010^{ n−2},…,0^{ n−1}1}⊂{0,1}^{ n }. Similarly, it is easy to see that \(\mathcal{H}=\mathcal{C}^{L}\) has VCdimension Ln. Thus, a sample of size Ω(Ln/ε) is necessary in order to achieve a distributionfree generalization error bounded by ε with high probability (Blumer et al. 1989).
Then, combining the sample complexity lower bound (Blumer et al. 1989) with the VC dimension upper bound (Vapnik and Chervonenkis 1971), we obtain the following result:
Theorem 4
Putting computational issues aside (e.g., how to efficiently find a consistent pair of τclose conjunctions), this analysis implies that dependence among the labels can be exploited to significantly reduce the sample complexity.
5 Methods
This section describes the proposed methods for multilabel classification.
Our approach comprises two main steps. The aim of the first step is preliminary identification of dependencies and the clustering of all labels into several independent subsets. In this paper we examine two methods for handling this step, namely, identifying unconditional and conditional label dependencies.
The second step is multilabel classification incorporating the category dependencies discovered in the previous step. For this we apply a combination of standard BR and LP approaches to the independent groups of labels that have been defined.
Following is a description of the methods applied for each of the steps.
5.1 Label dependence identification
5.1.1 Unconditional label dependence
The proposed method is based on analyzing the number of instances in each category. We apply the chisquare test for independence to the number of instances for each possible combination of two categories. The level of dependence between each two labels in the dataset is thus identified.
General contingency table for labels λ _{ i } and λ _{ j }
The label pairs with χ ^{2} score higher than a critical value of χ ^{2} are considered as dependent.
5.1.2 Conditional label dependence
Conditional label dependence is hard to identify since it is specific for each instance x. We try to estimate the conditional dependencies between each pair of labels by evaluating the advantage gained by exploiting this dependence for classification.
Following this condition, for each pair of labels λ _{ i } and λ _{ j }, we train two binary classifiers for predicting λ _{ i }. One—unconditional on λ _{ j }—is trained on a regular features space x; the second one—conditional on λ _{ j }—is trained on the features space x augmented by label λ _{ j } as additional feature. Then we compare the accuracy of both classifiers using 5×2 fold crossvalidation. If the accuracy of the conditional model is significantly higher than that of the unconditional one, we conclude that the labels are conditionally dependent. The statistical significance of the difference between the resultant vectors of both classifiers is determined using twosample ttest. Since both models were evaluated on the same folds of data a paired ttest is applied. To comply with the ttest assumptions each of the two compared populations should follow a normal distribution. We checked the normality assumption using the ShapiroWilk test and find out that in most of the cases this assumption holds.
The label pairs with a tvalue higher than tcritical are considered as dependent. We perform this procedure for all possible label pairs considering the labels order in the pair.^{2} Among the two pairs with the same labels, the pair with maximal tstatistic value is added to the resulting list of dependent pairs (for the clustering algorithm the order of labels in the pairs does not matter). Finally, we sort the resultant label pairs according to their tstatistic value in descending order (i.e., from the most to the least dependent pairs).
Note that even such approximate estimation of conditional dependencies is very computationally expensive. Indeed, the approach becomes very time consuming and sometimes not feasible at all for large datasets.
5.1.3 Dependent labels clustering

no more label pairs to consider

all labels are clustered into one single group

pair dependence score (chisquare value or tvalue) is below some threshold value t

the number of allowed “nonimproving” label pairs n is exceeded.
5.2 LPBR method

the BR approach to the independent groups of labels is applied without the limitation of ignored dependencies;

the LP approach to classification into labels within the groups of dependent labels is applied without incurring the problem of a large number of class combinations (since it is applied to a group with a limited, potentially small, number of classes).
For each independent group of labels, one single LP classifier is independently created to determine the relatedness of each instance to the labels from a group. In the case of a group including only one category, the classifier is binary. However, if a group consists of k labels, the classifier is, actually, a singlelabel multiclass classifier with 2^{ k } labels. Note that k is the number of maximum labels within one group and can be controlled by the model designer. Eventually, the final classification prediction is determined, similarly to the BR approach, by combining labels generated by each single LP and BR classifier.
On the one hand, this approach is more powerful than the regular BR approach, because it does not make the independence assumption which is wrong at times and, on the other, allows simple multilabel classification using any readily available singlelabel classifier.
5.3 LPBR discussion
Dembczynski et al. (2010a) showed that the BR approach estimates marginal label probabilities and thus is tailored for the Hamming loss measure; the LP approach estimates the joint distribution of the whole label subset and thus is tailored for the subset accuracy measure. The authors also show that in case of conditionally independent labels, both BR and LP approaches are supposed to perform equally well.
In this paper we claim that the LPBR approach can improve the classification performance of both the LP and BR methods in terms of the accuracy measure on datasets where part of the labels are dependent. We claim as well that LPBR might be beneficial in terms of subset accuracy, also, in cases where there is a small training set without sufficient examples for all labels combinations.
In this section two reasons for the above conjectures are presented.
Assuming that label dependencies supplied to LPBR were identified correctly, the LPBR prediction is expected to be more accurate than that of LP and BR separately.
The cases when all labels are independent or oppositely all labels are dependent are the borderline cases of LPBR in which it will apply the BR or LP strategy correspondingly for all labels and will predict the results same to BR or LP.
Although, the above justification of the LPBR method refers to conditional label dependence, in this paper we consider unconditional label dependence as well, and compare between both types.
The next reason for the superiority of LPBR refers to the targeted evaluation measures. As shown in Sect. 3, the accuracy measure can be considered as a kind of “golden mean” between subset accuracy and Hamming loss measures. Thus, the LPBR method, which interpolates between these two endpoint measures, should perform well for the accuracy measure. The power of the LP classifier can suffer in the case of a few, if any, training examples for some label combinations. In such cases, separating the labels of these combinations into different subsets, which are treated independently of each other, is an advantage of the LPBR approach, allowing it to provide more accurate results. These hypotheses were tested by empirical experiments discussed below.
5.4 Computational complexity
The complexity of the dependent labels identification step varies between the ChiDep and ConDep methods. Comparison of computational time analysis for these methods is presented in Sect. 7.1.3. This section analyzes the computational complexity of the ChiDep∖ConDep approaches after the dependent labels identification step.
The proposed clustering approach searches the space between the BR and LP models in an optimized method such that if there are no dependencies found, the result of ChiDep∖ConDep is the same as the BR result. If all labels are interdependent, then the result is the same as that of LP. Thus, the complexity of the ChiDep∖ConDep algorithm depends on the number of dependent label pairs that have been identified within the dataset and on the complexity of the underlying learner. If the complexity of the singlelabel base classifier is O(f(L,A,D)), where L is the number of labels, A is the number of attributes and D is the number of examples in training set. The ChiDep∖ConDep’s ‘best case’ complexity is equal to that of BR, that is O(L^{∗} f(2,A,D)). The ‘worst case’ complexity is equal to that of LP, that is O(f(2^{L},A,D)). Note, that value of 2^{L} is bounded by D, as there cannot be more label combinations than training examples.
We assume that in most regular cases there are partial dependencies among labels within the dataset and ChiDep∖ConDep will stop much earlier before it reaches the ‘worst case’. Thus, ChiDep∖ConDep’s complexity is approximated as follows: at each step, when two groups are clustered and a new set of labels is created, an evaluation of the new model is performed. It is important to note that the difference between the new model and the previous one is only in the newly clustered group. This allows us to optimize the calculation time by reusing models of the unchanged subsets, which were constructed at previous steps. Thus, reducing each step computation time to \(\mathrm{O}(f(2^{k_{s}}, \mathrm{A}, \mathrm{D}))\), where k _{ s } is the size of the new cluster at step s which increases from 2 to L as the algorithm proceeds. This results in a total complexity of \(\mathrm{O}(\mathrm{L}^{*} f(2, \mathrm{A}, \mathrm{D})) + \mathrm{O}(\mathit{sf}(2^{k_{s}}, \mathrm{A}, \mathrm{D}))\), where s is the number of clustering steps actually performed by the algorithm. The whole expression can be replaced by a general one \(\mathrm{O}((\mathrm{L}+s)f(2^{k_{s}},  \mathrm{A}, \mathrm{D}))\).
Datasets used in the experiment: information and statistics
Name  Domain  Labels  Train  Test  Attributes  L _{CARD}  L _{DC} 

Emotions  Music  6  391  202  72*  1.869  27 
Scene  Image  6  1211  1196  294*  1.074  15 
Yeast  Biology  14  1500  917  103*  4.237  198 
Genbase  Biology  27  463  199  1186  1.252  32 
Medical  Text  45  645  333  1449  1.245  94 
Enron  Text  53  1123  579  1001  3.378  753 
Slashdot  Text  22  2348  1434  1079  1.18  156 
Ohsumed  Text  23  8636  5293  1002  1.66  1147 
tmc2007^{a}  Text  22  21519  7077  500  2.158  1341 
rcv1(subset1)  Text  101  3000  3000  944*  2.88  1028 
Mediamill  Video  101  30993  12914  120*  4.376  6555 
Bibtex  Text  159  4880  2515  1836  2.402  2856 
5.5 Ensemble framework for ChiDep and ConDep algorithms
For classification of a new instance, binary decisions of all models for each label are averaged and the final decision is taken. All labels whose average is greater than a userspecified threshold t are returned as a final classification result.
Ensemble model diversity
Accuracy of the ensemble classifier might be further improved by selecting the most different from the highly scored models. This property has been demonstrated in other ensemble settings (Rokach 2010). Thus we have defined a strategy for selecting more diverse models for participation in the ensemble than simply selecting the m models with the highest score.
 1.
Compute the distance matrix between all pairs of N partitions with highest “dependency” scores. Later we will refer to these N highscored partitions as the set of “candidate” models.
 2.
Select the label set partition with the highest dependence score and add it to the set of “selected” models for participation in the ensemble.
 3.
Find the minimal distance d _{ i,min } from each one of the “candidate” models to all “selected” models.
 4.
Sort all “candidate” models in descending order of their d _{ min } value (in step 3) and select k percent of the partitions with the highest d _{ min }. This step is intended to choose k percents of the “candidate” models that are most different from the “selected” models. We refer to this set of models as “best candidates”.
 5.
From the “best candidates” set, select the partition with the highest dependence score and add it to the set of “selected” models.
 6.
Repeat steps 3–5 until the number of the models in the set of “selected” models reaches m.
 7.
Return the “selected” set of models for participating in the ensemble.
This procedure allows us to tradeoff between “diversity” and “dependency” scores among the selected models. Let us clarify how parameter values N and k influence the model selection process and level of “diversity”–“dependency” among the ensemble models. Parameter N defines the number of highscored partitions which would be considered as “candidates” for ensemble. As higher this number as more diverse but less “dependent” partitions would be selected. For example for a dataset with 6 labels there are 172 possible distinct labelset partitions which dependence score may vary from high positive to high negative numbers. Thus, when defining N=100 more than half of all possible partitions will be considered and some of them will probably have negative “dependency” scores. However, for a dataset with 14 labels there are above 6300 possible distinct labelset partitions and most probably all of the 100 high scored ones will have high positive “dependence” scores. Thus, for datasets with small number of labels and\or low dependency level among the label, relatively small values (between 20 and 100) for N should be considered. However, for datasets with large number of labels and\or higher dependency level among the labels, higher values of N (100 and above) are likely to perform better. Parameter k allows defining dynamically a threshold value for models which are “different enough” from already selected ones. For example, given that N=100 setting k to 0.2 means that all 20 (20 percent of 100) of the most different from all the currently selected models will be considered as enough different and finally one of them having the highest dependency score will be added to ensemble. Larger values of k are expected to reduce the level of diversity among the selected models. Clearly, the “best” values for these parameters are dependent on dataset properties. Thus, in order to achieve the best performance, we recommend calibrating these parameters for each dataset specifically.
In this research we carried out a variety of global calibration experiments in order to define the appropriate default values which would allow parameters to perform sufficiently for most datasets. The selected values are presented in the following section.
6 Empirical evaluation
This section presents the setup of empirical experiments we conducted to evaluate the proposed approaches. It describes the datasets and learning algorithms used during the experiments, and presents the measures used for evaluation.
6.1 Datasets
We empirically evaluated the proposed approach by measuring its performance on twelve benchmark multilabel datasets^{5} from different domains and variable sizes. All datasets along with their properties are listed in Table 2.
Besides the regular classification properties, such as label set and feature set size and the number of train and test examples, we present specific statistical information for multilabel classification. This information includes: (1) Label Cardinality (L _{CARD})—a measure of “multilabeledness” of a dataset introduced by Tsoumakas et al. (2010) that quantifies the average number of labels per example in a dataset; and (2) Label Distinct Combinations (L _{DC})—a measure representing the number of distinct combinations of labels found in the dataset.
As shown in Table 2, six small to medium size datasets and six large size datasets are included in the experiment. The datasets are ordered by their approximate complexity and roughly divided (by horizontal line) between regular and large size datasets. From the rcv1 corpus only the subset 1 dataset was used. In addition, dimensionality reduction has been performed on this dataset as in Zhang and Zhang (2010), such that the top 944 features with highest document frequency have been retained.
6.2 Procedure
We implemented the ChiDep and ConDep methods and their ensemble versions (the implementation is written in Java using Weka^{6} and Mulan^{7} opensource Java libraries). The new algorithms have been integrated into the Mulan library. The tests were performed using original train and test dataset splits. The internal selection of a model label set was carried out using 3fold crossvalidation over the large training sets (namely, slashdot, ohsumed, tmc2007, rcv1, mediamill and bibtex) and 10fold over the rest training sets. The overall crossvalidation process was repeated ten times for the training sets with less than 500 examples (namely, emotions and genbase). For the rest of training sets we followed the approach proposed by Kohavi and John (1997). According to this approach the number of repetitions was determined on the fly by looking at the standard deviation of the accuracy estimate. If the standard deviation of the accuracy estimate was above 1 % and ten crossvalidations have not been executed, another crossvalidation run was executed. Although this is only a heuristic, Kohavi and John claim that this heuristic seems to work well in practice and avoids multiple crossvalidation runs for large datasets.
First, we observed the results achieved by ChiDep and ConDep algorithms and compared the level of performance achieved by modeling unconditional vs. conditional label dependencies. We noticed that there is no benefit from modeling conditional dependencies (as presented on Sect. 7.1). Taking into consideration that modeling conditional dependencies is very computationally expensive and not feasible for large datasets, we continued the evaluation with the ChiDep (modeling unconditional dependencies) method.
We compared the results achieved by the ChiDep (denoted CD) approach to those of the standard multilabel classification methods such as the BR and LP approaches, and, also to some stateoftheart methods addressed in the Related Work section: HOMER (HO), MLStacking (2BR), Pruned Sets (PS) and Classifier Chains (CC).
The RAkEL (RA) method is an ensemble algorithm combining votes from a number of multilabel classifiers to a single classification decision. We thus compare it to the ensemble versions of the ChiDep (CDE), Classifier Chains (ECC) and Pruned Sets (EPS) methods. Due to the random nature of the RAkEL and ECC algorithms and the partially random nature of CDE, results vary between runs. Thus, we averaged the results of each one of these algorithms over five distinct runs on each dataset. Consecutive numbers from 1 to 5 were used as initialization seed for the random number generator allowing reproducibility of the experimental results.
All methods were evaluated using the Mulan library. All the algorithms were supplied with Weka’s J48 implementation of a C4.5 tree classifier as a singlelabel base learner. We compare the results using the evaluation measures presented in Sect. 3. The statistical significance of differences in algorithm results was determined by Friedman test (Demsar 2006) and posthoc Holm’s procedure for controlling the familywise error in multiple hypothesis testing.
6.3 Parameter configuration
All configurable parameters of the participating algorithms were set to their optimal values as reported in the relevant papers. For HOMER, its balanced k means version with k=3 was set. The 2BR was supplied with J48 for both baselevel and metalevel binary classification algorithms. The PS’s p and s parameters require tuning for each dataset. We used the parameters chosen by the authors for datasets presented in the PS paper (Read et al. 2008). For other datasets we set p=1 (as the dominant value among chosen values in the PS paper) and s was computed by PS’s utility (from Meka^{8} library) according to label cardinality and the number of labels in the dataset as recommended by the authors. BR, LP and CC do not require parameters. For the ChiDep and ConDep algorithms, we set the n parameter to 10 for all datasets, for the reason mentioned in Sect. 5.1.3. The target evaluation measure was set accordingly to each one of the considered measures, namely accuracy, subset accuracy, microaveraged Fmeasure and Hamming loss measure. The χ ^{2} and t critical values were set to 6.635 and 3.25 respectively, as described in Sect. 5.
Ensemble methods configuration
The number of models participating in the ensemble classifier is supposed to influence the predictive accuracy of the classifier. For the sake of a fair comparison, we wanted to evaluate ensemble models of an equivalent complexity. To achieve this, we configured all of the ensemble algorithms in the experiment to construct the same number of distinct models. ChiDep\ConDep ensemble algorithms construct a varying number of models for each label set partition according to the number of dependent labels groups at each partition. Thus, for the ChiDep\ConDep ensemble we set the number of label set partitions m to 10 (as frequently used for the number of classifiers in an ensemble) and averaged the number of distinct models constructed by the ensemble across all the random runs. This number, presented in Table 13, was taken as the base number of models for all ensemble methods. RAkEL, ECC and EPS were configured to construct the same number of distinct models. RAkEL and EPS allow supplying the number of desired models via the constructor. However, ECC aggregates the number of CC classifiers and each one of them constructs a number of models equal to the number of labels in the dataset. Thus, for ECC we divided the desired number of models by L and rounded it up. The result was set as the number of CC classifiers participating in ECC. For all ensemble methods the majority voting threshold was set to a commonly used intuitive value of 0.5.
Other parameters of ensemble methods were configured as follows. The RAkEL’s k parameter was set to 3. ECC does not require additional (to the number of models and threshold) parameters. For the EPS, at each dataset p and s, parameters were set to the same values as those used for the PS algorithm. The diverse version of the ChiDep ensemble (CDEd) was supplied with N=100 and k=0.2 (i.e. twenty percents), accordingly to the results of calibration experiments described in Sect. 7.2.
7 Experimental results
This section presents the results of the evaluation experiments that we conducted. Initially, we compare results of the algorithms utilizing unconditional (ChiDep) vs. conditional (ConDep) dependence identification and select the best one for further comparison to other baseline and stateoftheart multilabel algorithms. Then we compare the selected algorithm and its ensemble version to other singleclassifier and ensemble algorithms accordingly.
7.1 Conditional vs. unconditional dependencies
In this section the results of the new algorithm utilizing two different methods for dependence identification, are presented, namely, ChiDep using the chisquare test for unconditional dependence identification and ConDep for estimating conditional label dependencies.
Predictive performance of ChiDep and ConDep algorithms and their ensemble versions on various datasets
Dataset  Accuracy  microavg. Fmeasure  

Singleclassifier  Ensemble  Singleclassifier  Ensemble  
ChiDep  ConDep  ChiDep  ConDep  ChiDep  ConDep  ChiDep  ConDep  
Emotions  43.28  43.84  51.05  44.87  60.31  59.25  63.87  59.66 
Scene  57.71  56.76  59.91  59.29  60.4  60.4  66.06  65.72 
Yeast  42.88  42.88  49.53  44.73  56.92  56.72  62.31  58.9 
Genbase  99.16  98.66  98.66  98.66  98.97  98.77  98.77  98.77 
Medical  72.52  72.9  71.34  71.17  79.36  77.99  78.97  78.92 
Enron  41.14  38.73  42.91  40.46  51.94  50.63  54.81  53.58 
Slashdot  38.73  38.78  38.42  38.69  48.93  49.56  47.81  49.59 
Dataset  Subset accuracy  Hamming loss  

Singleclassifier  Ensemble  Singleclassifier  Ensemble  
ChiDep  ConDep  ChiDep  ConDep  ChiDep  ConDep  ChiDep  ConDep  
Emotions  22.28  17.82  26.24  13.86  25.41  25.99  22.77  25.66 
Scene  53.68  50.59  54.26  50.42  14.51  14.35  11.4  12.12 
Yeast  14.94  12.00  14.76  10.45  26.59  26.59  21.5  24.39 
Genbase  98.49  97.49  97.49  97.49  0.09  0.11  0.11  0.11 
Medical  65.47  64.26  62.93  62.76  1.17  1.17  1.11  1.12 
Enron  12.95  11.92  12.55  10.71  5.38  5.40  5.08  5.17 
Slashdot  33.47  33.12  33.66  33.1  4.36  4.20  4.4  4.21 
Consider first the single algorithms. It can be observed that the ChiDep algorithm outperforms ConDep on all datasets for subset accuracy measure and on the most datasets for the Fmeasure. However, for the accuracy and Hamming loss measures the ConDep outperforms the ChiDep on 3 and 2 datasets correspondingly. For the ensemble algorithms we observe that ChiDep outperforms ConDep for all measures on almost all datasets.
The above results indicate that among the two methods of modeling dependencies, the modeling of unconditional dependencies (ChiDep) is superior in terms of subset accuracy and Fmeasure. In addition, its ensemble version is superior in terms of accuracy and Hamming loss as well.
To verify these quite surprising results, we performed some experiments on artificial datasets where various types of dependencies are simulated. The results of these simulation experiments (presented in the next section) correspond to our findings on the benchmark datasets.
Furthermore, these results correspond with the results presented in Ghamrawi and McCallum (2005), where two models were evaluated, one of which parameterizes conditional dependencies, the CMLF, and another which parameterizes unconditional dependencies, the CML. The results presented in their work show that the CMLF model is not better and in many cases even worse (in terms of micro and macroaveraged Fmeasure and subset accuracy) than the CML model.
Taking into consideration that the computational cost of conditional dependencies is also much higher than that of unconditional dependencies (as can be observed from the times comparison reported in Sect. 7.1.3), we conclude that modeling unconditional dependencies is sufficient enough for solving multilabel classification problems. The rest of the comparisons are performed with ChiDep and an ensemble of ChiDep algorithms.
7.1.1 Experiments on artificial datasets
In this section the methods for conditional and unconditional label’s dependence identification are compared on artificially created multilabel datasets.
We experimented with two types of artificially generated datasets. In the first type of datasets, a predefined pattern was introduced to a set of randomly generated sequences. In the second type, artificial datasets were generated according to the models defined by Dembczynski et al. (2010a).
We hypothesize that in cases where label’s dependence is accurately expressed, the conditional method might perform better than the unconditional. However, in nonideal cases where various levels of noise and less dependence examples are present, the unconditional dependence identification would perform better than the conditional. These hypotheses are tested in the experiments below.
Experiments on the predefined patterns data
For each one of these datasets, two distinct experiments were conducted. In one of the experiments (referred as experiment1), 9,000 random records (i.e., not matching the pattern) and 1,000 records matching the pattern were selected from the full set of generated records. These 10,000 instances were used as a training set. For the remaining experiment (referred as experiment2), 500 “pattern” records were removed from the experiment1 dataset, thus reducing the number of dependency examples. The remaining 9,500 instances were used for training. In both experiments, the two algorithms, ChiDep, using the chisquare test for unconditional dependence identification, and ConDep, estimating conditional label dependencies, were tested on 1,000 records of the corresponding pattern, which were generated using another set of random vectors E′={0,1}^{16}.
Dependent label pairs identified by ChiDep and ConDep algorithms on training data
Pattern number  Experiment1  Experiment2  

ChiDep  ConDep  ChiDep  ConDep  
pair  pvalue  pair  pvalue  pair  pvalue  pair  pvalue  
1  [2, 8]  1.2E24  [2, 8]  1.0E06  [2, 8]  2.6E09  [2, 8]  9.0E08 
[4, 8]  1.6E21  [4, 8]  3.9E05  [4, 8]  1.9E07  [4, 8]  4.7E06  
[2, 4]  1.9E12  [2, 1]  4.2E03  [8, 3]  1.9E03  
[5, 6]  8.1E03  
[8, 3]  9.7E03  
2  [5, 7]  2.8E24  [5, 7]  1.8E06  [5, 7]  3.6E09  [5, 7]  1.8E05 
[6, 7]  1.8E17  [6, 7]  1.7E04  [6, 7]  2.2E05  [7, 1]  1.0E03  
[5, 6]  6.0E09  [6, 7]  1.3E03  
3  [2, 3]  9.8E26  [2, 3]  5.8E07  [2, 3]  5.9E10  [1, 3]  5.7E07 
[1, 3]  4.0E14  [1, 3]  4.4E04  [1, 3]  1.4E03  [2, 3]  1.3E05  
[1, 2]  4.3E12  [5, 7]  1.6E03  [6, 3]  4.4E04  
[6, 2]  6.1E03  
[5, 4]  6.8E03  
4  [2, 3]  2.6E18  [2, 3]  4.0E08  [2, 3]  2.3E05  [2, 3]  2.5E07 
[2, 5]  2.9E03  
5  [4, 6]  7.0E24  [1, 6]  1.1E05  [4, 6]  1.3E08  [1, 6]  5.6E06 
[1, 6]  2.1E22  [4, 6]  2.6E05  [1, 6]  6.5E08  [4, 6]  7.2E05  
[1, 4]  6.5E12  [1, 8]  7.6E03  [8, 4]  6.2E03  
[7, 6]  6.4E03  
[8, 2]  8.3E03 
Consider the algorithm’s results on pattern number 1, where labels 2, 4, and 8 are dependent. We can see that in the experiment1, the “unconditional” (ChiDep) method identifies all three pairs of dependent labels. In the same experiment, the “conditional” (ConDep) method identifies only two pairs correctly. Additionally, the ConDep method misidentifies three other label pairs as dependent. In the experiment2, both methods identify the same two (out of three) label pairs as dependent. However, ConDep again misidentifies another pair as dependent labels. The results on all the other patterns are quite similar. For experiment1, the ChiDep method correctly identifies all pairs of dependent labels on all patterns, while the ConDep identifies only two out of three dependent pairs on all patterns except pattern number 4. In pattern 4, only one pair of labels (2 and 3) is dependent and is correctly identified by both methods. In experiment2, both methods identify the same label pairs as dependent, however, the ConDep misidentifies some other pairs as dependent labels on all the patterns.
The results of these experiments demonstrate that on “noisy” datasets with different levels of dependence representations, the “unconditional” ChiDep method identifies a label’s dependencies more accurately than the “conditional” ConDep method. One of the problems with the ConDep method is the identification of spurious dependencies. We hypothesize that the origin of these misidentifications is the false positive of the statistical ttest, known in statistics as the Type 1 error. As can be seen from the results, the pvalue of the misidentified pairs is relatively high and almost always much higher than that of the actual dependent pairs. An additional glitch of the ConDep method is that the actual dependent pairs are not always identified.
Predictive performance of ChiDep and ConDep algorithms on predefined patterns data
Measure  Experiment1  Experiment2  

Pattern5  Pattern2  Pattern3  Pattern5  
ChiDep  ConDep  ChiDep  ConDep  ChiDep  ConDep  ChiDep  ConDep  
Accuracy  0.5735  0.5647  0.5606  0.5588  0.5888  0.5720  0.5789  0.5712 
Subset Accuracy  0.0340  0.0250  0.0270  0.0220  0.0260  0.0250  0.0240  0.0220 
Micro Fmeasure  0.7213  0.7136  0.7110  0.7091  0.7362  0.7210  0.7284  0.7212 
Hamming Loss  0.3140  0.3160  0.3199  0.3185  0.3113  0.3116  0.3165  0.3154 
It can be seen that the ChiDep method was outperformed by ConDep on two experiments for the Hamming loss measure only. On the other hand, the ChiDep outperforms the ConDep in four experiments on almost all measures.
Experiments on independent, dependent, and combined datasets
In Dembczynski et al. (2010a), two models for generating artificial threelabel datasets are defined and used. In the first dataset, all the labels are conditionally independent, however the unconditional dependence exists between labels 2 and 3. In the second dataset, all the labels are conditionally dependent. We utilized these models for our experiments. The models, defined in Dembczynski et al. (2010a) are briefly presented below for reading convenience.
For each dataset, 10,000 instances were generated. The first (conditionally independent) dataset was generated by uniformly drawing instances from the square x∈[−0.5,0.5]^{2}. The label distribution is given by the product of the marginal distributions defined by P _{ x }(y _{ i })=1/(1+exp(−f _{ i }(x))), where the f _{ i } are linear functions: f _{1}(x)=x _{1}+x _{2}, f _{2}(x)=−x _{1}+x _{2}, f _{3}(x)=x _{1}−x _{2}. This model generates conditionally independent labels. However, labels 2 and 3 are dependent marginally (unconditionally) as f _{2}(x)=−f _{3}(x). This dataset will subsequently be referred to as independent or conditionally independent.
The second (dependent) dataset was generated by drawing the instances from an univariate uniform distribution x∈[−0.5,0.5]. The label distribution is given by the product rule: P _{ x }(Y)=P _{ x }(y _{1})P _{ x }(y _{2}y _{1})P _{ x }(y _{3}y _{1},y _{2}), where the probabilities are modeled by linear functions similarly as before: f _{1}(x)=x, f _{2}(y _{1},x)=−x−2y _{1}+1, f _{3}(y _{2,} y _{1},x)=x+12y _{1}−2y _{2}−11. This dataset will subsequently be referred to as dependent.
Additionally, for a simulation of “noisy” dependencies, we created a third dataset combined of the first two. The third dataset consists of two features and six labels. The feature’s attributes and first three labels were generated according to the conditionally independent model and the next three labels were generated according to the dependent model, considering x=x _{2}. This dataset will subsequently be referred to as combined. On the combined dataset, we experimented with various training set sizes, namely, 10,000, 1,000, 500 and 300 instances.
Dependence identification results on dependent (left) and independent (right) datasets. In bold mark the label pairs identified as dependent
Dependent dataset  Independent dataset  

ChiDep  ConDep  ChiDep  ConDep  
pair  pvalue  pair  pvalue  pair  pvalue  pair  pvalue 
[1, 3]  0.0E+00  [1, 3]  6.5E17  [2, 3]  1.1E06  [2, 3]  2.3E01 
[2, 3]  0.0E+00  [2, 3]  4.0E15  [1, 3]  6.3E01  [1, 2]  3.2E01 
[1, 2]  0.0E+00  [1, 2]  6.4E15  [1, 2]  8.8E01  [3, 1]  5.3E01 
Dependence identification results on combined datasets
10000 training examples  1000 training examples  

ChiDep  ConDep  ChiDep  ConDep  
pair  pvalue  pair  pvalue  pair  pvalue  pair  pvalue 
[4, 6]  0.0E+00  [4, 6]  1.1E15  [4, 6]  5.6E102  [4, 6]  2.3E10 
[5, 6]  0.0E+00  [5, 6]  5.1E15  [5, 6]  3.9E68  [5, 4]  7.9E10 
[5, 4]  0.0E+00  [5, 4]  2.1E14  [5, 4]  2.9E51  [5, 6]  1.5E08 
[2, 3]  1.1E06  [2, 3]  1.1E03  
[1, 6]  1.2E04  
[1, 5]  2.4E04  
[2, 6]  2.2E03  
[2, 4]  4.9E03 
As can be seen in Table 6, in the dependent dataset, both methods correctly identified all label pairs as dependent. In the case of independent data, the identification results are correct as well. The ConDep method identifies all the labels as independent. The ChiDep method identifies pairs [1,3] and [1,2] as independent and pair [2,3] as dependent. Note that labels 2 and 3 are indeed unconditionally dependent. For the combined dataset sets with 500 and 300 train instances, both methods equally identified the three label pairs from the dependent model as dependent. Thus, these results are omitted from Table 7. The results presented are for the training sets of 10,000 and 1,000 instances. It can be seen that on both these training sets, the ConDep method correctly identifies the three label pairs from the dependent model as dependent. The ChiDep method identifies the same three pairs as the most dependent. It correctly identifies the pair of labels [2,3] as depended, as well. Additionally, on the training set with 10,000 examples, four other pairs are identified as unconditionally dependent by the ChiDep method. These dependencies are caused by a mutual interaction between the two underling data models.
Classification results on independent, dependent, and combined datasets
Measure  Independent  Dependent  Combined (10000 train)  Combined (300 train)  

ChiDep  ConDep  ChiDep  ConDep  ChiDep  ConDep  ChiDep  ConDep  
Accuracy  0.43  0.4265  0.4795  0.4795  0.4155  0.4166  0.3363  0.3194 
Subset Accuracy  0.1936  0.1939  0.4024  0.4024  0.0795  0.0786  0.0267  0.0233 
Micro Fmeasure  0.5836  0.5774  0.5336  0.5336  0.5549  0.5559  0.492  0.4714 
Hamming Loss  0.4231  0.4217  0.4127  0.4127  0.4168  0.4168  0.4828  0.49 
Considering the classification results of the ChiDep and ConDep algorithms, we can see that on dependent data both methods perform equally for all evaluated measures. On the other hand, the results of ConDep in the independent dataset are slightly better for the subset accuracy and Hamming loss measures, while utilizing the unconditional dependence discovered between labels 2 and 3 by the ChiDep method results in a better performance in terms of accuracy and microaveraged Fmeasure.
In the case of the combined dataset, we can see that on the large training set of 10,000 instances, the results of the two algorithms are very close. The ConDep slightly outperforms the ChiDep for accuracy and microaveraged Fmeasures. Alternatively, ChiDep slightly outperforms the ConDep for the subset accuracy measure. However, on the small training set of 300 instances, the ChiDep significantly outperforms the ConDep on all evaluated measures. On the training sets of 1,000 and 500 instances, both methods perform equally for all evaluated measures, thus these results are omitted from Table 8.
Experiments conclusions
Summarizing the experiments on artificial datasets where various dependence types and conditions were simulated, we draw the following conclusions. Both methods for conditional and unconditional dependence identification correctly identify the existing dependencies in datasets with wellexpressed dependencies and on datasets with many examples for dependence. In such cases the classification results of both methods are either the same or very close to each other. However, on small or with a limited number of dependence examples datasets, the unconditional method is more accurate in dependence identification and consequently in classification predictions. In most of such cases, the conditional method does not identify all existing dependencies and additionally suffers from Type 1 errors, causing identification of spurious dependencies. These results correspond to our results on benchmark datasets and confirm that in practice, modeling of unconditional dependencies provides either very similar or more effective results than modeling of conditional dependencies.
7.1.2 Comparison of computational time for unconditional vs. conditional label dependence identification
It can be seen from the graph that time required for approximation of conditional label dependencies increases exponentially with the number of training examples in the dataset. For each one of the five largest datasets (namely, ohsumed, tmc2007, rcv1, mediamill and bibtex) this approximation took more than one week and thus could not be completed. On the other hand the time required by chisquare test applied on labels data for identification of unconditional label dependencies took about 4–5 seconds even on the largest datasets.
7.1.3 Identified label dependencies
Number of label pairs identified as dependent by the methods for conditional and unconditional label dependence identification at the significance level of p=0.01
Dataset  Total label pairs  Number of unconditionally dependent label pair  Number of conditionally dependent label pairs 

Emotions  15  14  2 
Scene  15  15  5 
Yeast  91  53  16 
Genbase  351  42  0 
Medical  990  29  2 
Enron  1378  136  6 
Slashdot  231  50  1 
Ohsumed  253  110  NF 
tmc2007  231  178  NF 
rcv1(subset1)  5050  582  NF 
Mediamill  5050  1532  NF 
Bibtex  1566  1157  NF 
For the emotions dataset, the unconditional label dependence identification using the chisquare test identified 14 (from 15) label pairs as dependent. The only pair of independent labels is {amazedsurprised, happypleased}. According to our conditional dependence identification procedure, there are two dependent pairs of labels with a 0.01 significance level, and four more pairs are dependent at a 0.05 significance level. The two most conditionally dependent label pairs are {relaxingcalm, angryaggressive} and {happypleased, angryaggressive}, the next four (at p=0.05) are {happypleased, sadlonely}, {happypleased, quietstill}, {amazedsurprised, quietstill} and {amazedsurprised, relaxingcalm}. The {relaxingcalm, angryaggressive} pair is also the most unconditionally dependent pair. For the scene dataset, all labels were found to be unconditionally dependent. But only five were found to be conditionally dependent at a p=0.01 significance level; three more pairs were conditionally dependent at p=0.05. The first five conditionally dependent label pairs are {Mountain, Urban}, {Beach, Urban}, {Sunset, Fall Foliage}, {Beach, Mountain} and {Beach, Field}. The {Mountain, Urban} pair is also the most unconditionally dependent pair. For the yeast dataset, 53 unconditionally dependent pairs and 16 conditionally dependent pairs were found from a total of 91 pairs. Also in this dataset, the first most dependent pair of labels is the same for both methods. As well all of the 16 conditionally dependent labels were also found to be unconditionally dependent. For the slashdot dataset, 50 unconditionally dependent pairs and only one {Idle, Games} conditionally dependent pair were found from a total of 231 pairs. Three more pairs {Games, Technology}, {Idle, Mobile} and {Science, Technology} were conditionally dependent at p=0.05. Also in this dataset, the first most dependent pair of labels that was identified by both methods is the same. The situation was very similar for the genbase, medical and enron dataset. Details for these datasets can be seen in Table 9.
Summarizing these results, we conclude that in general there are many more label pairs identified as unconditionally dependent, than those identified as conditionally dependent. Although the results of both methods are very different, some correspondence between the identified dependent label pairs can be noticed. So, for example, in many cases the most dependent label pairs are the same for both methods. As well most of the conditionally dependent pairs are also identified as unconditionally dependent.
7.2 ChiDep vs. singleclassifier algorithms
Comparing ChiDep to other singleclassifier algorithms on various datasets. [+], []statistically significant improvement or degradation vs. ChiDep
Dataset  ChiDep rank  Accuracy  

ChiDep  BR  LP[]  HO[]  2BR[]  PS[]  CC  
Emotions  2  43.28  43.84  42.15  42.34  41.7  41.49  42.77 
Scene  2.5  57.71  51.34  57.71  46.98  47.07  57.04  59.45 
Yeast  4  42.88  42.26  39.84  43.27  45.62  40.76  43.17 
Genbase  1  99.16  98.66  98.41  97.91  98.74  98.41  98.66 
Medical  3  72.52  71.17  71.92  69.07  66.12  73.22  73.17 
Enron  1  41.14  36.71  30.97  40.19  37.25  33.18  39.41 
Slashdot  3  38.73  38.69  41.87  34.26  37.58  42.28  38.43 
Ohsumed  1  36.46  35.84  32.14  34.37  35.19  32.18  35.89 
tmc2007  1  75.71  75.12  75.38  73.75  59.56  75.08  74.16 
rcv1(subset1)  2  6.88  6.94  6.26  6.49  6.8  6.41  6.7 
Mediamill  2  36.66  36.89  33.72  36.46  35.08  33.66  36.47 
Bibtex  2  29.62  29.92  25.17  24.23  25.92  OOM  28.79 
Avg. rank  2.0  2.0  3.1  5.0  5.1  4.6  4.8  3.1 
Dataset  ChiDep rank  Subset accuracy  

ChiDep  BR[]  LP  HO[]  2BR[]  PS  CC  
Emotions  1.5  22.28  12.87  22.28  12.38  19.8  18.81  16.34 
Scene  1.5  53.68  40.13  53.68  32.61  42.31  51.92  53.34 
Yeast  1  14.94  6.43  11.89  6.98  9.49  12.43  14.29 
Genbase  1  98.49  97.49  96.98  96.48  97.49  96.98  97.49 
Medical  2  65.47  62.76  63.66  59.46  58.56  65.47  65.47 
Enron  1  12.95  8.64  9.67  10.36  3.8  11.23  11.57 
Slashdot  3  33.47  33.12  36.89  25.94  32.71  36.75  32.64 
Ohsumed  4  17.48  18.04  16.36  13.02  21.12  16.78  18.12 
tmc2007  3  53.82  52.18  64.45  47.25  29.31  63.71  53.45 
rcv1(subset1)  2.5  0.13  0.03  0.13  0  0.13  0.13  0.03 
Mediamill  4  6.90  5.35  7.22  5.68  4.89  7.43  8.42 
Bibtex  1  14.19  13.32  13.96  9.62  12.6  OOM  13.96 
Avg. rank  2.1  2.1  5.0  3.1  6.3  4.8  3.2  3.2 
Dataset  ChiDep rank  microavg. Fmeasure  

ChiDep  BR  LP[]  HO[]  2BR  PS[]  CC  
Emotions  1  60.31  59.25  52.59  56.75  55.78  52.69  55.63 
Scene  3  60.4  60.86  58.75  56.13  58.36  58.59  61.53 
Yeast  3  56.92  56.9  52.65  57.6  60.04  53.98  55.84 
Genbase  1  98.97  98.77  98.35  98.14  98.77  98.35  98.77 
Medical  2  79.36  78.94  74.54  75.78  75.82  75.74  79.46 
Enron  1  51.94  50.42  39.53  50.38  51.39  42.35  51.45 
Slashdot  3  48.93  49.64  42.26  44.24  47.68  42.98  49.28 
Ohsumed  3  48.04  48.4  37.52  45.54  46.64  37.56  48.36 
tmc2007  2  83.31  83.42  78.79  81.2  70.87  78.43  81.85 
rcv1(subset1)  2  12.48  12.3  10.09  11.46  11.75  10.72  12.68 
Mediamill  2  50.44  50.55  45.39  48.66  48.82  46.08  49.41 
Bibtex  2  39.09  39.67  28.97  29.37  36.09  OOM  38.18 
Avg. rank  2.1  2.1  2.2  6.4  4.8  4.0  5.7  2.6 
Dataset  ChiDep rank  Hamming loss  

ChiDep  BR  LP[]  HO[]  2BR  PS[]  CC  
Emotions  2  25.41  25.99  30.2  30.94  25.25  30.53  28.96 
Scene  4  14.51  13.89  14.72  17.49  12.63  14.91  13.92 
Yeast  4  26.59  25.88  28.7  28.91  21.43  27.18  26.38 
Genbase  1  0.09  0.11  0.15  0.17  0.11  0.15  0.11 
Medical  3  1.17  1.11  1.39  1.29  1.23  1.31  1.11 
Enron  2  5.38  5.4  7.41  6.86  5.59  6.45  5.3 
Slashdot  3.5  4.36  4.2  5.96  5.29  4.36  5.88  4.25 
Ohsumed  4  6.59  6.49  8.78  8.98  5.89  8.5  6.49 
tmc2007  2  3.13  3.11  4.2  3.97  5.65  4.21  3.4 
rcv1(subset1)  4  4.6  4.56  5.04  5.59  4.29  4.82  4.45 
Mediamill  2.5  3.82  3.82  4.7  5.03  3.71  4.59  4.01 
Bibtex  3.5  1.49  1.48  2.08  2.72  1.34  OOM  1.49 
Avg. rank  3.0  3.0  2.2  5.9  6.3  2.4  5.3  2.7 
Consider first the baseline BR and LP algorithms. It can be observed that for accuracy, subset accuracy and Fmeasure the ChiDep algorithm outperforms BR and LP in 25 and 28 (from a total of 36) cases and achieves the same results in 1 and 3 other cases, respectively. For the Hamming loss measure, the ChiDep algorithm is more accurate than LP on all datasets. However it is outperformed by BR in 9 of 12 cases. In general, ChiDep is significantly better than LP in respect to the accuracy, Fmeasure and the Hamming loss; and is significantly better than BR in respect to the subset accuracy measure.
Considering the recently developed stateoftheart methods, we notice that ChiDep is significantly better than HOMER, 2BR and PS in respect to the accuracy measure. It is also significantly better than HOMER and 2BR in respect to the subset accuracy measure and is significantly better than HOMER and PS for the Fmeasure and the Hamming loss measures.
The overall comparison of the algorithms indicates that the ChiDep algorithm is successful for accuracy and subset accuracy with four first ranked places at each one of measures; and sharing the first rank with other algorithms for subset accuracy measure in three additional cases. For the Hamming loss measure, ChiDep is outperformed by the BR, 2BR and CC methods. However, the difference is not statistically significant.
Summarizing this comparison, we notice that, as expected, the ChiDep algorithm is mainly beneficial for accuracy, subset accuracy and Fmeasure providing the best average rank for these measures. Even when ChiDep is outperformed by other methods, its value of a predictive quality measure is still relatively high and is among the three highest in 34 of 36 prediction cases for the above measures.
Also, we notice that the 2BR method demonstrates high performance in terms of the Hamming loss measure. It provided the best (the lowest) Hamming loss in 7 of 12 datasets and in four other datasets it was only slightly outperformed by BR and\or ChiDep. However, on the tmc2007 dataset it is outperformed by all other algorithms. In general, the BR method is the most successful in terms of the Hamming loss measure achieving the best average rank over all datasets, although having the best rank on three datasets only. For the subset accuracy measure, the LP algorithm performs best on 5 of 12 datasets however, its average rank is much worse than that of the ChiDep method. Note that LP achieves results higher than ChiDep algorithm mainly on datasets with relatively large number of examples per distinct label combination (i.e. Train/L _{DC}).
7.2.1 LP vs. BR considering percentage of dependent label pairs
Comparing LP superiority vs. BR in terms of subset accuracy measure as function of dependent labels percentage in dataset
Dataset  Subset accuracy  LP vs. BR superiority (diff. in subset accuracy values)  Unconditionally dependent label pairs (%)  

BR  LP  
Scene  40.13  53.68  13.55  100 
tmc2007  52.18  64.45  12.27  77 
Emotions  12.87  22.28  9.41  93 
Yeast  6.43  11.89  5.46  58 
Slashdot  33.12  36.89  3.77  22 
Mediamill  5.35  7.22  1.87  30 
Enron  8.64  9.67  1.03  10 
Medical  62.76  63.66  0.9  3 
Bibtex  13.32  13.96  0.64  9 
rcv1(subset1)  0.03  0.13  0.1  12 
Genbase  97.49  96.98  −0.51  12 
Ohsumed  18.04  16.36  −1.68  43 
7.3 Ensemble diversity
In this section we compare the predictive performance of the “base” version of the ChiDep Ensemble (CDE) method according to which m best (i.e. with highest dependence score) models are selected for participation in the ensemble to the “diverse” version (CDEd). According to the “diverse” version of the ChiDep ensemble we try to select the most different (at least in k percent) models among the Nhigh scored ones. The k and N are configured parameters of the algorithm and might be specific for each dataset.
We performed several calibration experiments to examine if the “diverse” version of the CDE algorithm with some default values for the configured parameters can improve the predictive performance of the “base” version. The calibration experiments were run on scene, emotions, yeast and medical training sets using 10fold crossvalidation with parameter k varying from 0.1 to 0.9 with step 0.1 and parameter N varying from 100 to 500 with step 50. In the result analysis, we found that combination of values k=0.2 and N=100 performed well and was the only combination appearing among the 25 best results on all evaluated datasets. The test of CDEd with the selected parameters on all the datasets showed that indeed the “diverse” version of the CDE algorithm, even with default parameters, improves the predictive performance of the ensemble.
Comparing ChiDep Ensemble base and “diverse” versions
Dataset  Accuracy  microavg. Fmeasure  

CDE  CDEd  CDE  CDEd  
Emotions  51.05  53.92  63.87  66.84 
Scene  59.91  60.05  66.06  67.29 
Yeast  49.04  49.7  62.42  62.91 
Genbase  98.56  98.66  98.6  98.77 
Medical  71.48  71.56  79.12  79.02 
Enron  43.14  43.26  55.21  55.29 
Slashdot  38.57  38.37  47.55  47.47 
Ohsumed  39.69  39.74  51.22  51.1 
tmc2007  83.7  83.55  88.92  88.85 
rcv1(subset1)  7.2  7.2  12.81  12.78 
Mediamill  42.54  43.11  56.17  56.63 
Bibtex  30.06  30.17  40.03  40.23 
Dataset  Subset accuracy  Hamming loss  

CDE  CDEd  CDE  CDEd  
Emotions  26.24  28.22  22.77  21.62 
Scene  54.26  54.93  11.4  10.76 
Yeast  14.92  15.01  21.66  21.3 
Genbase  97.09  97.49  0.13  0.11 
Medical  63.06  63.18  1.11  1.12 
Enron  12.23  12.57  5.03  5.02 
Slashdot  33.67  33.77  4.49  4.45 
Ohsumed  22.49  22.79  5.87  5.84 
tmc2007  67.39  67.12  2.11  2.12 
rcv1(subset1)  0.09  0.11  4.42  4.42 
Mediamill  9.52  10.07  3.23  3.18 
Bibtex  13.63  13.74  1.45  1.44 
The results show that model diversity, even with default parameters, leads to an improvement of prediction accuracy (in terms of all considered evaluation measures) of the ensemble classifier on almost all datasets. Note that higher predictive performance can be achieved by specifically calibrating the parameters for each dataset.
7.4 ChiDep vs. ensemble algorithms
In this section the results of the “diverse” version of ChiDep Ensemble algorithms are compared to other multilabel ensemble algorithms.
Comparing ChiDep Ensemble (CDEd) to other ensemble algorithms on various datasets. [+], []statistically significant improvement or degradation vs. CDEd
Dataset  CDEd ModelsNum  CDEd rank  Accuracy  

CDEd  RA  ECC  EPS[]  
Emotions  22  2  53.92  51.28  54.02  53.63 
Scene  21  3  60.05  60.87  62.24  58.5 
Yeast  37  1  49.7  48.39  45.63  49.14 
Genbase  111  1.5  98.66  98.66  98.63  98.32 
Medical  239  3  71.56  71.14  73.82  75.03 
Enron  249  1  43.26  41.89  42.26  34.65 
Slashdot  95  2  38.37  38.54  36.20  37.22 
Ohsumed  86  1  39.74  37.90  39.36  32.56 
tmc2007  77  1  83.55  80.90  72.19  75.69 
rcv1(subset1)  473  1  7.2  7.04  6.662  6.94 
Mediamill  459  2  43.11  44.52  40.89  OOM 
Bibtex  761  1  30.17  29.88  29.37  OOM 
Avg. rank  1.6  1.6  2.4  2.8  3.1 
Dataset  CDEd ModelsNum  CDEd rank  Subset accuracy  

CDEd  RA  ECC  EPS  
Emotions  22  2  28.22  24.75  24.95  29.21 
Scene  21  2  54.93  54.60  51.97  55.77 
Yeast  37  2  15.01  12.52  14.42  16.9 
Genbase  111  1.5  97.49  97.49  97.39  96.48 
Medical  239  3  63.18  62.70  65.29  67.57 
Enron  249  2  12.57  11.54  13.47  11.57 
Slashdot  95  1  33.77  32.98  31.52  33.26 
Ohsumed  86  1  22.79  21.58  19.48  21.46 
tmc2007  77  1  67.12  62.44  48.28  58.08 
rcv1(subset1)  473  1.5  0.11  0.11  0.01  0.07 
Mediamill  459  3  10.07  11.38  10.98  OOM 
Bibtex  761  2.5  13.74  13.74  15.79  OOM 
Avg. rank  1.9  1.9  2.7  2.9  2.2 
Dataset  CDEd ModelsNum  CDEd rank  microavg. Fmeasure  

CDEd  RA  ECC  EPS[]  
Emotions  22  2  66.84  64.93  66.17  67.21 
Scene  21  3  67.29  68.62  66.57  67.61 
Yeast  37  1  62.91  62.08  58.78  62.41 
Genbase  111  1.5  98.77  98.77  98.73  98.34 
Medical  239  2  79.02  78.91  79.7  77.79 
Enron  249  1  55.29  54.87  53.48  45.29 
Slashdot  95  3  47.47  49.63  47.53  45.92 
Ohsumed  86  2  51.1  50.04  52.08  41.99 
tmc2007  77  1  88.85  87.37  80.35  81.3 
rcv1(subset1)  473  1.5  12.78  12.45  12.78  12.42 
Mediamill  459  2  56.63  57.21  53.82  OOM 
Bibtex  761  1  40.23  40.03  38.61  OOM 
Avg. rank  1.8  1.8  2.2  2.7  3.2 
Dataset  CDEd ModelsNum  CDEd rank  Hamming loss  

CDEd  RA  ECC  EPS  
Emotions  22  2  21.62  22.19  23.48  20.05 
Scene  21  3  10.76  10.45  12.53  9.87 
Yeast  37  2  21.3  21.81  23.60  19.71 
Genbase  111  2  0.11  0.11  0.11  0.15 
Medical  239  2.5  1.12  1.12  1.1  1.18 
Enron  249  2.5  5.02  4.90  5.05  5.02 
Slashdot  95  4  4.45  4.19  4.21  4.43 
Ohsumed  86  1  5.84  5.91  6.33  5.85 
tmc2007  77  1  2.12  2.37  3.85  3.56 
rcv1(subset1)  473  4  4.42  4.34  4.04  4.05 
Mediamill  459  2  3.18  3.03  3.2  OOM 
Bibtex  761  3  1.44  1.41  1.32  OOM 
Avg. rank  2.4  2.4  2.1  2.8  2.4 
The differences between algorithms were found statistically significant in terms of accuracy and Fmeasure scores by Friedman test at p=0.02. The followed posthoc Holm’s procedure indicate that there is no significant case where RAkEL, ECC or EPS is more accurate than CDEd. On the other hand, CDEd is significantly more accurate than EPS for accuracy and Fmeasure. In addition, the CDEd accuracy, subset accuracy and Fmeasure values are higher than those of ECC and RAkEL algorithms in most cases. However these differences are not statistically significant. In general, the CDEd method obtains the best average rank at the accuracy, subset accuracy and Fmeasure scores. From a detailed comparison we can observe that CDEd algorithm achieves best results on 7, 5 and 6 datasets (from a total of 12) respectively for accuracy, subset accuracy and Fmeasure scores.
In respect to Hamming loss, the RAkEL method seems to perform best achieving 4 best results and having the best average rank. Generally, the Hamming loss values for all algorithms are very close in many cases and the differences between them are not significant.
7.5 Algorithms computational time comparison
It can be observed from the graphs that ChiDep train times are relatively long; however its test times are short and comparable to that of BR. On the other hand, both train and test times of the ChiDep Ensemble method are comparable to those of other ensemble methods. The CDE train times are shorter than RAkEL’s train times and are slightly longer than those of ECC. The test times of all ensemble methods are almost equal. Recall that the EPS algorithm on these datasets resulted with OutOfMemory Exception, thus its computational times are not presented.
7.6 Discussion
Comparing the two methods for label dependence identification indicates that estimating conditional dependencies is much more computationally expensive than estimating unconditional dependencies. The former method is also inferior from the predictive performance point of view on regular size datasets.
Analysis of identified dependent label pairs by both methods on benchmark datasets showed that many more label pairs were identified as unconditionally dependent, than those identified as conditionally dependent.
The results of the empirical experiments evaluating the ChiDep algorithm support our conjectures from Sect. 5.3 that modeling partial dependencies by combining the BR and LP approaches (as the LPBR method does) can be beneficial, compared to each algorithm separately, in terms of (1) accuracy measure and (2) subset accuracy measure for small and medium training sets.
Indeed, the empirical results demonstrate that ChiDep algorithm significantly outperforms BR in terms of the subset accuracy measure and LP in terms of the accuracy, microaveraged Fmeasure and Hamming loss measures. In addition, for the subset accuracy measure, ChiDep outperforms LP on datasets with a limited number of training examples or with a small average number of examples per distinct label combination (i.e. Train/L _{DC} is comparatively small). As for Fmeasure, ChiDep outperforms LP on all datasets and as well outperforms BR on most of the regular size datasets. However on datasets having sufficient number of training examples BR performs with highest Fmeasure values.
Moreover, we found that the ChiDep algorithm has the highest average rank among all the compared algorithms in terms of accuracy, subset accuracy and microaveraged Fmeasure scores. As well, the ChiDep Ensemble has the highest average rank among all the compared ensemble algorithms in terms of the same measures.
Summarizing the above we conclude that ChiDep and ChiDep Ensemble methods are especially beneficial when accuracy is the target measure of a classification problem. ChiDep is also beneficial in respect to subset accuracy and Fmeasure on datasets with a limited number of training examples.
Another finding reveals that the BR and 2BR methods demonstrate the highest performance in terms of Hamming loss measure among the singleclassifier algorithms, whereas ensemble models make it possible to reduce the Hamming loss even further. It was also found that in terms of subset accuracy measure, the superiority from utilizing LP over BR is growing directly with the level of label interdependence in dataset.

“Among the baseline methods, use LP algorithm if you are interested in highest Subset accuracy on dataset with many interdependent labels and enough training examples for the existing label combinations.”

“If Accuracy or Fmeasure is the target of a classification problem, BR algorithm is able to provide highest results (among the singleclassifier methods) on most datasets with large number of training examples.”

“For highest Accuracy values on datasets with few training examples choose the ChiDep algorithm (among the singleclassifier methods).”

“For best Hamming loss measure values use BR or 2BR method (among the singleclassifier methods). To further reduce the values use one of the ensemble methods.”
8 Conclusions
In this paper we have presented a novel algorithm for label set partitioning into several subsets of dependent labels and for applying a combination of LP and BR methods on these subsets. The basic idea is to decompose the original set of labels into several subsets of dependent labels, build an LP classifier for each subset, and then combine them as in the BR method.
To evaluate the new algorithm we first compared methods for identifying conditional vs. unconditional label dependencies on various benchmark and artificial datasets. The results confirm that modeling unconditional dependencies is good enough for solving multilabel classification problems and modeling conditional dependencies does not improve the predictive performance of the classifier. Then, we evaluated and compared the new algorithm and its ensemble version to nine other multilabel algorithms with four different measures, utilizing 12 datasets of various complexities to give as wide as possible a picture of the algorithm’s effectiveness.
Summarizing the results of this evaluation experiment, we conclude that the multilabel classification method presented in this paper is able to improve prediction accuracy and in some cases also subset accuracy compared to other known multilabel classification algorithms.
In addition, we presented generalization bounds for a flexible family of multilabel penalty functions. Our analysis yields a theoretical understanding of the reduction in sample complexity that is gained from unconditional label independence. The present analysis is a worstcase model, and we intend to examine averagecase models in future work.
Among the additional issues to be further studied are: development of more efficient labels clustering procedure for improvement of the ChiDep time performance; exploration of better methods for identification of conditional label dependence; and development of rules and tools that could help in selection the optimal algorithm for multilabel classification according to specific dataset and classification problem properties.
Footnotes
 1.
Actually, most of our results continue to hold in the agnostic setting (Kearns et al. 1994), where there is no “teacher” and the probability distribution P is over examplelabel pairs (X,Y).
 2.
Theoretically, based on the symmetry rule of conditional independence we could skip the test of a pair (λ _{ j },λ _{ i }) if the pair (λ _{ i },λ _{ j }) was found to be independent. However the described procedure only approximately estimates conditional independence. Thus in these circumstances, the symmetry rule may not hold for some pairs.
 3.
Critical value of χ ^{2} statistic with one degree of freedom is 6.635 for the significance level 0.01.
 4.
Critical value of t statistic with nine degrees of freedom is 3.25 for the significance level 0.01.
 5.
The datasets are available at http://mlkd.csd.auth.gr/multilabel.html, http://meka.sourceforge.net/#datasets, http://davis.wpi.edu/~xmdv/datasets/ohsumed.html.
 6.
Software is available at http://www.cs.waikato.ac.nz/ml/weka/.
 7.
Software is available at http://mulan.sourceforge.net/.
 8.
Software is available at http://meka.sourceforge.net/.
 9.
The tests were performed on 64bit Intel Core Quad CPU Q6600@2.40 GHz machine providing 4GB of RAM for each algorithm.
References
 Baum, E. B., & Haussler, D. (1989). What size net gives valid generalization? Neural Computation, 1(1), 151–160. CrossRefGoogle Scholar
 Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1989). Learnability and the VapnikChervonenkis dimension. Journal of the ACM, 36(4), 929–965. MathSciNetzbMATHCrossRefGoogle Scholar
 Dembczynski, K., Cheng, W., & Hullermeier, E. (2010a). Bayes optimal multilabel classification via probabilistic classifier chains. In Proc. ICML 2010, Haifa, Israel. Google Scholar
 Dembczynski, K., Waegeman, W., Cheng, W., & Hüllermeier, E. (2010b). On label dependence in multilabel classification. Working notes of the 2nd international workshop on learning from multilabel data, Haifa, Israel. Google Scholar
 Demsar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1–30. MathSciNetzbMATHGoogle Scholar
 Eisenstat, D., & Angluin, D. (2007). The VC dimension of kfold union. Information Processing Letters, 101(5), 181–184. MathSciNetzbMATHCrossRefGoogle Scholar
 Eisenstat, D. (2009). kfold unions of lowdimensional concept classes. Information Processing Letters, 109(23–24), 1232–1234. MathSciNetzbMATHCrossRefGoogle Scholar
 Ghamrawi, N., & McCallum, A. (2005). Collective multilabel classification. In CIKM 2005 (pp. 195–200). Google Scholar
 Kearns, M. J., Schapire, R. E., & Sellie, L. (1994). Toward efficient agnostic learning. Machine Learning, 17(2–3), 115–141. zbMATHGoogle Scholar
 Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1–2), 273–324. zbMATHCrossRefGoogle Scholar
 Pollard, D. (1984). Convergence of stochastic processes. New York: Springer. zbMATHCrossRefGoogle Scholar
 Read, J., Pfahringer, B., & Holmes, G. (2008). Multilabel classification using ensembles of pruned sets. In Proceedings of eighth IEEE international conference on data mining (pp. 995–1000). Google Scholar
 Read, J., Pfahringer, B., Holmes, G., & Frank, E. (2009). Classifier chains for multilabel classification. In Proceedings of 20th European conference on machine learning and knowledge discovery in databases (Vol. 2, pp. 254–269). CrossRefGoogle Scholar
 Rokach, L. (2008). Genetic algorithmbased feature set partitioning for classification problems. Pattern Recognition, 41(5), 1693–1717. doi: 10.1016/j.patcog.2007.10.013. CrossRefGoogle Scholar
 Rokach, L. (2010). Pattern classification using ensemble methods. Series in machine perception and artificial intelligence: Vol. 75. Singapore: World Scientific. zbMATHGoogle Scholar
 Rokach, L., & Maimon, O. (2005). Feature set decomposition for decision trees. Journal of Intelligent Data Analysis, 9(2), 131–158. Google Scholar
 Schapire, R. E., & Singer, Y. (2000). Boostexter: a boostingbased system for text categorization. Machine Learning, 39(2–3), 135–168. zbMATHCrossRefGoogle Scholar
 Tenenboim, L., Rokach, L., & Shapira, B. (2009). Multilabel classification by analyzing labels dependencies. In G. Tsoumakas, M. L. Zhang, & Z. H. Zhou (Eds.), Proceedings of the 1st international workshop on learning from multilabel data, Bled, Slovenia (pp. 117–132). Google Scholar
 Tsoumakas, G., & Vlahavas, I. (2007). Random klabelsets: an ensemble method for multilabel classification. In Proceedings of 18th European conference on machine learning, Warsaw, Poland (pp. 406–417). Google Scholar
 Tsoumakas, G., Katakis, I., & Vlahavas, I. (2008). Effective and efficient multilabel classification in domains with large number of labels. In Proceedings of ECML/PKDD 2008 workshop on mining multidimensional data (pp. 30–44). Google Scholar
 Tsoumakas, G., Dimou, A., Spyromitros, E., Mezaris, V., Kompatsiaris, I., & Vlahavas, I. (2009). Correlationbased pruning of stacked binary relevance models for multilabel learning. In G. Tsoumakas, M. L. Zhang, & Z. H. Zhou (Eds.), Proceedings of the 1st international workshop on learning from multilabel data, Bled, Slovenia (pp. 101–116). Google Scholar
 Tsoumakas, G., Katakis, I., & Vlahavas, I. (2010). Mining multilabel data. In O. Maimon & L. Rokach (Eds.), Data mining and knowledge discovery handbook (2nd ed., pp. 667–686). New York: Springer. Google Scholar
 Vapnik, V. N., & Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications, 16, 264–279. zbMATHCrossRefGoogle Scholar
 Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer. zbMATHGoogle Scholar
 Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5, 241–259. CrossRefGoogle Scholar
 Xu, J. (2010). Constructing a fast algorithm for multilabel classification with support vector data description. In IEEE international conference on granular computing (pp. 817–821). Google Scholar
 Zhang, M. L., Peña, J. M., & Robles, V. (2009). Feature selection for multilabel naive Bayes classification. Information Sciences, 179(19), 3218–3229. zbMATHCrossRefGoogle Scholar
 Zhang, M., & Zhang, K. (2010). Multilabel learning by exploiting label dependency. In Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, Washington, DC, USA (pp. 999–1008). http://doi.acm.org/10.1145/1835804.1835930. CrossRefGoogle Scholar