1 Introduction

Multi-label learning has been investigated widely by the machine learning community in recent years (de Carvalho and Freitas 2009; Tsoumakas et al. 2010; Gibaja and Ventura 2014). It deals with classification tasks where an instance can be simultaneously classified into more than one class. Each class is represented by one label. Several domains, such as text (Klimt and Yang 2004; Pestian et al. 2007), multimedia (Duygulu et al. 2002; Zhou and Zhang 2006; Briggs et al. 2013) and biology (Elisseeff and Weston 2001), are intrinsically multi-label.

A common approach to dealing with multi-label classification tasks is to transform the original data set into one or more single-label data sets. A conventional binary classification algorithm, called base algorithm here, is used to induce predictive models for each one of them. As such, a transformation strategy defines how to decompose the original task into a set of single-label tasks and to combine the results obtained from these tasks to solve the original task (Tsoumakas et al. 2010). Many strategies have been proposed to address the multi-label tasks and transform the data, exploring different aspects, such as label correlation (Read et al. 2011; Cherman et al. 2012; Montañes et al. 2014), dimensionality reduction (Tsoumakas et al. 2008; Zhang and Wu 2015) and class imbalance (Zhang and Wu 2015; Tsoumakas et al. 2011b).

Although the base algorithm can be seen as a hyperparameter for transformation strategies, it is generally fixed for all strategies, so that only a single base algorithm is considered in the whole experiment (Read et al. 2011; Montañes et al. 2014; Madjarov et al. 2012). Given that a comprehensive comparison of the binary transformation strategies, using different base algorithms, has not yet been performed, this study assesses the hypothesis that the base algorithms can have a stronger influence than the binary transformation strategies on the predictive performance of multi-label models. At a glance, it may seem trivial to be investigated, however, if the choice of a base algorithm is more important regarding the quality of the results than the specific strategy, then several of them should be considered in empirical studies evaluating these strategies.

In the multi-label literature, the most similar comparative study was performed by Madjarov et al. (2012), where 12 strategies (including 3 binary transformation strategies) were evaluated under several measures, using the original train and test partition of 11 benchmark data sets. Even though a variety of different algorithms were considered, the transformation strategies were evaluated with a single base algorithm, Support Vector Machine (SVM). Another large empirical study covering multiple ensemble strategies (Moyano et al. 2018) used only the C4.5 decision tree as the base algorithm. Nevertheless, a few studies have considered using more than one base algorithm. These studies include Tsoumakas and Katakis (2007) and Cherman et al. (2012), who did not compare strategies using different base algorithms; and Zufferey et al. (2015), who compared strategies with distinct base algorithms, but just in a single data set.

Methods using Automatic Machine Learning (Auto-ML) to address multi-label classification tasks also consider multiple base algorithms (de Sá et al. 2017, 2018; Wever et al. 2018, 2019). During the search for a solution, the Auto-ML method may find a suitable combination between strategies and base algorithms that optimizes a fitness function. In these cases, choosing the base algorithm is seen as part of the solution and the comparison of the strategies does not fix a base algorithm, as observed in other studies.

Since the most common strategies are based on binary transformations, this paper will focus on these strategies. Hence, 10 binary transformation strategies and 5 different base algorithms (plus one with its hyperparameters tuned) were evaluated using \(5\times 2\)-fold cross-validation for 20 benchmark data sets. In contrast to previous studies, which used null hypothesis significance testing, we ran Bayesian statistic tests (Benavoli et al. 2017) to assess the statistical significance of the differences in the predictive performance of the assessed strategies over different evaluation measures. To the best of our knowledge, this is the most extensive multi-label empirical study carried out so far.

The results reported reinforce the claim that the predictive performance obtained by transformation strategies is affected by the base algorithm used. Thus, experimental studies in multi-label learning must take into account experiments with several different base algorithms. In particular, many of the binary transformation strategies obtained very similar results, with differences mainly being due to the choice of the base algorithm used. Therefore, previous comparative studies (Madjarov et al. 2012; Moyano et al. 2018) might have reached different conclusions if other base algorithms had been employed. Additionally, for many data sets, the investigated strategies consistently predicted only a subset of the existing labels, never assigning the remaining labels to any instance. This problem was previously observed in the food truck data set (Rivolli et al. 2018), however, as far as we know, it has never been widely investigated.

The rest of the paper is organized as follows: Sect. 2 formally defines the main concepts relevant for multi-label learning. Section 3 details the investigated strategies. Section 4 describes the experimental design, including data sets, evaluation procedures, base classifiers, tools, and hyperparameter values adopted. Section 5 presents, analyzes and discusses the empirical results. In the last section, conclusions are drawn concerning relevant findings from the experimental study and future work directions.

2 Multi-label learning

In multi-label learning, an instance can be simultaneously associated with more than one label. The main tasks in this field are Multi-Label Classification and Label Ranking.

Multi-Label Classification (MLC), the most common task (Tsoumakas et al. 2010), induces a predictive model \(h: {\mathcal {X}} \rightarrow {\mathcal {Y}}\) from a set of training data \({\mathcal {D}}\), which later assigns labels to new examples. This task can be formally defined as follows. Let \({\mathcal {D}}\) be a set of labeled instances, such that \({\mathcal {D}} = \left\{ (x_1, Y_1), ..., (x_n, Y_n)\right\}\). Every labeled instance is composed of \(x_i = (x_{i1}, x_{i2}, ..., x_{id}) \in {\mathbb {R}}^d\), and \(Y_i \subseteq {\mathcal {L}}\), such that \({\mathcal {L}} = \left\{ \lambda _1, \lambda _2, ..., \lambda _q\right\}\) is the set of all q labels \(\lambda _i\). For the sake of convenience, the labels associated with the \(i^{th}\) instance, also called label set, can be seen as a binary vector \(y_i = (y_{i1}, y_{i2}, \ldots , y_{iq}) \in \{0,1\}^q\), where \(y_{ij} = 1 \,\textit{iff}\, \lambda _j \in Y_i\) and \(y_{ij} = 0 \,\textit{iff}\, \lambda _j \not \in Y_i\). Finally, model h is used to predict, for a test instance \((x_i, ?)\), the set of relevant labels \(\hat{Y}_i\) (or \(\hat{y}_i\) as a binarized prediction).

In the Label Ranking (LRK) task, a model outputs the ranked labels for each test instance. This ranking can easily be computed using any model that provides a score value indicating its probability of being relevant to a given instance. Thus, the higher the score value, the better its ranking position. In turn, MLC can be derived from the LRK formulation (Gibaja and Ventura 2015).

A multi-label model can be obtained by using two approaches (Tsoumakas and Katakis 2007), problem transformation and algorithm adaptation. The former converts the original multi-labeled data into a set of binary or multi-class data sets, whereas for the latter, the multi-label support is embedded into the algorithm’s structure. Thus, the transformation approach fits the data to the algorithms, and the adaptation approach fits the algorithms to the data (Zhang and Zhou 2014).

A straightforward transformation is to build a binary classifier for each label individually. This is known as the Binary approach. On the other hand, a multi-class transformation can be considered, in which each label set (combination of labels) is mapped to one class. Both approaches are algorithm independent (de Carvalho and Freitas 2009), in the sense that any traditional classification algorithm that is capable of handling such problems can be used as the base algorithm.

We want to emphasize that the binary transformation approach implies that algorithms are trained separately, but not necessarily independently; this will become apparent in the following section. In addition, many hybrid approaches exist, such as Pairwise, which models pairwise combinations (a one-vs-one approach), and subset approaches, which includes the well-known RAkEL strategy (Tsoumakas et al. 2011a).

Binary transformation generates at least one data set per label. Each binary data set \({\mathcal {D}}_j'\) is related to the label \(\lambda _j\). The instances associated with \(\lambda _j\) are labeled with a class value of “1”, all others are labelled with a class value of “0”.The number of binary data sets generated is defined by \(|{\mathcal {D}}'| = mq\), where m is the number of data sets per label. Therefore, the complexity of this family of strategies is linear in the number of labels q. Negative aspects of this approach include the tendency to generate rather imbalanced data sets and the fact that some of these strategies ignore the relationships between labels (Zhou et al. 2012).

The binary transformation strategies are organized into three groups, one-round, stacking, and ensemble, according to the value of m. One-round strategies are the simplest strategies, with \(m = 1\). A special case of one-round is chaining, which increases the input space by adding already predicted labels as features to predict the others, in a chain. In stacking strategies, two rounds of training and prediction steps are performed, thus \(m = 2\). They augment the input space in the second round by using the values of the labels predicted in the first round as features. When all the labels are used, they are called full-stacking. When only a subset of the labels is used, they are called pruned-stacking. Finally, in the ensemble strategies more than two models for each label (\(m>2\)) are used and usually, the value of m is a hyperparameter defined by the user. When the same instances and attributes are shared by all internal models, the ensemble is homogeneous. However, when each member and label use distinct data sets as training data, the ensemble is heterogeneous. The former can be seen as an ensemble of multi-labeled data, whereas the latter as multiple ensembles of single-label data (Gibaja and Ventura 2015). These groups and their strategies are detailed in Sect. 3.

A base classification algorithm must always be chosen to induce predictive models for each transformed data set \({\mathcal {D}}'\). Later, these models are used to predict the relevance of each label for new instances. If the models predict a score instead of a class, the strategies support both tasks, MLC and LRK (Gibaja and Ventura 2015). Logically, if the base algorithms are responsible for predicting a score and the binary transformation strategies are independent from them, any transformation strategy can be used to solve them. Distinctions among them will not be considered in the rest of this paper.

As previously mentioned, this study is restricted to analyzing strategies based on binary transformation, which are relevant for a broad group of researchers and practitioners. Besides, for most of them, their individual models can be trained separately (thus, allowing for parallelism), they are simple to interpret, they have been successfully used in many state-of-the-art comparisons in the literature, and they usually exhibit acceptable time complexity, almost linear with the number of labels. Using separate classifiers, each focused on only one label, allows for higher flexibility, choosing potentially different approaches on a per-label basis. Furthermore, new labels can usually be added to the problem without retraining the models built for existing labels. In general, as some of the strategies are conceptually quite similar to each other, their practical differences may be highlighted by comparing their performances using different base algorithms, an approach we put forward in this paper.

3 Strategies

In this section, the 10 binary transformation strategies considered are described. Table 1 presents the strategies organized into groups, defined by the number of binary models generated per label, and the subgroups according to their main characteristic.

Table 1 Binary transformation strategies organized into groups/subgroups according to the number of binary models per label and their main characteristic

3.1 One-round

The one-round strategies are characterized by generating only a single binary data set for each label. Binary models are induced from these data sets and used for multi-label prediction. The strategies from this group differ mainly by how they transform the data sets.

Binary Relevance (BR) (Boutell et al. 2004) is the simplest and most popular multi-label strategy (Luaces et al. 2012; Montañes et al. 2014). For each label \(\lambda _j\), an independent binary data set is generated according to

$$\begin{aligned} {\mathcal {D}}_j' = \left\{ (x_i, y_{ij}) \mid 1 \le i \le n\right\} , \end{aligned}$$
(1)

and will be used to induce a binary model \(\theta _j\). The prediction is performed using the values of all binary models as follows:

$$\begin{aligned} h_{br} = \{\lambda _j \,|\, \theta _j(x) = 1, \,1 \le j \le q\}. \end{aligned}$$
(2)

3.1.1 Chaining

The Classifier Chains (CC) strategy (Read et al. 2009, 2011) organizes the labels in a chain and increases the original input space of the transformed data set for a given label with the values of all previous labels in the chain. Thus, the data set is transformed as follows:

$$\begin{aligned} {\mathcal {D}}_j' = \left\{ ([x_i, y_{i1}, y_{i2}, \ldots , y_{i(j-2)}, y_{i(j-1)}], y_{ij}) \mid 1 \le i \le n\right\} . \end{aligned}$$
(3)

The model related to the first label in the chain is obtained exclusively from the original input data, without adding any predictive attributes, as shown in Eq. 1. The other models increase their input space by adding \(j - 1\) new attributes, where j is the position of the respective label in the chain. During the prediction phase, as the labels are predicted, their values are used to increase the input space, as shown next

$$\begin{aligned}&h_{cc} = \{\lambda _j \,|\, \hat{y}_{j} = 1, \,1 \le j \le q\}, \text {where} \nonumber \\&\hat{y}_{j} = \theta _j([x, \hat{y}_{1}, \hat{y}_{2}, \ldots , \hat{y}_{(j-2)}, \hat{y}_{(j-1)}]). \end{aligned}$$
(4)

Nested Stacking (NS) (Senge et al. 2013) brings two modifications to CC. In the training phase, it uses the predicted labels instead of the real labels. Furthermore, in the prediction phase, it makes a subset correction, in order to predict only preexisting label sets.

The transformation step is similar to Eq. 3. However, the original label values y are changed by the predicted values \(\hat{y}\), such that

$$\begin{aligned} {\mathcal {D}}_j' = \left\{ ([x_i, \hat{y}_{i1}, \hat{y}_{i2}, \ldots , \hat{y}_{i(j-2)}, \hat{y}_{i(j-1)}], y_{ij}) \mid 1 \le i \le n\right\} , \end{aligned}$$

where \(\hat{y}_{ij}\) is the prediction of the binary model \(\theta _j\) for the instance \(x_i\) presented in the training data. The prediction is obtained similarly to Eq. 4 followed by the subset correction. The \(\hat{y}\) is replaced by \(y^* \in Y\), which is the vector in Y that is most similar to \(\hat{y}\), such that

$$\begin{aligned}&h_{ns} = \{\lambda _j \,|\, y^*_{j} = 1, \,1 \le j \le q\}, \text {where} \\&y^* = \mathop {\mathrm{arg}\,\mathrm{min}}\limits _{y \,\in \, Y} \, dist(\hat{y}, y), \end{aligned}$$

and dist is the hamming distance, which corresponds to the number of differences between two binary vectors. When more than one minimum is found, the label set with the higher frequency in the training data is selected.

3.2 Stacking

The stacking strategies are characterized by using the stacked generalization learning paradigm (Wolpert 1992). In the multi-label context, they use two rounds of binary transformation, where in the second round, the input space is augmented by the information from the labels obtained from the first round.Footnote 1 The main difference among the stacked strategies is how they choose the labels that would augment the input space. Some of them use all labels (full stacking), while others use only a subset of labels (pruned stacking).

3.2.1 Full stacking

BR+ (Cherman et al. 2012) and Dependent Binary Relevance (DBR) (Montañes et al. 2014) are very similar to each other. In the training phase, they perform exactly the same procedure. The first round is characterized by the induction of a BR model, according to Eqs. 1 and 2. In the second round, the transformation is performed by increasing the input space using the original labels. To illustrate how it works, let \(\phi _j(y)\) be a function that removes the label \(\lambda _j\) from the vector y, such that

$$\begin{aligned}&{\mathcal {D}}_j^{''} = \left\{ ([x_i, \phi _j(y_i)], y_{ij}) \mid 1 \le i \le n\right\} ,\ \text {where}\nonumber \\&\phi _j(y) = (y_1, \ldots , y_{(j-1)}, y_{(j+1)}, \ldots , y_{q}). \end{aligned}$$
(5)

It should be noted though, that there is a subtle difference in the prediction phase, precisely, in the second round. DBR predicts the labels using the second round binary models that use the labels obtained from the first round binary models. Using the \(\phi\) function presented in Eq. 5, the prediction is obtained as follows:

$$\begin{aligned} h_{dbr} = \{\lambda _j \,|\, \theta _j^{''} ([x,\phi _j(h_{br}(x))]) = 1, \,1 \le j \le q\}. \end{aligned}$$

Differently, BR+ updates the labels from the first round binary models while the second prediction is occurring. Given a chain of labels (for example, \(\lambda _1 \prec \lambda _2 \prec \cdots \prec \lambda _q\)), the prediction is obtained in the following way:

$$\begin{aligned}&h_{br+} = \{\lambda _j \,|\, \theta _j^{''}([x, \phi _j(\hat{y})]) = 1, \,1 \le j \le q\},\nonumber \\&\text {for each } j\text {, }\quad \hat{y} =(\hat{y}_1, \ldots , \hat{y}_{(j-1)}, \theta _j^{''}([x, \hat{y}])), \hat{y}_{(j+1)}, \ldots , \hat{y}_q). \end{aligned}$$
(6)

Recursive Dependent Binary Relevance (RDBR) (Rauber et al. 2014) induces two models as DBR does, but it uses the second model several times in a recursive way. The labels predicted for the second model are used to update the input space and the second round is executed again until either the result converges or a fixed number of iterations is reached. In practice, it is the same process as in Eq. 6, but while BR+ does only one update, RDBR updates recursively several times until a stopping criterion is reached.

3.2.2 Pruned stacking

The Meta-BR (MBR) strategyFootnote 2 (Godbole and Sarawagi 2004; Read et al. 2011) augments the input space using the values of the most correlated labels (Tsoumakas et al. 2009). The Pearson product moment correlation coefficient for categorical variables \(\rho\) is computed for each pair of labels and a threshold \(\tau\) is used to define which labels should augment the space of attributes. The data set in the second round is obtained in the following way:

$$\begin{aligned}&{\mathcal {D}}_j^{''} = \left\{ ([x_i, \phi _j(\hat{y}_i)], y_{ij}) \mid 1 \le i \le n\right\} \text {, where}\\&\phi _j(\hat{y}) = \{\hat{y}_l \,|\, \rho (\lambda _j, \lambda _l) \ge \tau , \,1 \le l \le q\}, \end{aligned}$$

and \(\phi (\hat{y}_i)\) returns only the most related labels. Unlike the other stacked strategies, instead of using the original labels in the second transformation, it uses the predicted labels obtained in the first round.

The final prediction is the result of the binary models in the second step, such that:

$$\begin{aligned} h_{mbr} = \{\lambda _j \,|\, \theta _j^{''} ([x,\phi _j(h_{br}(x))]) = 1, \,1 \le j \le q\}. \end{aligned}$$

The Pruned and confiDent (PruDent) strategy (Alali and Kubat 2015) uses only the most relevant labels, as MBR does, and the original values to augment the second input space, as BR+ and DBR do. The Information Gain (IG) measure is used to prune the irrelevant labels based on a threshold \(\tau\). The PruDent transformation is the same as Eq. 5, with the exception of the \(\phi\) function:

$$\begin{aligned} \phi _j(y) = \{y_l \,|\, \textit{IG}(\lambda _j, \lambda _l) \ge \tau , \,1 \le l \le q, \,l \ne j\}. \end{aligned}$$

Contrary to the others, PruDent assigns a label to an example if either one of the corresponding models, first or second round, predicts it. The predictions are done in the following way:

$$\begin{aligned} h_{prud} = \{\lambda _j \,|\, \theta _j(x) = 1 \,\vee \, \theta _j^{''} ([x, \phi _j(h_{br}(x))]) = 1, \,1 \le j \le q\}. \end{aligned}$$

3.3 Ensemble

Ensemble of Binary Relevance (EBR) and Ensemble of Classifier Chains (ECC) (Read et al. 2011) are simply ensembles of models induced by the BR strategy and by the CC strategy, respectively. Both BR and CC use bagging and choose different random subsets of the attributes for each bagging iteration. To illustrate how EBR computes predictions, let m be the number of models in the ensemble and \(\phi _i\) a function for selecting a random subset of attributes:

$$\begin{aligned}&h_{ebr} = \{\lambda _j \,|\, \left( \frac{1}{m} \sum ^m_{l=1} \hat{y}_{lj} \right) > \tau , \,1 \le j \le q\},\ \text {where}\\&\hat{y}_l = h_{br}^l(\phi _l(x)), \end{aligned}$$

\(\hat{y}_{lj}\) is the predicted value of the BR model l for the label \(\lambda _j\) and \(\tau\) is a threshold value.Footnote 3 For the ECC strategy, internal models are built using \(h_{cc}\) with different chains, avoiding the influence that choosing an inappropriate chain could have on the results.

4 Experimental design

This section presents an experimental comparison across the binary transformation strategies and base algorithms. It describes the multi-label data sets, followed by a short overview of evaluation measures and procedures. Next, it explains the methodology adopted and the environmental setup.

4.1 Data sets

Table 2 lists the 20 multi-label data sets used for the experiments. They are from distinct domains (column Domain) and have a wide diversity in their characteristics. The columns Inst, Attr and Lbl are respectively the number of instances, attributes and labels. Label sets (lSets) is the amount of distinct label combination, proportion of unique label sets (PUL) indicates the proportion of label sets related to a single instance, label cardinality (lCard) measures the average number of labels per instance, label density (lDen) describes the average frequency of labels, dependency (Dep) shows the average unconditional labels’ dependency (Luaces et al. 2012), inner imbalance degree (IID) measures the average label imbalance in the binary data sets (Raez et al. 2004) and, finally, correlation (Corr) indicates the average correlation between the predictive attributes and the labels.

Letting \(\rho _{jk}\) be the Pearson correlation coefficient between the \(j^{th}\) attribute and the label \(\lambda _k\), the correlation is computed as

$$\begin{aligned} \textit{Corr} = \frac{1}{q} \sum ^q_{k=1} max(|\rho _{1k}|, | \rho _{2k}|, ..., |\rho _{dk}|), \end{aligned}$$

where d is the number of attributes. A high value for this measure means that there is at least one attribute which is strongly correlated to each label, while a low value indicates the opposite.

Table 2 Characteristics of the multi-label data sets

These data sets are frequently used as benchmarks for multi-label experiments. They come from different domains, organized here as text, image, audio, biology and other. The text-domain data sets are related to aviation safety reports (tmc2007-500, Srivastava and Zane-Ulman 2005), medical documents (medical, Pestian et al. 2007), emails (enron, Klimt and Yang 2004), newsgroups (20ng, Lang 1995), scientific literature (fapesp, Cherman et al. 2014; ohsumed, Joachims 1998), web forums (stackex_chess, Charte et al. 2015), and web content (langlog and slashdot, Read et al. 2011). Text data sets have a higher number of attributes than most of the data sets from the other domains and also contain the largest average value of correlation between attributes and labels.

The image-domain data sets are related to food (yelp), images extracted from videos (mediamill, Snoek et al. 2006), scene classification (image, Zhou and Zhang 2006; scene, Boutell et al. 2004), and vector graphics (corel5k, Duygulu et al. 2002). They have the highest average number of labels and label sets of all domains. The data sets with the highest average dependency degree among the labels are from the audio domain. They are related to detecting emotions in songs (emotion, Trohidis et al. 2011), the identification of music styles (msd-195, Bernardini et al. 2014), music effects classification (cal500, Turnbull et al. 2008) and sounds of birds (birds, Briggs et al. 2013).

The last two data sets are yeast (Elisseeff and Weston 2001), a data set from the biology domain that associates gene expressions with biological functions, and flags (Gonçalves et al. 2013), a data set of the countries where the color of their respective flags are the labels.

The data sets come from the Cometa repository (Charte et al. 2018), an exhaustive collection of MLC data sets, integrated with the tools used in this work. The exceptions are the data sets fapesp and msd-195 obtained from their respective authors, and yelp8 from the Kaggle website.Footnote 4 The data sets were preprocessed with three operations. First, the labels with less than 10 instances were removed to ensure a minimum number of instances with each label in the training and test folds. Next, instances with no labels were also removed. Finally, predictive attributes with constant values were removed.

Concerning the characteristics shown in Table 2, the density (LDen) and the inner imbalance degree (IID) are inversely correlated. As the density increases, the imbalance degree decreases, and vice-versa. We did not find high correlation among the other characteristics.

4.2 Evaluation measures

The evaluation of the predictive performance of multi-label strategies requires using different measures to assess different dimensions (Tsoumakas et al. 2010). They are organized here in example-based, label-based and ranking measures. The example-based measures summarize the predictive performance over all instances, whereas the label-based measures summarize the performance over all labels. The ranking measures are a specialization of the former, using the prediction scores instead of the crisp values. As many evaluation measures are highly correlated with each other (Pereira et al. 2018), a subset was used.

4.2.1 Example-based measures

Hamming-loss (HL) is an error measure that evaluates the misclassification rate for each label of every instance (Schapire and Singer 1999). This measure does not distinguish between false positive and false negative errors, giving the same weight for both, as shown next

$$\begin{aligned} \textit{HL}&= \frac{1}{n} \sum ^n_{i = 1} \frac{1}{q} {\mid }h(x_i) \, \Delta \, Y_i{\mid },\ \text {where}\nonumber \\&\quad A \, \Delta \, B = (A - B) \cup (B - A). \end{aligned}$$
(7)

While Hamming-loss is the most relaxed measure, Subset-accuracy (SA) is the strictest (Gibaja and Ventura 2015). It accounts only for correctly predicted label sets, ignoring the partial hits. A partially correct prediction is valued the same was as a completely incorrect one, such that the set of predicted or observed labels is treated as a class value in single-label classification (Zhang and Zhou 2014). It is computed as

$$\begin{aligned} SA&= \frac{1}{n} \sum ^n_{i = 1} I(h(x_i) = Y_i) ,\ \text {where} \nonumber \\&\quad I(\cdot ) = {\left\{ \begin{array}{ll} 1 &{} \text {if the predicate is true,}\\ 0 &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
(8)

Let us call the labels associated with an instance of relevant labels. We can use them to define the following measures: Precision is the fraction of relevant labels among those predicted. A high precision indicates a high ability of a model to correctly predict the labels, although not necessarily all of them. Recall is the fraction of relevant labels that have been predicted out of all relevant labels. A high recall indicates that a model predicts many labels correctly, but not necessarily only the relevant labels. Thus, the \(F_1\)measure (F1) computes the harmonic mean between precision and recall. A model with a high value in this measure can predict the relevant labels accurately and only them. It does not take the true negatives into account, combining just the rate of relevant labels among the predicted ones and the rate of predicted relevant labels over all relevant labels. F1 is computed as

$$\begin{aligned} F1 = \frac{1}{n} \sum ^n_{i = 1} \frac{2{\mid }h(x_i) \cap Y_i{\mid }}{{\mid }h(x_i){\mid } + {\mid }Y_i{\mid }}. \end{aligned}$$
(9)

4.2.2 Label-based measures

Label-based measures usually come in two variants: micro-averaged and macro-averaged. The macro-averaged measures summarize the label distribution by giving the same weight to all labels (Yang 1999). They assess the consistency across all labels. Thus, they are too sensitive to the performance on the least common labels, which is usually low (Jackson and Moulinier 2002).

To illustrate how these measures work, let \(\textit{TP}\), \(\textit{FP}\), \(\textit{TN}\) and \(\textit{FN}\) be respectively the true positive, false positive, true negative and false negative values from a confusion matrix, such that

$$\begin{aligned} \textit{Precision}_b= {} \frac{\textit{TP}}{\textit{TP} + \textit{FP}}, \end{aligned}$$
(10)
$$\begin{aligned} \textit{Recall}_b= {} \frac{\textit{TP}}{\textit{TP} + \textit{FN}}, \end{aligned}$$
(11)
$$\begin{aligned} \textit{F1}_b= {} \frac{2\textit{TP}}{2\textit{TP} + \textit{FP} + \textit{FN}}. \end{aligned}$$
(12)

The macro label-based version computes the previous measures for each label and returns their average value, such that

$$\begin{aligned} \text {macro-}\beta =\frac{1}{q}\sum ^q_{j=1} \beta (\textit{TP}_j, \textit{FP}_j, \textit{TN}_j, \textit{FN}_j), \end{aligned}$$

where \(\beta = \{\textit{Precision}_{b}\,| \,\textit{Recall}_{b}\,|\,\textit{F1}_b\}\), from Eqs. 10, 11 and 12, respectively.

The label problem measures, MLP and WLP, (Rivolli et al. 2018) will be also considered. The Missing Label Prediction (MLP) measure indicates the proportion of labels that are never predicted by a strategy. The Wrong Label Prediction (WLP) measure, which can be seen as a generalization or relaxation of MLP, represents the case where a label might be predicted for some instances, but these predictions are always wrong. Eqs. 13 and 14 formalize these measures, respectively. In an ideal scenario, their expected value is zero.

$$\begin{aligned} \textit{MLP}= {} \frac{1}{q} \sum ^q_{j=1} I(\textit{TP}_j + \textit{FP}_j == 0) \end{aligned}$$
(13)
$$\begin{aligned} \textit{WLP}= {} \frac{1}{q} \sum ^q_{j=1} I(\textit{TP}_j == 0) \end{aligned}$$
(14)

4.2.3 Ranking measures

Ranking measures consider the ranking of labels instead of the quality of bipartitions, which defines the labels predicted. One-error (OE) is an extreme measure that only assesses the error of the label predicted with most confidence. This measure is computed as follows:

$$\begin{aligned} \textit{OE} = \frac{1}{n} \sum ^n_{i = 1} I(\mathrm{arg} \, \underset{\lambda _j \in {\mathcal {L}}}{\mathrm{max}} \, f(x_i, \lambda _j) \not \in Y_i ) \end{aligned}$$

Ranking-loss (RL) computes the average rate of label pairs that are incorrectly sorted when using their predicted probabilities. It is calculated as follows:

$$\begin{aligned} RL= &\, {} \frac{1}{n} \sum ^n_{i = 1} \frac{{\mid }\{(\lambda _j, \lambda _k)| f(x_i, \lambda _j) \le f(x_i, \lambda _k),(\lambda _j, \lambda _k) \in Y_i \times \overline{Y}_i \}{\mid }}{{\mid }Y_i{\mid } {\mid }\overline{Y}_i{\mid }}, \\& {\text {where }} \overline{Y_i} = {\mathcal {L}} \setminus Y_i. \end{aligned}$$

4.3 Multi-label baselines

Different baselines were adopted, optimizing different measures. With the exception of the \(\hbox {baseline}_{{ RL}}\), they were proposed by Metz et al. (2012). The \(\hbox {baseline}_{{ F1}}\) literally predicts the label set that maximizes the F1 measure (Eq. 9) for the training data, such that

$$\begin{aligned} \textit{baseline}_{{ F1}} = \mathop {\mathrm{arg}\,\mathrm{max}}\limits _{\hat{Y} \,\subseteq \, {\mathcal {L}}} \, F1(Y, \hat{Y}), \end{aligned}$$

where \(\hat{Y}\) is the label set predicted. This baseline is also used to compare the label based measures macro-F1, macro-precision and macro-recall.

The \(\hbox {baseline}_{{ HL}}\) predicts the labels present in more than 50% of the training instances, such that

$$\begin{aligned} \textit{baseline}_{{ HL}} = \{\lambda _j \,|\, \textit{freq}(\lambda _j) > 0.5, 1 \le j \le q \}, \end{aligned}$$

where \(\textit{freq}(\lambda _j)\) is the frequency of the label \(\lambda _j\) in the training data. In turn, baseline\(_{SA}\) predicts the most frequent label set in the training data, such that

$$\begin{aligned} \textit{baseline}_{{ SA}} = \mathop {\mathrm{arg}\,\mathrm{max}}\limits _{\hat{Y} \,\subseteq \, {\mathcal {L}}} \, \sum ^n_{i=1} I(Y_i = \hat{Y}) \end{aligned}$$

where I is the indicator function defined in Eq. 8.

Finally, the \(\hbox {baseline}_{{ RL}}\) (Rivolli et al. 2018), an adaptation of the \({ General}_B\) baseline (Metz et al. 2012), predicts a ranking of labels according to their frequency, such that

$$\begin{aligned} \textit{rank}(\lambda _j) = |{\mathcal {L}}| - |\left\{ \lambda _k \mid \lambda _k \in {\mathcal {L}}, \, \textit{freq}(\lambda _j) > \textit{freq}(\lambda _k) \right\} |, \end{aligned}$$

and

$$\begin{aligned} \textit{baseline}_{{ RL}} = \{\lambda _j \,|\, \textit{rank}(\lambda _j) \le \textit{lcard}, \, 1 \le j \le q \}, \end{aligned}$$

where lcard is the label cardinality of the training data. This baseline is used for the ranking measures: one-error and ranking-loss.

4.4 Base algorithms

The strategies described in Sect. 3 require using a base algorithm to induce binary models. Algorithms that are frequently used as the base algorithm in multi-label experiments are Decision Tree Induction Algorithms (Cherman et al. 2012; Alali and Kubat 2015; Tsoumakas et al. 2009), Logistic Regression (LR) (Montañes et al. 2014; Rauber et al. 2014; Senge et al. 2013; Tsoumakas et al. 2009) and Support Vector Machines (SVM) (Read et al. 2011; Cherman et al. 2012; Li and Zhang 2014; Luaces et al. 2012; Madjarov et al. 2012; Tsoumakas et al. 2009).

Two classification algorithms that have been very successful in classification tasks, but not commonly used for multi-label classification, Random Forest (RF) and eXtreme Gradient Boosting (XGB), complete the set of base algorithms used in our experiments.

The k-Nearest Neighbors and Naive Bayes algorithms were initially considered. They were discarded because they did not show competitive results when compared with the others. Although other base algorithms, such as Multilayer Perceptron, could also be investigated, they were not considered because those selected were able to support the claims addressed in this paper.

4.5 Experimental setup

The experiments were carried out using the R environment. The data sets were handled using code from the mldr package (Charte and Charte 2015). The strategies used R code from the utiml package (Rivolli and de Carvalho 2018). By default, utiml prevents empty predictions (Liu and Chen 2015), in which case the strategy outputs the label with the highest probability/score, preventing an example from being predicted without any labels.

Most strategies and base algorithms used in the experiments require the definition of hyperparameter values. Table 3 shows, for each strategy used, the default values recommended by the packages for the main hyperparameters.

Table 3 Hyperparameters values for the strategies used in the experiments

The implementation of the base algorithms used in the experiments come from the packages C50, stats, randomForest, e1071 and xgboost for C5.0, LR, RF, SVM and XGB, respectively. Table 4 shows the values used for the hyperparameters of each base algorithm, which were those recommended in the corresponding package. SVMt is a tuned version of SVM for the macro-F1 measure, where the range of values used in a Grid Search procedure is reported. To validate the hyperparameter values, holdout with 70% for training and 30% for validation is adopted for all data sets. SVM was singled out for tuning, due to the high effect of hyperparameter values on its performance (Mantovani et al. 2015).

Table 4 Hyperparameter values of the base algorithms used in the experiments

All results were obtained using \(5 \times 2\)-fold cross-validation with paired folds across all combinations of strategies and base algorithms. An iterative algorithm for the stratification of multi-labeled data (Sechidis et al. 2011) was applied to ensure similar label distributions between training and test data.

Different from previous comparative studies in the multi-label domain, two Bayesian statistical tests were used (Benavoli et al. 2017). The Bayesian hierarchical correlated t-test was used to compare two strategies over multiple data sets, whereas the Bayesian correlated t-test was used for a single data set. When comparing two strategies, the Bayesian statistical test outputs the probability of three situations: strategy 1 is the best (left); strategy 2 is the best (right); and there is a draw between them (rope), which is a region of practical equivalence that indicates an insignificant difference in performance between the strategies. Benavoli et al. (2017) suggest the interval \([-0.01, 0.01]\), which represents a difference of 1% for a measure whose range is [0, 1]. This interval was used for all evaluation measures, with the exception of hamming-loss, where the interval was modified \([-0.001,0.001]\) due to its finer granularity when compared to the other measures. Otherwise, no statistical differences was observed, given that, for hamming-loss, the number of mistakes made by a strategy is divided by the number of test instances times the number of labels. Thus, the larger the data set, the smaller the differences between the strategies.

5 Experimental results

This section presents the experimental results and the main findings from this study. The complete set of experimental results is publicly available online at https://rivolli.github.io/ml-binary-transformation/.

Initially, this section compares the results with multi-label baselines followed by the comparison of the most similar strategies. Next, the strategies are compared using fixed base algorithms, which is the traditional approach used in the multi-label literature. Afterwards, the base algorithms are compared by fixing the strategies. In the last set of comparisons, both strategies and base algorithms are combined without distinction. Finally, the main findings are highlighted.

5.1 Comparison with the baselines

Despite their importance for evaluating predictive performance, baselines have not been frequently used in multi-label experiments (Metz et al. 2012). As a result, there are no clear standards for selecting baselines for evaluation. Table 5 presents a comprehensive set of results for the different baselines (Sect. 4.3) used in the experiments.

Table 5 Baseline values obtained for each data set and measure

The \(\hbox {baseline}_{{ F1}}\) obtained the highest results for all measures in data sets with high average labels’ frequency and low imbalance degree. The \(\hbox {baseline}_{{ HL}}\), on the contrary, had its best results in data sets with low average label frequency and high imbalance degree. Regarding the \(\hbox {baseline}_{{ RL}}\), used to evaluate the ranking measures, the results obtained are inversely correlated with the label cardinality, i.e. the lowest ranking-loss values were observed in data sets with high lCard. Finally, as the number of labels and label sets increase, the results obtained for the \(\hbox {baseline}_{{ SA}}\) decrease.

Figure 1 summarizes the number of strategy/base-algorithm pairs that did not perform statistically significantly better than the baselines for each data set and evaluation measure. With the exception of macro-recall, that can be easily maximized by predicting all labels, and some other measures in the case of the cal500 data set, at least one combination strategy/base-algorithm was always able to outperform the baselines for all measures and data sets. However, the considerable number of non-zero entries in Fig. 1 corroborates the claim of Metz et al. (2012) that any new strategy should be compared with others using appropriate multi-label baselines.

Fig. 1
figure 1

Number of pairs strategy/base-algorithm that did not perform statistically significantly better than the baselines according to different evaluation measures

5.2 Similarity of strategies

How the base algorithms affect the behavior of the binary transformation strategies is one of the questions investigated in this paper. According to Table 1, it is reasonable to assume that strategies within a group/subgroup are more similar to each other than the rest. However, the transformation strategies work with a base algorithm, which is used to induce the learning models from the transformed data, and its effect over the strategies is unknown so far. Following this rationale, the similarity of strategies using different base algorithms is analyzed in two distinct ways. First, by comparing their predictions, which removes the bias of a specific evaluation measure. Second, by comparing their predictive performance statistically over distinct evaluation measures, which considers particularities of the learning process.

To compare the predictions obtained by the strategies, the Hamming distance (defined in Eq. 7) is computed for each pair of strategies. The result indicates the difference between the predictions, and therefore, the average value over all data sets and repetitions can indicate how similar or distinct any two given strategies are.

Initially, by fixing the base algorithm, the strategies were compared. For such, they were organized according to their similarity using the hierarchical clustering algorithm Averaged-Linkage (Jain and Dubes 1988). Figure 2 shows the hierarchy of strategies for each base algorithm. Similar results are observed regardless of the base algorithm, with some exceptions. In summary, the similarity of the predictions follows the intuition of the groups of strategies presented in Table 1.

Fig. 2
figure 2

Similarity of strategies according to their bipartition predictions

For all base algorithms, the ensembles EBR and ECC presented the largest difference to all others. The full stacking BR+, DBR and RDBR were grouped together, following different paths, according to the base algorithm. These are the only consensus in the results. Other strategy pairs, such as the chaining CC and NS were the closest strategies only for the base algorithms C5.0, RF and XGB. Similarly, pruned stacking MBR and PruDent were not always in the same group.

Regarding the subgroups, the chaining strategies were more similar to the full stacking for some base algorithms, and to the pruned stacking for others. Pruned stacking was more related to BR than full stacking, which may indicate that the pruning approach impacted the results more than the use of stacking, for these strategies.

Looking at the base algorithms, the use of C5.0 leads to a larger difference among the results obtained by the strategies, and, on the other hand, RF leads to a higher similarity.

Next, when all the strategy/base-algorithm pairs were compared together (Fig. 3), the similarity between the base algorithms could also be compared. The base algorithms RF and XGB produced similar results, and likewise for SVM and LR. In the latter case, the similarity observed was still stronger than the former, since the same strategies using distinct base algorithms were clustered together. On the other hand, SVM and SVMt, despite being the same base algorithm using different hyperparameter values, were not so closely related as SVM and LR were.

Fig. 3
figure 3

Similarity of strategies and base algorithms according to their bipartition predictions

With the exception of the ensembles and the SVM and LR base algorithms, all strategies are clustered according to the base algorithm, instead of the opposite, i.e. different variants of the same strategy grouped together. For instance, in this comparison, \(\hbox {BR}_{{ RF}}\) is more similar to \(\hbox {DBR}_{{ RF}}\), a full stacking approaching, than to \(\hbox {BR}_{{ XGB}}\). This shows that, for these strategies, their differences might not be strong enough to always be apparent, regardless of the choice of base algorithm.

To identify when small differences in prediction are significant, the pairs of strategies within a group/subgroup were statistically compared. The investigated hypothesis remains that the two distributions are equal such that a high probability means that the two strategies are similar and a low value that the two strategies are indeed dissimilar as one would be interested in. Figure 4 presents the rope probability of different pairs of strategies. The pairs are sorted according to their average values, from the most similar to the most distinct (from the bottom to the top). Likewise, the base algorithms are sorted from left to right.

Fig. 4
figure 4

Rope probabilities from the Bayesian hierarchical test in the comparison of related strategies (y axis) for different base algorithms (x axis). The symbol ‘=’ is used for probabilities greater than 0.95

As previously observed, C5.0 was the base algorithm with the largest number of differences between strategies, whereas RF was the base algorithm with the lowest number of differences. Regardless of the evaluation measure, all pairs were considered similar to each other when RF was used. Additionally, the differences between the strategies were captured in different ways by the evaluation measures. For instance, no differences in F1 results were observed; the ranking measures were more sensitive when comparing the pruned stacking strategies; and hamming-loss and subset-accuracy produced clear differences for the ensemble and full stacking strategies.

In summary, the results presented in this section showed that the base algorithms impact the strategies in different ways. Despite all the investigated strategies using the same paradigm (binary transformation), their small differences were captured by the evaluation measures for some of the base algorithms. By varying the base algorithm, a pair of close-related strategies can be seen as more similar, or more distinct, to each other, given a specific evaluation measure. Therefore, it can be concluded that some base algorithms are more dominant than others.

5.3 Analysis of strategies

Following the procedure used in many multi-label studies, the strategies are compared with each other by fixing the base algorithm. As distinct base algorithms are considered, the differences between them can be contrasted. Using the Bayesian hierarchical statistical test, each pair of strategies with the same base algorithm is compared with each other. Figure 5 presents the results of the paired test, varying the base algorithms. For each base algorithm, the strategy whose probability to statistically outperform the other is higher than or equal to 95% is highlighted. Similar algorithms (rope \(\ge 95\%\)) are represented with an “=” character and an empty value indicates inconclusive results (probabilities \(< 95\%\)). The pairs of strategies with similar or inclusive results for all base algorithms were removed from the chart.

Fig. 5
figure 5

Best strategy according to the results of the Bayesian hierarchical statistical test. The symbol ‘=’ indicates they are similar with statistical significance

The main discrepancies in the results are observed in relation to the ensemble strategies and the base algorithm C5.0. For C5.0, EBR and ECC outperformed all other strategies for most evaluation measures, whereas for other base algorithms, ensembles were outperformed by different strategies. For the measures F1, macro-F1 and macro-recall a more homogeneous result is observed across the base algorithms. In this case, the ensembles are clearly the best choice, probably due to the fact that they internally perform a thresholding calibration that allows them to obtain more balanced precision and recall results regardless of the base algorithm.

To detail the contradictions, Table 6 presents the cases where conflicting probabilities from the statistical test were found across distinct base algorithms. Probabilities indicating that the strategies are similar (rope > 50%) and inconclusive results (all probabilities < 50%) were omitted from the table, which led to the elimination of the columns relative to base algorithms RF and SVM. The bold markup highlights, for each base algorithm, the highest value and the cases where the probability is greater than or equal to 95% are underlined.

Table 6 Divergent probabilities found across the base algorithms in the comparison of the strategies

Many observations showed low probabilities at least for one of the base algorithms. This indicates that the differences were not so evident according to the statistical test, even though they are still conflicting. In this sense, the most noticeable differences were observed in the ranking-loss measures, probably because the scores produced by the binary models are more sensitive to variation than the bipartitions.

Regarding the base algorithm, C5.0 shows many strongly significant differences, which reinforces the previous conclusions concerning C5.0 behaving very differently from the other base algorithms. Regarding the strategies, all observed differences are related to pairs of strategies where each comes from a different subgroup, e.g., a chaining strategy against a full stacking strategy.

In conclusion, the comparison of the transformation strategies showed different results, for some measures, according to the base algorithm used. In this particular case, all strategies use a binary transformation, which makes them very similar to each other. Given that differences were still observed, it is reasonable to assume that when different transformation strategies are evaluated, it is important to investigate distinct base algorithms.

5.4 Analysis of base algorithms

Exploring a different perspective, the base algorithms are compared by fixing the strategies. The hypothesis investigated is that for each strategy some specific base algorithms perform better than the rest. Analogous to the previous section, Fig. 6 presents the results of the paired test for base algorithms, in which all base algorithms were compared against each other for each one of the strategies. In this test, for each strategy, the algorithm whose probability to statistically outperform the other is higher than or equal to 95% is highlighted. Similar algorithms (rope \(\ge\) 95%) are represented with an “=” character and an empty value indicates inconclusive results (probabilities < 95%).

Fig. 6
figure 6

Best base algorithm according to the results of the Bayesian hierarchical statistical test. The best option for each pair and strategy is indicated by the first letter of the base algorithm, such that C, L, R, S, St and X indicate C5.0, LR, RF, SVM, SVMt and XGB, respectively. The symbol ‘=’ indicates they are similar with statistical significance

At a glance, RF and XGB were the dominant base algorithms, regardless of the evaluation measure used. However, they have not been used as the base algorithm in previous studies. In contrast, C5.0, followed by LR, obtained the worst results, despite their popularity in multi-label studies.

Probably due to the lack of diversity in the strategies considered, few variations concerning the best base algorithm were observed. Nevertheless, they are related to the ensembles, the most distinctive strategies among the ones investigated, as noticed in Sect. 5.2. An illustrative example that reinforces the investigated hypothesis is related to the ranking-loss measure. For many strategies, RF was the best base algorithm. However, for the ensembles, it was the worst. On the other hand, C5.0, which is not a good choice for many strategies, is a suitable alternative for the ensembles. This is very plausible, as ensemble-based base algorithms, similar to RF, perform better when their base learners are unstable—which is why decision tree induction algorithms (e.g., C5.0) are popular choices inside ensembles of machine learning algorithms. Since the predictions of ensemble-based base algorithms themselves reduce variance, they are not as suitable for ensembles strategies.

For some comparisons and evaluation measures, one of the base algorithms was statistically better than the other regardless of the strategy, mainly when C5.0 was involved, which typically is the worst of the two. In spite of this regularity, the results reinforce the conjecture that the performance of strategies depends on the base algorithm. In particular, the results of the ensemble strategies presented a greater variation, concerning the best base algorithms, compared to the other strategies. However, additional tests, including a more varied set of strategies, can increase support for this claim.

Some pairs of base algorithms, in particular LR/SVM and SVM/SVMt, presented similar results, with statistical significance, for different evaluation measures. Between LR and SVM, the latter was the best option only for the ensembles, but not for all measures. Comparing SVM and its optimized version, SVMt, despite the fact the latter performed apparently better than the former in terms of F1, macro-F1 and macro-recall, the probabilities obtained in the Bayesian test were not greater than or equal to the 95%. Regarding C5.0 and LR, the latter shows clear advantages over the former. Finally, between RF and XGB, the most dominant base algorithms according to the experimental results, the choice between one of them depends on the evaluation measure. XGB was the best option for macro-F1 and macro-recall, while RF was the best for hamming-loss, one-error, and ranking-loss.

In summary, the results presented in this section provide some support for the claim that the choice of base algorithm can strongly influence a strategy’s performance. Furthermore, some base algorithms performed better on average than others, which again can influence and even distort comparisons of multi-label learning strategies.

5.5 Combining strategies and base algorithms

The previous analyses showed that the ranking of the best strategies varies according to the base algorithm used. To further investigate this issue, all strategy/base-algorithm pairs are evaluated against each other without distinctions. In order to summarize the 60 pairs (strategy/base-algorithm), Annex A presents the ranking for each pair considering all data sets and the strategies’ results using the best base algorithm. The statistical results comparing those strategies are presented in Annex B.

Considering the BR strategy as a more robust baseline, its performance is analysed in relation to the other strategies. For the measures F1, macro-F1 and macro-recall the ensembles outperform BR with statistical significance, regardless of the base algorithm. By contrast, BR outperforms them to the measures hamming-loss, macro-precision, ranking-loss and subset-accuracy. In relation to the other strategies, there is no case in which BR is completely outperformed by other strategy and vice-versa. Specifically for one-error measure, \(\hbox {BR}_{RF}\) achieved the best ranking over all combinations and outperformed the other strategies for 4 or 5 base algorithms.

To complement these results, Table 7 presents, for all the selected pairs, the number and percentage of other pairs that were statistically outperformed with a probability greater than or equal to 95%, according to the Bayesian statistical test. The strategies are sorted from top to bottom based on the number of pairs outperformed.

Table 7 Selected pairs of strategy/base-algorithm and the percentage of other pairs that were statistically outperformed by them

None of the strategies obtained a reasonable performance over all evaluation measures. The highest results are observed for the ensembles using XGB that outperformed more than 90% of the other strategies in terms of F1, macro-F1 and macro-recall. Consequently, they are the best ranked pairs of strategy/base-algorithm according to the number of outperformed pairs. The lack of a dominant combination for the other measures shows that all the strategies obtained a good performance for some base algorithms.

Concerning the base algorithms, the best results were obtained mainly by either RF or XGB. Both algorithms are represented in the table by all strategies. In terms of strategies, despite being the simplest, BR presented a good performance for the hamming-loss, one-error and ranking-loss.

To sum up, when all strategies/base-algorithms pairs are compared, some strategies appear as dominant for some measures regardless of the choice of base algorithm, such as EBR and ECC for macro-F1. On the other hand, for some evaluation measures, the choice of the base algorithm dominates the results, regardless of the chosen strategies, such as RF for ranking-loss. Even though all strategies use binary transformation, and consequently are very similar to each other, statistical differences were observed between them. In conclusion, an empirical comparison of multiple transformation strategies together with multiple base algorithms should be considered for any future study proposing new transformations.

5.6 Label problems

It can be observed in Fig. 7 that the values of F1 are substantially higher than the values of macro-F1 for many data sets. This occurs when the value of F1 is very low for one or more labels. In practice, the least common labels are often behind these differences. As the previously defined label problems MLP and WLP (Eqs. 13 and 14) provide a possible explanation, their average proportions over all strategy/base-algorithm pairs are presented in Table 8.

Fig. 7
figure 7

Comparative results of the measures F1 and macro-F1 for all data sets and strategy/base-algorithm pairs

Table 8 Average label problems results over all strategy/base-algorithm pairs

For the sake of clarity, the data sets without problems were removed from the table. For many data sets, the values obtained paint a clear picture, indicating that many labels were wrongly predicted or even never predicted at all. E.g., in the worst case, on average 73% of the labels from the corel5k (\(\approx\) 159 labels) were wrongly predicted for all test instances, and 55% (\(\approx\) 120 labels) were never predicted. The high values observed for many data sets indicate a problem generated by the binary transformation strategies not previously detected.

This also justifies the high macro-precision values in comparison with the macro-recall values (Fig. 8). The best results for the measures F1, macro-F1 and macro-recall were achieved by the strategy ensembles. Since they use an internal threshold technique for selecting relevant labels, their recall is enhanced and, consequently, their F1 result is also higher. Additional studies are needed to test if this behavior is mainly due to this post-processing used by the ensembles.

Fig. 8
figure 8

Comparative results of the measures macro-precision and macro-recall for all data sets and strategy/base-algorithm pairs

5.7 Summary

The main motivation for this study was to obtain a better understanding of how the base algorithm impacts the binary transformation strategies. The results presented in the previous sections show that the choice of the base algorithm can interfere in the behaviour of binary transformation strategies. Thus, by considering distinct base algorithms, an empirical study involving transformation strategies can become less biased.

Different rankings of strategies and statistical results were obtained by using different base algorithms. This, however, is not common practice in multi-label research. Usually, transformation strategies are proposed and compared using a single base algorithm (Read et al. 2011; Madjarov et al. 2012; Montañes et al. 2014; Moyano et al. 2018). The claim that by segmenting the comparison of base algorithms more consistent results can be obtained (Moyano et al. 2018) might actually be misleading. In addition, across all assessed measures, there was not a single base algorithm that obtained the best results for all strategies. Consequently, performing a comparison of strategies using only one fixed base algorithm should be avoided.

Nevertheless, it is still valid to compare the strategies using a fixed base algorithm, since it can help with understanding the scenarios in which a strategy is improved. For instance, a clear superiority of the ensembles EBR and ECC, regardless of the evaluation measure, was observed when the base algorithm C5.0 was used. On the other hand, when using the LR and RF algorithms, ensemble strategies did not perform so well, showing that for a given base algorithm some strategies might not be suitable. Even though some base algorithms might obtain a better overall performance than others, the diversity of base algorithms is valid to determine the conditions in which each strategy is convenient. Furthermore, although predictive performance is very important, there are reasons one may consider different base classifiers. For example, decision trees provide good interpretation, logistic regression provides good probability estimates. Therefore, it is useful to consider the relative performance difference rather than simply the top performance.

Considering the large experimental scenario evaluated, the hyperparameter tuning procedure adopted was simple and did not achieve the best results for the optimized measure. The use of the SVMt base algorithm produced distinct results when compared to SVM, but when compared to others, such as RF and XGB, the SVMt results were more similar to SVM. Therefore, in this context, hyperparameter tuning can be seen secondary to base algorithm selection, provided reasonable default parameter settings can be identified for the selected base algorithm. However, we remark that this indeed depends on the model class in question; in which some models are more sensitive to initial hyperparameter settings than others. Ideally, if computational power allows for it, then the base algorithms should be tuned as part of the base-algorithm selection process, especially if the performance difference between them is not great. Of course, for large scale experimental comparisons, this may not be feasible due to the extra degree of complexity implied.

Auto-ML for MLC (de Sá et al. 2017; Wever et al. 2019) can be used to find the best combination between strategies and base algorithm. Furthermore, it can tune the hyperparameters of both of them, as well as the pipeline of the solution, in order to bring the best results for a given problem. Thus, Auto-ML tools is an answer to the question of how to give advice which multi-label classifier and base algorithm to use. However, it demands high computational resources, which may be limiting its use.

Regarding the closely-related strategies (BR and pruned stacking; chaining; full stacking; and the ensembles investigated here), their differences are shown to be subtle and circumstantial. Given the relatively small number of data sets that have been considered in empirical studies, finding characteristics of a problem that distinguishes strategies is not a trivial task. Thus, the choice of a strategy between those close-related might also be seen as merely a matter of convenience, potentially influenced by other performance considerations, such as memory or runtime cost.

The differences between strategies from distinct groups are very consistent for the different evaluation criteria. Therefore, for empirical studies involving binary transformation strategies in MLC, we strongly recommend the use of strategies from different groups, as well as various base algorithms. The selection between strategies in the same group is not an easy task. However, it is important to provide some guidance concerning which one to use. We decided to use the average ranking considering all base algorithms (“Appendix 1”).

Table 9 summarises the experimental results, describing good strategies for different evaluation measures. In practical applications, RF and XGB should be considered as base algorithms, in addition to the usual favorites, which include C5.0, LR, and SVM. We note that if the median rank for each base algorithm or another criterion were adopted, different recommendations would probably be observed but the predicted performance obtained would not be expected to be very different.

Table 9 Suggestion of binary transformation strategies to be picked in empirical experiments

6 Conclusion

This paper presented an extensive experimental evaluation of binary transformation strategies for multi-label classification. Different perspectives were considered in addition to the traditional approach of selecting just a single base algorithm when comparing multi-label strategies. Thus, bipartition predictions were compared, strategies were compared for fixed base algorithms, base algorithm were compared for fixed strategies, and all possible pairs of strategy and base algorithm were compared with each other.

The main conclusions to draw from this study are:

  • Binary transformation strategies are strongly influenced by the base algorithm used. Consequently, empirical studies should always consider distinct and diversified base algorithms.

  • RF and XGB, which showed high predictive performance across a number of strategies, should be considered in the subset of base algorithms selected to perform an empirical study in MLC.

  • The investigated strategies and base algorithms always either misclassified or were unable to predict some of the labels. So far this problem has been ignored, mainly because the traditional evaluation measures are not able to capture this problem. Nevertheless, this is a problem that requires more attention in future studies.

More specific conclusions for multi-label strategies and evaluation measures include:

  • Ensembles using internal threshold selection obtained good results for F1, macro-F1 and macro-recall.

  • Despite being considered a baseline in many studies, BR obtained the best predictive performance for the ranking measures, one-error and ranking-loss. In addition, BR obtained good results for the macro-precision and hamming-loss measures, depending on the choice of base algorithm.

  • The full stacking strategies and the NS strategy, which uses a subset correction procedure, obtained the best results for the subset-accuracy measure.

Future work includes investigating the impact of the base algorithm on other transformations such as the label-powerset method. Recommendation of combinations of a strategy and a base algorithm based on a desired measure, as well as data set characteristics is another promising direction. Finally, the two types of label prediction failure, MLP and WLP, need to be researched in more depth.